For example, If the physical address layout on a two node system with 8 GB
memory is something like:
node 0: 0-2GB, 4-6GB
node 1: 2-4GB, 6-8GB
Current kernels fail to boot/detect this NUMA topology.
ACPI SRAT tables can expose such a topology which needs to be supported.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Increase the number of bits in an apicid from 8 to 32.
By default, MP_processor_info() gets the APICID from the
mpc_config_processor structure. However, this structure limits
the size of APICID to 8 bits. This patch allows the caller of
MP_processor_info() to optionally pass a larger APICID that will
be used instead of the one in the mpc_config_processor struct.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
we don't need get that so early.
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
reserve_hotadd() are only used by __init acpi_numa_memory_affinity_init().
Annotate reserve_hotadd() with __init is the trivial fix.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patchset adds a flags variable to reserve_bootmem() and uses the
BOOTMEM_EXCLUSIVE flag in crashkernel reservation code to detect collisions
between crashkernel area and already used memory.
This patch:
Change the reserve_bootmem() function to accept a new flag BOOTMEM_EXCLUSIVE.
If that flag is set, the function returns with -EBUSY if the memory already
has been reserved in the past. This is to avoid conflicts.
Because that code runs before SMP initialisation, there's no race condition
inside reserve_bootmem_core().
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix powerpc build]
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Cc: <linux-arch@vger.kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the following warnings:
WARNING: arch/x86/mm/built-in.o(.text+0x1abc): Section mismatch: reference to .init.data:nodes_parsed in 'unparse_node'
WARNING: arch/x86/mm/built-in.o(.text+0x1ac6): Section mismatch: reference to .cpuinit.data:apicid_to_node in 'unparse_node'
WARNING: arch/x86/mm/built-in.o(.text+0x1ad2): Section mismatch: reference to .cpuinit.data:apicid_to_node in 'unparse_node'
unparse_node are static and only used by acpi_scan_nodes which
is already annotated __init.
So we annotate unparse_node with __init.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
I found a small bug of NUMA emulation code for x86_64. (CONFIG_NUMA_EMU)
If machine is non-NUMA, find_node_by_addr() should return
NUMA_NO_NODE, but current implementation code returns existent maximum
NUMA node number + 1.
This is not existent NUMA node number.
However, this behaviour does not affect NUMA emulation fortunately, because
acpi_fake_nodes() that is caller of find_node_by_addr() gets pxm
(proximity domain) by node_to_pxm() from non-existent NUMA node number
that was returned by find_node_by_addr().
node_to_pxm() returns PXM_INVAL that means illegal or non-existent
NUMA node number.
Signed-off-by: Minoru Usui <usui@mxm.nes.nec.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Change static bios_cpu_apicid array to a per_cpu data variable.
This includes using a static array used during initialization
similar to the way x86_cpu_to_apicid[] is handled.
There is one early use of bios_cpu_apicid in apic_is_clustered_box().
The other reference in cpu_present_to_apicid() is called after
smp_set_apicids() has setup the percpu version of bios_cpu_apicid.
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Change the following static arrays sized by NR_CPUS to
per_cpu data variables:
char cpu_to_node_map[NR_CPUS];
fixup:
- Split cpu_to_node function into "early" and "late" versions
so that x86_cpu_to_node_map_early_ptr is not EXPORT'ed and
the cpu_to_node inline function is more streamlined.
- This also involves setting up the percpu maps as early as possible.
- Fix X86_32 NUMA build errors that previous version of this
patch caused.
V2->V3:
- add early_cpu_to_node function to keep cpu_to_node efficient
- move and rename smp_set_apicids() to setup_percpu_maps()
- call setup_percpu_maps() as early as possible
V1->V2:
- Removed extraneous casts
- Fix !NUMA builds with '#ifdef CONFIG_NUMA"
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Change the size of node ids from 8 bits to 16 bits to
accomodate more than 256 nodes.
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Change the size of APICIDs from u8 to u16. This partially
supports the new x2apic mode that will be present on future
processor chips. (Chips actually support 32-bit APICIDs, but that
change is more intrusive. Supporting 16-bit is sufficient for now).
Signed-off-by: Jack Steiner <steiner@sgi.com>
I've included just the partial change from u8 to u16 apicids. The
remaining x2apic changes will be in a separate patch.
In addition, the fake_node_to_pxm_map[] and fake_apicid_to_node[]
tables have been moved from local data to the __initdata section
reducing stack pressure when MAX_NUMNODES and MAX_LOCAL_APIC are
increased in size.
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use sparsemem as the only memory model for UP, SMP and NUMA. Measurements
indicate that DISCONTIGMEM has a higher overhead than sparsemem. And
FLATMEMs benefits are minimal. So I think its best to simply standardize
on sparsemem.
Results of page allocator tests (test can be had via git from slab git
tree branch tests)
Measurements in cycle counts. 1000 allocations were performed and then the
average cycle count was calculated.
Order FlatMem Discontig SparseMem
0 639 665 641
1 567 647 593
2 679 774 692
3 763 967 781
4 961 1501 962
5 1356 2344 1392
6 2224 3982 2336
7 4869 7225 5074
8 12500 14048 12732
9 27926 28223 28165
10 58578 58714 58682
(Note that FlatMem is an SMP config and the rest NUMA configurations)
Memory use:
SMP Sparsemem
-------------
Kernel size:
text data bss dec hex filename
3849268 397739 1264856 5511863 541ab7 vmlinux
total used free shared buffers cached
Mem: 8242252 41164 8201088 0 352 11512
-/+ buffers/cache: 29300 8212952
Swap: 9775512 0 9775512
SMP Flatmem
-----------
Kernel size:
text data bss dec hex filename
3844612 397739 1264536 5506887 540747 vmlinux
So 4.5k growth in text size vs. FLATMEM.
total used free shared buffers cached
Mem: 8244052 40544 8203508 0 352 11484
-/+ buffers/cache: 28708 8215344
2k growth in overall memory use after boot.
NUMA discontig:
text data bss dec hex filename
3888124 470659 1276504 5635287 55fcd7 vmlinux
total used free shared buffers cached
Mem: 8256256 56908 8199348 0 352 11496
-/+ buffers/cache: 45060 8211196
Swap: 9775512 0 9775512
NUMA sparse:
text data bss dec hex filename
3896428 470659 1276824 5643911 561e87 vmlinux
8k text growth. Given that we fully inline virt_to_page and friends now
that is rather good.
total used free shared buffers cached
Mem: 8264720 57240 8207480 0 352 11516
-/+ buffers/cache: 45372 8219348
Swap: 9775512 0 9775512
The total available memory is increased by 8k.
This patch makes sparsemem the default and removes discontig and
flatmem support from x86.
[ akpm@linux-foundation.org: allnoconfig build fix ]
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In x86_64 and i386 architectures most arrays that are sized using
NR_CPUS lay in local memory on node 0. Not only will most (99%?) of the
systems not use all the slots in these arrays, particularly when NR_CPUS
is increased to accommodate future very high cpu count systems, but a
number of cache lines are passed unnecessarily on the system bus when
these arrays are referenced by cpus on other nodes.
Typically, the values in these arrays are referenced by the cpu
accessing it's own values, though when passing IPI interrupts, the cpu
does access the data relevant to the targeted cpu/node. Of course, if
the referencing cpu is not on node 0, then the reference will still
require cross node exchanges of cache lines. A common use of this is
for an interrupt service routine to pass the interrupt to other cpus
local to that node.
Ideally, all the elements in these arrays should be moved to the per_cpu
data area. In some cases (such as x86_cpu_to_apicid) the array is
referenced before the per_cpu data areas are setup. In this case, a
static array is declared in the __initdata area and initialized by the
booting cpu (BSP). The values are then moved to the per_cpu area after
it is initialized and the original static array is freed with the rest
of the __initdata.
This patch:
Fix four instances where cpu_to_node is referenced by array instead of
via the cpu_to_node macro. This is preparation to moving it to the
per_cpu data area.
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>