Commit Graph

1570 Commits (32a4330d4156e55a4888a201f484dbafed9504ed)

Author SHA1 Message Date
Rik van Riel 32a4330d41 mm: prevent kswapd from freeing excessive amounts of lowmem
The current VM can get itself into trouble fairly easily on systems with a
small ZONE_HIGHMEM, which is common on i686 computers with 1GB of memory.

On one side, page_alloc() will allocate down to zone->pages_low, while on
the other side, kswapd() and balance_pgdat() will try to free memory from
every zone, until every zone has more free pages than zone->pages_high.

Highmem can be filled up to zone->pages_low with page tables, ramfs,
vmalloc allocations and other unswappable things quite easily and without
many bad side effects, since we still have a huge ZONE_NORMAL to do future
allocations from.

However, as long as the number of free pages in the highmem zone is below
zone->pages_high, kswapd will continue swapping things out from
ZONE_NORMAL, too!

Sami Farin managed to get his system into a stage where kswapd had freed
about 700MB of low memory and was still "going strong".

The attached patch will make kswapd stop paging out data from zones when
there is more than enough memory free.  We do go above zone->pages_high in
order to keep pressure between zones equal in normal circumstances, but the
patch should prevent the kind of excesses that made Sami's computer totally
unusable.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:54 -07:00
Jesper Juhl 8691f3a72f mm: no need to cast vmalloc() return value in zone_wait_table_init()
vmalloc() returns a void pointer, so there's no need to cast its
return value in mm/page_alloc.c::zone_wait_table_init().

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:54 -07:00
Christoph Lameter ef8b4520bd Slab allocators: fail if ksize is called with a NULL parameter
A NULL pointer means that the object was not allocated.  One cannot
determine the size of an object that has not been allocated.  Currently we
return 0 but we really should BUG() on attempts to determine the size of
something nonexistent.

krealloc() interprets NULL to mean a zero sized object.  Handle that
separately in krealloc().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Dean Nelson 0da7e01f5f calculation of pgoff in do_linear_fault() uses mixed units
The calculation of pgoff in do_linear_fault() should use PAGE_SHIFT and not
PAGE_CACHE_SHIFT since vma->vm_pgoff is in units of PAGE_SIZE and not
PAGE_CACHE_SIZE.  At the moment linux/pagemap.h has PAGE_CACHE_SHIFT
defined as PAGE_SHIFT, but should that ever change this calculation would
break.

Signed-off-by: Dean Nelson <dcn@sgi.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Satyam Sharma 2408c55037 {slub, slob}: use unlikely() for kfree(ZERO_OR_NULL_PTR) check
Considering kfree(NULL) would normally occur only in error paths and
kfree(ZERO_SIZE_PTR) is uncommon as well, so let's use unlikely() for the
condition check in SLUB's and SLOB's kfree() to optimize for the common
case.  SLAB has this already.

Signed-off-by: Satyam Sharma <satyam@infradead.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Nick Piggin b55ed81623 mm: clarify __add_to_swap_cache locking
__add_to_swap_cache unconditionally sets the page locked, which can be a bit
alarming to the unsuspecting reader: in the code paths where the page is
visible to other CPUs, the page should be (and is) already locked.

Instead, just add a check to ensure the page is locked here, and teach the one
path relying on the old behaviour to call SetPageLocked itself.

[hugh@veritas.com: locking fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Nick Piggin 45726cb43d mm: improve find_lock_page
find_lock_page does not need to recheck ->index because if the page is in the
right mapping then the index must be the same.  Also, tree_lock does not need
to be retaken after the page is locked in order to test that ->mapping has not
changed, because holding the page lock pins its mapping.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Nick Piggin 0012818810 mm: use lockless radix-tree probe
Probing pages and radix_tree_tagged are lockless operations with the lockless
radix-tree.  Convert these users to RCU locking rather than using tree_lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Nick Piggin 557ed1fa26 remove ZERO_PAGE
The commit b5810039a5 contains the note

  A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
  (and thus mapcounted and count towards shared rss).  These writes to
  the struct page could cause excessive cacheline bouncing on big
  systems.  There are a number of ways this could be addressed if it is
  an issue.

And indeed this cacheline bouncing has shown up on large SGI systems.
There was a situation where an Altix system was essentially livelocked
tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
This situation can be avoided in userspace, but it does highlight the
potential scalability problem with refcounting ZERO_PAGE, and corner
cases where it can really hurt (we don't want the system to livelock!).

There are several broad ways to fix this problem:
1. add back some special casing to avoid refcounting ZERO_PAGE
2. per-node or per-cpu ZERO_PAGES
3. remove the ZERO_PAGE completely

I will argue for 3. The others should also fix the problem, but they
result in more complex code than does 3, with little or no real benefit
that I can see.

Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
false optimisation: if an application is performance critical, it would
not be doing many read faults of new memory, or at least it could be
expected to write to that memory soon afterwards. If cache or memory use
is critical, it should not be working with a significant number of
ZERO_PAGEs anyway (a more compact representation of zeroes should be
used).

As a sanity check -- mesuring on my desktop system, there are never many
mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
increase much without it.

When running a make -j4 kernel compile on my dual core system, there are
about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
is torn down without being COWed). So removing ZERO_PAGE will save 1,000
page faults per second when running kbuild, while keeping it only saves
less than 1 page clearing operation per second. 1 page clear is cheaper
than a thousand faults, presumably, so there isn't an obvious loss.

Neither the logical argument nor these basic tests give a guarantee of no
regressions. However, this is a reasonable opportunity to try to remove
the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
we can reintroduce it and just avoid refcounting it.

The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked.  I don't see
much use to them except on benchmarks.  All other users of ZERO_PAGE are
converted just to use ZERO_PAGE(0) for simplicity. We can look at
replacing them all and maybe ripping out ZERO_PAGE completely when we are
more satisfied with this solution.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus "snif" Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Christoph Lameter aadb4bc4a1 SLUB: direct pass through of page size or higher kmalloc requests
This gets rid of all kmalloc caches larger than page size.  A kmalloc
request larger than PAGE_SIZE > 2 is going to be passed through to the page
allocator.  This works both inline where we will call __get_free_pages
instead of kmem_cache_alloc and in __kmalloc.

kfree is modified to check if the object is in a slab page. If not then
the page is freed via the page allocator instead. Roughly similar to what
SLOB does.

Advantages:
- Reduces memory overhead for kmalloc array
- Large kmalloc operations are faster since they do not
  need to pass through the slab allocator to get to the
  page allocator.
- Performance increase of 10%-20% on alloc and 50% on free for
  PAGE_SIZEd allocations.
  SLUB must call page allocator for each alloc anyways since
  the higher order pages which that allowed avoiding the page alloc calls
  are not available in a reliable way anymore. So we are basically removing
  useless slab allocator overhead.
- Large kmallocs yields page aligned object which is what
  SLAB did. Bad things like using page sized kmalloc allocations to
  stand in for page allocate allocs can be transparently handled and are not
  distinguishable from page allocator uses.
- Checking for too large objects can be removed since
  it is done by the page allocator.

Drawbacks:
- No accounting for large kmalloc slab allocations anymore
- No debugging of large kmalloc slab allocations.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Fengguang Wu 57f6b96c09 filemap: convert some unsigned long to pgoff_t
Convert some 'unsigned long' to pgoff_t.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Fengguang Wu b2c3843b1e filemap: trivial code cleanups
- remove unused local next_index in do_generic_mapping_read()
- remove a redudant page_cache_read() declaration

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Fengguang Wu 535443f515 readahead: remove several readahead macros
Remove VM_MAX_CACHE_HIT, MAX_RA_PAGES and MIN_RA_PAGES.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:52 -07:00
Fengguang Wu 7ff81078d8 readahead: remove the local copy of ra in do_generic_mapping_read()
The local copy of ra in do_generic_mapping_read() can now go away.

It predates readanead(req_size).  In a time when the readahead code was called
on *every* single page.  Hence a local has to be made to reduce the chance of
the readahead state being overwritten by a concurrent reader.  More details
in: Linux: Random File I/O Regressions In 2.6
<http://kerneltrap.org/node/3039>

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:52 -07:00
Fengguang Wu 6b10c6c9fb readahead: basic support of interleaved reads
This is a simplified version of the pagecache context based readahead.  It
handles the case of multiple threads reading on the same fd and invalidating
each others' readahead state.  It does the trick by scanning the pagecache and
recovering the current read stream's readahead status.

The algorithm works in a opportunistic way, in that it does not try to detect
interleaved reads _actively_, which requires a probe into the page cache
(which means a little more overhead for random reads).  It only tries to
handle a previously started sequential readahead whose state was overwritten
by another concurrent stream, and it can do this job pretty well.

Negative and positive examples(or what you can expect from it):

1) it cannot detect and serve perfect request-by-request interleaved reads
   right:
	time	stream 1  stream 2
	0 	1
	1 	          1001
	2 	2
	3 	          1002
	4 	3
	5 	          1003
	6 	4
	7 	          1004
	8 	5
	9	          1005

Here no single readahead will be carried out.

2) However, if it's two concurrent reads by two threads, the chance of the
   initial sequential readahead be started is huge. Once the first sequential
   readahead is started for a stream, this patch will ensure that the readahead
   window continues to rampup and won't be disturbed by other streams.

	time	stream 1  stream 2
	0 	1
	1 	2
	2 	          1001
	3 	3
	4 	          1002
	5 	          1003
	6 	4
	7 	5
	8 	          1004
	9 	6
	10	          1005
	11	7
	12	          1006
	13	          1007

Here stream 1 will start a readahead at page 2, and stream 2 will start its
first readahead at page 1003.  From then on the two streams will be served
right.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:52 -07:00
Fengguang Wu f4e6b498d6 readahead: combine file_ra_state.prev_index/prev_offset into prev_pos
Combine the file_ra_state members
				unsigned long prev_index
				unsigned int prev_offset
into
				loff_t prev_pos

It is more consistent and better supports huge files.

Thanks to Peter for the nice proposal!

[akpm@linux-foundation.org: fix shift overflow]
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:52 -07:00
Fengguang Wu 0bb7ba6b9c readahead: mmap read-around simplification
Fold file_ra_state.mmap_hit into file_ra_state.mmap_miss and make it an int.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:52 -07:00
Fengguang Wu 937085aa35 readahead: compacting file_ra_state
Use 'unsigned int' instead of 'unsigned long' for readahead sizes.

This helps reduce memory consumption on 64bit CPU when a lot of files are
opened.

CC: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:52 -07:00
Jesper Juhl 43fac94dd6 Clean up duplicate includes in mm/
This patch cleans up duplicate includes in
	mm/

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:52 -07:00
Adrian Bunk 1cd7daa51b slub.c:early_kmem_cache_node_alloc() shouldn't be __init
WARNING: mm/built-in.o(.text+0x24bd3): Section mismatch: reference to .init.text:early_kmem_cache_node_alloc (between 'init_kmem_cache_nodes' and 'calculate_sizes')
...

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:51 -07:00
Andy Whitcroft 29c71111d0 vmemmap: generify initialisation via helpers
Convert the common vmemmap population into initialisation helpers for use by
architecture vmemmap populators.  All architecture implementing the
SPARSEMEM_VMEMMAP variant supply an architecture specific vmemmap_populate()
initialiser, which may make use of the helpers.

This allows us to clean up and remove the initialisation Kconfig entries.
With this patch there is a single SPARSEMEM_VMEMMAP_ENABLE Kconfig option to
indicate use of that variant.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:51 -07:00
Christoph Lameter 8f6aac419b Generic Virtual Memmap support for SPARSEMEM
SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all
the arches.  It would be great if it could be the default so that we can get
rid of various forms of DISCONTIG and other variations on memory maps.  So far
what has hindered this are the additional lookups that SPARSEMEM introduces
for virt_to_page and page_address.  This goes so far that the code to do this
has to be kept in a separate function and cannot be used inline.

This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap
is mapped into a virtually contigious area, only the active sections are
physically backed.  This allows virt_to_page page_address and cohorts become
simple shift/add operations.  No page flag fields, no table lookups, nothing
involving memory is required.

The two key operations pfn_to_page and page_to_page become:

   #define __pfn_to_page(pfn)      (vmemmap + (pfn))
   #define __page_to_pfn(page)     ((page) - vmemmap)

By having a virtual mapping for the memmap we allow simple access without
wasting physical memory.  As kernel memory is typically already mapped 1:1
this introduces no additional overhead.  The virtual mapping must be big
enough to allow a struct page to be allocated and mapped for all valid
physical pages.  This vill make a virtual memmap difficult to use on 32 bit
platforms that support 36 address bits.

However, if there is enough virtual space available and the arch already maps
its 1-1 kernel space using TLBs (f.e.  true of IA64 and x86_64) then this
technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM.
FLATMEM needs to read the contents of the mem_map variable to get the start of
the memmap and then add the offset to the required entry.  vmemmap is a
constant to which we can simply add the offset.

This patch has the potential to allow us to make SPARSMEM the default (and
even the only) option for most systems.  It should be optimal on UP, SMP and
NUMA on most platforms.  Then we may even be able to remove the other memory
models: FLATMEM, DISCONTIG etc.

[apw@shadowen.org: config cleanups, resplit code etc]
[kamezawa.hiroyu@jp.fujitsu.com: Fix sparsemem_vmemmap init]
[apw@shadowen.org: vmemmap: remove excess debugging]
[apw@shadowen.org: simplify initialisation code and reduce duplication]
[apw@shadowen.org: pull out the vmemmap code into its own file]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:51 -07:00
Andy Whitcroft 540557b943 sparsemem: record when a section has a valid mem_map
We have flags to indicate whether a section actually has a valid mem_map
associated with it.  This is never set and we rely solely on the present bit
to indicate a section is valid.  By definition a section is not valid if it
has no mem_map and there is a window during init where the present bit is set
but there is no mem_map, during which pfn_valid() will return true
incorrectly.

Use the existing SECTION_HAS_MEM_MAP flag to indicate the presence of a valid
mem_map.  Switch valid_section{,_nr} and pfn_valid() to this bit.  Add a new
present_section{,_nr} and pfn_present() interfaces for those users who care to
know that a section is going to be valid.

[akpm@linux-foundation.org: coding-syle fixes]
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:51 -07:00
Andy Whitcroft cd881a6b22 sparsemem: clean up spelling error in comments
SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all
the arches.  It would be great if it could be the default so that we can get
rid of various forms of DISCONTIG and other variations on memory maps.  So far
what has hindered this are the additional lookups that SPARSEMEM introduces
for virt_to_page and page_address.  This goes so far that the code to do this
has to be kept in a separate function and cannot be used inline.

This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap
is mapped into a virtually contigious area, only the active sections are
physically backed.  This allows virt_to_page page_address and cohorts become
simple shift/add operations.  No page flag fields, no table lookups, nothing
involving memory is required.

The two key operations pfn_to_page and page_to_page become:

   #define __pfn_to_page(pfn)      (vmemmap + (pfn))
   #define __page_to_pfn(page)     ((page) - vmemmap)

By having a virtual mapping for the memmap we allow simple access without
wasting physical memory.  As kernel memory is typically already mapped 1:1
this introduces no additional overhead.  The virtual mapping must be big
enough to allow a struct page to be allocated and mapped for all valid
physical pages.  This vill make a virtual memmap difficult to use on 32 bit
platforms that support 36 address bits.

However, if there is enough virtual space available and the arch already maps
its 1-1 kernel space using TLBs (f.e.  true of IA64 and x86_64) then this
technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM.
FLATMEM needs to read the contents of the mem_map variable to get the start of
the memmap and then add the offset to the required entry.  vmemmap is a
constant to which we can simply add the offset.

This patch has the potential to allow us to make SPARSMEM the default (and
even the only) option for most systems.  It should be optimal on UP, SMP and
NUMA on most platforms.  Then we may even be able to remove the other memory
models: FLATMEM, DISCONTIG etc.

The current aim is to bring a common virtually mapped mem_map to all
architectures.  This should facilitate the removal of the bespoke
implementations from the architectures.  This also brings performance
improvements for most architecture making sparsmem vmemmap the more desirable
memory model.  The ultimate aim of this work is to expand sparsemem support to
encompass all the features of the other memory models.  This could allow us to
drop support for and remove the other models in the longer term.

Below are some comparitive kernbench numbers for various architectures,
comparing default memory model against SPARSEMEM VMEMMAP.  All but ia64 show
marginal improvement; we expect the ia64 figures to be sorted out when the
larger mapping support returns.

x86-64 non-NUMA
             Base    VMEMAP    % change (-ve good)
User        85.07     84.84    -0.26
System      34.32     33.84    -1.39
Total      119.38    118.68    -0.59

ia64
             Base    VMEMAP    % change (-ve good)
User      1016.41   1016.93    0.05
System      50.83     51.02    0.36
Total     1067.25   1067.95    0.07

x86-64 NUMA
             Base   VMEMAP    % change (-ve good)
User        30.77   431.73     0.22
System      45.39    43.98    -3.11
Total      476.17   475.71    -0.10

ppc64
             Base   VMEMAP    % change (-ve good)
User       488.77   488.35    -0.09
System      56.92    56.37    -0.97
Total      545.69   544.72    -0.18

Below are some AIM bencharks on IA64 and x86-64 (thank Bob).  The seems
pretty much flat as you would expect.

ia64 results 2 cpu non-numa 4Gb SCSI disk

Benchmark	Version	Machine	Run Date
AIM Multiuser Benchmark - Suite VII	"1.1"	extreme	Jun  1 07:17:24 2007

Tasks	Jobs/Min	JTI	Real	CPU	Jobs/sec/task
1	98.9		100	58.9	1.3	1.6482
101	5547.1		95	106.0	79.4	0.9154
201	6377.7		95	183.4	158.3	0.5288
301	6932.2		95	252.7	237.3	0.3838
401	7075.8		93	329.8	316.7	0.2941
501	7235.6		94	403.0	396.2	0.2407
600	7387.5		94	472.7	475.0	0.2052

Benchmark	Version	Machine	Run Date
AIM Multiuser Benchmark - Suite VII	"1.1"	vmemmap	Jun  1 09:59:04 2007

Tasks	Jobs/Min	JTI	Real	CPU	Jobs/sec/task
1	99.1		100	58.8	1.2	1.6509
101	5480.9		95	107.2	79.2	0.9044
201	6490.3		95	180.2	157.8	0.5382
301	6886.6		94	254.4	236.8	0.3813
401	7078.2		94	329.7	316.0	0.2942
501	7250.3		95	402.2	395.4	0.2412
600	7399.1		94	471.9	473.9	0.2055

open power 710 2 cpu, 4 Gb, SCSI and configured physically

Benchmark	Version	Machine	Run Date
AIM Multiuser Benchmark - Suite VII	"1.1"	extreme	May 29 15:42:53 2007

Tasks	Jobs/Min	JTI	Real	CPU	Jobs/sec/task
1	25.7		100	226.3	4.3	0.4286
101	1096.0		97	536.4	199.8	0.1809
201	1236.4		96	946.1	389.1	0.1025
301	1280.5		96	1368.0	582.3	0.0709
401	1270.2		95	1837.4	771.0	0.0528
501	1251.4		96	2330.1	955.9	0.0416
601	1252.6		96	2792.4	1139.2	0.0347
701	1245.2		96	3276.5	1334.6	0.0296
918	1229.5		96	4345.4	1728.7	0.0223

Benchmark	Version	Machine	Run Date
AIM Multiuser Benchmark - Suite VII	"1.1"	vmemmap	May 30 07:28:26 2007

Tasks	Jobs/Min	JTI	Real	CPU	Jobs/sec/task
1	25.6		100	226.9	4.3	0.4275
101	1049.3		97	560.2	198.1	0.1731
201	1199.1		97	975.6	390.7	0.0994
301	1261.7		96	1388.5	591.5	0.0699
401	1256.1		96	1858.1	771.9	0.0522
501	1220.1		96	2389.7	955.3	0.0406
601	1224.6		96	2856.3	1133.4	0.0340
701	1252.0		96	3258.7	1314.1	0.0298
915	1232.8		96	4319.7	1704.0	0.0225

amd64 2 2-core, 4Gb and SATA

Benchmark	Version	Machine	Run Date
AIM Multiuser Benchmark - Suite VII	"1.1"	extreme	Jun  2 03:59:48 2007

Tasks	Jobs/Min	JTI	Real	CPU	Jobs/sec/task
1	13.0		100	446.4	2.1	0.2173
101	533.4		97	1102.0	110.2	0.0880
201	578.3		97	2022.8	220.8	0.0480
301	583.8		97	3000.6	332.3	0.0323
401	580.5		97	4020.1	442.2	0.0241
501	574.8		98	5072.8	558.8	0.0191
600	566.5		98	6163.8	671.0	0.0157

Benchmark	Version	Machine	Run Date
AIM Multiuser Benchmark - Suite VII	"1.1"	vmemmap	Jun  3 04:19:31 2007

Tasks	Jobs/Min	JTI	Real	CPU	Jobs/sec/task
1	13.0		100	447.8	2.0	0.2166
101	536.5		97	1095.6	109.7	0.0885
201	567.7		97	2060.5	219.3	0.0471
301	582.1		96	3009.4	330.2	0.0322
401	578.2		96	4036.4	442.4	0.0240
501	585.1		98	4983.2	555.1	0.0195
600	565.5		98	6175.2	660.6	0.0157

This patch:

Fix some spelling errors.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:51 -07:00
Al Viro 9d966d495c mm/migrate.c __user annotation
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-14 12:41:51 -07:00
NeilBrown 6712ecf8f6 Drop 'size' argument from bio_endio and bi_end_io
As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant.  Remove it.

Now there is no need for bio_endio to subtract the size completed
from bi_size.  So don't do that either.

While we are at it, change bi_end_io to return void.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:57 +02:00
Jens Axboe f5ff8422bb Fix warnings with !CONFIG_BLOCK
Hide everything in blkdev.h with CONFIG_BLOCK isn't set, and fixup
the (few) files that fail to build because they were relying on blkdev.h
pulling in extra includes for them.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:57 +02:00
Yan Zheng 745ad48e8c fix page release issue in filemap_fault
find_lock_page increases page's usage count, we should decrease it
before return VM_FAULT_SIGBUS

Signed-off-by: Yan Zheng<yanzheng@21cn.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-08 12:58:14 -07:00
Yan Zheng dd204d63cd fix VM_CAN_NONLINEAR check in sys_remap_file_pages
The test for VM_CAN_NONLINEAR always fails

Signed-off-by: Yan Zheng<yanzheng@21cn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-08 12:58:14 -07:00
Peter Zijlstra a200ee182a mm: set_page_dirty_balance() vs ->page_mkwrite()
All the current page_mkwrite() implementations also set the page dirty. Which
results in the set_page_dirty_balance() call to _not_ call balance, because the
page is already found dirty.

This allows us to dirty a _lot_ of pages without ever hitting
balance_dirty_pages().  Not good (tm).

Force a balance call if ->page_mkwrite() was successful.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-08 12:58:14 -07:00
Jeremy Fitzhardinge 67dd5a25f4 xen: disable split pte locks for now
When pinning and unpinning pagetables, we must protect them against
being used by other CPUs, lest they see the pagetable in an
intermediate read-only-but-not-pinned state.

When using split pte locks, doing this properly would require taking
all the pte locks for the pagetable while pinning, but this may overflow
the PREEMPT_BITS part of the preempt counter if the process has mapped
more than about 512M of memory.

However, failing to take the pte locks causes write-protect faults when
the pageout code is trying to clear the Access bit on a pte which is part
of a freshy created and still being pinned process after fork.

This is a short-term fix until the problem is solved properly.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Keir Fraser <keir@xensource.com>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-06 09:31:30 -07:00
Hugh Dickins 16abfa0860 Fix sys_remap_file_pages BUG at highmem.c:15!
Gurudas Pai reports kernel BUG at arch/i386/mm/highmem.c:15! below
sys_remap_file_pages, while running Oracle database test on x86 in 6GB
RAM: kunmap thinks we're in_interrupt because the preempt count has
wrapped.

That's because __do_fault expected to unmap page_table, but one of its
two callers do_nonlinear_fault already unmapped it: let do_linear_fault
unmap it first too, and then there's no need to pass the page_table arg
down.

Why have we been so slow to notice this? Probably through forgetting
that the mapping_cap_account_dirty test means that sys_remap_file_pages
nowadays only goes the full nonlinear vma route on a few memory-backed
filesystems like ramfs, tmpfs and hugetlbfs.

[ It also depends on CONFIG_HIGHPTE, so it becomes even harder to
  trigger in practice. Many who have need of large memory have probably
  migrated to x86-64..

  Problem introduced by commit d0217ac04c
  ("mm: fault feedback #1")                -- Linus ]

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: gurudas pai <gurudas.pai@oracle.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-04 10:13:09 -07:00
Ralf Baechle 281e0e3b34 hugetlb: fix clear_user_highpage arguments
The virtual address space argument of clear_user_highpage is supposed to be
the virtual address where the page being cleared will eventually be mapped.
 This allows architectures with virtually indexed caches a few clever
tricks.  That sort of trick falls over in painful ways if the virtual
address argument is wrong.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-01 07:52:23 -07:00
Lee Schermerhorn 480eccf9ae Fix NUMA Memory Policy Reference Counting
This patch proposes fixes to the reference counting of memory policy in the
page allocation paths and in show_numa_map().  Extracted from my "Memory
Policy Cleanups and Enhancements" series as stand-alone.

Shared policy lookup [shmem] has always added a reference to the policy,
but this was never unrefed after page allocation or after formatting the
numa map data.

Default system policy should not require additional ref counting, nor
should the current task's task policy.  However, show_numa_map() calls
get_vma_policy() to examine what may be [likely is] another task's policy.
The latter case needs protection against freeing of the policy.

This patch adds a reference count to a mempolicy returned by
get_vma_policy() when the policy is a vma policy or another task's
mempolicy.  Again, shared policy is already reference counted on lookup.  A
matching "unref" [__mpol_free()] is performed in alloc_page_vma() for
shared and vma policies, and in show_numa_map() for shared and another
task's mempolicy.  We can call __mpol_free() directly, saving an admittedly
inexpensive inline NULL test, because we know we have a non-NULL policy.

Handling policy ref counts for hugepages is a bit trickier.
huge_zonelist() returns a zone list that might come from a shared or vma
'BIND policy.  In this case, we should hold the reference until after the
huge page allocation in dequeue_hugepage().  The patch modifies
huge_zonelist() to return a pointer to the mempolicy if it needs to be
unref'd after allocation.

Kernel Build [16cpu, 32GB, ia64] - average of 10 runs:

		w/o patch	w/ refcount patch
	    Avg	  Std Devn	   Avg	  Std Devn
Real:	 100.59	    0.38	 100.63	    0.43
User:	1209.60	    0.37	1209.91	    0.31
System:   81.52	    0.42	  81.64	    0.34

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-19 11:24:18 -07:00
Christoph Lameter ba0268a8b0 SLUB: accurately compare debug flags during slab cache merge
This was posted on Aug 28 and fixes an issue that could cause troubles
when slab caches >=128k are created.

http://marc.info/?l=linux-mm&m=118798149918424&w=2

Currently we simply add the debug flags unconditional when checking for a
matching slab.  This creates issues for sysfs processing when slabs exist
that are exempt from debugging due to their huge size or because only a
subset of slabs was selected for debugging.

We need to only add the flags if kmem_cache_open() would also add them.

Create a function to calculate the flags that would be set
if the cache would be opened and use that function to determine
the flags before looking for a compatible slab.

[akpm@linux-foundation.org: fixlets]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-11 17:21:27 -07:00
Christoph Lameter 3b42d28b2a Page migration: Do not accept invalid nodes in the target nodeset
Page migration currently does not check if the target of the move contains
nodes that that are invalid (if root attempts to migrate pages)
and may try to allocate from invalid nodes if these are specified
leading to oopses.

Return -EINVAL if an offline node is specified.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-31 01:42:23 -07:00
Christoph Lameter 5d540fb715 slub: do not fail if we cannot register a slab with sysfs
Do not BUG() if we cannot register a slab with sysfs.  Just print an error.
 The only consequence of not registering is that the slab cache is not
visible via /sys/slab.  A BUG() may not be visible that early during boot
and we have had multiple issues here already.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-31 01:42:22 -07:00
KAMEZAWA Hiroyuki 989f89c57e fix rcu_read_lock() in page migraton
In migration fallback path, write_page() or lock_page() will be called.
This causes sleep with holding rcu_read_lock().
For avoding that, just do rcu_lock if the page is Anon.(this is enough.)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-31 01:42:22 -07:00
Andrew Morton 6419168813 process_zones(): fix recovery code
Don't try to free memory which we didn't allocate.

Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-31 01:42:22 -07:00
Mel Gorman b377fd3982 Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE
The NUMA layer only supports NUMA policies for the highest zone.  When
ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes
ZONE_MOVABLE.  The result is that policies are only applied to allocations
like anonymous pages and page cache allocated from ZONE_MOVABLE when the
zone is used.

This patch applies policies to the two highest zones when the highest zone
is ZONE_MOVABLE.  As ZONE_MOVABLE consists of pages from the highest "real"
zone, it's always functionally equivalent.

The patch has been tested on a variety of machines both NUMA and non-NUMA
covering x86, x86_64 and ppc64.  No abnormal results were seen in
kernbench, tbench, dbench or hackbench.  It passes regression tests from
the numactl package with and without kernelcore= once numactl tests are
patched to wait for vmstat counters to update.

akpm: this is the nasty hack to fix NUMA mempolicies in the presence of
ZONE_MOVABLE and kernelcore= in 2.6.23.  Christoph says "For .24 either merge
the mobility or get the other solution that Mel is working on.  That solution
would only use a single zonelist per node and filter on the fly.  That may
help performance and also help to make memory policies work better."

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-22 19:52:47 -07:00
Christoph Lameter a2f92ee7e7 SLUB: do not fail on broken memory configurations
Print a big fat warning and do what is necessary to continue if a node is
marked as up (meaning either node is online (upstream) or node has memory
(Andrew's tree)) but allocations from the node do not succeed.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-22 19:52:47 -07:00
Christoph Lameter 9e86943b6c SLUB: use atomic_long_read for atomic_long variables
SLUB is using atomic_read() for variables declared atomic_long_t.
Switch to atomic_long_read().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-22 19:52:47 -07:00
Adam Litke a89182c76e Fix VM_FAULT flags conversion for hugetlb
It seems a simple mistake was made when converting follow_hugetlb_page()
over to the VM_FAULT flags bitmasks (in "mm: fault feedback #2", commit
83c54070ee).

By using the wrong bitmask, hugetlb_fault() failures are not being
recognized.  This results in an infinite loop whenever follow_hugetlb_page
is involved in a failed fault.

Signed-off-by: Adam Litke <agl@us.ibm.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-22 19:52:46 -07:00
Siddha, Suresh B 1807a1aaf5 slab: skip calling cache_free_alien() when the platform is not numa capable
Skip calling cache_free_alien() when the platform is not numa capable.
This will avoid cache misses that happen while accessing slabp (which is
per page memory reference) to get nodeid.  Instead use a global variable to
skip the call, which is mostly likely to be present in the cache.

This gives a 0.8% performance boost with the database oltp workload on a
quad-core SMP platform and by any means the number is not small :)

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-22 19:52:46 -07:00
Alan Cox 34b4e4aa3c fix NULL pointer dereference in __vm_enough_memory()
The new exec code inserts an accounted vma into an mm struct which is not
current->mm.  The existing memory check code has a hard coded assumption
that this does not happen as does the security code.

As the correct mm is known we pass the mm to the security method and the
helper function.  A new security test is added for the case where we need
to pass the mm and the existing one is modified to pass current->mm to
avoid the need to change large amounts of code.

(Thanks to Tobias for fixing rejects and testing)

Signed-off-by: Alan Cox <alan@redhat.com>
Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
Cc: James Morris <jmorris@redhat.com>
Cc: Tobias Diedrich <ranma+kernel@tdiedrich.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-22 19:52:45 -07:00
Andy Whitcroft c661b078fd synchronous lumpy reclaim: wait for page writeback when directly reclaiming contiguous areas
Lumpy reclaim works by selecting a lead page from the LRU list and then
selecting pages for reclaim from the order-aligned area of pages.  In the
situation were all pages in that region are inactive and not referenced by any
process over time, it works well.

In the situation where there is even light load on the system, the pages may
not free quickly.  Out of a area of 1024 pages, maybe only 950 of them are
freed when the allocation attempt occurs because lumpy reclaim returned early.
 This patch alters the behaviour of direct reclaim for large contiguous
blocks.

The first attempt to call shrink_page_list() is asynchronous but if it fails,
the pages are submitted a second time and the calling process waits for the IO
to complete.  This may stall allocators waiting for contiguous memory but that
should be expected behaviour for high-order users.  It is preferable behaviour
to potentially queueing unnecessary areas for IO.  Note that kswapd will not
stall in this fashion.

[apw@shadowen.org: update to version 2]
[apw@shadowen.org: update to version 3]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-22 19:52:45 -07:00
Andy Whitcroft e9187bdcbb synchronous lumpy reclaim: ensure we count pages transitioning inactive via clear_active_flags
As pointed out by Mel when reclaim is applied at higher orders a significant
amount of IO may be started.  As this takes finite time to drain reclaim will
consider more areas than ultimatly needed to satisfy the request.  This leads
to more reclaim than strictly required and reduced success rates.

I was able to confirm Mel's test results on systems locally.  These show that
even under light load the success rates drop off far more than expected.
Testing with a modified version of his patch (which follows) I was able to
allocate almost all of ZONE_MOVABLE with a near idle system.  I ran 5 test
passes sequentially following system boot (the system has 29 hugepages in
ZONE_MOVABLE):

  2.6.23-rc1              11  8  6  7  7
  sync_lumpy              28 28 29 29 26

These show that although hugely better than the near 0% success normally
expected we can only allocate about a 1/4 of the zone.  Using synchronous
reclaim for these allocations we get close to 100% as expected.

I have also run our standard high order tests and these show no regressions in
allocation success rates at rest, and some significant improvements under
load.

This patch:

We are transitioning pages from active to inactive in clear_active_flags,
those need counting as PGDEACTIVATE vm events.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-22 19:52:45 -07:00
Andy Whitcroft 85770ffe4f sparsemem: ensure we initialise the node mapping for SPARSEMEM_STATIC
Booting SPARSEMEM on NUMA systems trips a BUG in page_alloc.c:

	Initializing HighMem for node 0 (00038000:00100000)
	Initializing HighMem for node 1 (00100000:001ffe00)
	------------[ cut here ]------------
	kernel BUG at /home/apw/git/linux-2.6/mm/page_alloc.c:456!
	[...]

This occurs because the section to node id mapping is not being
setup correctly during init under SPARSEMEM_STATIC, leading to an
attempt to free pages from all nodes into the zones on node 0.

When the zone_table[] was removed in the following commit, a new
section to node mapping table was introduced:

    commit 89689ae7f9
    [PATCH] Get rid of zone_table[]

That conversion inadvertantly only initialised the node mapping in
SPARSEMEM_EXTREME.  Ensure we initialise the node mapping in
SPARSEMEM_STATIC.

[akpm@linux-foundation.org: make the stubs static inline]
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-22 19:52:44 -07:00
Linus Torvalds dc8a7b11aa Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  BLOCK: Hide the contents of linux/bio.h if CONFIG_BLOCK=n
  sysace: HDIO_GETGEO has it's own method for ages
  drivers/block/cpqarray.c: better error handling and kmalloc + memset conversion to k[cz]alloc
  drivers/block/cciss.c: kmalloc + memset conversion to kzalloc
  Clean up duplicate includes in drivers/block/
  Fix remap handling by blktrace
  [PATCH] remove mm/filemap.c:file_send_actor()
2007-08-11 16:01:06 -07:00
Stephen Hemminger f0b85c0cfd readahead: docbook fix
Minor docbook error since argument name in comment doesn't match function

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-11 15:47:42 -07:00