linux

q3k/linux

Author	SHA1	Message	Date
Lee Schermerhorn	252c5f94d9	mmap: avoid unnecessary anon_vma lock acquisition in vma_adjust() We noticed very erratic behavior [throughput] with the AIM7 shared workload running on recent distro [SLES11] and mainline kernels on an 8-socket, 32-core, 256GB x86_64 platform. On the SLES11 kernel [2.6.27.19+] with Barcelona processors, as we increased the load [10s of thousands of tasks], the throughput would vary between two "plateaus"--one at ~65K jobs per minute and one at ~130K jpm. The simple patch below causes the results to smooth out at the ~130k plateau. But wait, there's more: We do not see this behavior on smaller platforms--e.g., 4 socket/8 core. This could be the result of the larger number of cpus on the larger platform--a scalability issue--or it could be the result of the larger number of interconnect "hops" between some nodes in this platform and how the tasks for a given load end up distributed over the nodes' cpus and memories--a stochastic NUMA effect. The variability in the results are less pronounced [on the same platform] with Shanghai processors and with mainline kernels. With 31-rc6 on Shanghai processors and 288 file systems on 288 fibre attached storage volumes, the curves [jpm vs load] are both quite flat with the patched kernel consistently producing ~3.9% better throughput [~80K jpm vs ~77K jpm] than the unpatched kernel. Profiling indicated that the "slow" runs were incurring high[er] contention on an anon_vma lock in vma_adjust(), apparently called from the sbrk() system call. The patch: A comment in mm/mmap.c:vma_adjust() suggests that we don't really need the anon_vma lock when we're only adjusting the end of a vma, as is the case for brk(). The comment questions whether it's worth while to optimize for this case. Apparently, on the newer, larger x86_64 platforms, with interesting NUMA topologies, it is worth while--especially considering that the patch [if correct!] is quite simple. We can detect this condition--no overlap with next vma--by noting a NULL "importer". The anon_vma pointer will also be NULL in this case, so simply avoid loading vma->anon_vma to avoid the lock. However, we DO need to take the anon_vma lock when we're inserting a vma ['insert' non-NULL] even when we have no overlap [NULL "importer"], so we need to check for 'insert', as well. And Hugh points out that we should also take it when adjusting vm_start (so that rmap.c can rely upon vma_address() while it holds the anon_vma lock). akpm: Zhang Yanmin reprts a 150% throughput improvement with aim7, so it might be -stable material even though thiss isn't a regression: "this issue is not clear on dual socket Nehalem machine (242 cpu), but is severe on large machine (482 cpu)" [hugh.dickins@tiscali.co.uk: test vma start too] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Nick Piggin <npiggin@suse.de> Cc: Eric Whitney <eric.whitney@hp.com> Tested-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:41 -07:00
Hugh Dickins	3f96b79ad9	tmpfs: depend on shmem CONFIG_SHMEM off gives you (ramfs masquerading as) tmpfs, even when CONFIG_TMPFS is off: that's a little anomalous, and I'd intended to make more sense of it by removing CONFIG_TMPFS altogether, always enabling its code when CONFIG_SHMEM; but so many defconfigs have CONFIG_SHMEM on CONFIG_TMPFS off that we'd better leave that as is. But there is no point in asking for CONFIG_TMPFS if CONFIG_SHMEM is off: make TMPFS depend on SHMEM, which also prevents TMPFS_POSIX_ACL shmem_acl.o being pointlessly built into the kernel when SHMEM is off. And a selfish change, to prevent the world from being rebuilt when I switch between CONFIG_SHMEM on and off: the only CONFIG_SHMEM in the header files is mm.h shmem_lock() - give that a shmem.c stub instead. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:41 -07:00
Huang Shijie	cdf7b3418a	mmap: remove unnecessary code If (flags & MAP_LOCKED) is true, it means vm_flags has already contained the bit VM_LOCKED which is set by calc_vm_flag_bits(). So there is no need to reset it again, just remove it. Signed-off-by: Huang Shijie <shijie8@gmail.com> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:41 -07:00
Hugh Dickins	03f6462a3a	mm: move highest_memmap_pfn Move highest_memmap_pfn __read_mostly from page_alloc.c next to zero_pfn __read_mostly in memory.c: to help them share a cacheline, since they're very often tested together in vm_normal_page(). Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:41 -07:00
Hugh Dickins	62eede62da	mm: ZERO_PAGE without PTE_SPECIAL Reinstate anonymous use of ZERO_PAGE to all architectures, not just to those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin. Contrary to how I'd imagined it, there's nothing ugly about this, just a zero_pfn test built into one or another block of vm_normal_page(). But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and my_zero_pfn() inlines. Reinstate its mremap move_pte() shuffling of ZERO_PAGEs we did from 2.6.17 to 2.6.19? Not unless someone shouts for that: it would have to take vm_flags to weed out some cases. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:41 -07:00
Hugh Dickins	3ae77f43b1	mm: hugetlbfs_pagecache_present Rename hugetlbfs_backed() to hugetlbfs_pagecache_present() and add more comments, as suggested by Mel Gorman. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:41 -07:00
Hugh Dickins	6e919717c8	mm: m(un)lock avoid ZERO_PAGE I'm still reluctant to clutter __get_user_pages() with another flag, just to avoid touching ZERO_PAGE count in mlock(); though we can add that later if it shows up as an issue in practice. But when mlocking, we can test page->mapping slightly earlier, to avoid the potentially bouncy rescheduling of lock_page on ZERO_PAGE - mlock didn't lock_page in olden ZERO_PAGE days, so we might have regressed. And when munlocking, it turns out that FOLL_DUMP coincidentally does what's needed to avoid all updates to ZERO_PAGE, so use that here also. Plus add comment suggested by KAMEZAWA Hiroyuki. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:40 -07:00
Hugh Dickins	58fa879e1e	mm: FOLL flags for GUP flags __get_user_pages() has been taking its own GUP flags, then processing them into FOLL flags for follow_page(). Though oddly named, the FOLL flags are more widely used, so pass them to __get_user_pages() now. Sorry, VM flags, VM_FAULT flags and FAULT_FLAGs are still distinct. (The patch to __get_user_pages() looks peculiar, with both gup_flags and foll_flags: the gup_flags remain constant; but as before there's an exceptional case, out of scope of the patch, in which foll_flags per page have FOLL_WRITE masked off.) Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:40 -07:00
Hugh Dickins	a13ea5b759	mm: reinstate ZERO_PAGE KAMEZAWA Hiroyuki has observed customers of earlier kernels taking advantage of the ZERO_PAGE: which we stopped do_anonymous_page() from using in 2.6.24. And there were a couple of regression reports on LKML. Following suggestions from Linus, reinstate do_anonymous_page() use of the ZERO_PAGE; but this time avoid dirtying its struct page cacheline with (map)count updates - let vm_normal_page() regard it as abnormal. Use it only on arches which __HAVE_ARCH_PTE_SPECIAL (x86, s390, sh32, most powerpc): that's not essential, but minimizes additional branches (keeping them in the unlikely pte_special case); and incidentally excludes mips (some models of which needed eight colours of ZERO_PAGE to avoid costly exceptions). Don't be fanatical about avoiding ZERO_PAGE updates: get_user_pages() callers won't want to make exceptions for it, so increment its count there. Changes to mlock and migration? happily seems not needed. In most places it's quicker to check pfn than struct page address: prepare a __read_mostly zero_pfn for that. Does get_dump_page() still need its ZERO_PAGE check? probably not, but keep it anyway. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:40 -07:00
Hugh Dickins	1ac0cb5d0e	mm: fix anonymous dirtying do_anonymous_page() has been wrong to dirty the pte regardless. If it's not going to mark the pte writable, then it won't help to mark it dirty here, and clogs up memory with pages which will need swap instead of being thrown away. Especially wrong if no overcommit is chosen, and this vma is not yet VM_ACCOUNTed - we could exceed the limit and OOM despite no overcommit. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: <stable@kernel.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:40 -07:00
Hugh Dickins	2a15efc953	mm: follow_hugetlb_page flags follow_hugetlb_page() shouldn't be guessing about the coredump case either: pass the foll_flags down to it, instead of just the write bit. Remove that obscure huge_zeropage_ok() test. The decision is easy, though unlike the non-huge case - here vm_ops->fault is always set. But we know that a fault would serve up zeroes, unless there's already a hugetlbfs pagecache page to back the range. (Alternatively, since hugetlb pages aren't swapped out under pressure, you could save more dump space by arguing that a page not yet faulted into this process cannot be relevant to the dump; but that would be more surprising.) Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:40 -07:00
Hugh Dickins	8e4b9a6071	mm: FOLL_DUMP replace FOLL_ANON The "FOLL_ANON optimization" and its use_zero_page() test have caused confusion and bugs: why does it test VM_SHARED? for the very good but unsatisfying reason that VMware crashed without. As we look to maybe reinstating anonymous use of the ZERO_PAGE, we need to sort this out. Easily done: it's silly for __get_user_pages() and follow_page() to be guessing whether it's safe to assume that they're being used for a coredump (which can take a shortcut snapshot where other uses must handle a fault) - just tell them with GUP_FLAGS_DUMP and FOLL_DUMP. get_dump_page() doesn't even want a ZERO_PAGE: an error suits fine. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:40 -07:00
Hugh Dickins	f3e8fccd06	mm: add get_dump_page In preparation for the next patch, add a simple get_dump_page(addr) interface for the CONFIG_ELF_CORE dumpers to use, instead of calling get_user_pages() directly. They're not interested in errors: they just want to use holes as much as possible, to save space and make sure that the data is aligned where the headers said it would be. Oh, and don't use that horrid DUMP_SEEK(off) macro! Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:40 -07:00
Hugh Dickins	1c3aff1cee	mm: remove unused GUP flags GUP_FLAGS_IGNORE_VMA_PERMISSIONS and GUP_FLAGS_IGNORE_SIGKILL were flags added solely to prevent __get_user_pages() from doing some of what it usually does, in the munlock case: we can now remove them. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:40 -07:00
Hugh Dickins	408e82b78b	mm: munlock use follow_page Hiroaki Wakabayashi points out that when mlock() has been interrupted by SIGKILL, the subsequent munlock() takes unnecessarily long because its use of __get_user_pages() insists on faulting in all the pages which mlock() never reached. It's worse than slowness if mlock() is terminated by Out Of Memory kill: the munlock_vma_pages_all() in exit_mmap() insists on faulting in all the pages which mlock() could not find memory for; so innocent bystanders are killed too, and perhaps the system hangs. __get_user_pages() does a lot that's silly for munlock(): so remove the munlock option from __mlock_vma_pages_range(), and use a simple loop of follow_page()s in munlock_vma_pages_range() instead; ignoring absent pages, and not marking present pages as accessed or dirty. (Change munlock() to only go so far as mlock() reached? That does not work out, given the convention that mlock() claims complete success even when it has to give up early - in part so that an underlying file can be extended later, and those pages locked which earlier would give SIGBUS.) Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: <stable@kernel.org> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Hiroaki Wakabayashi <primulaelatior@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:40 -07:00
Mel Gorman	a6f9edd65b	page-allocator: maintain rolling count of pages to free from the PCP When round-robin freeing pages from the PCP lists, empty lists may be encountered. In the event one of the lists has more pages than another, there may be numerous checks for list_empty() which is undesirable. This patch maintains a count of pages to free which is incremented when empty lists are encountered. The intention is that more pages will then be freed from fuller lists than the empty ones reducing the number of empty list checks in the free path. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <npiggin@suse.de> Cc: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:39 -07:00
Mel Gorman	5f8dcc2121	page-allocator: split per-cpu list into one-list-per-migrate-type The following two patches remove searching in the page allocator fast-path by maintaining multiple free-lists in the per-cpu structure. At the time the search was introduced, increasing the per-cpu structures would waste a lot of memory as per-cpu structures were statically allocated at compile-time. This is no longer the case. The patches are as follows. They are based on mmotm-2009-08-27. Patch 1 adds multiple lists to struct per_cpu_pages, one per migratetype that can be stored on the PCP lists. Patch 2 notes that the pcpu drain path check empty lists multiple times. The patch reduces the number of checks by maintaining a count of free lists encountered. Lists containing pages will then free multiple pages in batch The patches were tested with kernbench, netperf udp/tcp, hackbench and sysbench. The netperf tests were not bound to any CPU in particular and were run such that the results should be 99% confidence that the reported results are within 1% of the estimated mean. sysbench was run with a postgres background and read-only tests. Similar to netperf, it was run multiple times so that it's 99% confidence results are within 1%. The patches were tested on x86, x86-64 and ppc64 as x86: Intel Pentium D 3GHz with 8G RAM (no-brand machine) kernbench - No significant difference, variance well within noise netperf-udp - 1.34% to 2.28% gain netperf-tcp - 0.45% to 1.22% gain hackbench - Small variances, very close to noise sysbench - Very small gains x86-64: AMD Phenom 9950 1.3GHz with 8G RAM (no-brand machine) kernbench - No significant difference, variance well within noise netperf-udp - 1.83% to 10.42% gains netperf-tcp - No conclusive until buffer >= PAGE_SIZE 4096 +15.83% 8192 + 0.34% (not significant) 16384 + 1% hackbench - Small gains, very close to noise sysbench - 0.79% to 1.6% gain ppc64: PPC970MP 2.5GHz with 10GB RAM (it's a terrasoft powerstation) kernbench - No significant difference, variance well within noise netperf-udp - 2-3% gain for almost all buffer sizes tested netperf-tcp - losses on small buffers, gains on larger buffers possibly indicates some bad caching effect. hackbench - No significant difference sysbench - 2-4% gain This patch: Currently the per-cpu page allocator searches the PCP list for pages of the correct migrate-type to reduce the possibility of pages being inappropriate placed from a fragmentation perspective. This search is potentially expensive in a fast-path and undesirable. Splitting the per-cpu list into multiple lists increases the size of a per-cpu structure and this was potentially a major problem at the time the search was introduced. These problem has been mitigated as now only the necessary number of structures is allocated for the running system. This patch replaces a list search in the per-cpu allocator with one list per migrate type. The potential snag with this approach is when bulk freeing pages. We round-robin free pages based on migrate type which has little bearing on the cache hotness of the page and potentially checks empty lists repeatedly in the event the majority of PCP pages are of one type. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Nick Piggin <npiggin@suse.de> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:39 -07:00
KOSAKI Motohiro	8c5cd6f3a1	oom: oom_kill doesn't kill vfork parent (or child) Current oom_kill doesn't only kill the victim process, but also kill all thas shread the same mm. it mean vfork parent will be killed. This is definitely incorrect. another process have another oom_adj. we shouldn't ignore their oom_adj (it might have OOM_DISABLE). following caller hit the minefield. =============================== switch (constraint) { case CONSTRAINT_MEMORY_POLICY: oom_kill_process(current, gfp_mask, order, 0, NULL, "No available memory (MPOL_BIND)"); break; Note: force_sig(SIGKILL) send SIGKILL to all thread in the process. We don't need to care multi thread in here. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:39 -07:00
KOSAKI Motohiro	495789a51a	oom: make oom_score to per-process value oom-killer kills a process, not task. Then oom_score should be calculated as per-process too. it makes consistency more and makes speed up select_bad_process(). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:39 -07:00
KOSAKI Motohiro	28b83c5193	oom: move oom_adj value from task_struct to signal_struct Currently, OOM logic callflow is here. __out_of_memory() select_bad_process() for each task badness() calculate badness of one task oom_kill_process() search child oom_kill_task() kill target task and mm shared tasks with it example, process-A have two thread, thread-A and thread-B and it have very fat memory and each thread have following oom_adj and oom_score. thread-A: oom_adj = OOM_DISABLE, oom_score = 0 thread-B: oom_adj = 0, oom_score = very-high Then, select_bad_process() select thread-B, but oom_kill_task() refuse kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory() call select_bad_process() again. but select_bad_process() select the same task. It mean kernel fall in livelock. The fact is, select_bad_process() must select killable task. otherwise OOM logic go into livelock. And root cause is, oom_adj shouldn't be per-thread value. it should be per-process value because OOM-killer kill a process, not thread. Thus This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct signal_struct. it naturally prevent select_bad_process() choose wrong task. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:39 -07:00
Vincent Li	f168e1b639	mm/vmscan: remove page_queue_congested() comment Commit 084f71ae5c(kill page_queue_congested()) removed page_queue_congested(). Remove the page_queue_congested() comment in vmscan pageout() too. Signed-off-by: Vincent Li <macli@brc.ubc.ca> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:39 -07:00
Wu Fengguang	f862963174	mm: do batched scans for mem_cgroup For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1, in which case shrink_list() _still_ calls isolate_pages() with the much larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list scan rate by up to 32 times. For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4. So when shrink_zone() expects to scan 4 pages in the active/inactive list, the active list will be scanned 4 pages, while the inactive list will be (over) scanned SWAP_CLUSTER_MAX=32 pages in effect. And that could break the balance between the two lists. It can further impact the scan of anon active list, due to the anon active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone(): inactive anon list over scanned => inactive_anon_is_low() == TRUE => shrink_active_list() => active anon list over scanned So the end result may be - anon inactive => over scanned - anon active => over scanned (maybe not as much) - file inactive => over scanned - file active => under scanned (relatively) The accesses to nr_saved_scan are not lock protected and so not 100% accurate, however we can tolerate small errors and the resulted small imbalanced scan rates between zones. Cc: Rik van Riel <riel@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:39 -07:00
Vincent Li	0b21767637	mm/vmscan: rename zone_nr_pages() to zone_nr_lru_pages() The name `zone_nr_pages' can be mis-read as zone's (total) number pages, but it actually returns zone's LRU list number pages. Signed-off-by: Vincent Li <macli@brc.ubc.ca> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:38 -07:00
Jan Beulich	2c85f51d22	mm: also use alloc_large_system_hash() for the PID hash table This is being done by allowing boot time allocations to specify that they may want a sub-page sized amount of memory. Overall this seems more consistent with the other hash table allocations, and allows making two supposedly mm-only variables really mm-only (nr_{kernel,all}_pages). Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:38 -07:00
Jan Beulich	4481374ce8	mm: replace various uses of num_physpages by totalram_pages Sizing of memory allocations shouldn't depend on the number of physical pages found in a system, as that generally includes (perhaps a huge amount of) non-RAM pages. The amount of what actually is usable as storage should instead be used as a basis here. Some of the calculations (i.e. those not intending to use high memory) should likely even use (totalram_pages - totalhigh_pages). Signed-off-by: Jan Beulich <jbeulich@novell.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Dave Airlie <airlied@linux.ie> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:38 -07:00
Jan Beulich	4738e1b9cf	memory hotplug: fix updating of num_physpages for hot plugged memory Sizing of memory allocations shouldn't depend on the number of physical pages found in a system, as that generally includes (perhaps a huge amount of) non-RAM pages. The amount of what actually is usable as storage should instead be used as a basis here. In line with that, the memory hotplug code should update num_physpages in a way that it retains its original (post-boot) meaning; in particular, decreasing the value should at best be done with great care - this patch doesn't try to ever decrease this value at all as it doesn't really seem meaningful to do so. Signed-off-by: Jan Beulich <jbeulich@novell.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:38 -07:00
Mel Gorman	78986a678f	page-allocator: limit the number of MIGRATE_RESERVE pageblocks per zone After anti-fragmentation was merged, a bug was reported whereby devices that depended on high-order atomic allocations were failing. The solution was to preserve a property in the buddy allocator which tended to keep the minimum number of free pages in the zone at the lower physical addresses and contiguous. To preserve this property, MIGRATE_RESERVE was introduced and a number of pageblocks at the start of a zone would be marked "reserve", the number of which depended on min_free_kbytes. Anti-fragmentation works by avoiding the mixing of page migratetypes within the same pageblock. One way of helping this is to increase min_free_kbytes because it becomes less like that it will be necessary to place pages of of MIGRATE_RESERVE is unbounded, the free memory is kept there in large contiguous blocks instead of helping anti-fragmentation as much as it should. With the page-allocator tracepoint patches applied, it was found during anti-fragmentation tests that the number of fragmentation-related events were far higher than expected even with min_free_kbytes at higher values. This patch limits the number of MIGRATE_RESERVE blocks that exist per zone to two. For example, with a sufficient min_free_kbytes, 4MB of memory will be kept aside on an x86-64 and remain more or less free and contiguous for the systems uptime. This should be sufficient for devices depending on high-order atomic allocations while helping fragmentation control when min_free_kbytes is tuned appropriately. As side-effect of this patch is that the reserve variable is converted to int as unsigned long was the wrong type to use when ensuring that only the required number of reserve blocks are created. With the patches applied, fragmentation-related events as measured by the page allocator tracepoints were significantly reduced when running some fragmentation stress-tests on systems with min_free_kbytes tuned to a value appropriate for hugepage allocations at runtime. On x86, the events recorded were reduced by 99.8%, on x86-64 by 99.72% and on ppc64 by 99.83%. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:38 -07:00
Johannes Weiner	ceddc3a52d	mm: document is_page_cache_freeable() Enlighten the reader of this code about what reference count makes a page cache page freeable. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:38 -07:00
Johannes Weiner	edcf4748cd	mm: return boolean from page_has_private() Make page_has_private() return a true boolean value and remove the double negations from the two callsites using it for arithmetic. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:38 -07:00
Johannes Weiner	6c0b13519d	mm: return boolean from page_is_file_cache() page_is_file_cache() has been used for both boolean checks and LRU arithmetic, which was always a bit weird. Now that page_lru_base_type() exists for LRU arithmetic, make page_is_file_cache() a real predicate function and adjust the boolean-using callsites to drop those pesky double negations. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:37 -07:00
Johannes Weiner	401a8e1c16	mm: introduce page_lru_base_type() Instead of abusing page_is_file_cache() for LRU list index arithmetic, add another helper with a more appropriate name and convert the non-boolean users of page_is_file_cache() accordingly. This new helper gives the LRU base type a page is supposed to live on, inactive anon or inactive file. [hugh.dickins@tiscali.co.uk: convert del_page_from_lru() also] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:35 -07:00
Johannes Weiner	b7c46d151c	mm: drop unneeded double negations Remove double negations where the operand is already boolean. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:35 -07:00
Sage Weil	bba7881954	mm: remove broken 'kzalloc' mempool The kzalloc mempool zeros items when they are initially allocated, but does not rezero used items that are returned to the pool. Consequently mempool_alloc()s may return non-zeroed memory. Since there are/were only two in-tree users for mempool_create_kzalloc_pool(), and 'fixing' this in a way that will re-zero used (but not new) items before first use is non-trivial, just remove it. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:35 -07:00
Jaswinder Singh Rajput	72ff13b703	mm: includecheck fix for mm/nommu.c Fix the following 'make includecheck' warning: mm/nommu.c: internal.h is included more than once. Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Cc: David Howells <dhowells@redhat.com> Acked-by: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:35 -07:00
Jaswinder Singh Rajput	cff397e6b3	mm: includecheck fix for mm/shmem.c Fix the following 'make includecheck' warning: mm/shmem.c: linux/vfs.h is included more than once. Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:35 -07:00
Daisuke Nishimura	2ca4532a49	mm: add_to_swap_cache() does not return -EEXIST After commit `355cfa73` ("mm: modify swap_map and add SWAP_HAS_CACHE flag"), only the context which have set SWAP_HAS_CACHE flag by swapcache_prepare() or get_swap_page() would call add_to_swap_cache(). So add_to_swap_cache() doesn't return -EEXIST any more. Even though it doesn't return -EEXIST, it's not good behavior conceptually to call swapcache_prepare() in the -EEXIST case, because it means clearing SWAP_HAS_CACHE flag while the entry is on swap cache. This patch removes redundant codes and comments from callers of it, and adds VM_BUG_ON() in error path of add_to_swap_cache() and some comments. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:35 -07:00
Daisuke Nishimura	31a5639623	mm: add_to_swap_cache() must not sleep After commit `355cfa73` ("mm: modify swap_map and add SWAP_HAS_CACHE flag"), read_swap_cache_async() will busy-wait while a entry doesn't exist in swap cache but it has SWAP_HAS_CACHE flag. Such entries can exist on add/delete path of swap cache. On add path, add_to_swap_cache() is called soon after SWAP_HAS_CACHE flag is set, and on delete path, swapcache_free() will be called (SWAP_HAS_CACHE flag is cleared) soon after __delete_from_swap_cache() is called. So, the busy-wait works well in most cases. But this mechanism can cause soft lockup if add_to_swap_cache() sleeps and read_swap_cache_async() tries to swap-in the same entry on the same cpu. This patch calls radix_tree_preload() before swapcache_prepare() and divides add_to_swap_cache() into two part: radix_tree_preload() part and radix_tree_insert() part(define it as __add_to_swap_cache()). Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:35 -07:00
Mel Gorman	0d3d062a6e	tracing, page-allocator: add trace event for page traffic related to the buddy lists The page allocation trace event reports that a page was successfully allocated but it does not specify where it came from. When analysing performance, it can be important to distinguish between pages coming from the per-cpu allocator and pages coming from the buddy lists as the latter requires the zone lock to the taken and more data structures to be examined. This patch adds a trace event for __rmqueue reporting when a page is being allocated from the buddy lists. It distinguishes between being called to refill the per-cpu lists or whether it is a high-order allocation. Similarly, this patch adds an event to catch when the PCP lists are being drained a little and pages are going back to the buddy lists. This is trickier to draw conclusions from but high activity on those events could explain why there were a large number of cache misses on a page-allocator-intensive workload. The coalescing and splitting of buddies involves a lot of writing of page metadata and cache line bounces not to mention the acquisition of an interrupt-safe lock necessary to enter this path. [akpm@linux-foundation.org: fix build] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Ingo Molnar <mingo@elte.hu> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Ming Chun <macli@brc.ubc.ca> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:34 -07:00
Mel Gorman	e0fff1bd12	tracing, page-allocator: add trace events for anti-fragmentation falling back to other migratetypes Fragmentation avoidance depends on being able to use free pages from lists of the appropriate migrate type. In the event this is not possible, __rmqueue_fallback() selects a different list and in some circumstances change the migratetype of the pageblock. Simplistically, the more times this event occurs, the more likely that fragmentation will be a problem later for hugepage allocation at least but there are other considerations such as the order of page being split to satisfy the allocation. This patch adds a trace event for __rmqueue_fallback() that reports what page is being used for the fallback, the orders of relevant pages, the desired migratetype and the migratetype of the lists being used, whether the pageblock changed type and whether this event is important with respect to fragmentation avoidance or not. This information can be used to help analyse fragmentation avoidance and help decide whether min_free_kbytes should be increased or not. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Ingo Molnar <mingo@elte.hu> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Ming Chun <macli@brc.ubc.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:34 -07:00
Mel Gorman	4b4f278c03	tracing, page-allocator: add trace events for page allocation and page freeing This patch adds trace events for the allocation and freeing of pages, including the freeing of pagevecs. Using the events, it will be known what struct page and pfns are being allocated and freed and what the call site was in many cases. The page alloc tracepoints be used as an indicator as to whether the workload was heavily dependant on the page allocator or not. You can make a guess based on vmstat but you can't get a per-process breakdown. Depending on the call path, the call_site for page allocation may be __get_free_pages() instead of a useful callsite. Instead of passing down a return address similar to slab debugging, the user should enable the stacktrace and seg-addr options to get a proper stack trace. The pagevec free tracepoint has a different usecase. It can be used to get a idea of how many pages are being dumped off the LRU and whether it is kswapd doing the work or a process doing direct reclaim. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Ingo Molnar <mingo@elte.hu> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Ming Chun <macli@brc.ubc.ca> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:34 -07:00
Mel Gorman	38a398572f	page-allocator: remove dead function free_cold_page() The function free_cold_page() has no callers so delete it. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:34 -07:00
KAMEZAWA Hiroyuki	d0107eb073	kcore: fix vread/vwrite to be aware of holes vread/vwrite access vmalloc area without checking there is a page or not. In most case, this works well. In old ages, the caller of get_vm_ara() is only IOREMAP and there is no memory hole within vm_struct's [addr...addr + size - PAGE_SIZE] ( -PAGE_SIZE is for a guard page.) After per-cpu-alloc patch, it uses get_vm_area() for reserve continuous virtual address but remap _later_. There tend to be a hole in valid vmalloc area in vm_struct lists. Then, skip the hole (not mapped page) is necessary. This patch updates vread/vwrite() for avoiding memory hole. Routines which access vmalloc area without knowing for which addr is used are - /proc/kcore - /dev/kmem kcore checks IOREMAP, /dev/kmem doesn't. After this patch, IOREMAP is checked and /dev/kmem will avoid to read/write it. Fixes to /proc/kcore will be in the next patch in series. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Mike Smith <scgtrp@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:34 -07:00
KAMEZAWA Hiroyuki	dd32c27998	vmalloc: unmap vmalloc area after hiding it vmap area should be purged after vm_struct is removed from the list because vread/vwrite etc...believes the range is valid while it's on vm_struct list. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Mike Smith <scgtrp@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:33 -07:00
Mel Gorman	2f66a68f3f	page-allocator: change migratetype for all pageblocks within a high-order page during __rmqueue_fallback When there are no pages of a target migratetype free, the page allocator selects a high-order block of another migratetype to allocate from. When the order of the page taken is greater than pageblock_order, all pageblocks within that high-order page should change migratetype so that pages are later freed to the correct free-lists. The current behaviour is that pageblocks change migratetype if the order being split matches the pageblock_order. When pageblock_order < MAX_ORDER-1, ownership is not changing correct and pages are being later freed to the incorrect list and this impacts fragmentation avoidance. This patch changes all pageblocks within the high-order page being split to the correct migratetype. Without the patch, allocation success rates for hugepages under stress were about 59% of physical memory on x86-64. With the patch applied, this goes up to 65%. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:33 -07:00
Benjamin Herrenschmidt	fe1ff49d0d	mm: kmem_cache_create(): make it easier to catch NULL cache names Right now, if you inadvertently pass NULL to kmem_cache_create() at boot time, it crashes much later after boot somewhere deep inside sysfs which makes it very non obvious to figure out what's going on. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:33 -07:00
Hugh Dickins	7103ad323b	ksm: mremap use err from ksm_madvise mremap move's use of ksm_madvise() was assuming -ENOMEM on failure, because ksm_madvise used to say -EAGAIN for that; but ksm_madvise now says -ENOMEM (letting madvise convert that to -EAGAIN), and can also say -ERESTARTSYS when signalled: so pass the error from ksm_madvise. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:33 -07:00
Hugh Dickins	35451beecb	ksm: unmerge is an origin of OOMs Just as the swapoff system call allocates many pages of RAM to various processes, perhaps triggering OOM, so "echo 2 >/sys/kernel/mm/ksm/run" (unmerge) is liable to allocate many pages of RAM to various processes, perhaps triggering OOM; and each is normally run from a modest admin process (swapoff or shell), easily repeated until it succeeds. So treat unmerge_and_remove_all_rmap_items() in the same way that we treat try_to_unuse(): generalize PF_SWAPOFF to PF_OOM_ORIGIN, and bracket both with that, to ask the OOM killer to kill them first, to prevent them from spawning more and more OOM kills. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:33 -07:00
Hugh Dickins	a913e182ab	ksm: clean up obsolete references A few cleanups, given the munlock fix: the comment on ksm_test_exit() no longer applies, and it can be made private to ksm.c; there's no more reference to mmu_gather or tlb.h, and mmap.c doesn't need ksm.h. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:33 -07:00
Hugh Dickins	8314c4f24a	ksm: remove VM_MERGEABLE_FLAGS KSM originally stood for Kernel Shared Memory: but the kernel has long supported shared memory, and VM_SHARED and VM_MAYSHARE vmas, and KSM is something else. So we switched to saying "merge" instead of "share". But Chris Wright points out that this is confusing where mmap.c merges adjacent vmas: most especially in the name VM_MERGEABLE_FLAGS, used by is_mergeable_vma() to let vmas be merged despite flags being different. Call it VMA_MERGE_DESPITE_FLAGS? Perhaps, but at present it consists only of VM_CAN_NONLINEAR: so for now it's clearer on all sides to use that directly, with a comment on it in is_mergeable_vma(). Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:33 -07:00
Hugh Dickins	7701c9c0f5	ksm: add some documentation Add Documentation/vm/ksm.txt: how to use the Kernel Samepage Merging feature Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:33 -07:00
Hugh Dickins	2ffd8679c8	ksm: sysfs and defaults At present KSM is just a waste of space if you don't have CONFIG_SYSFS=y to provide the /sys/kernel/mm/ksm files to tune and activate it. Make KSM depend on SYSFS? Could do, but it might be better to provide some defaults so that KSM works out-of-the-box, ready for testers to madvise MADV_MERGEABLE, even without SYSFS. Though anyone serious is likely to want to retune the numbers to their taste once they have experience; and whether these settings ever reach 2.6.32 can be discussed along the way. Save 1kB from tiny kernels by #ifdef'ing the SYSFS side of it. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:32 -07:00
Andrea Arcangeli	1c2fb7a4c2	ksm: fix deadlock with munlock in exit_mmap Rawhide users have reported hang at startup when cryptsetup is run: the same problem can be simply reproduced by running a program int main() { mlockall(MCL_CURRENT \| MCL_FUTURE); return 0; } The problem is that exit_mmap() applies munlock_vma_pages_all() to clean up VM_LOCKED areas, and its current implementation (stupidly) tries to fault in absent pages, for example where PROT_NONE prevented them being faulted in when mlocking. Whereas the "ksm: fix oom deadlock" patch, knowing there's a race by which KSM might try to fault in pages after exit_mmap() had finally zapped the range, backs out of such faults doing nothing when its ksm_test_exit() notices mm_users 0. So revert that part of "ksm: fix oom deadlock" which moved the ksm_exit() call from before exit_mmap() to the middle of exit_mmap(); and remove those ksm_test_exit() checks from the page fault paths, so allowing the munlocking to proceed without interference. ksm_exit, if there are rmap_items still chained on this mm slot, takes mmap_sem write side: so preventing KSM from working on an mm while exit_mmap runs. And KSM will bail out as soon as it notices that mm_users is already zero, thanks to its internal ksm_test_exit checks. So that when a task is killed by OOM killer or the user, KSM will not indefinitely prevent it from running exit_mmap to release its memory. This does break a part of what "ksm: fix oom deadlock" was trying to achieve. When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even when ksmd itself has to cancel a KSM page, it is possible that the first OOM-kill victim would be the KSM process being faulted: then its memory won't be freed until a second victim has been selected (freeing memory for the unmerging fault to complete). But the OOM killer is already liable to kill a second victim once the intended victim's p->mm goes to NULL: so there's not much point in rejecting this KSM patch before fixing that OOM behaviour. It is very much more important to allow KSM users to boot up, than to haggle over an unlikely and poorly supported OOM case. We also intend to fix munlocking to not fault pages: at which point this patch _could_ be reverted; though that would be controversial, so we hope to find a better solution. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Justin M. Forbes <jforbes@redhat.com> Acked-for-now-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:32 -07:00
Hugh Dickins	9ba6929480	ksm: fix oom deadlock There's a now-obvious deadlock in KSM's out-of-memory handling: imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex, trying to allocate a page to break KSM in an mm which becomes the OOM victim (quite likely in the unmerge case): it's killed and goes to exit, and hangs there waiting to acquire ksm_thread_mutex. Clearly we must not require ksm_thread_mutex in __ksm_exit, simple though that made everything else: perhaps use mmap_sem somehow? And part of the answer lies in the comments on unmerge_ksm_pages: __ksm_exit should also leave all the rmap_item removal to ksmd. But there's a fundamental problem, that KSM relies upon mmap_sem to guarantee the consistency of the mm it's dealing with, yet exit_mmap tears down an mm without taking mmap_sem. And bumping mm_users won't help at all, that just ensures that the pages the OOM killer assumes are on their way to being freed will not be freed. The best answer seems to be, to move the ksm_exit callout from just before exit_mmap, to the middle of exit_mmap: after the mm's pages have been freed (if the mmu_gather is flushed), but before its page tables and vma structures have been freed; and down_write,up_write mmap_sem there to serialize with KSM's own reliance on mmap_sem. But KSM then needs to be careful, whenever it downs mmap_sem, to check that the mm is not already exiting: there's a danger of using find_vma on a layout that's being torn apart, or writing into page tables which have been freed for reuse; and even do_anonymous_page and __do_fault need to check they're not being called by break_ksm to reinstate a pte after zap_pte_range has zapped that page table. Though it might be clearer to add an exiting flag, set while holding mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating a zapped pte. All we need is to check whether mm_users is 0 - but must remember that ksmd may detect that before __ksm_exit is reached. So, ksm_test_exit(mm) added to comment such checks on mm->mm_users. __ksm_exit now has to leave clearing up the rmap_items to ksmd, that needs ksm_thread_mutex; but shift the exiting mm just after the ksm_scan cursor so that it will soon be dealt with. __ksm_enter raise mm_count to hold the mm_struct, ksmd's exit processing (exactly like its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it, similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd). But also give __ksm_exit a fast path: when there's no complication (no rmap_items attached to mm and it's not at the ksm_scan cursor), it can safely do all the exiting work itself. This is not just an optimization: when ksmd is not running, the raised mm_count would otherwise leak mm_structs. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:32 -07:00
Hugh Dickins	cd551f9751	ksm: distribute remove_mm_from_lists Do some housekeeping in ksm.c, to help make the next patch easier to understand: remove the function remove_mm_from_lists, distributing its code to its callsites scan_get_next_rmap_item and __ksm_exit. That turns out to be a win in scan_get_next_rmap_item: move its remove_trailing_rmap_items and cursor advancement up, and it becomes simpler than before. __ksm_exit becomes messier, but will change again; and moving its remove_trailing_rmap_items up lets us strengthen the unstable tree item's age condition in remove_rmap_item_from_tree. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:32 -07:00
Hugh Dickins	d952b79136	ksm: fix endless loop on oom break_ksm has been looping endlessly ignoring VM_FAULT_OOM: that should only be a problem for ksmd when a memory control group imposes limits (normally the OOM killer will kill others with an mm until it succeeds); but in general (especially for MADV_UNMERGEABLE and KSM_RUN_UNMERGE) we do need to route the error (or kill) back to the caller (or sighandling). Test signal_pending in unmerge_ksm_pages, which could be a lengthy procedure if it has to spill into swap: returning -ERESTARTSYS so that trivial signals will restart but fatals will terminate (is that right? we do different things in different places in mm, none exactly this). unmerge_and_remove_all_rmap_items was forgetting to lock when going down the mm_list: fix that. Whether it's successful or not, reset ksm_scan cursor to head; but only if it's successful, reset seqnr (shown in full_scans) - page counts will have gone down to zero. This patch leaves a significant OOM deadlock, but it's a good step on the way, and that deadlock is fixed in a subsequent patch. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:32 -07:00
Hugh Dickins	81464e3060	ksm: five little cleanups 1. We don't use __break_cow entry point now: merge it into break_cow. 2. remove_all_slot_rmap_items is just a special case of remove_trailing_rmap_items: use the latter instead. 3. Extend comment on unmerge_ksm_pages and rmap_items. 4. try_to_merge_two_pages should use try_to_merge_with_ksm_page instead of duplicating its code; and so swap them around. 5. Comment on cmp_and_merge_page described last year's: update it. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:32 -07:00
Hugh Dickins	6e15838425	ksm: keep quiet while list empty ksm_scan_thread already sleeps in wait_event_interruptible until setting ksm_run activates it; but if there's nothing on its list to look at, i.e. nobody has yet said madvise MADV_MERGEABLE, it's a shame to be clocking up system time and full_scans: ksmd_should_run added to check that too. And move the mutex_lock out around it: the new counts showed that when ksm_run is stopped, a little work often got done afterwards, because it had been read before taking the mutex. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:32 -07:00
Hugh Dickins	26465d3ea5	ksm: break cow once unshared We kept agreeing not to bother about the unswappable shared KSM pages which later become unshared by others: observation suggests they're not a significant proportion. But they are disadvantageous, and it is easier to break COW to replace them by swappable pages, than offer statistics to show that they don't matter; then we can stop worrying about them. Doing this in ksm_do_scan, they don't go through cmp_and_merge_page on this pass: give them a good chance of getting into the unstable tree on the next pass, or back into the stable, by computing checksum now. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:32 -07:00
Hugh Dickins	473b0ce4d1	ksm: pages_unshared and pages_volatile The pages_shared and pages_sharing counts give a good picture of how successful KSM is at sharing; but no clue to how much wasted work it's doing to get there. Add pages_unshared (count of unique pages waiting in the unstable tree, hoping to find a mate) and pages_volatile. pages_volatile is harder to define. It includes those pages changing too fast to get into the unstable tree, but also whatever other edge conditions prevent a page getting into the trees: a high value may deserve investigation. Don't try to calculate it from the various conditions: it's the total of rmap_items less those accounted for. Also show full_scans: the number of completed scans of everything registered in the mm list. The locking for all these counts is simply ksm_thread_mutex. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:32 -07:00
Hugh Dickins	e178dfde39	ksm: move pages_sharing updates The pages_shared count is incremented and decremented when adding a node to and removing a node from the stable tree: easy to understand. But the pages_sharing count was hard to follow, being adjusted in various places: increment and decrement it when adding to and removing from the stable tree. And the pages_sharing variable used to include the pages_shared, then those were subtracted when shown in the pages_sharing sysfs file: now keep it as an exclusive count of leaves hanging off the stable tree nodes, throughout. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:32 -07:00
Hugh Dickins	b402826033	ksm: rename kernel_pages_allocated We're not implementing swapping of KSM pages in its first release; but when that follows, "kernel_pages_allocated" will be a very poor name for the sysfs file showing number of nodes in the stable tree: rename that to "pages_shared" throughout. But we already have a "pages_shared", counting those page slots sharing the shared pages: first rename that to... "pages_sharing". What will become of "max_kernel_pages" when the pages shared can be swapped? I guess it will just be removed, so keep that name. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:32 -07:00
Izik Eidus	339aa62469	ksm: change ksm nice level to be 5 ksm should try not to disturb other tasks as much as possible. Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Chris Wright <chrisw@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:31 -07:00
Izik Eidus	36b2528dc1	ksm: change copyright message Adding Hugh Dickins into the authors list. Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Chris Wright <chrisw@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:31 -07:00
Hugh Dickins	1ff8299573	ksm: prevent mremap move poisoning KSM's scan allows for user pages to be COWed or unmapped at any time, without requiring any notification. But its stable tree does assume that when it finds a KSM page where it placed a KSM page, then it is the same KSM page that it placed there. mremap move could break that assumption: if an area containing a KSM page was unmapped, then an area containing a different KSM page was moved with mremap into the place of the original, before KSM's scan came around to notice. That could then poison a node of the stable tree, so that memcmps would "lie" and upset the ordering of the tree. Probably noone will ever need mremap move on a VM_MERGEABLE area; except that prohibiting it would make trouble for schemes in which we try making everything VM_MERGEABLE e.g. for testing: an mremap which normally works would then fail mysteriously. There's no need to go to any trouble, such as re-sorting KSM's list of rmap_items to match the new layout: simply unmerge the area to COW all its KSM pages before moving, but leave VM_MERGEABLE on so that they're remerged later. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:31 -07:00
Izik Eidus	31dbd01f31	ksm: Kernel SamePage Merging Ksm is code that allows merging of identical pages between one or more applications, in a way invisible to the applications that use it. Pages that are merged are marked as read-only, then COWed when any application tries to change them. Whereas fork() allows sharing anonymous pages between parent and child, ksm can share anonymous pages between unrelated processes. Ksm works by walking over the memory pages of the applications it scans, in order to find identical pages. It uses two sorted data structures, called the stable and unstable trees, to locate identical pages in an effective way. When ksm finds two identical pages, it marks them as readonly and merges them into a single page. After the pages have been marked as readonly and merged into one, Linux treats them as normal copy-on-write pages, copying to a fresh anonymous page if write access is required later. Ksm scans and merges anonymous pages only in those memory areas that have been registered with it by madvise(addr, length, MADV_MERGEABLE). The ksm scanner is controlled by sysfs files in /sys/kernel/mm/ksm/: max_kernel_pages - the maximum number of unswappable kernel pages which may be allocated by ksm (0 for unlimited). kernel_pages_allocated - how many ksm pages are currently allocated, sharing identical content between different processes (pages unswappable in this release). pages_shared - how many pages have been saved by sharing with ksm pages (kernel_pages_allocated being excluded from this count). pages_to_scan - how many pages ksm should scan before sleeping. sleep_millisecs - how many milliseconds ksm should sleep between scans. run - write 0 to disable ksm, read 0 while ksm is disabled (default), write 1 to run ksm, read 1 while ksm is running, write 2 to disable ksm and unmerge all its pages. Includes contributions by Andrea Arcangeli Chris Wright and Hugh Dickins. [hugh.dickins@tiscali.co.uk: fix rare page leak] Signed-off-by: Izik Eidus <ieidus@redhat.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:31 -07:00
Hugh Dickins	9a84089514	ksm: identify PageKsm pages KSM will need to identify its kernel merged pages unambiguously, and /proc/kpageflags will probably like to do so too. Since KSM will only be substituting anonymous pages, statistics are best preserved by making a PageKsm page a special PageAnon page: one with no anon_vma. But KSM then needs its own page_add_ksm_rmap() - keep it in ksm.h near PageKsm; and do_wp_page() must COW them, unlike singly mapped PageAnons. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:31 -07:00
Hugh Dickins	21333b2b66	ksm: no debug in page_dup_rmap() page_dup_rmap(), used on each mapped page when forking, was originally just an inline atomic_inc of mapcount. 2.6.22 added CONFIG_DEBUG_VM out-of-line checks to it, which would need to be ever-so-slightly complicated to allow for the PageKsm() we're about to define. But I think these checks never caught anything. And if it's coding errors we're worried about, such checks should be in page_remove_rmap() too, not just when forking; whereas if it's pagetable corruption we're worried about, then they shouldn't be limited to CONFIG_DEBUG_VM. Oh, just revert page_dup_rmap() to an inline atomic_inc of mapcount. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:31 -07:00
Hugh Dickins	f8af4da3b4	ksm: the mm interface to ksm This patch presents the mm interface to a dummy version of ksm.c, for better scrutiny of that interface: the real ksm.c follows later. When CONFIG_KSM is not set, madvise(2) reject MADV_MERGEABLE and MADV_UNMERGEABLE with EINVAL, since that seems more helpful than pretending that they can be serviced. But when CONFIG_KSM=y, accept them even if KSM is not currently running, and even on areas which KSM will not touch (e.g. hugetlb or shared file or special driver mappings). Like other madvices, report ENOMEM despite success if any area in the range is unmapped, and use EAGAIN to report out of memory. Define vma flag VM_MERGEABLE to identify an area on which KSM may try merging pages: leave it to ksm_madvise() to decide whether to set it. Define mm flag MMF_VM_MERGEABLE to identify an mm which might contain VM_MERGEABLE areas, to minimize callouts when forking or exiting. Based upon earlier patches by Chris Wright and Izik Eidus. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:31 -07:00
Hugh Dickins	3866ea90d3	ksm: first tidy up madvise_vma() madvise.c has several levels of switch statements, what to do in which? Move MADV_DOFORK code down from madvise_vma() to madvise_behavior(), so madvise_vma() can be a simple router, to madvise_behavior() by default. vma->vm_flags is an unsigned long so use the same type for new_flags. Add missing comment lines to describe MADV_DONTFORK and MADV_DOFORK. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:31 -07:00
Izik Eidus	828502d300	ksm: add mmu_notifier set_pte_at_notify() KSM is a linux driver that allows dynamicly sharing identical memory pages between one or more processes. Unlike tradtional page sharing that is made at the allocation of the memory, ksm do it dynamicly after the memory was created. Memory is periodically scanned; identical pages are identified and merged. The sharing is made in a transparent way to the processes that use it. Ksm is highly important for hypervisors (kvm), where in production enviorments there might be many copys of the same data data among the host memory. This kind of data can be: similar kernels, librarys, cache, and so on. Even that ksm was wrote for kvm, any userspace application that want to use it to share its data can try it. Ksm may be useful for any application that might have similar (page aligment) data strctures among the memory, ksm will find this data merge it to one copy, and even if it will be changed and thereforew copy on writed, ksm will merge it again as soon as it will be identical again. Another reason to consider using ksm is the fact that it might simplify alot the userspace code of application that want to use shared private data, instead that the application will mange shared area, ksm will do this for the application, and even write to this data will be allowed without any synchinization acts from the application. Ksm was designed to be a loadable module that doesn't change the VM code of linux. This patch: The set_pte_at_notify() macro allows setting a pte in the shadow page table directly, instead of flushing the shadow page table entry and then getting vmexit to set it. It uses a new change_pte() callback to do so. set_pte_at_notify() is an optimization for kvm, and other users of mmu_notifiers, for COW pages. It is useful for kvm when ksm is used, because it allows kvm not to have to receive vmexit and only then map the ksm page into the shadow page table, but instead map it directly at the same time as Linux maps the page into the host page table. Users of mmu_notifiers who don't implement new mmu_notifier_change_pte() callback will just receive the mmu_notifier_invalidate_page() callback. Signed-off-by: Izik Eidus <ieidus@redhat.com> Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:31 -07:00
Johannes Weiner	451ea25da7	mm: perform non-atomic test-clear of PG_mlocked on free By the time PG_mlocked is cleared in the page freeing path, nobody else is looking at our page->flags anymore. It is thus safe to make the test-and-clear non-atomic and thereby removing an unnecessary and expensive operation from a hotpath. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:30 -07:00
Figo.zhang	bf88c8c83e	vmalloc.c: fix double error checking There is no need for double error checking. Signed-off-by: Figo.zhang <figo1802@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:30 -07:00
Akinobu Mita	945a11136e	mm: add gfp mask checking for __get_free_pages() __get_free_pages() with __GFP_HIGHMEM is not safe because the return address cannot represent a highmem page. get_zeroed_page() already has such a debug checking. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:30 -07:00
KOSAKI Motohiro	a26f5320c4	vmscan: kill unnecessary prefetch The pages in the list passed move_active_pages_to_lru() are already touched by shrink_active_list(). IOW the prefetch in move_active_pages_to_lru() don't populate any cache. it's pointless. This patch remove it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:30 -07:00
KOSAKI Motohiro	74a1c48fb4	vmscan: kill unnecessary page flag test The page_lru() already evaluate PageActive() and PageSwapBacked(). We don't need to re-evaluate it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:30 -07:00
KOSAKI Motohiro	5205e56eea	vmscan: move ClearPageActive from move_active_pages() to shrink_active_list() The move_active_pages_to_lru() function is called under irq disabled and ClearPageActive() doesn't need irq disabling. Then, this patch move it into shrink_active_list(). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:30 -07:00
Minchan Kim	de2e7567c7	vmscan: don't attempt to reclaim anon page in lumpy reclaim when no swap space is available The VM already avoids attempting to reclaim anon pages in various places, But it doesn't avoid it for lumpy reclaim. It shuffles lru list unnecessary so that it is pointless. [akpm@linux-foundation.org: cleanup] Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:30 -07:00
Wu Fengguang	adea02a1be	mm: count only reclaimable lru pages global_lru_pages() / zone_lru_pages() can be used in two ways: - to estimate max reclaimable pages in determine_dirtyable_memory() - to calculate the slab scan ratio When swap is full or not present, the anon lru lists are not reclaimable and also won't be scanned. So the anon pages shall not be counted in both usage scenarios. Also rename to _reclaimable_pages: now they are counting the possibly reclaimable lru pages. It can greatly (and correctly) increase the slab scan rate under high memory pressure (when most file pages have been reclaimed and swap is full/absent), thus reduce false OOM kills. Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: David Howells <dhowells@redhat.com> Cc: "Li, Ming Chun" <macli@brc.ubc.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:30 -07:00
Rik van Riel	35cd78156c	vmscan: throttle direct reclaim when too many pages are isolated already When way too many processes go into direct reclaim, it is possible for all of the pages to be taken off the LRU. One result of this is that the next process in the page reclaim code thinks there are no reclaimable pages left and triggers an out of memory kill. One solution to this problem is to never let so many processes into the page reclaim path that the entire LRU is emptied. Limiting the system to only having half of each inactive list isolated for reclaim should be safe. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:29 -07:00
KOSAKI Motohiro	a731286de6	mm: vmstat: add isolate pages If the system is running a heavy load of processes then concurrent reclaim can isolate a large number of pages from the LRU. /proc/vmstat and the output generated for an OOM do not show how many pages were isolated. This has been observed during process fork bomb testing (mstctl11 in LTP). This patch shows the information about isolated pages. Reproduced via: ----------------------- % ./hackbench 140 process 1000 => OOM occur active_anon:146 inactive_anon:0 isolated_anon:49245 active_file:79 inactive_file:18 isolated_file:113 unevictable:0 dirty:0 writeback:0 unstable:0 buffer:39 free:370 slab_reclaimable:309 slab_unreclaimable:5492 mapped:53 shmem:15 pagetables:28140 bounce:0 Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:29 -07:00
KOSAKI Motohiro	b35ea17b7b	mm: shrink_inactive_list() nr_scan accounting fix fix If sc->isolate_pages() return 0, we don't need to call shrink_page_list(). In past days, shrink_inactive_list() handled it properly. But commit `fb8d14e1` (three years ago commit!) breaked it. current shrink_inactive_list() always call shrink_page_list() although isolate_pages() return 0. This patch restore proper return value check. Requirements: o "nr_taken == 0" condition should stay before calling shrink_page_list(). o "nr_taken == 0" condition should stay after nr_scan related statistics modification. Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:28 -07:00
KOSAKI Motohiro	44c241f166	mm: rename pgmoved variable in shrink_active_list() Currently the pgmoved variable has two meanings. It causes harder reviewing. This patch separates it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:27 -07:00
David Rientjes	b259fbde0a	mm: update alloc_flags after oom killer has been called It is possible for the oom killer to select current as the task to kill. When this happens, alloc_flags needs to be updated accordingly to set ALLOC_NO_WATERMARKS so the subsequent allocation attempt may use memory reserves as the result of its thread having TIF_MEMDIE set if the allocation is not __GFP_NOMEMALLOC. Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:27 -07:00
KOSAKI Motohiro	4b02108ac1	mm: oom analysis: add shmem vmstat Recently we encountered OOM problems due to memory use of the GEM cache. Generally a large amuont of Shmem/Tmpfs pages tend to create a memory shortage problem. We often use the following calculation to determine the amount of shmem pages: shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES however the expression does not consider isolated and mlocked pages. This patch adds explicit accounting for pages used by shmem and tmpfs. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:27 -07:00
KOSAKI Motohiro	c6a7f5728a	mm: oom analysis: Show kernel stack usage in /proc/meminfo and OOM log output The amount of memory allocated to kernel stacks can become significant and cause OOM conditions. However, we do not display the amount of memory consumed by stacks. Add code to display the amount of memory used for stacks in /proc/meminfo. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:27 -07:00
KOSAKI Motohiro	71de1ccbe1	mm: oom analysis: add buffer cache information to show_free_areas() It is often useful to know the statistics for all pages that are handled like page cache pages when looking at OOM log output. Therefore show_free_areas() should also display buffer cache statistics. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:27 -07:00
KOSAKI Motohiro	4a0aa73f1d	mm: oom analysis: add per-zone statistics to show_free_areas() show_free_areas() displays only a limited amount of zone counters. This patch includes additional counters in the display to allow easier debugging. This may be especially useful if an OOM is due to running out of DMA memory. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:27 -07:00
KOSAKI Motohiro	3701b03323	mm: show_free_areas(): display slab pages in two separate fields If an OOM happens, we really want to know the number of remaining reclaimable pages. So the reclaimable slab and unreclaimable slab fields should not be combined for display. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:26 -07:00
KOSAKI Motohiro	b904dcfed6	mm: clean up page_remove_rmap() page_remove_rmap() has multiple PageAnon() tests and it has deep nesting. Clean this up. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:26 -07:00
Lee Schermerhorn	57dd28fb05	hugetlb: restore interleaving of bootmem huge pages I noticed that alloc_bootmem_huge_page() will only advance to the next node on failure to allocate a huge page, potentially filling nodes with huge-pages. I asked about this on linux-mm and linux-numa, cc'ing the usual huge page suspects. Mel Gorman responded: I strongly suspect that the same node being used until allocation failure instead of round-robin is an oversight and not deliberate at all. It appears to be a side-effect of a fix made way back in commit `63b4613c3f` ["hugetlb: fix hugepage allocation with memoryless nodes"]. Prior to that patch it looked like allocations would always round-robin even when allocation was successful. This patch--factored out of my "hugetlb mempolicy" series--moves the advance of the hstate next node from which to allocate up before the test for success of the attempted allocation. Note that alloc_bootmem_huge_page() is only used for order > MAX_ORDER huge pages. I'll post a separate patch for mainline/stable, as the above mentioned "balance freeing" series renamed the next node to alloc function. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Andy Whitcroft <apw@canonical.com> Reviewed-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:26 -07:00
Lee Schermerhorn	685f345708	hugetlb: use free_pool_huge_page() to return unused surplus pages Use the [modified] free_pool_huge_page() function to return unused surplus pages. This will help keep huge pages balanced across nodes between freeing of unused surplus pages and freeing of persistent huge pages [from set_max_huge_pages] by using the same node id "cursor". It also eliminates some code duplication. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:26 -07:00
Lee Schermerhorn	e8c5c82498	hugetlb: balance freeing of huge pages across nodes Free huges pages from nodes in round robin fashion in an attempt to keep [persistent a.k.a static] hugepages balanced across nodes New function free_pool_huge_page() is modeled on and performs roughly the inverse of alloc_fresh_huge_page(). Replaces dequeue_huge_page() which now has no callers, so this patch removes it. Helper function hstate_next_node_to_free() uses new hstate member next_to_free_nid to distribute "frees" across all nodes with huge pages. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:26 -07:00
Randy Dunlap	55a4462af5	page_alloc: fix kernel-doc warning Ummark function as having kernel-doc notation, fixing the kernel-doc warning. Warning(mm/page_alloc.c:4519): No description found for parameter 'zone' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:26 -07:00
Shaohua Li	abfc348811	memory hotplug: migrate swap cache page In test, some pages in swap-cache can't be migrated, as they aren't rmap. unmap_and_move() ignores swap-cache page which is just read in and hasn't rmap (see the comments in the code), but swap_aops provides .migratepage. Better to migrate such pages instead of ignore them. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Yakui Zhao <yakui.zhao@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:26 -07:00
Shaohua Li	f52407ce2d	memory hotplug: alloc page from other node in memory online To initialize hotadded node, some pages are allocated. At that time, the node hasn't memory, this makes the allocation always fail. In such case, let's allocate pages from other nodes. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Yakui Zhao <yakui.zhao@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:26 -07:00
Shaohua Li	8e7e40d965	memory hotplug: make pages from movable zone always isolatable Pages on movable zone have two types, MIGRATE_MOVABLE and MIGRATE_RESERVE, both them can be movable, because only movable memory allocation can get pages from movable zone. This makes pages in movable zone always be able to migrate. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Yakui Zhao <yakui.zhao@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:26 -07:00
Shaohua Li	6fb332fabd	memory hotplug: exclude isolated page from pco page alloc Pages marked as isolated should not be allocated again. If such pages reside in pcp list, they can be allocated too, so there is a ping-pong memory offline frees some pages to pcp list and the pages get allocated and then memory offline frees them again, this loop will happen again and again. This should have no impact in normal code path, because in normal code path, pages in pcp list aren't isolated, and below loop will break in the first entry. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Yakui Zhao <yakui.zhao@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:25 -07:00
Shaohua Li	112067f090	memory hotplug: update zone pcp at memory online In my test, 128M memory is hot added, but zone's pcp batch is 0, which is an obvious error. When pages are onlined, zone pcp should be updated accordingly. [akpm@linux-foundation.org: fix warnings] Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Yakui Zhao <yakui.zhao@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:25 -07:00
David Rientjes	478b81fd84	mm: remove obsoleted alloc_pages cpuset comment When a cpuset's nodemask is updated, all attached tasks have their cached task->mems_allowed updated by a heap instead of requiring an explicit call to cpuset_update_task_memory_state(), which has since been removed in `58568d2a82` ("cpuset,mm: update tasks' mems_allowed in time"). Remove the obsoleted comment from the page allocator. Cc: Paul Menage <menage@google.com> Acked-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:25 -07:00
Linus Torvalds	43c1266ce4	Merge branch 'perfcounters-rename-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perfcounters-rename-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Tidy up after the big rename perf: Do the big rename: Performance Counters -> Performance Events perf_counter: Rename 'event' to event_id/hw_event perf_counter: Rename list_entry -> group_entry, counter_list -> group_list Manually resolved some fairly trivial conflicts with the tracing tree in include/trace/ftrace.h and kernel/trace/trace_syscalls.c.	2009-09-21 09:15:07 -07:00
Jens Axboe	87c6a9b253	writeback: make balance_dirty_pages() gradually back more off Currently it just sleeps for a very short time, just 1 jiffy. If we keep looping in there, continually delay for a little longer of up to 100msec in total. That was the old limit for congestion wait. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-21 15:40:33 +02:00
Jens Axboe	3542a5c0de	writeback: don't use schedule_timeout() without setting runstate Just use schedule_timeout_interruptible(), saves a call to set_current_state(). Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-21 15:40:33 +02:00
Frans Pop	22f8b45822	trivial: improve help text for mm debug config options Improve the help text for PAGE_POISONING. Also fix some typos and improve consistency within the file. Signed-of-by: Frans Pop <elendil@planet.nl> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-09-21 15:14:57 +02:00
Ingo Molnar	cdd6c482c9	perf: Do the big rename: Performance Counters -> Performance Events Bye-bye Performance Counters, welcome Performance Events! In the past few months the perfcounters subsystem has grown out its initial role of counting hardware events, and has become (and is becoming) a much broader generic event enumeration, reporting, logging, monitoring, analysis facility. Naming its core object 'perf_counter' and naming the subsystem 'perfcounters' has become more and more of a misnomer. With pending code like hw-breakpoints support the 'counter' name is less and less appropriate. All in one, we've decided to rename the subsystem to 'performance events' and to propagate this rename through all fields, variables and API names. (in an ABI compatible fashion) The word 'event' is also a bit shorter than 'counter' - which makes it slightly more convenient to write/handle as well. Thanks goes to Stephane Eranian who first observed this misnomer and suggested a rename. User-space tooling and ABI compatibility is not affected - this patch should be function-invariant. (Also, defconfigs were not touched to keep the size down.) This patch has been generated via the following script: FILES=$(find * -type f \| grep -vE 'oprofile\|[^K]config') sed -i \ -e 's/PERF_EVENT_/PERF_RECORD_/g' \ -e 's/PERF_COUNTER/PERF_EVENT/g' \ -e 's/perf_counter/perf_event/g' \ -e 's/nb_counters/nb_events/g' \ -e 's/swcounter/swevent/g' \ -e 's/tpcounter_event/tp_event/g' \ $FILES for N in $(find . -name perf_counter.[ch]); do M=$(echo $N \| sed 's/perf_counter/perf_event/g') mv $N $M done FILES=$(find . -name perf_event.*) sed -i \ -e 's/COUNTER_MASK/REG_MASK/g' \ -e 's/COUNTER/EVENT/g' \ -e 's/\<event\>/event_id/g' \ -e 's/counter/event/g' \ -e 's/Counter/Event/g' \ $FILES ... to keep it as correct as possible. This script can also be used by anyone who has pending perfcounters patches - it converts a Linux kernel tree over to the new naming. We tried to time this change to the point in time where the amount of pending patches is the smallest: the end of the merge window. Namespace clashes were fixed up in a preparatory patch - and some stylistic fallout will be fixed up in a subsequent patch. ( NOTE: 'counters' are still the proper terminology when we deal with hardware registers - and these sed scripts are a bit over-eager in renaming them. I've undone some of that, but in case there's something left where 'counter' would be better than 'event' we can undo that on an individual basis instead of touching an otherwise nicely automated patch. ) Suggested-by: Stephane Eranian <eranian@google.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Howells <dhowells@redhat.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <linux-arch@vger.kernel.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-21 14:28:04 +02:00
Alexey Dobriyan	6952b61de9	headers: taskstats_kern.h trim Remove net/genetlink.h inclusion, now sched.c won't be recompiled because of some networking changes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-18 09:48:52 -07:00
Jianjun Kong	27f5de7963	mm: Fix problem of parameter in note 'current' is a pointer, so the right form is 'down_write(&current->mm->mmap_sem)'. Signed-off-by: Jianjun Kong <jianjun@zeuux.org> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-18 09:48:52 -07:00
Linus Torvalds	ab86e5765d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: Driver Core: devtmpfs - kernel-maintained tmpfs-based /dev debugfs: Modify default debugfs directory for debugging pktcdvd. debugfs: Modified default dir of debugfs for debugging UHCI. debugfs: Change debugfs directory of IWMC3200 debugfs: Change debuhgfs directory of trace-events-sample.h debugfs: Fix mount directory of debugfs by default in events.txt hpilo: add poll f_op hpilo: add interrupt handler hpilo: staging for interrupt handling driver core: platform_device_add_data(): use kmemdup() Driver core: Add support for compatibility classes uio: add generic driver for PCI 2.3 devices driver-core: move dma-coherent.c from kernel to driver/base mem_class: fix bug mem_class: use minor as index instead of searching the array driver model: constify attribute groups UIO: remove 'default n' from Kconfig Driver core: Add accessor for device platform data Driver core: move dev_get/set_drvdata to drivers/base/dd.c Driver core: add new device to bus's list before probing	2009-09-16 08:27:10 -07:00
Linus Torvalds	a3eb51ecfa	Merge branch 'writeback' of git://git.kernel.dk/linux-2.6-block * 'writeback' of git://git.kernel.dk/linux-2.6-block: writeback: fix possible bdi writeback refcounting problem writeback: Fix bdi use after free in wb_work_complete() writeback: improve scalability of bdi writeback work queues writeback: remove smp_mb(), it's not needed with list_add_tail_rcu() writeback: use schedule_timeout_interruptible() writeback: add comments to bdi_work structure writeback: splice dirty inode entries to default bdi on bdi_destroy() writeback: separate starting of sync vs opportunistic writeback writeback: inline allocation failure handling in bdi_alloc_queue_work() writeback: use RCU to protect bdi_list writeback: only use bdi_writeback_all() for WB_SYNC_NONE writeout fs: Assign bdi in super_block writeback: make wb_writeback() take an argument structure writeback: merely wakeup flusher thread if work allocation fails for WB_SYNC_NONE writeback: get rid of wbc->for_writepages fs: remove bdev->bd_inode_backing_dev_info	2009-09-16 07:45:38 -07:00
Jens Axboe	ce5f8e7795	writeback: splice dirty inode entries to default bdi on bdi_destroy() We cannot safely ensure that the inodes are all gone at this point in time, and we must not destroy this bdi with inodes having off it. So just splice our entries to the default bdi since that one will always persist. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-16 15:18:52 +02:00
Jens Axboe	b6e51316da	writeback: separate starting of sync vs opportunistic writeback bdi_start_writeback() is currently split into two paths, one for WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback() for WB_SYNC_ALL writeback and let bdi_start_writeback() handle only WB_SYNC_NONE. Push down the writeback_control allocation and only accept the parameters that make sense for each function. This cleans up the API considerably. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-16 15:18:52 +02:00
Jens Axboe	cfc4ba5365	writeback: use RCU to protect bdi_list Now that bdi_writeback_all() no longer handles integrity writeback, it doesn't have to block anymore. This means that we can switch bdi_list reader side protection to RCU. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-16 15:18:51 +02:00
Jens Axboe	1fe06ad892	writeback: get rid of wbc->for_writepages It's only set, it's never checked. Kill it. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-16 15:16:18 +02:00
Andi Kleen	cae681fc12	HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs Useful for some testing scenarios, although specific testing is often done better through MADV_POISON This can be done with the x86 level MCE injector too, but this interface allows it to do independently from low level x86 changes. v2: Add module license (Haicheng Li) Signed-off-by: Andi Kleen <ak@linux.intel.com>	2009-09-16 11:50:17 +02:00
Andi Kleen	9893e49d64	HWPOISON: Add madvise() based injector for hardware poisoned pages v4 Impact: optional, useful for debugging Add a new madvice sub command to inject poison for some pages in a process' address space. This is useful for testing the poison page handling. This patch can allow root to tie up large amounts of memory. I got feedback from container developers and they didn't see any problem. v2: Use write flag for get_user_pages to make sure to always get a fresh page v3: Don't request write mapping (Fengguang Wu) v4: Move MADV_* number to avoid conflict with KSM (Hugh Dickins) Signed-off-by: Andi Kleen <ak@linux.intel.com>	2009-09-16 11:50:17 +02:00
Andi Kleen	aa261f549d	HWPOISON: Enable .remove_error_page for migration aware file systems Enable removing of corrupted pages through truncation for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs These should cover most server needs. I chose the set of migration aware file systems for this for now, assuming they have been especially audited. But in general it should be safe for all file systems on the data area that support read/write and truncate. Caveat: the hardware error handler does not take i_mutex for now before calling the truncate function. Is that ok? Cc: tytso@mit.edu Cc: hch@infradead.org Cc: mfasheh@suse.com Cc: aia21@cantab.net Cc: hugh.dickins@tiscali.co.uk Cc: swhiteho@redhat.com Signed-off-by: Andi Kleen <ak@linux.intel.com>	2009-09-16 11:50:16 +02:00
Andi Kleen	6a46079cf5	HWPOISON: The high level memory error handler in the VM v7 Add the high level memory handler that poisons pages that got corrupted by hardware (typically by a two bit flip in a DIMM or a cache) on the Linux level. The goal is to prevent everyone from accessing these pages in the future. This done at the VM level by marking a page hwpoisoned and doing the appropriate action based on the type of page it is. The code that does this is portable and lives in mm/memory-failure.c To quote the overview comment: High level machine check handler. Handles pages reported by the hardware as being corrupted usually due to a 2bit ECC memory or cache failure. This focuses on pages detected as corrupted in the background. When the current CPU tries to consume corruption the currently running process can just be killed directly instead. This implies that if the error cannot be handled for some reason it's safe to just ignore it because no corruption has been consumed yet. Instead when that happens another machine check will happen. Handles page cache pages in various states. The tricky part here is that we can access any page asynchronous to other VM users, because memory failures could happen anytime and anywhere, possibly violating some of their assumptions. This is why this code has to be extremely careful. Generally it tries to use normal locking rules, as in get the standard locks, even if that means the error handling takes potentially a long time. Some of the operations here are somewhat inefficient and have non linear algorithmic complexity, because the data structures have not been optimized for this case. This is in particular the case for the mapping from a vma to a process. Since this case is expected to be rare we hope we can get away with this. There are in principle two strategies to kill processes on poison: - just unmap the data and wait for an actual reference before killing - kill as soon as corruption is detected. Both have advantages and disadvantages and should be used in different situations. Right now both are implemented and can be switched with a new sysctl vm.memory_failure_early_kill The default is early kill. The patch does some rmap data structure walking on its own to collect processes to kill. This is unusual because normally all rmap data structure knowledge is in rmap.c only. I put it here for now to keep everything together and rmap knowledge has been seeping out anyways Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu, Nick Piggin (who did a lot of great work) and others. Cc: npiggin@suse.de Cc: riel@redhat.com Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>	2009-09-16 11:50:15 +02:00
Wu Fengguang	6746aff74d	HWPOISON: shmem: call set_page_dirty() with locked page The dirtying of page and set_page_dirty() can be moved into the page lock. - In shmem_write_end(), the page was dirtied while the page lock was held, but it's being marked dirty just after dropping the page lock. - In shmem_symlink(), both dirtying and marking can be moved into page lock. It's valuable for the hwpoison code to know whether one bad page can be dropped without losing data. It mainly judges by testing the PG_dirty bit after taking the page lock. So it becomes important that the dirtying of page and the marking of dirtiness are both done inside the page lock. Which is a common practice, but sadly not a rule. The noticeable exceptions are - mapped pages - pages with buffer_heads The above pages could go dirty at any time. Fortunately the hwpoison will unmap the page and release the buffer_heads beforehand anyway. Many other types of pages (eg. metadata pages) can also be dirtied at will by their owners, the hwpoison code cannot do meaningful things to them anyway. Only the dirtiness of pagecache pages owned by regular files are interested. v2: AK: Add comment about set_page_dirty rules (suggested by Peter Zijlstra) Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>	2009-09-16 11:50:14 +02:00
Andi Kleen	2571873621	HWPOISON: Define a new error_remove_page address space op for async truncation Truncating metadata pages is not safe right now before we haven't audited all file systems. To enable truncation only for data address space define a new address_space callback error_remove_page. This is used for memory_failure.c memory error handling. This can be then set to truncate_inode_page() This patch just defines the new operation and adds documentation. Callers and users come in followon patches. Signed-off-by: Andi Kleen <ak@linux.intel.com>	2009-09-16 11:50:13 +02:00
Wu Fengguang	83f786680a	HWPOISON: Add invalidate_inode_page Add a simple way to invalidate a single page This is just a refactoring of the truncate.c code. Originally from Fengguang, modified by Andi Kleen. Signed-off-by: Andi Kleen <ak@linux.intel.com>	2009-09-16 11:50:13 +02:00
Nick Piggin	750b4987b0	HWPOISON: Refactor truncate to allow direct truncating of page v2 Extract out truncate_inode_page() out of the truncate path so that it can be used by memory-failure.c [AK: description, headers, fix typos] v2: Some white space changes from Fengguang Wu Signed-off-by: Andi Kleen <ak@linux.intel.com>	2009-09-16 11:50:12 +02:00
Wu Fengguang	2a7684a23e	HWPOISON: check and isolate corrupted free pages v2 If memory corruption hits the free buddy pages, we can safely ignore them. No one will access them until page allocation time, then prep_new_page() will automatically check and isolate PG_hwpoison page for us (for 0-order allocation). This patch expands prep_new_page() to check every component page in a high order page allocation, in order to completely stop PG_hwpoison pages from being recirculated. Note that the common case -- only allocating a single page, doesn't do any more work than before. Allocating > order 0 does a bit more work, but that's relatively uncommon. This simple implementation may drop some innocent neighbor pages, hopefully it is not a big problem because the event should be rare enough. This patch adds some runtime costs to high order page users. [AK: Improved description] v2: Andi Kleen: Port to -mm code Move check into separate function. Don't dump stack in bad_pages for hwpoisoned pages. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>	2009-09-16 11:50:12 +02:00
Andi Kleen	888b9f7c58	HWPOISON: Handle hardware poisoned pages in try_to_unmap When a page has the poison bit set replace the PTE with a poison entry. This causes the right error handling to be done later when a process runs into it. v2: add a new flag to not do that (needed for the memory-failure handler later) (Fengguang) v3: remove unnecessary is_migration_entry() test (Fengguang, Minchan) Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>	2009-09-16 11:50:11 +02:00
Andi Kleen	14fa31b89c	HWPOISON: Use bitmask/action code for try_to_unmap behaviour try_to_unmap currently has multiple modi (migration, munlock, normal unmap) which are selected by magic flag variables. The logic is not very straight forward, because each of these flag change multiple behaviours (e.g. migration turns off aging, not only sets up migration ptes etc.) Also the different flags interact in magic ways. A later patch in this series adds another mode to try_to_unmap, so this becomes quickly unmanageable. Replace the different flags with a action code (migration, munlock, munmap) and some additional flags as modifiers (ignore mlock, ignore aging). This makes the logic more straight forward and allows easier extension to new behaviours. Change all the caller to declare what they want to do. This patch is supposed to be a nop in behaviour. If anyone can prove it is not that would be a bug. Cc: Lee.Schermerhorn@hp.com Cc: npiggin@suse.de Signed-off-by: Andi Kleen <ak@linux.intel.com>	2009-09-16 11:50:10 +02:00
Andi Kleen	a3b947eacf	HWPOISON: Add poison check to page fault handling Bail out early when hardware poisoned pages are found in page fault handling. Since they are poisoned they should not be mapped freshly into processes, because that would cause another (potentially deadly) machine check This is generally handled in the same way as OOM, just a different error code is returned to the architecture code. v2: Do a page unlock if needed (Fengguang Wu) Signed-off-by: Andi Kleen <ak@linux.intel.com>	2009-09-16 11:50:08 +02:00
Andi Kleen	d1737fdbec	HWPOISON: Add basic support for poisoned pages in fault handler v3 - Add a new VM_FAULT_HWPOISON error code to handle_mm_fault. Right now architectures have to explicitely enable poison page support, so this is forward compatible to all architectures. They only need to add it when they enable poison page support. - Add poison page handling in swap in fault code v2: Add missing delayacct_clear_flag (Hidehiro Kawai) v3: Really use delayacct_clear_flag (Hidehiro Kawai) Signed-off-by: Andi Kleen <ak@linux.intel.com>	2009-09-16 11:50:06 +02:00
Andi Kleen	a7420aa54d	HWPOISON: Add support for poison swap entries v2 Memory migration uses special swap entry types to trigger special actions on page faults. Extend this mechanism to also support poisoned swap entries, to trigger poison handling on page faults. This allows follow-on patches to prevent processes from faulting in poisoned pages again. v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu) v3: Better overflow fix (Hidehiro Kawai) Signed-off-by: Andi Kleen <ak@linux.intel.com>	2009-09-16 11:50:05 +02:00
Andi Kleen	10be22dfe1	HWPOISON: Export some rmap vma locking to outside world Needed for later patch that walks rmap entries on its own. This used to be very frowned upon, but memory-failure.c does some rather specialized rmap walking and rmap has been stable for quite some time, so I think it's ok now to export it. Signed-off-by: Andi Kleen <ak@linux.intel.com>	2009-09-16 11:50:04 +02:00
Ingo Molnar	fdaa45e95d	slub: Fix build error in kmem_cache_open() with !CONFIG_SLUB_DEBUG This build bug: mm/slub.c: In function 'kmem_cache_open': mm/slub.c:2476: error: 'disable_higher_order_debug' undeclared (first use in this function) mm/slub.c:2476: error: (Each undeclared identifier is reported only once mm/slub.c:2476: error: for each function it appears in.) Triggers because there's no !CONFIG_SLUB_DEBUG definition for disable_higher_order_debug. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>	2009-09-15 22:32:10 +03:00
Kay Sievers	2b2af54a5b	Driver Core: devtmpfs - kernel-maintained tmpfs-based /dev Devtmpfs lets the kernel create a tmpfs instance called devtmpfs very early at kernel initialization, before any driver-core device is registered. Every device with a major/minor will provide a device node in devtmpfs. Devtmpfs can be changed and altered by userspace at any time, and in any way needed - just like today's udev-mounted tmpfs. Unmodified udev versions will run just fine on top of it, and will recognize an already existing kernel-created device node and use it. The default node permissions are root:root 0600. Proper permissions and user/group ownership, meaningful symlinks, all other policy still needs to be applied by userspace. If a node is created by devtmps, devtmpfs will remove the device node when the device goes away. If the device node was created by userspace, or the devtmpfs created node was replaced by userspace, it will no longer be removed by devtmpfs. If it is requested to auto-mount it, it makes init=/bin/sh work without any further userspace support. /dev will be fully populated and dynamic, and always reflect the current device state of the kernel. With the commonly used dynamic device numbers, it solves the problem where static devices nodes may point to the wrong devices. It is intended to make the initial bootup logic simpler and more robust, by de-coupling the creation of the inital environment, to reliably run userspace processes, from a complex userspace bootstrap logic to provide a working /dev. Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jan Blunck <jblunck@suse.de> Tested-By: Harald Hoyer <harald@redhat.com> Tested-By: Scott James Remnant <scott@ubuntu.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-09-15 09:50:49 -07:00
Linus Torvalds	ada3fa1505	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits) powerpc64: convert to dynamic percpu allocator sparc64: use embedding percpu first chunk allocator percpu: kill lpage first chunk allocator x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA percpu: update embedding first chunk allocator to handle sparse units percpu: use group information to allocate vmap areas sparsely vmalloc: implement pcpu_get_vm_areas() vmalloc: separate out insert_vmalloc_vm() percpu: add chunk->base_addr percpu: add pcpu_unit_offsets[] percpu: introduce pcpu_alloc_info and pcpu_group_info percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward percpu: add @align to pcpu_fc_alloc_fn_t percpu: make @dyn_size mandatory for pcpu_setup_first_chunk() percpu: drop @static_size from first chunk allocators percpu: generalize first chunk allocator selection percpu: build first chunk allocators selectively percpu: rename 4k first chunk allocator to page percpu: improve boot messages percpu: fix pcpu_reclaim() locking ... Fix trivial conflict as by Tejun Heo in kernel/sched.c	2009-09-15 09:39:44 -07:00
Linus Torvalds	227423904c	Merge branch 'x86-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, pat: Fix cacheflush address in change_page_attr_set_clr() mm: remove !NUMA condition from PAGEFLAGS_EXTENDED condition set x86: Fix earlyprintk=dbgp for machines without NX x86, pat: Sanity check remap_pfn_range for RAM region x86, pat: Lookup the protection from memtype list on vm_insert_pfn() x86, pat: Add lookup_memtype to get the current memtype of a paddr x86, pat: Use page flags to track memtypes of RAM pages x86, pat: Generalize the use of page flag PG_uncached x86, pat: Add rbtree to do quick lookup in memtype tracking x86, pat: Add PAT reserve free to io_mapping* APIs x86, pat: New i/f for driver to request memtype for IO regions x86, pat: ioremap to follow same PAT restrictions as other PAT users x86, pat: Keep identity maps consistent with mmaps even when pat_disabled x86, mtrr: make mtrr_aps_delayed_init static bool x86, pat/mtrr: Rendezvous all the cpus for MTRR/PAT init generic-ipi: Allow cpus not yet online to call smp_call_function with irqs disabled x86: Fix an incorrect argument of reserve_bootmem() x86: Fix system crash when loading with "reservetop" parameter	2009-09-15 09:19:38 -07:00
Tejun Heo	5579fd7e6a	Merge branch 'for-next' into for-linus * pcpu_chunk_page_occupied() doesn't exist in for-next. * pcpu_chunk_addr_search() updated to use raw_smp_processor_id(). Conflicts: mm/percpu.c	2009-09-15 09:57:19 +09:00
Linus Torvalds	355bbd8cb8	Merge branch 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block: (29 commits) block: use blkdev_issue_discard in blk_ioctl_discard Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads block: don't assume device has a request list backing in nr_requests store block: Optimal I/O limit wrapper cfq: choose a new next_req when a request is dispatched Seperate read and write statistics of in_flight requests aoe: end barrier bios with EOPNOTSUPP block: trace bio queueing trial only when it occurs block: enable rq CPU completion affinity by default cfq: fix the log message after dispatched a request block: use printk_once cciss: memory leak in cciss_init_one() splice: update mtime and atime on files block: make blk_iopoll_prep_sched() follow normal 0/1 return convention cfq-iosched: get rid of must_alloc flag block: use interrupts disabled version of raise_softirq_irqoff() block: fix comment in blk-iopoll.c block: adjust default budget for blk-iopoll block: fix long lines in block/blk-iopoll.c block: add blk-iopoll, a NAPI like approach for block devices ...	2009-09-14 17:55:15 -07:00
Linus Torvalds	69def9f05d	Merge branch 'kvm-updates/2.6.32' of git://git.kernel.org/pub/scm/virt/kvm/kvm * 'kvm-updates/2.6.32' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (202 commits) MAINTAINERS: update KVM entry KVM: correct error-handling code KVM: fix compile warnings on s390 KVM: VMX: Check cpl before emulating debug register access KVM: fix misreporting of coalesced interrupts by kvm tracer KVM: x86: drop duplicate kvm_flush_remote_tlb calls KVM: VMX: call vmx_load_host_state() only if msr is cached KVM: VMX: Conditionally reload debug register 6 KVM: Use thread debug register storage instead of kvm specific data KVM guest: do not batch pte updates from interrupt context KVM: Fix coalesced interrupt reporting in IOAPIC KVM guest: fix bogus wallclock physical address calculation KVM: VMX: Fix cr8 exiting control clobbering by EPT KVM: Optimize kvm_mmu_unprotect_page_virt() for tdp KVM: Document KVM_CAP_IRQCHIP KVM: Protect update_cr8_intercept() when running without an apic KVM: VMX: Fix EPT with WP bit change during paging KVM: Use kvm_{read,write}_guest_virt() to read and write segment descriptors KVM: x86 emulator: Add adc and sbb missing decoder flags KVM: Add missing #include ...	2009-09-14 17:43:43 -07:00
Linus Torvalds	bb193c986a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: slub: fix slab_pad_check() slub: release kobject if sysfs_create_group failed in sysfs_slab_add SLUB: fix ARCH_KMALLOC_MINALIGN cases 64 and 256 SLUB: Fix some coding style issues SLUB: Drop write permission to /proc/slabinfo slab: remove duplicate kmem_cache_init_late() declarations slub: change kmem_cache->align to record the real alignment slub: use size and objsize orders to disable debug flags slub: add option to disable higher order debugging slabs	2009-09-14 17:38:52 -07:00
Pekka Enberg	aceda77360	Merge branches 'slab/cleanups' and 'slab/fixes' into for-linus	2009-09-14 20:19:06 +03:00
Jan Kara	18f2ee705d	vfs: Remove generic_osync_inode() and sync_page_range{_nolock}() Remove these three functions since nobody uses them anymore. Signed-off-by: Jan Kara <jack@suse.cz>	2009-09-14 17:08:17 +02:00
Jan Kara	148f948ba8	vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode Introduce new function for generic inode syncing (vfs_fsync_range) and use it from fsync() path. Introduce also new helper for syncing after a sync write (generic_write_sync) using the generic function. Use these new helpers for syncing from generic VFS functions. This makes O_SYNC writes to block devices acquire i_mutex for syncing. If we really care about this, we can make block_fsync() drop the i_mutex and reacquire it before it returns. CC: Evgeniy Polyakov <zbr@ioremap.net> CC: ocfs2-devel@oss.oracle.com CC: Joel Becker <joel.becker@oracle.com> CC: Felix Blyakher <felixb@sgi.com> CC: xfs@oss.sgi.com CC: Anton Altaparmakov <aia21@cantab.net> CC: linux-ntfs-dev@lists.sourceforge.net CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> CC: linux-ext4@vger.kernel.org CC: tytso@mit.edu Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2009-09-14 17:08:15 +02:00
Christoph Hellwig	eef9938067	vfs: Rename generic_file_aio_write_nolock generic_file_aio_write_nolock() is now used only by block devices and raw character device. Filesystems should use __generic_file_aio_write() in case generic_file_aio_write() doesn't suit them. So rename the function to blkdev_aio_write() and move it to fs/blockdev.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2009-09-14 17:08:15 +02:00
Jan Kara	c7b50db21f	vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write() generic_file_direct_write() and generic_file_buffered_write() called generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But this is superfluous since generic_file_aio_write() does the syncing as well. Also XFS and OCFS2 which call these functions directly handle syncing themselves. So let's have a single place where syncing happens: generic_file_aio_write(). We slightly change the behavior by syncing only the range of file to which the write happened for buffered writes but that should be all that is required. CC: ocfs2-devel@oss.oracle.com CC: Joel Becker <joel.becker@oracle.com> CC: Felix Blyakher <felixb@sgi.com> CC: xfs@oss.sgi.com Signed-off-by: Jan Kara <jack@suse.cz>	2009-09-14 17:08:15 +02:00
Jan Kara	e4dd9de3c6	vfs: Export __generic_file_aio_write() and add some comments Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add comments to write helpers explaining how they should be used and export __generic_file_aio_write() since it will be used by some filesystems. CC: ocfs2-devel@oss.oracle.com CC: Joel Becker <joel.becker@oracle.com> Acked-by: Evgeniy Polyakov <zbr@ioremap.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2009-09-14 17:08:14 +02:00
Jan Kara	d3bccb6f4b	vfs: Introduce filemap_fdatawait_range This simple helper saves some filesystems conversion from byte offset to page numbers and also makes the fdata* interface more complete. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2009-09-14 17:08:14 +02:00
Christoph Hellwig	746cd1e7e4	block: use blkdev_issue_discard in blk_ioctl_discard blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard, the only difference between the two is that blkdev_issue_discard needs to send a barrier discard request and blk_ioctl_discard a non-barrier one, and blk_ioctl_discard needs to wait on the request. To facilitates this add a flags argument to blkdev_issue_discard to control both aspects of the behaviour. This will be very useful later on for using the waiting funcitonality for other callers. Based on an earlier patch from Matthew Wilcox <matthew@wil.cx>. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-14 08:24:53 +02:00
Eric Dumazet	8a3d271deb	slub: fix slab_pad_check() When SLAB_POISON is used and slab_pad_check() finds an overwrite of the slab padding, we call restore_bytes() on the whole slab, not only on the padding. Acked-by: Christoph Lameer <cl@linux-foundation.org> Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>	2009-09-14 07:57:55 +03:00
Linus Torvalds	a12e4d304c	Merge branch 'writeback' of git://git.kernel.dk/linux-2.6-block * 'writeback' of git://git.kernel.dk/linux-2.6-block: writeback: check for registered bdi in flusher add and inode dirty writeback: add name to backing_dev_info writeback: add some debug inode list counters to bdi stats writeback: get rid of pdflush completely writeback: switch to per-bdi threads for flushing data writeback: move dirty inodes from super_block to backing_dev_info writeback: get rid of generic_sync_sb_inodes() export	2009-09-11 09:17:05 -07:00
Linus Torvalds	1b195b170d	Merge branch 'kmemleak' of git://linux-arm.org/linux-2.6 * 'kmemleak' of git://linux-arm.org/linux-2.6: kmemleak: Improve the "Early log buffer exceeded" error message kmemleak: fix sparse warning for static declarations kmemleak: fix sparse warning over overshadowed flags kmemleak: move common painting code together kmemleak: add clear command support kmemleak: use bool for true/false questions kmemleak: Do no create the clean-up thread during kmemleak_disable() kmemleak: Scan all thread stacks kmemleak: Don't scan uninitialized memory when kmemcheck is enabled kmemleak: Ignore the aperture memory hole on x86_64 kmemleak: Printing of the objects hex dump kmemleak: Do not report alloc_bootmem blocks as leaks kmemleak: Save the stack trace for early allocations kmemleak: Mark the early log buffer as __initdata kmemleak: Dump object information on request kmemleak: Allow rescheduling during an object scanning	2009-09-11 09:16:22 -07:00
Catalin Marinas	addd72c1a9	kmemleak: Improve the "Early log buffer exceeded" error message Based on a suggestion from Jaswinder, clarify what the user would need to do to avoid this error message from kmemleak. Reported-by: Jaswinder Singh Rajput <jaswinder@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-09-11 10:42:09 +01:00
Jens Axboe	500b067c5e	writeback: check for registered bdi in flusher add and inode dirty Also a debugging aid. We want to catch dirty inodes being added to backing devices that don't do writeback. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 09:20:26 +02:00
Jens Axboe	d993831fa7	writeback: add name to backing_dev_info This enables us to track who does what and print info. Its main use is catching dirty inodes on the default_backing_dev_info, so we can fix that up. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 09:20:26 +02:00
Jens Axboe	f09b00d3e7	writeback: add some debug inode list counters to bdi stats Add some debug entries to be able to inspect the internal state of the writeback details. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 09:20:26 +02:00
Jens Axboe	d0bceac747	writeback: get rid of pdflush completely It is now unused, so kill it off. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 09:20:25 +02:00
Jens Axboe	03ba3782e8	writeback: switch to per-bdi threads for flushing data This gets rid of pdflush for bdi writeout and kupdated style cleaning. pdflush writeout suffers from lack of locality and also requires more threads to handle the same workload, since it has to work in a non-blocking fashion against each queue. This also introduces lumpy behaviour and potential request starvation, since pdflush can be starved for queue access if others are accessing it. A sample ffsb workload that does random writes to files is about 8% faster here on a simple SATA drive during the benchmark phase. File layout also seems a LOT more smooth in vmstat: r b swpd free buff cache si so bi bo in cs us sy id wa 0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42 0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44 1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37 0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58 0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34 0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37 0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44 0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38 0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41 0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45 where vanilla tends to fluctuate a lot in the creation phase: r b swpd free buff cache si so bi bo in cs us sy id wa 1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36 1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51 0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40 0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37 1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41 0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49 0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36 1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43 0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39 1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45 1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34 0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54 A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A SSD based writeback test on XFS performs over 20% better as well, with the throughput being very stable around 1GB/sec, where pdflush only manages 750MB/sec and fluctuates wildly while doing so. Random buffered writes to many files behave a lot better as well, as does random mmap'ed writes. A separate thread is added to sync the super blocks. In the long term, adding sync_supers_bdi() functionality could get rid of this thread again. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 09:20:25 +02:00
Jens Axboe	66f3b8e2e1	writeback: move dirty inodes from super_block to backing_dev_info This is a first step at introducing per-bdi flusher threads. We should have no change in behaviour, although sb_has_dirty_inodes() is now ridiculously expensive, as there's no easy way to answer that question. Not a huge problem, since it'll be deleted in subsequent patches. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 09:20:25 +02:00
Joerg Roedel	f340ca0f06	hugetlbfs: export vma_kernel_pagsize to modules This function is required by KVM. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-09-10 08:33:01 +03:00
Linus Torvalds	6d848a488a	shmfs: use 'check_acl' instead of 'permission' shmfs wants purely standard POSIX ACL semantics, so we can use the new generic VFS layer POSIX ACL checking rather than cooking our own 'permission()' function. Reviewed-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-08 11:08:46 -07:00
Luis R. Rodriguez	7eb0d5e5be	kmemleak: fix sparse warning for static declarations This fixes these sparse warnings: mm/kmemleak.c:1179:6: warning: symbol 'start_scan_thread' was not declared. Should it be static? mm/kmemleak.c:1194:6: warning: symbol 'stop_scan_thread' was not declared. Should it be static? Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-09-08 17:34:07 +01:00
Luis R. Rodriguez	0580a1819c	kmemleak: fix sparse warning over overshadowed flags A secondary irq_save is not required as a locking before it was already disabling irqs. This fixes this sparse warning: mm/kmemleak.c:512:31: warning: symbol 'flags' shadows an earlier one mm/kmemleak.c:448:23: originally declared here Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-09-08 17:34:06 +01:00
Luis R. Rodriguez	a1084c8779	kmemleak: move common painting code together When painting grey or black we do the same thing, bring this together into a helper and identify coloring grey or black explicitly with defines. This makes this a little easier to read. Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-09-08 17:22:20 +01:00
Luis R. Rodriguez	30b3710105	kmemleak: add clear command support In an ideal world your kmemleak output will be small, when its not (usually during initial bootup) you can use the clear command to ingore previously reported and unreferenced kmemleak objects. We do this by painting all currently reported unreferenced objects grey. We paint them grey instead of black to allow future scans on the same objects as such objects could still potentially reference newly allocated objects in the future. To test a critical section on demand with a clean /sys/kernel/debug/kmemleak you can do: echo clear > /sys/kernel/debug/kmemleak test your kernel or modules echo scan > /sys/kernel/debug/kmemleak Then as usual to get your report with: cat /sys/kernel/debug/kmemleak Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-09-08 16:36:08 +01:00
Luis R. Rodriguez	4a558dd6f9	kmemleak: use bool for true/false questions Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-09-08 16:34:50 +01:00
Catalin Marinas	179a8100e1	kmemleak: Do no create the clean-up thread during kmemleak_disable() The kmemleak_disable() function could be called from various contexts including IRQ. It creates a clean-up thread but the kthread_create() function has restrictions on which contexts it can be called from, mainly because of the kthread_create_lock. The patch changes the kmemleak clean-up thread to a workqueue. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Eric Paris <eparis@redhat.com>	2009-09-08 16:31:15 +01:00
Linus Torvalds	931f70350e	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: don't assume existence of cpu0	2009-09-05 14:22:00 -07:00
Linus Torvalds	e305fc5ecd	Merge branch 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: slub: Fix kmem_cache_destroy() with SLAB_DESTROY_BY_RCU	2009-09-05 13:57:53 -07:00
Mel Gorman	dd5d241ea9	page-allocator: always change pageblock ownership when anti-fragmentation is disabled On low-memory systems, anti-fragmentation gets disabled as fragmentation cannot be avoided on a sufficiently large boundary to be worthwhile. Once disabled, there is a period of time when all the pageblocks are marked MOVABLE and the expectation is that they get marked UNMOVABLE at each call to __rmqueue_fallback(). However, when MAX_ORDER is large the pageblocks do not change ownership because the normal criteria are not met. This has the effect of prematurely breaking up too many large contiguous blocks. This is most serious on NOMMU systems which depend on high-order allocations to boot. This patch causes pageblocks to change ownership on every fallback when anti-fragmentation is disabled. This prevents the large blocks being prematurely broken up. This is a fix to commit `49255c619f` [page allocator: move check for disabled anti-fragmentation out of fastpath] and the problem affects 2.6.31-rc8. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Tested-by: Paul Mundt <lethal@linux-sh.org> Cc: David Howells <dhowells@redhat.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-05 11:30:42 -07:00
David Howells	a190887b58	nommu: fix error handling in do_mmap_pgoff() Fix the error handling in do_mmap_pgoff(). If do_mmap_shared_file() or do_mmap_private() fail, we jump to the error_put_region label at which point we cann __put_nommu_region() on the region - but we haven't yet added the region to the tree, and so __put_nommu_region() may BUG because the region tree is empty or it may corrupt the region tree. To get around this, we can afford to add the region to the region tree before calling do_mmap_shared_file() or do_mmap_private() as we keep nommu_region_sem write-locked, so no-one can race with us by seeing a transient region. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-05 11:30:42 -07:00
Catalin Marinas	43ed5d6ee0	kmemleak: Scan all thread stacks This patch changes the for_each_process() loop with the do_each_thread()/while_each_thread() pair. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-09-04 16:05:55 +01:00
Pekka Enberg	8e019366ba	kmemleak: Don't scan uninitialized memory when kmemcheck is enabled Ingo Molnar reported the following kmemcheck warning when running both kmemleak and kmemcheck enabled: PM: Adding info for No Bus:vcsa7 WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (f6f6e1a4) d873f9f600000000c42ae4c1005c87f70000000070665f666978656400000000 i i i i u u u u i i i i i i i i i i i i i i i i i i i i i u u u ^ Pid: 3091, comm: kmemleak Not tainted (2.6.31-rc7-tip #1303) P4DC6 EIP: 0060:[<c110301f>] EFLAGS: 00010006 CPU: 0 EIP is at scan_block+0x3f/0xe0 EAX: f40bd700 EBX: f40bd780 ECX: f16b46c0 EDX: 00000001 ESI: f6f6e1a4 EDI: 00000000 EBP: f10f3f4c ESP: c2605fcc DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 CR0: 8005003b CR2: e89a4844 CR3: 30ff1000 CR4: 000006f0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff4ff0 DR7: 00000400 [<c110313c>] scan_object+0x7c/0xf0 [<c1103389>] kmemleak_scan+0x1d9/0x400 [<c1103a3c>] kmemleak_scan_thread+0x4c/0xb0 [<c10819d4>] kthread+0x74/0x80 [<c10257db>] kernel_thread_helper+0x7/0x3c [<ffffffff>] 0xffffffff kmemleak: 515 new suspected memory leaks (see /sys/kernel/debug/kmemleak) kmemleak: 42 new suspected memory leaks (see /sys/kernel/debug/kmemleak) The problem here is that kmemleak will scan partially initialized objects that makes kmemcheck complain. Fix that up by skipping uninitialized memory regions when kmemcheck is enabled. Reported-by: Ingo Molnar <mingo@elte.hu> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>	2009-09-04 16:05:55 +01:00
Eric Dumazet	d76b1590e0	slub: Fix kmem_cache_destroy() with SLAB_DESTROY_BY_RCU kmem_cache_destroy() should call rcu_barrier() after kmem_cache_close() and before sysfs_slab_remove() or risk rcu_free_slab() being called after kmem_cache is deleted (kfreed). rmmod nf_conntrack can crash the machine because it has to kmem_cache_destroy() a SLAB_DESTROY_BY_RCU enabled cache. Cc: <stable@kernel.org> Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>	2009-09-03 22:38:59 +03:00
Xiaotian Feng	5788d8ad6c	slub: release kobject if sysfs_create_group failed in sysfs_slab_add When CONFIG_SLUB_DEBUG is enabled, sysfs_slab_add should unlink and put the kobject if sysfs_create_group failed. Otherwise, sysfs_slab_add returns error then free kmem_cache s, thus memory of s->kobj is leaked. Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>	2009-09-03 21:11:41 +03:00
Tejun Heo	04a13c7c63	percpu: don't assume existence of cpu0 percpu incorrectly assumed that cpu0 was always there which led to the following warning and eventual oops on sparc machines w/o cpu0. WARNING: at mm/percpu.c:651 pcpu_map+0xdc/0x100() Modules linked in: Call Trace: [000000000045eb70] warn_slowpath_common+0x50/0xa0 [000000000045ebdc] warn_slowpath_null+0x1c/0x40 [00000000004d493c] pcpu_map+0xdc/0x100 [00000000004d59a4] pcpu_alloc+0x3e4/0x4e0 [00000000004d5af8] __alloc_percpu+0x18/0x40 [00000000005b112c] __percpu_counter_init+0x4c/0xc0 ... Unable to handle kernel NULL pointer dereference ... I7: <sysfs_new_dirent+0x30/0x120> Disabling lock debugging due to kernel taint Caller[000000000053c1b0]: sysfs_new_dirent+0x30/0x120 Caller[000000000053c7a4]: create_dir+0x24/0xc0 Caller[000000000053c870]: sysfs_create_dir+0x30/0x80 Caller[00000000005990e8]: kobject_add_internal+0xc8/0x200 ... Kernel panic - not syncing: Attempted to kill the idle task! This patch fixes the problem by backporting parts from devel branch to make percpu core not depend on the existence of cpu0. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Meelis Roos <mroos@linux.ee> Cc: David Miller <davem@davemloft.net>	2009-09-01 21:23:18 +09:00
H. Peter Anvin	a269cca992	mm: remove !NUMA condition from PAGEFLAGS_EXTENDED condition set CONFIG_PAGEFLAGS_EXTENDED disables a trick to conserve pageflags. This trick is indended to be enabled when the pressure on page flags is very high. The previous condition was: - depends on 64BIT \|\| SPARSEMEM_VMEMMAP \|\| !NUMA \|\| !SPARSEMEM ... however, the sparsemem code already has a way to crowd out the node number from the pageflags, which means that !NUMA actually doesn't contribute to hard pageflags exhaustion. This is required for the new PG_uncached flag to not cause pageflags exhaustion on x86_32 + PAE + SPARSEMEM + !NUMA. Signed-off-by: H. Peter Anvin <hpa@zytor.com> LKML-Reference: <4A9828F4.4040905@zytor.com> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Suresh Siddha <suresh.siddha@intel.com>	2009-08-31 11:17:44 -07:00
Aaro Koskinen	acdfcd04d9	SLUB: fix ARCH_KMALLOC_MINALIGN cases 64 and 256 If the minalign is 64 bytes, then the 96 byte cache should not be created because it would conflict with the 128 byte cache. If the minalign is 256 bytes, patching the size_index table should not result in a buffer overrun. The calculation "(i - 1) / 8" used to access size_index[] is moved to a separate function as suggested by Christoph Lameter. Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>	2009-08-30 14:56:48 +03:00
Sergey Senozhatsky	0494e08281	kmemleak: Printing of the objects hex dump Introducing printing of the objects hex dump to the seq file. The number of lines to be printed is limited to HEX_MAX_LINES to prevent seq file spamming. The actual number of printed bytes is less than or equal to (HEX_MAX_LINES * HEX_ROW_SIZE). (slight adjustments by Catalin Marinas) Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@mail.by> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-08-27 14:29:18 +01:00
Catalin Marinas	008139d914	kmemleak: Do not report alloc_bootmem blocks as leaks This patch sets the min_count for alloc_bootmem objects to 0 so that they are never reported as leaks. This is because many of these blocks are only referred via the physical address which is not looked up by kmemleak. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi>	2009-08-27 14:29:17 +01:00
Catalin Marinas	fd6789675e	kmemleak: Save the stack trace for early allocations Before slab is initialised, kmemleak save the allocations in an early log buffer. They are later recorded as normal memory allocations. This patch adds the stack trace saving to the early log buffer, otherwise the information shown for such objects only refers to the kmemleak_init() function. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-08-27 14:29:17 +01:00
Catalin Marinas	a6186d89c9	kmemleak: Mark the early log buffer as __initdata This buffer isn't needed after kmemleak was initialised so it can be freed together with the .init.data section. This patch also marks functions conditionally accessing the early log variables with __ref. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-08-27 14:29:16 +01:00
Catalin Marinas	189d84ed54	kmemleak: Dump object information on request By writing dump=<addr> to the kmemleak file, kmemleak will look up an object with that address and dump the information it has about it to syslog. This is useful in debugging memory leaks. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-08-27 14:29:15 +01:00
Catalin Marinas	af98603dad	kmemleak: Allow rescheduling during an object scanning If the object size is bigger than a predefined value (4K in this case), release the object lock during scanning and call cond_resched(). Re-acquire the lock after rescheduling and test whether the object is still valid. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-08-27 14:29:12 +01:00
Minchan Kim	03ef83af52	mm: fix for infinite churning of mlocked pages An mlocked page might lose the isolatation race. This causes the page to clear PG_mlocked while it remains in a VM_LOCKED vma. This means it can be put onto the [in]active list. We can rescue it by using try_to_unmap() in shrink_page_list(). But now, As Wu Fengguang pointed out, vmscan has a bug. If the page has PG_referenced, it can't reach try_to_unmap() in shrink_page_list() but is put into the active list. If the page is referenced repeatedly, it can remain on the [in]active list without being moving to the unevictable list. This patch fixes it. Reported-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <<kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-08-26 20:06:52 -07:00
Amerigo Wang	5086c389cb	SLUB: Fix some coding style issues Signed-off-by: WANG Cong <amwang@redhat.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>	2009-08-19 21:44:13 +03:00
Linus Torvalds	77f312a96d	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: use the right flag for get_vm_area() percpu, sparc64: fix sparse possible cpu map handling init: set nr_cpu_ids before setup_per_cpu_areas()	2009-08-18 19:41:05 -07:00
Bo Liu	7f9cfb3103	mm: build_zonelists(): move clear node_load[] to __build_all_zonelists() If node_load[] is cleared everytime build_zonelists() is called,node_load[] will have no help to find the next node that should appear in the given node's fallback list. Because of the bug, zonelist's node_order is not calculated as expected. This bug affects on big machine, which has asynmetric node distance. [synmetric NUMA's node distance] 0 1 2 0 10 12 12 1 12 10 12 2 12 12 10 [asynmetric NUMA's node distance] 0 1 2 0 10 12 20 1 12 10 14 2 20 14 10 This (my bug) is very old but no one has reported this for a long time. Maybe because the number of asynmetric NUMA is very small and they use cpuset for customizing node memory allocation fallback. [akpm@linux-foundation.org: fix CONFIG_NUMA=n build] Signed-off-by: Bo Liu <bo-liu@hotmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-08-18 16:31:13 -07:00
Graff Yang	28d7a6ae92	nommu: check fd read permission in validate_mmap_request() According to the POSIX (1003.1-2008), the file descriptor shall have been opened with read permission, regardless of the protection options specified to mmap(). The ltp test cases mmap06/07 need this. Signed-off-by: Graff Yang <graff.yang@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-08-18 16:31:13 -07:00
KOSAKI Motohiro	0753ba01e1	mm: revert "oom: move oom_adj value" The commit `2ff05b2b` (oom: move oom_adj value) moveed the oom_adj value to the mm_struct. It was a very good first step for sanitize OOM. However Paul Menage reported the commit makes regression to his job scheduler. Current OOM logic can kill OOM_DISABLED process. Why? His program has the code of similar to the following. ... set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom / ... if (vfork() == 0) { set_oom_adj(0); / Invoked child can be killed */ execve("foo-bar-cmd"); } .... vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent. Then, vfork() parent (job scheduler) lost OOM immune and it was killed. Actually, fork-setting-exec idiom is very frequently used in userland program. We must not break this assumption. Then, this patch revert commit `2ff05b2b` and related commit. Reverted commit list --------------------- - commit `2ff05b2b4e` (oom: move oom_adj value from task_struct to mm_struct) - commit `4d8b9135c3` (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE) - commit `8123681022` (oom: only oom kill exiting tasks with attached memory) - commit `933b787b57` (mm: copy over oom_adj value at fork time) Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-08-18 16:31:13 -07:00
WANG Cong	cf5d11317e	SLUB: Drop write permission to /proc/slabinfo SLUB does not support writes to /proc/slabinfo so there should not be write permission to do that either. Signed-off-by: WANG Cong <amwang@redhat.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>	2009-08-18 19:11:40 +03:00
Eric Paris	788084aba2	Security/SELinux: seperate lsm specific mmap_min_addr Currently SELinux enforcement of controls on the ability to map low memory is determined by the mmap_min_addr tunable. This patch causes SELinux to ignore the tunable and instead use a seperate Kconfig option specific to how much space the LSM should protect. The tunable will now only control the need for CAP_SYS_RAWIO and SELinux permissions will always protect the amount of low memory designated by CONFIG_LSM_MMAP_MIN_ADDR. This allows users who need to disable the mmap_min_addr controls (usual reason being they run WINE as a non-root user) to do so and still have SELinux controls preventing confined domains (like a web server) from being able to map some area of low memory. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-08-17 15:09:11 +10:00
Tejun Heo	e933a73f48	percpu: kill lpage first chunk allocator With x86 converted to embedding allocator, lpage doesn't have any user left. Kill it along with cpa handling code. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jan Beulich <JBeulich@novell.com>	2009-08-14 15:00:53 +09:00
Tejun Heo	c8826dd538	percpu: update embedding first chunk allocator to handle sparse units Now that percpu core can handle very sparse units, given that vmalloc space is large enough, embedding first chunk allocator can use any memory to build the first chunk. This patch teaches pcpu_embed_first_chunk() about distances between cpus and to use alloc/free callbacks to allocate node specific areas for each group and use them for the first chunk. This brings the benefits of embedding allocator to NUMA configurations - no extra TLB pressure with the flexibility of unified dynamic allocator and no need to restructure arch code to build memory layout suitable for percpu. With units put into atom_size aligned groups according to cpu distances, using large page for dynamic chunks is also easily possible with falling back to reuglar pages if large allocation fails. Embedding allocator users are converted to specify NULL cpu_distance_fn, so this patch doesn't cause any visible behavior difference. Following patches will convert them. Signed-off-by: Tejun Heo <tj@kernel.org>	2009-08-14 15:00:52 +09:00
Tejun Heo	6563297cea	percpu: use group information to allocate vmap areas sparsely ai->groups[] contains which units need to be put consecutively and at what offset from the chunk base address. Compile this information into pcpu_group_offsets[] and pcpu_group_sizes[] in pcpu_setup_first_chunk() and use them to allocate sparse vm areas using pcpu_get_vm_areas(). This will be used to allow directly using sparse NUMA memories as percpu areas. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Nick Piggin <npiggin@suse.de>	2009-08-14 15:00:52 +09:00
Tejun Heo	ca23e405e0	vmalloc: implement pcpu_get_vm_areas() To directly use spread NUMA memories for percpu units, percpu allocator will be updated to allow sparsely mapping units in a chunk. As the distances between units can be very large, this makes allocating single vmap area for each chunk undesirable. This patch implements pcpu_get_vm_areas() and pcpu_free_vm_areas() which allocates and frees sparse congruent vmap areas. pcpu_get_vm_areas() take @offsets and @sizes array which define distances and sizes of vmap areas. It scans down from the top of vmalloc area looking for the top-most address which can accomodate all the areas. The top-down scan is to avoid interacting with regular vmallocs which can push up these congruent areas up little by little ending up wasting address space and page table. To speed up top-down scan, the highest possible address hint is maintained. Although the scan is linear from the hint, given the usual large holes between memory addresses between NUMA nodes, the scanning is highly likely to finish after finding the first hole for the last unit which is scanned first. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Nick Piggin <npiggin@suse.de>	2009-08-14 15:00:52 +09:00
Tejun Heo	cf88c79006	vmalloc: separate out insert_vmalloc_vm() Separate out insert_vmalloc_vm() from __get_vm_area_node(). insert_vmalloc_vm() initializes vm_struct from vmap_area and inserts it into vmlist. insert_vmalloc_vm() only initializes fields which can be determined from @vm, @flags and @caller The rest should be initialized by the caller. For __get_vm_area_node(), all other fields just need to be cleared and this is done by using kzalloc instead of kmalloc. This will be used to implement pcpu_get_vm_areas(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Nick Piggin <npiggin@suse.de>	2009-08-14 15:00:52 +09:00
Tejun Heo	bba174f5e0	percpu: add chunk->base_addr The only thing percpu allocator wants to know about a vmalloc area is the base address. Instead of requiring chunk->vm, add chunk->base_addr which contains the necessary value. This simplifies the code a bit and makes the dummy first_vm unnecessary. This change will ease allowing a chunk to be mapped by multiple vms. Signed-off-by: Tejun Heo <tj@kernel.org>	2009-08-14 15:00:51 +09:00
Tejun Heo	fb435d5233	percpu: add pcpu_unit_offsets[] Currently units are mapped sequentially into address space. This patch adds pcpu_unit_offsets[] which allows units to be mapped to arbitrary offsets from the chunk base address. This is necessary to allow sparse embedding which might would need to allocate address ranges and memory areas which aren't aligned to unit size but allocation atom size (page or large page size). This also simplifies things a bit by removing the need to calculate offset from unit number. With this change, there's no need for the arch code to know pcpu_unit_size. Update pcpu_setup_first_chunk() and first chunk allocators to return regular 0 or -errno return code instead of unit size or -errno. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David S. Miller <davem@davemloft.net>	2009-08-14 15:00:51 +09:00
Tejun Heo	fd1e8a1fe2	percpu: introduce pcpu_alloc_info and pcpu_group_info Till now, non-linear cpu->unit map was expressed using an integer array which maps each cpu to a unit and used only by lpage allocator. Although how many units have been placed in a single contiguos area (group) is known while building unit_map, the information is lost when the result is recorded into the unit_map array. For lpage allocator, as all allocations are done by lpages and whether two adjacent lpages are in the same group or not is irrelevant, this didn't cause any problem. Non-linear cpu->unit mapping will be used for sparse embedding and this grouping information is necessary for that. This patch introduces pcpu_alloc_info which contains all the information necessary for initializing percpu allocator. pcpu_alloc_info contains array of pcpu_group_info which describes how units are grouped and mapped to cpus. pcpu_group_info also has base_offset field to specify its offset from the chunk's base address. pcpu_build_alloc_info() initializes this field as if all groups are allocated back-to-back as is currently done but this will be used to sparsely place groups. pcpu_alloc_info is a rather complex data structure which contains a flexible array which in turn points to nested cpu_map arrays. * pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to help dealing with pcpu_alloc_info. * pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info, generalized and renamed to pcpu_build_alloc_info(). @cpu_distance_fn may be NULL indicating that all cpus are of LOCAL_DISTANCE. * pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info, generalized and renamed to pcpu_dump_alloc_info(). It now also prints which group each alloc unit belongs to. * pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the separate parameters. All first chunk allocators are updated to use pcpu_build_alloc_info() to build alloc_info and call pcpu_setup_first_chunk() with it. This has the side effect of packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4. * x86 setup_pcpu_lpage() is updated to deal with alloc_info. * sparc64 setup_per_cpu_areas() is updated to build alloc_info. Although the changes made by this patch are pretty pervasive, it doesn't cause any behavior difference other than packing of sparse cpus. It mostly changes how information is passed among initialization functions and makes room for more flexibility. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net>	2009-08-14 15:00:51 +09:00
Tejun Heo	033e48fb82	percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward Unit map handling will be generalized and extended and used for embedding sparse first chunk and other purposes. Relocate two unit_map related functions upward in preparation. This patch just moves the code without any actual change. Signed-off-by: Tejun Heo <tj@kernel.org>	2009-08-14 15:00:51 +09:00
Tejun Heo	3cbc856527	percpu: add @align to pcpu_fc_alloc_fn_t pcpu_fc_alloc_fn_t is about to see more interesting usage, add @align parameter. Signed-off-by: Tejun Heo <tj@kernel.org>	2009-08-14 15:00:50 +09:00
Tejun Heo	1d9d325721	percpu: make @dyn_size mandatory for pcpu_setup_first_chunk() Now that all actual first chunk allocation and copying happen in the first chunk allocators and helpers, there's no reason for pcpu_setup_first_chunk() to try to determine @dyn_size automatically. The only left user is page first chunk allocator. Make it determine dyn_size like other allocators and make @dyn_size mandatory for pcpu_setup_first_chunk(). Signed-off-by: Tejun Heo <tj@kernel.org>	2009-08-14 15:00:50 +09:00
Tejun Heo	9a7737691e	percpu: drop @static_size from first chunk allocators First chunk allocators assume percpu areas have been linked using one of PERCPU_*() macros and depend on __per_cpu_load symbol defined by those macros, so there isn't much point in passing in static area size explicitly when it can be easily calculated from __per_cpu_start and __per_cpu_end. Drop @static_size from all percpu first chunk allocators and helpers. Signed-off-by: Tejun Heo <tj@kernel.org>	2009-08-14 15:00:50 +09:00
Tejun Heo	f58dc01ba2	percpu: generalize first chunk allocator selection Now that all first chunk allocators are in mm/percpu.c, it makes sense to make generalize percpu_alloc kernel parameter. Define PCPU_FC_* and set pcpu_chosen_fc using early_param() in mm/percpu.c. Arch code can use the set value to determine which first chunk allocator to use. Signed-off-by: Tejun Heo <tj@kernel.org>	2009-08-14 15:00:50 +09:00
Tejun Heo	08fc458061	percpu: build first chunk allocators selectively There's no need to build unused first chunk allocators in. Define CONFIG_NEED_PER_CPU_*_FIRST_CHUNK and let archs enable them selectively. Signed-off-by: Tejun Heo <tj@kernel.org>	2009-08-14 15:00:49 +09:00
Tejun Heo	00ae4064b1	percpu: rename 4k first chunk allocator to page Page size isn't always 4k depending on arch and configuration. Rename 4k first chunk allocator to page. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Howells <dhowells@redhat.com>	2009-08-14 15:00:49 +09:00
Tejun Heo	004018e2c0	percpu: improve boot messages Improve percpu boot messages such that they're uniform and contain more information. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>	2009-08-14 15:00:49 +09:00
Tejun Heo	971f3918a5	percpu: fix pcpu_reclaim() locking pcpu_reclaim() calls pcpu_depopulate_chunk() which makes use of pages array and bitmap returned by pcpu_get_pages_and_bitmap() and thus should be called under pcpu_alloc_mutex. pcpu_reclaim() released the mutex before calling depopulate leading to double free and other strange problems caused by the unexpected concurrent usages of pages array and bitmap. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>	2009-08-14 15:00:49 +09:00
Tejun Heo	384be2b18a	Merge branch 'percpu-for-linus' into percpu-for-next Conflicts: arch/sparc/kernel/smp_64.c arch/x86/kernel/cpu/perf_counter.c arch/x86/kernel/setup_percpu.c drivers/cpufreq/cpufreq_ondemand.c mm/percpu.c Conflicts in core and arch percpu codes are mostly from commit ed78e1e078dd44249f88b1dd8c76dafb39567161 which substituted many num_possible_cpus() with nr_cpu_ids. As for-next branch has moved all the first chunk allocators into mm/percpu.c, the changes are moved from arch code to mm/percpu.c. Signed-off-by: Tejun Heo <tj@kernel.org>	2009-08-14 14:45:31 +09:00
Amerigo Wang	142d44b0dd	percpu: use the right flag for get_vm_area() get_vm_area() only accepts VM_* flags, not GFP_*. And according to the doc of get_vm_area(), here should be VM_ALLOC. Signed-off-by: WANG Cong <amwang@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu>	2009-08-14 13:21:10 +09:00
Tejun Heo	74d46d6b2d	percpu, sparc64: fix sparse possible cpu map handling percpu code has been assuming num_possible_cpus() == nr_cpu_ids which is incorrect if cpu_possible_map contains holes. This causes percpu code to access beyond allocated memories and vmalloc areas. On a sparc64 machine with cpus 0 and 2 (u60), this triggers the following warning or fails boot. WARNING: at /devel/tj/os/work/mm/vmalloc.c:106 vmap_page_range_noflush+0x1f0/0x240() Modules linked in: Call Trace: [00000000004b17d0] vmap_page_range_noflush+0x1f0/0x240 [00000000004b1840] map_vm_area+0x20/0x60 [00000000004b1950] __vmalloc_area_node+0xd0/0x160 [0000000000593434] deflate_init+0x14/0xe0 [0000000000583b94] __crypto_alloc_tfm+0xd4/0x1e0 [00000000005844f0] crypto_alloc_base+0x50/0xa0 [000000000058b898] alg_test_comp+0x18/0x80 [000000000058dad4] alg_test+0x54/0x180 [000000000058af00] cryptomgr_test+0x40/0x60 [0000000000473098] kthread+0x58/0x80 [000000000042b590] kernel_thread+0x30/0x60 [0000000000472fd0] kthreadd+0xf0/0x160 ---[ end trace 429b268a213317ba ]--- This patch fixes generic percpu functions and sparc64 setup_per_cpu_areas() so that they handle sparse cpu_possible_map properly. Please note that on x86, cpu_possible_map() doesn't contain holes and thus num_possible_cpus() == nr_cpu_ids and this patch doesn't cause any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@elte.hu>	2009-08-14 13:20:53 +09:00
Figo.zhang	5e2f89b5d5	mempool.c: clean up type-casting clean up type-casting twice. "size_t" is typedef as "unsigned long" in 64-bit system, and "unsigned int" in 32-bit system, and the intermediate cast to 'long' is pointless. Signed-off-by: Figo.zhang <figo1802@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-08-10 08:31:16 -07:00
KAMEZAWA Hiroyuki	4bfc44958e	mm: make set_mempolicy(MPOL_INTERLEAV) N_HIGH_MEMORY aware At first, init_task's mems_allowed is initialized as this. init_task->mems_allowed == node_state[N_POSSIBLE] And cpuset's top_cpuset mask is initialized as this top_cpuset->mems_allowed = node_state[N_HIGH_MEMORY] Before 2.6.29: policy's mems_allowed is initialized as this. 1. update tasks->mems_allowed by its cpuset->mems_allowed. 2. policy->mems_allowed = nodes_and(tasks->mems_allowed, user's mask) Updating task's mems_allowed in reference to top_cpuset's one. cpuset's mems_allowed is aware of N_HIGH_MEMORY, always. In 2.6.30: After commit `58568d2a82` ("cpuset,mm: update tasks' mems_allowed in time"), policy's mems_allowed is initialized as this. 1. policy->mems_allowd = nodes_and(task->mems_allowed, user's mask) Here, if task is in top_cpuset, task->mems_allowed is not updated from init's one. Assume user excutes command as #numactrl --interleave=all ,.... policy->mems_allowd = nodes_and(N_POSSIBLE, ALL_SET_MASK) Then, policy's mems_allowd can includes a possible node, which has no pgdat. MPOL's INTERLEAVE just scans nodemask of task->mems_allowd and access this directly. NODE_DATA(nid)->zonelist even if NODE_DATA(nid)==NULL Then, what's we need is making policy->mems_allowed be aware of N_HIGH_MEMORY. This patch does that. But to do so, extra nodemask will be on statck. Because I know cpumask has a new interface of CPUMASK_ALLOC(), I added it to node. This patch stands on old behavior. But I feel this fix itself is just a Band-Aid. But to do fundametal fix, we have to take care of memory hotplug and it takes time. (task->mems_allowd should be N_HIGH_MEMORY, I think.) mpol_set_nodemask() should be aware of N_HIGH_MEMORY and policy's nodemask should be includes only online nodes. In old behavior, this is guaranteed by frequent reference to cpuset's code. Now, most of them are removed and mempolicy has to check it by itself. To do check, a few nodemask_t will be used for calculating nodemask. But, size of nodemask_t can be big and it's not good to allocate them on stack. Now, cpumask_t has CPUMASK_ALLOC/FREE an easy code for get scratch area. NODEMASK_ALLOC/FREE shoudl be there. [akpm@linux-foundation.org: cleanups & tweaks] Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Paul Menage <menage@google.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: David Rientjes <rientjes@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-08-07 10:39:55 -07:00
Wu Fengguang	bbff2e433e	slab: remove duplicate kmem_cache_init_late() declarations kmem_cache_init_late() has been declared in slab.h CC: Nick Piggin <npiggin@suse.de> CC: Matt Mackall <mpm@selenic.com> CC: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>	2009-08-06 11:36:25 +03:00
Zhang, Yanmin	dcb0ce1bdf	slub: change kmem_cache->align to record the real alignment kmem_cache->align records the original align parameter value specified by users. Function calculate_alignment might change it based on cache line size. So change kmem_cache->align correspondingly. Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>	2009-08-01 18:26:40 +03:00
Linus Torvalds	91a5698d1f	Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM / Hibernate: Replace bdget call with simple atomic_inc of i_count PM / ACPI: HP G7000 Notebook needs a SCI_EN resume quirk	2009-07-29 19:15:18 -07:00
Mel Gorman	1fc28b70fe	page-allocator: allow too high-order warning messages to be suppressed with __GFP_NOWARN The page allocator warns once when an order >= MAX_ORDER is specified. This is to catch callers of the allocator that are always falling back to their worst-case when it was not expected. However, there are cases where the caller is behaving correctly but cannot suppress the warning. This patch allows the warning to be suppressed by the callers by specifying __GFP_NOWARN. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: David Rientjes <rientjes@google.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-07-29 19:10:35 -07:00
KAMEZAWA Hiroyuki	887032670d	cgroup avoid permanent sleep at rmdir After commit `ec64f51545` ("cgroup: fix frequent -EBUSY at rmdir"), cgroup's rmdir (especially against memcg) doesn't return -EBUSY by temporary ref counts. That commit expects all refs after pre_destroy() is temporary but...it wasn't. Then, rmdir can wait permanently. This patch tries to fix that and change followings. - set CGRP_WAIT_ON_RMDIR flag before pre_destroy(). - clear CGRP_WAIT_ON_RMDIR flag when the subsys finds racy case. if there are sleeping ones, wakes them up. - rmdir() sleeps only when CGRP_WAIT_ON_RMDIR flag is set. Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: Paul Menage <menage@google.com> Acked-by: Balbir Sigh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-07-29 19:10:35 -07:00
Eric Sandeen	e4c6f8bed0	hugetlbfs: fix i_blocks accounting As reported in Red Hat bz #509671, i_blocks for files on hugetlbfs get accounting wrong when doing something like: $ > foo $ date > foo date: write error: Invalid argument $ /usr/bin/stat foo File: `foo' Size: 0 Blocks: 18446744073709547520 IO Block: 2097152 regular ... This is because hugetlb_unreserve_pages() is unconditionally removing blocks_per_huge_page(h) on each call rather than using the freed amount. If there were 0 blocks, it goes negative, resulting in the above. This is a regression from commit `a551643895` ("hugetlb: modular state for hugetlb page size") which did: - inode->i_blocks -= BLOCKS_PER_HUGEPAGE * freed; + inode->i_blocks -= blocks_per_huge_page(h); so just put back the freed multiplier, and it's all happy again. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Andi Kleen <andi@firstfloor.org> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-07-29 19:10:35 -07:00
David Rientjes	6583bb64fc	mm: avoid endless looping for oom killed tasks If a task is oom killed and still cannot find memory when trying with no watermarks, it's better to fail the allocation attempt than to loop endlessly. Direct reclaim has already failed and the oom killer will be a no-op since current has yet to die, so there is no other alternative for allocations that are not __GFP_NOFAIL. Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-07-29 19:10:34 -07:00
Mel Gorman	e084b2d95e	page-allocator: preserve PFN ordering when __GFP_COLD is set Fix a post-2.6.24 performace regression caused by `3dfa5721f1` ("page-allocator: preserve PFN ordering when __GFP_COLD is set"). Narayanan reports "The regression is around 15%. There is no disk controller as our setup is based on Samsung OneNAND used as a memory mapped device on a OMAP2430 based board." The page allocator tries to preserve contiguous PFN ordering when returning pages such that repeated callers to the allocator have a strong chance of getting physically contiguous pages, particularly when external fragmentation is low. However, of the bulk of the allocations have __GFP_COLD set as they are due to aio_read() for example, then the PFNs are in reverse PFN order. This can cause performance degration when used with IO controllers that could have merged the requests. This patch attempts to preserve the contiguous ordering of PFNs for users of __GFP_COLD. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reported-by: Narayananu Gopalakrishnan <narayanan.g@samsung.com> Tested-by: Narayanan Gopalakrishnan <narayanan.g@samsung.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-07-29 19:10:34 -07:00
Catalin Marinas	f5886c7f96	kmemleak: Protect the seq start/next/stop sequence by rcu_read_lock() Objects passed to kmemleak_seq_next() have an incremented reference count (hence not freed) but they may point via object_list.next to other freed objects. To avoid this, the whole start/next/stop sequence must be protected by rcu_read_lock(). Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-07-29 12:34:58 -07:00
Alan Jenkins	dddac6a7b4	PM / Hibernate: Replace bdget call with simple atomic_inc of i_count Create bdgrab(). This function copies an existing reference to a block_device. It is safe to call from any context. Hibernation code wishes to copy a reference to the active swap device. Right now it calls bdget() under a spinlock, but this is wrong because bdget() can sleep. It doesn't need a full bdget() because we already hold a reference to active swap devices (and the spinlock protects against swapoff). Fixes http://bugzilla.kernel.org/show_bug.cgi?id=13827 Signed-off-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2009-07-29 21:07:55 +02:00
David Rientjes	3de472138a	slub: use size and objsize orders to disable debug flags This patch moves the masking of debugging flags which increase a cache's min order due to metadata when `slub_debug=O' is used from kmem_cache_flags() to kmem_cache_open(). Instead of defining the maximum metadata size increase in a preprocessor macro, this approach uses the cache's ->size and ->objsize members to determine if the min order increased due to debugging options. If so, the flags specified in the more appropriately named DEBUG_METADATA_FLAGS are masked off. This approach was suggested by Christoph Lameter <cl@linux-foundation.org>. Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>	2009-07-28 10:53:09 +03:00
Benjamin Herrenschmidt	9e1b32caa5	mm: Pass virtual address to [__]p{te,ud,md}_free_tlb() mm: Pass virtual address to [__]p{te,ud,md}_free_tlb() Upcoming paches to support the new 64-bit "BookE" powerpc architecture will need to have the virtual address corresponding to PTE page when freeing it, due to the way the HW table walker works. Basically, the TLB can be loaded with "large" pages that cover the whole virtual space (well, sort-of, half of it actually) represented by a PTE page, and which contain an "indirect" bit indicating that this TLB entry RPN points to an array of PTEs from which the TLB can then create direct entries. Thus, in order to invalidate those when PTE pages are deleted, we need the virtual address to pass to tlbilx or tlbivax instructions. The old trick of sticking it somewhere in the PTE page struct page sucks too much, the address is almost readily available in all call sites and almost everybody implemets these as macros, so we may as well add the argument everywhere. I added it to the pmd and pud variants for consistency. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: David Howells <dhowells@redhat.com> [MN10300 & FRV] Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [s390] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-07-27 12:10:38 -07:00
Linus Torvalds	7638d5322b	Merge branch 'kmemleak' of git://linux-arm.org/linux-2.6 * 'kmemleak' of git://linux-arm.org/linux-2.6: kmemleak: Remove alloc_bootmem annotations introduced in the past kmemleak: Add callbacks to the bootmem allocator kmemleak: Allow partial freeing of memory blocks kmemleak: Trace the kmalloc_large* functions in slub kmemleak: Scan objects allocated during a scanning episode kmemleak: Do not acquire scan_mutex in kmemleak_open() kmemleak: Remove the reported leaks number limitation kmemleak: Add more cond_resched() calls in the scanning thread kmemleak: Renice the scanning thread to +10	2009-07-12 12:24:35 -07:00
Jens Axboe	8aa7e847d8	Fix congestion_wait() sync/async vs read/write confusion Commit `1faa16d228` accidentally broke the bdi congestion wait queue logic, causing us to wait on congestion for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-07-10 20:31:53 +02:00
David Rientjes	fa5ec8a1f6	slub: add option to disable higher order debugging slabs When debugging is enabled, slub requires that additional metadata be stored in slabs for certain options: SLAB_RED_ZONE, SLAB_POISON, and SLAB_STORE_USER. Consequently, it may require that the minimum possible slab order needed to allocate a single object be greater when using these options. The most notable example is for objects that are PAGE_SIZE bytes in size. Higher minimum slab orders may cause page allocation failures when oom or under heavy fragmentation. This patch adds a new slub_debug option, which disables debugging by default for caches that would have resulted in higher minimum orders: slub_debug=O When this option is used on systems with 4K pages, kmalloc-4096, for example, will not have debugging enabled by default even if CONFIG_SLUB_DEBUG_ON is defined because it would have resulted in a order-1 minimum slab order. Reported-by: Larry Finger <Larry.Finger@lwfinger.net> Tested-by: Larry Finger <Larry.Finger@lwfinger.net> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>	2009-07-10 09:52:55 +03:00
Catalin Marinas	264ef8a904	kmemleak: Remove alloc_bootmem annotations introduced in the past kmemleak_alloc() calls were added in some places where alloc_bootmem was called. Since now kmemleak tracks bootmem allocations, these explicit calls should be run. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>	2009-07-09 17:07:02 +01:00
Catalin Marinas	ec3a354bd4	kmemleak: Add callbacks to the bootmem allocator This patch adds kmemleak_alloc/free callbacks to the bootmem allocator. This would allow scanning of such blocks and help avoiding a whole class of false positives and more kmemleak annotations. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>	2009-07-08 14:25:14 +01:00
Catalin Marinas	53238a60dd	kmemleak: Allow partial freeing of memory blocks Functions like free_bootmem() are allowed to free only part of a memory block. This patch adds support for this via the kmemleak_free_part() callback which removes the original object and creates one or two additional objects as a result of the memory block split. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>	2009-07-08 14:25:14 +01:00
Catalin Marinas	e4f7c0b44a	kmemleak: Trace the kmalloc_large* functions in slub The kmalloc_large() and kmalloc_large_node() functions were missed when adding the kmemleak hooks to the slub allocator. However, they should be traced to avoid false positives. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux-foundation.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>	2009-07-08 14:25:14 +01:00
Catalin Marinas	2587362eaf	kmemleak: Scan objects allocated during a scanning episode Many of the false positives in kmemleak happen on busy systems where objects are allocated during a kmemleak scanning episode. These objects aren't scanned by default until the next memory scan. When such object is added, for example, at the head of a list, it is possible that all the other objects in the list become unreferenced until the next scan. This patch adds checking for newly allocated objects at the end of the scan and repeats the scanning on these objects. If Linux allocates new objects at a higher rate than their scanning, it stops after a predefined number of passes. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-07-08 14:25:13 +01:00
Catalin Marinas	b87324d082	kmemleak: Do not acquire scan_mutex in kmemleak_open() Initially, the scan_mutex was acquired in kmemleak_open() and released in kmemleak_release() (corresponding to /sys/kernel/debug/kmemleak operations). This was causing some lockdep reports when the file was closed from a different task than the one opening it. This patch moves the scan_mutex acquiring in kmemleak_write() or kmemleak_seq_start() with releasing in kmemleak_seq_stop(). Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-07-08 14:25:13 +01:00
Catalin Marinas	288c857d66	kmemleak: Remove the reported leaks number limitation Since the leaks are no longer printed to the syslog, there is no point in keeping this limitation. All the suspected leaks are shown on /sys/kernel/debug/kmemleak file. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-07-08 14:25:12 +01:00
Catalin Marinas	4b8a96744c	kmemleak: Add more cond_resched() calls in the scanning thread Following recent fix to no longer reschedule in the scan_block() function, the system may become unresponsive with !PREEMPT. This patch re-adds the cond_resched() call to scan_block() but conditioned by the allow_resched parameter. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu>	2009-07-07 10:32:56 +01:00
Catalin Marinas	bf2a76b317	kmemleak: Renice the scanning thread to +10 This is a long-running thread but not high-priority. So it makes sense to renice it to +10. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-07-07 10:32:55 +01:00
Linus Torvalds	9861df15f4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: SLAB: Fix lockdep annotations fix RCU-callback-after-kmem_cache_destroy problem in sl[aou]b	2009-07-06 14:05:09 -07:00
Kevin Cernekee	5bfd756097	Fix virt_to_phys() warnings These warnings were observed on MIPS32 using 2.6.31-rc1 and gcc-4.2.0: mm/page_alloc.c: In function 'alloc_pages_exact': mm/page_alloc.c:1986: warning: passing argument 1 of 'virt_to_phys' makes pointer from integer without a cast drivers/usb/mon/mon_bin.c: In function 'mon_alloc_buff': drivers/usb/mon/mon_bin.c:1264: warning: passing argument 1 of 'virt_to_phys' makes pointer from integer without a cast [akpm@linux-foundation.org: fix kernel/perf_counter.c too] Signed-off-by: Kevin Cernekee <cernekee@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-07-06 13:57:03 -07:00
Josef Bacik	c8236db9cd	mm: mark page accessed before we write_end() In testing a backport of the write_begin/write_end AOPs, a 10% re-read regression was noticed when running iozone. This regression was introduced because the old AOPs would always do a mark_page_accessed(page) after the commit_write, but when the new AOPs where introduced, the only place this was kept was in pagecache_write_end(). This patch does the same thing in the generic case as what is done in pagecache_write_end(), which is just to mark the page accessed before we do write_end(). Signed-off-by: Josef Bacik <jbacik@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-07-06 13:57:03 -07:00
Pekka Enberg	67fc25ef34	Merge branch 'slab/urgent' into for-linus	2009-07-06 10:51:54 +03:00
Tejun Heo	a530b79586	percpu: teach large page allocator about NUMA Large page first chunk allocator is primarily used for NUMA machines; however, its NUMA handling is extremely simplistic. Regardless of their proximity, each cpu is put into separate large page just to return most of the allocated space back wasting large amount of vmalloc space and increasing cache footprint. This patch teachs NUMA details to large page allocator. Given processor proximity information, pcpu_lpage_build_unit_map() will find fitting cpu -> unit mapping in which cpus in LOCAL_DISTANCE share the same large page and not too much virtual address space is wasted. This greatly reduces the unit and thus chunk size and wastes much less address space for the first chunk. For example, on 4/4 NUMA machine, the original code occupied 16MB of virtual space for the first chunk while the new code only uses 4MB - one 2MB page for each node. [ Impact: much better space efficiency on NUMA machines ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jan Beulich <JBeulich@novell.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Miller <davem@davemloft.net>	2009-07-04 08:11:00 +09:00
Tejun Heo	2f39e637ea	percpu: allow non-linear / sparse cpu -> unit mapping Currently cpu and unit are always identity mapped. To allow more efficient large page support on NUMA and lazy allocation for possible but offline cpus, cpu -> unit mapping needs to be non-linear and/or sparse. This can be easily implemented by adding a cpu -> unit mapping array and using it whenever looking up the matching unit for a cpu. The only unusal conversion is in pcpu_chunk_addr_search(). The passed in address is unit0 based and unit0 might not be in use so it needs to be converted to address of an in-use unit. This is easily done by adding the unit offset for the current processor. [ Impact: allows non-linear/sparse cpu -> unit mapping, no visible change yet ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net>	2009-07-04 08:11:00 +09:00
Tejun Heo	ce3141a277	percpu: drop pcpu_chunk->page[] percpu core doesn't need to tack all the allocated pages. It needs to know whether certain pages are populated and a way to reverse map address to page when freeing. This patch drops pcpu_chunk->page[] and use populated bitmap and vmalloc_to_page() lookup instead. Using vmalloc_to_page() exclusively is also possible but complicates first chunk handling, inflates cache footprint and prevents non-standard memory allocation for percpu memory. pcpu_chunk->page[] was used to track each page's allocation and allowed asymmetric population which happens during failure path; however, with single bitmap for all units, this is no longer possible. Bite the bullet and rewrite (de)populate functions so that things are done in clearly separated steps such that asymmetric population doesn't happen. This makes the (de)population process much more modular and will also ease implementing non-standard memory usage in the future (e.g. large pages). This makes @get_page_fn parameter to pcpu_setup_first_chunk() unnecessary. The parameter is dropped and all first chunk helpers are updated accordingly. Please note that despite the volume most changes to first chunk helpers are symbol renames for variables which don't need to be referenced outside of the helper anymore. This change reduces memory usage and cache footprint of pcpu_chunk. Now only #unit_pages bits are necessary per chunk. [ Impact: reduced memory usage and cache footprint for bookkeeping ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net>	2009-07-04 08:11:00 +09:00
Tejun Heo	c8a51be4ca	percpu: reorder a few functions in mm/percpu.c (de)populate functions are about to be reimplemented to drop pcpu_chunk->page array. Move a few functions so that the rewrite patch doesn't have code movement making it more difficult to read. [ Impact: code movement ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu>	2009-07-04 08:10:59 +09:00
Tejun Heo	38a6be5254	percpu: simplify pcpu_setup_first_chunk() Now that all first chunk allocator helpers allocate and map the first chunk themselves, there's no need to have optional default alloc/map in pcpu_setup_first_chunk(). Drop @populate_pte_fn and only leave @dyn_size optional and make all other params mandatory. This makes it much easier to follow what pcpu_setup_first_chunk() is doing and what actual differences tweaking each parameter results in. [ Impact: drop unused code path ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu>	2009-07-04 08:10:59 +09:00
Tejun Heo	8c4bfc6e88	x86,percpu: generalize lpage first chunk allocator Generalize and move x86 setup_pcpu_lpage() into pcpu_lpage_first_chunk(). setup_pcpu_lpage() now is a simple wrapper around the generalized version. Other than taking size parameters and using arch supplied callbacks to allocate/free/map memory, pcpu_lpage_first_chunk() is identical to the original implementation. This simplifies arch code and will help converting more archs to dynamic percpu allocator. While at it, factor out pcpu_calc_fc_sizes() which is common to pcpu_embed_first_chunk() and pcpu_lpage_first_chunk(). [ Impact: code reorganization and generalization ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu>	2009-07-04 08:10:59 +09:00
Tejun Heo	8f05a6a65d	percpu: make 4k first chunk allocator map memory At first, percpu first chunk was always setup page-by-page by the generic code. To add other allocators, different parts of the generic initialization was made optional. Now we have three allocators - embed, remap and 4k. embed and remap fully handle allocation and mapping of the first chunk while 4k still depends on generic code for those. This makes the generic alloc/map paths specifci to 4k and makes the code unnecessary complicated with optional generic behaviors. This patch makes the 4k allocator to allocate and map memory directly instead of depending on the generic code. The only outside visible change is that now dynamic area in the first chunk is allocated up-front instead of on-demand. This doesn't make any meaningful difference as the area is minimal (usually less than a page, just enough to fill the alignment) on 4k allocator. Plus, dynamic area in the first chunk usually gets fully used anyway. This will allow simplification of pcpu_setpu_first_chunk() and removal of chunk->page array. [ Impact: no outside visible change other than up-front allocation of dyn area ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu>	2009-07-04 08:10:59 +09:00
Tejun Heo	d4b95f8039	x86,percpu: generalize 4k first chunk allocator Generalize and move x86 setup_pcpu_4k() into pcpu_4k_first_chunk(). setup_pcpu_4k() now is a simple wrapper around the generalized version. Other than taking size parameters and using arch supplied callbacks to allocate/free memory, pcpu_4k_first_chunk() is identical to the original implementation. This simplifies arch code and will help converting more archs to dynamic percpu allocator. While at it, s/pcpu_populate_pte_fn_t/pcpu_fc_populate_pte_fn_t/ for consistency. [ Impact: code reorganization and generalization ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu>	2009-07-04 08:10:59 +09:00
Tejun Heo	788e5abc54	percpu: drop @unit_size from embed first chunk allocator The only extra feature @unit_size provides is making dead space at the end of the first chunk which doesn't have any valid usecase. Drop the parameter. This will increase consistency with generalized 4k allocator. James Bottomley spotted missing conversion for the default setup_per_cpu_areas() which caused build breakage on all arcsh which use it. [ Impact: drop unused code path ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Ingo Molnar <mingo@elte.hu>	2009-07-04 08:10:58 +09:00
Tejun Heo	79ba6ac825	x86: make pcpu_chunk_addr_search() matching stricter The @addr passed into pcpu_chunk_addr_search() is unit0 based address and thus should be matched inside unit0 area. Currently, when it uses chunk size when determining whether the address falls in the first chunk. Addresses in unitN where N>0 shouldn't be passed in anyway, so this doesn't cause any malfunction but fix it for consistency. [ Impact: mostly cleanup ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu>	2009-07-04 08:10:58 +09:00
Tejun Heo	c43768cbb7	Merge branch 'master' into for-next Pull linus#master to merge PER_CPU_DEF_ATTRIBUTES and alpha build fix changes. As alpha in percpu tree uses 'weak' attribute instead of inline assembly, there's no need for __used attribute. Conflicts: arch/alpha/include/asm/percpu.h arch/mn10300/kernel/vmlinux.lds.S include/linux/percpu-defs.h	2009-07-04 07:13:18 +09:00
Linus Torvalds	5a475ce469	Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: sh: LCDC dcache flush for deferred io sh: Fix compiler error and include the definition of IS_ERR_VALUE sh: re-add LCDC fbdev support to the Migo-R defconfig sh: fix se7724 ceu names sh: ms7724se: Enable sh_eth in defconfig. arch/sh/boards/mach-se/7206/io.c: Remove unnecessary semicolons sh: ms7724se: Add sh_eth support nommu: provide follow_pfn(). sh: Kill off unused DEBUG_BOOTMEM symbol. perf_counter tools: add cpu_relax()/rmb() definitions for sh. sh64: Hook up page fault events for software perf counters. sh: Hook up page fault events for software perf counters. sh: make set_perf_counter_pending() static inline. clocksource: sh_tmu: Make undefined TCOR behaviour less undefined.	2009-07-01 11:46:30 -07:00
Ingo Molnar	57d81f6f39	kmemleak: Fix scheduling-while-atomic bug One of the kmemleak changes caused the following scheduling-while-holding-the-tasklist-lock regression on x86: BUG: sleeping function called from invalid context at mm/kmemleak.c:795 in_atomic(): 1, irqs_disabled(): 0, pid: 1737, name: kmemleak 2 locks held by kmemleak/1737: #0: (scan_mutex){......}, at: [<c10c4376>] kmemleak_scan_thread+0x45/0x86 #1: (tasklist_lock){......}, at: [<c10c3bb4>] kmemleak_scan+0x1a9/0x39c Pid: 1737, comm: kmemleak Not tainted 2.6.31-rc1-tip #59266 Call Trace: [<c105ac0f>] ? __debug_show_held_locks+0x1e/0x20 [<c102e490>] __might_sleep+0x10a/0x111 [<c10c38d5>] scan_yield+0x17/0x3b [<c10c3970>] scan_block+0x39/0xd4 [<c10c3bc6>] kmemleak_scan+0x1bb/0x39c [<c10c4331>] ? kmemleak_scan_thread+0x0/0x86 [<c10c437b>] kmemleak_scan_thread+0x4a/0x86 [<c104d73e>] kthread+0x6e/0x73 [<c104d6d0>] ? kthread+0x0/0x73 [<c100959f>] kernel_thread_helper+0x7/0x10 kmemleak: 834 new suspected memory leaks (see /sys/kernel/debug/kmemleak) The bit causing it is highly dubious: static void scan_yield(void) { might_sleep(); if (time_is_before_eq_jiffies(next_scan_yield)) { schedule(); next_scan_yield = jiffies + jiffies_scan_yield; } } It called deep inside the codepath and in a conditional way, and that is what crapped up when one of the new scan_block() uses grew a tasklist_lock dependency. This minimal patch removes that yielding stuff and adds the proper cond_resched(). The background scanning thread could probably also be reniced to +10. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-07-01 10:26:23 -07:00
Linus Torvalds	e83c2b0ff3	Merge branch 'kmemleak' of git://linux-arm.org/linux-2.6 * 'kmemleak' of git://linux-arm.org/linux-2.6: kmemleak: Inform kmemleak about pid_hash kmemleak: Do not warn if an unknown object is freed kmemleak: Do not report new leaked objects if the scanning was stopped kmemleak: Slightly change the policy on newly allocated objects kmemleak: Do not trigger a scan when reading the debug/kmemleak file kmemleak: Simplify the reports logged by the scanning thread kmemleak: Enable task stacks scanning by default kmemleak: Allow the early log buffer to be configurable.	2009-06-30 19:04:53 -07:00
Yinghai Lu	66918dcdf9	x86: only clear node_states for 64bit Nathan reported that \| commit `73d60b7f74` \| Author: Yinghai Lu <yinghai@kernel.org> \| Date: Tue Jun 16 15:33:00 2009 -0700 \| \| page-allocator: clear N_HIGH_MEMORY map before we set it again \| \| SRAT tables may contains nodes of very small size. The arch code may \| decide to not activate such a node. However, currently the early boot \| code sets N_HIGH_MEMORY for such nodes. These nodes therefore seem to be \| active although these nodes have no present pages. \| \| For 64bit N_HIGH_MEMORY == N_NORMAL_MEMORY, so that works for 64 bit too unintentionally and incorrectly clears the cpuset.mems cgroup attribute on an i386 kvm guest, meaning that cpuset.mems can not be used. Fix this by only clearing node_states[N_NORMAL_MEMORY] for 64bit only. and need to do save/restore for that in find_zone_movable_pfn Reported-by: Nathan Lynch <ntl@pobox.com> Tested-by: Nathan Lynch <ntl@pobox.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu>, Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-30 18:56:01 -07:00
Richard Kennedy	d7831a0bdf	mm: prevent balance_dirty_pages() from doing too much work balance_dirty_pages can overreact and move all of the dirty pages to writeback unnecessarily. balance_dirty_pages makes its decision to throttle based on the number of dirty plus writeback pages that are over the calculated limit,so it will continue to move pages even when there are plenty of pages in writeback and less than the threshold still dirty. This allows it to overshoot its limits and move all the dirty pages to writeback while waiting for the drives to catch up and empty the writeback list. A simple fio test easily demonstrates this problem. fio --name=f1 --directory=/disk1 --size=2G -rw=write --name=f2 --directory=/disk2 --size=1G --rw=write --startdelay=10 This is the simplest fix I could find, but I'm not entirely sure that it alone will be enough for all cases. But it certainly is an improvement on my desktop machine writing to 2 disks. Do we need something more for machines with large arrays where bdi_threshold * number_of_drives is greater than the dirty_ratio ? Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-30 18:56:01 -07:00
Thomas Gleixner	c49568235d	dmapools: protect page_list walk in show_pools() show_pools() walks the page_list of a pool w/o protection against the list modifications in alloc/free. Take pool->lock to avoid stomping into nirvana. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-30 18:56:00 -07:00
Catalin Marinas	b6e687221e	kmemleak: Do not warn if an unknown object is freed vmap'ed memory blocks are not tracked by kmemleak (yet) but they may be released with vfree() which is tracked. The corresponding kmemleak warning is only enabled in debug mode. Future patch will add support for ioremap and vmap. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-06-29 17:14:14 +01:00
Catalin Marinas	17bb9e0d90	kmemleak: Do not report new leaked objects if the scanning was stopped If the scanning was stopped with a signal, it is possible that some objects are left with a white colour (potential leaks) and reported. Add a check to avoid reporting such objects. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-06-29 17:14:13 +01:00
Pekka Enberg	ec5a36f94e	SLAB: Fix lockdep annotations Commit 8429db5... ("slab: setup cpu caches later on when interrupts are enabled") broke mm/slab.c lockdep annotations: [ 11.554715] ============================================= [ 11.555249] [ INFO: possible recursive locking detected ] [ 11.555560] 2.6.31-rc1 #896 [ 11.555861] --------------------------------------------- [ 11.556127] udevd/1899 is trying to acquire lock: [ 11.556436] (&nc->lock){-.-...}, at: [<ffffffff810c337f>] kmem_cache_free+0xcd/0x25b [ 11.557101] [ 11.557102] but task is already holding lock: [ 11.557706] (&nc->lock){-.-...}, at: [<ffffffff810c3cd0>] kfree+0x137/0x292 [ 11.558109] [ 11.558109] other info that might help us debug this: [ 11.558720] 2 locks held by udevd/1899: [ 11.558983] #0: (&nc->lock){-.-...}, at: [<ffffffff810c3cd0>] kfree+0x137/0x292 [ 11.559734] #1: (&parent->list_lock){-.-...}, at: [<ffffffff810c36c7>] __drain_alien_cache+0x3b/0xbd [ 11.560442] [ 11.560443] stack backtrace: [ 11.561009] Pid: 1899, comm: udevd Not tainted 2.6.31-rc1 #896 [ 11.561276] Call Trace: [ 11.561632] [<ffffffff81065ed6>] __lock_acquire+0x15ec/0x168f [ 11.561901] [<ffffffff81065f60>] ? __lock_acquire+0x1676/0x168f [ 11.562171] [<ffffffff81063c52>] ? trace_hardirqs_on_caller+0x113/0x13e [ 11.562490] [<ffffffff8150c337>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 11.562807] [<ffffffff8106603a>] lock_acquire+0xc1/0xe5 [ 11.563073] [<ffffffff810c337f>] ? kmem_cache_free+0xcd/0x25b [ 11.563385] [<ffffffff8150c8fc>] _spin_lock+0x31/0x66 [ 11.563696] [<ffffffff810c337f>] ? kmem_cache_free+0xcd/0x25b [ 11.563964] [<ffffffff810c337f>] kmem_cache_free+0xcd/0x25b [ 11.564235] [<ffffffff8109bf8c>] ? __free_pages+0x1b/0x24 [ 11.564551] [<ffffffff810c3564>] slab_destroy+0x57/0x5c [ 11.564860] [<ffffffff810c3641>] free_block+0xd8/0x123 [ 11.565126] [<ffffffff810c372e>] __drain_alien_cache+0xa2/0xbd [ 11.565441] [<ffffffff810c3ce5>] kfree+0x14c/0x292 [ 11.565752] [<ffffffff8144a007>] skb_release_data+0xc6/0xcb [ 11.566020] [<ffffffff81449cf0>] __kfree_skb+0x19/0x86 [ 11.566286] [<ffffffff81449d88>] consume_skb+0x2b/0x2d [ 11.566631] [<ffffffff8144cbe0>] skb_free_datagram+0x14/0x3a [ 11.566901] [<ffffffff81462eef>] netlink_recvmsg+0x164/0x258 [ 11.567170] [<ffffffff81443461>] sock_recvmsg+0xe5/0xfe [ 11.567486] [<ffffffff810ab063>] ? might_fault+0xaf/0xb1 [ 11.567802] [<ffffffff81053a78>] ? autoremove_wake_function+0x0/0x38 [ 11.568073] [<ffffffff810d84ca>] ? core_sys_select+0x3d/0x2b4 [ 11.568378] [<ffffffff81065f60>] ? __lock_acquire+0x1676/0x168f [ 11.568693] [<ffffffff81442dc1>] ? sockfd_lookup_light+0x1b/0x54 [ 11.568961] [<ffffffff81444416>] sys_recvfrom+0xa3/0xf8 [ 11.569228] [<ffffffff81063c8a>] ? trace_hardirqs_on+0xd/0xf [ 11.569546] [<ffffffff8100af2b>] system_call_fastpath+0x16/0x1b# Fix that up. Closes-bug: http://bugzilla.kernel.org/show_bug.cgi?id=13654 Tested-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>	2009-06-29 09:57:10 +03:00
Linus Torvalds	8326e284f8	Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, delay: tsc based udelay should have rdtsc_barrier x86, setup: correct include file in <asm/boot.h> x86, setup: Fix typo "CONFIG_x86_64" in <asm/boot.h> x86, mce: percpu mcheck_timer should be pinned x86: Add sysctl to allow panic on IOCK NMI error x86: Fix uv bau sending buffer initialization x86, mce: Fix mce resume on 32bit x86: Move init_gbpages() to setup_arch() x86: ensure percpu lpage doesn't consume too much vmalloc space x86: implement percpu_alloc kernel parameter x86: fix pageattr handling for lpage percpu allocator and re-enable it x86: reorganize cpa_process_alias() x86: prepare setup_pcpu_lpage() for pageattr fix x86: rename remap percpu first chunk allocator to lpage x86: fix duplicate free in setup_pcpu_remap() failure path percpu: fix too lazy vunmap cache flushing x86: Set cpu_llc_id on AMD CPUs	2009-06-28 11:05:28 -07:00
Catalin Marinas	acf4968ec9	kmemleak: Slightly change the policy on newly allocated objects Newly allocated objects are more likely to be reported as false positives. Kmemleak ignores the reporting of objects younger than 5 seconds. However, this age was calculated after the memory scanning completed which usually takes longer than 5 seconds. This patch make the minimum object age calculation in relation to the start of the memory scanning. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-06-26 17:38:29 +01:00
Catalin Marinas	4698c1f2bb	kmemleak: Do not trigger a scan when reading the debug/kmemleak file Since there is a kernel thread for automatically scanning the memory, it makes sense for the debug/kmemleak file to only show its findings. This patch also adds support for "echo scan > debug/kmemleak" to trigger an intermediate memory scan and eliminates the kmemleak_mutex (scan_mutex covers all the cases now). Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-06-26 17:38:27 +01:00
Catalin Marinas	bab4a34afc	kmemleak: Simplify the reports logged by the scanning thread Because of false positives, the memory scanning thread may print too much information. This patch changes the scanning thread to only print the number of newly suspected leaks. Further information can be read from the /sys/kernel/debug/kmemleak file. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-06-26 17:38:26 +01:00
Catalin Marinas	e0a2a1601b	kmemleak: Enable task stacks scanning by default This is to reduce the number of false positives reported. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-06-26 17:38:25 +01:00
Paul E. McKenney	7ed9f7e5db	fix RCU-callback-after-kmem_cache_destroy problem in sl[aou]b Jesper noted that kmem_cache_destroy() invokes synchronize_rcu() rather than rcu_barrier() in the SLAB_DESTROY_BY_RCU case, which could result in RCU callbacks accessing a kmem_cache after it had been destroyed. Cc: <stable@kernel.org> Acked-by: Matt Mackall <mpm@selenic.com> Reported-by: Jesper Dangaard Brouer <hawk@comx.dk> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>	2009-06-26 12:10:47 +03:00
Paul Mundt	dfc2f91ac2	nommu: provide follow_pfn(). With the introduction of follow_pfn() as an exported symbol, modules have begun making use of it. Unfortunately this was not reflected on nommu at the time, so the in-tree users have subsequently all blown up with link errors there. This provides a simple follow_pfn() that just returns addr >> PAGE_SHIFT, which will do the right thing on nommu. There is no need to do range checking within the vma, as the find_vma() case will already take care of this. Signed-off-by: Paul Mundt <lethal@linux-sh.org>	2009-06-26 04:31:57 +09:00
Peter Zijlstra	9d73777e50	clarify get_user_pages() prototype Currently the 4th parameter of get_user_pages() is called len, but its in pages, not bytes. Rename the thing to nr_pages to avoid future confusion. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-25 11:22:13 -07:00
Catalin Marinas	a9d9058aba	kmemleak: Allow the early log buffer to be configurable. (feature suggested by Sergey Senozhatsky) Kmemleak needs to track all the memory allocations but some of these happen before kmemleak is initialised. These are stored in an internal buffer which may be exceeded in some kernel configurations. This patch adds a configuration option with a default value of 400 and also removes the stack dump when the early log buffer is exceeded. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@mail.by>	2009-06-25 10:16:13 +01:00
Linus Torvalds	c622304825	Merge branches 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/{vfs-2.6,audit-current} * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: another race fix in jfs_check_acl() Get "no acls for this inode" right, fix shmem breakage inline functions left without protection of ifdef (acl) * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: audit: inode watches depend on CONFIG_AUDIT not CONFIG_AUDIT_SYSCALL	2009-06-24 14:17:14 -07:00
Al Viro	72c04902d1	Get "no acls for this inode" right, fix shmem breakage Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-06-24 16:58:48 -04:00
Pekka Enberg	ba52270d18	SLUB: Don't pass __GFP_FAIL for the initial allocation SLUB uses higher order allocations by default but falls back to small orders under memory pressure. Make sure the GFP mask used in the initial allocation doesn't include __GFP_NOFAIL. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-24 12:20:14 -07:00
Linus Torvalds	4923abf9f1	Don't warn about order-1 allocations with __GFP_NOFAIL Traditionally, we never failed small orders (even regardless of any __GFP_NOFAIL flags), and slab will allocate order-1 allocations even for small allocations that could fit in a single page (in order to avoid excessive fragmentation). Maybe we should remove this warning entirely, but before making that judgement, at least limit it to bigger allocations. Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-24 12:16:49 -07:00
Al Viro	06b16e9f68	switch shmem to inode->i_acl Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-06-24 08:17:06 -04:00
Tejun Heo	245b2e70ea	percpu: clean up percpu variable definitions Percpu variable definition is about to be updated such that all percpu symbols including the static ones must be unique. Update percpu variable definitions accordingly. * as,cfq: rename ioc_count uniquely * cpufreq: rename cpu_dbs_info uniquely * xen: move nesting_count out of xen_evtchn_do_upcall() and rename it * mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and rename it * ipv4,6: rename cookie_scratch uniquely * x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to pmc_irq_entry and nmi_entry to pmc_nmi_entry * perf_counter: rename disable_count to perf_disable_count * ftrace: rename test_event_disable to ftrace_test_event_disable * kmemleak: rename test_pointer to kmemleak_test_pointer * mce: rename next_interval to mce_next_interval [ Impact: percpu usage cleanups, no duplicate static percpu var names ] Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Dave Jones <davej@redhat.com> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: linux-mm <linux-mm@kvack.org> Cc: David S. Miller <davem@davemloft.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <srostedt@redhat.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Andi Kleen <andi@firstfloor.org>	2009-06-24 15:13:48 +09:00
Tejun Heo	204fba4aa3	percpu: cleanup percpu array definitions Currently, the following three different ways to define percpu arrays are in use. 1. DEFINE_PER_CPU(elem_type[array_len], array_name); 2. DEFINE_PER_CPU(elem_type, array_name[array_len]); 3. DEFINE_PER_CPU(elem_type, array_name)[array_len]; Unify to #1 which correctly separates the roles of the two parameters and thus allows more flexibility in the way percpu variables are defined. [ Impact: cleanup ] Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Tony Luck <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: linux-mm@kvack.org Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David S. Miller <davem@davemloft.net>	2009-06-24 15:13:45 +09:00
Tejun Heo	e74e396204	percpu: use dynamic percpu allocator as the default percpu allocator This patch makes most !CONFIG_HAVE_SETUP_PER_CPU_AREA archs use dynamic percpu allocator. The first chunk is allocated using embedding helper and 8k is reserved for modules. This ensures that the new allocator behaves almost identically to the original allocator as long as static percpu variables are concerned, so it shouldn't introduce much breakage. s390 and alpha use custom SHIFT_PERCPU_PTR() to work around addressing range limit the addressing model imposes. Unfortunately, this breaks if the address is specified using a variable, so for now, the two archs aren't converted. The following architectures are affected by this change. * sh * arm * cris * mips * sparc(32) * blackfin * avr32 * parisc (broken, under investigation) * m32r * powerpc(32) As this change makes the dynamic allocator the default one, CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is replaced with its invert - CONFIG_HAVE_LEGACY_PER_CPU_AREA, which is added to yet-to-be converted archs. These archs implement their own setup_per_cpu_areas() and the conversion is not trivial. * powerpc(64) * sparc(64) * ia64 * alpha * s390 Boot and batch alloc/free tests on x86_32 with debug code (x86_32 doesn't use default first chunk initialization). Compile tested on sparc(32), powerpc(32), arm and alpha. Kyle McMartin reported that this change breaks parisc. The problem is still under investigation and he is okay with pushing this patch forward and fixing parisc later. [ Impact: use dynamic allocator for most archs w/o custom percpu setup ] Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Mikael Starvik <starvik@axis.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Bryan Wu <cooloney@kernel.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Grant Grundler <grundler@parisc-linux.org> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu>	2009-06-24 15:13:35 +09:00
Dimitri Sivanich	364df0ebfb	mm: fix handling of pagesets for downed cpus After downing/upping a cpu, an attempt to set /proc/sys/vm/percpu_pagelist_fraction results in an oops in percpu_pagelist_fraction_sysctl_handler(). If a processor is downed then we need to set the pageset pointer back to the boot pageset. Updates of the high water marks should not access pagesets of unpopulated zones (those pointer go to the boot pagesets which would be no longer functional if their size would be increased beyond zero). Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Mel Gorman <mel@csn.ul.ie> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-23 12:50:05 -07:00
Hugh Dickins	a5c9b696ec	mm: pass mm to grab_swap_token If a kthread happens to use get_user_pages() on an mm (as KSM does), there's a chance that it will end up trying to read in a swap page, then oops in grab_swap_token() because the kthread has no mm: GUP passes down the right mm, so grab_swap_token() ought to be using it. We have not identified a stronger case than KSM's daemon (not yet in mainline), but the issue must have come up before, since RHEL has included a fix for this for years (though a different fix, they just back out of grab_swap_token if current->mm is unset: which is what we first proposed, but using the right mm here seems more correct). Reported-by: Izik Eidus <ieidus@redhat.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-23 12:50:05 -07:00
Linus Torvalds	95b3692d9c	Merge branch 'kmemleak' of git://linux-arm.org/linux-2.6 * 'kmemleak' of git://linux-arm.org/linux-2.6: kmemleak: Do not force the slab debugging Kconfig options kmemleak: use pr_fmt	2009-06-23 11:25:04 -07:00
Hugh Dickins	d26ed650d9	mm: don't rely on flags coincidence Indeed FOLL_WRITE matches FAULT_FLAG_WRITE, matches GUP_FLAGS_WRITE, and it's tempting to devise a set of Grand Unified Paging flags; but not today. So until then, let's rely upon the compiler to spot the coincidence, "rather than have that subtle dependency and a comment for it" - as you remarked in another context yesterday. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-23 11:23:33 -07:00
Hugh Dickins	788c7df451	hugetlb: fault flags instead of write_access handle_mm_fault() is now passing fault flags rather than write_access down to hugetlb_fault(), so better recognize that in hugetlb_fault(), and in hugetlb_no_page(). Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-23 11:23:33 -07:00
KAMEZAWA Hiroyuki	cb4cbcf6b3	mm: fix incorrect page removal from LRU The isolated page is "cursor_page" not "page". This could cause LRU list corruption under memory pressure, caught by CONFIG_DEBUG_LIST. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-23 10:17:28 -07:00
Joe Perches	ae281064be	kmemleak: use pr_fmt Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-06-23 14:40:26 +01:00
Tejun Heo	fa8a7094ba	x86: implement percpu_alloc kernel parameter According to Andi, it isn't clear whether lpage allocator is worth the trouble as there are many processors where PMD TLB is far scarcer than PTE TLB. The advantage or disadvantage probably depends on the actual size of percpu area and specific processor. As performance degradation due to TLB pressure tends to be highly workload specific and subtle, it is difficult to decide which way to go without more data. This patch implements percpu_alloc kernel parameter to allow selecting which first chunk allocator to use to ease debugging and testing. While at it, make sure all the failure paths report why something failed to help determining why certain allocator isn't working. Also, kill the "Great future plan" comment which had already been realized quite some time ago. [ Impact: allow explicit percpu first chunk allocator selection ] Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jan Beulich <JBeulich@novell.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu>	2009-06-22 11:56:24 +09:00
Tejun Heo	85ae87c1ad	percpu: fix too lazy vunmap cache flushing In pcpu_unmap(), flushing virtual cache on vunmap can't be delayed as the page is going to be returned to the page allocator. Only TLB flushing can be put off such that vmalloc code can handle it lazily. Fix it. [ Impact: fix subtle virtual cache flush bug ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Nick Piggin <npiggin@suse.de> Cc: Ingo Molnar <mingo@elte.hu>	2009-06-22 11:56:23 +09:00
Linus Torvalds	d06063cc22	Move FAULT_FLAG_xyz into handle_mm_fault() callers This allows the callers to now pass down the full set of FAULT_FLAG_xyz flags to handle_mm_fault(). All callers have been (mechanically) converted to the new calling convention, there's almost certainly room for architectures to clean up their code and then add FAULT_FLAG_RETRY when that support is added. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-21 13:08:22 -07:00
Linus Torvalds	30c9f3a9fa	Remove internal use of 'write_access' in mm/memory.c The fault handling routines really want more fine-grained flags than a single "was it a write fault" boolean - the callers will want to set flags like "you can return a retry error" etc. And that's actually how the VM works internally, but right now the top-level fault handling functions in mm/memory.c all pass just the 'write_access' boolean around. This switches them over to pass around the FAULT_FLAG_xyzzy 'flags' variable instead. The 'write_access' calling convention still exists for the exported 'handle_mm_fault()' function, but that is next. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-21 13:06:05 -07:00
Johannes Weiner	c277331d5f	mm: page_alloc: clear PG_locked before checking flags on free `da456f1` "page allocator: do not disable interrupts in free_page_mlock()" moved the PG_mlocked clearing after the flag sanity checking which makes mlocked pages always trigger 'bad page'. Fix this by clearing the bit up front. Reported--and-debugged-by: Peter Chubb <peter.chubb@nicta.com.au> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Tested-by: Maxim Levitsky <maximlevitsky@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-20 16:08:22 -07:00
Joe Perches	433f13a727	bootmem.c: avoid c90 declaration warning [akpm@linux-foundation.org: cleanup] Signed-off-by: Joe Perches <joe@perches.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-19 16:46:04 -07:00
Benjamin Herrenschmidt	dcce284a25	mm: Extend gfp masking to the page allocator The page allocator also needs the masking of gfp flags during boot, so this moves it out of slab/slub and uses it with the page allocator as well. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-18 13:12:57 -07:00
KAMEZAWA Hiroyuki	2ffebca6aa	memcg: fix lru rotation in isolate_pages Try to fix memcg's lru rotation sanity: make memcg use the same logic as the global LRU does. Now, at __isolate_lru_page() retruns -EBUSY, the page is rotated to the tail of LRU in global LRU's isolate LRU pages. But in memcg, it's not handled. This makes memcg do the same behavior as global LRU and rotate LRU in the page is busy. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-18 13:03:48 -07:00
KAMEZAWA Hiroyuki	22a668d7c3	memcg: fix behavior under memory.limit equals to memsw.limit A user can set memcg.limit_in_bytes == memcg.memsw.limit_in_bytes when the user just want to limit the total size of applications, in other words, not very interested in memory usage itself. In this case, swap-out will be done only by global-LRU. But, under current implementation, memory.limit_in_bytes is checked at first and try_to_free_page() may do swap-out. But, that swap-out is useless for memsw.limit_in_bytes and the thread may hit limit again. This patch tries to fix the current behavior at memory.limit == memsw.limit case. And documentation is updated to explain the behavior of this special case. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-18 13:03:48 -07:00
KAMEZAWA Hiroyuki	8a9478ca7f	memcg: fix swap accounting This patch fixes mis-accounting of swap usage in memcg. In the current implementation, memcg's swap account is uncharged only when swap is completely freed. But there are several cases where swap cannot be freed cleanly. For handling that, this patch changes that memcg uncharges swap account when swap has no references other than cache. By this, memcg's swap entry accounting can be fully synchronous with the application's behavior. This patch also changes memcg's hooks for swap-out. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-18 13:03:47 -07:00
Li Zefan	338c843108	memcg: remove some redundant checks We don't need to check do_swap_account in the case that the function which checks do_swap_account will never get called if do_swap_account == 0. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-18 13:03:47 -07:00
Balbir Singh	d69b042f3d	memcg: add file-based RSS accounting Add file RSS tracking per memory cgroup We currently don't track file RSS, the RSS we report is actually anon RSS. All the file mapped pages, come in through the page cache and get accounted there. This patch adds support for accounting file RSS pages. It should 1. Help improve the metrics reported by the memory resource controller 2. Will form the basis for a future shared memory accounting heuristic that has been proposed by Kamezawa. Unfortunately, we cannot rename the existing "rss" keyword used in memory.stat to "anon_rss". We however, add "mapped_file" data and hope to educate the end user through documentation. [hugh.dickins@tiscali.co.uk: fix mem_cgroup_update_mapped_file_stat oops] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.cn> Cc: Paul Menage <menage@google.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-18 13:03:47 -07:00
Randy Dunlap	8ca739e369	cgroups: make messages more readable Fix some cgroup messages to read better. Update MAINTAINERS to include mm/cgroup files. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-18 13:03:46 -07:00
Linus Torvalds	3fe0344faf	Merge branch 'kmemleak' of git://linux-arm.org/linux-2.6 * 'kmemleak' of git://linux-arm.org/linux-2.6: kmemleak: Fix some typos in comments kmemleak: Rename kmemleak_panic to kmemleak_stop kmemleak: Only use GFP_KERNEL\|GFP_ATOMIC for the internal allocations	2009-06-17 10:42:21 -07:00
Catalin Marinas	2030117d27	kmemleak: Fix some typos in comments Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-06-17 18:29:04 +01:00
Catalin Marinas	000814f44e	kmemleak: Rename kmemleak_panic to kmemleak_stop This is to avoid the confusion created by the "panic" word. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>	2009-06-17 18:29:03 +01:00
Catalin Marinas	216c04b0d8	kmemleak: Only use GFP_KERNEL\|GFP_ATOMIC for the internal allocations Kmemleak allocates memory for pointer tracking and it tries to avoid using GFP_ATOMIC if the caller doesn't require it. However other gfp flags may be passed by the caller which aren't required by kmemleak. This patch filters the gfp flags so that only GFP_KERNEL \| GFP_ATOMIC are used. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>	2009-06-17 18:29:02 +01:00
Pekka Enberg	5caf5c7dc2	Merge branch 'slub/earlyboot' into for-linus Conflicts: mm/slub.c	2009-06-17 08:30:54 +03:00
Pekka Enberg	e03ab9d415	Merge branches 'slab/documentation', 'slab/fixes', 'slob/cleanups' and 'slub/fixes' into for-linus	2009-06-17 08:30:15 +03:00
Linus Torvalds	517d08699b	Merge branch 'akpm' * akpm: (182 commits) fbdev: bf54x-lq043fb: use kzalloc over kmalloc/memset fbdev: bfin: fix __dev{init,exit} markings fbdev: bfin: drop unnecessary calls to memset fbdev: bfin-t350mcqb-fb: drop unused local variables fbdev: blackfin has __raw I/O accessors, so use them in fb.h fbdev: s1d13xxxfb: add accelerated bitblt functions tcx: use standard fields for framebuffer physical address and length fbdev: add support for handoff from firmware to hw framebuffers intelfb: fix a bug when changing video timing fbdev: use framebuffer_release() for freeing fb_info structures radeon: P2G2CLK_ALWAYS_ONb tested twice, should 2nd be P2G2CLK_DAC_ALWAYS_ONb? s3c-fb: CPUFREQ frequency scaling support s3c-fb: fix resource releasing on error during probing carminefb: fix possible access beyond end of carmine_modedb[] acornfb: remove fb_mmap function mb862xxfb: use CONFIG_OF instead of CONFIG_PPC_OF mb862xxfb: restrict compliation of platform driver to PPC Samsung SoC Framebuffer driver: add Alpha Channel support atmel-lcdc: fix pixclock upper bound detection offb: use framebuffer_alloc() to allocate fb_info struct ... Manually fix up conflicts due to kmemcheck in mm/slab.c	2009-06-16 19:50:13 -07:00
KAMEZAWA Hiroyuki	ee993b135e	mm: fix lumpy reclaim lru handling at isolate_lru_pages At lumpy reclaim, a page failed to be taken by __isolate_lru_page() can be pushed back to "src" list by list_move(). But the page may not be from "src" list. This pushes the page back to wrong LRU. And list_move() itself is unnecessary because the page is not on top of LRU. Then, leave it as it is if __isolate_lru_page() fails. Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:46 -07:00
Mel Gorman	24cf72518c	vmscan: count the number of times zone_reclaim() scans and fails On NUMA machines, the administrator can configure zone_reclaim_mode that is a more targetted form of direct reclaim. On machines with large NUMA distances for example, a zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be reclaimed if the zone watermarks are not being met. There is a heuristic that determines if the scan is worthwhile but it is possible that the heuristic will fail and the CPU gets tied up scanning uselessly. Detecting the situation requires some guesswork and experimentation so this patch adds a counter "zreclaim_failed" to /proc/vmstat. If during high CPU utilisation this counter is increasing rapidly, then the resolution to the problem may be to set /proc/sys/vm/zone_reclaim_mode to 0. [akpm@linux-foundation.org: name things consistently] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:46 -07:00
Mel Gorman	fa5e084e43	vmscan: do not unconditionally treat zones that fail zone_reclaim() as full On NUMA machines, the administrator can configure zone_reclaim_mode that is a more targetted form of direct reclaim. On machines with large NUMA distances for example, a zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be reclaimed if the zone watermarks are not being met. The problem is that zone_reclaim() failing at all means the zone gets marked full. This can cause situations where a zone is usable, but is being skipped because it has been considered full. Take a situation where a large tmpfs mount is occuping a large percentage of memory overall. The pages do not get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full and the zonelist cache considers them not worth trying in the future. This patch makes zone_reclaim() return more fine-grained information about what occured when zone_reclaim() failued. The zone only gets marked full if it really is unreclaimable. If it's a case that the scan did not occur or if enough pages were not reclaimed with the limited reclaim_mode, then the zone is simply skipped. There is a side-effect to this patch. Currently, if zone_reclaim() successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would go ahead. With this patch applied, zone watermarks are rechecked after zone_reclaim() does some work. This bug was introduced by commit `9276b1bc96` ("memory page_alloc zonelist caching speedup") way back in 2.6.19 when the zonelist_cache was introduced. It was not intended that zone_reclaim() aggressively consider the zone to be full when it failed as full direct reclaim can still be an option. Due to the age of the bug, it should be considered a -stable candidate. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:45 -07:00
Mel Gorman	90afa5de6f	vmscan: properly account for the number of page cache pages zone_reclaim() can reclaim A bug was brought to my attention against a distro kernel but it affects mainline and I believe problems like this have been reported in various guises on the mailing lists although I don't have specific examples at the moment. The reported problem was that malloc() stalled for a long time (minutes in some cases) if a large tmpfs mount was occupying a large percentage of memory overall. The pages did not get cleaned or reclaimed by zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists are uselessly scanned frequencly making the CPU spin at near 100%. This patchset intends to address that bug and bring the behaviour of zone_reclaim() more in line with expectations which were noticed during investigation. It is based on top of mmotm and takes advantage of Kosaki's work with respect to zone_reclaim(). Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the scan should go ahead. The broken heuristic is what was causing the malloc() stall as it uselessly scanned the LRU constantly. Currently, zone_reclaim is assuming zone_reclaim_mode is 1 and historically it could not deal with tmpfs pages at all. This fixes up the heuristic so that an unnecessary scan is more likely to be correctly avoided. Patch 2 notes that zone_reclaim() returning a failure automatically means the zone is marked full. This is not always true. It could have failed because the GFP mask or zone_reclaim_mode were unsuitable. Patch 3 introduces a counter zreclaim_failed that will increment each time the zone_reclaim scan-avoidance heuristics fail. If that counter is rapidly increasing, then zone_reclaim_mode should be set to 0 as a temporarily resolution and a bug reported because the scan-avoidance heuristic is still broken. This patch: On NUMA machines, the administrator can configure zone_reclaim_mode that is a more targetted form of direct reclaim. On machines with large NUMA distances for example, a zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be reclaimed if the zone watermarks are not being met. There is a heuristic that determines if the scan is worthwhile but the problem is that the heuristic is not being properly applied and is basically assuming zone_reclaim_mode is 1 if it is enabled. The lack of proper detection can manfiest as high CPU usage as the LRU list is scanned uselessly. Historically, once enabled it was depending on NR_FILE_PAGES which may include swapcache pages that the reclaim_mode cannot deal with. Patch vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included pages that were not file-backed such as swapcache and made a calculation based on the inactive, active and mapped files. This is far superior when zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a reasonable starting figure. This patch alters how zone_reclaim() works out how many pages it might be able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set in the reclaim_mode it will either consider NR_FILE_PAGES as potential candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount swapcache and other non-file-backed pages. If RECLAIM_WRITE is not set, then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is not set, then NR_FILE_MAPPED are not. [kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages] [fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:45 -07:00
David Rientjes	8123681022	oom: only oom kill exiting tasks with attached memory When a task is chosen for oom kill and is found to be PF_EXITING, __oom_kill_task() is called to elevate the task's timeslice and give it access to memory reserves so that it may quickly exit. This privilege is unnecessary, however, if the task has already detached its mm. Although its possible for the mm to become detached later since task_lock() is not held, __oom_kill_task() will simply be a no-op in such circumstances. Subsequently, it is no longer necessary to warn about killing mm-less tasks since it is a no-op. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:45 -07:00
Daisuke Nishimura	9198e96c06	vmscan: handle may_swap more strictly Commit `2e2e425989` ("vmscan,memcg: reintroduce sc->may_swap) add may_swap flag and handle it at get_scan_ratio(). But the result of get_scan_ratio() is ignored when priority == 0, so anon lru is scanned even if may_swap == 0 or nr_swap_pages == 0. IMHO, this is not an expected behavior. As for memcg especially, because of this behavior many and many pages are swapped-out just in vain when oom is invoked by mem+swap limit. This patch is for handling may_swap flag more strictly. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:45 -07:00
Wu Fengguang	3eb4140f03	vmscan: merge duplicate code in shrink_active_list() The "move pages to active list" and "move pages to inactive list" code blocks are mostly identical and can be served by a function. Thanks to Andrew Morton for pointing this out. Note that buffer_heads_over_limit check will also be carried out for re-activated pages, which is slightly different from pre-2.6.28 kernels. Also, Rik's "vmscan: evict use-once pages first" patch could totally stop scans of active file list when memory pressure is low. So the net effect could be, the number of buffer heads is now more likely to grow large. However that's fine according to Johannes' comments: I don't think that this could be harmful. We just preserve the buffer mappings of what we consider the working set and with low memory pressure, as you say, this set is not big. As to stripping of reactivated pages: the only pages we re-activate for now are those VM_EXEC mapped ones. Since we don't expect IO from or to these pages, removing the buffer mappings in case they grow too large should be okay, I guess. Cc: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:45 -07:00
Wu Fengguang	8cab4754d2	vmscan: make mapped executable pages the first class citizen Protect referenced PROT_EXEC mapped pages from being deactivated. PROT_EXEC(or its internal presentation VM_EXEC) pages normally belong to some currently running executables and their linked libraries, they shall really be cached aggressively to provide good user experiences. Thanks to Johannes Weiner for the advice to reuse the VMA walk in page_referenced() to get the PROT_EXEC bit. [more details] ( The consequences of this patch will have to be discussed together with Rik van Riel's recent patch "vmscan: evict use-once pages first". ) ( Some of the good points and insights are taken into this changelog. Thanks to all the involved people for the great LKML discussions. ) the problem =========== For a typical desktop, the most precious working set is composed of actively accessed (1) memory mapped executables (2) and their anonymous pages (3) and other files (4) and the dcache/icache/.. slabs while the least important data are (5) infrequently used or use-once files For a typical desktop, one major problem is busty and large amount of (5) use-once files flushing out the working set. Inside the working set, (4) dcache/icache have already been too sticky ;-) So we only have to care (2) anonymous and (1)(3) file pages. anonymous pages =============== Anonymous pages are effectively immune to the streaming IO attack, because we now have separate file/anon LRU lists. When the use-once files crowd into the file LRU, the list's "quality" is significantly lowered. Therefore the scan balance policy in get_scan_ratio() will choose to scan the (low quality) file LRU much more frequently than the anon LRU. file pages ========== Rik proposed to not scan the active file LRU when the inactive list grows larger than active list. This guarantees that when there are use-once streaming IO, and the working set is not too large(so that active_size < inactive_size), the active file LRU will not be scanned at all. So the not-too-large working set can be well protected. But there are also situations where the file working set is a bit large so that (active_size >= inactive_size), or the streaming IOs are not purely use-once. In these cases, the active list will be scanned slowly. Because the current shrink_active_list() policy is to deactivate active pages regardless of their referenced bits. The deactivated pages become susceptible to the streaming IO attack: the inactive list could be scanned fast (500MB / 50MBps = 10s) so that the deactivated pages don't have enough time to get re-referenced. Because a user tend to switch between windows in intervals from seconds to minutes. This patch holds mapped executable pages in the active list as long as they are referenced during each full scan of the active list. Because the active list is normally scanned much slower, they get longer grace time (eg. 100s) for further references, which better matches the pace of user operations. Therefore this patch greatly prolongs the in-cache time of executable code, when there are moderate memory pressures. before patch: guaranteed to be cached if reference intervals < I after patch: guaranteed to be cached if reference intervals < I+A (except when randomly reclaimed by the lumpy reclaim) where A = time to fully scan the active file LRU I = time to fully scan the inactive file LRU Note that normally A >> I. side effects ============ This patch is safe in general, it restores the pre-2.6.28 mmap() behavior but in a much smaller and well targeted scope. One may worry about some one to abuse the PROT_EXEC heuristic. But as Andrew Morton stated, there are other tricks to getting that sort of boost. Another concern is the PROT_EXEC mapped pages growing large in rare cases, and therefore hurting reclaim efficiency. But a sane application targeted for large audience will never use PROT_EXEC for data mappings. If some home made application tries to abuse that bit, it shall be aware of the consequences. If it is abused to scale of 2/3 total memory, it gains nothing but overheads. benchmarks ========== 1) memory tight desktop 1.1) brief summary - clock time and major faults are reduced by 50%; - pswpin numbers are reduced to ~1/3. That means X desktop responsiveness is doubled under high memory/swap pressure. 1.2) test scenario - nfsroot gnome desktop with 512M physical memory - run some programs, and switch between the existing windows after starting each new program. 1.3) progress timing (seconds) before after programs 0.02 0.02 N xeyes 0.75 0.76 N firefox 2.02 1.88 N nautilus 3.36 3.17 N nautilus --browser 5.26 4.89 N gthumb 7.12 6.47 N gedit 9.22 8.16 N xpdf /usr/share/doc/shared-mime-info/shared-mime-info-spec.pdf 13.58 12.55 N xterm 15.87 14.57 N mlterm 18.63 17.06 N gnome-terminal 21.16 18.90 N urxvt 26.24 23.48 N gnome-system-monitor 28.72 26.52 N gnome-help 32.15 29.65 N gnome-dictionary 39.66 36.12 N /usr/games/sol 43.16 39.27 N /usr/games/gnometris 48.65 42.56 N /usr/games/gnect 53.31 47.03 N /usr/games/gtali 58.60 52.05 N /usr/games/iagno 65.77 55.42 N /usr/games/gnotravex 70.76 61.47 N /usr/games/mahjongg 76.15 67.11 N /usr/games/gnome-sudoku 86.32 75.15 N /usr/games/glines 92.21 79.70 N /usr/games/glchess 103.79 88.48 N /usr/games/gnomine 113.84 96.51 N /usr/games/gnotski 124.40 102.19 N /usr/games/gnibbles 137.41 114.93 N /usr/games/gnobots2 155.53 125.02 N /usr/games/blackjack 179.85 135.11 N /usr/games/same-gnome 224.49 154.50 N /usr/bin/gnome-window-properties 248.44 162.09 N /usr/bin/gnome-default-applications-properties 282.62 173.29 N /usr/bin/gnome-at-properties 323.72 188.21 N /usr/bin/gnome-typing-monitor 363.99 199.93 N /usr/bin/gnome-at-visual 394.21 206.95 N /usr/bin/gnome-sound-properties 435.14 224.49 N /usr/bin/gnome-at-mobility 463.05 234.11 N /usr/bin/gnome-keybinding-properties 503.75 248.59 N /usr/bin/gnome-about-me 554.00 276.27 N /usr/bin/gnome-display-properties 615.48 304.39 N /usr/bin/gnome-network-preferences 693.03 342.01 N /usr/bin/gnome-mouse-properties 759.90 388.58 N /usr/bin/gnome-appearance-properties 937.90 508.47 N /usr/bin/gnome-control-center 1109.75 587.57 N /usr/bin/gnome-keyboard-properties 1399.05 758.16 N : oocalc 1524.64 830.03 N : oodraw 1684.31 900.03 N : ooimpress 1874.04 993.91 N : oomath 2115.12 1081.89 N : ooweb 2369.02 1161.99 N : oowriter Note that the last ": oo" commands are actually commented out. 1.4) vmstat numbers (some relevant ones are marked with ) before after nr_free_pages 1293 3898 nr_inactive_anon 59956 53460 nr_active_anon 26815 30026 nr_inactive_file 2657 3218 nr_active_file 2019 2806 nr_unevictable 4 4 nr_mlock 4 4 nr_anon_pages 26706 27859 nr_mapped 3542 4469 nr_file_pages 72232 67681 nr_dirty 1 0 nr_writeback 123 19 nr_slab_reclaimable 3375 3534 nr_slab_unreclaimable 11405 10665 nr_page_table_pages 8106 7864 nr_unstable 0 0 nr_bounce 0 0 nr_vmscan_write 394776 230839 nr_writeback_temp 0 0 numa_hit 6843353 3318676 numa_miss 0 0 numa_foreign 0 0 numa_interleave 1719 1719 numa_local 6843353 3318676 numa_other 0 0 pgpgin 5954683 2057175 pgpgout `1578276` 922744 pswpin 1486615 512238 pswpout 394568 230685 pgalloc_dma 277432 56602 pgalloc_dma32 6769477 3310348 pgalloc_normal 0 0 pgalloc_movable 0 0 pgfree 7048396 3371118 pgactivate 2036343 1471492 pgdeactivate 2189691 1612829 pgfault 3702176 3100702 pgmajfault 452116 201343 pgrefill_dma 12185 7127 pgrefill_dma32 334384 653703 pgrefill_normal 0 0 pgrefill_movable 0 0 pgsteal_dma 74214 22179 pgsteal_dma32 3334164 1638029 pgsteal_normal 0 0 pgsteal_movable 0 0 pgscan_kswapd_dma 1081421 1216199 pgscan_kswapd_dma32 58979118 46002810 pgscan_kswapd_normal 0 0 pgscan_kswapd_movable 0 0 pgscan_direct_dma 2015438 1086109 pgscan_direct_dma32 55787823 36101597 pgscan_direct_normal 0 0 pgscan_direct_movable 0 0 pginodesteal 3461 7281 slabs_scanned 564864 527616 kswapd_steal 2889797 1448082 kswapd_inodesteal 14827 14835 pageoutrun 43459 21562 allocstall 9653 4032 pgrotated 384216 228631 1.5) free numbers at the end of the tests before patch: total used free shared buffers cached Mem: 474 467 7 0 0 236 -/+ buffers/cache: 230 243 Swap: 1023 418 605 after patch: total used free shared buffers cached Mem: 474 457 16 0 0 236 -/+ buffers/cache: 221 253 Swap: 1023 404 619 2) memory flushing in a file server 2.1) brief summary The number of major faults from 50 to 3 during 10% cache hot reads. That means this patch successfully stops major faults when the active file list is slowly scanned when there are partially cache hot streaming IO. 2.2) test scenario Do 100000 pread(size=110 pages, offset=(i100) pages), where 10% of the pages will be activated: for i in `seq 0 100 10000000`; do echo $i 110; done > pattern-hot-10 iotrace.rb --load pattern-hot-10 --play /b/sparse vmmon nr_mapped nr_active_file nr_inactive_file pgmajfault pgdeactivate pgfree and monitor /proc/vmstat during the time. The test box has 2G memory. I carried out tests on fresh booted console as well as X desktop, and fetched the vmstat numbers on (1) begin: shortly after the big read IO starts; (2) end: just before the big read IO stops; (3) restore: the big read IO stops and the zsh working set restored (4) restore X: after IO, switch back and forth between the urxvt and firefox windows to restore their working set. 2.3) console mode results nr_mapped nr_active_file nr_inactive_file pgmajfault pgdeactivate pgfree 2.6.29 VM_EXEC protection ON: begin: 2481 2237 8694 630 0 574299 end: 275 231976 233914 633 776271 20933042 restore: 370 232154 234524 691 777183 20958453 2.6.29 VM_EXEC protection ON (second run): begin: 2434 2237 8493 629 0 574195 end: 284 231970 233536 632 771918 20896129 restore: 399 232218 234789 690 774526 20957909 2.6.30-rc4-mm VM_EXEC protection OFF: begin: 2479 2344 9659 210 0 579643 end: 284 232010 234142 260 772776 20917184 restore: 379 232159 234371 301 774888 20967849 The above console numbers show that - The startup pgmajfault of 2.6.30-rc4-mm is merely 1/3 that of 2.6.29. I'd attribute that improvement to the mmap readahead improvements :-) - The pgmajfault increment during the file copy is 633-630=3 vs 260-210=50. That's a huge improvement - which means with the VM_EXEC protection logic, active mmap pages is pretty safe even under partially cache hot streaming IO. - when active:inactive file lru size reaches 1:1, their scan rates is 1:20.8 under 10% cache hot IO. (computed with formula Dpgdeactivate:Dpgfree) That roughly means the active mmap pages get 20.8 more chances to get re-referenced to stay in memory. - The absolute nr_mapped drops considerably to 1/9 during the big IO, and the dropped pages are mostly inactive ones. The patch has almost no impact in this aspect, that means it won't unnecessarily increase memory pressure. (In contrast, your 20% mmap protection ratio will keep them all, and therefore eliminate the extra 41 major faults to restore working set of zsh etc.) The iotrace.rb read throughput is 151.194384MB/s 284.198252s 100001x 450560b --load pattern-hot-10 --play /b/sparse which means the inactive list is rotated at the speed of 250MB/s, so a full scan of which takes about 3.5 seconds, while a full scan of active file list takes about 77 seconds. 2.4) X mode results We can reach roughly the same conclusions for X desktop: nr_mapped nr_active_file nr_inactive_file pgmajfault pgdeactivate pgfree 2.6.30-rc4-mm VM_EXEC protection ON: begin: 9740 8920 64075 561 0 678360 end: 768 218254 220029 565 798953 21057006 restore: 857 218543 220987 606 799462 21075710 restore X: 2414 218560 225344 797 799462 21080795 2.6.30-rc4-mm VM_EXEC protection OFF: begin: 9368 5035 26389 554 0 633391 end: 770 218449 221230 661 646472 17832500 restore: 1113 218466 220978 710 649881 17905235 restore X: 2687 218650 225484 947 802700 21083584 - the absolute nr_mapped drops considerably (to 1/13 of the original size) during the streaming IO. - the delta of pgmajfault is 3 vs 107 during IO, or 236 vs 393 during the whole process. Cc: Elladan <elladan@eskimo.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Andi Kleen <andi@firstfloor.org> Cc: Christoph Lameter <cl@linux-foundation.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:44 -07:00
Wu Fengguang	6fe6b7e357	vmscan: report vm_flags in page_referenced() Collect vma->vm_flags of the VMAs that actually referenced the page. This is preparing for more informed reclaim heuristics, eg. to protect executable file pages more aggressively. For now only the VM_EXEC bit will be used by the caller. Thanks to Johannes, Peter and Minchan for all the good tips. Acked-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:44 -07:00
Sergei Trofimovich	168f5ac668	mm cleanup: shmem_file_setup: 'char ' -> 'const char ' for name argument As function shmem_file_setup does not modify/allocate/free/pass given filename - mark it as const. Signed-off-by: Sergei Trofimovich <slyfox@inbox.ru> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:44 -07:00
Minchan Kim	aca8bf323e	mm: remove file argument from swap_readpage() The file argument resulted from address_space's readpage long time ago. We don't use it any more. Let's remove unnecessary argement. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:44 -07:00
Minchan Kim	8192da6a88	mm: remove annotation of gfp_mask in add_to_swap Hugh removed add_to_swap's gfp_mask argument. (mm: remove gfp_mask from add_to_swap) So we have to remove annotation of gfp_mask of the function. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:43 -07:00
Yinghai Lu	73d60b7f74	page-allocator: clear N_HIGH_MEMORY map before we set it again SRAT tables may contains nodes of very small size. The arch code may decide to not activate such a node. However, currently the early boot code sets N_HIGH_MEMORY for such nodes. These nodes therefore seem to be active although these nodes have no present pages. For 64bit N_HIGH_MEMORY == N_NORMAL_MEMORY, so that works for 64 bit too Signed-off-by: Yinghai Lu <Yinghai@kernel.org> Tested-by: Jack Steiner <steiner@sgi.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:43 -07:00
Mike Waychison	286973552f	mm: remove __invalidate_mapping_pages variant Remove __invalidate_mapping_pages atomic variant now that its sole caller can sleep (fixed in `eccb95cee4` ("vfs: fix lock inversion in drop_pagecache_sb()")). This fixes softlockups that can occur while in the drop_caches path. Signed-off-by: Mike Waychison <mikew@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:43 -07:00
David Rientjes	82553a937f	oom: invoke oom killer for __GFP_NOFAIL The oom killer must be invoked regardless of the order if the allocation is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to free some memory. Cc: Nick Piggin <npiggin@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:43 -07:00
David Rientjes	4d8b9135c3	oom: avoid unnecessary mm locking and scanning for OOM_DISABLE This moves the check for OOM_DISABLE to the badness heuristic so it is only necessary to hold task_lock() once. If the mm is OOM_DISABLE, the score is 0, which is also correctly exported via /proc/pid/oom_score. This requires that tasks with badness scores of 0 are prohibited from being oom killed, which makes sense since they would not allow for future memory freeing anyway. Since the oom_adj value is a characteristic of an mm and not a task, it is no longer necessary to check the oom_adj value for threads sharing the same memory (except when simply issuing SIGKILLs for threads in other thread groups). Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:43 -07:00
David Rientjes	2ff05b2b4e	oom: move oom_adj value from task_struct to mm_struct The per-task oom_adj value is a characteristic of its mm more than the task itself since it's not possible to oom kill any thread that shares the mm. If a task were to be killed while attached to an mm that could not be freed because another thread were set to OOM_DISABLE, it would have needlessly been terminated since there is no potential for future memory freeing. This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct mm_struct. This requires task_lock() on a task to check its oom_adj value to protect against exec, but it's already necessary to take the lock when dereferencing the mm to find the total VM size for the badness heuristic. This fixes a livelock if the oom killer chooses a task and another thread sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs because oom_kill_task() repeatedly returns 1 and refuses to kill the chosen task while select_bad_process() will repeatedly choose the same task during the next retry. Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in oom_kill_task() to check for threads sharing the same memory will be removed in the next patch in this series where it will no longer be necessary. Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since these threads are immune from oom killing already. They simply report an oom_adj value of OOM_DISABLE. Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:43 -07:00
KAMEZAWA Hiroyuki	c9e444103b	mm: reuse unused swap entry if necessary Presently we can know a swap entry is just used as SwapCache via swap_map, without looking up swap cache. Then, we have a chance to reuse swap-cache-only swap entries in get_swap_pages(). This patch tries to free swap-cache-only swap entries if swap is not enough. Note: We hit following path when swap_cluster code cannot find a free cluster. Then, vm_swap_full() is not only condition to allow the kernel to reclaim unused swap. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:42 -07:00
KAMEZAWA Hiroyuki	355cfa73dd	mm: modify swap_map and add SWAP_HAS_CACHE flag This is a part of the patches for fixing memcg's swap accountinf leak. But, IMHO, not a bad patch even if no memcg. There are 2 kinds of references to swap. - reference from swap entry - reference from swap cache Then, - If there is swap cache && swap's refcnt is 1, there is only swap cache. (*) swapcount(entry) == 1 && find_get_page(swapper_space, entry) != NULL This counting logic have worked well for a long time. But considering that we cannot know there is a _real_ reference or not by swap_map[], current usage of counter is not very good. This patch adds a flag SWAP_HAS_CACHE and recored information that a swap entry has a cache or not. This will remove -1 magic used in swapfile.c and be a help to avoid unnecessary find_get_page(). Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:42 -07:00
KAMEZAWA Hiroyuki	cb4b86ba47	mm: add swap cache interface for swap reference In a following patch, the usage of swap cache is recorded into swap_map. This patch is for necessary interface changes to do that. 2 interfaces: - swapcache_prepare() - swapcache_free() are added for allocating/freeing refcnt from swap-cache to existing swap entries. But implementation itself is not changed under this patch. At adding swapcache_free(), memcg's hook code is moved under swapcache_free(). This is better than using scattered hooks. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:42 -07:00
KOSAKI Motohiro	6837765963	mm: remove CONFIG_UNEVICTABLE_LRU config option Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this configurability is unnecessary. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andi Kleen <andi@firstfloor.org> Acked-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:42 -07:00
Minchan Kim	bce7394a3e	page-allocator: reset wmark_min and inactive ratio of zone when hotplug happens Solve two problems. Whenever memory hotplug sucessfully happens, zone->present_pages have to be changed. 1) Now memory hotplug calls setup_per_zone_wmark_min only when online_pages called, not offline_pages. It breaks balance. 2) If zone->present_pages is changed, we also have to change zone->inactive_ratio. That's because inactive_ratio depends on zone->present_pages. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:42 -07:00
Minchan Kim	96cb4df5dd	page-allocator: add inactive ratio calculation function of each zone Factor the per-zone arithemetic inside setup_per_zone_inactive_ratio()'s loop into a a separate function, calculate_zone_inactive_ratio(). This function will be used in a later patch [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:42 -07:00
Minchan Kim	bc75d33f0f	page-allocator: clean up functions related to pages_min Change the names of two functions. It doesn't affect behavior. Presently, setup_per_zone_pages_min() changes low, high of zone as well as min. So a better name is setup_per_zone_wmarks(). That's because Mel changed zone->pages_[hig/low/min] to zone->watermark array in "page allocator: replace the watermark-related union in struct zone with a watermark[] array". * setup_per_zone_pages_min => setup_per_zone_wmarks Of course, we have to change init_per_zone_pages_min, too. There are not pages_min any more. * init_per_zone_pages_min => init_per_zone_wmark_min [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:41 -07:00
MinChan Kim	69c8548175	vmscan: prevent shrinking of active anon lru list in case of no swap space V3 shrink_zone() can deactivate active anon pages even if we don't have a swap device. Many embedded products don't have a swap device. So the deactivation of anon pages is unnecessary. This patch prevents unnecessary deactivation of anon lru pages. But, it don't prevent aging of anon pages to swap out. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:41 -07:00
Brice Goglin	35282a2de4	migration: only migrate_prep() once per move_pages() migrate_prep() is fairly expensive (72us on 16-core barcelona 1.9GHz). Commit `3140a22730` improved move_pages() throughput by breaking it into chunks, but it also made migrate_prep() be called once per chunk (every 128pages or so) instead of once per move_pages(). This patch reverts to calling migrate_prep() only once per chunk as we did before 2.6.29. It is also a followup to commit `0aedadf91a` ("mm: move migrate_prep out from under mmap_sem"). This improves migration throughput on the above machine from 600MB/s to 750MB/s. Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:41 -07:00
Rafael J. Wysocki	7f33d49a2e	mm, PM/Freezer: Disable OOM killer when tasks are frozen Currently, the following scenario appears to be possible in theory: * Tasks are frozen for hibernation or suspend. * Free pages are almost exhausted. * Certain piece of code in the suspend code path attempts to allocate some memory using GFP_KERNEL and allocation order less than or equal to PAGE_ALLOC_COSTLY_ORDER. * __alloc_pages_internal() cannot find a free page so it invokes the OOM killer. * The OOM killer attempts to kill a task, but the task is frozen, so it doesn't die immediately. * __alloc_pages_internal() jumps to 'restart', unsuccessfully tries to find a free page and invokes the OOM killer. * No progress can be made. Although it is now hard to trigger during hibernation due to the memory shrinking carried out by the hibernation code, it is theoretically possible to trigger during suspend after the memory shrinking has been removed from that code path. Moreover, since memory allocations are going to be used for the hibernation memory shrinking, it will be even more likely to happen during hibernation. To prevent it from happening, introduce the oom_killer_disabled switch that will cause __alloc_pages_internal() to fail in the situations in which the OOM killer would have been called and make the freezer set this switch after tasks have been successfully frozen. [akpm@linux-foundation.org: be nicer to the namespace] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Fengguang Wu <fengguang.wu@gmail.com> Cc: David Rientjes <rientjes@google.com> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:40 -07:00
Nick Piggin	75927af8bc	mm: madvise(): correct return code The posix_madvise() function succeeds (and does nothing) when called with parameters (NULL, 0, -1); according to LSB tests, it should fail with EINVAL because -1 is not a valid flag. When called with a valid address and size, it correctly fails. So perform an initial check for valid flags first. Reported-by: Jiri Dluhos <jdluhos@novell.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-and-Tested-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:40 -07:00
Andrew Morton	dab48dab37	page-allocator: warn if __GFP_NOFAIL is used for a large allocation __GFP_NOFAIL is a bad fiction. Allocations _can_ fail, and callers should detect and suitably handle this (and not by lamely moving the infinite loop up to the caller level either). Attempting to use __GFP_NOFAIL for a higher-order allocation is even worse, so add a once-off runtime check for this to slap people around for even thinking about trying it. Cc: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:40 -07:00
Johannes Weiner	3b6748e2dd	mm: introduce follow_pfn() Analoguous to follow_phys(), add a helper that looks up the PFN at a user virtual address in an IO mapping or a raw PFN mapping. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: Magnus Damm <magnus.damm@gmail.com> Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:40 -07:00
Johannes Weiner	03668a4deb	mm: use generic follow_pte() in follow_phys() Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: Magnus Damm <magnus.damm@gmail.com> Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:40 -07:00
Johannes Weiner	f8ad0f499f	mm: introduce follow_pte() A generic readonly page table lookup helper to map an address space and an address from it to a pte. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: Magnus Damm <magnus.damm@gmail.com> Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:39 -07:00
Cyrill Gorcunov	e9bb35df6f	mm: setup_per_zone_inactive_ratio - fix comment and make it __init The caller of setup_per_zone_inactive_ratio is an __init function. There is no need to keep the callee after it completed as well. Also fix a comment. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:39 -07:00
Cyrill Gorcunov	5c87eada68	mm: setup_per_zone_inactive_ratio - do not call for int_sqrt if not needed int_sqrt() returns 0 if its argument is zero so call it if only needed. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:39 -07:00
Wu Fengguang	af166777cf	vmscan: ZVC updates in shrink_active_list() can be done once This effectively lifts the unit of updates to nr_inactive_* and pgdeactivate from PAGEVEC_SIZE=14 to SWAP_CLUSTER_MAX=32, or MAX_ORDER_NR_PAGES=1024 for reclaim_zone(). Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:39 -07:00
Wu Fengguang	08d9ae7cbb	vmscan: don't export nr_saved_scan in /proc/zoneinfo The lru->nr_saved_scan's are not meaningful counters for even kernel developers. They typically are smaller than 32 and are always 0 for large lists. So remove them from /proc/zoneinfo. Hopefully this interface change won't break too many scripts. /proc/zoneinfo is too unstructured to be script friendly, and I wonder the affected scripts - if there are any - are still bleeding since the not long ago commit "vmscan: split LRU lists into anon & file sets", which also touched the "scanned" line :) If we are to re-export accumulated vmscan counts in the future, they can go to new lines in /proc/zoneinfo instead of the current form, or to /sys/devices/system/node/node0/meminfo? Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:39 -07:00
Wu Fengguang	6e08a369ee	vmscan: cleanup the scan batching code The vmscan batching logic is twisting. Move it into a standalone function nr_scan_try_batch() and document it. No behavior change. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Christoph Lameter <cl@linux-foundation.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:39 -07:00
Rik van Riel	56e49d2188	vmscan: evict use-once pages first When the file LRU lists are dominated by streaming IO pages, evict those pages first, before considering evicting other pages. This should be safe from deadlocks or performance problems because only three things can happen to an inactive file page: 1) referenced twice and promoted to the active list 2) evicted by the pageout code 3) under IO, after which it will get evicted or promoted The pages freed in this way can either be reused for streaming IO, or allocated for something else. If the pages are used for streaming IO, this pageout pattern continues. Otherwise, we will fall back to the normal pageout pattern. Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Elladan <elladan@eskimo.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:38 -07:00
Wu Fengguang	20a0307c03	mm: introduce PageHuge() for testing huge/gigantic pages A series of patches to enhance the /proc/pagemap interface and to add a userspace executable which can be used to present the pagemap data. Export 10 more flags to end users (and more for kernel developers): 11. KPF_MMAP (pseudo flag) memory mapped page 12. KPF_ANON (pseudo flag) memory mapped page (anonymous) 13. KPF_SWAPCACHE page is in swap cache 14. KPF_SWAPBACKED page is swap/RAM backed 15. KPF_COMPOUND_HEAD () 16. KPF_COMPOUND_TAIL () 17. KPF_HUGE hugeTLB pages 18. KPF_UNEVICTABLE page is in the unevictable LRU list 19. KPF_HWPOISON hardware detected corruption 20. KPF_NOPAGE (pseudo flag) no page frame at the address (*) For compound pages, exporting _both_ head/tail info enables users to tell where a compound page starts/ends, and its order. a simple demo of the page-types tool # ./page-types -h page-types [options] -r\|--raw Raw mode, for kernel developers -a\|--addr addr-spec Walk a range of pages -b\|--bits bits-spec Walk pages with specified bits -l\|--list Show page details in ranges -L\|--list-each Show page details one by one -N\|--no-summary Don't show summay info -h\|--help Show this usage message addr-spec: N one page at offset N (unit: pages) N+M pages range from N to N+M-1 N,M pages range from N to M-1 N, pages range from N to end ,M pages range from 0 to M bits-spec: bit1,bit2 (flags & (bit1\|bit2)) != 0 bit1,bit2=bit1 (flags & (bit1\|bit2)) == bit1 bit1,~bit2 (flags & (bit1\|bit2)) == bit1 =bit1,bit2 flags == (bit1\|bit2) bit-names: locked error referenced uptodate dirty lru active slab writeback reclaim buddy mmap anonymous swapcache swapbacked compound_head compound_tail huge unevictable hwpoison nopage reserved(r) mlocked(r) mappedtodisk(r) private(r) private_2(r) owner_private(r) arch(r) uncached(r) readahead(o) slob_free(o) slub_frozen(o) slub_debug(o) (r) raw mode bits (o) overloaded bits # ./page-types flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000000 487369 1903 _________________________________ 0x0000000000000014 5 0 __R_D____________________________ referenced,dirty 0x0000000000000020 1 0 _____l___________________________ lru 0x0000000000000024 34 0 __R__l___________________________ referenced,lru 0x0000000000000028 3838 14 ___U_l___________________________ uptodate,lru 0x0001000000000028 48 0 ___U_l_______________________I___ uptodate,lru,readahead 0x000000000000002c 6478 25 __RU_l___________________________ referenced,uptodate,lru 0x000100000000002c 47 0 __RU_l_______________________I___ referenced,uptodate,lru,readahead 0x0000000000000040 8344 32 ______A__________________________ active 0x0000000000000060 1 0 _____lA__________________________ lru,active 0x0000000000000068 348 1 ___U_lA__________________________ uptodate,lru,active 0x0001000000000068 12 0 ___U_lA______________________I___ uptodate,lru,active,readahead 0x000000000000006c 988 3 __RU_lA__________________________ referenced,uptodate,lru,active 0x000100000000006c 48 0 __RU_lA______________________I___ referenced,uptodate,lru,active,readahead 0x0000000000004078 1 0 ___UDlA_______b__________________ uptodate,dirty,lru,active,swapbacked 0x000000000000407c 34 0 __RUDlA_______b__________________ referenced,uptodate,dirty,lru,active,swapbacked 0x0000000000000400 503 1 __________B______________________ buddy 0x0000000000000804 1 0 __R________M_____________________ referenced,mmap 0x0000000000000828 1029 4 ___U_l_____M_____________________ uptodate,lru,mmap 0x0001000000000828 43 0 ___U_l_____M_________________I___ uptodate,lru,mmap,readahead 0x000000000000082c 382 1 __RU_l_____M_____________________ referenced,uptodate,lru,mmap 0x000100000000082c 12 0 __RU_l_____M_________________I___ referenced,uptodate,lru,mmap,readahead 0x0000000000000868 192 0 ___U_lA____M_____________________ uptodate,lru,active,mmap 0x0001000000000868 12 0 ___U_lA____M_________________I___ uptodate,lru,active,mmap,readahead 0x000000000000086c 800 3 __RU_lA____M_____________________ referenced,uptodate,lru,active,mmap 0x000100000000086c 31 0 __RU_lA____M_________________I___ referenced,uptodate,lru,active,mmap,readahead 0x0000000000004878 2 0 ___UDlA____M__b__________________ uptodate,dirty,lru,active,mmap,swapbacked 0x0000000000001000 492 1 ____________a____________________ anonymous 0x0000000000005808 4 0 ___U_______Ma_b__________________ uptodate,mmap,anonymous,swapbacked 0x0000000000005868 2839 11 ___U_lA____Ma_b__________________ uptodate,lru,active,mmap,anonymous,swapbacked 0x000000000000586c 30 0 __RU_lA____Ma_b__________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked total 513968 2007 # ./page-types -r flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000000 468002 1828 _________________________________ 0x0000000100000000 19102 74 _____________________r___________ reserved 0x0000000000008000 41 0 _______________H_________________ compound_head 0x0000000000010000 188 0 ________________T________________ compound_tail 0x0000000000008014 1 0 __R_D__________H_________________ referenced,dirty,compound_head 0x0000000000010014 4 0 __R_D___________T________________ referenced,dirty,compound_tail 0x0000000000000020 1 0 _____l___________________________ lru 0x0000000800000024 34 0 __R__l__________________P________ referenced,lru,private 0x0000000000000028 3794 14 ___U_l___________________________ uptodate,lru 0x0001000000000028 46 0 ___U_l_______________________I___ uptodate,lru,readahead 0x0000000400000028 44 0 ___U_l_________________d_________ uptodate,lru,mappedtodisk 0x0001000400000028 2 0 ___U_l_________________d_____I___ uptodate,lru,mappedtodisk,readahead 0x000000000000002c 6434 25 __RU_l___________________________ referenced,uptodate,lru 0x000100000000002c 47 0 __RU_l_______________________I___ referenced,uptodate,lru,readahead 0x000000040000002c 14 0 __RU_l_________________d_________ referenced,uptodate,lru,mappedtodisk 0x000000080000002c 30 0 __RU_l__________________P________ referenced,uptodate,lru,private 0x0000000800000040 8124 31 ______A_________________P________ active,private 0x0000000000000040 219 0 ______A__________________________ active 0x0000000800000060 1 0 _____lA_________________P________ lru,active,private 0x0000000000000068 322 1 ___U_lA__________________________ uptodate,lru,active 0x0001000000000068 12 0 ___U_lA______________________I___ uptodate,lru,active,readahead 0x0000000400000068 13 0 ___U_lA________________d_________ uptodate,lru,active,mappedtodisk 0x0000000800000068 12 0 ___U_lA_________________P________ uptodate,lru,active,private 0x000000000000006c 977 3 __RU_lA__________________________ referenced,uptodate,lru,active 0x000100000000006c 48 0 __RU_lA______________________I___ referenced,uptodate,lru,active,readahead 0x000000040000006c 5 0 __RU_lA________________d_________ referenced,uptodate,lru,active,mappedtodisk 0x000000080000006c 3 0 __RU_lA_________________P________ referenced,uptodate,lru,active,private 0x0000000c0000006c 3 0 __RU_lA________________dP________ referenced,uptodate,lru,active,mappedtodisk,private 0x0000000c00000068 1 0 ___U_lA________________dP________ uptodate,lru,active,mappedtodisk,private 0x0000000000004078 1 0 ___UDlA_______b__________________ uptodate,dirty,lru,active,swapbacked 0x000000000000407c 34 0 __RUDlA_______b__________________ referenced,uptodate,dirty,lru,active,swapbacked 0x0000000000000400 538 2 __________B______________________ buddy 0x0000000000000804 1 0 __R________M_____________________ referenced,mmap 0x0000000000000828 1029 4 ___U_l_____M_____________________ uptodate,lru,mmap 0x0001000000000828 43 0 ___U_l_____M_________________I___ uptodate,lru,mmap,readahead 0x000000000000082c 382 1 __RU_l_____M_____________________ referenced,uptodate,lru,mmap 0x000100000000082c 12 0 __RU_l_____M_________________I___ referenced,uptodate,lru,mmap,readahead 0x0000000000000868 192 0 ___U_lA____M_____________________ uptodate,lru,active,mmap 0x0001000000000868 12 0 ___U_lA____M_________________I___ uptodate,lru,active,mmap,readahead 0x000000000000086c 800 3 __RU_lA____M_____________________ referenced,uptodate,lru,active,mmap 0x000100000000086c 31 0 __RU_lA____M_________________I___ referenced,uptodate,lru,active,mmap,readahead 0x0000000000004878 2 0 ___UDlA____M__b__________________ uptodate,dirty,lru,active,mmap,swapbacked 0x0000000000001000 492 1 ____________a____________________ anonymous 0x0000000000005008 2 0 ___U________a_b__________________ uptodate,anonymous,swapbacked 0x0000000000005808 4 0 ___U_______Ma_b__________________ uptodate,mmap,anonymous,swapbacked 0x000000000000580c 1 0 __RU_______Ma_b__________________ referenced,uptodate,mmap,anonymous,swapbacked 0x0000000000005868 2839 11 ___U_lA____Ma_b__________________ uptodate,lru,active,mmap,anonymous,swapbacked 0x000000000000586c 29 0 __RU_lA____Ma_b__________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked total 513968 2007 # ./page-types --raw --list --no-summary --bits reserved offset count flags 0 15 _____________________r___________ 31 4 _____________________r___________ 159 97 _____________________r___________ 4096 2067 _____________________r___________ 6752 2390 _____________________r___________ 9355 3 _____________________r___________ 9728 14526 _____________________r___________ This patch: Introduce PageHuge(), which identifies huge/gigantic pages by their dedicated compound destructor functions. Also move prep_compound_gigantic_page() to hugetlb.c and make __free_pages_ok() non-static. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:36 -07:00
Mel Gorman	a1dd268cf6	mm: use alloc_pages_exact() in alloc_large_system_hash() to avoid duplicated logic alloc_large_system_hash() has logic for freeing pages at the end of an excessively large power-of-two buffer that is a duplicate of what is in alloc_pages_exact(). This patch converts alloc_large_system_hash() to use alloc_pages_exact(). Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:36 -07:00
Mel Gorman	72807a74c0	page allocator: sanity check order in the page allocator slow path Callers may speculatively call different allocators in order of preference trying to allocate a buffer of a given size. The order needed to allocate this may be larger than what the page allocator can normally handle. While the allocator mostly does the right thing, it should not direct reclaim or wakeup kswapd with a bogus order. This patch sanity checks the order in the slow path and returns NULL if it is too large. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:36 -07:00
KOSAKI Motohiro	092cead617	page allocator: move free_page_mlock() to page_alloc.c Currently, free_page_mlock() is only called from page_alloc.c. Thus, we can move it to page_alloc.c. Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:35 -07:00
Mel Gorman	b6e68bc1ba	page allocator: slab: use nr_online_nodes to check for a NUMA platform SLAB currently avoids checking a bitmap repeatedly by checking once and storing a flag. When the addition of nr_online_nodes as a cheaper version of num_online_nodes(), this check can be replaced by nr_online_nodes. (Christoph did a patch that this is lifted almost verbatim from) Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:35 -07:00
Christoph Lameter	62bc62a873	page allocator: use a pre-calculated value instead of num_online_nodes() in fast paths num_online_nodes() is called in a number of places but most often by the page allocator when deciding whether the zonelist needs to be filtered based on cpusets or the zonelist cache. This is actually a heavy function and touches a number of cache lines. This patch stores the number of online nodes at boot time and updates the value when nodes get onlined and offlined. The value is then used in a number of important paths in place of num_online_nodes(). [rientjes@google.com: do not override definition of node_set_online() with macro] Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:35 -07:00
Mel Gorman	974709bdb2	page allocator: get the pageblock migratetype without disabling interrupts Local interrupts are disabled when freeing pages to the PCP list. Part of that free checks what the migratetype of the pageblock the page is in but it checks this with interrupts disabled and interupts should never be disabled longer than necessary. This patch checks the pagetype with interrupts enabled with the impact that it is possible a page is freed to the wrong list when a pageblock changes type. As that block is now already considered mixed from an anti-fragmentation perspective, it's not of vital importance. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:35 -07:00
Mel Gorman	f2260e6b1f	page allocator: update NR_FREE_PAGES only as necessary When pages are being freed to the buddy allocator, the zone NR_FREE_PAGES counter must be updated. In the case of bulk per-cpu page freeing, it's updated once per page. This retouches cache lines more than necessary. Update the counters one per per-cpu bulk free. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:35 -07:00
Mel Gorman	418589663d	page allocator: use allocation flags as an index to the zone watermark ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether pages_min, pages_low or pages_high is used as the zone watermark when allocating the pages. Two branches in the allocator hotpath determine which watermark to use. This patch uses the flags as an array index into a watermark array that is indexed with WMARK_* defines accessed via helpers. All call sites that use zone->pages_* are updated to use the helpers for accessing the values and the array offsets for setting. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:35 -07:00
Nick Piggin	a3af9c389a	page allocator: do not check for compound pages during the page allocator sanity checks A number of sanity checks are made on each page allocation and free including that the page count is zero. page_count() checks for compound pages and checks the count of the head page if true. However, in these paths, we do not care if the page is compound or not as the count of each tail page should also be zero. This patch makes two changes to the use of page_count() in the free path. It converts one check of page_count() to a VM_BUG_ON() as the count should have been unconditionally checked earlier in the free path. It also avoids checking for compound pages. [mel@csn.ul.ie: Wrote changelog] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:34 -07:00
Mel Gorman	d395b73428	page allocator: do not setup zonelist cache when there is only one node There is a zonelist cache which is used to track zones that are not in the allowed cpuset or found to be recently full. This is to reduce cache footprint on large machines. On smaller machines, it just incurs cost for no gain. This patch only uses the zonelist cache when there are NUMA nodes. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:34 -07:00
Mel Gorman	da456f14d2	page allocator: do not disable interrupts in free_page_mlock() free_page_mlock() tests and clears PG_mlocked using locked versions of the bit operations. If set, it disables interrupts to update counters and this happens on every page free even though interrupts are disabled very shortly afterwards a second time. This is wasteful. This patch splits what free_page_mlock() does. The bit check is still made. However, the update of counters is delayed until the interrupts are disabled and the non-lock version for clearing the bit is used. One potential weirdness with this split is that the counters do not get updated if the bad_page() check is triggered but a system showing bad pages is getting screwed already. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 19:47:34 -07:00

... 5 6 7 8 9 ...

3808 commits