linux/mm
KAMEZAWA Hiroyuki 4bfc44958e mm: make set_mempolicy(MPOL_INTERLEAV) N_HIGH_MEMORY aware
At first, init_task's mems_allowed is initialized as this.
 init_task->mems_allowed == node_state[N_POSSIBLE]

And cpuset's top_cpuset mask is initialized as this
 top_cpuset->mems_allowed = node_state[N_HIGH_MEMORY]

Before 2.6.29:
policy's mems_allowed is initialized as this.

  1. update tasks->mems_allowed by its cpuset->mems_allowed.
  2. policy->mems_allowed = nodes_and(tasks->mems_allowed, user's mask)

Updating task's mems_allowed in reference to top_cpuset's one.
cpuset's mems_allowed is aware of N_HIGH_MEMORY, always.

In 2.6.30: After commit 58568d2a82
("cpuset,mm: update tasks' mems_allowed in time"), policy's mems_allowed
is initialized as this.

  1. policy->mems_allowd = nodes_and(task->mems_allowed, user's mask)

Here, if task is in top_cpuset, task->mems_allowed is not updated from
init's one.  Assume user excutes command as #numactrl --interleave=all
,....

  policy->mems_allowd = nodes_and(N_POSSIBLE, ALL_SET_MASK)

Then, policy's mems_allowd can includes a possible node, which has no pgdat.

MPOL's INTERLEAVE just scans nodemask of task->mems_allowd and access this
directly.

  NODE_DATA(nid)->zonelist even if NODE_DATA(nid)==NULL

Then, what's we need is making policy->mems_allowed be aware of
N_HIGH_MEMORY.  This patch does that.  But to do so, extra nodemask will
be on statck.  Because I know cpumask has a new interface of
CPUMASK_ALLOC(), I added it to node.

This patch stands on old behavior.  But I feel this fix itself is just a
Band-Aid.  But to do fundametal fix, we have to take care of memory
hotplug and it takes time.  (task->mems_allowd should be N_HIGH_MEMORY, I
think.)

mpol_set_nodemask() should be aware of N_HIGH_MEMORY and policy's nodemask
should be includes only online nodes.

In old behavior, this is guaranteed by frequent reference to cpuset's
code.  Now, most of them are removed and mempolicy has to check it by
itself.

To do check, a few nodemask_t will be used for calculating nodemask.  But,
size of nodemask_t can be big and it's not good to allocate them on stack.

Now, cpumask_t has CPUMASK_ALLOC/FREE an easy code for get scratch area.
NODEMASK_ALLOC/FREE shoudl be there.

[akpm@linux-foundation.org: cleanups & tweaks]
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-07 10:39:55 -07:00
..
Kconfig Merge branch 'akpm' 2009-06-16 19:50:13 -07:00
Kconfig.debug kmemcheck: enable in the x86 Kconfig 2009-06-15 15:49:15 +02:00
Makefile Merge branch 'akpm' 2009-06-16 19:50:13 -07:00
allocpercpu.c percpu: __percpu_depopulate_mask can take a const mask 2009-04-06 13:44:15 -07:00
backing-dev.c Fix congestion_wait() sync/async vs read/write confusion 2009-07-10 20:31:53 +02:00
bootmem.c kmemleak: Add callbacks to the bootmem allocator 2009-07-08 14:25:14 +01:00
bounce.c block: remove some includings of blktrace_api.h 2009-06-16 11:19:36 +02:00
debug-pagealloc.c generic debug pagealloc 2009-04-01 08:59:13 -07:00
dmapool.c dmapools: protect page_list walk in show_pools() 2009-06-30 18:56:00 -07:00
fadvise.c readahead: move max_sane_readahead() calls into force_page_cache_readahead() 2009-06-16 19:47:28 -07:00
failslab.c kmemtrace, mm: fix slab.h dependency problem in mm/failslab.c 2009-04-03 12:23:01 +02:00
filemap.c mm: mark page accessed before we write_end() 2009-07-06 13:57:03 -07:00
filemap_xip.c mm: do_xip_mapping_read: fix length calculation 2009-04-02 19:04:49 -07:00
fremap.c
highmem.c block: remove some includings of blktrace_api.h 2009-06-16 11:19:36 +02:00
hugetlb.c hugetlbfs: fix i_blocks accounting 2009-07-29 19:10:35 -07:00
init-mm.c mm: consolidate init_mm definition 2009-06-16 19:47:28 -07:00
internal.h vmscan: do not unconditionally treat zones that fail zone_reclaim() as full 2009-06-16 19:47:45 -07:00
kmemcheck.c kmemcheck: add hooks for the page allocator 2009-06-15 15:48:33 +02:00
kmemleak-test.c kmemleak: Simple testing module for kmemleak 2009-06-11 17:04:19 +01:00
kmemleak.c kmemleak: Protect the seq start/next/stop sequence by rcu_read_lock() 2009-07-29 12:34:58 -07:00
maccess.c [S390] maccess: add weak attribute to probe_kernel_write 2009-06-12 10:27:37 +02:00
madvise.c mm: madvise(): correct return code 2009-06-16 19:47:40 -07:00
memcontrol.c cgroup avoid permanent sleep at rmdir 2009-07-29 19:10:35 -07:00
memory.c mm: Pass virtual address to [__]p{te,ud,md}_free_tlb() 2009-07-27 12:10:38 -07:00
memory_hotplug.c page-allocator: reset wmark_min and inactive ratio of zone when hotplug happens 2009-06-16 19:47:42 -07:00
mempolicy.c mm: make set_mempolicy(MPOL_INTERLEAV) N_HIGH_MEMORY aware 2009-08-07 10:39:55 -07:00
mempool.c
migrate.c migration: only migrate_prep() once per move_pages() 2009-06-16 19:47:41 -07:00
mincore.c
mlock.c mm: remove CONFIG_UNEVICTABLE_LRU config option 2009-06-16 19:47:42 -07:00
mm_init.c
mmap.c Merge branch 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-06-11 14:01:07 -07:00
mmu_notifier.c
mmzone.c [ARM] Double check memmap is actually valid with a memmap has unexpected holes V2 2009-05-18 11:22:24 +01:00
mprotect.c perf_counter: Add mmap event hooks to mprotect() 2009-06-08 23:10:43 +02:00
mremap.c
msync.c
nommu.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6 2009-07-01 11:46:30 -07:00
oom_kill.c oom: only oom kill exiting tasks with attached memory 2009-06-16 19:47:45 -07:00
page-writeback.c Fix congestion_wait() sync/async vs read/write confusion 2009-07-10 20:31:53 +02:00
page_alloc.c page-allocator: allow too high-order warning messages to be suppressed with __GFP_NOWARN 2009-07-29 19:10:35 -07:00
page_cgroup.c memcg: remove some redundant checks 2009-06-18 13:03:47 -07:00
page_io.c mm: remove file argument from swap_readpage() 2009-06-16 19:47:44 -07:00
page_isolation.c
pagewalk.c
pdflush.c Revert "mm: add /proc controls for pdflush threads" 2009-05-15 11:32:24 +02:00
percpu.c x86: implement percpu_alloc kernel parameter 2009-06-22 11:56:24 +09:00
prio_tree.c
quicklist.c
readahead.c readahead: introduce context readahead algorithm 2009-06-16 19:47:30 -07:00
rmap.c memcg: add file-based RSS accounting 2009-06-18 13:03:47 -07:00
shmem.c Get "no acls for this inode" right, fix shmem breakage 2009-06-24 16:58:48 -04:00
shmem_acl.c switch shmem to inode->i_acl 2009-06-24 08:17:06 -04:00
slab.c SLAB: Fix lockdep annotations 2009-06-29 09:57:10 +03:00
slob.c fix RCU-callback-after-kmem_cache_destroy problem in sl[aou]b 2009-06-26 12:10:47 +03:00
slub.c kmemleak: Trace the kmalloc_large* functions in slub 2009-07-08 14:25:14 +01:00
sparse-vmemmap.c
sparse.c mm: mminit_validate_memmodel_limits(): remove redundant test 2009-04-01 08:59:11 -07:00
swap.c mm: fix Committed_AS underflow on large NR_CPUS environment 2009-05-02 15:36:10 -07:00
swap_state.c mm: remove file argument from swap_readpage() 2009-06-16 19:47:44 -07:00
swapfile.c PM / Hibernate: Replace bdget call with simple atomic_inc of i_count 2009-07-29 21:07:55 +02:00
thrash.c mm: pass mm to grab_swap_token 2009-06-23 12:50:05 -07:00
truncate.c mm: remove __invalidate_mapping_pages variant 2009-06-16 19:47:43 -07:00
util.c Merge branches 'slab/documentation', 'slab/fixes', 'slob/cleanups' and 'slub/fixes' into for-linus 2009-06-17 08:30:15 +03:00
vmalloc.c Merge branch 'for-linus' of git://linux-arm.org/linux-2.6 2009-06-11 14:15:57 -07:00
vmscan.c Fix congestion_wait() sync/async vs read/write confusion 2009-07-10 20:31:53 +02:00
vmstat.c vmscan: count the number of times zone_reclaim() scans and fails 2009-06-16 19:47:46 -07:00