linux/include/asm-generic
Andrea Arcangeli 26c191788f mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition
When holding the mmap_sem for reading, pmd_offset_map_lock should only
run on a pmd_t that has been read atomically from the pmdp pointer,
otherwise we may read only half of it leading to this crash.

PID: 11679  TASK: f06e8000  CPU: 3   COMMAND: "do_race_2_panic"
 #0 [f06a9dd8] crash_kexec at c049b5ec
 #1 [f06a9e2c] oops_end at c083d1c2
 #2 [f06a9e40] no_context at c0433ded
 #3 [f06a9e64] bad_area_nosemaphore at c043401a
 #4 [f06a9e6c] __do_page_fault at c0434493
 #5 [f06a9eec] do_page_fault at c083eb45
 #6 [f06a9f04] error_code (via page_fault) at c083c5d5
    EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
    00000000
    DS:  007b     ESI: 9e201000 ES:  007b     EDI: 01fb4700 GS:  00e0
    CS:  0060     EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
 #7 [f06a9f38] _spin_lock at c083bc14
 #8 [f06a9f44] sys_mincore at c0507b7d
 #9 [f06a9fb0] system_call at c083becd
                         start           len
    EAX: ffffffda  EBX: 9e200000  ECX: 00001000  EDX: 6228537f
    DS:  007b      ESI: 00000000  ES:  007b      EDI: 003d0f00
    SS:  007b      ESP: 62285354  EBP: 62285388  GS:  0033
    CS:  0073      EIP: 00291416  ERR: 000000da  EFLAGS: 00000286

This should be a longstanding bug affecting x86 32bit PAE without THP.
Only archs with 64bit large pmd_t and 32bit unsigned long should be
affected.

With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
would partly hide the bug when the pmd transition from none to stable,
by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
enabled a new set of problem arises by the fact could then transition
freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
unconditional isn't good idea and it would be a flakey solution.

This should be fully fixed by introducing a pmd_read_atomic that reads
the pmd in order with THP disabled, or by reading the pmd atomically
with cmpxchg8b with THP enabled.

Luckily this new race condition only triggers in the places that must
already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
is localized there but this bug is not related to THP.

NOTE: this can trigger on x86 32bit systems with PAE enabled with more
than 4G of ram, otherwise the high part of the pmd will never risk to be
truncated because it would be zero at all times, in turn so hiding the
SMP race.

This bug was discovered and fully debugged by Ulrich, quote:

----
[..]
pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
eax.

    496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
    *pmd)
    497 {
    498         /* depend on compiler for an atomic pmd read */
    499         pmd_t pmdval = *pmd;

                                // edi = pmd pointer
0xc0507a74 <sys_mincore+548>:   mov    0x8(%esp),%edi
...
                                // edx = PTE page table high address
0xc0507a84 <sys_mincore+564>:   mov    0x4(%edi),%edx
...
                                // eax = PTE page table low address
0xc0507a8e <sys_mincore+574>:   mov    (%edi),%eax

[..]

Please note that the PMD is not read atomically. These are two "mov"
instructions where the high order bits of the PMD entry are fetched
first. Hence, the above machine code is prone to the following race.

-  The PMD entry {high|low} is 0x0000000000000000.
   The "mov" at 0xc0507a84 loads 0x00000000 into edx.

-  A page fault (on another CPU) sneaks in between the two "mov"
   instructions and instantiates the PMD.

-  The PMD entry {high|low} is now 0x00000003fda38067.
   The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
----

Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29 16:22:24 -07:00
..
bitops Add #includes needed to permit the removal of asm/system.h 2012-03-28 18:30:03 +01:00
4level-fixup.h mm: Pass virtual address to [__]p{te,ud,md}_free_tlb() 2009-07-27 12:10:38 -07:00
Kbuild include: replace unifdef-y with header-y 2010-08-14 22:26:51 +02:00
Kbuild.asm include: replace unifdef-y with header-y 2010-08-14 22:26:51 +02:00
atomic-long.h
atomic.h Remove all #inclusions of asm/system.h 2012-03-28 18:30:03 +01:00
atomic64.h
audit_change_attr.h audit: support the "standard" <asm-generic/unistd.h> 2011-05-04 14:41:28 -04:00
audit_dir_write.h audit: support the "standard" <asm-generic/unistd.h> 2011-05-04 14:41:28 -04:00
audit_read.h audit: support the "standard" <asm-generic/unistd.h> 2011-05-04 14:41:28 -04:00
audit_signal.h
audit_write.h audit: support the "standard" <asm-generic/unistd.h> 2011-05-04 14:41:28 -04:00
auxvec.h
barrier.h Create asm-generic/barrier.h 2012-03-28 18:30:03 +01:00
bitops.h bitops: remove minix bitops from asm/bitops.h 2011-03-23 19:46:22 -07:00
bitsperlong.h
bug.h consolidate WARN_...ONCE() static variables 2012-03-23 16:58:31 -07:00
bugs.h
cache.h
cacheflush.h asm-generic/cacheflush.h: flush icache when copying to user pages 2011-05-25 08:39:37 -07:00
checksum.h Add extra arch overrides to asm-generic/checksum.h 2011-11-01 07:34:21 -07:00
cmpxchg-local.h Fix IRQ flag handling naming 2010-10-07 14:08:55 +01:00
cmpxchg.h asm-generic: add linux/types.h to cmpxchg.h 2012-04-02 14:41:27 -07:00
cputime.h Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2012-01-06 08:44:54 -08:00
current.h
delay.h asm-generic: delay.h fix udelay and ndelay for 8 bit args 2011-07-22 18:45:33 +02:00
device.h Driver Core: Add platform device arch data V3 2009-07-22 00:28:38 +02:00
div64.h
dma-coherent.h common: add dma_mmap_from_coherent() function 2012-05-21 15:06:09 +02:00
dma-contiguous.h drivers: add Contiguous Memory Allocator 2012-05-21 15:09:37 +02:00
dma-mapping-broken.h dma-mapping: remove dma_is_consistent API 2010-08-11 08:59:21 -07:00
dma-mapping-common.h BUG: headers with BUG/BUG_ON etc. need linux/bug.h 2012-03-04 17:54:34 -05:00
dma.h
emergency-restart.h
errno-base.h
errno.h mm: make __get_user_pages return -EHWPOISON for HWPOISON page optionally 2011-03-17 13:08:27 -03:00
exec.h Split arch_align_stack() out from asm-generic/system.h 2012-03-28 18:30:03 +01:00
fb.h
fcntl.h locks: move F_INPROGRESS from fl_type to fl_flags field 2011-08-19 13:25:34 -04:00
ftrace.h asm-generic headers: add ftrace.h 2011-03-17 09:19:04 +08:00
futex.h futex: Sanitize futex ops argument types 2011-03-11 12:23:31 +01:00
getorder.h bitops: Add missing parentheses to new get_order macro 2012-02-24 10:39:27 -08:00
gpio.h gpiolib: Remove 'const' from data argument of gpiochip_find() 2012-05-18 23:01:05 -06:00
hardirq.h Fix IRQ flag handling naming 2010-10-07 14:08:55 +01:00
hw_irq.h
ide_iops.h
int-l64.h
int-ll64.h
io-64-nonatomic-hi-lo.h asm-generic: architecture independent readq/writeq for 32bit environment 2012-02-21 16:47:28 -08:00
io-64-nonatomic-lo-hi.h asm-generic: architecture independent readq/writeq for 32bit environment 2012-02-21 16:47:28 -08:00
io.h lib: use generic pci_iomap on all architectures 2012-01-10 18:04:27 -08:00
ioctl.h
ioctls.h tty: add TIOCVHANGUP to allow clean tty shutdown of all ttys 2011-02-17 14:16:30 -08:00
iomap.h [PARISC] fix compile break caused by iomap: make IOPORT/PCI mapping functions conditional 2012-02-27 09:43:30 -06:00
ipcbuf.h
irq.h
irq_regs.h core: Replace __get_cpu_var with __this_cpu_read if not used for an address. 2010-12-17 15:07:19 +01:00
irqflags.h Fix IRQ flag handling naming 2010-10-07 14:08:55 +01:00
kdebug.h asm-generic: kdebug.h: Checkpatch cleanup 2010-10-09 21:51:44 +02:00
kmap_types.h include/asm-generic/kmap_types.h: add helpful reminder 2010-05-25 08:07:03 -07:00
kvm_para.h KVM: make asm-generic/kvm_para.h have an ifdef __KERNEL__ block 2012-05-21 17:47:52 +03:00
libata-portmap.h
linkage.h
local.h atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
local64.h atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
memory_model.h mm: fix __page_to_pfn for a const struct page argument 2011-08-17 13:00:20 -07:00
mm_hooks.h
mman-common.h coredump: add VM_NODUMP, MADV_NODUMP, MADV_CLEAR_NODUMP 2012-03-23 16:58:42 -07:00
mman.h mm: add MAP_HUGETLB for mmaping pseudo-anonymous huge page regions 2009-09-22 07:17:41 -07:00
mmu.h
mmu_context.h
module.h
msgbuf.h
mutex-dec.h
mutex-null.h
mutex-xchg.h
mutex.h
page.h The following changes since commit 3ee72ca992 2012-01-10 17:39:40 -08:00
param.h UAPI: Rearrange definition of HZ in asm-generic/param.h 2011-12-13 09:26:45 +00:00
parport.h
pci-bridge.h PCI: work around Stratus ftServer broken PCIe hierarchy 2012-04-30 15:21:02 -06:00
pci-dma-compat.h dma-mapping: pci: move pci_set_dma_mask and pci_set_consistent_dma_mask to pci-dma-compat.h 2010-03-12 15:52:42 -08:00
pci.h PCI: collapse pcibios_resource_to_bus 2012-02-23 20:19:04 -07:00
pci_iomap.h [PARISC] fix compile break caused by iomap: make IOPORT/PCI mapping functions conditional 2012-02-27 09:43:30 -06:00
percpu.h percpu: Optimize __get_cpu_var() 2010-09-10 10:56:51 +02:00
pgalloc.h
pgtable-nopmd.h mm: Pass virtual address to [__]p{te,ud,md}_free_tlb() 2009-07-27 12:10:38 -07:00
pgtable-nopud.h mm: Pass virtual address to [__]p{te,ud,md}_free_tlb() 2009-07-27 12:10:38 -07:00
pgtable.h mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition 2012-05-29 16:22:24 -07:00
poll.h epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree() 2012-02-24 11:42:50 -08:00
posix_types.h posix_types: Introduce __kernel_[u]long_t 2012-02-20 12:48:47 -08:00
ptrace.h asm-generic/ptrace.h: start a common low level ptrace helper 2011-05-26 17:12:36 -07:00
resource.h ulimit: raise default hard ulimit on number of files to 4096 2011-05-25 08:39:43 -07:00
rtc.h
rwsem.h Hexagon: Add locking types and functions 2011-11-01 07:34:20 -07:00
scatterlist.h asm-generic: remove ARCH_HAS_SG_CHAIN in scatterlist.h 2010-05-27 09:12:54 -07:00
sections.h x86: Separate out entry text section 2011-03-08 17:22:11 +01:00
segment.h
sembuf.h
serial.h
setup.h
shmbuf.h
shmparam.h
siginfo.h Linux 3.4-rc5 2012-05-04 12:46:40 +10:00
signal-defs.h
signal.h
sizes.h asm-generic headers: add sizes.h 2011-03-17 09:19:04 +08:00
socket.h net: Add framework to allow sending packets with customized CRC. 2012-02-24 01:37:35 -08:00
sockios.h
spinlock.h
stat.h asm-generic/stat.h: support 64-bit file time_t for stat() 2010-11-01 15:31:29 -04:00
statfs.h asm-generic: Use __BITS_PER_LONG in statfs.h 2012-04-30 12:55:15 -07:00
string.h
swab.h
switch_to.h Split the switch_to() wrapper out of asm-generic/system.h 2012-03-28 18:30:03 +01:00
syscall.h asm/syscall.h: add syscall_get_arch 2012-04-14 11:13:19 +10:00
syscalls.h Fix the declaration of sys_execve() in asm-generic/syscalls.h 2010-08-18 12:12:38 -07:00
termbits.h tty: Add EXTPROC support for LINEMODE 2010-08-10 13:47:39 -07:00
termios-base.h
termios.h
timex.h
tlb.h thp: add tlb_remove_pmd_tlb_entry 2012-01-12 20:13:08 -08:00
tlbflush.h BUG: headers with BUG/BUG_ON etc. need linux/bug.h 2012-03-04 17:54:34 -05:00
topology.h topology: alternate fix for ia64 tiger_defconfig build breakage 2010-08-09 20:44:57 -07:00
types.h consolidate umode_t declarations 2012-01-03 22:55:17 -05:00
uaccess-unaligned.h
uaccess.h fix default __strnlen_user macro 2011-10-06 19:47:12 -04:00
ucontext.h
unaligned.h
unistd.h compat: use sys_sendfile64() implementation for sendfile syscall 2012-03-27 13:36:57 -04:00
user.h asm-generic/user.h: Fix spelling in comment 2011-03-01 15:49:39 +01:00
vga.h
vmlinux.lds.h ftrace: Sort all function addresses, not just per page 2012-05-16 19:58:44 -04:00
word-at-a-time.h word-at-a-time: make the interfaces truly generic 2012-05-26 11:33:40 -07:00
xor.h sanitize <linux/prefetch.h> usage 2011-05-20 12:50:29 -07:00