linux/arch
Robert Richter 4177c42a63 perf, x86: Try to handle unknown nmis with an enabled PMU
When the PMU is enabled it is valid to have unhandled nmis, two
events could trigger 'simultaneously' raising two back-to-back
NMIs. If the first NMI handles both, the latter will be empty
and daze the CPU.

The solution to avoid an 'unknown nmi' massage in this case was
simply to stop the nmi handler chain when the PMU is enabled by
stating the nmi was handled. This has the drawback that a) we
can not detect unknown nmis anymore, and b) subsequent nmi
handlers are not called.

This patch addresses this. Now, we check this unknown NMI if it
could be a PMU back-to-back NMI. Otherwise we pass it and let
the kernel handle the unknown nmi.

This is a debug log:

 cpu #6, nmi #32333, skip_nmi #32330, handled = 1, time = 1934364430
 cpu #6, nmi #32334, skip_nmi #32330, handled = 1, time = 1934704616
 cpu #6, nmi #32335, skip_nmi #32336, handled = 2, time = 1936032320
 cpu #6, nmi #32336, skip_nmi #32336, handled = 0, time = 1936034139
 cpu #6, nmi #32337, skip_nmi #32336, handled = 1, time = 1936120100
 cpu #6, nmi #32338, skip_nmi #32336, handled = 1, time = 1936404607
 cpu #6, nmi #32339, skip_nmi #32336, handled = 1, time = 1937983416
 cpu #6, nmi #32340, skip_nmi #32341, handled = 2, time = 1938201032
 cpu #6, nmi #32341, skip_nmi #32341, handled = 0, time = 1938202830
 cpu #6, nmi #32342, skip_nmi #32341, handled = 1, time = 1938443743
 cpu #6, nmi #32343, skip_nmi #32341, handled = 1, time = 1939956552
 cpu #6, nmi #32344, skip_nmi #32341, handled = 1, time = 1940073224
 cpu #6, nmi #32345, skip_nmi #32341, handled = 1, time = 1940485677
 cpu #6, nmi #32346, skip_nmi #32347, handled = 2, time = 1941947772
 cpu #6, nmi #32347, skip_nmi #32347, handled = 1, time = 1941949818
 cpu #6, nmi #32348, skip_nmi #32347, handled = 0, time = 1941951591
 Uhhuh. NMI received for unknown reason 00 on CPU 6.
 Do you have a strange power saving mode enabled?
 Dazed and confused, but trying to continue

Deltas:

 nmi #32334 340186
 nmi #32335 1327704
 nmi #32336 1819      <<<< back-to-back nmi [1]
 nmi #32337 85961
 nmi #32338 284507
 nmi #32339 1578809
 nmi #32340 217616
 nmi #32341 1798      <<<< back-to-back nmi [2]
 nmi #32342 240913
 nmi #32343 1512809
 nmi #32344 116672
 nmi #32345 412453
 nmi #32346 1462095   <<<< 1st nmi (standard) handling 2 counters
 nmi #32347 2046      <<<< 2nd nmi (back-to-back) handling one
 counter nmi #32348 1773      <<<< 3rd nmi (back-to-back)
 handling no counter! [3]

For  back-to-back nmi detection there are the following rules:

The PMU nmi handler was handling more than one counter and no
counter was handled in the subsequent nmi (see [1] and [2]
above).

There is another case if there are two subsequent back-to-back
nmis [3]. The 2nd is detected as back-to-back because the first
handled more than one counter. If the second handles one counter
and the 3rd handles nothing, we drop the 3rd nmi because it
could be a back-to-back nmi.

Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
[ renamed nmi variable to pmu_nmi to avoid clash with .nmi in entry.S ]
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: peterz@infradead.org
Cc: gorcunov@gmail.com
Cc: fweisbec@gmail.com
Cc: ying.huang@intel.com
Cc: ming.m.lin@intel.com
Cc: eranian@google.com
LKML-Reference: <1283454469-1909-3-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-09-03 08:05:18 +02:00
..
alpha Make do_execve() take a const filename pointer 2010-08-17 18:07:43 -07:00
arm ARM: imx: fix build failure concerning otg/ulpi 2010-08-23 20:50:17 -07:00
avr32 Make do_execve() take a const filename pointer 2010-08-17 18:07:43 -07:00
blackfin Blackfin: wire up new fanotify/prlimit64 syscalls 2010-08-23 04:24:09 -04:00
cris Make do_execve() take a const filename pointer 2010-08-17 18:07:43 -07:00
frv Make do_execve() take a const filename pointer 2010-08-17 18:07:43 -07:00
h8300 Make do_execve() take a const filename pointer 2010-08-17 18:07:43 -07:00
ia64 [IA64] Fix build error: conflicting types for ‘sys_execve’ 2010-08-18 10:17:44 -07:00
m32r Make do_execve() take a const filename pointer 2010-08-17 18:07:43 -07:00
m68k Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu 2010-08-18 09:27:10 -07:00
m68knommu Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu 2010-08-18 09:27:10 -07:00
microblaze Merge branch 'merge-devicetree' of git://git.secretlab.ca/git/linux-2.6 2010-08-18 09:26:17 -07:00
mips Make do_execve() take a const filename pointer 2010-08-17 18:07:43 -07:00
mn10300 arch/mn10300/mm: eliminate NULL dereference 2010-08-23 11:41:24 -07:00
parisc Make do_execve() take a const filename pointer 2010-08-17 18:07:43 -07:00
powerpc powerpc: Fix config dependency problem with MPIC_U3_HT_IRQS 2010-08-24 15:28:29 +10:00
s390 [S390] fix tlb flushing vs. concurrent /proc accesses 2010-08-24 09:26:34 +02:00
score Make do_execve() take a const filename pointer 2010-08-17 18:07:43 -07:00
sh Make do_execve() take a const filename pointer 2010-08-17 18:07:43 -07:00
sparc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6 2010-08-24 10:10:13 -07:00
tile Make do_execve() take a const filename pointer 2010-08-17 18:07:43 -07:00
um uml: fix compile error in dma_get_cache_alignment() 2010-08-20 09:34:55 -07:00
x86 perf, x86: Try to handle unknown nmis with an enabled PMU 2010-09-03 08:05:18 +02:00
xtensa Make do_execve() take a const filename pointer 2010-08-17 18:07:43 -07:00
.gitignore
Kconfig Merge branch 'perf/nmi' into perf/core 2010-08-05 08:45:05 +02:00