Commit Graph

22 Commits (9c547768e7d9f456f1b145102e75f79e30f7b709)

Author SHA1 Message Date
Linas Vepstas 5794dbcbab [POWERPC] EEH: multifunction recovery bugfix
If the second or higher function of a multi-function device fails
to recover, this failure is not reported upwards. Fix this.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-03-22 22:52:53 +11:00
Linas Vepstas 90fdd6130f [POWERPC] EEH: hotplug recovery bugfix
If a device driver does not have native PCI error recovery,
a hotplug error recovery will be attemped. In this case,
the device driver will not report back whether its healthy
or not; simply assume that it is.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-03-22 22:52:52 +11:00
Linas Vepstas e0f90b6418 [POWERPC] EEH: Add clarifying messages.
There are multiple code patchs tht resuls in a "permanent
failure"; when examining rare events, it can be hard to see
which was taken. This patch adds printk's to assist.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-03-22 22:52:50 +11:00
Linas Vepstas a885902de3 [POWERPC] Clarify EEH error message
Clarify error message re EEH permanent failure.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-01-24 21:13:56 +11:00
Linas Vepstas d0e70341c0 [POWERPC] EEH recovery tweaks
If one attempts to create a device driver recovery sequence that
does not depend on a hard reset of the device, but simply just
attempts to resume processing, then one discovers that the
recovery sequence implemented on powerpc is not quite right.
This patch fixes this up.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-12-08 17:10:18 +11:00
Linas Vepstas 6a1ca373a1 [POWERPC] EEH: support MMIO enable recovery step
Update to the PowerPC PCI error recovery code.

Add code to enable MMIO if a device driver reports that it is capable
of recovering on its own.  One anticipated use of this having a device
driver enable MMIO so that it can take a register dump, which might
then be followed by the device driver requesting a full reset.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-09-21 22:59:20 +10:00
Linas Vepstas cb5b562444 [POWERPC] EEH: code comment cleanup
Clean up subroutine documentation; mostly formatting changes, with
some new content.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-09-21 22:59:10 +10:00
Jeremy Kerr 954a46e2d5 [POWERPC] pseries: Constify & voidify get_property()
Now that get_property() returns a void *, there's no need to cast its
return value. Also, treat the return value as const, so we can
constify get_property later.

pseries platform changes.

Built for pseries_defconfig

Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-07-31 15:55:04 +10:00
Adrian Bunk 80f7228b59 typo fixes: occuring -> occurring
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 18:27:16 +02:00
Linas Vepstas 0aa8d15b01 [POWERPC] pseries: Print PCI slot location code on failure
The PCI error recovery code will printk diagnostic info when
a PCI error event occurs. Change the messages to include the slot
location code, which is how most sysadmins will know the device.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-06-21 15:01:32 +10:00
Linas Vepstas 4240545661 [PATCH] powerpc/pseries: Increment fail counter in PCI recovery
When a PCI device driver does not support PCI error recovery,
the powerpc/pseries code takes a walk through a branch of code
that resets the failure counter. Because of this, if a broken
PCI card is present, the kernel will attempt to reset it an
infinite number of times. (This is annoying but mostly harmless:
each reset takes about 10-20 seconds, and uses almost no CPU time).

This patch preserves the failure count across resets.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-05-19 13:51:12 +10:00
Linas Vepstas ac325acd50 [PATCH] powerpc/pseries: clear PCI failure counter if no new failures
The current PCI error recovery system keeps track of the number of PCI card
resets, and refuses to bring a card back up if this number is too large.
The goal of doing this was to avoid an infinite loop of resets if a card is
obviously dead.  However, if the failures are rare, but the machine has a
high uptime, this mechanism might still be triggered; this is too harsh.

This patch will avoids this problem by decrementing the fail count after an
hour.  Thus, as long as a pci card BSOD's less than 6 times an hour, it
will continue to be reset indefinitely.  If it's failure rate is greater
than that, it will be taken off-line permanently.

This patch is larger than it might otherwise be because it changes
indentation by removing a pointless while-loop.  The while loop is not
needed, as the handler is invoked once fo each event (by schedule_work());
the loop is leftover cruft from an earlier implementation.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-04-22 18:46:13 +10:00
Linas Vepstas a219be2cf4 [PATCH] powerpc/pseries: fix device name printing, again.
The recent patch to print device names in EEH reset messages
was lacking ... this patch works better.

Signed-off-by: Linas Vepstas <linas@linas.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-04-01 22:37:02 +11:00
Linas Vepstas 8df83028cf [PATCH] powerpc/pseries: print message if EEH recovery fails
The current code prints an ambiguous message if the recovery
of a failed PCI device fails. Give this special case its own
unique message.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-04-01 22:35:01 +11:00
Linas Vepstas b4f382a3e5 [PATCH] powerpc/pseries: Cleanup device name printing.
This avoids printk'ing a NULL string.

Signed-off-by: Linas Vepstas <linas@linas.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-03-27 14:48:46 +11:00
Olaf Hering 273d280381 [PATCH] powerpc: fix NULL pointer in handle_eeh_events
This patch fixes a crash in handle_eeh_events,
but ethtool -t still doesnt work right.

...
pepino:~ # cpu 0x3: Vector: 300 (Data Access) at [c00000005192bbe0]
    pc: c00000000004a380: .handle_eeh_events+0xe0/0x23c
    lr: c00000000004a374: .handle_eeh_events+0xd4/0x23c
    sp: c00000005192be60
   msr: 9000000000009032
   dar: 268
 dsisr: 40000000
  current = 0xc0000001fe7bf1a0
  paca    = 0xc00000000048b280
    pid   = 16322, comm = eehd
enter ? for help
[c00000005192bf00] c00000000004a808 .eeh_event_handler+0xcc/0x130
[c00000005192bf90] c000000000025e00 .kernel_thread+0x4c/0x68

...

(none):/# /usr/sbin/ethtool -i eth0
driver: e100
version: 3.5.10-k2-NAPI
firmware-version: N/A
bus-info: 0000:21:01.0
(none):/# /usr/sbin/ethtool -t eth0
Call Trace:
[C00000000F8DEFF0] [C00000000000F270] .show_stack+0x74/0x1b4 (unreliable)
[C00000000F8DF0A0] [C000000000049D04] .eeh_dn_check_failure+0x290/0x2d8
[C00000000F8DF150] [C000000000049E58] .eeh_check_failure+0x10c/0x138
[C00000000F8DF1E0] [C0000000002DFDB0] .e100_hw_reset+0x70/0xf4
[C00000000F8DF270] [C0000000002E1BBC] .e100_hw_init+0x2c/0x260
[C00000000F8DF310] [C0000000002E2464] .e100_loopback_test+0x8c/0x220
[C00000000F8DF3C0] [C0000000002E28DC] .e100_diag_test+0xdc/0x16c
[C00000000F8DF490] [C000000000420BE0] .dev_ethtool+0xf24/0x14f8
[C00000000F8DF8F0] [C00000000041F4A8] .dev_ioctl+0x5cc/0x740
[C00000000F8DFA20] [C00000000040FEFC] .sock_ioctl+0x3d0/0x404
[C00000000F8DFAC0] [C0000000000D513C] .do_ioctl+0x68/0x108
[C00000000F8DFB50] [C0000000000D56B0] .vfs_ioctl+0x4d4/0x510
[C00000000F8DFC10] [C0000000000D5740] .sys_ioctl+0x54/0x94
[C00000000F8DFCC0] [C0000000000FB6EC] .ethtool_ioctl+0x11c/0x150
[C00000000F8DFD60] [C0000000000F7E40] .compat_sys_ioctl+0x338/0x3bc
[C00000000F8DFE30] [C00000000000871C] syscall_exit+0x0/0x40
EEH: Detected PCI bus error on device 0000:21:01.0
EEH: This PCI device has failed 1 times since last reboot: <NULL> -

modprobe: FATAL: Could not load /lib/modules/2.6.16-rc4-git7/modules.dep: No such file or directory

Cannot get strings: No such device
(none):/#
(none):/# EEH: Unable to configure device bridge (-3) for /pci@400000000110/pci@2,2

(none):/# Call Trace:
[C00000000FA17940] [C00000000000F270] .show_stack+0x74/0x1b4 (unreliable)
[C00000000FA179F0] [C000000000049D04] .eeh_dn_check_failure+0x290/0x2d8
[C00000000FA17AA0] [C00000000001E114] .rtas_read_config+0x120/0x154
[C00000000FA17B40] [C000000000049664] .early_enable_eeh+0x274/0x2bc
[C00000000FA17C00] [C000000000049708] .eeh_add_device_early+0x5c/0x6c
[C00000000FA17C90] [C000000000049748] .eeh_add_device_tree_early+0x30/0x5c
[C00000000FA17D20] [C000000000046568] .pcibios_add_pci_devices+0x8c/0x1f8
[C00000000FA17DD0] [C00000000004A528] .eeh_reset_device+0xe0/0x110
[C00000000FA17E60] [C00000000004A698] .handle_eeh_events+0x140/0x250
[C00000000FA17F00] [C00000000004AC7C] .eeh_event_handler+0xe8/0x140
[C00000000FA17F90] [C000000000025784] .kernel_thread+0x4c/0x68
EEH: Detected PCI bus error on device <NULL>
EEH: This PCI device has failed 1 times since last reboot: <NULL> -
EEH: Unable to configure device bridge (-3) for /pci@400000000110/pci@2,2
Call Trace:
[C00000000FA17940] [C00000000000F270] .show_stack+0x74/0x1b4 (unreliable)
[C00000000FA179F0] [C000000000049D04] .eeh_dn_check_failure+0x290/0x2d8
[C00000000FA17AA0] [C00000000001E114] .rtas_read_config+0x120/0x154
[C00000000FA17B40] [C000000000049664] .early_enable_eeh+0x274/0x2bc
[C00000000FA17C00] [C000000000049708] .eeh_add_device_early+0x5c/0x6c
[C00000000FA17C90] [C000000000049748] .eeh_add_device_tree_early+0x30/0x5c
[C00000000FA17D20] [C000000000046568] .pcibios_add_pci_devices+0x8c/0x1f8
[C00000000FA17DD0] [C00000000004A528] .eeh_reset_device+0xe0/0x110
[C00000000FA17E60] [C00000000004A698] .handle_eeh_events+0x140/0x250
[C00000000FA17F00] [C00000000004AC7C] .eeh_event_handler+0xe8/0x140
[C00000000FA17F90] [C000000000025784] .kernel_thread+0x4c/0x68
EEH: Detected PCI bus error on device <NULL>
EEH: This PCI device has failed 1 times since last reboot: <NULL> -
EEH: Unable to configure device bridge (-3) for /pci@400000000110/pci@2,2
Call Trace:
[C00000000FA17940] [C00000000000F270] .show_stack+0x74/0x1b4 (unreliable)
[C00000000FA179F0] [C000000000049D04] .eeh_dn_check_failure+0x290/0x2d8
[C00000000FA17AA0] [C00000000001E114] .rtas_read_config+0x120/0x154
[C00000000FA17B40] [C000000000049664] .early_enable_eeh+0x274/0x2bc
[C00000000FA17C00] [C000000000049708] .eeh_add_device_early+0x5c/0x6c
[C00000000FA17C90] [C000000000049748] .eeh_add_device_tree_early+0x30/0x5c
[C00000000FA17D20] [C000000000046568] .pcibios_add_pci_devices+0x8c/0x1f8
[C00000000FA17DD0] [C00000000004A528] .eeh_reset_device+0xe0/0x110
[C00000000FA17E60] [C00000000004A698] .handle_eeh_events+0x140/0x250
[C00000000FA17F00] [C00000000004AC7C] .eeh_event_handler+0xe8/0x140
[C00000000FA17F90] [C000000000025784] .kernel_thread+0x4c/0x68
EEH: Detected PCI bus error on device <NULL>
and so on

Signed-off-by: Olaf Hering <olh@suse.de>
Acked-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-02-28 16:25:54 +11:00
Al Viro d04e4e115b [PATCH] eeh_driver NULL noise removal
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-02-07 20:58:33 -05:00
Paul Mackerras 18eb3b398d powerpc: Fix up some compile errors in the PCI error recovery code
<asm/systemcfg.h> is gone now, and the PCI error recovery constants
in include/linux/pci.h changed their names in the process of getting
accepted.

Signed-off-by: Paul Mackerras <paulus@samba.org>
(cherry picked from 5a2516156c591fc3d2059fbd93f97e15eb6010d6 commit)
2006-01-10 15:32:31 +11:00
Linas Vepstas 3914ac7b0e [PATCH] powerpc: handle multifunction PCI devices properly
239-eeh-multifunction-consolidate.patch

New-style firmware will often place multiple different functions
under a non-EEH-aware parent.  However, these devices might share
a common PE "partition endpoint" and config address, ad thus any
EEH events will affect all of the devices in common.  This patch
makes the effort to find all of these common devices and handle
them together.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
(cherry picked from 216810296bb97d39da8e176822e9de78d2f00187 commit)
2006-01-10 15:30:23 +11:00
Linas Vepstas b6495c0c8f [PATCH] powerpc: Don't continue with PCI Error recovery if slot reset failed.
238-eeh-stop-if-reset_failed.patch

If the firmware is unable to reset the PCI slot for some reason, then
don't attempt any further recovery steps after that point.  Instead,
mark the device as permanently failed.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
(cherry picked from e06b942521eb2cdaf232726f45a820d5837acb12 commit)
2006-01-10 15:30:14 +11:00
Linas Vepstas 9fb40eb883 [PATCH] powerpc: Remove duplicate code
234-eeh-find-pe.patch

The find_device_pe() routine is duplicated in two files. Remove one of
the two copies, declare the other extern.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
(cherry picked from 48408e708282d4d0269136ff27ea5acbd9410b5a commit)
2006-01-10 15:29:33 +11:00
Linas Vepstas 77bd741561 [PATCH] powerpc: PCI Error Recovery: PPC64 core recovery routines
Various PCI bus errors can be signaled by newer PCI controllers.  The
core error recovery routines are architecture dependent.  This patch adds
a recovery infrastructure for the  PPC64 pSeries systems.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
(cherry picked from e8ca11b460c4c9c7fa6b529be221529ebd770e38 commit)
2006-01-10 15:28:32 +11:00