eeh-pci-error-recovery.txt - OpenGrok cross reference for /linux-4.4.14/Documentation/powerpc/eeh-pci-error-recovery.txt

Lines Matching refs:to
16 hardware features allow PCI bus errors to be cleared and a PCI
17 card to be "rebooted", without also having to reboot the operating
20 This is in contrast to traditional PCI error handling, where the
21 PCI chip is wired directly to the CPU, and an error would cause
23 Another "traditional" technique is to ignore such errors, which
24 can lead to data corruption, both of user data or of kernel data,
28 the OS the ability to "reboot"/recover individual PCI devices.
36 EEH was originally designed to guard against hardware failure, such
39 "real life" are due to either poorly seated PCI cards, or,
40 unfortunately quite commonly, due to device driver bugs, device firmware
43 The most common software bug, is one that causes the device to
44 attempt to DMA to a location in system memory that has not been
50 address line parity errors (for example, due to poor electrical
51 connectivity due to a poorly seated card), and PCI-X split-completion
52 errors (due to software, device firmware, or device PCI hardware bugs).
59 In the following discussion, a generic overview of how to detect
62 kernel does it.  The actual implementation is subject to change,
68 PCI bus to the system CPU electronics complex) detects a PCI error
70 will block all writes (either to the card from the system, or
71 from the card to the system), and it will cause all reads to
75 This includes access to PCI memory, I/O space, and PCI config
76 space.  Interrupts; however, will continued to be delivered.
80 into the firmware are referred to as RTAS (Run-Time Abstraction
89 EEH-isolated, there is a firmware call it can make to determine if
91 into a consistent state (given that it won't be able to complete any
98 the power to the card can be toggled, at least on hot-plug-capable
100 do not need to know that the PCI card has been "rebooted" in this
106 card has died completely, and report this error to the sysadmin.
108 syslogd (/var/log/messages) to alert the sysadmin of PCI resets.
109 The correct way to deal with failed adapters is to use the standard
110 PCI hotplug tools to remove and replace the dead card.
116 so that individual device drivers do not need to be modified to support
125 drivers/pci/hotplug/pSeries_pci.c calling in to the eeh.c code.
130 registered with the EEH code; the EEH code needs to know about
131 the I/O address ranges of the PCI device in order to detect an
137 etc. include a check to see if the i/o read returned all-0xff's.
138 If so, these make a call to eeh_dn_check_failure(), which in turn
142 seen in /proc/ppc64/eeh (subject to change).  Normally, almost
147 arch/powerpc/platforms/pseries/eeh.c will print a stack trace to 
148 syslog (/var/log/messages).  This stack trace has proven to be very 
149 useful to device-driver authors for finding out at what point the EEH 
153 Next, it uses the Linux kernel notifier chain/work queue mechanism to
154 allow any interested parties to find out about the failure.  Device
156 eeh_register_notifier(struct notifier_block *) to find out about EEH
157 events.  The event will include a pointer to the pci device, the
166 rtas_configure_bridge() -- ask firmware to configure any PCI bridges
175 This last call causes the device driver for the card to be stopped,
176 which causes uevents to go out to user space. This triggers
179 hoping to give the user-space scripts enough time to complete.
190 events get delivered to user-space scripts.
193 close function to be called during the first phase of an EEH reset.
234                              to stop the device
253 Following is the analogous stack trace for events sent to user-space
273                 a call to
276                   dev->bus->uevent() which is really just a call to
281                then kobject_uevent() sends a netlink uevent to userspace
283                (during early boot, nobody listens to netlink events and
297 big plus of the current design is that no changes need to be made to
300 network daemons and file systems that didn't need to be disturbed.
303    user-space back-to-back ifdown/ifup burps that potentially disturb
304    network daemons, that didn't need to even know that the pci
308    causes havoc to mounted file systems.  Scripts cannot post-facto
315    Ext3fs seems to be tolerant, retrying reads/writes until it does
322    from the block layer.  It would be very natural to add an EEH
326    the sysadmin had the foresight to run /bin, /sbin, /etc, /var