Sophie

Sophie

distrib > Scientific%20Linux > 5x > x86_64 > by-pkgid > fc11cd6e1c513a17304da94a5390f3cd > files > 2860

kernel-2.6.18-194.11.1.el5.src.rpm

From: Scott Moser <smoser@redhat.com>
Subject: [PATCH RHEL5u1] bz252405 EEH kernel crash on power6 blades
Date: Fri, 17 Aug 2007 10:20:39 -0400 (EDT)
Bugzilla: 252405
Message-Id: <Pine.LNX.4.64.0708171018340.30310@squad5-lp1.lab.boston.redhat.com>
Changelog: [ppc] EEH: better status string detection


RHBZ#: 252405
------
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=252405

Description:
------------
During verification of RHEL5u1 snapshots on power6 blades, a kernel crash
was found. Any EEH hardware event (triggered by hardware error detection)
will cause a system crash - and the possibility of the occurances is high.

It seems that some versions of firmware will report a device node status
as the string "okay". As we are not expecting this string, the device node
will be ignored by the EEH subsystem.  Which means EEH will not be
enabled.

When EEH is not enabled, PCI errors will be converted into Machine Check
exceptions, and we'll have a very unhappy system.

Remove dead code, and a misleading comment about EEH checking for video
devices.  The removed code is a left-over from the olden days where there
was concern over how video devices worked in Linux. We are never going to
go that way again, so kill this.

RHEL Version Found:
-------------------
This is a bug found in RHEL5u1 kernel 2.6.18-39.el5.

Upstream Status:
----------------
These patches have been posted for upstream review at [1,2]

Test Status:
------------
To ensure cross platform build of this patch, a brew scratch build has
been done against kernel-2.6.18-40 and is available at [3].

Test of this patch has been done by Linas Vepstas of IBM.
 - 'cat /proc/ppc64/eeh' shows that eeh has been enabled.
 - with unpatched kernel, he following would have crashed with "machine check"
   and entered xmon.  With patched kernel it does not.
   To inject the EEH error:
   errinjct eeh -f 5 -s usb_host/usb_host1

   To trigger hte EEH error:
   lspci -v -x -s 0001:00:01.1

Proposed Patch:
----------------
Please review and ACK for RHEL5.1

--
[1] http://patchwork.ozlabs.org/linuxppc/patch?id=12855
[2] http://patchwork.ozlabs.org/linuxppc/patch?id=12856
[3] http://brewweb.devel.redhat.com/brew/taskinfo?taskID=924761
---
 arch/powerpc/platforms/pseries/eeh.c |   19 +------------------
 1 file changed, 1 insertion(+), 18 deletions(-)
Index: b/arch/powerpc/platforms/pseries/eeh.c
===================================================================
--- a/arch/powerpc/platforms/pseries/eeh.c
+++ b/arch/powerpc/platforms/pseries/eeh.c
@@ -792,7 +792,7 @@ static void *early_enable_eeh(struct dev
 	pdn->eeh_check_count = 0;
 	pdn->eeh_freeze_count = 0;
 
-	if (status && strcmp(status, "ok") != 0)
+	if (status && strncmp(status, "ok",2) != 0)
 		return NULL;	/* ignore devices with bad status */
 
 	/* Ignore bad nodes. */
@@ -806,23 +806,6 @@ static void *early_enable_eeh(struct dev
 	}
 	pdn->class_code = *class_code;
 
-	/*
-	 * Now decide if we are going to "Disable" EEH checking
-	 * for this device.  We still run with the EEH hardware active,
-	 * but we won't be checking for ff's.  This means a driver
-	 * could return bad data (very bad!), an interrupt handler could
-	 * hang waiting on status bits that won't change, etc.
-	 * But there are a few cases like display devices that make sense.
-	 */
-	enable = 1;	/* i.e. we will do checking */
-#if 0
-	if ((*class_code >> 16) == PCI_BASE_CLASS_DISPLAY)
-		enable = 0;
-#endif
-
-	if (!enable)
-		pdn->eeh_mode |= EEH_MODE_NOCHECK;
-
 	/* Ok... see if this device supports EEH.  Some do, some don't,
 	 * and the only way to find out is to check each and every one. */
 	regs = (u32 *)get_property(dn, "reg", NULL);