Sophie

Sophie

distrib > Scientific%20Linux > 5x > x86_64 > by-pkgid > 89877e42827f16fa5f86b1df0c2860b1 > files > 1708

kernel-2.6.18-128.1.10.el5.src.rpm

From: Doug Ledford <dledford@redhat.com>
Date: Mon, 17 Nov 2008 11:35:59 -0500
Subject: [openib] IPoIB: fix oops on fabric events
Message-id: 1226939759.4765.406.camel@firewall.xsintricity.com
O-Subject: [Patch RHEL5.3] IPoIB: fix oops on fabric events
Bugzilla: 471890
RH-Acked-by: Jay Fenlason <fenlason@redhat.com>

This is for https://bugzilla.redhat.com/show_bug.cgi?id=471890

Long story short, when something happens to the IB fabric, each machine
has to possibly record new paths.  There is an oopser in the IPoIB code
such that the process of recording new paths might bring the machine
down.  Here's the fun part: if a machine goes down because it oopsed, it
will then trigger an *additional* fabric event for other machines in the
fabric.  So, given an 8000 node cluster, if a fabric event happens,
possibly all 8000 nodes risk oopsing as they record the event, and if
any of them *do* oops, then they trigger a new event that the remaining
7999 nodes must process in addition to the first event.  I think you get
the idea.  This bug can, under the right conditions, act just like the
process of thermonuclear fission inside an atomic weapon and eventually
1 machine can take down the entire 8000 node cluster.

I have the patch posted in the bug with a specific request that IBM test
my exact patch and report their results back.  I have built and run
tested the patch, but I don't have a cluster large enough to effectively
test the patch.  I'm waiting on IBM feedback for that.  In the meantime,
acks for the code itself would be appreciated.

--
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

commit 252b989b87014fc820484abcab1cf2bf26549ee9
Author: Doug Ledford <dledford@redhat.com>
Date:   Mon Nov 17 11:23:43 2008 -0500

    IPoIB: Fix crash when path record fails after path flush, rhel5 specific

    This is a modified version of this upstream commit:

    IPoIB: Fix crash when path record fails after path flush

    Commit ee1e2c82 ("IPoIB: Refresh paths instead of flushing them on SM
    change events") changed how paths are flushed on an SM event.  This
    change introduces a problem if the path record query triggered by
    fails, causing path->ah to become NULL.  A later successful path query
    will then trigger WARN_ON() in path_rec_completion(), and crash
    because path->ah has already been freed, so the ipoib_put_ah() inside
    the lock in path_rec_completion() may actually drop the last reference
    (contrary to the comment that claims this is safe).

    Fix this by updating path->ah and freeing old_ah only when the path
    record query is successful.  This prevents the neighbour AH and that
    path AH from getting out of sync.

    This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=1194>

    Reported-by: Rabah Salem <ravah@mellanox.com>
    Debugged-by: Eli Cohen <eli@mellanox.co.il>
    Signed-off-by: Roland Dreier <rolandd@cisco.com>

    Signed-off-by: Doug Ledford <dledford@redhat.com>

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index e95e58d..0f03d20 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -385,6 +385,7 @@ static void path_rec_completion(int status,
 	struct net_device *dev = path->dev;
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_ah *ah = NULL;
+	struct ipoib_ah *old_ah = NULL;
 	struct ipoib_neigh *neigh, *tn;
 	struct sk_buff_head skqueue;
 	struct sk_buff *skb;
@@ -408,11 +409,12 @@ static void path_rec_completion(int status,
 
 	spin_lock_irqsave(&priv->lock, flags);
 
-	path->ah = ah;
-
 	if (ah) {
 		path->pathrec = *pathrec;
 
+		old_ah = path->ah;
+		path->ah = ah;
+
 		ipoib_dbg(priv, "created address handle %p for LID 0x%04x, SL %d\n",
 			  ah, be16_to_cpu(pathrec->dlid), pathrec->sl);
 
@@ -420,6 +422,17 @@ static void path_rec_completion(int status,
 			__skb_queue_tail(&skqueue, skb);
 
 		list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) {
+			if (neigh->ah) {
+				WARN_ON(neigh->ah != old_ah);
+				/*
+				 * Dropping the ah reference inside
+				 * priv->lock is safe here, because we
+				 * will hold one more reference from
+				 * the original value of path->ah (ie
+				 * old_ah).
+				 */
+				ipoib_put_ah(neigh->ah);
+			}
 			kref_get(&path->ah->ref);
 			neigh->ah = path->ah;
 			memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw,
@@ -449,6 +462,9 @@ static void path_rec_completion(int status,
 
 	spin_unlock_irqrestore(&priv->lock, flags);
 
+	if (old_ah)
+		ipoib_put_ah(old_ah);
+
 	while ((skb = __skb_dequeue(&skqueue))) {
 		skb->dev = dev;
 		if (dev_queue_xmit(skb))