Sophie: kernel-2.6.18-238.el5 src

kernel-2.6.18-238.el5.src.rpm

From: Paolo Bonzini <pbonzini@redhat.com>
Date: Wed, 4 Aug 2010 15:24:14 -0400
Subject: [net] cxgb3: alt buffer freeing strategy when xen dom0
Message-id: <1280935454-22447-3-git-send-email-pbonzini@redhat.com>
Patchwork-id: 27384
O-Subject: [RHEL5.5 PATCH] cxgb3: adopt alternative buffer freeing strategy when
	running on dom0
Bugzilla: 488882

Bugzilla: 488882

Upstream status: driver does not exist in linux-2.6.18.hg

Brew build: https://brewweb.devel.redhat.com/taskinfo?taskID=2651766

Patch mostly coming from Chelsio, the only change I made is to restrict
the new path to dom0, in order to avoid losing performance when the NIC
is passed through to a PV domain.

> In all the virtualized environments we have tested, the VM's app's
> send buffer frees up its load only when the hypervisor's driver frees
> the corresponding skb.  cxgb3 however does not free a TX skb on DMA
> completion.  The driver relies on FW generated credit returns posted on
> the receive control queues.
>
> In non virtual environments, the driver programs the HW to coalesce
> these credit returns to minimize the FW management load, and relies on
> skb_orphan() to free up space in the app'send buffer. skbs are freed on
> credit return receptions.  It does not work for the VMs, skb_orphan()
> won't free up virtualized app'send buffer.
>
> The attached patch provides a much more aggressive credit return policy,
> and has solved our perf issues on other virtualized platforms.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c
index 4d44cec..0ca7c72 100644
--- a/drivers/net/cxgb3/sge.c
+++ b/drivers/net/cxgb3/sge.c
@@ -1265,7 +1265,31 @@ int t3_eth_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	gen = q->gen;
 	q->unacked += ndesc;
-	compl = (q->unacked & 8) << (S_WR_COMPL - 3);
+#ifdef CONFIG_XEN
+	if (is_initial_xendomain())
+		/*
+		 * Some Guest OS clients get terrible performance when
+		 * they have bad message size / socket send buffer space
+		 * parameters.  For instance, if an application selects an
+		 * 8KB message size and an 8KB send socket buffer size.
+		 * This forces the application into a single packet
+		 * stop-and-go mode where it's only willing to have a
+		 * single message outstanding.  The next message is only
+		 * sent when the previous message is noted as having
+		 * been sent.  Until we issue a kfree_skb() against the
+		 * TX skb, the skb is charged against the application's
+		 * send buffer space.  We only free up TX skbs when we
+		 * get a TX credit return from the hardware / firmware
+		 * which is fairly lazy about this.  So we request a TX
+		 * WR Completion Notification on every TX descriptor in
+		 * order to accellerate TX credit returns.  See also the
+		 * change in handle_rsp_cntrl_info() to free up TX skb's
+		 * when we receive the TX WR Completion Notifications ...
+		 */
+		compl = F_WR_COMPL;
+	else
+#endif
+		compl = (q->unacked & 8) << (S_WR_COMPL - 3);
 	q->unacked &= 7;
 	pidx = q->pidx;
 	q->pidx += ndesc;
@@ -2171,9 +2195,38 @@ static inline void handle_rsp_cntrl_info(struct sge_qset *qs, u32 flags)
 #endif
 
 	credits = G_RSPD_TXQ0_CR(flags);
-	if (credits)
+	if (credits) {
 		qs->txq[TXQ_ETH].processed += credits;
-
+#ifdef CONFIG_XEN
+		if (is_initial_xendomain()) {
+			/*
+			 * In the normal Linux driver t3_eth_xmit()
+			 * routine, we call skb_orphan() on unshared TX skb.
+			 * This results in a call to the destructor for
+			 * the skb which frees up the send buffer space
+			 * it was holding down.  This, in turn, allows the
+			 * application to make forward progress generating
+			 * more data which is important at 10Gb/s.
+			 * For Virtual Machine Guest Operating Systems
+			 * this doesn't work since the send buffer space is
+			 * being held down in the Virtual Machine.  Thus we
+			 * need to get the TX skb's freed up as soon as
+			 * possible in order to prevent applications from
+			 * stalling.  This code is largely copied from the
+			 * corresponding code in sge_timer_tx() and should
+			 * probably be kept in sync with any changes there.  */
+			if (spin_trylock(&qs->txq[TXQ_ETH].lock)) {
+				struct sge_txq *q = &qs->txq[TXQ_ETH];
+				struct port_info *pi = netdev_priv(qs->netdev);
+				struct adapter *adap = pi->adapter;
+
+				reclaim_completed_tx(adap, &qs->txq[TXQ_ETH],
+						     TX_RECLAIM_CHUNK);
+				spin_unlock(&qs->txq[TXQ_ETH].lock);
+			}
+		}
+#endif
+	}
 	credits = G_RSPD_TXQ2_CR(flags);
 	if (credits)
 		qs->txq[TXQ_CTRL].processed += credits;