Sophie: kernel-2.6.18-238.el5 src

kernel-2.6.18-238.el5.src.rpm

From: AMEET M. PARANJAPE <aparanja@redhat.com>
Date: Sat, 7 Mar 2009 15:12:52 -0600
Subject: [misc] signal: modify locking to handle large loads
Message-id: 49B2E354.4010109@REDHAT.COM
O-Subject: Re: [PATCH RHEL5.4 BZ487376] Lapi takes too long to run
Bugzilla: 487376
RH-Acked-by: Oleg Nesterov <oleg@redhat.com>
RH-Acked-by: David Howells <dhowells@redhat.com>

RHBZ#:
======
https://bugzilla.redhat.com/show_bug.cgi?id=487376

Description:
===========
IBM's HPC Infiniband  Solution supports MPI and LAPI applications running on
multiple Power6 32-way CECs interconnected via an IB Switch. The customer's
scaling expectation is that they should be able to run 64 tasks (SMT mode) or
32-tasks (ST Mode) per CEC. When we tested our solution on RHEL5.3 we found
that many of our tests were often taking significantly longer to complete than
they should have. For example, a job that should complete in 5  minutes might
instead take 20 minutes, or in the worst case 1 hour. In short our HPC solution
did not scale when run on RHEL5.3. We can not GA our IB solution without these
scaling problems being resolved. The patch referenced in this bugzilla has
resolved these scaling problems.

RHEL Version Found:
================
RHEL 5.4

kABI Status:
============
No symbols were harmed.

Brew:
=====
Built on all platforms.
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1704010

Upstream Status:
================
GIT commit ID 3547ff3aefbe092ca35506c60c02e2d17a4f2199

Test Status:
============
A PPC system was setup with 2 test cases, mpi_collective and mpi_lapi, that
were slower before applying the kernel patch and faster with the kernel
patch
applied.  The comparison results are given below.

Execution times without the patch applied:
mpi_collective          1104 seconds
mpi_lapi                628 seconds

Execution times with patch applied:
mpi_collective          953 seconds
mpi_lapi                269 seconds

The collective communication test case is still slow, but I think that
can be attributed to running with cpu affinity.

===============================================================

diff --git a/kernel/signal.c b/kernel/signal.c
index 44391c7..493b8d2 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2020,6 +2020,8 @@ static int do_tkill(int tgid, int pid, int sig)
 	int error;
 	struct siginfo info;
 	struct task_struct *p;
+	unsigned long flags;
+	int acquired_tasklist_lock = 0;
 
 	error = -ESRCH;
 	info.si_signo = sig;
@@ -2028,22 +2030,31 @@ static int do_tkill(int tgid, int pid, int sig)
 	info.si_pid = current->tgid;
 	info.si_uid = current->uid;
 
-	read_lock(&tasklist_lock);
+	rcu_read_lock();
+	if (unlikely(sig_needs_tasklist(sig))) {
+		read_lock(&tasklist_lock);
+		acquired_tasklist_lock = 1;
+	}
 	p = find_task_by_pid(pid);
 	if (p && (tgid <= 0 || p->tgid == tgid)) {
 		error = check_kill_permission(sig, &info, p);
 		/*
 		 * The null signal is a permissions and process existence
 		 * probe.  No signal is actually delivered.
+	 	 *
+		 * If lock_task_sighand() fails we pretend the task dies
+		 * after receiving the signal. The window is tiny, and the
+		 * signal is private anyway.
 		 */
-		if (!error && sig && p->sighand) {
-			spin_lock_irq(&p->sighand->siglock);
+		if (!error && sig && lock_task_sighand(p, &flags)) {
 			handle_stop_signal(sig, p);
 			error = specific_send_sig_info(sig, &info, p);
-			spin_unlock_irq(&p->sighand->siglock);
+			unlock_task_sighand(p, &flags);
 		}
 	}
-	read_unlock(&tasklist_lock);
+	if (unlikely(acquired_tasklist_lock))
+		read_unlock(&tasklist_lock);
+	rcu_read_unlock();
 
 	return error;
 }