Sophie: kernel-2.6.18-238.el5 src

kernel-2.6.18-238.el5.src.rpm

From: Erik Jacobson <ejacobso@redhat.com>
Subject: [RHEL5.1 Patch] BZ 220416 FPSWA exceptions taking excessive system time (performance issue)
Date: Thu, 21 Dec 2006 13:32:16 -0500
Bugzilla: 220416
Message-Id: <20061221183215.GA17230@redhat.com>
Changelog: [ia64] FPSWA exceptions take excessive system time 


This is bugzilla 220416.

Improve the scalability of the fpswa code that rate-limits
logging of messages.

There are 2 distinctly different problems in this code.

1) If prctl is used to disable logging, last_time is never
   updated. The result is that fpu_swa_count is zeroed out on
   EVERY fp fault. This causes a very very hot cache line.
   The fix reduces the wallclock time of a 1024p FP exception test
   from 28734 sec to 19 sec!!!

2) On VERY large systems, excessive messages are logged because
   multiple cpus can each reset or increment fpu_swa_count at
   about the same time. The result is that hundreds of messages
   are logged each second. The fixes reduces the logging rate
   to ~1 per second.


	Signed-off-by: Jack Steiner <steiner@sgi.com>

---
Here is some additional detail about the second problem.

The second problem occurs only at high cpu counts. The problem has been seen
on 768p & 1024p systems. It does not _appear_ to occur on 512p systems
although I would not rule out that it could occur under some conditions.

The current algorithm relies on the fact that <last_time> & <fpu_swa_count>
are infrequently modified. Under most circumstances, these variables are
globally shared & each cpu has a read-only copy of the cache line.

When one of the <last_time>/<fpu_swa_count> variables is modified, the cpu
must first cause the line to be evicted from all other cpus, then obtain an
exclusive copy of the cacheline. When the line is finally obtained
exclusively, there is no guarantee that the contents have not changed from
the values observed when the line was held shared.

Look specifically at this code and consider the case where <fpu_swa_count>
is incremented:

  (1)   if ((fpu_swa_count < 4) ...
  (2)           ....
  (3)           ++fpu_swa_count;


        - code at line (1) gets a shared copy of the cacheline containing
          <fpu_swa_count>.

        - the value of <fpu_swa_count> is loaded into a register

        - code at line (2) generates a few more memory references &
          introduces a small delay between (1) & (3)

        - code at line (3) does:
                - add "1" to the possibly stale register copy of
                  <fpu_swa_count> that was read from the shared line in (1)
                - obtains an exclusive copy of <fpu_swa_count>
                - OTHER CPUS MAY HAVE already update this line!!
                - stores the potentially stale result to <fpu_swa_count>

On heavily loaded large systems, it is probable that multiple cpus will each
make the same updates to <fpu_swa_count>. The <fpu_swa_count> can be
repeated reset to the same or smaller values. The result is that excessive
messages are logged & the cachelines containing <fpu_swa_count> becomes very
hot.


This is easily verified. The log from the customer site showed periods of
correct behavior & periods of bad behavior.

I stripped the timestamps out of the logs, and ran the "times" thru "uniq -c".

This is from one portion of a log and is what I would expect to see. Column
1 is a count of the number of "floating-point" messages that were logged for
the 1 second interval. Note the pattern of a burst of 4 messages every 5 sec

      4 00:21:29
      4 00:21:34
      5 00:21:39
      4 00:21:44
      4 00:21:50
      4 00:21:55
      4 00:22:00


Here is another portion of the log that shows the bad behavior.  The system
is logging 100's of messages per second!!!! Logging never stops.

    103 14:26:07
    122 14:26:08
    112 14:26:09
     83 14:26:10
     98 14:26:11
     91 14:26:12
    123 14:26:13
    123 14:26:14
     97 14:26:15
    100 14:26:16
    117 14:26:17
     84 14:26:18

Index: linux/arch/ia64/kernel/traps.c
===================================================================
--- linux.orig/arch/ia64/kernel/traps.c	2006-12-20 13:26:47.276186100 -0600
+++ linux/arch/ia64/kernel/traps.c	2006-12-21 10:27:50.987899608 -0600
@@ -325,6 +325,15 @@ fp_emulate (int fp_fault, void *bundle, 
 	return ret.status;
 }
 
+struct fpu_swa_msg {
+	unsigned long count;
+	unsigned long time;
+};
+static DEFINE_PER_CPU(struct fpu_swa_msg, cpulast);
+DECLARE_PER_CPU(struct fpu_swa_msg, cpulast);
+static struct fpu_swa_msg last __cacheline_aligned;
+
+
 /*
  * Handle floating-point assist faults and traps.
  */
@@ -334,8 +343,6 @@ handle_fpu_swa (int fp_fault, struct pt_
 	long exception, bundle[2];
 	unsigned long fault_ip;
 	struct siginfo siginfo;
-	static int fpu_swa_count = 0;
-	static unsigned long last_time;
 
 	fault_ip = regs->cr_iip;
 	if (!fp_fault && (ia64_psr(regs)->ri == 0))
@@ -343,14 +350,37 @@ handle_fpu_swa (int fp_fault, struct pt_
 	if (copy_from_user(bundle, (void __user *) fault_ip, sizeof(bundle)))
 		return -1;
 
-	if (jiffies - last_time > 5*HZ)
-		fpu_swa_count = 0;
-	if ((fpu_swa_count < 4) && !(current->thread.flags & IA64_THREAD_FPEMU_NOPRINT)) {
-		last_time = jiffies;
-		++fpu_swa_count;
-		printk(KERN_WARNING
-		       "%s(%d): floating-point assist fault at ip %016lx, isr %016lx\n",
-		       current->comm, current->pid, regs->cr_iip + ia64_psr(regs)->ri, isr);
+	if (!(current->thread.flags & IA64_THREAD_FPEMU_NOPRINT))  {
+		unsigned long count, current_jiffies = jiffies;
+		struct fpu_swa_msg *cp = &__get_cpu_var(cpulast);
+
+		if (unlikely(current_jiffies > cp->time))
+			cp->count = 0;
+		if (unlikely(cp->count < 5)) {
+			cp->count++;
+			cp->time = current_jiffies + 5 * HZ;
+
+			/* minimize races by grabbing a copy of count BEFORE checking last.time. */
+			count = last.count;
+			barrier();
+
+			/*
+			 * Lower 4 bits are used as a count. Upper bits are a sequence
+			 * number that is updated when count is reset. The cmpxchg will
+			 * fail is seqno has changed. This minimizes mutiple cpus
+			 * reseting the count.
+			 */
+			if (current_jiffies > last.time)
+				(void) cmpxchg_acq(&last.count, count, 16 + (count & ~15));
+
+			/* used fetchadd to atomically update the count */
+			if ((last.count & 15) < 5 && (ia64_fetchadd(1, &last.count, acq) & 15) < 5) {
+				last.time = current_jiffies + 5 * HZ;
+				printk(KERN_WARNING
+		       			"%s(%d): floating-point assist fault at ip %016lx, isr %016lx\n",
+		       			current->comm, current->pid, regs->cr_iip + ia64_psr(regs)->ri, isr);
+			}
+		}
 	}
 
 	exception = fp_emulate(fp_fault, bundle, &regs->cr_ipsr, &regs->ar_fpsr, &isr, &regs->pr,