Sophie: kernel-2.6.18-238.el5 src

kernel-2.6.18-238.el5.src.rpm

From: Larry Woodman <lwoodman@redhat.com>
Date: Fri, 19 Nov 2010 13:27:34 -0500
Subject: [misc] prevent divide by 0 in the kernel during boot
Message-id: <4CE67B46.4080800@redhat.com>
Patchwork-id: 29514
O-Subject: [RHEL5.6 Patch] Prevent divide by 0 in the kernel during boot
Bugzilla: 508140

Similar to patch I posted yesterday to fix the RHEL6 DBZ in find_busiest_group().
This description is directly form the BZ(508140).  In this case
find_busiest_group() ran at boot time before everything is properly initialized
so I fixed it by making sure the cpu_power can not inadvertently be zero.

---------------------------------------------------------------------------------
Sometimes when a node boots up the nmi_watchdog triggers a kernel panic
immediately after nmi is started.  I tried booting with notsc and the behavior
is the same.  I have attached the console messages from hyperion718.

This appears to be a statistical phenomena and happens <1% of the time but it
is still a few nodes every time we reboot the cluster.

Below it looks like we're hitting a divide error (possibly divide by zero or
something)
somewhere during boot. The machine tries to panic, and hits another divide
error, etc,
until the NMI watchdog kills the machine.

>2009-04-22 13:48:22 Intel(R) Xeon(R) CPU           E5530  @ 2.40GHz stepping 05
>2009-04-22 13:48:22 Brought up 16 CPUs
>2009-04-22 13:48:22 testing NMI watchdog ... OK.
>2009-04-22 13:48:22 time.c: Using 14.318180 MHz WALL HPET GTOD HPET timer.
>2009-04-22 13:48:22 time.c: Detected 2400.184 MHz processor.
>2009-04-22 13:48:22 divide error: 0000 [1] SMP
>2009-04-22 13:48:22 last sysfs file:
>2009-04-22 13:48:22 CPU 2
>2009-04-22 13:48:22 Modules linked in:
>2009-04-22 13:48:22 Pid: 0, comm: swapper Not tainted 2.6.18-66chaos #1
>2009-04-22 13:48:22 RIP: 0010:[<ffffffff8008b49d>]  [<ffffffff8008b49d>] find_busiest_group+0x23a/0x621
>2009-04-22 13:48:23 RSP: 0018:ffff8101c55bfdb8  EFLAGS: 00010006
>2009-04-22 13:48:23 RAX: 0000000000004000 RBX: 00000000000000ff RCX: 0000000000000000
>2009-04-22 13:48:23 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000c0
>2009-04-22 13:48:23 RBP: ffff8101c55bfea8 R08: 0000000000000010 R09: 0000000000000030
>2009-04-22 13:48:23 R10: ffff81033fc1f848 R11: 0000000000000000 R12: ffff81033fc1f840
>2009-04-22 13:48:23 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000004000
>2009-04-22 13:48:23 FS:  0000000000000000(0000) GS:ffff8101c55a4740(0000) knlGS:0000000000000000
>2009-04-22 13:48:23 CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>2009-04-22 13:48:23 CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
>2009-04-22 13:48:23 Process swapper (pid: 0, threadinfo ffff8101c55b8000, task ffff8101062da0c0)
>2009-04-22 13:48:23 Stack:  0000000000000000 ffff8101c55bfee8 ffff8101c55bff10 0000000000000000
>2009-04-22 13:48:23  ffff8101c55bff08 0000000200000000 ffff8101c545b140 0000000000000000
>2009-04-22 13:48:23  ffff81033fc1f800 0000000000000000 0000000000000080 0000000000000000
>2009-04-22 13:48:23 Call Trace:
>2009-04-22 13:48:23  <IRQ>  [<ffffffff8008d526>] rebalance_tick+0x18c/0x3ce
>2009-04-22 13:48:23  [<ffffffff80097a3c>] update_process_times+0x7a/0x8a
>2009-04-22 13:48:23  [<ffffffff800770ed>] smp_local_timer_interrupt+0x2f/0x64
>2009-04-22 13:48:23  [<ffffffff800777f3>] smp_apic_timer_interrupt+0x41/0x47
>2009-04-22 13:48:23  [<ffffffff800569a4>] mwait_idle+0x0/0x4a
>2009-04-22 13:48:23  [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
>2009-04-22 13:48:23  <EOI>  [<ffffffff800569da>] mwait_idle+0x36/0x4a
>2009-04-22 13:48:23  [<ffffffff80048bf6>] cpu_idle+0x95/0xb8
>2009-04-22 13:48:23  [<ffffffff80076eff>] start_secondary+0x45a/0x469
>2009-04-22 13:48:23
>2009-04-22 13:48:23

No hits in RH's IT for "divide error: 0000 [1] SMP" that affect RHEL5. There
were 4 old closed cases on RHEL4 but nothing recent.

Asked them to reproduce on an offical RHEL5 kernel and to give me a better
quantified notion of how often this happens. This is going to be important
because it will give us some idea of how many times we need to serially reboot
a test machine to get a reasonable comfort level that the problem is fixed.

http://lkml.org/lkml/2008/11/26/524 looks similar but we have Ingo's patch in
the RHEL5 kernel. http://lkml.org/lkml/2008/11/27/30 However, RHEL doesn't have
the ACCESS_ONCE macro. http://lkml.org/lkml/2008/11/29/129

These are 16 core Nahelem E5530 machines:
http://ark.intel.com/cpu.aspx?groupId=37103
-------------------------------------------------------------------------------

Signed-off-by: Jarod Wilson <jarod@redhat.com>

diff --git a/kernel/sched.c b/kernel/sched.c
index d49ddab..6251dbf 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6533,6 +6533,8 @@ static void set_domain_attribute(struct sched_domain *sd, int level, int *attr)
 	}
 }
 
+#define NZ(x) (x?x:1)
+
 /*
  * Build sched domains for a given set of cpus and attach the sched domains
  * to the individual cpus
@@ -6787,7 +6789,7 @@ static int __build_sched_domains(const cpumask_t *cpu_map, int *attr)
 		struct sched_domain *sd;
 		sd = &per_cpu(core_domains, i);
 		if (sched_smt_power_savings)
-			power = SCHED_LOAD_SCALE * cpus_weight(sd->groups->cpumask);
+			power = SCHED_LOAD_SCALE * NZ(cpus_weight(sd->groups->cpumask));
 		else
 			power = SCHED_LOAD_SCALE + (cpus_weight(sd->groups->cpumask)-1)
 					    * SCHED_LOAD_SCALE / 10;
@@ -6837,7 +6839,7 @@ static int __build_sched_domains(const cpumask_t *cpu_map, int *attr)
 		int power;
 		sd = &per_cpu(phys_domains, i);
 		if (sched_smt_power_savings)
-			power = SCHED_LOAD_SCALE * cpus_weight(sd->groups->cpumask);
+			power = SCHED_LOAD_SCALE * NZ(cpus_weight(sd->groups->cpumask));
 		else
 			power = SCHED_LOAD_SCALE;
 		sd->groups->cpu_power = power;