Sophie

Sophie

distrib > Scientific%20Linux > 5x > x86_64 > by-pkgid > 27922b4260f65d317aabda37e42bbbff > files > 2047

kernel-2.6.18-238.el5.src.rpm

From: Don Zickus <dzickus@redhat.com>
Date: Mon, 10 May 2010 16:13:51 -0400
Subject: [misc] nmi: fix bogus nmi watchdog stuck messages
Message-id: <1273508031-2296-1-git-send-email-dzickus@redhat.com>
Patchwork-id: 24927
O-Subject: [RHEL-6 PATCH] [x86] nmi watchdog:  fix bogus nmi stuck messages
Bugzilla: 455323
RH-Acked-by: Jarod Wilson <jarod@redhat.com>

https://bugzilla.redhat.com/process_bug.cgi

As the number of cpus has increased, various machines started exhibiting
behaviours like

testing NMI watchdog ... <4>WARNING: CPU#0: NMI appears to be stuck
(1150->1153)!

Now obviously the nmi watchdog is working because the count increased from
1150 to 1153.  Internally the nmi watchdog has a check to see if the number
increased by 5.  Where 5 is some magic number.

I am having trouble reproducing this problem and for RHEL-5 I am not sure how
interesting it is to get to the root of the problem.

So, the lazy fix I am proposing is attached below, just check to see if the
count moved at all.  Whether something fired 1 time over 20 ms or 5 times, it
shouldn't matter as long it fires.

Tested by a customer who is happy with it.

It doesn't apply upstream, because all this is going away soon.

Please review and ack.

Signed-off-by: Don Zickus <dzickus@redhat.com>

diff --git a/arch/i386/kernel/nmi.c b/arch/i386/kernel/nmi.c
index 0947d35..02fe497 100644
--- a/arch/i386/kernel/nmi.c
+++ b/arch/i386/kernel/nmi.c
@@ -129,7 +129,7 @@ static int __init check_nmi_watchdog(void)
 		if (!cpu_isset(cpu, cpu_callin_map))
 			continue;
 #endif
-		if (nmi_count(cpu) - prev_nmi_count[cpu] <= 5) {
+		if (nmi_count(cpu) - prev_nmi_count[cpu] == 0) {
 			endflag = 1;
 			/* most hypervisors do not emulate nmi watchdog
 			 * ticks correctly.  do not print anything if we
diff --git a/arch/x86_64/kernel/nmi.c b/arch/x86_64/kernel/nmi.c
index ce6f499..6c854a6 100644
--- a/arch/x86_64/kernel/nmi.c
+++ b/arch/x86_64/kernel/nmi.c
@@ -135,7 +135,7 @@ int __init check_nmi_watchdog (void)
 		if (!per_cpu(wd_enabled, cpu))
 			continue;
 
-		if (cpu_pda(cpu)->__nmi_count - counts[cpu] <= 5) {
+		if (cpu_pda(cpu)->__nmi_count - counts[cpu] == 0) {
 			endflag = 1;
 			/* most hypervisors do not emulate nmi watchdog
 			 * ticks correctly.  do not print anything if we