From: Don Zickus <dzickus@redhat.com> Date: Mon, 10 May 2010 16:13:51 -0400 Subject: [misc] nmi: fix bogus nmi watchdog stuck messages Message-id: <1273508031-2296-1-git-send-email-dzickus@redhat.com> Patchwork-id: 24927 O-Subject: [RHEL-6 PATCH] [x86] nmi watchdog: fix bogus nmi stuck messages Bugzilla: 455323 RH-Acked-by: Jarod Wilson <jarod@redhat.com> https://bugzilla.redhat.com/process_bug.cgi As the number of cpus has increased, various machines started exhibiting behaviours like testing NMI watchdog ... <4>WARNING: CPU#0: NMI appears to be stuck (1150->1153)! Now obviously the nmi watchdog is working because the count increased from 1150 to 1153. Internally the nmi watchdog has a check to see if the number increased by 5. Where 5 is some magic number. I am having trouble reproducing this problem and for RHEL-5 I am not sure how interesting it is to get to the root of the problem. So, the lazy fix I am proposing is attached below, just check to see if the count moved at all. Whether something fired 1 time over 20 ms or 5 times, it shouldn't matter as long it fires. Tested by a customer who is happy with it. It doesn't apply upstream, because all this is going away soon. Please review and ack. Signed-off-by: Don Zickus <dzickus@redhat.com> diff --git a/arch/i386/kernel/nmi.c b/arch/i386/kernel/nmi.c index 0947d35..02fe497 100644 --- a/arch/i386/kernel/nmi.c +++ b/arch/i386/kernel/nmi.c @@ -129,7 +129,7 @@ static int __init check_nmi_watchdog(void) if (!cpu_isset(cpu, cpu_callin_map)) continue; #endif - if (nmi_count(cpu) - prev_nmi_count[cpu] <= 5) { + if (nmi_count(cpu) - prev_nmi_count[cpu] == 0) { endflag = 1; /* most hypervisors do not emulate nmi watchdog * ticks correctly. do not print anything if we diff --git a/arch/x86_64/kernel/nmi.c b/arch/x86_64/kernel/nmi.c index ce6f499..6c854a6 100644 --- a/arch/x86_64/kernel/nmi.c +++ b/arch/x86_64/kernel/nmi.c @@ -135,7 +135,7 @@ int __init check_nmi_watchdog (void) if (!per_cpu(wd_enabled, cpu)) continue; - if (cpu_pda(cpu)->__nmi_count - counts[cpu] <= 5) { + if (cpu_pda(cpu)->__nmi_count - counts[cpu] == 0) { endflag = 1; /* most hypervisors do not emulate nmi watchdog * ticks correctly. do not print anything if we