Sophie

Sophie

distrib > Scientific%20Linux > 5x > x86_64 > by-pkgid > eab357269fb8735c5e1a2938e6c77cae > files > 3379

kernel-2.6.18-164.10.1.el5.src.rpm

From: John Villalovos <jvillalo@redhat.com>
Date: Mon, 17 Aug 2009 11:40:44 -0400
Subject: [x86] suspend-resume: work on large logical CPU systems
Message-id: 20090817154044.GA3117@linuxjohn.usersys.redhat.com
O-Subject: [RHEL5.5 BZ499271] Make suspend-resume work on systems with lots of logical CPUs
Bugzilla: 499271
RH-Acked-by: Prarit Bhargava <prarit@redhat.com>
RH-Acked-by: John Feeney <jfeeney@redhat.com>

Patch for RHEL 5.5 based on the 2.6.18-153 kernel

Bugzilla 499271 Make suspend-resume work on systems with lots of logical CPUs
(e.g. Boxboro-EX which has 64 logical CPUs).
https://bugzilla.redhat.com/show_bug.cgi?id=499271

No equivalent upstream because upstream works differently now.

Brew build, all builds passed:
https://brewweb.devel.redhat.com/taskinfo?taskID=1929443

Testing: This has been tested on a system with 64 virtual CPUs (32 physical
plus hyperthreading) and C3 states (deep c-states) enabled.  Without the patch
the system hangs in suspend, with the patch the system does not hang.

Also did an RHTS kernel test pass:
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=84862
All tests passed.  The s390x test failed because the OS failed to
install.  Previous RHTS tests I did, did pass on that architecture.

[PATCH] Make suspend-resume work on system with lot of logical CPUs

Validation team reported "softlockup" on S4 suspend resume on a system with
lot of logical CPUs and when CPU deep C-state is enables. The softlockup would
not happen when deep C-states are disabled. This softlockup during S4 suspend
(and also during S4 resume) was root caused to smp_send_timer_broadcast_ipi()
function.

In RHEL 5.2, this function used
		send_IPI_mask(mask, LOCAL_TIMER_VECTOR);

In RHEL 5.3, this was changed to
	if (cpus_equal(cpu_online_map, timer_interrupt_broadcast_ipi_mask)) {
		__send_IPI_shortcut(APIC_DEST_ALLINC, LOCAL_TIMER_VECTOR,
					APIC_DEST_LOGICAL);
	else
		send_IPI_mask(mask, LOCAL_TIMER_VECTOR);

The reason for the above change was that send_IPI_mask sends IPIs to CPUs one
after another and if those CPUs that have to receive these IPIs are in deep
C-state, then the CPU sending the IPI waits until the receiving CPU wakes up
(which will be the order of 50-100us). The above fix in 5.3 was done to
eliminate this waiting for each individual CPU and send IPI_shortcut sends IPI
once. The above issue was identified on a DP server, with one CPU spending too
much time waiting for IPIs to be sent.

Now, on more recent systems with 32 or 64 logical CPUs, they do use this
IPI_shortcut during normal run time. But, on entry/exit from S1/S3/S4, RHEL5.3
offlines all the secondary CPUs one-by-one and we get into "else" part in the
above code. The else part does the old "send IPI to each CPU in the mask one
after another" logic and 50-100us latency to send IPI to one CPU multiply that
by 30 or 62 CPUs (assuming 1 CPU is offline and one is actually sending the
IPI), we will end up with the function taking more than 1 ms, by the time 1000
HZ kernel will be ready with next timer interrupt which again wants to send
IPI to all the non-idle CPUs. Thus system fails to make any forward progress
during suspend/resume once one CPU is offline, eventually ending up with
"softlockup interrupt"

The fix below, changes the function to always use __send_IPI_shortcut. This
means we will never have to do the sequential IPI. The offline CPUs will have
there APIC disabled, int disabled and will not be affected by this broadcast.

This patch fixes S1/S3/S4 on systems with large number of logical CPUs and
when those CPUs support deep C-state.

diff --git a/arch/i386/kernel/apic.c b/arch/i386/kernel/apic.c
index 5dc606e..47a4e02 100644
--- a/arch/i386/kernel/apic.c
+++ b/arch/i386/kernel/apic.c
@@ -1307,16 +1307,10 @@ static void up_apic_timer_interrupt_call(struct pt_regs *regs)
 void smp_send_timer_broadcast_ipi(struct pt_regs *regs)
 {
 #ifdef CONFIG_SMP
-	cpumask_t mask;
-
-	if (cpus_equal(cpu_online_map, timer_bcast_ipi)) {
-		__send_IPI_shortcut(APIC_DEST_ALLINC, LOCAL_TIMER_VECTOR);
+	/* If none of the CPUs are using IPI then no need to continue */
+	if (cpus_empty(timer_bcast_ipi))
 		return;
-	}
-	cpus_and(mask, cpu_online_map, timer_bcast_ipi);
-	if (!cpus_empty(mask)) {
-		send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
-	}
+	__send_IPI_shortcut(APIC_DEST_ALLINC, LOCAL_TIMER_VECTOR);
 #else
 	if (!cpus_empty(timer_bcast_ipi)) {
 		/*
diff --git a/arch/x86_64/kernel/apic.c b/arch/x86_64/kernel/apic.c
index 940f365..de9f6a4 100644
--- a/arch/x86_64/kernel/apic.c
+++ b/arch/x86_64/kernel/apic.c
@@ -926,18 +926,11 @@ EXPORT_SYMBOL(switch_APIC_timer_to_ipi);
 
 void smp_send_timer_broadcast_ipi(void)
 {
-	cpumask_t mask;
-
-	if (cpus_equal(cpu_online_map, timer_interrupt_broadcast_ipi_mask)) {
-		__send_IPI_shortcut(APIC_DEST_ALLINC, LOCAL_TIMER_VECTOR,
-					APIC_DEST_LOGICAL);
+	/* If none of the CPUs are using IPI then no need to continue */
+	if (cpus_empty(timer_interrupt_broadcast_ipi_mask))
 		return;
-	}
-
-	cpus_and(mask, cpu_online_map, timer_interrupt_broadcast_ipi_mask);
-	if (!cpus_empty(mask)) {
-		send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
-	}
+	__send_IPI_shortcut(APIC_DEST_ALLINC, LOCAL_TIMER_VECTOR,
+				APIC_DEST_LOGICAL);
 }
 
 void switch_ipi_to_APIC_timer(void *cpumask)