Sophie

Sophie

distrib > Scientific%20Linux > 5x > x86_64 > by-pkgid > 27922b4260f65d317aabda37e42bbbff > files > 89

kernel-2.6.18-238.el5.src.rpm

From: Luming Yu <luyu@redhat.com>
Date: Thu, 19 Feb 2009 17:07:29 +0800
Subject: [acpi] fix C-states less efficient on certain machines
Message-id: 499D2151.50903@redhat.com
O-Subject: [RHEL 5.4 PATCH] bz 484174: Fix C-states less efficient on some Nehalem machines with certain BIOS
Bugzilla: 484174
RH-Acked-by: Pete Zaitcev <zaitcev@redhat.com>

BZ 484174

Description of problem:

Looking at the idle residencies, average CC6 across all cores was 70%+ but
Package C6 (PC6) was less than 20%. Looking at individual cores instead of
average across all cores revealed that CPU 0 and its sibling thread had < 20%
CC6 while other threads had >80% CC6. This CPU 0 and its sibling were
eventually limiting the package C-state for both sockets.

Probing further, the OS seemd to choose C6 for CPU 0 most of the time. But, for
SMT sibling of CPU 0, OS C-state policy chose C1 more often than C6.
Instrumenting the C-state entry exit showed the wakeup pattern on this sibling
thread to be something like

3 830.76
3 174.16
1 847.55
1 1000.31
1 971.42
1 1024.36
1 1002.37
1 966.38
1 986.84
1 994.96
1 996.17
1 970.77
3 819.49
3 159.93
1 856.66
1 988.59
1 998.31
1 987.61
1 986.50
1 964.69
1 978.42

Where first column is th C-state entered (C3 meaning hw C6) and second column
in time spent in that state in us. Note that OS tick rate is 1000, so we expect
one wakeup every 1000us.

We see alternating short and long wakeups which sum up to 1000uS while we enter
C6 and we see residency of ~1000us while in C1. The reason for this being that
the external timer interrupt to CPU 0 wakes up sibling as well when in C6 and
sibling goes back to sleep, to be woken up soon by timer IPI from CPU 0. This
double wakeup in a tick only happens when sibling is in C6 and does not happen
when sibling is in C1. And due to this short wakeup, OS promotion/demotion
algorithm demotes this particular thread to go to C1, and the thread keeps
going back to C1 for "demotion.threshold" number of time (which is 10). As a
result of this double wakeup within a tick, the sibling of CPU 0 spends more
time in C1, limiting the whole platform.

int sleep_ticks;
unsigned int threshold_ticks;

// record start_time
// enter C6-state
// record stop_time
sleep_ticks = stop_time - start_time - C6_exit_latency

If (sleep_ticks < threshold_ticks)
 // check for threshold count, etc

The problem here being sleep_ticks is int and threshold_ticks is unsigned int.
The above comparison turns into a unsigned comparison and negative sleep_ticks
(due to small sleep and high exit latency in supermicro) gets counted as very
high idle time and demotion does not happen on supermicro. GC on the other hand
has a lower exit latency resulting in sleep_ticks still being small positive
number for these short wakes and the policy demotes C-state from C6 to C1 and
stays in C1 for next 10 C-state invocations. Fixing this int, unsigned int bug
in OS makes this low package C6 problem appear on Supermicro as well!

Now to fix the problem, we have to change the C-state policy in Redhat EL5 to
look for 2 consecutive short wakeups in place of just 1, before it does any
demotion.

< The attached patch fixes the issue >

Power difference:

Testing status:

Upstream status:
Upstream doesn't need this fix due to different code base of cpu idle
driver.

Brew:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1698914

Please review and Ack.
Thanks,
Luming

diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index f40b175..299734e 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -434,6 +434,9 @@ static void acpi_processor_idle(void)
 	if ((cx->type != ACPI_STATE_C1) && (sleep_ticks > 0))
 		cx->time += sleep_ticks;
 
+	if (sleep_ticks != 0xFFFFFFFF && sleep_ticks < 0)
+		sleep_ticks = 0;
+
 	next_state = pr->power.state;
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -473,6 +476,9 @@ static void acpi_processor_idle(void)
 					goto end;
 				}
 			}
+		} else {
+			if (cx->promotion.count > 0)
+				cx->promotion.count--;
 		}
 	}
 
@@ -490,6 +496,9 @@ static void acpi_processor_idle(void)
 				next_state = cx->demotion.state;
 				goto end;
 			}
+		} else {
+			if (cx->demotion.count > 0)
+				cx->demotion.count--;
 		}
 	}
 
@@ -557,7 +566,7 @@ static int acpi_processor_set_power_policy(struct acpi_processor *pr)
 		if (lower) {
 			cx->demotion.state = lower;
 			cx->demotion.threshold.ticks = cx->latency_ticks;
-			cx->demotion.threshold.count = 1;
+			cx->demotion.threshold.count = 2;
 			if (cx->type == ACPI_STATE_C3)
 				cx->demotion.threshold.bm = bm_history;
 		}