Sophie

Sophie

distrib > Scientific%20Linux > 5x > x86_64 > by-pkgid > 27922b4260f65d317aabda37e42bbbff > files > 2165

kernel-2.6.18-238.el5.src.rpm

From: Larry Woodman <lwoodman@redhat.com>
Date: Tue, 10 Aug 2010 11:51:44 -0400
Subject: [mm] add option to skip ZERO_PAGE mmap of /dev/zero
Message-id: <4C613D50.3030308@redhat.com>
Patchwork-id: 27482
O-Subject: [RHEL5-U6 Patch] Remove optimization to map the ZERO_PAGE when
	mmap()'ng /dev/zero
Bugzilla: 619541
RH-Acked-by: Jarod Wilson <jarod@redhat.com>

We have a customer running Oracle Tuxedo and migrated from
RHEL3 to RHEL5.
When they did, the performance of the application suite dropped hundreds
of percent.  After
investigating this I found the application was mmap()'ng & munmap()'ng
/dev/zero millions
of time, a common way of malloc()/free() anonymous memory on Solaris.

Anyway, mmap() of /dev/zero results in calling map_zero() which on RHEL5
maps the
ZERO_PAGE in every pte within that virtual address range.  Since the
application is
also multi-threaded the subsequest munmap() of /dev/zero results is TLB
shootdowns
to all other CPUs.  When this happens thousands or millions of times the
application
performance is terrible. The mapping ZERO_PAGE in every pte within that
virtual
address range was an optimization to make the subsequent pagefault times
faster on
RHEL5 that has been removed/changed upstream.

Rather than removing this optimization I added a new tunable for RHEL5
/proc/sys/vm/vm_devzero_optimized that allows one to disable this
optimization.
By default its set to 1 so the optimization is still set.  If you set it
to zero mmap_zero()
will not map the ZERO_PAGE so the address range is basically anonymous
virtual
memory.  This means the pagefault is slower but the mmap() is much faster.

The attached patch adds this and fixes BZ619541

Signed-off-by: Jarod Wilson <jarod@redhat.com>

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 2451273..97b34af 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -34,6 +34,8 @@
 # include <linux/efi.h>
 #endif
 
+int vm_devzero_optimized = 1;
+
 static inline int range_is_allowed(unsigned long pfn, unsigned long size)
 {
 	u64 from = ((u64)pfn) << PAGE_SHIFT;
@@ -595,7 +597,8 @@ static int mmap_zero(struct file * file, struct vm_area_struct * vma)
 {
 	if (vma->vm_flags & VM_SHARED)
 		return shmem_zero_setup(vma);
-	if (zeromap_page_range(vma, vma->vm_start, vma->vm_end - vma->vm_start, vma->vm_page_prot))
+	if (vm_devzero_optimized &&
+	    zeromap_page_range(vma, vma->vm_start, vma->vm_end - vma->vm_start, vma->vm_page_prot))
 		return -EAGAIN;
 	return 0;
 }
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 88e94c0..adfb1c9 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -214,6 +214,7 @@ enum
 	VM_ZONE_RECLAIM_INTERVAL=41, /* interval between zone_reclaim failures */
 	VM_TOPDOWN_ALLOCATE_FAST=42, /* optimize speed over fragmentation in topdown alloc */
 	VM_MAX_RECLAIMS=43,     /* max reclaims allowed */
+	VM_DEVZERO_OPTIMIZED=44, /* pagetables initialized with ZERO_PAGE at mmmap time */
 };
 
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6ff0cf3..494f90b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -83,6 +83,7 @@ extern int compat_log;
 extern int flush_mmap_pages;
 extern int max_writeback_pages;
 extern int blk_iopoll_enabled;
+extern int vm_devzero_optimized;
 
 #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
 extern int proc_unknown_nmi_panic(ctl_table *, int, struct file *,
@@ -1232,6 +1233,16 @@ static ctl_table vm_table[] = {
 		.strategy	= &sysctl_intvec,
 		.extra1		= &zero,
 	},
+	{
+		.ctl_name	= VM_DEVZERO_OPTIMIZED,
+		.procname	= "vm_devzero_optimized",
+		.data		= &vm_devzero_optimized,
+		.maxlen		= sizeof(vm_devzero_optimized),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
 	{ .ctl_name = 0 }
 };