Sophie

Sophie

distrib > Scientific%20Linux > 5x > x86_64 > by-pkgid > 27922b4260f65d317aabda37e42bbbff > files > 2208

kernel-2.6.18-238.el5.src.rpm

From: Eric Sandeen <esandeen@redhat.com>
Subject: [PATCH RHEL5] VM: Fix nasty and subtle race in shared mmap'ed page writeback
Date: Tue, 2 Jan 2007 17:12:17 -0600
Bugzilla: 220963
Message-Id: <200701022312.l02NCHe0007055@neon.msp.redhat.com>
Changelog: VM: Fix nasty and subtle race in shared mmap'ed page writeback


This is for Bug 220963: data corruption bug in 2.6.18 - patch available

2 patches below, first one is part of a no-op cleanup from andrew, it makes
the 2nd patch (which actually fixes the bug) apply pretty cleanly.  I could combine
them if preferable.

Both patches are from upstream; peterz was on the thread from the beginning
but somehow I drew the short straw for this bug I guess. :)

The testcase is at http://marc.theaimsgroup.com/?l=linux-kernel&m=116726650423768&w=2

It consistently & massively corrupts for me on a 4-way x86_64 RHEL5 box, and with 
the change in place I had 20 successful runs in a row..

Linus has documented the issue better than I can in the comments below.

-Eric

X-Git-Tag: v2.6.20-rc1^0~145^2~63
X-Git-Url: http://www.kernel.org/git/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=8c08540f8755c451d8b96ea14cfe796bc3cd712d

[PATCH] clean up __set_page_dirty_nobuffers()

Save a tabstop in __set_page_dirty_nobuffers() and __set_page_dirty_buffers()
and a few other places.  No functional changes.

Cc: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Cc: David Wright <daw@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

(Partial patch below, just for clear_page_dirty_for_io)

---

Index: linux-2.6.18-1.2910.el5/mm/page-writeback.c
===================================================================
--- linux-2.6.18-1.2910.el5.orig/mm/page-writeback.c
+++ linux-2.6.18-1.2910.el5/mm/page-writeback.c
@@ -761,17 +761,17 @@ int clear_page_dirty_for_io(struct page 
 	struct address_space *mapping = page_mapping(page);
 
 	WARN_ON_ONCE(!PageLocked(page));
-	if (mapping) {
-		if (TestClearPageDirty(page)) {
-			if (mapping_cap_account_dirty(mapping)) {
-				page_mkclean(page);
-				dec_zone_page_state(page, NR_FILE_DIRTY);
-			}
-			return 1;
+	if (!mapping)
+		return TestClearPageDirty(page);
+
+	if (TestClearPageDirty(page)) {
+		if (mapping_cap_account_dirty(mapping)) {
+			page_mkclean(page);
+			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
-		return 0;
+		return 1;
 	}
-	return TestClearPageDirty(page);
+	return 0;
 }
 EXPORT_SYMBOL(clear_page_dirty_for_io);
 
X-Git-Tag: v2.6.20-rc3^0~58
X-Git-Url: http://www.kernel.org/git/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=7658cc289288b8ae7dd2c2224549a048431222b3

VM: Fix nasty and subtle race in shared mmap'ed page writeback

The VM layer (on the face of it, fairly reasonably) expected that when
it does a ->writepage() call to the filesystem, it would write out the
full page at that point in time.  Especially since it had earlier marked
the whole page dirty with "set_page_dirty()".

But that isn't actually the case: ->writepage() does not actually write
a page, it writes the parts of the page that have been explicitly marked
dirty before, *and* that had not got written out for other reasons since
the last time we told it they were dirty.

That last caveat is the important one.

Which _most_ of the time ends up being the whole page (since we had
called "set_page_dirty()" on the page earlier), but if the filesystem
had done any dirty flushing of its own (for example, to honor some
internal write ordering guarantees), it might end up doing only a
partial page IO (or none at all) when ->writepage() is actually called.

That is the correct thing in general (since we actually often _want_
only the known-dirty parts of the page to be written out), but the
shared dirty page handling had implicitly forgotten about these details,
and had a number of cases where it was doing just the "->writepage()"
part, without telling the low-level filesystem that the whole page might
have been re-dirtied as part of being mapped writably into user space.

Since most of the time the FS did actually write out the full page, we
didn't notice this for a loong time, and this needed some really odd
patterns to trigger.  But it caused occasional corruption with rtorrent
and with the Debian "apt" database, because both use shared mmaps to
update the end result.

This fixes it. Finally. After way too much hair-pulling.

Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Martin J. Bligh <mbligh@google.com>
Acked-by: Martin Michlmayr <tbm@cyrius.com>
Acked-by: Martin Johansson <martin@fatbob.nu>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Andrei Popa <andrei.popa@i-neo.ro>
Cc: High Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@osdl.org>,
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Segher Boessenkool <segher@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Gordon Farquharson <gordonfarquharson@gmail.com>
Cc: Guillaume Chazarain <guichaz@yahoo.fr>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Kenneth Cheng <kenneth.w.chen@intel.com>
Cc: Tobias Diedrich <ranma@tdiedrich.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
---

Index: linux-2.6.18-1.2910.el5/mm/page-writeback.c
===================================================================
--- linux-2.6.18-1.2910.el5.orig/mm/page-writeback.c
+++ linux-2.6.18-1.2910.el5/mm/page-writeback.c
@@ -761,17 +761,46 @@ int clear_page_dirty_for_io(struct page 
 	struct address_space *mapping = page_mapping(page);
 
 	WARN_ON_ONCE(!PageLocked(page));
-	if (!mapping)
-		return TestClearPageDirty(page);
-
-	if (TestClearPageDirty(page)) {
-		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
+	if (mapping && mapping_cap_account_dirty(mapping)) {
+		/*
+		 * Yes, Virginia, this is indeed insane.
+		 *
+		 * We use this sequence to make sure that
+		 *  (a) we account for dirty stats properly
+		 *  (b) we tell the low-level filesystem to
+		 *      mark the whole page dirty if it was
+		 *      dirty in a pagetable. Only to then
+		 *  (c) clean the page again and return 1 to
+		 *      cause the writeback.
+		 *
+		 * This way we avoid all nasty races with the
+		 * dirty bit in multiple places and clearing
+		 * them concurrently from different threads.
+		 *
+		 * Note! Normally the "set_page_dirty(page)"
+		 * has no effect on the actual dirty bit - since
+		 * that will already usually be set. But we
+		 * need the side effects, and it can help us
+		 * avoid races.
+		 *
+		 * We basically use the page "master dirty bit"
+		 * as a serialization point for all the different
+		 * threads doing their things.
+		 *
+		 * FIXME! We still have a race here: if somebody
+		 * adds the page back to the page tables in
+		 * between the "page_mkclean()" and the "TestClearPageDirty()",
+		 * we might have it mapped without the dirty bit set.
+		 */
+		if (page_mkclean(page))
+			set_page_dirty(page);
+		if (TestClearPageDirty(page)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
+			return 1;
 		}
-		return 1;
+		return 0;
 	}
-	return 0;
+	return TestClearPageDirty(page);
 }
 EXPORT_SYMBOL(clear_page_dirty_for_io);
 

From: Peter Zijlstra <pzijlstr@redhat.com>
Subject: Re: [PATCH RHEL5] VM: Fix nasty and subtle race in shared mmap'ed 	page writeback
Date: Mon, 08 Jan 2007 14:20:16 +0100

>From c2fda5fed81eea077363b285b66eafce20dfd45a Mon Sep 17 00:00:00 2001

 - add flush_cache_page() for all those virtual indexed cache
   architectures.

 - handle s390.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


(akpm: macros are wonderful)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
---
 mm/rmap.c |   23 +++++++++++++----------
 1 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index b3deda8..57306fa 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -433,7 +433,7 @@ static int page_mkclean_one(struct page 
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
-	pte_t *pte, entry;
+	pte_t *pte;
 	spinlock_t *ptl;
 	int ret = 0;
 
@@ -445,17 +445,18 @@ static int page_mkclean_one(struct page 
 	if (!pte)
 		goto out;
 
-	if (!pte_dirty(*pte) && !pte_write(*pte))
-		goto unlock;
+	if (pte_dirty(*pte) || pte_write(*pte)) {
+		pte_t entry;
 
-	entry = ptep_get_and_clear(mm, address, pte);
-	entry = pte_mkclean(entry);
-	entry = pte_wrprotect(entry);
-	ptep_establish(vma, address, pte, entry);
-	lazy_mmu_prot_update(entry);
-	ret = 1;
+		flush_cache_page(vma, address, pte_pfn(*pte));
+		entry = ptep_clear_flush(vma, address, pte);
+		entry = pte_wrprotect(entry);
+		entry = pte_mkclean(entry);
+		set_pte_at(mm, address, pte, entry);
+		lazy_mmu_prot_update(entry);
+		ret = 1;
+	}
 
-unlock:
 	pte_unmap_unlock(pte, ptl);
 out:
 	return ret;
@@ -490,6 +491,8 @@ int page_mkclean(struct page *page)
 		if (mapping)
 			ret = page_mkclean_file(mapping, page);
 	}
+	if (page_test_and_clear_dirty(page))
+		ret = 1;
 
 	return ret;
 }
--