Sophie: kernel-2.6.18-128.1.10.el5 src

kernel-2.6.18-128.1.10.el5.src.rpm

From: Peter Staubach <staubach@redhat.com>
Date: Thu, 12 Jun 2008 15:14:00 -0400
Subject: [nfs] pages of a memory mapped file get corrupted
Message-id: 48517578.4070205@redhat.com
O-Subject: Re: [RHEL-5.3 PATCH] pages of a memory mapped NFS file get corrupted
Bugzilla: 435291
RH-Acked-by: Jeff Layton <jlayton@redhat.com>

Attached is a patch to address BZ435291, "LTC41974-Pages of a
memory mapped NFS file get corrupted.".  This bugzilla describes
a situation where modifications to some pages made through a
mmap'd region may be lost if the NFS client decides that it needs
to invalidate the page cache for that particular file.

The problem in this case is a lack of synchronization between the
code which handles page faults and the code which the NFS client
uses to invalidate pages in the page cache.  The code to invalidate
the pages would lock the page, unmap it from all of the mmap'd
regions that it was mapped into, and then would basically just
unilaterally destroy it.  The code to handle page faults would
find the page in the page cache, map it back into the process'
address space, and then restart the instruction which caused the
page fault.  A small percentage of the time, this would lead to
the page being destroyed after the instruction was re-executed,
thus losing the results of the memory store operation.

The solution is to provide some synchronization between the
code to handle page faults and the code to handle the invalidation
of pages by the NFS client and also to simplify and correct the
semantics of the code to handle the invalidation of the pages.
The synchronization takes the form of the page fault handling code
being modified to acquire the page lock and the page invalidation
of pages by the NFS client and also to simplify and correct the
semantics of the code to handle the invalidation of the pages.
The synchronization takes the form of the page fault handling code
being modified to acquire the page lock and the page invalidation
code to correctly increment the truncate_count element in the
address_space mapping struct to indicate something in the mapping
has changed and the page fault code should releases its page and
start over again.

The semantics that the NFS client requires are that pages which
are marked as modified should not be invalidated because they
need to be written out to the server before they can be destroyed.

The new support was tested by running the test programs supplied
in BZ432974 and BZ435291 for many hours, the NFS Connectathon
testsuite for many hours against many different servers, and on
a desktop system.  The changes have been verified by IBM as to
address the problem that they were seeing.

Upstream has a different solution to this situation and that
support was not applicable to RHEL-5 for kABI reasons.

The new patch reduces the time that the page is held locked
and in particular, releases the page lock before calling the
routine which ends up trying to lock the page for itself.

The new patch passes the reproducible testcases in bz435291
and bz432974.

    Thanx...

       ps

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index ff11522..c82cf26 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -716,7 +716,7 @@ int nfs_revalidate_mapping(struct inode *inode, struct address_space *mapping)
 		nfs_inc_stats(inode, NFSIOS_DATAINVALIDATE);
 		if (S_ISREG(inode->i_mode))
 			nfs_sync_mapping(mapping);
-		invalidate_inode_pages2(mapping);
+		invalidate_inode_pages3(mapping);
 
 		spin_lock(&inode->i_lock);
 		nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6644592..28c201f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1577,6 +1577,9 @@ static inline void invalidate_remote_inode(struct inode *inode)
 extern int invalidate_inode_pages2(struct address_space *mapping);
 extern int invalidate_inode_pages2_range(struct address_space *mapping,
 					 pgoff_t start, pgoff_t end);
+extern int invalidate_inode_pages3(struct address_space *mapping);
+extern int invalidate_inode_pages3_range(struct address_space *mapping,
+					 pgoff_t start, pgoff_t end);
 extern int write_inode_now(struct inode *, int);
 extern int filemap_fdatawrite(struct address_space *);
 extern int filemap_flush(struct address_space *);
diff --git a/mm/memory.c b/mm/memory.c
index 67dcecb..2d376a9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2354,6 +2354,8 @@ retry:
 		}
 	}
 
+	lock_page(new_page);
+
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 	/*
 	 * For a file-backed vma, someone could have truncated or otherwise
@@ -2362,6 +2364,7 @@ retry:
 	 */
 	if (mapping && unlikely(sequence != mapping->truncate_count)) {
 		pte_unmap_unlock(page_table, ptl);
+		unlock_page(new_page);
 		page_cache_release(new_page);
 		cond_resched();
 		sequence = mapping->truncate_count;
@@ -2393,9 +2396,11 @@ retry:
 			inc_mm_counter(mm, anon_rss);
 			lru_cache_add_active(new_page);
 			page_add_new_anon_rmap(new_page, vma, address);
+			unlock_page(new_page);
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(new_page);
+			unlock_page(new_page);
 			if (write_access) {
 				dirty_page = new_page;
 				get_page(dirty_page);
@@ -2403,6 +2408,7 @@ retry:
 		}
 	} else {
 		/* One of our sibling threads was faster, back out. */
+		unlock_page(new_page);
 		page_cache_release(new_page);
 		goto unlock;
 	}
diff --git a/mm/truncate.c b/mm/truncate.c
index 2599770..7728e45 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -408,3 +408,88 @@ int invalidate_inode_pages2(struct address_space *mapping)
 	return invalidate_inode_pages2_range(mapping, 0, -1);
 }
 EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
+
+/**
+ * invalidate_inode_pages3_range - remove range of pages from an address_space
+ * @mapping: the address_space
+ * @start: the page offset 'from' which to invalidate
+ * @end: the page offset 'to' which to invalidate (inclusive)
+ *
+ * Any pages which are found to be mapped into pagetables are unmapped prior to
+ * invalidation.
+ *
+ * Returns -EIO if any pages could not be invalidated.
+ */
+int invalidate_inode_pages3_range(struct address_space *mapping,
+				  pgoff_t start, pgoff_t end)
+{
+	struct pagevec pvec;
+	pgoff_t next;
+	int i;
+	int ret = 0;
+	int wrapped = 0;
+
+	pagevec_init(&pvec, 0);
+	next = start;
+	while (next <= end && !wrapped &&
+		pagevec_lookup(&pvec, mapping, next,
+			min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *page = pvec.pages[i];
+			pgoff_t page_index;
+
+			lock_page(page);
+			if (page->mapping != mapping) {
+				unlock_page(page);
+				continue;
+			}
+			page_index = page->index;
+			next = page_index + 1;
+			if (next == 0)
+				wrapped = 1;
+			if (page_index > end) {
+				unlock_page(page);
+				break;
+			}
+			wait_on_page_writeback(page);
+			while (page_mapped(page)) {
+				unmap_mapping_range(mapping,
+				  (loff_t)page_index<<PAGE_CACHE_SHIFT,
+				  PAGE_CACHE_SIZE, 0);
+			}
+			if (!invalidate_complete_page2(mapping, page))
+				ret = -EIO;
+			else {
+				/*
+				 * Update the truncate_count.  This should
+				 * prevent any threads in do_no_page()
+				 * who found this page in the page cache
+				 * from using it.
+				 */
+				spin_lock(&mapping->i_mmap_lock);
+				mapping->truncate_count++;
+				spin_unlock(&mapping->i_mmap_lock);
+			}
+			unlock_page(page);
+		}
+		pagevec_release(&pvec);
+		cond_resched();
+	}
+	return ret;
+}
+EXPORT_SYMBOL_GPL(invalidate_inode_pages3_range);
+
+/**
+ * invalidate_inode_pages3 - remove all pages from an address_space
+ * @mapping: the address_space
+ *
+ * Any pages which are found to be mapped into pagetables are unmapped prior to
+ * invalidation.
+ *
+ * Returns -EIO if any pages could not be invalidated.
+ */
+int invalidate_inode_pages3(struct address_space *mapping)
+{
+	return invalidate_inode_pages3_range(mapping, 0, -1);
+}
+EXPORT_SYMBOL_GPL(invalidate_inode_pages3);