Sophie: kernel-2.6.18-238.el5 src

kernel-2.6.18-238.el5.src.rpm

From: Brad Peters <bpeters@redhat.com>
Date: Mon, 4 Aug 2008 20:00:41 -0400
Subject: [fs] jbd: fix races that lead to EIO for O_DIRECT
Message-id: 20080805000041.1004.22941.sendpatchset@squad5-lp1.lab.bos.redhat.com
O-Subject: [PATCH RHEL5.3] Journaling Block Device (jbd) races lead to EIO for O_DIRECT
Bugzilla: 446599
RH-Acked-by: David Howells <dhowells@redhat.com>
RH-Acked-by: Josef Bacik <jbacik@redhat.com>

RHBZ#:
======
https://bugzilla.redhat.com/show_bug.cgi?id=446599

Description:
===========
Bug fix

Race condition in journaling block device layer is leading to occasional
EIO faults.  The problem lies in that journal_try_to_free_buffers() could
race with jbd commit transaction when the later is holding the buffer
reference, while waiting for the data buffer to flush to disk.

With this patch, if the caller is passing the GFP_KERNEL to indicating this
call could wait, in case of try_to_free_buffers() failed, let's waiting for
journal_commit_transaction() to finish commit the current committing transaction
, then try to free those buffers again with journal locked.

RHEL Version Found:
================
RHEL 5.1

kABI Status:
============
No symbols were harmed.

Brew:
=====
Built on all platforms.
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1385528

Upstream Status:
================
Upstream:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=3f31fddfa26b7594b44ff2b34f9a04ba409e0f91

Test Status:
============
Tested using the inline test program, which stress the fs through multiple
user-space processes all set to write into the same large file.  No
EIO errors witnessed with patch.

Brad Peters 7/29/08
===============================================================

Brad Peters 1-978-392-1000 x 23183
IBM on-site partner.

Proposed Patch:
===============
This patch is based on 2.6.18-95.el5

diff --git a/fs/buffer.c b/fs/buffer.c
index 17ebeab..c3880ab 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1595,9 +1595,8 @@ static void discard_buffer(struct buffer_head * bh)
  * Otherwise return zero.
  *
  * The @gfp_mask argument specifies whether I/O may be performed to release
- * this page (__GFP_IO), and whether the call may block (__GFP_WAIT).
+ * this page (__GFP_IO), and whether the call may block (__GFP_WAIT & __GFP_FS).
  *
- * NOTE: @gfp_mask may go away, and this function may become non-blocking.
  */
 int try_to_release_page(struct page *page, gfp_t gfp_mask)
 {
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 664a740..08f4f90 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -472,7 +472,9 @@ void journal_commit_transaction(journal_t *journal)
 	 * transaction!  Now comes the tricky part: we need to write out
 	 * metadata.  Loop over the transaction's entire buffer list:
 	 */
+	spin_lock(&journal->j_state_lock);
 	commit_transaction->t_state = T_COMMIT;
+	spin_unlock(&journal->j_state_lock);
 
 	J_ASSERT(commit_transaction->t_nr_buffers <=
 		 commit_transaction->t_outstanding_credits);
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 68eb45c..3b501ea 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -1634,13 +1634,41 @@ out:
 	return;
 }
 
+/*
+ * journal_try_to_free_buffers() could race with journal_commit_transaction()
+ * The later might still hold the reference count to the buffers when inspecting
+ * them on t_syncdata_list or t_locked_list.
+ *
+ * Journal_try_to_free_buffers() will call this function to
+ * wait for the current transaction to finish syncing data buffers, before
+ * try to free that buffer.
+ *
+ * Called with journal->j_state_lock hold.
+ */
+static void journal_wait_for_transaction_sync_data(journal_t *journal)
+{
+	transaction_t *transaction = NULL;
+	tid_t tid;
+
+	spin_lock(&journal->j_state_lock);
+	transaction = journal->j_committing_transaction;
+
+	if (!transaction){
+		spin_unlock(&journal->j_state_lock);
+		return;
+	}
+
+	tid = transaction->t_tid;
+	spin_unlock(&journal->j_state_lock);
+	log_wait_commit(journal, tid);
+}
 
 /** 
  * int journal_try_to_free_buffers() - try to free page buffers.
  * @journal: journal for operation
  * @page: to try and free
- * @unused_gfp_mask: unused
- *
+ * @gfp_mask: specifies whether the call may block
+ *		(__GFP_WAIT & __GFP_FS via GFP_KERNEL)
  * 
  * For all the buffers on this page,
  * if they are fully written out ordered data, move them onto BUF_CLEAN
@@ -1668,9 +1696,11 @@ out:
  * journal_try_to_free_buffer() is changing its state.  But that
  * cannot happen because we never reallocate freed data as metadata
  * while the data is part of a transaction.  Yes?
+ *
+ * Returns 0 on failure, 1 on success
  */
 int journal_try_to_free_buffers(journal_t *journal, 
-				struct page *page, gfp_t unused_gfp_mask)
+				struct page *page, gfp_t gfp_mask)
 {
 	struct buffer_head *head;
 	struct buffer_head *bh;
@@ -1699,7 +1729,28 @@ int journal_try_to_free_buffers(journal_t *journal,
 		if (buffer_jbd(bh))
 			goto busy;
 	} while ((bh = bh->b_this_page) != head);
+
 	ret = try_to_free_buffers(page);
+
+ 	/*
+	 * There are a number of places where journal_try_to_free_buffers()
+	 * could race with journal_commit_transaction(), the later still
+	 * holds the reference to the buffers to free while processing them.
+	 * try_to_free_buffers() failed to free those buffers. Some of the
+	 * caller of releasepage() request page buffers to be dropped, otherwise
+	 * treat the fail-to-free as errors (such as generic_file_direct_IO())
+	 *
+	 * So, if the caller of try_to_release_page() wants the synchronous
+	 * behaviour(i.e make sure buffers are dropped upon return),
+	 * let's wait for the current transaction to finish flush of
+	 * dirty data buffers, then try to free those buffers again,
+	 * with the journal locked.
+	 */
+	if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS)) {
+		journal_wait_for_transaction_sync_data(journal);
+		ret = try_to_free_buffers(page);
+	}
+
 busy:
 	return ret;
 }