Sophie: kernel-2.6.18-238.el5 src

kernel-2.6.18-238.el5.src.rpm

From: Jeff Moyer <jmoyer@redhat.com>
Date: Thu, 25 Mar 2010 16:53:12 -0400
Subject: [block] cfq-iosched: propagate down request sync flag
Message-id: <1269535995-21213-3-git-send-email-jmoyer@redhat.com>
Patchwork-id: 23755
O-Subject: [RHEL5 PATCH 2/5] block: Propagate down request sync flag
Bugzilla: 574285
RH-Acked-by: Josef Bacik <josef@redhat.com>
RH-Acked-by: Vivek Goyal <vgoyal@redhat.com>
RH-Acked-by: Jerome Marchand <jmarchan@redhat.com>

There was a performance regression reported against RHEL 5.5 for a workload
that issued O_DIRECT write I/O from two dd processes to two separate
partitions on a single disk.  The reproducer is relatively simple, just
something like:

# dd if=/dev/zero of=/dev/sdb1 bs=4096 count=10000 & \
  dd if=/dev/zero of=/dev/sdb2 bs=4096 count=10000

Throughput dropped (in my setup) by 18%.  The following RHEL5 patch was
tracked down as the source of the regression:

  commit 78e25ae5a8f85ea09bd170a5c1b4d28bc8a0e30c
  Author: Jeff Moyer <jmoyer@redhat.com>
  Date:   Wed Nov 25 20:13:35 2009 -0500

    [block] cfq-iosched: get rid of cfqq hash

Understanding the reason for this requires a little bit of background.

CFQ differentiates between two types of I/O: sync and async.  Sync I/O
is I/O that an application is likely actively waiting for, such as
read requests, or O_DIRECT reads or writes.  Async I/O is typically
background write-out done by some kernel thread (pdflush, for example).

Now, if we look closely at the code path for O_DIRECT writes, we can see
that the request sent down to the I/O scheduler was not actually marked as
being synchronous.  This means that O_DIRECT WRITES were treated as async
I/O by CFQ!

Before the introduction of the above patch, all async I/Os were funneled
into a single cfq_queue (the scheduling entity of CFQ).  Thus, two dd
processes accessing two separate areas of the disk (say one per partition)
would be scheduled together (meaning that during a dispatch cycle, up to 4
I/Os would be sent to the device).  All I/O goes into the same queue, and thus
there is no waiting for other queues in the system.  So long as your storage
can keep up with the issuing of fairly random I/O, there is no problem.

With the introduction of the patch to get rid of the cfq_queue hash,
there is no longer a single cfq_queue for all async I/O on the system;
instead, there is one per process.  Couple that with the fact that O_DIRECT
I/O was being treated as async I/O and we have the source of the regression.
Each cfqq (even async ones) gets a time slice in which to submit I/O to
the device.  By splitting the I/O across two cfqq's, we were essentially
limiting the number of I/Os submitted to the device to 1 (instead of the 2
that it would be prior to this patch).

Unfortunately, that's not all that is wrong!  In my backport of the
patch to get rid of the cfqq hash, I managed to treat O_DIRECT writes as
async I/O when a request is submitted to the block layer (as it has been
all along).  However, when a subsequent bio comes in, we look for a merge
candidate.  In that code path, I actually treated O_DIRECT writes as
sync I/O, and so no merge candidate would be found.  The fix for this is
to backport the following upstream commit:

  commit 7749a8d423c483a51983b666613acda1a4dd9c1b
  Author: Jens Axboe <jens.axboe@oracle.com>
  Date:   Wed Dec 13 13:02:26 2006 +0100

    [PATCH] Propagate down request sync flag

    We need to do this, otherwise the io schedulers don't get access to the
    sync flag. Then they cannot tell the difference between a regular write
    and an O_DIRECT write, which can cause a performance loss.

    Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

Be warned: with both patches in place we still have this performance
regression!  However, now the regression is expected behaviour.  The bug
reporter cited this as a possible source of problems for database workloads.
However, databases don't drive queue depths of 1.  I provided tuning
information in case the bug reporter actually has a use case that matches
the dd test provided.  I also (indirectly) encouraged debate on the subject,
but the reporter has gone silent.

So, I submit this patch to resolve the issue of O_DIRECT WRITES not being
treated as sync I/O by CFQ.  This resolves bug 574285.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 24d1875..5fb1fc2 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2366,7 +2366,7 @@ cfq_set_request(request_queue_t *q, struct request *rq, struct bio *bio,
 	struct task_struct *tsk = current;
 	struct cfq_io_context *cic;
 	const int rw = rq_data_dir(rq);
-	const int is_sync = (rw == READ || rw == WRITE_SYNC);
+	const int is_sync = rq_is_sync(rq);
 	struct cfq_queue *cfqq;
 	struct cfq_rq *crq;
 	unsigned long flags;
diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 4dfcb52..a0f37a3 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -2171,15 +2171,16 @@ static void freed_request(request_queue_t *q, int rw, int priv)
  * Returns NULL on failure, with queue_lock held.
  * Returns !NULL on success, with queue_lock *not held*.
  */
-static struct request *get_request(request_queue_t *q, int rw, struct bio *bio,
-				   gfp_t gfp_mask)
+static struct request *get_request(request_queue_t *q, int rw_flags,
+				   struct bio *bio, gfp_t gfp_mask)
 {
 	struct request *rq = NULL;
 	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
+	const int rw = rw_flags & 0x01;
 	int may_queue, priv;
 
-	may_queue = elv_may_queue(q, rw, bio);
+	may_queue = elv_may_queue(q, rw_flags, bio);
 	if (may_queue == ELV_MQUEUE_NO)
 		goto rq_starved;
 
@@ -2226,10 +2227,10 @@ static struct request *get_request(request_queue_t *q, int rw, struct bio *bio,
 		rl->elvpriv++;
 
 	if (blk_queue_io_stat(q))
-		rw |= REQ_IO_STAT;
+		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
-	rq = blk_alloc_request(q, rw, bio, priv, gfp_mask);
+	rq = blk_alloc_request(q, rw_flags, bio, priv, gfp_mask);
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
@@ -2278,12 +2279,13 @@ out:
  *
  * Called with q->queue_lock held, and returns with it unlocked.
  */
-static struct request *get_request_wait(request_queue_t *q, int rw,
+static struct request *get_request_wait(request_queue_t *q, int rw_flags,
 					struct bio *bio)
 {
+	const int rw = rw_flags & 0x01;
 	struct request *rq;
 
-	rq = get_request(q, rw, bio, GFP_NOIO);
+	rq = get_request(q, rw_flags, bio, GFP_NOIO);
 	while (!rq) {
 		DEFINE_WAIT(wait);
 		struct request_list *rl = &q->rq;
@@ -2291,7 +2293,7 @@ static struct request *get_request_wait(request_queue_t *q, int rw,
 		prepare_to_wait_exclusive(&rl->wait[rw], &wait,
 				TASK_UNINTERRUPTIBLE);
 
-		rq = get_request(q, rw, bio, GFP_NOIO);
+		rq = get_request(q, rw_flags, bio, GFP_NOIO);
 
 		if (!rq) {
 			struct io_context *ioc;
@@ -3005,6 +3007,7 @@ static int __make_request(request_queue_t *q, struct bio *bio)
 	int el_ret, rw, nr_sectors, cur_nr_sectors, barrier, err, sync;
 	unsigned short prio;
 	sector_t sector;
+	int rw_flags;
 
 	sector = bio->bi_sector;
 	nr_sectors = bio_sectors(bio);
@@ -3087,10 +3090,19 @@ static int __make_request(request_queue_t *q, struct bio *bio)
 
 get_rq:
 	/*
+	 * This sync check and mask will be re-done in init_request_from_bio(),
+	 * but we need to set it earlier to expose the sync flag to the
+	 * rq allocator and io schedulers.
+	 */
+	rw_flags = bio_data_dir(bio);
+	if (sync)
+		rw_flags |= REQ_RW_SYNC;
+
+	/*
 	 * Grab a free request. This is might sleep but can not fail.
 	 * Returns with the queue unlocked.
 	 */
-	req = get_request_wait(q, rw, bio);
+	req = get_request_wait(q, rw_flags, bio);
 
 	/*
 	 * After dropping the lock and possibly sleeping here, our request