Sophie: kernel-2.6.18-194.11.1.el5 src

kernel-2.6.18-194.11.1.el5.src.rpm

From: Jeff Moyer <jmoyer@redhat.com>
Date: Fri, 10 Jul 2009 11:01:13 -0400
Subject: [block] protect the per-gendisk partition array with rcu
Message-id: x491voo7fba.fsf@segfault.boston.devel.redhat.com
O-Subject: [rhel5.4 patch] block: protect the per-gendisk partition array with rcu
Bugzilla: 495866
RH-Acked-by: Jerome Marchand <jmarchan@redhat.com>

Hi,

In bugzilla 495866, a race between reading /proc/partitions and
rescanning partitions was seen in the wild (in fact, there are 5 Issue
Tracker tickets associated with this bugzilla), which results in an oops
with the following signature:

Unable to handle kernel NULL pointer dereference at 0000000000000008
RIP:
<ffffffff80255e23>{show_partition+149}
PML4 10b3ef067 PGD 10ac36067 PMD 0
Oops: 0000 [1] SMP
CPU 1
Modules linked in: 8021q nfsd exportfs lockd nfs_acl md5 ipv6 parport_pc
lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core
ide_dump cciss_dump scsi_dump diskdump zlib_deflate emcpdm(U) emcpgpx(U)
emcpmpx(U) emcp(U) button battery ac ohci_hcd hw_random k8_edac edac_mc
tg3 e1000(U) bonding(U) floppy dm_snapshot dm_zero dm_mirror ext3 jbd
dm_mod qla6312 qla2400 qla2300 qla2xxx scsi_transport_fc cciss sd_mod
scsi_mod
Pid: 6732, comm: cmaperfd Tainted: P      2.6.9-67.0.22.ELsmp
RIP: 0010:[<ffffffff80255e23>] <ffffffff80255e23>{show_partition+149}
RSP: 0018:000001010b39fe68  EFLAGS: 00010293
RAX: 0000000000000008 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffffffff8033c3f7 RSI: fffffffffffffffe RDI: 000001010b39fe68
RBP: 00000102fc711a00 R08: 00000000ffffffff R09: 0000000000000006
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 000001010cae5280 R14: 00000000000001a8 R15: 0000000000000010
FS:  0000002a95b767e0(0000) GS:ffffffff804f3700(005b)
knlGS:00000000f7fba6c0
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 0000000000000008 CR3: 00000002fc7be000 CR4: 00000000000006e0
Process cmaperfd (pid: 6732, threadinfo 000001010b39e000, task
000001010b5697f0)
Stack: 0000000031796400 0000000000000246 00000000f7fb9000
0000000000000246
       0000000000000246 00000102fc711a00 000001010cae5280
       0000000000000000
       00000000000003f0 ffffffff80198a96
Call Trace:<ffffffff80198a96>{seq_read+445}
<ffffffff8017b0f4>{vfs_read+207}
       <ffffffff8017b350>{sys_read+69}
       <ffffffff80126737>{cstar_do_call+27}

The posted reproducer does the following:

# while true; do cat /proc/partitions > /dev/null; done &
# while true; do hdparm -z /dev/sda; done

I was able to reproduce this issue on my test system within seconds.

Upon further inspection of the code, it appears to me that
/proc/diskstats suffers from the same potential race as
/proc/partitions, and I verified this empirically, so the attached patch
protects that path as well.

The full upstream commit for this is this:

---[cut here]---
commit e71bf0d0ee89e51b92776391c5634938236977d5
Author: Tejun Heo <tj@kernel.org>
Date:   Wed Sep 3 09:03:02 2008 +0200

    block: fix disk->part[] dereferencing race

    disk->part[] is protected by its matching bdev's lock.  However,
    non-critical accesses like collecting stats and printing out sysfs
    and proc information used to be performed without any locking.  As
    partitions can come and go dynamically, partitions can go away
    underneath those non-critical accesses.  As some of those accesses
    are writes, this theoretically can lead to silent corruption.

    This patch fixes the race by using RCU for the partition array and
    dev reference counter to hold partitions.

    * Rename disk->part[] to disk->__part[] to make sure no one outside
      genhd layer proper accesses it directly.

    * Use RCU for disk->__part[] dereferencing.

    * Implement disk_{get|put}_part() which can be used to get and put
      partitions from gendisk respectively.

    * Iterators are implemented to help iterate through all partitions
      safely.

    * Functions which require RCU readlock are marked with _rcu suffix.

    * Use disk_put_part() in __blkdev_put() instead of directly putting
      the contained kobject.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

---[cut here]---

Tejun's patch seems a bit heavy-handed for 5.4 this late in the cycle,
so I'd prefer to go with the attached minimal backport.  Running a
kernel with the attached patch, I was unable to reproduce the problem
after 16 hours.

Comments, as always, are appreciated.

Cheers,
Jeff

diff --git a/block/genhd.c b/block/genhd.c
index 1684f86..655f444 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -245,6 +245,7 @@ static int show_partition(struct seq_file *part, void *v)
 	struct gendisk *sgp = v;
 	int n;
 	char buf[BDEVNAME_SIZE];
+	struct hd_struct *hd_part;
 
 	if (&sgp->kobj.entry == block_subsys.kset.list.next)
 		seq_puts(part, "major minor  #blocks  name\n\n");
@@ -262,14 +263,19 @@ static int show_partition(struct seq_file *part, void *v)
 		(unsigned long long)get_capacity(sgp) >> 1,
 		disk_name(sgp, 0, buf));
 	for (n = 0; n < sgp->minors - 1; n++) {
-		if (!sgp->part[n])
-			continue;
-		if (sgp->part[n]->nr_sects == 0)
+		rcu_read_lock();
+		hd_part = rcu_dereference(sgp->part[n]);
+
+		if ((!hd_part) || (hd_part->nr_sects == 0)) {
+			rcu_read_unlock();
 			continue;
+		}
 		seq_printf(part, "%4d  %4d %10llu %s\n",
 			sgp->major, n + 1 + sgp->first_minor,
-			(unsigned long long)sgp->part[n]->nr_sects >> 1 ,
+			(unsigned long long)hd_part->nr_sects >> 1 ,
 			disk_name(sgp, n + 1, buf));
+
+		rcu_read_unlock();
 	}
 
 	return 0;
@@ -579,10 +585,14 @@ static int diskstats_show(struct seq_file *s, void *v)
 
 	/* now show all non-0 size partitions of it */
 	for (n = 0; n < gp->minors - 1; n++) {
-		struct hd_struct *hd = gp->part[n];
+		struct hd_struct *hd;
 
-		if (!hd || !hd->nr_sects)
+		rcu_read_lock();
+		hd = rcu_dereference(gp->part[n]);
+		if (!hd || !hd->nr_sects) {
+			rcu_read_unlock();
 			continue;
+		}
 
 		preempt_disable();
 		part_round_stats(hd);
@@ -603,6 +613,7 @@ static int diskstats_show(struct seq_file *s, void *v)
 			   jiffies_to_msecs(part_stat_read(hd, io_ticks)),
 			   jiffies_to_msecs(part_stat_read(hd, time_in_queue))
 			);
+		rcu_read_unlock();
 	}
  
 	return 0;
diff --git a/fs/partitions/check.c b/fs/partitions/check.c
index b8c4ab0..c343118 100644
--- a/fs/partitions/check.c
+++ b/fs/partitions/check.c
@@ -343,6 +343,9 @@ void delete_partition(struct gendisk *disk, int part)
 		return;
 	if (!p->nr_sects)
 		return;
+	rcu_assign_pointer(disk->part[part-1], NULL);
+	synchronize_rcu();
+
 	disk->part[part-1] = NULL;
 	p->start_sect = 0;
 	p->nr_sects = 0;