From: Jeff Moyer <jmoyer@redhat.com> Date: Fri, 10 Jul 2009 11:01:13 -0400 Subject: [block] protect the per-gendisk partition array with rcu Message-id: x491voo7fba.fsf@segfault.boston.devel.redhat.com O-Subject: [rhel5.4 patch] block: protect the per-gendisk partition array with rcu Bugzilla: 495866 RH-Acked-by: Jerome Marchand <jmarchan@redhat.com> Hi, In bugzilla 495866, a race between reading /proc/partitions and rescanning partitions was seen in the wild (in fact, there are 5 Issue Tracker tickets associated with this bugzilla), which results in an oops with the following signature: Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP: <ffffffff80255e23>{show_partition+149} PML4 10b3ef067 PGD 10ac36067 PMD 0 Oops: 0000 [1] SMP CPU 1 Modules linked in: 8021q nfsd exportfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core ide_dump cciss_dump scsi_dump diskdump zlib_deflate emcpdm(U) emcpgpx(U) emcpmpx(U) emcp(U) button battery ac ohci_hcd hw_random k8_edac edac_mc tg3 e1000(U) bonding(U) floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla6312 qla2400 qla2300 qla2xxx scsi_transport_fc cciss sd_mod scsi_mod Pid: 6732, comm: cmaperfd Tainted: P 2.6.9-67.0.22.ELsmp RIP: 0010:[<ffffffff80255e23>] <ffffffff80255e23>{show_partition+149} RSP: 0018:000001010b39fe68 EFLAGS: 00010293 RAX: 0000000000000008 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffffffff8033c3f7 RSI: fffffffffffffffe RDI: 000001010b39fe68 RBP: 00000102fc711a00 R08: 00000000ffffffff R09: 0000000000000006 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 000001010cae5280 R14: 00000000000001a8 R15: 0000000000000010 FS: 0000002a95b767e0(0000) GS:ffffffff804f3700(005b) knlGS:00000000f7fba6c0 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 0000000000000008 CR3: 00000002fc7be000 CR4: 00000000000006e0 Process cmaperfd (pid: 6732, threadinfo 000001010b39e000, task 000001010b5697f0) Stack: 0000000031796400 0000000000000246 00000000f7fb9000 0000000000000246 0000000000000246 00000102fc711a00 000001010cae5280 0000000000000000 00000000000003f0 ffffffff80198a96 Call Trace:<ffffffff80198a96>{seq_read+445} <ffffffff8017b0f4>{vfs_read+207} <ffffffff8017b350>{sys_read+69} <ffffffff80126737>{cstar_do_call+27} The posted reproducer does the following: # while true; do cat /proc/partitions > /dev/null; done & # while true; do hdparm -z /dev/sda; done I was able to reproduce this issue on my test system within seconds. Upon further inspection of the code, it appears to me that /proc/diskstats suffers from the same potential race as /proc/partitions, and I verified this empirically, so the attached patch protects that path as well. The full upstream commit for this is this: ---[cut here]--- commit e71bf0d0ee89e51b92776391c5634938236977d5 Author: Tejun Heo <tj@kernel.org> Date: Wed Sep 3 09:03:02 2008 +0200 block: fix disk->part[] dereferencing race disk->part[] is protected by its matching bdev's lock. However, non-critical accesses like collecting stats and printing out sysfs and proc information used to be performed without any locking. As partitions can come and go dynamically, partitions can go away underneath those non-critical accesses. As some of those accesses are writes, this theoretically can lead to silent corruption. This patch fixes the race by using RCU for the partition array and dev reference counter to hold partitions. * Rename disk->part[] to disk->__part[] to make sure no one outside genhd layer proper accesses it directly. * Use RCU for disk->__part[] dereferencing. * Implement disk_{get|put}_part() which can be used to get and put partitions from gendisk respectively. * Iterators are implemented to help iterate through all partitions safely. * Functions which require RCU readlock are marked with _rcu suffix. * Use disk_put_part() in __blkdev_put() instead of directly putting the contained kobject. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com> ---[cut here]--- Tejun's patch seems a bit heavy-handed for 5.4 this late in the cycle, so I'd prefer to go with the attached minimal backport. Running a kernel with the attached patch, I was unable to reproduce the problem after 16 hours. Comments, as always, are appreciated. Cheers, Jeff diff --git a/block/genhd.c b/block/genhd.c index 1684f86..655f444 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -245,6 +245,7 @@ static int show_partition(struct seq_file *part, void *v) struct gendisk *sgp = v; int n; char buf[BDEVNAME_SIZE]; + struct hd_struct *hd_part; if (&sgp->kobj.entry == block_subsys.kset.list.next) seq_puts(part, "major minor #blocks name\n\n"); @@ -262,14 +263,19 @@ static int show_partition(struct seq_file *part, void *v) (unsigned long long)get_capacity(sgp) >> 1, disk_name(sgp, 0, buf)); for (n = 0; n < sgp->minors - 1; n++) { - if (!sgp->part[n]) - continue; - if (sgp->part[n]->nr_sects == 0) + rcu_read_lock(); + hd_part = rcu_dereference(sgp->part[n]); + + if ((!hd_part) || (hd_part->nr_sects == 0)) { + rcu_read_unlock(); continue; + } seq_printf(part, "%4d %4d %10llu %s\n", sgp->major, n + 1 + sgp->first_minor, - (unsigned long long)sgp->part[n]->nr_sects >> 1 , + (unsigned long long)hd_part->nr_sects >> 1 , disk_name(sgp, n + 1, buf)); + + rcu_read_unlock(); } return 0; @@ -579,10 +585,14 @@ static int diskstats_show(struct seq_file *s, void *v) /* now show all non-0 size partitions of it */ for (n = 0; n < gp->minors - 1; n++) { - struct hd_struct *hd = gp->part[n]; + struct hd_struct *hd; - if (!hd || !hd->nr_sects) + rcu_read_lock(); + hd = rcu_dereference(gp->part[n]); + if (!hd || !hd->nr_sects) { + rcu_read_unlock(); continue; + } preempt_disable(); part_round_stats(hd); @@ -603,6 +613,7 @@ static int diskstats_show(struct seq_file *s, void *v) jiffies_to_msecs(part_stat_read(hd, io_ticks)), jiffies_to_msecs(part_stat_read(hd, time_in_queue)) ); + rcu_read_unlock(); } return 0; diff --git a/fs/partitions/check.c b/fs/partitions/check.c index b8c4ab0..c343118 100644 --- a/fs/partitions/check.c +++ b/fs/partitions/check.c @@ -343,6 +343,9 @@ void delete_partition(struct gendisk *disk, int part) return; if (!p->nr_sects) return; + rcu_assign_pointer(disk->part[part-1], NULL); + synchronize_rcu(); + disk->part[part-1] = NULL; p->start_sect = 0; p->nr_sects = 0;