Skip to content

Commit e836007

Browse files
jbaron-lgtmliu-song-6
authored andcommitted
md/raid0: add discard support for the 'original' layout
We've found that using raid0 with the 'original' layout and discard enabled with different disk sizes (such that at least two zones are created) can result in data corruption. This is due to the fact that the discard handling in 'raid0_handle_discard()' assumes the 'alternate' layout. We've seen this corruption using ext4 but other filesystems are likely susceptible as well. More specifically, while multiple zones are necessary to create the corruption, the corruption may not occur with multiple zones if they layout in such a way the layout matches what the 'alternate' layout would have produced. Thus, not all raid0 devices with the 'original' layout, different size disks and discard enabled will encounter this corruption. The 3.14 kernel inadvertently changed the raid0 disk layout for different size disks. Thus, running a pre-3.14 kernel and post-3.14 kernel on the same raid0 array could corrupt data. This lead to the creation of the 'original' layout (to match the pre-3.14 layout) and the 'alternate' layout (to match the post 3.14 layout) in the 5.4 kernel time frame and an option to tell the kernel which layout to use (since it couldn't be autodetected). However, when the 'original' layout was added back to 5.4 discard support for the 'original' layout was not added leading this issue. I've been able to reliably reproduce the corruption with the following test case: 1. create raid0 array with different size disks using original layout 2. mkfs 3. mount -o discard 4. create lots of files 5. remove 1/2 the files 6. fstrim -a (or just the mount point for the raid0 array) 7. umount 8. fsck -fn /dev/md0 (spews all sorts of corruptions) Let's fix this by adding proper discard support to the 'original' layout. The fix 'maps' the 'original' layout disks to the order in which they are read/written such that we can compare the disks in the same way that the current 'alternate' layout does. A 'disk_shift' field is added to 'struct strip_zone'. This could be computed on the fly in raid0_handle_discard() but by adding this field, we save some computation in the discard path. Note we could also potentially fix this by re-ordering the disks in the zones that follow the first one, and then always read/writing them using the 'alternate' layout. However, that is seen as a more substantial change, and we are attempting the least invasive fix at this time to remedy the corruption. I've verified the change using the reproducer mentioned above. Typically, the corruption is seen after less than 3 iterations, while the patch has run 500+ iterations. Cc: NeilBrown <neilb@suse.de> Cc: Song Liu <song@kernel.org> Fixes: c84a137 ("md/raid0: avoid RAID0 data corruption due to layout confusion.") Cc: stable@vger.kernel.org Signed-off-by: Jason Baron <jbaron@akamai.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230623180523.1901230-1-jbaron@akamai.com
1 parent 6e34e78 commit e836007

2 files changed

Lines changed: 55 additions & 8 deletions

File tree

drivers/md/raid0.c

Lines changed: 54 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -270,6 +270,18 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
270270
goto abort;
271271
}
272272

273+
if (conf->layout == RAID0_ORIG_LAYOUT) {
274+
for (i = 1; i < conf->nr_strip_zones; i++) {
275+
sector_t first_sector = conf->strip_zone[i-1].zone_end;
276+
277+
sector_div(first_sector, mddev->chunk_sectors);
278+
zone = conf->strip_zone + i;
279+
/* disk_shift is first disk index used in the zone */
280+
zone->disk_shift = sector_div(first_sector,
281+
zone->nb_dev);
282+
}
283+
}
284+
273285
pr_debug("md/raid0:%s: done.\n", mdname(mddev));
274286
*private_conf = conf;
275287

@@ -431,6 +443,20 @@ static int raid0_run(struct mddev *mddev)
431443
return ret;
432444
}
433445

446+
/*
447+
* Convert disk_index to the disk order in which it is read/written.
448+
* For example, if we have 4 disks, they are numbered 0,1,2,3. If we
449+
* write the disks starting at disk 3, then the read/write order would
450+
* be disk 3, then 0, then 1, and then disk 2 and we want map_disk_shift()
451+
* to map the disks as follows 0,1,2,3 => 1,2,3,0. So disk 0 would map
452+
* to 1, 1 to 2, 2 to 3, and 3 to 0. That way we can compare disks in
453+
* that 'output' space to understand the read/write disk ordering.
454+
*/
455+
static int map_disk_shift(int disk_index, int num_disks, int disk_shift)
456+
{
457+
return ((disk_index + num_disks - disk_shift) % num_disks);
458+
}
459+
434460
static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
435461
{
436462
struct r0conf *conf = mddev->private;
@@ -444,7 +470,9 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
444470
sector_t end_disk_offset;
445471
unsigned int end_disk_index;
446472
unsigned int disk;
473+
sector_t orig_start, orig_end;
447474

475+
orig_start = start;
448476
zone = find_zone(conf, &start);
449477

450478
if (bio_end_sector(bio) > zone->zone_end) {
@@ -458,6 +486,7 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
458486
} else
459487
end = bio_end_sector(bio);
460488

489+
orig_end = end;
461490
if (zone != conf->strip_zone)
462491
end = end - zone[-1].zone_end;
463492

@@ -469,32 +498,49 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
469498
last_stripe_index = end;
470499
sector_div(last_stripe_index, stripe_size);
471500

472-
start_disk_index = (int)(start - first_stripe_index * stripe_size) /
473-
mddev->chunk_sectors;
501+
/* In the first zone the original and alternate layouts are the same */
502+
if ((conf->layout == RAID0_ORIG_LAYOUT) && (zone != conf->strip_zone)) {
503+
sector_div(orig_start, mddev->chunk_sectors);
504+
start_disk_index = sector_div(orig_start, zone->nb_dev);
505+
start_disk_index = map_disk_shift(start_disk_index,
506+
zone->nb_dev,
507+
zone->disk_shift);
508+
sector_div(orig_end, mddev->chunk_sectors);
509+
end_disk_index = sector_div(orig_end, zone->nb_dev);
510+
end_disk_index = map_disk_shift(end_disk_index,
511+
zone->nb_dev, zone->disk_shift);
512+
} else {
513+
start_disk_index = (int)(start - first_stripe_index * stripe_size) /
514+
mddev->chunk_sectors;
515+
end_disk_index = (int)(end - last_stripe_index * stripe_size) /
516+
mddev->chunk_sectors;
517+
}
474518
start_disk_offset = ((int)(start - first_stripe_index * stripe_size) %
475519
mddev->chunk_sectors) +
476520
first_stripe_index * mddev->chunk_sectors;
477-
end_disk_index = (int)(end - last_stripe_index * stripe_size) /
478-
mddev->chunk_sectors;
479521
end_disk_offset = ((int)(end - last_stripe_index * stripe_size) %
480522
mddev->chunk_sectors) +
481523
last_stripe_index * mddev->chunk_sectors;
482524

483525
for (disk = 0; disk < zone->nb_dev; disk++) {
484526
sector_t dev_start, dev_end;
485527
struct md_rdev *rdev;
528+
int compare_disk;
529+
530+
compare_disk = map_disk_shift(disk, zone->nb_dev,
531+
zone->disk_shift);
486532

487-
if (disk < start_disk_index)
533+
if (compare_disk < start_disk_index)
488534
dev_start = (first_stripe_index + 1) *
489535
mddev->chunk_sectors;
490-
else if (disk > start_disk_index)
536+
else if (compare_disk > start_disk_index)
491537
dev_start = first_stripe_index * mddev->chunk_sectors;
492538
else
493539
dev_start = start_disk_offset;
494540

495-
if (disk < end_disk_index)
541+
if (compare_disk < end_disk_index)
496542
dev_end = (last_stripe_index + 1) * mddev->chunk_sectors;
497-
else if (disk > end_disk_index)
543+
else if (compare_disk > end_disk_index)
498544
dev_end = last_stripe_index * mddev->chunk_sectors;
499545
else
500546
dev_end = end_disk_offset;

drivers/md/raid0.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ struct strip_zone {
66
sector_t zone_end; /* Start of the next zone (in sectors) */
77
sector_t dev_start; /* Zone offset in real dev (in sectors) */
88
int nb_dev; /* # of devices attached to the zone */
9+
int disk_shift; /* start disk for the original layout */
910
};
1011

1112
/* Linux 3.14 (20d0189b101) made an unintended change to

0 commit comments

Comments
 (0)