Skip to content

Commit cc25df3

Browse files
committed
Merge tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe: - Fix head insertion for mq-deadline, a regression from when priority support was added - Series simplifying and improving the ublk user copy code - Various ublk related cleanups - Fixup REQ_NOWAIT handling in loop/zloop, clearing NOWAIT when the request is punted to a thread for handling - Merge and then later revert loop dio nowait support, as it ended up causing excessive stack usage for when the inline issue code needs to dip back into the full file system code - Improve auto integrity code, making it less deadlock prone - Speedup polled IO handling, but manually managing the hctx lookups - Fixes for blk-throttle for SSD devices - Small series with fixes for the S390 dasd driver - Add support for caching zones, avoiding unnecessary report zone queries - MD pull requests via Yu: - fix null-ptr-dereference regression for dm-raid0 - fix IO hang for raid5 when array is broken with IO inflight - remove legacy 1s delay to speed up system shutdown - change maintainer's email address - data can be lost if array is created with different lbs devices, fix this problem and record lbs of the array in metadata - fix rcu protection for md_thread - fix mddev kobject lifetime regression - enable atomic writes for md-linear - some cleanups - bcache updates via Coly - remove useless discard and cache device code - improve usage of per-cpu workqueues - Reorganize the IO scheduler switching code, fixing some lockdep reports as well - Improve the block layer P2P DMA support - Add support to the block tracing code for zoned devices - Segment calculation improves, and memory alignment flexibility improvements - Set of prep and cleanups patches for ublk batching support. The actual batching hasn't been added yet, but helps shrink down the workload of getting that patchset ready for 6.20 - Fix for how the ps3 block driver handles segments offsets - Improve how block plugging handles batch tag allocations - nbd fixes for use-after-free of the configuration on device clear/put - Set of improvements and fixes for zloop - Add Damien as maintainer of the block zoned device code handling - Various other fixes and cleanups * tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits) block/rnbd: correct all kernel-doc complaints blk-mq: use queue_hctx in blk_mq_map_queue_type md: remove legacy 1s delay in md_notify_reboot md/raid5: fix IO hang when array is broken with IO inflight md: warn about updating super block failure md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raid sbitmap: fix all kernel-doc warnings ublk: add helper of __ublk_fetch() ublk: pass const pointer to ublk_queue_is_zoned() ublk: refactor auto buffer register in ublk_dispatch_req() ublk: add `union ublk_io_buf` with improved naming ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg() kfifo: add kfifo_alloc_node() helper for NUMA awareness blk-mq: fix potential uaf for 'queue_hw_ctx' blk-mq: use array manage hctx map instead of xarray ublk: prevent invalid access with DEBUG s390/dasd: Use scnprintf() instead of sprintf() s390/dasd: Move device name formatting into separate function s390/dasd: Remove unnecessary debugfs_create() return checks s390/dasd: Fix gendisk parent after copy pair swap ...
2 parents 0abcfd8 + d211a28 commit cc25df3

108 files changed

Lines changed: 3019 additions & 1557 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Documentation/ABI/testing/sysfs-block-bcache

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -106,13 +106,6 @@ Description:
106106
will be discarded from the cache. Should not be turned off with
107107
writeback caching enabled.
108108

109-
What: /sys/block/<disk>/bcache/discard
110-
Date: November 2010
111-
Contact: Kent Overstreet <kent.overstreet@gmail.com>
112-
Description:
113-
For a cache, a boolean allowing discard/TRIM to be turned off
114-
or back on if the device supports it.
115-
116109
What: /sys/block/<disk>/bcache/bucket_size
117110
Date: November 2010
118111
Contact: Kent Overstreet <kent.overstreet@gmail.com>

Documentation/admin-guide/bcache.rst

Lines changed: 2 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,7 @@ The latest bcache kernel code can be found from mainline Linux kernel:
1717
It's designed around the performance characteristics of SSDs - it only allocates
1818
in erase block sized buckets, and it uses a hybrid btree/log to track cached
1919
extents (which can be anywhere from a single sector to the bucket size). It's
20-
designed to avoid random writes at all costs; it fills up an erase block
21-
sequentially, then issues a discard before reusing it.
20+
designed to avoid random writes at all costs.
2221

2322
Both writethrough and writeback caching are supported. Writeback defaults to
2423
off, but can be switched on and off arbitrarily at runtime. Bcache goes to
@@ -618,19 +617,11 @@ bucket_size
618617
cache_replacement_policy
619618
One of either lru, fifo or random.
620619

621-
discard
622-
Boolean; if on a discard/TRIM will be issued to each bucket before it is
623-
reused. Defaults to off, since SATA TRIM is an unqueued command (and thus
624-
slow).
625-
626620
freelist_percent
627621
Size of the freelist as a percentage of nbuckets. Can be written to to
628622
increase the number of buckets kept on the freelist, which lets you
629623
artificially reduce the size of the cache at runtime. Mostly for testing
630-
purposes (i.e. testing how different size caches affect your hit rate), but
631-
since buckets are discarded when they move on to the freelist will also make
632-
the SSD's garbage collection easier by effectively giving it more reserved
633-
space.
624+
purposes (i.e. testing how different size caches affect your hit rate).
634625

635626
io_errors
636627
Number of errors that have occurred, decayed by io_error_halflife.

Documentation/admin-guide/blockdev/zoned_loop.rst

Lines changed: 37 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -68,30 +68,43 @@ The options available for the add command can be listed by reading the
6868
In more details, the options that can be used with the "add" command are as
6969
follows.
7070

71-
================ ===========================================================
72-
id Device number (the X in /dev/zloopX).
73-
Default: automatically assigned.
74-
capacity_mb Device total capacity in MiB. This is always rounded up to
75-
the nearest higher multiple of the zone size.
76-
Default: 16384 MiB (16 GiB).
77-
zone_size_mb Device zone size in MiB. Default: 256 MiB.
78-
zone_capacity_mb Device zone capacity (must always be equal to or lower than
79-
the zone size. Default: zone size.
80-
conv_zones Total number of conventioanl zones starting from sector 0.
81-
Default: 8.
82-
base_dir Path to the base directory where to create the directory
83-
containing the zone files of the device.
84-
Default=/var/local/zloop.
85-
The device directory containing the zone files is always
86-
named with the device ID. E.g. the default zone file
87-
directory for /dev/zloop0 is /var/local/zloop/0.
88-
nr_queues Number of I/O queues of the zoned block device. This value is
89-
always capped by the number of online CPUs
90-
Default: 1
91-
queue_depth Maximum I/O queue depth per I/O queue.
92-
Default: 64
93-
buffered_io Do buffered IOs instead of direct IOs (default: false)
94-
================ ===========================================================
71+
=================== =========================================================
72+
id Device number (the X in /dev/zloopX).
73+
Default: automatically assigned.
74+
capacity_mb Device total capacity in MiB. This is always rounded up
75+
to the nearest higher multiple of the zone size.
76+
Default: 16384 MiB (16 GiB).
77+
zone_size_mb Device zone size in MiB. Default: 256 MiB.
78+
zone_capacity_mb Device zone capacity (must always be equal to or lower
79+
than the zone size. Default: zone size.
80+
conv_zones Total number of conventioanl zones starting from
81+
sector 0
82+
Default: 8
83+
base_dir Path to the base directory where to create the directory
84+
containing the zone files of the device.
85+
Default=/var/local/zloop.
86+
The device directory containing the zone files is always
87+
named with the device ID. E.g. the default zone file
88+
directory for /dev/zloop0 is /var/local/zloop/0.
89+
nr_queues Number of I/O queues of the zoned block device. This
90+
value is always capped by the number of online CPUs
91+
Default: 1
92+
queue_depth Maximum I/O queue depth per I/O queue.
93+
Default: 64
94+
buffered_io Do buffered IOs instead of direct IOs (default: false)
95+
zone_append Enable or disable a zloop device native zone append
96+
support.
97+
Default: 1 (enabled).
98+
If native zone append support is disabled, the block layer
99+
will emulate this operation using regular write
100+
operations.
101+
ordered_zone_append Enable zloop mitigation of zone append reordering.
102+
Default: disabled.
103+
This is useful for testing file systems file data mapping
104+
(extents), as when enabled, this can significantly reduce
105+
the number of data extents needed to for a file data
106+
mapping.
107+
=================== =========================================================
95108

96109
3) Deleting a Zoned Device
97110
--------------------------

Documentation/admin-guide/md.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -238,6 +238,16 @@ All md devices contain:
238238
the number of devices in a raid4/5/6, or to support external
239239
metadata formats which mandate such clipping.
240240

241+
logical_block_size
242+
Configure the array's logical block size in bytes. This attribute
243+
is only supported for 1.x meta. Write the value before starting
244+
array. The final array LBS uses the maximum between this
245+
configuration and LBS of all combined devices. Note that
246+
LBS cannot exceed PAGE_SIZE before RAID supports folio.
247+
WARNING: Arrays created on new kernel cannot be assembled at old
248+
kernel due to padding check, Set module parameter 'check_new_feature'
249+
to false to bypass, but data loss may occur.
250+
241251
reshape_position
242252
This is either ``none`` or a sector number within the devices of
243253
the array where ``reshape`` is up to. If this is set, the three

MAINTAINERS

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4307,7 +4307,7 @@ F: Documentation/filesystems/befs.rst
43074307
F: fs/befs/
43084308

43094309
BFQ I/O SCHEDULER
4310-
M: Yu Kuai <yukuai3@huawei.com>
4310+
M: Yu Kuai <yukuai@fnnas.com>
43114311
L: linux-block@vger.kernel.org
43124312
S: Odd Fixes
43134313
F: Documentation/block/bfq-iosched.rst
@@ -4407,6 +4407,8 @@ F: block/
44074407
F: drivers/block/
44084408
F: include/linux/bio.h
44094409
F: include/linux/blk*
4410+
F: include/uapi/linux/blk*
4411+
F: include/uapi/linux/ioprio.h
44104412
F: kernel/trace/blktrace.c
44114413
F: lib/sbitmap.c
44124414

@@ -23908,7 +23910,7 @@ F: include/linux/property.h
2390823910

2390923911
SOFTWARE RAID (Multiple Disks) SUPPORT
2391023912
M: Song Liu <song@kernel.org>
23911-
M: Yu Kuai <yukuai3@huawei.com>
23913+
M: Yu Kuai <yukuai@fnnas.com>
2391223914
L: linux-raid@vger.kernel.org
2391323915
S: Supported
2391423916
Q: https://patchwork.kernel.org/project/linux-raid/list/
@@ -28371,6 +28373,13 @@ L: linux-kernel@vger.kernel.org
2837128373
S: Maintained
2837228374
F: arch/x86/kernel/cpu/zhaoxin.c
2837328375

28376+
ZONED BLOCK DEVICE (BLOCK LAYER)
28377+
M: Damien Le Moal <dlemoal@kernel.org>
28378+
L: linux-block@vger.kernel.org
28379+
S: Maintained
28380+
F: block/blk-zoned.c
28381+
F: include/uapi/linux/blkzoned.h
28382+
2837428383
ZONED LOOP DEVICE
2837528384
M: Damien Le Moal <dlemoal@kernel.org>
2837628385
R: Christoph Hellwig <hch@lst.de>

block/bio-integrity-auto.c

Lines changed: 3 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ static void bio_integrity_finish(struct bio_integrity_data *bid)
2929
{
3030
bid->bio->bi_integrity = NULL;
3131
bid->bio->bi_opf &= ~REQ_INTEGRITY;
32-
kfree(bvec_virt(bid->bip.bip_vec));
32+
bio_integrity_free_buf(&bid->bip);
3333
mempool_free(bid, &bid_pool);
3434
}
3535

@@ -110,8 +110,6 @@ bool bio_integrity_prep(struct bio *bio)
110110
struct bio_integrity_data *bid;
111111
bool set_flags = true;
112112
gfp_t gfp = GFP_NOIO;
113-
unsigned int len;
114-
void *buf;
115113

116114
if (!bi)
117115
return true;
@@ -152,19 +150,12 @@ bool bio_integrity_prep(struct bio *bio)
152150
if (WARN_ON_ONCE(bio_has_crypt_ctx(bio)))
153151
return true;
154152

155-
/* Allocate kernel buffer for protection data */
156-
len = bio_integrity_bytes(bi, bio_sectors(bio));
157-
buf = kmalloc(len, gfp);
158-
if (!buf)
159-
goto err_end_io;
160153
bid = mempool_alloc(&bid_pool, GFP_NOIO);
161-
if (!bid)
162-
goto err_free_buf;
163154
bio_integrity_init(bio, &bid->bip, &bid->bvec, 1);
164-
165155
bid->bio = bio;
166-
167156
bid->bip.bip_flags |= BIP_BLOCK_INTEGRITY;
157+
bio_integrity_alloc_buf(bio, gfp & __GFP_ZERO);
158+
168159
bip_set_seed(&bid->bip, bio->bi_iter.bi_sector);
169160

170161
if (set_flags) {
@@ -176,23 +167,12 @@ bool bio_integrity_prep(struct bio *bio)
176167
bid->bip.bip_flags |= BIP_CHECK_REFTAG;
177168
}
178169

179-
if (bio_integrity_add_page(bio, virt_to_page(buf), len,
180-
offset_in_page(buf)) < len)
181-
goto err_end_io;
182-
183170
/* Auto-generate integrity metadata if this is a write */
184171
if (bio_data_dir(bio) == WRITE && bip_should_check(&bid->bip))
185172
blk_integrity_generate(bio);
186173
else
187174
bid->saved_bio_iter = bio->bi_iter;
188175
return true;
189-
190-
err_free_buf:
191-
kfree(buf);
192-
err_end_io:
193-
bio->bi_status = BLK_STS_RESOURCE;
194-
bio_endio(bio);
195-
return false;
196176
}
197177
EXPORT_SYMBOL(bio_integrity_prep);
198178

block/bio-integrity.c

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,45 @@ struct bio_integrity_alloc {
1414
struct bio_vec bvecs[];
1515
};
1616

17+
static mempool_t integrity_buf_pool;
18+
19+
void bio_integrity_alloc_buf(struct bio *bio, bool zero_buffer)
20+
{
21+
struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
22+
struct bio_integrity_payload *bip = bio_integrity(bio);
23+
unsigned int len = bio_integrity_bytes(bi, bio_sectors(bio));
24+
gfp_t gfp = GFP_NOIO | (zero_buffer ? __GFP_ZERO : 0);
25+
void *buf;
26+
27+
buf = kmalloc(len, (gfp & ~__GFP_DIRECT_RECLAIM) |
28+
__GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN);
29+
if (unlikely(!buf)) {
30+
struct page *page;
31+
32+
page = mempool_alloc(&integrity_buf_pool, GFP_NOFS);
33+
if (zero_buffer)
34+
memset(page_address(page), 0, len);
35+
bvec_set_page(&bip->bip_vec[0], page, len, 0);
36+
bip->bip_flags |= BIP_MEMPOOL;
37+
} else {
38+
bvec_set_page(&bip->bip_vec[0], virt_to_page(buf), len,
39+
offset_in_page(buf));
40+
}
41+
42+
bip->bip_vcnt = 1;
43+
bip->bip_iter.bi_size = len;
44+
}
45+
46+
void bio_integrity_free_buf(struct bio_integrity_payload *bip)
47+
{
48+
struct bio_vec *bv = &bip->bip_vec[0];
49+
50+
if (bip->bip_flags & BIP_MEMPOOL)
51+
mempool_free(bv->bv_page, &integrity_buf_pool);
52+
else
53+
kfree(bvec_virt(bv));
54+
}
55+
1756
/**
1857
* bio_integrity_free - Free bio integrity payload
1958
* @bio: bio containing bip to be freed
@@ -438,3 +477,12 @@ int bio_integrity_clone(struct bio *bio, struct bio *bio_src,
438477

439478
return 0;
440479
}
480+
481+
static int __init bio_integrity_initfn(void)
482+
{
483+
if (mempool_init_page_pool(&integrity_buf_pool, BIO_POOL_SIZE,
484+
get_order(BLK_INTEGRITY_MAX_SIZE)))
485+
panic("bio: can't create integrity buf pool\n");
486+
return 0;
487+
}
488+
subsys_initcall(bio_integrity_initfn);

block/bio.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -253,6 +253,7 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
253253
bio->bi_write_hint = 0;
254254
bio->bi_write_stream = 0;
255255
bio->bi_status = 0;
256+
bio->bi_bvec_gap_bit = 0;
256257
bio->bi_iter.bi_sector = 0;
257258
bio->bi_iter.bi_size = 0;
258259
bio->bi_iter.bi_idx = 0;

block/blk-core.c

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -662,13 +662,13 @@ static void __submit_bio(struct bio *bio)
662662
* bio_list of new bios to be added. ->submit_bio() may indeed add some more
663663
* bios through a recursive call to submit_bio_noacct. If it did, we find a
664664
* non-NULL value in bio_list and re-enter the loop from the top.
665-
* - In this case we really did just take the bio of the top of the list (no
665+
* - In this case we really did just take the bio off the top of the list (no
666666
* pretending) and so remove it from bio_list, and call into ->submit_bio()
667667
* again.
668668
*
669669
* bio_list_on_stack[0] contains bios submitted by the current ->submit_bio.
670670
* bio_list_on_stack[1] contains bios that were submitted before the current
671-
* ->submit_bio, but that haven't been processed yet.
671+
* ->submit_bio(), but that haven't been processed yet.
672672
*/
673673
static void __submit_bio_noacct(struct bio *bio)
674674
{
@@ -743,8 +743,8 @@ void submit_bio_noacct_nocheck(struct bio *bio, bool split)
743743
/*
744744
* We only want one ->submit_bio to be active at a time, else stack
745745
* usage with stacked devices could be a problem. Use current->bio_list
746-
* to collect a list of requests submited by a ->submit_bio method while
747-
* it is active, and then process them after it returned.
746+
* to collect a list of requests submitted by a ->submit_bio method
747+
* while it is active, and then process them after it returned.
748748
*/
749749
if (current->bio_list) {
750750
if (split)
@@ -901,7 +901,7 @@ static void bio_set_ioprio(struct bio *bio)
901901
*
902902
* submit_bio() is used to submit I/O requests to block devices. It is passed a
903903
* fully set up &struct bio that describes the I/O that needs to be done. The
904-
* bio will be send to the device described by the bi_bdev field.
904+
* bio will be sent to the device described by the bi_bdev field.
905905
*
906906
* The success/failure status of the request, along with notification of
907907
* completion, is delivered asynchronously through the ->bi_end_io() callback
@@ -991,7 +991,7 @@ int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
991991
* point to a freshly allocated bio at this point. If that happens
992992
* we have a few cases to consider:
993993
*
994-
* 1) the bio is beeing initialized and bi_bdev is NULL. We can just
994+
* 1) the bio is being initialized and bi_bdev is NULL. We can just
995995
* simply nothing in this case
996996
* 2) the bio points to a not poll enabled device. bio_poll will catch
997997
* this and return 0

block/blk-iocost.c

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2334,10 +2334,8 @@ static void ioc_timer_fn(struct timer_list *timer)
23342334
else
23352335
usage_dur = max_t(u64, now.now - ioc->period_at, 1);
23362336

2337-
usage = clamp_t(u32,
2338-
DIV64_U64_ROUND_UP(usage_us * WEIGHT_ONE,
2339-
usage_dur),
2340-
1, WEIGHT_ONE);
2337+
usage = clamp(DIV64_U64_ROUND_UP(usage_us * WEIGHT_ONE, usage_dur),
2338+
1, WEIGHT_ONE);
23412339

23422340
/*
23432341
* Already donating or accumulated enough to start.

0 commit comments

Comments
 (0)