Skip to content

Commit 0c00ed3

Browse files
committed
Merge tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe: - Support for batch request processing for ublk, improving the efficiency of the kernel/ublk server communication. This can yield nice 7-12% performance improvements - Support for integrity data for ublk - Various other ublk improvements and additions, including a ton of selftests additions and updated - Move the handling of blk-crypto software fallback from below the block layer to above it. This reduces the complexity of dealing with bio splitting - Series fixing a number of potential deadlocks in blk-mq related to the queue usage counter and writeback throttling and rq-qos debugfs handling - Add an async_depth queue attribute, to resolve a performance regression that's been around for a qhilw related to the scheduler depth handling - Only use task_work for IOPOLL completions on NVMe, if it is necessary to do so. An earlier fix for an issue resulted in all these completions being punted to task_work, to guarantee that completions were only run for a given io_uring ring when it was local to that ring. With the new changes, we can detect if it's necessary to use task_work or not, and avoid it if possible. - rnbd fixes: - Fix refcount underflow in device unmap path - Handle PREFLUSH and NOUNMAP flags properly in protocol - Fix server-side bi_size for special IOs - Zero response buffer before use - Fix trace format for flags - Add .release to rnbd_dev_ktype - MD pull requests via Yu Kuai - Fix raid5_run() to return error when log_init() fails - Fix IO hang with degraded array with llbitmap - Fix percpu_ref not resurrected on suspend timeout in llbitmap - Fix GPF in write_page caused by resize race - Fix NULL pointer dereference in process_metadata_update - Fix hang when stopping arrays with metadata through dm-raid - Fix any_working flag handling in raid10_sync_request - Refactor sync/recovery code path, improve error handling for badblocks, and remove unused recovery_disabled field - Consolidate mddev boolean fields into mddev_flags - Use mempool to allocate stripe_request_ctx and make sure max_sectors is not less than io_opt in raid5 - Fix return value of mddev_trylock - Fix memory leak in raid1_run() - Add Li Nan as mdraid reviewer - Move phys_vec definitions to the kernel types, mostly in preparation for some VFIO and RDMA changes - Improve the speed for secure erase for some devices - Various little rust updates - Various other minor fixes, improvements, and cleanups * tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits) blk-mq: ABI/sysfs-block: fix docs build warnings selftests: ublk: organize test directories by test ID block: decouple secure erase size limit from discard size limit block: remove redundant kill_bdev() call in set_blocksize() blk-mq: add documentation for new queue attribute async_dpeth block, bfq: convert to use request_queue->async_depth mq-deadline: covert to use request_queue->async_depth kyber: covert to use request_queue->async_depth blk-mq: add a new queue sysfs attribute async_depth blk-mq: factor out a helper blk_mq_limit_depth() blk-mq-sched: unify elevators checking for async requests block: convert nr_requests to unsigned int block: don't use strcpy to copy blockdev name blk-mq-debugfs: warn about possible deadlock blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs() blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos() blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static blk-rq-qos: fix possible debugfs_mutex deadlock blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter ...
2 parents 591beb0 + 72f4d6f commit 0c00ed3

151 files changed

Lines changed: 5298 additions & 1664 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Documentation/ABI/stable/sysfs-block

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -609,6 +609,51 @@ Description:
609609
enabled, and whether tags are shared.
610610

611611

612+
What: /sys/block/<disk>/queue/async_depth
613+
Date: August 2025
614+
Contact: linux-block@vger.kernel.org
615+
Description:
616+
[RW] Controls how many asynchronous requests may be allocated
617+
in the block layer. The value is always capped at nr_requests.
618+
619+
When no elevator is active (none):
620+
621+
- async_depth is always equal to nr_requests.
622+
623+
For bfq scheduler:
624+
625+
- By default, async_depth is set to 75% of nr_requests.
626+
Internal limits are then derived from this value:
627+
628+
* Sync writes: limited to async_depth (≈75% of nr_requests).
629+
* Async I/O: limited to ~2/3 of async_depth (≈50% of
630+
nr_requests).
631+
632+
If a bfq_queue is weight-raised:
633+
634+
* Sync writes: limited to ~1/2 of async_depth (≈37% of
635+
nr_requests).
636+
* Async I/O: limited to ~1/4 of async_depth (≈18% of
637+
nr_requests).
638+
639+
- If the user writes a custom value to async_depth, BFQ will
640+
recompute these limits proportionally based on the new value.
641+
642+
For Kyber:
643+
644+
- By default async_depth is set to 75% of nr_requests.
645+
- If the user writes a custom value to async_depth, then it
646+
overrides the default and directly controls the limit for
647+
writes and async I/O.
648+
649+
For mq-deadline:
650+
651+
- By default async_depth is set to nr_requests.
652+
- If the user writes a custom value to async_depth, then it
653+
overrides the default and directly controls the limit for
654+
writes and async I/O.
655+
656+
612657
What: /sys/block/<disk>/queue/nr_zones
613658
Date: November 2018
614659
Contact: Damien Le Moal <damien.lemoal@wdc.com>

Documentation/block/biovecs.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,6 @@ Usage of helpers:
135135
bio_first_bvec_all()
136136
bio_first_page_all()
137137
bio_first_folio_all()
138-
bio_last_bvec_all()
139138

140139
* The following helpers iterate over single-page segment. The passed 'struct
141140
bio_vec' will contain a single-page IO vector during the iteration::

Documentation/block/inline-encryption.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,12 @@ it to a bio, given the blk_crypto_key and the data unit number that will be used
206206
for en/decryption. Users don't need to worry about freeing the bio_crypt_ctx
207207
later, as that happens automatically when the bio is freed or reset.
208208

209+
To submit a bio that uses inline encryption, users must call
210+
``blk_crypto_submit_bio()`` instead of the usual ``submit_bio()``. This will
211+
submit the bio to the underlying driver if it supports inline crypto, or else
212+
call the blk-crypto fallback routines before submitting normal bios to the
213+
underlying drivers.
214+
209215
Finally, when done using inline encryption with a blk_crypto_key on a
210216
block_device, users must call ``blk_crypto_evict_key()``. This ensures that
211217
the key is evicted from all keyslots it may be programmed into and unlinked from

Documentation/block/ublk.rst

Lines changed: 60 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -260,9 +260,12 @@ The following IO commands are communicated via io_uring passthrough command,
260260
and each command is only for forwarding the IO and committing the result
261261
with specified IO tag in the command data:
262262

263-
- ``UBLK_IO_FETCH_REQ``
263+
Traditional Per-I/O Commands
264+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
264265

265-
Sent from the server IO pthread for fetching future incoming IO requests
266+
- ``UBLK_U_IO_FETCH_REQ``
267+
268+
Sent from the server I/O pthread for fetching future incoming I/O requests
266269
destined to ``/dev/ublkb*``. This command is sent only once from the server
267270
IO pthread for ublk driver to setup IO forward environment.
268271

@@ -278,7 +281,7 @@ with specified IO tag in the command data:
278281
supported by the driver, daemons must be per-queue instead - i.e. all I/Os
279282
associated to a single qid must be handled by the same task.
280283

281-
- ``UBLK_IO_COMMIT_AND_FETCH_REQ``
284+
- ``UBLK_U_IO_COMMIT_AND_FETCH_REQ``
282285

283286
When an IO request is destined to ``/dev/ublkb*``, the driver stores
284287
the IO's ``ublksrv_io_desc`` to the specified mapped area; then the
@@ -293,7 +296,7 @@ with specified IO tag in the command data:
293296
requests with the same IO tag. That is, ``UBLK_IO_COMMIT_AND_FETCH_REQ``
294297
is reused for both fetching request and committing back IO result.
295298

296-
- ``UBLK_IO_NEED_GET_DATA``
299+
- ``UBLK_U_IO_NEED_GET_DATA``
297300

298301
With ``UBLK_F_NEED_GET_DATA`` enabled, the WRITE request will be firstly
299302
issued to ublk server without data copy. Then, IO backend of ublk server
@@ -322,6 +325,59 @@ with specified IO tag in the command data:
322325
``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy
323326
the server buffer (pages) read to the IO request pages.
324327

328+
Batch I/O Commands (UBLK_F_BATCH_IO)
329+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
330+
331+
The ``UBLK_F_BATCH_IO`` feature provides an alternative high-performance
332+
I/O handling model that replaces the traditional per-I/O commands with
333+
per-queue batch commands. This significantly reduces communication overhead
334+
and enables better load balancing across multiple server tasks.
335+
336+
Key differences from traditional mode:
337+
338+
- **Per-queue vs Per-I/O**: Commands operate on queues rather than individual I/Os
339+
- **Batch processing**: Multiple I/Os are handled in single operations
340+
- **Multishot commands**: Use io_uring multishot for reduced submission overhead
341+
- **Flexible task assignment**: Any task can handle any I/O (no per-I/O daemons)
342+
- **Better load balancing**: Tasks can adjust their workload dynamically
343+
344+
Batch I/O Commands:
345+
346+
- ``UBLK_U_IO_PREP_IO_CMDS``
347+
348+
Prepares multiple I/O commands in batch. The server provides a buffer
349+
containing multiple I/O descriptors that will be processed together.
350+
This reduces the number of individual command submissions required.
351+
352+
- ``UBLK_U_IO_COMMIT_IO_CMDS``
353+
354+
Commits results for multiple I/O operations in batch, and prepares the
355+
I/O descriptors to accept new requests. The server provides a buffer
356+
containing the results of multiple completed I/Os, allowing efficient
357+
bulk completion of requests.
358+
359+
- ``UBLK_U_IO_FETCH_IO_CMDS``
360+
361+
**Multishot command** for fetching I/O commands in batch. This is the key
362+
command that enables high-performance batch processing:
363+
364+
* Uses io_uring multishot capability for reduced submission overhead
365+
* Single command can fetch multiple I/O requests over time
366+
* Buffer size determines maximum batch size per operation
367+
* Multiple fetch commands can be submitted for load balancing
368+
* Only one fetch command is active at any time per queue
369+
* Supports dynamic load balancing across multiple server tasks
370+
371+
It is one typical multishot io_uring request with provided buffer, and it
372+
won't be completed until any failure is triggered.
373+
374+
Each task can submit ``UBLK_U_IO_FETCH_IO_CMDS`` with different buffer
375+
sizes to control how much work it handles. This enables sophisticated
376+
load balancing strategies in multi-threaded servers.
377+
378+
Migration: Applications using traditional commands (``UBLK_U_IO_FETCH_REQ``,
379+
``UBLK_U_IO_COMMIT_AND_FETCH_REQ``) cannot use batch mode simultaneously.
380+
325381
Zero copy
326382
---------
327383

MAINTAINERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24276,6 +24276,7 @@ F: include/linux/property.h
2427624276
SOFTWARE RAID (Multiple Disks) SUPPORT
2427724277
M: Song Liu <song@kernel.org>
2427824278
M: Yu Kuai <yukuai@fnnas.com>
24279+
R: Li Nan <linan122@huawei.com>
2427924280
L: linux-raid@vger.kernel.org
2428024281
S: Supported
2428124282
Q: https://patchwork.kernel.org/project/linux-raid/list/

block/bdev.c

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -208,7 +208,6 @@ int set_blocksize(struct file *file, int size)
208208

209209
inode->i_blkbits = blksize_bits(size);
210210
mapping_set_folio_min_order(inode->i_mapping, get_order(size));
211-
kill_bdev(bdev);
212211
filemap_invalidate_unlock(inode->i_mapping);
213212
inode_unlock(inode);
214213
}

block/bfq-iosched.c

Lines changed: 28 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -231,7 +231,7 @@ static struct kmem_cache *bfq_pool;
231231
#define BFQ_RQ_SEEKY(bfqd, last_pos, rq) \
232232
(get_sdist(last_pos, rq) > \
233233
BFQQ_SEEK_THR && \
234-
(!blk_queue_nonrot(bfqd->queue) || \
234+
(blk_queue_rot(bfqd->queue) || \
235235
blk_rq_sectors(rq) < BFQQ_SECT_THR_NONROT))
236236
#define BFQQ_CLOSE_THR (sector_t)(8 * 1024)
237237
#define BFQQ_SEEKY(bfqq) (hweight32(bfqq->seek_history) > 19)
@@ -697,7 +697,7 @@ static void bfq_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
697697
unsigned int limit, act_idx;
698698

699699
/* Sync reads have full depth available */
700-
if (op_is_sync(opf) && !op_is_write(opf))
700+
if (blk_mq_is_sync_read(opf))
701701
limit = data->q->nr_requests;
702702
else
703703
limit = bfqd->async_depths[!!bfqd->wr_busy_queues][op_is_sync(opf)];
@@ -4165,7 +4165,7 @@ static bool bfq_bfqq_is_slow(struct bfq_data *bfqd, struct bfq_queue *bfqq,
41654165

41664166
/* don't use too short time intervals */
41674167
if (delta_usecs < 1000) {
4168-
if (blk_queue_nonrot(bfqd->queue))
4168+
if (!blk_queue_rot(bfqd->queue))
41694169
/*
41704170
* give same worst-case guarantees as idling
41714171
* for seeky
@@ -4487,7 +4487,7 @@ static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
44874487
struct bfq_queue *bfqq)
44884488
{
44894489
bool rot_without_queueing =
4490-
!blk_queue_nonrot(bfqd->queue) && !bfqd->hw_tag,
4490+
blk_queue_rot(bfqd->queue) && !bfqd->hw_tag,
44914491
bfqq_sequential_and_IO_bound,
44924492
idling_boosts_thr;
44934493

@@ -4521,7 +4521,7 @@ static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
45214521
* flash-based device.
45224522
*/
45234523
idling_boosts_thr = rot_without_queueing ||
4524-
((!blk_queue_nonrot(bfqd->queue) || !bfqd->hw_tag) &&
4524+
((blk_queue_rot(bfqd->queue) || !bfqd->hw_tag) &&
45254525
bfqq_sequential_and_IO_bound);
45264526

45274527
/*
@@ -4722,7 +4722,7 @@ bfq_choose_bfqq_for_injection(struct bfq_data *bfqd)
47224722
* there is only one in-flight large request
47234723
* at a time.
47244724
*/
4725-
if (blk_queue_nonrot(bfqd->queue) &&
4725+
if (!blk_queue_rot(bfqd->queue) &&
47264726
blk_rq_sectors(bfqq->next_rq) >=
47274727
BFQQ_SECT_THR_NONROT &&
47284728
bfqd->tot_rq_in_driver >= 1)
@@ -6340,7 +6340,7 @@ static void bfq_update_hw_tag(struct bfq_data *bfqd)
63406340
bfqd->hw_tag_samples = 0;
63416341

63426342
bfqd->nonrot_with_queueing =
6343-
blk_queue_nonrot(bfqd->queue) && bfqd->hw_tag;
6343+
!blk_queue_rot(bfqd->queue) && bfqd->hw_tag;
63446344
}
63456345

63466346
static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)
@@ -7112,39 +7112,29 @@ void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
71127112
static void bfq_depth_updated(struct request_queue *q)
71137113
{
71147114
struct bfq_data *bfqd = q->elevator->elevator_data;
7115-
unsigned int nr_requests = q->nr_requests;
7115+
unsigned int async_depth = q->async_depth;
71167116

71177117
/*
7118-
* In-word depths if no bfq_queue is being weight-raised:
7119-
* leaving 25% of tags only for sync reads.
7118+
* By default:
7119+
* - sync reads are not limited
7120+
* If bfqq is not being weight-raised:
7121+
* - sync writes are limited to 75%(async depth default value)
7122+
* - async IO are limited to 50%
7123+
* If bfqq is being weight-raised:
7124+
* - sync writes are limited to ~37%
7125+
* - async IO are limited to ~18
71207126
*
7121-
* In next formulas, right-shift the value
7122-
* (1U<<bt->sb.shift), instead of computing directly
7123-
* (1U<<(bt->sb.shift - something)), to be robust against
7124-
* any possible value of bt->sb.shift, without having to
7125-
* limit 'something'.
7127+
* If request_queue->async_depth is updated by user, all limit are
7128+
* updated relatively.
71267129
*/
7127-
/* no more than 50% of tags for async I/O */
7128-
bfqd->async_depths[0][0] = max(nr_requests >> 1, 1U);
7129-
/*
7130-
* no more than 75% of tags for sync writes (25% extra tags
7131-
* w.r.t. async I/O, to prevent async I/O from starving sync
7132-
* writes)
7133-
*/
7134-
bfqd->async_depths[0][1] = max((nr_requests * 3) >> 2, 1U);
7130+
bfqd->async_depths[0][1] = async_depth;
7131+
bfqd->async_depths[0][0] = max(async_depth * 2 / 3, 1U);
7132+
bfqd->async_depths[1][1] = max(async_depth >> 1, 1U);
7133+
bfqd->async_depths[1][0] = max(async_depth >> 2, 1U);
71357134

71367135
/*
7137-
* In-word depths in case some bfq_queue is being weight-
7138-
* raised: leaving ~63% of tags for sync reads. This is the
7139-
* highest percentage for which, in our tests, application
7140-
* start-up times didn't suffer from any regression due to tag
7141-
* shortage.
7136+
* Due to cgroup qos, the allowed request for bfqq might be 1
71427137
*/
7143-
/* no more than ~18% of tags for async I/O */
7144-
bfqd->async_depths[1][0] = max((nr_requests * 3) >> 4, 1U);
7145-
/* no more than ~37% of tags for sync writes (~20% extra tags) */
7146-
bfqd->async_depths[1][1] = max((nr_requests * 6) >> 4, 1U);
7147-
71487138
blk_mq_set_min_shallow_depth(q, 1);
71497139
}
71507140

@@ -7293,7 +7283,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
72937283
INIT_HLIST_HEAD(&bfqd->burst_list);
72947284

72957285
bfqd->hw_tag = -1;
7296-
bfqd->nonrot_with_queueing = blk_queue_nonrot(bfqd->queue);
7286+
bfqd->nonrot_with_queueing = !blk_queue_rot(bfqd->queue);
72977287

72987288
bfqd->bfq_max_budget = bfq_default_max_budget;
72997289

@@ -7328,9 +7318,9 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
73287318
* Begin by assuming, optimistically, that the device peak
73297319
* rate is equal to 2/3 of the highest reference rate.
73307320
*/
7331-
bfqd->rate_dur_prod = ref_rate[blk_queue_nonrot(bfqd->queue)] *
7332-
ref_wr_duration[blk_queue_nonrot(bfqd->queue)];
7333-
bfqd->peak_rate = ref_rate[blk_queue_nonrot(bfqd->queue)] * 2 / 3;
7321+
bfqd->rate_dur_prod = ref_rate[!blk_queue_rot(bfqd->queue)] *
7322+
ref_wr_duration[!blk_queue_rot(bfqd->queue)];
7323+
bfqd->peak_rate = ref_rate[!blk_queue_rot(bfqd->queue)] * 2 / 3;
73347324

73357325
/* see comments on the definition of next field inside bfq_data */
73367326
bfqd->actuator_load_threshold = 4;
@@ -7365,6 +7355,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
73657355
blk_queue_flag_set(QUEUE_FLAG_DISABLE_WBT_DEF, q);
73667356
wbt_disable_default(q->disk);
73677357
blk_stat_enable_accounting(q);
7358+
q->async_depth = (q->nr_requests * 3) >> 2;
73687359

73697360
return 0;
73707361

block/bio-integrity-auto.c

Lines changed: 1 addition & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -52,19 +52,7 @@ static bool bip_should_check(struct bio_integrity_payload *bip)
5252

5353
static bool bi_offload_capable(struct blk_integrity *bi)
5454
{
55-
switch (bi->csum_type) {
56-
case BLK_INTEGRITY_CSUM_CRC64:
57-
return bi->metadata_size == sizeof(struct crc64_pi_tuple);
58-
case BLK_INTEGRITY_CSUM_CRC:
59-
case BLK_INTEGRITY_CSUM_IP:
60-
return bi->metadata_size == sizeof(struct t10_pi_tuple);
61-
default:
62-
pr_warn_once("%s: unknown integrity checksum type:%d\n",
63-
__func__, bi->csum_type);
64-
fallthrough;
65-
case BLK_INTEGRITY_CSUM_NONE:
66-
return false;
67-
}
55+
return bi->metadata_size == bi->pi_tuple_size;
6856
}
6957

7058
/**

block/bio.c

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -301,9 +301,12 @@ EXPORT_SYMBOL(bio_init);
301301
*/
302302
void bio_reset(struct bio *bio, struct block_device *bdev, blk_opf_t opf)
303303
{
304+
struct bio_vec *bv = bio->bi_io_vec;
305+
304306
bio_uninit(bio);
305307
memset(bio, 0, BIO_RESET_BYTES);
306308
atomic_set(&bio->__bi_remaining, 1);
309+
bio->bi_io_vec = bv;
307310
bio->bi_bdev = bdev;
308311
if (bio->bi_bdev)
309312
bio_associate_blkg(bio);
@@ -1196,8 +1199,8 @@ void bio_iov_bvec_set(struct bio *bio, const struct iov_iter *iter)
11961199
{
11971200
WARN_ON_ONCE(bio->bi_max_vecs);
11981201

1199-
bio->bi_vcnt = iter->nr_segs;
12001202
bio->bi_io_vec = (struct bio_vec *)iter->bvec;
1203+
bio->bi_iter.bi_idx = 0;
12011204
bio->bi_iter.bi_bvec_done = iter->iov_offset;
12021205
bio->bi_iter.bi_size = iov_iter_count(iter);
12031206
bio_set_flag(bio, BIO_CLONED);

0 commit comments

Comments
 (0)