Skip to content

Commit 0c40d7c

Browse files
zhangyi089brauner
authored andcommitted
block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits
Currently, disks primarily implement the write zeroes command (aka REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves physically writing zeros to the disk media (e.g., HDDs), while the second performs an unmap operation on the logical blocks, effectively putting them into a deallocated state (e.g., SSDs). The first method is generally slow, while the second method is typically very fast. For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate the write zeros operation by placing disk blocks into a deallocated state, which opportunistically avoids writing zeroes to media while still guaranteeing that subsequent reads from the specified block range will return zeroed data. This is a best-effort optimization, not a mandatory requirement, some devices may partially fall back to writing physical zeroes due to factors such as misalignment or being asked to clear a block range smaller than the device's internal allocation unit. Therefore, the speed of this operation is not guaranteed. It is difficult to determine whether the storage device supports unmap write zeroes operation. We cannot determine this by only querying bdev_limits(bdev)->max_write_zeroes_sectors. Therefore, first, add a new hardware queue limit parameters, max_hw_wzeroes_unmap_sectors, to indicate whether a device supports this unmap write zeroes operation. Then, add two new counterpart software queue limits, max_wzeroes_unmap_sectors and max_user_wzeroes_unmap_sectors, which allow users to disable this operation if the speed is very slow on some sepcial devices. Finally, for the stacked devices cases, initialize these two parameters to UINT_MAX. This operation should be enabled by both the stacking driver and all underlying devices. Thanks to Martin K. Petersen for optimizing the documentation of the write_zeroes_unmap sysfs interface. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-2-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
1 parent e04c78d commit 0c40d7c

4 files changed

Lines changed: 87 additions & 2 deletions

File tree

Documentation/ABI/stable/sysfs-block

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -778,6 +778,39 @@ Description:
778778
0, write zeroes is not supported by the device.
779779

780780

781+
What: /sys/block/<disk>/queue/write_zeroes_unmap_max_hw_bytes
782+
Date: January 2025
783+
Contact: Zhang Yi <yi.zhang@huawei.com>
784+
Description:
785+
[RO] This file indicates whether a device supports zeroing data
786+
in a specified block range without incurring the cost of
787+
physically writing zeroes to the media for each individual
788+
block. If this parameter is set to write_zeroes_max_bytes, the
789+
device implements a zeroing operation which opportunistically
790+
avoids writing zeroes to media while still guaranteeing that
791+
subsequent reads from the specified block range will return
792+
zeroed data. This operation is a best-effort optimization, a
793+
device may fall back to physically writing zeroes to the media
794+
due to other factors such as misalignment or being asked to
795+
clear a block range smaller than the device's internal
796+
allocation unit. If this parameter is set to 0, the device may
797+
have to write each logical block media during a zeroing
798+
operation.
799+
800+
801+
What: /sys/block/<disk>/queue/write_zeroes_unmap_max_bytes
802+
Date: January 2025
803+
Contact: Zhang Yi <yi.zhang@huawei.com>
804+
Description:
805+
[RW] While write_zeroes_unmap_max_hw_bytes is the hardware limit
806+
for the device, this setting is the software limit. Since the
807+
unmap write zeroes operation is a best-effort optimization, some
808+
devices may still physically writing zeroes to media. So the
809+
speed of this operation is not guaranteed. Writing a value of
810+
'0' to this file disables this operation. Otherwise, this
811+
parameter should be equal to write_zeroes_unmap_max_hw_bytes.
812+
813+
781814
What: /sys/block/<disk>/queue/zone_append_max_bytes
782815
Date: May 2020
783816
Contact: linux-block@vger.kernel.org

block/blk-settings.c

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
5050
lim->max_sectors = UINT_MAX;
5151
lim->max_dev_sectors = UINT_MAX;
5252
lim->max_write_zeroes_sectors = UINT_MAX;
53+
lim->max_hw_wzeroes_unmap_sectors = UINT_MAX;
54+
lim->max_user_wzeroes_unmap_sectors = UINT_MAX;
5355
lim->max_hw_zone_append_sectors = UINT_MAX;
5456
lim->max_user_discard_sectors = UINT_MAX;
5557
}
@@ -333,6 +335,12 @@ int blk_validate_limits(struct queue_limits *lim)
333335
if (!lim->max_segments)
334336
lim->max_segments = BLK_MAX_SEGMENTS;
335337

338+
if (lim->max_hw_wzeroes_unmap_sectors &&
339+
lim->max_hw_wzeroes_unmap_sectors != lim->max_write_zeroes_sectors)
340+
return -EINVAL;
341+
lim->max_wzeroes_unmap_sectors = min(lim->max_hw_wzeroes_unmap_sectors,
342+
lim->max_user_wzeroes_unmap_sectors);
343+
336344
lim->max_discard_sectors =
337345
min(lim->max_hw_discard_sectors, lim->max_user_discard_sectors);
338346

@@ -418,10 +426,11 @@ int blk_set_default_limits(struct queue_limits *lim)
418426
{
419427
/*
420428
* Most defaults are set by capping the bounds in blk_validate_limits,
421-
* but max_user_discard_sectors is special and needs an explicit
422-
* initialization to the max value here.
429+
* but these limits are special and need an explicit initialization to
430+
* the max value here.
423431
*/
424432
lim->max_user_discard_sectors = UINT_MAX;
433+
lim->max_user_wzeroes_unmap_sectors = UINT_MAX;
425434
return blk_validate_limits(lim);
426435
}
427436

@@ -708,6 +717,13 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
708717
t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors);
709718
t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
710719
b->max_write_zeroes_sectors);
720+
t->max_user_wzeroes_unmap_sectors =
721+
min(t->max_user_wzeroes_unmap_sectors,
722+
b->max_user_wzeroes_unmap_sectors);
723+
t->max_hw_wzeroes_unmap_sectors =
724+
min(t->max_hw_wzeroes_unmap_sectors,
725+
b->max_hw_wzeroes_unmap_sectors);
726+
711727
t->max_hw_zone_append_sectors = min(t->max_hw_zone_append_sectors,
712728
b->max_hw_zone_append_sectors);
713729

block/blk-sysfs.c

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -161,6 +161,8 @@ static ssize_t queue_##_field##_show(struct gendisk *disk, char *page) \
161161
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_discard_sectors)
162162
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_hw_discard_sectors)
163163
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_write_zeroes_sectors)
164+
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_hw_wzeroes_unmap_sectors)
165+
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_wzeroes_unmap_sectors)
164166
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(atomic_write_max_sectors)
165167
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(atomic_write_boundary_sectors)
166168
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_zone_append_sectors)
@@ -205,6 +207,24 @@ static int queue_max_discard_sectors_store(struct gendisk *disk,
205207
return 0;
206208
}
207209

210+
static int queue_max_wzeroes_unmap_sectors_store(struct gendisk *disk,
211+
const char *page, size_t count, struct queue_limits *lim)
212+
{
213+
unsigned long max_zeroes_bytes, max_hw_zeroes_bytes;
214+
ssize_t ret;
215+
216+
ret = queue_var_store(&max_zeroes_bytes, page, count);
217+
if (ret < 0)
218+
return ret;
219+
220+
max_hw_zeroes_bytes = lim->max_hw_wzeroes_unmap_sectors << SECTOR_SHIFT;
221+
if (max_zeroes_bytes != 0 && max_zeroes_bytes != max_hw_zeroes_bytes)
222+
return -EINVAL;
223+
224+
lim->max_user_wzeroes_unmap_sectors = max_zeroes_bytes >> SECTOR_SHIFT;
225+
return 0;
226+
}
227+
208228
static int
209229
queue_max_sectors_store(struct gendisk *disk, const char *page, size_t count,
210230
struct queue_limits *lim)
@@ -514,6 +534,10 @@ QUEUE_LIM_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
514534

515535
QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
516536
QUEUE_LIM_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes");
537+
QUEUE_LIM_RO_ENTRY(queue_max_hw_wzeroes_unmap_sectors,
538+
"write_zeroes_unmap_max_hw_bytes");
539+
QUEUE_LIM_RW_ENTRY(queue_max_wzeroes_unmap_sectors,
540+
"write_zeroes_unmap_max_bytes");
517541
QUEUE_LIM_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes");
518542
QUEUE_LIM_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
519543

@@ -662,6 +686,8 @@ static struct attribute *queue_attrs[] = {
662686
&queue_atomic_write_unit_min_entry.attr,
663687
&queue_atomic_write_unit_max_entry.attr,
664688
&queue_max_write_zeroes_sectors_entry.attr,
689+
&queue_max_hw_wzeroes_unmap_sectors_entry.attr,
690+
&queue_max_wzeroes_unmap_sectors_entry.attr,
665691
&queue_max_zone_append_sectors_entry.attr,
666692
&queue_zone_write_granularity_entry.attr,
667693
&queue_rotational_entry.attr,

include/linux/blkdev.h

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -383,6 +383,9 @@ struct queue_limits {
383383
unsigned int max_user_discard_sectors;
384384
unsigned int max_secure_erase_sectors;
385385
unsigned int max_write_zeroes_sectors;
386+
unsigned int max_wzeroes_unmap_sectors;
387+
unsigned int max_hw_wzeroes_unmap_sectors;
388+
unsigned int max_user_wzeroes_unmap_sectors;
386389
unsigned int max_hw_zone_append_sectors;
387390
unsigned int max_zone_append_sectors;
388391
unsigned int discard_granularity;
@@ -1042,6 +1045,7 @@ static inline void blk_queue_disable_secure_erase(struct request_queue *q)
10421045
static inline void blk_queue_disable_write_zeroes(struct request_queue *q)
10431046
{
10441047
q->limits.max_write_zeroes_sectors = 0;
1048+
q->limits.max_wzeroes_unmap_sectors = 0;
10451049
}
10461050

10471051
/*
@@ -1378,6 +1382,12 @@ static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
13781382
return bdev_limits(bdev)->max_write_zeroes_sectors;
13791383
}
13801384

1385+
static inline unsigned int
1386+
bdev_write_zeroes_unmap_sectors(struct block_device *bdev)
1387+
{
1388+
return bdev_limits(bdev)->max_wzeroes_unmap_sectors;
1389+
}
1390+
13811391
static inline bool bdev_nonrot(struct block_device *bdev)
13821392
{
13831393
return blk_queue_nonrot(bdev_get_queue(bdev));

0 commit comments

Comments
 (0)