Skip to content

Commit b14c5e0

Browse files
boryaskdave
authored andcommitted
btrfs: fix EEXIST abort due to non-consecutive gaps in chunk allocation
I have been observing a number of systems aborting at insert_dev_extents() in btrfs_create_pending_block_groups(). The following is a sample stack trace of such an abort coming from forced chunk allocation (typically behind CONFIG_BTRFS_EXPERIMENTAL) but this can theoretically happen to any DUP chunk allocation. [81.801] ------------[ cut here ]------------ [81.801] BTRFS: Transaction aborted (error -17) [81.801] WARNING: fs/btrfs/block-group.c:2876 at btrfs_create_pending_block_groups+0x721/0x770 [btrfs], CPU#1: bash/319 [81.802] Modules linked in: virtio_net btrfs xor zstd_compress raid6_pq null_blk [81.803] CPU: 1 UID: 0 PID: 319 Comm: bash Kdump: loaded Not tainted 6.19.0-rc6+ #319 NONE [81.803] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014 [81.804] RIP: 0010:btrfs_create_pending_block_groups+0x723/0x770 [btrfs] [81.806] RSP: 0018:ffffa36241a6bce8 EFLAGS: 00010282 [81.806] RAX: 000000000000000d RBX: ffff8e699921e400 RCX: 0000000000000000 [81.807] RDX: 0000000002040001 RSI: 00000000ffffffef RDI: ffffffffc0608bf0 [81.807] RBP: 00000000ffffffef R08: ffff8e69830f6000 R09: 0000000000000007 [81.808] R10: ffff8e699921e5e8 R11: 0000000000000000 R12: ffff8e6999228000 [81.808] R13: ffff8e6984d82000 R14: ffff8e69966a69c0 R15: ffff8e69aa47b000 [81.809] FS: 00007fec6bdd9740(0000) GS:ffff8e6b1b379000(0000) knlGS:0000000000000000 [81.809] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [81.810] CR2: 00005604833670f0 CR3: 0000000116679000 CR4: 00000000000006f0 [81.810] Call Trace: [81.810] <TASK> [81.810] __btrfs_end_transaction+0x3e/0x2b0 [btrfs] [81.811] btrfs_force_chunk_alloc_store+0xcd/0x140 [btrfs] [81.811] kernfs_fop_write_iter+0x15f/0x240 [81.812] vfs_write+0x264/0x500 [81.812] ksys_write+0x6c/0xe0 [81.812] do_syscall_64+0x66/0x770 [81.812] entry_SYSCALL_64_after_hwframe+0x76/0x7e [81.813] RIP: 0033:0x7fec6be66197 [81.814] RSP: 002b:00007fffb159dd30 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 [81.815] RAX: ffffffffffffffda RBX: 00007fec6bdd9740 RCX: 00007fec6be66197 [81.815] RDX: 0000000000000002 RSI: 0000560483374f80 RDI: 0000000000000001 [81.816] RBP: 0000560483374f80 R08: 0000000000000000 R09: 0000000000000000 [81.816] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002 [81.817] R13: 00007fec6bfb85c0 R14: 00007fec6bfb5ee0 R15: 00005604833729c0 [81.817] </TASK> [81.817] irq event stamp: 20039 [81.818] hardirqs last enabled at (20047): [<ffffffff99a68302>] __up_console_sem+0x52/0x60 [81.818] hardirqs last disabled at (20056): [<ffffffff99a682e7>] __up_console_sem+0x37/0x60 [81.819] softirqs last enabled at (19470): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0 [81.819] softirqs last disabled at (19463): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0 [81.820] ---[ end trace 0000000000000000 ]--- [81.820] BTRFS: error (device dm-7 state A) in btrfs_create_pending_block_groups:2876: errno=-17 Object already exists Inspecting these aborts with drgn, I observed a pattern of overlapping chunk_maps. Note how stripe 1 of the first chunk overlaps in physical address with stripe 0 of the second chunk. Physical Start Physical End Length Logical Type Stripe ---------------------------------------------------------------------------------------------------- 0x0000000102500000 0x0000000142500000 1.0G 0x0000000641d00000 META|DUP 0/2 0x0000000142500000 0x0000000182500000 1.0G 0x0000000641d00000 META|DUP 1/2 0x0000000142500000 0x0000000182500000 1.0G 0x0000000601d00000 META|DUP 0/2 0x0000000182500000 0x00000001c2500000 1.0G 0x0000000601d00000 META|DUP 1/2 Now how could this possibly happen? All chunk allocation is protected by the chunk_mutex so racing allocations should see a consistent view of the CHUNK_ALLOCATED bit in the chunk allocation extent-io-tree (device->alloc_state as set by chunk_map_device_set_bits()) The tree itself is protected by a spin lock, and clearing/setting the bits is always protected by fs_info->mapping_tree_lock, so no race is apparent. It turns out that there is a subtle bug in the logic regarding chunk allocations that have happened in the current transaction, known as "pending extents". The chunk allocation as defined in find_free_dev_extent() is a loop which searches the commit root of the dev_root and looks for gaps between DEV_EXTENT items. For those gaps, it then checks alloc_state bitmap for any pending extents and adjusts the hole that it finds accordingly. However, the logic in that adjustment assumes that the first pending extent is the only one in that range. e.g., given a layout with two non-consecutive pending extents in a hole passed to dev_extent_hole_check() via *hole_start and *hole_size: |----pending A----| real hole |----pending B----| | candidate hole | *hole_start *hole_start + *hole_size the code incorrectly returns a "hole" from the end of pending extent A until the passed in hole end, failing to account for pending B. However, it is not entirely obvious that it is actually possible to produce such a layout. I was able to reproduce it, but with some contortions: I continued to use the force chunk allocation sysfs file and I introduced a long delay (10 seconds) into the start of the cleaner thread. I also prevented the unused bgs cleaning logic from ever deleting metadata bgs. These help make it easier to deterministically produce the condition but shouldn't really matter if you imagine the conditions happening by race/luck. Allocations/frees can happen concurrently with the cleaner thread preparing to process an unused extent and both create some used chunks with an unused chunk interleaved, all during one transaction. Then btrfs_delete_unused_bgs() sees the unused one and clears it, leaving a range with several pending chunk allocations and a gap in the middle. The basic idea is that the unused_bgs cleanup work happens on a worker so if we allocate 3 block groups in one transaction, then the cleaner work kicked off by the previous transaction comes through and deletes the middle one of the 3, then the commit root shows no dev extents and we have the bad pattern in the extent-io-tree. One final consideration is that the code happens to loop to the next hole if there are no more extents at all, so we need one more dev extent way past the area we are working in. Something like the following demonstrates the technique: # push the BG frontier out to 20G fallocate -l 20G $mnt/foo # allocate one more that will prevent the "no more dev extents" luck fallocate -l 1G $mnt/sticky # sync sync # clear out the allocation area rm $mnt/foo sync _cleaner # let everything quiesce sleep 20 sync # dev tree should have one bg 20G out and the rest at the beginning.. # sort of like an empty FS but with a random sticky chunk. # kick off the cleaner in the background, remember it will sleep 10s # before doing interesting work _cleaner & sleep 3 # create 3 trivial block groups, all empty, all immediately marked as unused. echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc" echo 1 > "$(_btrfs_sysfs_space_info $dev data)/force_chunk_alloc" echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc" # let the cleaner thread definitely finish, it will remove the data bg sleep 10 # this allocation sees the non-consecutive pending metadata chunks with # data chunk gap of 1G and allocates a 2G extent in that hole. ENOSPC! echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc" As for the fix, it is not that obvious. I could not see a trivial way to do it even by adding backup loops into find_free_dev_extent(), so I opted to change the semantics of dev_extent_hole_check() to not stop looping until it finds a sufficiently big hole. For clarity, this also required changing the helper function contains_pending_extent() into two new helpers which find the first pending extent and the first suitable hole in a range. I attempted to clean up the documentation and range calculations to be as consistent and clear as possible for the future. I also looked at the zoned case and concluded that the loop there is different and not to be unified with this one. As far as I can tell, the zoned check will only further constrain the hole so looping back to find more holes is acceptable. Though given that zoned really only appends, I find it highly unlikely that it is susceptible to this bug. Fixes: 1b98450 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole") Reported-by: Dimitrios Apostolou <jimis@gmx.net> Closes: https://lore.kernel.org/linux-btrfs/q7760374-q1p4-029o-5149-26p28421s468@tzk.arg/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
1 parent b291ad4 commit b14c5e0

1 file changed

Lines changed: 183 additions & 60 deletions

File tree

fs/btrfs/volumes.c

Lines changed: 183 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -1509,30 +1509,158 @@ struct btrfs_device *btrfs_scan_one_device(const char *path,
15091509
}
15101510

15111511
/*
1512-
* Try to find a chunk that intersects [start, start + len] range and when one
1513-
* such is found, record the end of it in *start
1512+
* Find the first pending extent intersecting a range.
1513+
*
1514+
* @device: the device to search
1515+
* @start: start of the range to check
1516+
* @len: length of the range to check
1517+
* @pending_start: output pointer for the start of the found pending extent
1518+
* @pending_end: output pointer for the end of the found pending extent (inclusive)
1519+
*
1520+
* Search for a pending chunk allocation that intersects the half-open range
1521+
* [start, start + len).
1522+
*
1523+
* Return: true if a pending extent was found, false otherwise.
1524+
* If the return value is true, store the first pending extent in
1525+
* [*pending_start, *pending_end]. Otherwise, the two output variables
1526+
* may still be modified, to something outside the range and should not
1527+
* be used.
15141528
*/
1515-
static bool contains_pending_extent(struct btrfs_device *device, u64 *start,
1516-
u64 len)
1529+
static bool first_pending_extent(struct btrfs_device *device, u64 start, u64 len,
1530+
u64 *pending_start, u64 *pending_end)
15171531
{
1518-
u64 physical_start, physical_end;
1519-
15201532
lockdep_assert_held(&device->fs_info->chunk_mutex);
15211533

1522-
if (btrfs_find_first_extent_bit(&device->alloc_state, *start,
1523-
&physical_start, &physical_end,
1534+
if (btrfs_find_first_extent_bit(&device->alloc_state, start,
1535+
pending_start, pending_end,
15241536
CHUNK_ALLOCATED, NULL)) {
15251537

1526-
if (in_range(physical_start, *start, len) ||
1527-
in_range(*start, physical_start,
1528-
physical_end + 1 - physical_start)) {
1529-
*start = physical_end + 1;
1538+
if (in_range(*pending_start, start, len) ||
1539+
in_range(start, *pending_start, *pending_end + 1 - *pending_start)) {
15301540
return true;
15311541
}
15321542
}
15331543
return false;
15341544
}
15351545

1546+
/*
1547+
* Find the first real hole accounting for pending extents.
1548+
*
1549+
* @device: the device containing the candidate hole
1550+
* @start: input/output pointer for the hole start position
1551+
* @len: input/output pointer for the hole length
1552+
* @min_hole_size: the size of hole we are looking for
1553+
*
1554+
* Given a potential hole specified by [*start, *start + *len), check for pending
1555+
* chunk allocations within that range. If pending extents are found, the hole is
1556+
* adjusted to represent the first true free space that is large enough when
1557+
* accounting for pending chunks.
1558+
*
1559+
* Note that this function must handle various cases involving non consecutive
1560+
* pending extents.
1561+
*
1562+
* Returns: true if a suitable hole was found and false otherwise.
1563+
* If the return value is true, then *start and *len are set to represent the hole.
1564+
* If the return value is false, then *start is set to the largest hole we
1565+
* found and *len is set to its length.
1566+
* If there are no holes at all, then *start is set to the end of the range and
1567+
* *len is set to 0.
1568+
*/
1569+
static bool find_hole_in_pending_extents(struct btrfs_device *device, u64 *start,
1570+
u64 *len, u64 min_hole_size)
1571+
{
1572+
u64 pending_start, pending_end;
1573+
u64 end;
1574+
u64 max_hole_start = 0;
1575+
u64 max_hole_len = 0;
1576+
1577+
lockdep_assert_held(&device->fs_info->chunk_mutex);
1578+
1579+
if (*len == 0)
1580+
return false;
1581+
1582+
end = *start + *len - 1;
1583+
1584+
/*
1585+
* Loop until we either see a large enough hole or check every pending
1586+
* extent overlapping the candidate hole.
1587+
* At every hole that we observe, record it if it is the new max.
1588+
* At the end of the iteration, set the output variables to the max hole.
1589+
*/
1590+
while (true) {
1591+
if (first_pending_extent(device, *start, *len, &pending_start, &pending_end)) {
1592+
/*
1593+
* Case 1: the pending extent overlaps the start of
1594+
* candidate hole. That means the true hole is after the
1595+
* pending extent, but we need to find the next pending
1596+
* extent to properly size the hole. In the next loop,
1597+
* we will reduce to case 2 or 3.
1598+
* e.g.,
1599+
*
1600+
* |----pending A----| real hole |----pending B----|
1601+
* | candidate hole |
1602+
* *start end
1603+
*/
1604+
if (pending_start <= *start) {
1605+
*start = pending_end + 1;
1606+
goto next;
1607+
}
1608+
/*
1609+
* Case 2: The pending extent starts after *start (and overlaps
1610+
* [*start, end), so the first hole just goes up to the start
1611+
* of the pending extent.
1612+
* e.g.,
1613+
*
1614+
* | real hole |----pending A----|
1615+
* | candidate hole |
1616+
* *start end
1617+
*/
1618+
*len = pending_start - *start;
1619+
if (*len > max_hole_len) {
1620+
max_hole_start = *start;
1621+
max_hole_len = *len;
1622+
}
1623+
if (*len >= min_hole_size)
1624+
break;
1625+
/*
1626+
* If the hole wasn't big enough, then we advance past
1627+
* the pending extent and keep looking.
1628+
*/
1629+
*start = pending_end + 1;
1630+
goto next;
1631+
} else {
1632+
/*
1633+
* Case 3: There is no pending extent overlapping the
1634+
* range [*start, *start + *len - 1], so the only remaining
1635+
* hole is the remaining range.
1636+
* e.g.,
1637+
*
1638+
* | candidate hole |
1639+
* | real hole |
1640+
* *start end
1641+
*/
1642+
1643+
if (*len > max_hole_len) {
1644+
max_hole_start = *start;
1645+
max_hole_len = *len;
1646+
}
1647+
break;
1648+
}
1649+
next:
1650+
if (*start > end)
1651+
break;
1652+
*len = end - *start + 1;
1653+
}
1654+
if (max_hole_len) {
1655+
*start = max_hole_start;
1656+
*len = max_hole_len;
1657+
} else {
1658+
*start = end + 1;
1659+
*len = 0;
1660+
}
1661+
return max_hole_len >= min_hole_size;
1662+
}
1663+
15361664
static u64 dev_extent_search_start(struct btrfs_device *device)
15371665
{
15381666
switch (device->fs_devices->chunk_alloc_policy) {
@@ -1597,59 +1725,57 @@ static bool dev_extent_hole_check_zoned(struct btrfs_device *device,
15971725
}
15981726

15991727
/*
1600-
* Check if specified hole is suitable for allocation.
1728+
* Validate and adjust a hole for chunk allocation
1729+
*
1730+
* @device: the device containing the candidate hole
1731+
* @hole_start: input/output pointer for the hole start position
1732+
* @hole_size: input/output pointer for the hole size
1733+
* @num_bytes: minimum allocation size required
16011734
*
1602-
* @device: the device which we have the hole
1603-
* @hole_start: starting position of the hole
1604-
* @hole_size: the size of the hole
1605-
* @num_bytes: the size of the free space that we need
1735+
* Check if the specified hole is suitable for allocation and adjust it if
1736+
* necessary. The hole may be modified to skip over pending chunk allocations
1737+
* and to satisfy stricter zoned requirements on zoned filesystems.
16061738
*
1607-
* This function may modify @hole_start and @hole_size to reflect the suitable
1608-
* position for allocation. Returns 1 if hole position is updated, 0 otherwise.
1739+
* For regular (non-zoned) allocation, if the hole after adjustment is smaller
1740+
* than @num_bytes, the search continues past additional pending extents until
1741+
* either a sufficiently large hole is found or no more pending extents exist.
1742+
*
1743+
* Return: true if a suitable hole was found and false otherwise.
1744+
* If the return value is true, then *hole_start and *hole_size are set to
1745+
* represent the hole we found.
1746+
* If the return value is false, then *hole_start is set to the largest
1747+
* hole we found and *hole_size is set to its length.
1748+
* If there are no holes at all, then *hole_start is set to the end of the range
1749+
* and *hole_size is set to 0.
16091750
*/
16101751
static bool dev_extent_hole_check(struct btrfs_device *device, u64 *hole_start,
16111752
u64 *hole_size, u64 num_bytes)
16121753
{
1613-
bool changed = false;
1614-
u64 hole_end = *hole_start + *hole_size;
1754+
bool found = false;
1755+
const u64 hole_end = *hole_start + *hole_size - 1;
16151756

1616-
for (;;) {
1617-
/*
1618-
* Check before we set max_hole_start, otherwise we could end up
1619-
* sending back this offset anyway.
1620-
*/
1621-
if (contains_pending_extent(device, hole_start, *hole_size)) {
1622-
if (hole_end >= *hole_start)
1623-
*hole_size = hole_end - *hole_start;
1624-
else
1625-
*hole_size = 0;
1626-
changed = true;
1627-
}
1757+
ASSERT(*hole_size > 0);
16281758

1629-
switch (device->fs_devices->chunk_alloc_policy) {
1630-
default:
1631-
btrfs_warn_unknown_chunk_allocation(device->fs_devices->chunk_alloc_policy);
1632-
fallthrough;
1633-
case BTRFS_CHUNK_ALLOC_REGULAR:
1634-
/* No extra check */
1635-
break;
1636-
case BTRFS_CHUNK_ALLOC_ZONED:
1637-
if (dev_extent_hole_check_zoned(device, hole_start,
1638-
hole_size, num_bytes)) {
1639-
changed = true;
1640-
/*
1641-
* The changed hole can contain pending extent.
1642-
* Loop again to check that.
1643-
*/
1644-
continue;
1645-
}
1646-
break;
1647-
}
1759+
again:
1760+
*hole_size = hole_end - *hole_start + 1;
1761+
found = find_hole_in_pending_extents(device, hole_start, hole_size, num_bytes);
1762+
if (!found)
1763+
return found;
1764+
ASSERT(*hole_size >= num_bytes);
16481765

1766+
switch (device->fs_devices->chunk_alloc_policy) {
1767+
default:
1768+
btrfs_warn_unknown_chunk_allocation(device->fs_devices->chunk_alloc_policy);
1769+
fallthrough;
1770+
case BTRFS_CHUNK_ALLOC_REGULAR:
1771+
return found;
1772+
case BTRFS_CHUNK_ALLOC_ZONED:
1773+
if (dev_extent_hole_check_zoned(device, hole_start, hole_size, num_bytes))
1774+
goto again;
16491775
break;
16501776
}
16511777

1652-
return changed;
1778+
return found;
16531779
}
16541780

16551781
/*
@@ -1708,7 +1834,7 @@ static int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes,
17081834
ret = -ENOMEM;
17091835
goto out;
17101836
}
1711-
again:
1837+
17121838
if (search_start >= search_end ||
17131839
test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)) {
17141840
ret = -ENOSPC;
@@ -1795,11 +1921,7 @@ static int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes,
17951921
*/
17961922
if (search_end > search_start) {
17971923
hole_size = search_end - search_start;
1798-
if (dev_extent_hole_check(device, &search_start, &hole_size,
1799-
num_bytes)) {
1800-
btrfs_release_path(path);
1801-
goto again;
1802-
}
1924+
dev_extent_hole_check(device, &search_start, &hole_size, num_bytes);
18031925

18041926
if (hole_size > max_hole_size) {
18051927
max_hole_start = search_start;
@@ -5022,6 +5144,7 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
50225144
u64 diff;
50235145
u64 start;
50245146
u64 free_diff = 0;
5147+
u64 pending_start, pending_end;
50255148

50265149
new_size = round_down(new_size, fs_info->sectorsize);
50275150
start = new_size;
@@ -5067,7 +5190,7 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
50675190
* in-memory chunks are synced to disk so that the loop below sees them
50685191
* and relocates them accordingly.
50695192
*/
5070-
if (contains_pending_extent(device, &start, diff)) {
5193+
if (first_pending_extent(device, start, diff, &pending_start, &pending_end)) {
50715194
mutex_unlock(&fs_info->chunk_mutex);
50725195
ret = btrfs_commit_transaction(trans);
50735196
if (ret)

0 commit comments

Comments
 (0)