Skip to content

Commit a1a4a9c

Browse files
fdmananakdave
authored andcommitted
btrfs: fix race between ordered extent completion and fiemap
For fiemap we recently stopped locking the target extent range for the whole duration of the fiemap call, in order to avoid a deadlock in a scenario where the fiemap buffer happens to be a memory mapped range of the same file. This use case is very unlikely to be useful in practice but it may be triggered by fuzz testing (syzbot, etc). However by not locking the target extent range for the whole duration of the fiemap call we can race with an ordered extent. This happens like this: 1) The fiemap task finishes processing a file extent item that covers the file range [512K, 1M[, and that file extent item is the last item in the leaf currently being processed; 2) And ordered extent for the file range [768K, 2M[, in COW mode, completes (btrfs_finish_one_ordered()) and the file extent item covering the range [512K, 1M[ is trimmed to cover the range [512K, 768K[ and then a new file extent item for the range [768K, 2M[ is inserted in the inode's subvolume tree; 3) The fiemap task calls fiemap_next_leaf_item(), which then calls btrfs_next_leaf() to find the next leaf / item. This finds that the the next key following the one we previously processed (its type is BTRFS_EXTENT_DATA_KEY and its offset is 512K), is the key corresponding to the new file extent item inserted by the ordered extent, which has a type of BTRFS_EXTENT_DATA_KEY and an offset of 768K; 4) Later the fiemap code ends up at emit_fiemap_extent() and triggers the warning: if (cache->offset + cache->len > offset) { WARN_ON(1); return -EINVAL; } Since we get 1M > 768K, because the previously emitted entry for the old extent covering the file range [512K, 1M[ ends at an offset that is greater than the new extent's start offset (768K). This makes fiemap fail with -EINVAL besides triggering the warning that produces a stack trace like the following: [1621.677651] ------------[ cut here ]------------ [1621.677656] WARNING: CPU: 1 PID: 204366 at fs/btrfs/extent_io.c:2492 emit_fiemap_extent+0x84/0x90 [btrfs] [1621.677899] Modules linked in: btrfs blake2b_generic (...) [1621.677951] CPU: 1 PID: 204366 Comm: pool Not tainted 6.8.0-rc5-btrfs-next-151+ #1 [1621.677954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [1621.677956] RIP: 0010:emit_fiemap_extent+0x84/0x90 [btrfs] [1621.678033] Code: 2b 4c 89 63 (...) [1621.678035] RSP: 0018:ffffab16089ffd20 EFLAGS: 00010206 [1621.678037] RAX: 00000000004fa000 RBX: ffffab16089ffe08 RCX: 0000000000009000 [1621.678039] RDX: 00000000004f9000 RSI: 00000000004f1000 RDI: ffffab16089ffe90 [1621.678040] RBP: 00000000004f9000 R08: 0000000000001000 R09: 0000000000000000 [1621.678041] R10: 0000000000000000 R11: 0000000000001000 R12: 0000000041d78000 [1621.678043] R13: 0000000000001000 R14: 0000000000000000 R15: ffff9434f0b17850 [1621.678044] FS: 00007fa6e20006c0(0000) GS:ffff943bdfa40000(0000) knlGS:0000000000000000 [1621.678046] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1621.678048] CR2: 00007fa6b0801000 CR3: 000000012d404002 CR4: 0000000000370ef0 [1621.678053] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1621.678055] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1621.678056] Call Trace: [1621.678074] <TASK> [1621.678076] ? __warn+0x80/0x130 [1621.678082] ? emit_fiemap_extent+0x84/0x90 [btrfs] [1621.678159] ? report_bug+0x1f4/0x200 [1621.678164] ? handle_bug+0x42/0x70 [1621.678167] ? exc_invalid_op+0x14/0x70 [1621.678170] ? asm_exc_invalid_op+0x16/0x20 [1621.678178] ? emit_fiemap_extent+0x84/0x90 [btrfs] [1621.678253] extent_fiemap+0x766/0xa30 [btrfs] [1621.678339] btrfs_fiemap+0x45/0x80 [btrfs] [1621.678420] do_vfs_ioctl+0x1e4/0x870 [1621.678431] __x64_sys_ioctl+0x6a/0xc0 [1621.678434] do_syscall_64+0x52/0x120 [1621.678445] entry_SYSCALL_64_after_hwframe+0x6e/0x76 There's also another case where before calling btrfs_next_leaf() we are processing a hole or a prealloc extent and we had several delalloc ranges within that hole or prealloc extent. In that case if the ordered extents complete before we find the next key, we may end up finding an extent item with an offset smaller than (or equals to) the offset in cache->offset. So fix this by changing emit_fiemap_extent() to address these three scenarios like this: 1) For the first case, steps listed above, adjust the length of the previously cached extent so that it does not overlap with the current extent, emit the previous one and cache the current file extent item; 2) For the second case where he had a hole or prealloc extent with multiple delalloc ranges inside the hole or prealloc extent's range, and the current file extent item has an offset that matches the offset in the fiemap cache, just discard what we have in the fiemap cache and assign the current file extent item to the cache, since it's more up to date; 3) For the third case where he had a hole or prealloc extent with multiple delalloc ranges inside the hole or prealloc extent's range and the offset of the file extent item we just found is smaller than what we have in the cache, just skip the current file extent item if its range end at or behind the cached extent's end, because we may have emitted (to the fiemap user space buffer) delalloc ranges that overlap with the current file extent item's range. If the file extent item's range goes beyond the end offset of the cached extent, just emit the cached extent and cache a subrange of the file extent item, that goes from the end offset of the cached extent to the end offset of the file extent item. Dealing with those cases in those ways makes everything consistent by reflecting the current state of file extent items in the btree and without emitting extents that have overlapping ranges (which would be confusing and violating expectations). This issue could be triggered often with test case generic/561, and was also hit and reported by Wang Yugui. Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20240223104619.701F.409509F4@e16-tech.com/ Fixes: b0ad381 ("btrfs: fix deadlock with fiemap and extent locking") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
1 parent c7bb26b commit a1a4a9c

1 file changed

Lines changed: 96 additions & 7 deletions

File tree

fs/btrfs/extent_io.c

Lines changed: 96 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2480,6 +2480,7 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
24802480
struct fiemap_cache *cache,
24812481
u64 offset, u64 phys, u64 len, u32 flags)
24822482
{
2483+
u64 cache_end;
24832484
int ret = 0;
24842485

24852486
/* Set at the end of extent_fiemap(). */
@@ -2489,15 +2490,102 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
24892490
goto assign;
24902491

24912492
/*
2492-
* Sanity check, extent_fiemap() should have ensured that new
2493-
* fiemap extent won't overlap with cached one.
2494-
* Not recoverable.
2493+
* When iterating the extents of the inode, at extent_fiemap(), we may
2494+
* find an extent that starts at an offset behind the end offset of the
2495+
* previous extent we processed. This happens if fiemap is called
2496+
* without FIEMAP_FLAG_SYNC and there are ordered extents completing
2497+
* while we call btrfs_next_leaf() (through fiemap_next_leaf_item()).
24952498
*
2496-
* NOTE: Physical address can overlap, due to compression
2499+
* For example we are in leaf X processing its last item, which is the
2500+
* file extent item for file range [512K, 1M[, and after
2501+
* btrfs_next_leaf() releases the path, there's an ordered extent that
2502+
* completes for the file range [768K, 2M[, and that results in trimming
2503+
* the file extent item so that it now corresponds to the file range
2504+
* [512K, 768K[ and a new file extent item is inserted for the file
2505+
* range [768K, 2M[, which may end up as the last item of leaf X or as
2506+
* the first item of the next leaf - in either case btrfs_next_leaf()
2507+
* will leave us with a path pointing to the new extent item, for the
2508+
* file range [768K, 2M[, since that's the first key that follows the
2509+
* last one we processed. So in order not to report overlapping extents
2510+
* to user space, we trim the length of the previously cached extent and
2511+
* emit it.
2512+
*
2513+
* Upon calling btrfs_next_leaf() we may also find an extent with an
2514+
* offset smaller than or equals to cache->offset, and this happens
2515+
* when we had a hole or prealloc extent with several delalloc ranges in
2516+
* it, but after btrfs_next_leaf() released the path, delalloc was
2517+
* flushed and the resulting ordered extents were completed, so we can
2518+
* now have found a file extent item for an offset that is smaller than
2519+
* or equals to what we have in cache->offset. We deal with this as
2520+
* described below.
24972521
*/
2498-
if (cache->offset + cache->len > offset) {
2499-
WARN_ON(1);
2500-
return -EINVAL;
2522+
cache_end = cache->offset + cache->len;
2523+
if (cache_end > offset) {
2524+
if (offset == cache->offset) {
2525+
/*
2526+
* We cached a dealloc range (found in the io tree) for
2527+
* a hole or prealloc extent and we have now found a
2528+
* file extent item for the same offset. What we have
2529+
* now is more recent and up to date, so discard what
2530+
* we had in the cache and use what we have just found.
2531+
*/
2532+
goto assign;
2533+
} else if (offset > cache->offset) {
2534+
/*
2535+
* The extent range we previously found ends after the
2536+
* offset of the file extent item we found and that
2537+
* offset falls somewhere in the middle of that previous
2538+
* extent range. So adjust the range we previously found
2539+
* to end at the offset of the file extent item we have
2540+
* just found, since this extent is more up to date.
2541+
* Emit that adjusted range and cache the file extent
2542+
* item we have just found. This corresponds to the case
2543+
* where a previously found file extent item was split
2544+
* due to an ordered extent completing.
2545+
*/
2546+
cache->len = offset - cache->offset;
2547+
goto emit;
2548+
} else {
2549+
const u64 range_end = offset + len;
2550+
2551+
/*
2552+
* The offset of the file extent item we have just found
2553+
* is behind the cached offset. This means we were
2554+
* processing a hole or prealloc extent for which we
2555+
* have found delalloc ranges (in the io tree), so what
2556+
* we have in the cache is the last delalloc range we
2557+
* found while the file extent item we found can be
2558+
* either for a whole delalloc range we previously
2559+
* emmitted or only a part of that range.
2560+
*
2561+
* We have two cases here:
2562+
*
2563+
* 1) The file extent item's range ends at or behind the
2564+
* cached extent's end. In this case just ignore the
2565+
* current file extent item because we don't want to
2566+
* overlap with previous ranges that may have been
2567+
* emmitted already;
2568+
*
2569+
* 2) The file extent item starts behind the currently
2570+
* cached extent but its end offset goes beyond the
2571+
* end offset of the cached extent. We don't want to
2572+
* overlap with a previous range that may have been
2573+
* emmitted already, so we emit the currently cached
2574+
* extent and then partially store the current file
2575+
* extent item's range in the cache, for the subrange
2576+
* going the cached extent's end to the end of the
2577+
* file extent item.
2578+
*/
2579+
if (range_end <= cache_end)
2580+
return 0;
2581+
2582+
if (!(flags & (FIEMAP_EXTENT_ENCODED | FIEMAP_EXTENT_DELALLOC)))
2583+
phys += cache_end - offset;
2584+
2585+
offset = cache_end;
2586+
len = range_end - cache_end;
2587+
goto emit;
2588+
}
25012589
}
25022590

25032591
/*
@@ -2517,6 +2605,7 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
25172605
return 0;
25182606
}
25192607

2608+
emit:
25202609
/* Not mergeable, need to submit cached one */
25212610
ret = fiemap_fill_next_extent(fieinfo, cache->offset, cache->phys,
25222611
cache->len, cache->flags);

0 commit comments

Comments
 (0)