Skip to content

Commit 79bd371

Browse files
fdmananakdave
authored andcommitted
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations") fixed a problem that resulted in exhausting the system chunk array in the superblock when there are many tasks allocating chunks in parallel. Basically too many tasks enter the first phase of chunk allocation without previous tasks having finished their second phase of allocation, resulting in too many system chunks being allocated. That was originally observed when running the fallocate tests of stress-ng on a PowerPC machine, using a node size of 64K. However that commit also introduced a deadlock where a task in phase 1 of the chunk allocation waited for another task that had allocated a system chunk to finish its phase 2, but that other task was waiting on an extent buffer lock held by the first task, therefore resulting in both tasks not making any progress. That change was later reverted by a patch with the subject "btrfs: fix deadlock with concurrent chunk allocations involving system chunks", since there is no simple and short solution to address it and the deadlock is relatively easy to trigger on zoned filesystems, while the system chunk array exhaustion is not so common. This change reworks the chunk allocation to avoid the system chunk array exhaustion. It accomplishes that by making the first phase of chunk allocation do the updates of the device items in the chunk btree and the insertion of the new chunk item in the chunk btree. This is done while under the protection of the chunk mutex (fs_info->chunk_mutex), in the same critical section that checks for available system space, allocates a new system chunk if needed and reserves system chunk space. This way we do not have chunk space reserved until the second phase completes. The same logic is applied to chunk removal as well, since it keeps reserved system space long after it is done updating the chunk btree. For direct allocation of system chunks, the previous behaviour remains, because otherwise we would deadlock on extent buffers of the chunk btree. Changes to the chunk btree are by large done by chunk allocation and chunk removal, which first reserve chunk system space and then later do changes to the chunk btree. The other remaining cases are uncommon and correspond to adding a device, removing a device and resizing a device. All these other cases do not pre-reserve system space, they modify the chunk btree right away, so they don't hold reserved space for a long period like chunk allocation and chunk removal do. The diff of this change is huge, but more than half of it is just addition of comments describing both how things work regarding chunk allocation and removal, including both the new behavior and the parts of the old behavior that did not change. CC: stable@vger.kernel.org # 5.12+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
1 parent 1cb3db1 commit 79bd371

7 files changed

Lines changed: 546 additions & 184 deletions

File tree

fs/btrfs/block-group.c

Lines changed: 249 additions & 36 deletions
Large diffs are not rendered by default.

fs/btrfs/block-group.h

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,7 @@ struct btrfs_block_group {
9797
unsigned int removed:1;
9898
unsigned int to_copy:1;
9999
unsigned int relocating_repair:1;
100+
unsigned int chunk_item_inserted:1;
100101

101102
int disk_cache_state;
102103

@@ -268,8 +269,9 @@ void btrfs_reclaim_bgs_work(struct work_struct *work);
268269
void btrfs_reclaim_bgs(struct btrfs_fs_info *fs_info);
269270
void btrfs_mark_bg_to_reclaim(struct btrfs_block_group *bg);
270271
int btrfs_read_block_groups(struct btrfs_fs_info *info);
271-
int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
272-
u64 type, u64 chunk_offset, u64 size);
272+
struct btrfs_block_group *btrfs_make_block_group(struct btrfs_trans_handle *trans,
273+
u64 bytes_used, u64 type,
274+
u64 chunk_offset, u64 size);
273275
void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans);
274276
int btrfs_inc_block_group_ro(struct btrfs_block_group *cache,
275277
bool do_chunk_alloc);

fs/btrfs/ctree.c

Lines changed: 13 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -364,49 +364,6 @@ static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
364364
return 0;
365365
}
366366

367-
static struct extent_buffer *alloc_tree_block_no_bg_flush(
368-
struct btrfs_trans_handle *trans,
369-
struct btrfs_root *root,
370-
u64 parent_start,
371-
const struct btrfs_disk_key *disk_key,
372-
int level,
373-
u64 hint,
374-
u64 empty_size,
375-
enum btrfs_lock_nesting nest)
376-
{
377-
struct btrfs_fs_info *fs_info = root->fs_info;
378-
struct extent_buffer *ret;
379-
380-
/*
381-
* If we are COWing a node/leaf from the extent, chunk, device or free
382-
* space trees, make sure that we do not finish block group creation of
383-
* pending block groups. We do this to avoid a deadlock.
384-
* COWing can result in allocation of a new chunk, and flushing pending
385-
* block groups (btrfs_create_pending_block_groups()) can be triggered
386-
* when finishing allocation of a new chunk. Creation of a pending block
387-
* group modifies the extent, chunk, device and free space trees,
388-
* therefore we could deadlock with ourselves since we are holding a
389-
* lock on an extent buffer that btrfs_create_pending_block_groups() may
390-
* try to COW later.
391-
* For similar reasons, we also need to delay flushing pending block
392-
* groups when splitting a leaf or node, from one of those trees, since
393-
* we are holding a write lock on it and its parent or when inserting a
394-
* new root node for one of those trees.
395-
*/
396-
if (root == fs_info->extent_root ||
397-
root == fs_info->chunk_root ||
398-
root == fs_info->dev_root ||
399-
root == fs_info->free_space_root)
400-
trans->can_flush_pending_bgs = false;
401-
402-
ret = btrfs_alloc_tree_block(trans, root, parent_start,
403-
root->root_key.objectid, disk_key, level,
404-
hint, empty_size, nest);
405-
trans->can_flush_pending_bgs = true;
406-
407-
return ret;
408-
}
409-
410367
/*
411368
* does the dirty work in cow of a single block. The parent block (if
412369
* supplied) is updated to point to the new cow copy. The new buffer is marked
@@ -455,8 +412,9 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans,
455412
if ((root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) && parent)
456413
parent_start = parent->start;
457414

458-
cow = alloc_tree_block_no_bg_flush(trans, root, parent_start, &disk_key,
459-
level, search_start, empty_size, nest);
415+
cow = btrfs_alloc_tree_block(trans, root, parent_start,
416+
root->root_key.objectid, &disk_key, level,
417+
search_start, empty_size, nest);
460418
if (IS_ERR(cow))
461419
return PTR_ERR(cow);
462420

@@ -2458,9 +2416,9 @@ static noinline int insert_new_root(struct btrfs_trans_handle *trans,
24582416
else
24592417
btrfs_node_key(lower, &lower_key, 0);
24602418

2461-
c = alloc_tree_block_no_bg_flush(trans, root, 0, &lower_key, level,
2462-
root->node->start, 0,
2463-
BTRFS_NESTING_NEW_ROOT);
2419+
c = btrfs_alloc_tree_block(trans, root, 0, root->root_key.objectid,
2420+
&lower_key, level, root->node->start, 0,
2421+
BTRFS_NESTING_NEW_ROOT);
24642422
if (IS_ERR(c))
24652423
return PTR_ERR(c);
24662424

@@ -2589,8 +2547,9 @@ static noinline int split_node(struct btrfs_trans_handle *trans,
25892547
mid = (c_nritems + 1) / 2;
25902548
btrfs_node_key(c, &disk_key, mid);
25912549

2592-
split = alloc_tree_block_no_bg_flush(trans, root, 0, &disk_key, level,
2593-
c->start, 0, BTRFS_NESTING_SPLIT);
2550+
split = btrfs_alloc_tree_block(trans, root, 0, root->root_key.objectid,
2551+
&disk_key, level, c->start, 0,
2552+
BTRFS_NESTING_SPLIT);
25942553
if (IS_ERR(split))
25952554
return PTR_ERR(split);
25962555

@@ -3381,10 +3340,10 @@ static noinline int split_leaf(struct btrfs_trans_handle *trans,
33813340
* BTRFS_NESTING_SPLIT_THE_SPLITTENING if we need to, but for now just
33823341
* use BTRFS_NESTING_NEW_ROOT.
33833342
*/
3384-
right = alloc_tree_block_no_bg_flush(trans, root, 0, &disk_key, 0,
3385-
l->start, 0, num_doubles ?
3386-
BTRFS_NESTING_NEW_ROOT :
3387-
BTRFS_NESTING_SPLIT);
3343+
right = btrfs_alloc_tree_block(trans, root, 0, root->root_key.objectid,
3344+
&disk_key, 0, l->start, 0,
3345+
num_doubles ? BTRFS_NESTING_NEW_ROOT :
3346+
BTRFS_NESTING_SPLIT);
33883347
if (IS_ERR(right))
33893348
return PTR_ERR(right);
33903349

fs/btrfs/transaction.c

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -254,8 +254,11 @@ static inline int extwriter_counter_read(struct btrfs_transaction *trans)
254254
}
255255

256256
/*
257-
* To be called after all the new block groups attached to the transaction
258-
* handle have been created (btrfs_create_pending_block_groups()).
257+
* To be called after doing the chunk btree updates right after allocating a new
258+
* chunk (after btrfs_chunk_alloc_add_chunk_item() is called), when removing a
259+
* chunk after all chunk btree updates and after finishing the second phase of
260+
* chunk allocation (btrfs_create_pending_block_groups()) in case some block
261+
* group had its chunk item insertion delayed to the second phase.
259262
*/
260263
void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans)
261264
{
@@ -264,8 +267,6 @@ void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans)
264267
if (!trans->chunk_bytes_reserved)
265268
return;
266269

267-
WARN_ON_ONCE(!list_empty(&trans->new_bgs));
268-
269270
btrfs_block_rsv_release(fs_info, &fs_info->chunk_block_rsv,
270271
trans->chunk_bytes_reserved, NULL);
271272
trans->chunk_bytes_reserved = 0;
@@ -696,7 +697,6 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
696697
h->fs_info = root->fs_info;
697698

698699
h->type = type;
699-
h->can_flush_pending_bgs = true;
700700
INIT_LIST_HEAD(&h->new_bgs);
701701

702702
smp_mb();

fs/btrfs/transaction.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,7 @@ struct btrfs_trans_handle {
132132
short aborted;
133133
bool adding_csums;
134134
bool allocating_chunk;
135-
bool can_flush_pending_bgs;
135+
bool removing_chunk;
136136
bool reloc_reserved;
137137
bool in_fsync;
138138
struct btrfs_root *root;

0 commit comments

Comments
 (0)