Skip to content

Commit b32e381

Browse files
committed
Merge tag 'xfs-5.18-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong: "This fixes multiple problems in the reserve pool sizing functions: an incorrect free space calculation, a pointless infinite loop, and even more braindamage that could result in the pool being overfilled. The pile of patches from Dave fix myriad races and UAF bugs in the log recovery code that much to our mutual surprise nobody's tripped over. Dave also fixed a performance optimization that had turned into a regression. Dave Chinner is taking over as XFS maintainer starting Sunday and lasting until 5.19-rc1 is tagged so that I can focus on starting a massive design review for the (feature complete after five years) online repair feature. From then on, he and I will be moving XFS to a co-maintainership model by trading duties every other release. NOTE: I hope very strongly that the other pieces of the (X)FS ecosystem (fstests and xfsprogs) will make similar changes to spread their maintenance load. Summary: - Fix an incorrect free space calculation in xfs_reserve_blocks that could lead to a request for free blocks that will never succeed. - Fix a hang in xfs_reserve_blocks caused by an infinite loop and the incorrect free space calculation. - Fix yet a third problem in xfs_reserve_blocks where multiple racing threads can overfill the reserve pool. - Fix an accounting error that lead to us reporting reserved space as "available". - Fix a race condition during abnormal fs shutdown that could cause UAF problems when memory reclaim and log shutdown try to clean up inodes. - Fix a bug where log shutdown can race with unmount to tear down the log, thereby causing UAF errors. - Disentangle log and filesystem shutdown to reduce confusion. - Fix some confusion in xfs_trans_commit such that a race between transaction commit and filesystem shutdown can cause unlogged dirty inode metadata to be committed, thereby corrupting the filesystem. - Remove a performance optimization in the log as it was discovered that certain storage hardware handle async log flushes so poorly as to cause serious performance regressions. Recent restructuring of other parts of the logging code mean that no performance benefit is seen on hardware that handle it well" * tag 'xfs-5.18-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: drop async cache flushes from CIL commits. xfs: shutdown during log recovery needs to mark the log shutdown xfs: xfs_trans_commit() path must check for log shutdown xfs: xfs_do_force_shutdown needs to block racing shutdowns xfs: log shutdown triggers should only shut down the log xfs: run callbacks before waking waiters in xlog_state_shutdown_callbacks xfs: shutdown in intent recovery has non-intent items in the AIL xfs: aborting inodes on shutdown may need buffer lock xfs: don't report reserved bnobt space as available xfs: fix overfilling of reserve pool xfs: always succeed at setting the reserve pool size xfs: remove infinite loop when reserving free block pool xfs: don't include bnobt blocks when reserving free block pool xfs: document the XFS_ALLOC_AGFL_RESERVE constant
2 parents 1fdff40 + 919edba commit b32e381

18 files changed

Lines changed: 347 additions & 246 deletions

fs/xfs/libxfs/xfs_alloc.c

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,24 @@ xfs_prealloc_blocks(
8282
}
8383

8484
/*
85+
* The number of blocks per AG that we withhold from xfs_mod_fdblocks to
86+
* guarantee that we can refill the AGFL prior to allocating space in a nearly
87+
* full AG. Although the the space described by the free space btrees, the
88+
* blocks used by the freesp btrees themselves, and the blocks owned by the
89+
* AGFL are counted in the ondisk fdblocks, it's a mistake to let the ondisk
90+
* free space in the AG drop so low that the free space btrees cannot refill an
91+
* empty AGFL up to the minimum level. Rather than grind through empty AGs
92+
* until the fs goes down, we subtract this many AG blocks from the incore
93+
* fdblocks to ensure user allocation does not overcommit the space the
94+
* filesystem needs for the AGFLs. The rmap btree uses a per-AG reservation to
95+
* withhold space from xfs_mod_fdblocks, so we do not account for that here.
96+
*/
97+
#define XFS_ALLOCBT_AGFL_RESERVE 4
98+
99+
/*
100+
* Compute the number of blocks that we set aside to guarantee the ability to
101+
* refill the AGFL and handle a full bmap btree split.
102+
*
85103
* In order to avoid ENOSPC-related deadlock caused by out-of-order locking of
86104
* AGF buffer (PV 947395), we place constraints on the relationship among
87105
* actual allocations for data blocks, freelist blocks, and potential file data
@@ -93,14 +111,14 @@ xfs_prealloc_blocks(
93111
* extents need to be actually allocated. To get around this, we explicitly set
94112
* aside a few blocks which will not be reserved in delayed allocation.
95113
*
96-
* We need to reserve 4 fsbs _per AG_ for the freelist and 4 more to handle a
97-
* potential split of the file's bmap btree.
114+
* For each AG, we need to reserve enough blocks to replenish a totally empty
115+
* AGFL and 4 more to handle a potential split of the file's bmap btree.
98116
*/
99117
unsigned int
100118
xfs_alloc_set_aside(
101119
struct xfs_mount *mp)
102120
{
103-
return mp->m_sb.sb_agcount * (XFS_ALLOC_AGFL_RESERVE + 4);
121+
return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4);
104122
}
105123

106124
/*
@@ -124,12 +142,12 @@ xfs_alloc_ag_max_usable(
124142
unsigned int blocks;
125143

126144
blocks = XFS_BB_TO_FSB(mp, XFS_FSS_TO_BB(mp, 4)); /* ag headers */
127-
blocks += XFS_ALLOC_AGFL_RESERVE;
145+
blocks += XFS_ALLOCBT_AGFL_RESERVE;
128146
blocks += 3; /* AGF, AGI btree root blocks */
129147
if (xfs_has_finobt(mp))
130148
blocks++; /* finobt root block */
131149
if (xfs_has_rmapbt(mp))
132-
blocks++; /* rmap root block */
150+
blocks++; /* rmap root block */
133151
if (xfs_has_reflink(mp))
134152
blocks++; /* refcount root block */
135153

fs/xfs/libxfs/xfs_alloc.h

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,6 @@ typedef struct xfs_alloc_arg {
8888
#define XFS_ALLOC_NOBUSY (1 << 2)/* Busy extents not allowed */
8989

9090
/* freespace limit calculations */
91-
#define XFS_ALLOC_AGFL_RESERVE 4
9291
unsigned int xfs_alloc_set_aside(struct xfs_mount *mp);
9392
unsigned int xfs_alloc_ag_max_usable(struct xfs_mount *mp);
9493

fs/xfs/xfs_bio_io.c

Lines changed: 0 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -9,39 +9,6 @@ static inline unsigned int bio_max_vecs(unsigned int count)
99
return bio_max_segs(howmany(count, PAGE_SIZE));
1010
}
1111

12-
static void
13-
xfs_flush_bdev_async_endio(
14-
struct bio *bio)
15-
{
16-
complete(bio->bi_private);
17-
}
18-
19-
/*
20-
* Submit a request for an async cache flush to run. If the request queue does
21-
* not require flush operations, just skip it altogether. If the caller needs
22-
* to wait for the flush completion at a later point in time, they must supply a
23-
* valid completion. This will be signalled when the flush completes. The
24-
* caller never sees the bio that is issued here.
25-
*/
26-
void
27-
xfs_flush_bdev_async(
28-
struct bio *bio,
29-
struct block_device *bdev,
30-
struct completion *done)
31-
{
32-
struct request_queue *q = bdev->bd_disk->queue;
33-
34-
if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
35-
complete(done);
36-
return;
37-
}
38-
39-
bio_init(bio, bdev, NULL, 0, REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC);
40-
bio->bi_private = done;
41-
bio->bi_end_io = xfs_flush_bdev_async_endio;
42-
43-
submit_bio(bio);
44-
}
4512
int
4613
xfs_rw_bdev(
4714
struct block_device *bdev,

fs/xfs/xfs_fsops.c

Lines changed: 27 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
#include "xfs_fsops.h"
1818
#include "xfs_trans_space.h"
1919
#include "xfs_log.h"
20+
#include "xfs_log_priv.h"
2021
#include "xfs_ag.h"
2122
#include "xfs_ag_resv.h"
2223
#include "xfs_trace.h"
@@ -347,7 +348,7 @@ xfs_fs_counts(
347348
cnt->allocino = percpu_counter_read_positive(&mp->m_icount);
348349
cnt->freeino = percpu_counter_read_positive(&mp->m_ifree);
349350
cnt->freedata = percpu_counter_read_positive(&mp->m_fdblocks) -
350-
mp->m_alloc_set_aside;
351+
xfs_fdblocks_unavailable(mp);
351352

352353
spin_lock(&mp->m_sb_lock);
353354
cnt->freertx = mp->m_sb.sb_frextents;
@@ -430,46 +431,36 @@ xfs_reserve_blocks(
430431
* If the request is larger than the current reservation, reserve the
431432
* blocks before we update the reserve counters. Sample m_fdblocks and
432433
* perform a partial reservation if the request exceeds free space.
434+
*
435+
* The code below estimates how many blocks it can request from
436+
* fdblocks to stash in the reserve pool. This is a classic TOCTOU
437+
* race since fdblocks updates are not always coordinated via
438+
* m_sb_lock. Set the reserve size even if there's not enough free
439+
* space to fill it because mod_fdblocks will refill an undersized
440+
* reserve when it can.
433441
*/
434-
error = -ENOSPC;
435-
do {
436-
free = percpu_counter_sum(&mp->m_fdblocks) -
437-
mp->m_alloc_set_aside;
438-
if (free <= 0)
439-
break;
440-
441-
delta = request - mp->m_resblks;
442-
lcounter = free - delta;
443-
if (lcounter < 0)
444-
/* We can't satisfy the request, just get what we can */
445-
fdblks_delta = free;
446-
else
447-
fdblks_delta = delta;
448-
442+
free = percpu_counter_sum(&mp->m_fdblocks) -
443+
xfs_fdblocks_unavailable(mp);
444+
delta = request - mp->m_resblks;
445+
mp->m_resblks = request;
446+
if (delta > 0 && free > 0) {
449447
/*
450448
* We'll either succeed in getting space from the free block
451-
* count or we'll get an ENOSPC. If we get a ENOSPC, it means
452-
* things changed while we were calculating fdblks_delta and so
453-
* we should try again to see if there is anything left to
454-
* reserve.
449+
* count or we'll get an ENOSPC. Don't set the reserved flag
450+
* here - we don't want to reserve the extra reserve blocks
451+
* from the reserve.
455452
*
456-
* Don't set the reserved flag here - we don't want to reserve
457-
* the extra reserve blocks from the reserve.....
453+
* The desired reserve size can change after we drop the lock.
454+
* Use mod_fdblocks to put the space into the reserve or into
455+
* fdblocks as appropriate.
458456
*/
457+
fdblks_delta = min(free, delta);
459458
spin_unlock(&mp->m_sb_lock);
460459
error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
460+
if (!error)
461+
xfs_mod_fdblocks(mp, fdblks_delta, 0);
461462
spin_lock(&mp->m_sb_lock);
462-
} while (error == -ENOSPC);
463-
464-
/*
465-
* Update the reserve counters if blocks have been successfully
466-
* allocated.
467-
*/
468-
if (!error && fdblks_delta) {
469-
mp->m_resblks += fdblks_delta;
470-
mp->m_resblks_avail += fdblks_delta;
471463
}
472-
473464
out:
474465
if (outval) {
475466
outval->resblks = mp->m_resblks;
@@ -528,8 +519,11 @@ xfs_do_force_shutdown(
528519
int tag;
529520
const char *why;
530521

531-
if (test_and_set_bit(XFS_OPSTATE_SHUTDOWN, &mp->m_opstate))
522+
523+
if (test_and_set_bit(XFS_OPSTATE_SHUTDOWN, &mp->m_opstate)) {
524+
xlog_shutdown_wait(mp->m_log);
532525
return;
526+
}
533527
if (mp->m_sb_bp)
534528
mp->m_sb_bp->b_flags |= XBF_DONE;
535529

fs/xfs/xfs_icache.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -883,7 +883,7 @@ xfs_reclaim_inode(
883883
*/
884884
if (xlog_is_shutdown(ip->i_mount->m_log)) {
885885
xfs_iunpin_wait(ip);
886-
xfs_iflush_abort(ip);
886+
xfs_iflush_shutdown_abort(ip);
887887
goto reclaim;
888888
}
889889
if (xfs_ipincount(ip))

fs/xfs/xfs_inode.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3631,7 +3631,7 @@ xfs_iflush_cluster(
36313631

36323632
/*
36333633
* We must use the safe variant here as on shutdown xfs_iflush_abort()
3634-
* can remove itself from the list.
3634+
* will remove itself from the list.
36353635
*/
36363636
list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
36373637
iip = (struct xfs_inode_log_item *)lip;

0 commit comments

Comments
 (0)