Skip to content

Commit 01728b4

Browse files
dgchinnerDarrick J. Wong
authored andcommitted
xfs: xfs_is_shutdown vs xlog_is_shutdown cage fight
I've been chasing a recent resurgence in generic/388 recovery failure and/or corruption events. The events have largely been uninitialised inode chunks being tripped over in log recovery such as: XFS (pmem1): User initiated shutdown received. pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096 XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/xfs/xfs_fsops.c:500). Shutting down filesystem. XFS (pmem1): Please unmount the filesystem and rectify the problem(s) XFS (pmem1): Unmounting Filesystem XFS (pmem1): Mounting V5 Filesystem XFS (pmem1): Starting recovery (logdev: internal) XFS (pmem1): bad inode magic/vsn daddr 8723584 #0 (magic=1818) XFS (pmem1): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x851c80 xfs_inode_buf_verify XFS (pmem1): Unmount and run xfs_repair XFS (pmem1): First 128 bytes of corrupted metadata buffer: 00000000: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ 00000010: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ 00000020: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ 00000030: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ 00000040: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ 00000050: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ 00000060: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ 00000070: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ XFS (pmem1): metadata I/O error in "xlog_recover_items_pass2+0x52/0xc0" at daddr 0x851c80 len 32 error 117 XFS (pmem1): log mount/recovery failed: error -117 XFS (pmem1): log mount failed There have been isolated random other issues, too - xfs_repair fails because it finds some corruption in symlink blocks, rmap inconsistencies, etc - but they are nowhere near as common as the uninitialised inode chunk failure. The problem has clearly happened at runtime before recovery has run; I can see the ICREATE log item in the log shortly before the actively recovered range of the log. This means the ICREATE was definitely created and written to the log, but for some reason the tail of the log has been moved past the ordered buffer log item that tracks INODE_ALLOC buffers and, supposedly, prevents the tail of the log moving past the ICREATE log item before the inode chunk buffer is written to disk. Tracing the fsstress processes that are running when the filesystem shut down immediately pin-pointed the problem: user shutdown marks xfs_mount as shutdown godown-213341 [008] 6398.022871: console: [ 6397.915392] XFS (pmem1): User initiated shutdown received. ..... aild tries to push ordered inode cluster buffer xfsaild/pmem1-213314 [001] 6398.022974: xfs_buf_trylock: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 16 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_inode_item_push+0x8e xfsaild/pmem1-213314 [001] 6398.022976: xfs_ilock_nowait: dev 259:1 ino 0x851c80 flags ILOCK_SHARED caller xfs_iflush_cluster+0xae xfs_iflush_cluster() checks xfs_is_shutdown(), returns true, calls xfs_iflush_abort() to kill writeback of the inode. Inode is removed from AIL, drops cluster buffer reference. xfsaild/pmem1-213314 [001] 6398.022977: xfs_ail_delete: dev 259:1 lip 0xffff88880247ed80 old lsn 7/20344 new lsn 7/21000 type XFS_LI_INODE flags IN_AIL xfsaild/pmem1-213314 [001] 6398.022978: xfs_buf_rele: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 17 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_iflush_abort+0xd7 ..... All inodes on cluster buffer are aborted, then the cluster buffer itself is aborted and removed from the AIL *without writeback*: xfsaild/pmem1-213314 [001] 6398.023011: xfs_buf_error_relse: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_ioend_fail+0x33 xfsaild/pmem1-213314 [001] 6398.023012: xfs_ail_delete: dev 259:1 lip 0xffff8888053efde8 old lsn 7/20344 new lsn 7/20344 type XFS_LI_BUF flags IN_AIL The inode buffer was at 7/20344 when it was removed from the AIL. xfsaild/pmem1-213314 [001] 6398.023012: xfs_buf_item_relse: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_done+0x31 xfsaild/pmem1-213314 [001] 6398.023012: xfs_buf_rele: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_relse+0x39 ..... Userspace is still running, doing stuff. an fsstress process runs syncfs() or sync() and we end up in sync_fs_one_sb() which issues a log force. This pushes on the CIL: fsstress-213322 [001] 6398.024430: xfs_fs_sync_fs: dev 259:1 m_features 0x20000000019ff6e9 opstate (clean|shutdown|inodegc|blockgc) s_flags 0x70810000 caller sync_fs_one_sb+0x26 fsstress-213322 [001] 6398.024430: xfs_log_force: dev 259:1 lsn 0x0 caller xfs_fs_sync_fs+0x82 fsstress-213322 [001] 6398.024430: xfs_log_force: dev 259:1 lsn 0x5f caller xfs_log_force+0x7c <...>-194402 [001] 6398.024467: kmem_alloc: size 176 flags 0x14 caller xlog_cil_push_work+0x9f And the CIL fills up iclogs with pending changes. This picks up the current tail from the AIL: <...>-194402 [001] 6398.024497: xlog_iclog_get_space: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x0 flags caller xlog_write+0x149 <...>-194402 [001] 6398.024498: xlog_iclog_switch: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x700005408 flags caller xlog_state_get_iclog_space+0x37e <...>-194402 [001] 6398.024521: xlog_iclog_release: dev 259:1 state XLOG_STATE_WANT_SYNC refcnt 1 offset 32256 lsn 0x700005408 flags caller xlog_write+0x5f9 <...>-194402 [001] 6398.024522: xfs_log_assign_tail_lsn: dev 259:1 new tail lsn 7/21000, old lsn 7/20344, last sync 7/21448 And it moves the tail of the log to 7/21000 from 7/20344. This *moves the tail of the log beyond the ICREATE transaction* that was at 7/20344 and pinned by the inode cluster buffer that was cancelled above. .... godown-213341 [008] 6398.027005: xfs_force_shutdown: dev 259:1 tag logerror flags log_io|force_umount file fs/xfs/xfs_fsops.c line_num 500 godown-213341 [008] 6398.027022: console: [ 6397.915406] pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096 godown-213341 [008] 6398.030551: console: [ 6397.919546] XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/ And finally the log itself is now shutdown, stopping all further writes to the log. But this is too late to prevent the corruption that moving the tail of the log forwards after we start cancelling writeback causes. The fundamental problem here is that we are using the wrong shutdown checks for log items. We've long conflated mount shutdown with log shutdown state, and I started separating that recently with the atomic shutdown state changes in commit b36d465 ("xfs: make forced shutdown processing atomic"). The changes in that commit series are directly responsible for being able to diagnose this issue because it clearly separated mount shutdown from log shutdown. Essentially, once we start cancelling writeback of log items and removing them from the AIL because the filesystem is shut down, we *cannot* update the journal because we may have cancelled the items that pin the tail of the log. That moves the tail of the log forwards without having written the metadata back, hence we have corrupt in memory state and writing to the journal propagates that to the on-disk state. What commit b36d465 makes clear is that log item state needs to change relative to log shutdown, not mount shutdown. IOWs, anything that aborts metadata writeback needs to check log shutdown state because log items directly affect log consistency. Having them check mount shutdown state introduces the above race condition where we cancel metadata writeback before the log shuts down. To fix this, this patch works through all log items and converts shutdown checks to use xlog_is_shutdown() rather than xfs_is_shutdown(), so that we don't start aborting metadata writeback before we shut off journal writes. AFAICT, this race condition is a zero day IO error handling bug in XFS that dates back to the introduction of XLOG_IO_ERROR, XLOG_STATE_IOERROR and XFS_FORCED_SHUTDOWN back in January 1997. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
1 parent 8eda872 commit 01728b4

5 files changed

Lines changed: 70 additions & 15 deletions

File tree

fs/xfs/xfs_buf.c

Lines changed: 32 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
#include "xfs_trace.h"
1515
#include "xfs_log.h"
1616
#include "xfs_log_recover.h"
17+
#include "xfs_log_priv.h"
1718
#include "xfs_trans.h"
1819
#include "xfs_buf_item.h"
1920
#include "xfs_errortag.h"
@@ -813,7 +814,15 @@ xfs_buf_read_map(
813814
* buffer.
814815
*/
815816
if (error) {
816-
if (!xfs_is_shutdown(target->bt_mount))
817+
/*
818+
* Check against log shutdown for error reporting because
819+
* metadata writeback may require a read first and we need to
820+
* report errors in metadata writeback until the log is shut
821+
* down. High level transaction read functions already check
822+
* against mount shutdown, anyway, so we only need to be
823+
* concerned about low level IO interactions here.
824+
*/
825+
if (!xlog_is_shutdown(target->bt_mount->m_log))
817826
xfs_buf_ioerror_alert(bp, fa);
818827

819828
bp->b_flags &= ~XBF_DONE;
@@ -1177,10 +1186,10 @@ xfs_buf_ioend_handle_error(
11771186
struct xfs_error_cfg *cfg;
11781187

11791188
/*
1180-
* If we've already decided to shutdown the filesystem because of I/O
1181-
* errors, there's no point in giving this a retry.
1189+
* If we've already shutdown the journal because of I/O errors, there's
1190+
* no point in giving this a retry.
11821191
*/
1183-
if (xfs_is_shutdown(mp))
1192+
if (xlog_is_shutdown(mp->m_log))
11841193
goto out_stale;
11851194

11861195
xfs_buf_ioerror_alert_ratelimited(bp);
@@ -1593,8 +1602,23 @@ __xfs_buf_submit(
15931602

15941603
ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
15951604

1596-
/* on shutdown we stale and complete the buffer immediately */
1597-
if (xfs_is_shutdown(bp->b_mount)) {
1605+
/*
1606+
* On log shutdown we stale and complete the buffer immediately. We can
1607+
* be called to read the superblock before the log has been set up, so
1608+
* be careful checking the log state.
1609+
*
1610+
* Checking the mount shutdown state here can result in the log tail
1611+
* moving inappropriately on disk as the log may not yet be shut down.
1612+
* i.e. failing this buffer on mount shutdown can remove it from the AIL
1613+
* and move the tail of the log forwards without having written this
1614+
* buffer to disk. This corrupts the log tail state in memory, and
1615+
* because the log may not be shut down yet, it can then be propagated
1616+
* to disk before the log is shutdown. Hence we check log shutdown
1617+
* state here rather than mount state to avoid corrupting the log tail
1618+
* on shutdown.
1619+
*/
1620+
if (bp->b_mount->m_log &&
1621+
xlog_is_shutdown(bp->b_mount->m_log)) {
15981622
xfs_buf_ioend_fail(bp);
15991623
return -EIO;
16001624
}
@@ -1808,10 +1832,10 @@ xfs_buftarg_drain(
18081832
* If one or more failed buffers were freed, that means dirty metadata
18091833
* was thrown away. This should only ever happen after I/O completion
18101834
* handling has elevated I/O error(s) to permanent failures and shuts
1811-
* down the fs.
1835+
* down the journal.
18121836
*/
18131837
if (write_fail) {
1814-
ASSERT(xfs_is_shutdown(btp->bt_mount));
1838+
ASSERT(xlog_is_shutdown(btp->bt_mount->m_log));
18151839
xfs_alert(btp->bt_mount,
18161840
"Please run xfs_repair to determine the extent of the problem.");
18171841
}

fs/xfs/xfs_icache.c

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
#include "xfs_reflink.h"
2424
#include "xfs_ialloc.h"
2525
#include "xfs_ag.h"
26+
#include "xfs_log_priv.h"
2627

2728
#include <linux/iversion.h>
2829

@@ -873,7 +874,14 @@ xfs_reclaim_inode(
873874
if (xfs_iflags_test_and_set(ip, XFS_IFLUSHING))
874875
goto out_iunlock;
875876

876-
if (xfs_is_shutdown(ip->i_mount)) {
877+
/*
878+
* Check for log shutdown because aborting the inode can move the log
879+
* tail and corrupt in memory state. This is fine if the log is shut
880+
* down, but if the log is still active and only the mount is shut down
881+
* then the in-memory log tail movement caused by the abort can be
882+
* incorrectly propagated to disk.
883+
*/
884+
if (xlog_is_shutdown(ip->i_mount->m_log)) {
877885
xfs_iunpin_wait(ip);
878886
xfs_iflush_abort(ip);
879887
goto reclaim;

fs/xfs/xfs_inode.c

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@
3535
#include "xfs_bmap_btree.h"
3636
#include "xfs_reflink.h"
3737
#include "xfs_ag.h"
38+
#include "xfs_log_priv.h"
3839

3940
struct kmem_cache *xfs_inode_cache;
4041

@@ -3678,7 +3679,7 @@ xfs_iflush_cluster(
36783679
* AIL, leaving a dirty/unpinned inode attached to the buffer
36793680
* that otherwise looks like it should be flushed.
36803681
*/
3681-
if (xfs_is_shutdown(mp)) {
3682+
if (xlog_is_shutdown(mp->m_log)) {
36823683
xfs_iunpin_wait(ip);
36833684
xfs_iflush_abort(ip);
36843685
xfs_iunlock(ip, XFS_ILOCK_SHARED);
@@ -3704,9 +3705,19 @@ xfs_iflush_cluster(
37043705
}
37053706

37063707
if (error) {
3708+
/*
3709+
* Shutdown first so we kill the log before we release this
3710+
* buffer. If it is an INODE_ALLOC buffer and pins the tail
3711+
* of the log, failing it before the _log_ is shut down can
3712+
* result in the log tail being moved forward in the journal
3713+
* on disk because log writes can still be taking place. Hence
3714+
* unpinning the tail will allow the ICREATE intent to be
3715+
* removed from the log an recovery will fail with uninitialised
3716+
* inode cluster buffers.
3717+
*/
3718+
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
37073719
bp->b_flags |= XBF_ASYNC;
37083720
xfs_buf_ioend_fail(bp);
3709-
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
37103721
return error;
37113722
}
37123723

fs/xfs/xfs_inode_item.c

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
#include "xfs_trans_priv.h"
1818
#include "xfs_buf_item.h"
1919
#include "xfs_log.h"
20+
#include "xfs_log_priv.h"
2021
#include "xfs_error.h"
2122

2223
#include <linux/iversion.h>
@@ -720,6 +721,17 @@ xfs_iflush_ail_updates(
720721
if (INODE_ITEM(lip)->ili_flush_lsn != lip->li_lsn)
721722
continue;
722723

724+
/*
725+
* dgc: Not sure how this happens, but it happens very
726+
* occassionaly via generic/388. xfs_iflush_abort() also
727+
* silently handles this same "under writeback but not in AIL at
728+
* shutdown" condition via xfs_trans_ail_delete().
729+
*/
730+
if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
731+
ASSERT(xlog_is_shutdown(lip->li_log));
732+
continue;
733+
}
734+
723735
lsn = xfs_ail_delete_one(ailp, lip);
724736
if (!tail_lsn && lsn)
725737
tail_lsn = lsn;

fs/xfs/xfs_qm.c

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
#include "xfs_error.h"
2626
#include "xfs_ag.h"
2727
#include "xfs_ialloc.h"
28+
#include "xfs_log_priv.h"
2829

2930
/*
3031
* The global quota manager. There is only one of these for the entire
@@ -121,8 +122,7 @@ xfs_qm_dqpurge(
121122
struct xfs_dquot *dqp,
122123
void *data)
123124
{
124-
struct xfs_mount *mp = dqp->q_mount;
125-
struct xfs_quotainfo *qi = mp->m_quotainfo;
125+
struct xfs_quotainfo *qi = dqp->q_mount->m_quotainfo;
126126
int error = -EAGAIN;
127127

128128
xfs_dqlock(dqp);
@@ -157,7 +157,7 @@ xfs_qm_dqpurge(
157157
}
158158

159159
ASSERT(atomic_read(&dqp->q_pincount) == 0);
160-
ASSERT(xfs_is_shutdown(mp) ||
160+
ASSERT(xlog_is_shutdown(dqp->q_logitem.qli_item.li_log) ||
161161
!test_bit(XFS_LI_IN_AIL, &dqp->q_logitem.qli_item.li_flags));
162162

163163
xfs_dqfunlock(dqp);
@@ -172,7 +172,7 @@ xfs_qm_dqpurge(
172172
*/
173173
ASSERT(!list_empty(&dqp->q_lru));
174174
list_lru_del(&qi->qi_lru, &dqp->q_lru);
175-
XFS_STATS_DEC(mp, xs_qm_dquot_unused);
175+
XFS_STATS_DEC(dqp->q_mount, xs_qm_dquot_unused);
176176

177177
xfs_qm_dqdestroy(dqp);
178178
return 0;

0 commit comments

Comments
 (0)