Skip to content

Commit 03786f0

Browse files
author
Darrick J. Wong
committed
xfs: document future directions of online fsck
Add the seventh and final chapter of the online fsck documentation, where we talk about future functionality that can tie in with the functionality provided by the online fsck patchset. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
1 parent af051df commit 03786f0

1 file changed

Lines changed: 210 additions & 0 deletions

File tree

Documentation/filesystems/xfs-online-fsck-design.rst

Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5103,3 +5103,213 @@ mapping ioctl to map the recorded media errors back to metadata structures
51035103
and report what has been lost.
51045104
For media errors in blocks owned by files, parent pointers can be used to
51055105
construct file paths from inode numbers for user-friendly reporting.
5106+
5107+
7. Conclusion and Future Work
5108+
=============================
5109+
5110+
It is hoped that the reader of this document has followed the designs laid out
5111+
in this document and now has some familiarity with how XFS performs online
5112+
rebuilding of its metadata indices, and how filesystem users can interact with
5113+
that functionality.
5114+
Although the scope of this work is daunting, it is hoped that this guide will
5115+
make it easier for code readers to understand what has been built, for whom it
5116+
has been built, and why.
5117+
Please feel free to contact the XFS mailing list with questions.
5118+
5119+
FIEXCHANGE_RANGE
5120+
----------------
5121+
5122+
As discussed earlier, a second frontend to the atomic extent swap mechanism is
5123+
a new ioctl call that userspace programs can use to commit updates to files
5124+
atomically.
5125+
This frontend has been out for review for several years now, though the
5126+
necessary refinements to online repair and lack of customer demand mean that
5127+
the proposal has not been pushed very hard.
5128+
5129+
Extent Swapping with Regular User Files
5130+
```````````````````````````````````````
5131+
5132+
As mentioned earlier, XFS has long had the ability to swap extents between
5133+
files, which is used almost exclusively by ``xfs_fsr`` to defragment files.
5134+
The earliest form of this was the fork swap mechanism, where the entire
5135+
contents of data forks could be exchanged between two files by exchanging the
5136+
raw bytes in each inode fork's immediate area.
5137+
When XFS v5 came along with self-describing metadata, this old mechanism grew
5138+
some log support to continue rewriting the owner fields of BMBT blocks during
5139+
log recovery.
5140+
When the reverse mapping btree was later added to XFS, the only way to maintain
5141+
the consistency of the fork mappings with the reverse mapping index was to
5142+
develop an iterative mechanism that used deferred bmap and rmap operations to
5143+
swap mappings one at a time.
5144+
This mechanism is identical to steps 2-3 from the procedure above except for
5145+
the new tracking items, because the atomic extent swap mechanism is an
5146+
iteration of an existing mechanism and not something totally novel.
5147+
For the narrow case of file defragmentation, the file contents must be
5148+
identical, so the recovery guarantees are not much of a gain.
5149+
5150+
Atomic extent swapping is much more flexible than the existing swapext
5151+
implementations because it can guarantee that the caller never sees a mix of
5152+
old and new contents even after a crash, and it can operate on two arbitrary
5153+
file fork ranges.
5154+
The extra flexibility enables several new use cases:
5155+
5156+
- **Atomic commit of file writes**: A userspace process opens a file that it
5157+
wants to update.
5158+
Next, it opens a temporary file and calls the file clone operation to reflink
5159+
the first file's contents into the temporary file.
5160+
Writes to the original file should instead be written to the temporary file.
5161+
Finally, the process calls the atomic extent swap system call
5162+
(``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all
5163+
of the updates to the original file, or none of them.
5164+
5165+
.. _swapext_if_unchanged:
5166+
5167+
- **Transactional file updates**: The same mechanism as above, but the caller
5168+
only wants the commit to occur if the original file's contents have not
5169+
changed.
5170+
To make this happen, the calling process snapshots the file modification and
5171+
change timestamps of the original file before reflinking its data to the
5172+
temporary file.
5173+
When the program is ready to commit the changes, it passes the timestamps
5174+
into the kernel as arguments to the atomic extent swap system call.
5175+
The kernel only commits the changes if the provided timestamps match the
5176+
original file.
5177+
5178+
- **Emulation of atomic block device writes**: Export a block device with a
5179+
logical sector size matching the filesystem block size to force all writes
5180+
to be aligned to the filesystem block size.
5181+
Stage all writes to a temporary file, and when that is complete, call the
5182+
atomic extent swap system call with a flag to indicate that holes in the
5183+
temporary file should be ignored.
5184+
This emulates an atomic device write in software, and can support arbitrary
5185+
scattered writes.
5186+
5187+
Vectorized Scrub
5188+
----------------
5189+
5190+
As it turns out, the :ref:`refactoring <scrubrepair>` of repair items mentioned
5191+
earlier was a catalyst for enabling a vectorized scrub system call.
5192+
Since 2018, the cost of making a kernel call has increased considerably on some
5193+
systems to mitigate the effects of speculative execution attacks.
5194+
This incentivizes program authors to make as few system calls as possible to
5195+
reduce the number of times an execution path crosses a security boundary.
5196+
5197+
With vectorized scrub, userspace pushes to the kernel the identity of a
5198+
filesystem object, a list of scrub types to run against that object, and a
5199+
simple representation of the data dependencies between the selected scrub
5200+
types.
5201+
The kernel executes as much of the caller's plan as it can until it hits a
5202+
dependency that cannot be satisfied due to a corruption, and tells userspace
5203+
how much was accomplished.
5204+
It is hoped that ``io_uring`` will pick up enough of this functionality that
5205+
online fsck can use that instead of adding a separate vectored scrub system
5206+
call to XFS.
5207+
5208+
The relevant patchsets are the
5209+
`kernel vectorized scrub
5210+
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
5211+
and
5212+
`userspace vectorized scrub
5213+
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
5214+
series.
5215+
5216+
Quality of Service Targets for Scrub
5217+
------------------------------------
5218+
5219+
One serious shortcoming of the online fsck code is that the amount of time that
5220+
it can spend in the kernel holding resource locks is basically unbounded.
5221+
Userspace is allowed to send a fatal signal to the process which will cause
5222+
``xfs_scrub`` to exit when it reaches a good stopping point, but there's no way
5223+
for userspace to provide a time budget to the kernel.
5224+
Given that the scrub codebase has helpers to detect fatal signals, it shouldn't
5225+
be too much work to allow userspace to specify a timeout for a scrub/repair
5226+
operation and abort the operation if it exceeds budget.
5227+
However, most repair functions have the property that once they begin to touch
5228+
ondisk metadata, the operation cannot be cancelled cleanly, after which a QoS
5229+
timeout is no longer useful.
5230+
5231+
Defragmenting Free Space
5232+
------------------------
5233+
5234+
Over the years, many XFS users have requested the creation of a program to
5235+
clear a portion of the physical storage underlying a filesystem so that it
5236+
becomes a contiguous chunk of free space.
5237+
Call this free space defragmenter ``clearspace`` for short.
5238+
5239+
The first piece the ``clearspace`` program needs is the ability to read the
5240+
reverse mapping index from userspace.
5241+
This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl.
5242+
The second piece it needs is a new fallocate mode
5243+
(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a region and
5244+
maps it to a file.
5245+
Call this file the "space collector" file.
5246+
The third piece is the ability to force an online repair.
5247+
5248+
To clear all the metadata out of a portion of physical storage, clearspace
5249+
uses the new fallocate map-freespace call to map any free space in that region
5250+
to the space collector file.
5251+
Next, clearspace finds all metadata blocks in that region by way of
5252+
``GETFSMAP`` and issues forced repair requests on the data structure.
5253+
This often results in the metadata being rebuilt somewhere that is not being
5254+
cleared.
5255+
After each relocation, clearspace calls the "map free space" function again to
5256+
collect any newly freed space in the region being cleared.
5257+
5258+
To clear all the file data out of a portion of the physical storage, clearspace
5259+
uses the FSMAP information to find relevant file data blocks.
5260+
Having identified a good target, it uses the ``FICLONERANGE`` call on that part
5261+
of the file to try to share the physical space with a dummy file.
5262+
Cloning the extent means that the original owners cannot overwrite the
5263+
contents; any changes will be written somewhere else via copy-on-write.
5264+
Clearspace makes its own copy of the frozen extent in an area that is not being
5265+
cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap
5266+
<swapext_if_unchanged>` feature) to change the target file's data extent
5267+
mapping away from the area being cleared.
5268+
When all other mappings have been moved, clearspace reflinks the space into the
5269+
space collector file so that it becomes unavailable.
5270+
5271+
There are further optimizations that could apply to the above algorithm.
5272+
To clear a piece of physical storage that has a high sharing factor, it is
5273+
strongly desirable to retain this sharing factor.
5274+
In fact, these extents should be moved first to maximize sharing factor after
5275+
the operation completes.
5276+
To make this work smoothly, clearspace needs a new ioctl
5277+
(``FS_IOC_GETREFCOUNTS``) to report reference count information to userspace.
5278+
With the refcount information exposed, clearspace can quickly find the longest,
5279+
most shared data extents in the filesystem, and target them first.
5280+
5281+
**Future Work Question**: How might the filesystem move inode chunks?
5282+
5283+
*Answer*: To move inode chunks, Dave Chinner constructed a prototype program
5284+
that creates a new file with the old contents and then locklessly runs around
5285+
the filesystem updating directory entries.
5286+
The operation cannot complete if the filesystem goes down.
5287+
That problem isn't totally insurmountable: create an inode remapping table
5288+
hidden behind a jump label, and a log item that tracks the kernel walking the
5289+
filesystem to update directory entries.
5290+
The trouble is, the kernel can't do anything about open files, since it cannot
5291+
revoke them.
5292+
5293+
**Future Work Question**: Can static keys be used to minimize the cost of
5294+
supporting ``revoke()`` on XFS files?
5295+
5296+
*Answer*: Yes.
5297+
Until the first revocation, the bailout code need not be in the call path at
5298+
all.
5299+
5300+
The relevant patchsets are the
5301+
`kernel freespace defrag
5302+
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_
5303+
and
5304+
`userspace freespace defrag
5305+
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_
5306+
series.
5307+
5308+
Shrinking Filesystems
5309+
---------------------
5310+
5311+
Removing the end of the filesystem ought to be a simple matter of evacuating
5312+
the data and metadata at the end of the filesystem, and handing the freed space
5313+
to the shrink code.
5314+
That requires an evacuation of the space at end of the filesystem, which is a
5315+
use of free space defragmentation!

0 commit comments

Comments
 (0)