Skip to content

Commit af051df

Browse files
author
Darrick J. Wong
committed
xfs: document the userspace fsck driver program
Add the sixth chapter of the online fsck design documentation, where we discuss the details of the data structures and algorithms used by the driver program xfs_scrub. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
1 parent a26aa25 commit af051df

1 file changed

Lines changed: 316 additions & 0 deletions

File tree

Documentation/filesystems/xfs-online-fsck-design.rst

Lines changed: 316 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -315,6 +315,9 @@ The seven phases are as follows:
315315
7. Re-check the summary counters and presents the caller with a summary of
316316
space usage and file counts.
317317

318+
This allocation of responsibilities will be :ref:`revisited <scrubcheck>`
319+
later in this document.
320+
318321
Steps for Each Scrub Item
319322
-------------------------
320323

@@ -4787,3 +4790,316 @@ The proposed patches are in the
47874790
`orphanage adoption
47884791
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
47894792
series.
4793+
4794+
6. Userspace Algorithms and Data Structures
4795+
===========================================
4796+
4797+
This section discusses the key algorithms and data structures of the userspace
4798+
program, ``xfs_scrub``, that provide the ability to drive metadata checks and
4799+
repairs in the kernel, verify file data, and look for other potential problems.
4800+
4801+
.. _scrubcheck:
4802+
4803+
Checking Metadata
4804+
-----------------
4805+
4806+
Recall the :ref:`phases of fsck work<scrubphases>` outlined earlier.
4807+
That structure follows naturally from the data dependencies designed into the
4808+
filesystem from its beginnings in 1993.
4809+
In XFS, there are several groups of metadata dependencies:
4810+
4811+
a. Filesystem summary counts depend on consistency within the inode indices,
4812+
the allocation group space btrees, and the realtime volume space
4813+
information.
4814+
4815+
b. Quota resource counts depend on consistency within the quota file data
4816+
forks, inode indices, inode records, and the forks of every file on the
4817+
system.
4818+
4819+
c. The naming hierarchy depends on consistency within the directory and
4820+
extended attribute structures.
4821+
This includes file link counts.
4822+
4823+
d. Directories, extended attributes, and file data depend on consistency within
4824+
the file forks that map directory and extended attribute data to physical
4825+
storage media.
4826+
4827+
e. The file forks depends on consistency within inode records and the space
4828+
metadata indices of the allocation groups and the realtime volume.
4829+
This includes quota and realtime metadata files.
4830+
4831+
f. Inode records depends on consistency within the inode metadata indices.
4832+
4833+
g. Realtime space metadata depend on the inode records and data forks of the
4834+
realtime metadata inodes.
4835+
4836+
h. The allocation group metadata indices (free space, inodes, reference count,
4837+
and reverse mapping btrees) depend on consistency within the AG headers and
4838+
between all the AG metadata btrees.
4839+
4840+
i. ``xfs_scrub`` depends on the filesystem being mounted and kernel support
4841+
for online fsck functionality.
4842+
4843+
Therefore, a metadata dependency graph is a convenient way to schedule checking
4844+
operations in the ``xfs_scrub`` program:
4845+
4846+
- Phase 1 checks that the provided path maps to an XFS filesystem and detect
4847+
the kernel's scrubbing abilities, which validates group (i).
4848+
4849+
- Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.
4850+
4851+
- Phase 3 scans inodes in parallel.
4852+
For each inode, groups (f), (e), and (d) are checked, in that order.
4853+
4854+
- Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6
4855+
may run reliably.
4856+
4857+
- Phase 5 starts by checking groups (b) and (c) in parallel before moving on
4858+
to checking names.
4859+
4860+
- Phase 6 depends on groups (i) through (b) to find file data blocks to verify,
4861+
to read them, and to report which blocks of which files are affected.
4862+
4863+
- Phase 7 checks group (a), having validated everything else.
4864+
4865+
Notice that the data dependencies between groups are enforced by the structure
4866+
of the program flow.
4867+
4868+
Parallel Inode Scans
4869+
--------------------
4870+
4871+
An XFS filesystem can easily contain hundreds of millions of inodes.
4872+
Given that XFS targets installations with large high-performance storage,
4873+
it is desirable to scrub inodes in parallel to minimize runtime, particularly
4874+
if the program has been invoked manually from a command line.
4875+
This requires careful scheduling to keep the threads as evenly loaded as
4876+
possible.
4877+
4878+
Early iterations of the ``xfs_scrub`` inode scanner naïvely created a single
4879+
workqueue and scheduled a single workqueue item per AG.
4880+
Each workqueue item walked the inode btree (with ``XFS_IOC_INUMBERS``) to find
4881+
inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to gather enough
4882+
information to construct file handles.
4883+
The file handle was then passed to a function to generate scrub items for each
4884+
metadata object of each inode.
4885+
This simple algorithm leads to thread balancing problems in phase 3 if the
4886+
filesystem contains one AG with a few large sparse files and the rest of the
4887+
AGs contain many smaller files.
4888+
The inode scan dispatch function was not sufficiently granular; it should have
4889+
been dispatching at the level of individual inodes, or, to constrain memory
4890+
consumption, inode btree records.
4891+
4892+
Thanks to Dave Chinner, bounded workqueues in userspace enable ``xfs_scrub`` to
4893+
avoid this problem with ease by adding a second workqueue.
4894+
Just like before, the first workqueue is seeded with one workqueue item per AG,
4895+
and it uses INUMBERS to find inode btree chunks.
4896+
The second workqueue, however, is configured with an upper bound on the number
4897+
of items that can be waiting to be run.
4898+
Each inode btree chunk found by the first workqueue's workers are queued to the
4899+
second workqueue, and it is this second workqueue that queries BULKSTAT,
4900+
creates a file handle, and passes it to a function to generate scrub items for
4901+
each metadata object of each inode.
4902+
If the second workqueue is too full, the workqueue add function blocks the
4903+
first workqueue's workers until the backlog eases.
4904+
This doesn't completely solve the balancing problem, but reduces it enough to
4905+
move on to more pressing issues.
4906+
4907+
The proposed patchsets are the scrub
4908+
`performance tweaks
4909+
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_
4910+
and the
4911+
`inode scan rebalance
4912+
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_
4913+
series.
4914+
4915+
.. _scrubrepair:
4916+
4917+
Scheduling Repairs
4918+
------------------
4919+
4920+
During phase 2, corruptions and inconsistencies reported in any AGI header or
4921+
inode btree are repaired immediately, because phase 3 relies on proper
4922+
functioning of the inode indices to find inodes to scan.
4923+
Failed repairs are rescheduled to phase 4.
4924+
Problems reported in any other space metadata are deferred to phase 4.
4925+
Optimization opportunities are always deferred to phase 4, no matter their
4926+
origin.
4927+
4928+
During phase 3, corruptions and inconsistencies reported in any part of a
4929+
file's metadata are repaired immediately if all space metadata were validated
4930+
during phase 2.
4931+
Repairs that fail or cannot be repaired immediately are scheduled for phase 4.
4932+
4933+
In the original design of ``xfs_scrub``, it was thought that repairs would be
4934+
so infrequent that the ``struct xfs_scrub_metadata`` objects used to
4935+
communicate with the kernel could also be used as the primary object to
4936+
schedule repairs.
4937+
With recent increases in the number of optimizations possible for a given
4938+
filesystem object, it became much more memory-efficient to track all eligible
4939+
repairs for a given filesystem object with a single repair item.
4940+
Each repair item represents a single lockable object -- AGs, metadata files,
4941+
individual inodes, or a class of summary information.
4942+
4943+
Phase 4 is responsible for scheduling a lot of repair work in as quick a
4944+
manner as is practical.
4945+
The :ref:`data dependencies <scrubcheck>` outlined earlier still apply, which
4946+
means that ``xfs_scrub`` must try to complete the repair work scheduled by
4947+
phase 2 before trying repair work scheduled by phase 3.
4948+
The repair process is as follows:
4949+
4950+
1. Start a round of repair with a workqueue and enough workers to keep the CPUs
4951+
as busy as the user desires.
4952+
4953+
a. For each repair item queued by phase 2,
4954+
4955+
i. Ask the kernel to repair everything listed in the repair item for a
4956+
given filesystem object.
4957+
4958+
ii. Make a note if the kernel made any progress in reducing the number
4959+
of repairs needed for this object.
4960+
4961+
iii. If the object no longer requires repairs, revalidate all metadata
4962+
associated with this object.
4963+
If the revalidation succeeds, drop the repair item.
4964+
If not, requeue the item for more repairs.
4965+
4966+
b. If any repairs were made, jump back to 1a to retry all the phase 2 items.
4967+
4968+
c. For each repair item queued by phase 3,
4969+
4970+
i. Ask the kernel to repair everything listed in the repair item for a
4971+
given filesystem object.
4972+
4973+
ii. Make a note if the kernel made any progress in reducing the number
4974+
of repairs needed for this object.
4975+
4976+
iii. If the object no longer requires repairs, revalidate all metadata
4977+
associated with this object.
4978+
If the revalidation succeeds, drop the repair item.
4979+
If not, requeue the item for more repairs.
4980+
4981+
d. If any repairs were made, jump back to 1c to retry all the phase 3 items.
4982+
4983+
2. If step 1 made any repair progress of any kind, jump back to step 1 to start
4984+
another round of repair.
4985+
4986+
3. If there are items left to repair, run them all serially one more time.
4987+
Complain if the repairs were not successful, since this is the last chance
4988+
to repair anything.
4989+
4990+
Corruptions and inconsistencies encountered during phases 5 and 7 are repaired
4991+
immediately.
4992+
Corrupt file data blocks reported by phase 6 cannot be recovered by the
4993+
filesystem.
4994+
4995+
The proposed patchsets are the
4996+
`repair warning improvements
4997+
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_,
4998+
refactoring of the
4999+
`repair data dependency
5000+
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_
5001+
and
5002+
`object tracking
5003+
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_,
5004+
and the
5005+
`repair scheduling
5006+
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_
5007+
improvement series.
5008+
5009+
Checking Names for Confusable Unicode Sequences
5010+
-----------------------------------------------
5011+
5012+
If ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of
5013+
phase 4, it moves on to phase 5, which checks for suspicious looking names in
5014+
the filesystem.
5015+
These names consist of the filesystem label, names in directory entries, and
5016+
the names of extended attributes.
5017+
Like most Unix filesystems, XFS imposes the sparest of constraints on the
5018+
contents of a name:
5019+
5020+
- Slashes and null bytes are not allowed in directory entries.
5021+
5022+
- Null bytes are not allowed in userspace-visible extended attributes.
5023+
5024+
- Null bytes are not allowed in the filesystem label.
5025+
5026+
Directory entries and attribute keys store the length of the name explicitly
5027+
ondisk, which means that nulls are not name terminators.
5028+
For this section, the term "naming domain" refers to any place where names are
5029+
presented together -- all the names in a directory, or all the attributes of a
5030+
file.
5031+
5032+
Although the Unix naming constraints are very permissive, the reality of most
5033+
modern-day Linux systems is that programs work with Unicode character code
5034+
points to support international languages.
5035+
These programs typically encode those code points in UTF-8 when interfacing
5036+
with the C library because the kernel expects null-terminated names.
5037+
In the common case, therefore, names found in an XFS filesystem are actually
5038+
UTF-8 encoded Unicode data.
5039+
5040+
To maximize its expressiveness, the Unicode standard defines separate control
5041+
points for various characters that render similarly or identically in writing
5042+
systems around the world.
5043+
For example, the character "Cyrillic Small Letter A" U+0430 "а" often renders
5044+
identically to "Latin Small Letter A" U+0061 "a".
5045+
5046+
The standard also permits characters to be constructed in multiple ways --
5047+
either by using a defined code point, or by combining one code point with
5048+
various combining marks.
5049+
For example, the character "Angstrom Sign U+212B "Å" can also be expressed
5050+
as "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring Above"
5051+
U+030A "◌̊".
5052+
Both sequences render identically.
5053+
5054+
Like the standards that preceded it, Unicode also defines various control
5055+
characters to alter the presentation of text.
5056+
For example, the character "Right-to-Left Override" U+202E can trick some
5057+
programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as "mootxt.png".
5058+
A second category of rendering problems involves whitespace characters.
5059+
If the character "Zero Width Space" U+200B is encountered in a file name, the
5060+
name will render identically to a name that does not have the zero width
5061+
space.
5062+
5063+
If two names within a naming domain have different byte sequences but render
5064+
identically, a user may be confused by it.
5065+
The kernel, in its indifference to upper level encoding schemes, permits this.
5066+
Most filesystem drivers persist the byte sequence names that are given to them
5067+
by the VFS.
5068+
5069+
Techniques for detecting confusable names are explained in great detail in
5070+
sections 4 and 5 of the
5071+
`Unicode Security Mechanisms <https://unicode.org/reports/tr39/>`_
5072+
document.
5073+
When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the
5074+
Unicode normalization form NFD in conjunction with the confusable name
5075+
detection component of
5076+
`libicu <https://github.com/unicode-org/icu>`_
5077+
to identify names with a directory or within a file's extended attributes that
5078+
could be confused for each other.
5079+
Names are also checked for control characters, non-rendering characters, and
5080+
mixing of bidirectional characters.
5081+
All of these potential issues are reported to the system administrator during
5082+
phase 5.
5083+
5084+
Media Verification of File Data Extents
5085+
---------------------------------------
5086+
5087+
The system administrator can elect to initiate a media scan of all file data
5088+
blocks.
5089+
This scan after validation of all filesystem metadata (except for the summary
5090+
counters) as phase 6.
5091+
The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map
5092+
to find areas that are allocated to file data fork extents.
5093+
Gaps betweeen data fork extents that are smaller than 64k are treated as if
5094+
they were data fork extents to reduce the command setup overhead.
5095+
When the space map scan accumulates a region larger than 32MB, a media
5096+
verification request is sent to the disk as a directio read of the raw block
5097+
device.
5098+
5099+
If the verification read fails, ``xfs_scrub`` retries with single-block reads
5100+
to narrow down the failure to the specific region of the media and recorded.
5101+
When it has finished issuing verification requests, it again uses the space
5102+
mapping ioctl to map the recorded media errors back to metadata structures
5103+
and report what has been lost.
5104+
For media errors in blocks owned by files, parent pointers can be used to
5105+
construct file paths from inode numbers for user-friendly reporting.

0 commit comments

Comments
 (0)