@@ -315,6 +315,9 @@ The seven phases are as follows:
3153157. Re-check the summary counters and presents the caller with a summary of
316316 space usage and file counts.
317317
318+ This allocation of responsibilities will be :ref: `revisited <scrubcheck >`
319+ later in this document.
320+
318321Steps for Each Scrub Item
319322-------------------------
320323
@@ -4787,3 +4790,316 @@ The proposed patches are in the
47874790`orphanage adoption
47884791<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage> `_
47894792series.
4793+
4794+ 6. Userspace Algorithms and Data Structures
4795+ ===========================================
4796+
4797+ This section discusses the key algorithms and data structures of the userspace
4798+ program, ``xfs_scrub ``, that provide the ability to drive metadata checks and
4799+ repairs in the kernel, verify file data, and look for other potential problems.
4800+
4801+ .. _scrubcheck :
4802+
4803+ Checking Metadata
4804+ -----------------
4805+
4806+ Recall the :ref: `phases of fsck work<scrubphases> ` outlined earlier.
4807+ That structure follows naturally from the data dependencies designed into the
4808+ filesystem from its beginnings in 1993.
4809+ In XFS, there are several groups of metadata dependencies:
4810+
4811+ a. Filesystem summary counts depend on consistency within the inode indices,
4812+ the allocation group space btrees, and the realtime volume space
4813+ information.
4814+
4815+ b. Quota resource counts depend on consistency within the quota file data
4816+ forks, inode indices, inode records, and the forks of every file on the
4817+ system.
4818+
4819+ c. The naming hierarchy depends on consistency within the directory and
4820+ extended attribute structures.
4821+ This includes file link counts.
4822+
4823+ d. Directories, extended attributes, and file data depend on consistency within
4824+ the file forks that map directory and extended attribute data to physical
4825+ storage media.
4826+
4827+ e. The file forks depends on consistency within inode records and the space
4828+ metadata indices of the allocation groups and the realtime volume.
4829+ This includes quota and realtime metadata files.
4830+
4831+ f. Inode records depends on consistency within the inode metadata indices.
4832+
4833+ g. Realtime space metadata depend on the inode records and data forks of the
4834+ realtime metadata inodes.
4835+
4836+ h. The allocation group metadata indices (free space, inodes, reference count,
4837+ and reverse mapping btrees) depend on consistency within the AG headers and
4838+ between all the AG metadata btrees.
4839+
4840+ i. ``xfs_scrub `` depends on the filesystem being mounted and kernel support
4841+ for online fsck functionality.
4842+
4843+ Therefore, a metadata dependency graph is a convenient way to schedule checking
4844+ operations in the ``xfs_scrub `` program:
4845+
4846+ - Phase 1 checks that the provided path maps to an XFS filesystem and detect
4847+ the kernel's scrubbing abilities, which validates group (i).
4848+
4849+ - Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.
4850+
4851+ - Phase 3 scans inodes in parallel.
4852+ For each inode, groups (f), (e), and (d) are checked, in that order.
4853+
4854+ - Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6
4855+ may run reliably.
4856+
4857+ - Phase 5 starts by checking groups (b) and (c) in parallel before moving on
4858+ to checking names.
4859+
4860+ - Phase 6 depends on groups (i) through (b) to find file data blocks to verify,
4861+ to read them, and to report which blocks of which files are affected.
4862+
4863+ - Phase 7 checks group (a), having validated everything else.
4864+
4865+ Notice that the data dependencies between groups are enforced by the structure
4866+ of the program flow.
4867+
4868+ Parallel Inode Scans
4869+ --------------------
4870+
4871+ An XFS filesystem can easily contain hundreds of millions of inodes.
4872+ Given that XFS targets installations with large high-performance storage,
4873+ it is desirable to scrub inodes in parallel to minimize runtime, particularly
4874+ if the program has been invoked manually from a command line.
4875+ This requires careful scheduling to keep the threads as evenly loaded as
4876+ possible.
4877+
4878+ Early iterations of the ``xfs_scrub `` inode scanner naïvely created a single
4879+ workqueue and scheduled a single workqueue item per AG.
4880+ Each workqueue item walked the inode btree (with ``XFS_IOC_INUMBERS ``) to find
4881+ inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT ``) to gather enough
4882+ information to construct file handles.
4883+ The file handle was then passed to a function to generate scrub items for each
4884+ metadata object of each inode.
4885+ This simple algorithm leads to thread balancing problems in phase 3 if the
4886+ filesystem contains one AG with a few large sparse files and the rest of the
4887+ AGs contain many smaller files.
4888+ The inode scan dispatch function was not sufficiently granular; it should have
4889+ been dispatching at the level of individual inodes, or, to constrain memory
4890+ consumption, inode btree records.
4891+
4892+ Thanks to Dave Chinner, bounded workqueues in userspace enable ``xfs_scrub `` to
4893+ avoid this problem with ease by adding a second workqueue.
4894+ Just like before, the first workqueue is seeded with one workqueue item per AG,
4895+ and it uses INUMBERS to find inode btree chunks.
4896+ The second workqueue, however, is configured with an upper bound on the number
4897+ of items that can be waiting to be run.
4898+ Each inode btree chunk found by the first workqueue's workers are queued to the
4899+ second workqueue, and it is this second workqueue that queries BULKSTAT,
4900+ creates a file handle, and passes it to a function to generate scrub items for
4901+ each metadata object of each inode.
4902+ If the second workqueue is too full, the workqueue add function blocks the
4903+ first workqueue's workers until the backlog eases.
4904+ This doesn't completely solve the balancing problem, but reduces it enough to
4905+ move on to more pressing issues.
4906+
4907+ The proposed patchsets are the scrub
4908+ `performance tweaks
4909+ <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks> `_
4910+ and the
4911+ `inode scan rebalance
4912+ <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance> `_
4913+ series.
4914+
4915+ .. _scrubrepair :
4916+
4917+ Scheduling Repairs
4918+ ------------------
4919+
4920+ During phase 2, corruptions and inconsistencies reported in any AGI header or
4921+ inode btree are repaired immediately, because phase 3 relies on proper
4922+ functioning of the inode indices to find inodes to scan.
4923+ Failed repairs are rescheduled to phase 4.
4924+ Problems reported in any other space metadata are deferred to phase 4.
4925+ Optimization opportunities are always deferred to phase 4, no matter their
4926+ origin.
4927+
4928+ During phase 3, corruptions and inconsistencies reported in any part of a
4929+ file's metadata are repaired immediately if all space metadata were validated
4930+ during phase 2.
4931+ Repairs that fail or cannot be repaired immediately are scheduled for phase 4.
4932+
4933+ In the original design of ``xfs_scrub ``, it was thought that repairs would be
4934+ so infrequent that the ``struct xfs_scrub_metadata `` objects used to
4935+ communicate with the kernel could also be used as the primary object to
4936+ schedule repairs.
4937+ With recent increases in the number of optimizations possible for a given
4938+ filesystem object, it became much more memory-efficient to track all eligible
4939+ repairs for a given filesystem object with a single repair item.
4940+ Each repair item represents a single lockable object -- AGs, metadata files,
4941+ individual inodes, or a class of summary information.
4942+
4943+ Phase 4 is responsible for scheduling a lot of repair work in as quick a
4944+ manner as is practical.
4945+ The :ref: `data dependencies <scrubcheck >` outlined earlier still apply, which
4946+ means that ``xfs_scrub `` must try to complete the repair work scheduled by
4947+ phase 2 before trying repair work scheduled by phase 3.
4948+ The repair process is as follows:
4949+
4950+ 1. Start a round of repair with a workqueue and enough workers to keep the CPUs
4951+ as busy as the user desires.
4952+
4953+ a. For each repair item queued by phase 2,
4954+
4955+ i. Ask the kernel to repair everything listed in the repair item for a
4956+ given filesystem object.
4957+
4958+ ii. Make a note if the kernel made any progress in reducing the number
4959+ of repairs needed for this object.
4960+
4961+ iii. If the object no longer requires repairs, revalidate all metadata
4962+ associated with this object.
4963+ If the revalidation succeeds, drop the repair item.
4964+ If not, requeue the item for more repairs.
4965+
4966+ b. If any repairs were made, jump back to 1a to retry all the phase 2 items.
4967+
4968+ c. For each repair item queued by phase 3,
4969+
4970+ i. Ask the kernel to repair everything listed in the repair item for a
4971+ given filesystem object.
4972+
4973+ ii. Make a note if the kernel made any progress in reducing the number
4974+ of repairs needed for this object.
4975+
4976+ iii. If the object no longer requires repairs, revalidate all metadata
4977+ associated with this object.
4978+ If the revalidation succeeds, drop the repair item.
4979+ If not, requeue the item for more repairs.
4980+
4981+ d. If any repairs were made, jump back to 1c to retry all the phase 3 items.
4982+
4983+ 2. If step 1 made any repair progress of any kind, jump back to step 1 to start
4984+ another round of repair.
4985+
4986+ 3. If there are items left to repair, run them all serially one more time.
4987+ Complain if the repairs were not successful, since this is the last chance
4988+ to repair anything.
4989+
4990+ Corruptions and inconsistencies encountered during phases 5 and 7 are repaired
4991+ immediately.
4992+ Corrupt file data blocks reported by phase 6 cannot be recovered by the
4993+ filesystem.
4994+
4995+ The proposed patchsets are the
4996+ `repair warning improvements
4997+ <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings> `_,
4998+ refactoring of the
4999+ `repair data dependency
5000+ <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps> `_
5001+ and
5002+ `object tracking
5003+ <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking> `_,
5004+ and the
5005+ `repair scheduling
5006+ <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling> `_
5007+ improvement series.
5008+
5009+ Checking Names for Confusable Unicode Sequences
5010+ -----------------------------------------------
5011+
5012+ If ``xfs_scrub `` succeeds in validating the filesystem metadata by the end of
5013+ phase 4, it moves on to phase 5, which checks for suspicious looking names in
5014+ the filesystem.
5015+ These names consist of the filesystem label, names in directory entries, and
5016+ the names of extended attributes.
5017+ Like most Unix filesystems, XFS imposes the sparest of constraints on the
5018+ contents of a name:
5019+
5020+ - Slashes and null bytes are not allowed in directory entries.
5021+
5022+ - Null bytes are not allowed in userspace-visible extended attributes.
5023+
5024+ - Null bytes are not allowed in the filesystem label.
5025+
5026+ Directory entries and attribute keys store the length of the name explicitly
5027+ ondisk, which means that nulls are not name terminators.
5028+ For this section, the term "naming domain" refers to any place where names are
5029+ presented together -- all the names in a directory, or all the attributes of a
5030+ file.
5031+
5032+ Although the Unix naming constraints are very permissive, the reality of most
5033+ modern-day Linux systems is that programs work with Unicode character code
5034+ points to support international languages.
5035+ These programs typically encode those code points in UTF-8 when interfacing
5036+ with the C library because the kernel expects null-terminated names.
5037+ In the common case, therefore, names found in an XFS filesystem are actually
5038+ UTF-8 encoded Unicode data.
5039+
5040+ To maximize its expressiveness, the Unicode standard defines separate control
5041+ points for various characters that render similarly or identically in writing
5042+ systems around the world.
5043+ For example, the character "Cyrillic Small Letter A" U+0430 "а" often renders
5044+ identically to "Latin Small Letter A" U+0061 "a".
5045+
5046+ The standard also permits characters to be constructed in multiple ways --
5047+ either by using a defined code point, or by combining one code point with
5048+ various combining marks.
5049+ For example, the character "Angstrom Sign U+212B "Å" can also be expressed
5050+ as "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring Above"
5051+ U+030A "◌̊".
5052+ Both sequences render identically.
5053+
5054+ Like the standards that preceded it, Unicode also defines various control
5055+ characters to alter the presentation of text.
5056+ For example, the character "Right-to-Left Override" U+202E can trick some
5057+ programs into rendering "moo\\ xe2\\ x80\\ xaegnp.txt" as "mootxt.png".
5058+ A second category of rendering problems involves whitespace characters.
5059+ If the character "Zero Width Space" U+200B is encountered in a file name, the
5060+ name will render identically to a name that does not have the zero width
5061+ space.
5062+
5063+ If two names within a naming domain have different byte sequences but render
5064+ identically, a user may be confused by it.
5065+ The kernel, in its indifference to upper level encoding schemes, permits this.
5066+ Most filesystem drivers persist the byte sequence names that are given to them
5067+ by the VFS.
5068+
5069+ Techniques for detecting confusable names are explained in great detail in
5070+ sections 4 and 5 of the
5071+ `Unicode Security Mechanisms <https://unicode.org/reports/tr39/ >`_
5072+ document.
5073+ When ``xfs_scrub `` detects UTF-8 encoding in use on a system, it uses the
5074+ Unicode normalization form NFD in conjunction with the confusable name
5075+ detection component of
5076+ `libicu <https://github.com/unicode-org/icu >`_
5077+ to identify names with a directory or within a file's extended attributes that
5078+ could be confused for each other.
5079+ Names are also checked for control characters, non-rendering characters, and
5080+ mixing of bidirectional characters.
5081+ All of these potential issues are reported to the system administrator during
5082+ phase 5.
5083+
5084+ Media Verification of File Data Extents
5085+ ---------------------------------------
5086+
5087+ The system administrator can elect to initiate a media scan of all file data
5088+ blocks.
5089+ This scan after validation of all filesystem metadata (except for the summary
5090+ counters) as phase 6.
5091+ The scan starts by calling ``FS_IOC_GETFSMAP `` to scan the filesystem space map
5092+ to find areas that are allocated to file data fork extents.
5093+ Gaps betweeen data fork extents that are smaller than 64k are treated as if
5094+ they were data fork extents to reduce the command setup overhead.
5095+ When the space map scan accumulates a region larger than 32MB, a media
5096+ verification request is sent to the disk as a directio read of the raw block
5097+ device.
5098+
5099+ If the verification read fails, ``xfs_scrub `` retries with single-block reads
5100+ to narrow down the failure to the specific region of the media and recorded.
5101+ When it has finished issuing verification requests, it again uses the space
5102+ mapping ioctl to map the recorded media errors back to metadata structures
5103+ and report what has been lost.
5104+ For media errors in blocks owned by files, parent pointers can be used to
5105+ construct file paths from inode numbers for user-friendly reporting.
0 commit comments