Skip to content

Commit a924853

Browse files
author
Darrick J. Wong
committed
xfs: stabilize the dirent name transformation function used for ascii-ci dir hash computation
Back in the old days, the "ascii-ci" feature was created to implement case-insensitive directory entry lookups for latin1-encoded names and remove the large overhead of Samba's case-insensitive lookup code. UTF8 names were not allowed, but nobody explicitly wrote in the documentation that this was only expected to work if the system used latin1 names. The kernel tolower function was selected to prepare names for hashed lookups. There's a major discrepancy in the function that computes directory entry hashes for filesystems that have ASCII case-insensitive lookups enabled. The root of this is that the kernel and glibc's tolower implementations have differing behavior for extended ASCII accented characters. I wrote a program to spit out characters for which the tolower() return value is different from the input: glibc tolower: 65:A 66:B 67:C 68:D 69:E 70:F 71:G 72:H 73:I 74:J 75:K 76:L 77:M 78:N 79:O 80:P 81:Q 82:R 83:S 84:T 85:U 86:V 87:W 88:X 89:Y 90:Z kernel tolower: 65:A 66:B 67:C 68:D 69:E 70:F 71:G 72:H 73:I 74:J 75:K 76:L 77:M 78:N 79:O 80:P 81:Q 82:R 83:S 84:T 85:U 86:V 87:W 88:X 89:Y 90:Z 192:À 193:Á 194:Â 195:Ã 196:Ä 197:Å 198:Æ 199:Ç 200:È 201:É 202:Ê 203:Ë 204:Ì 205:Í 206:Î 207:Ï 208:Ð 209:Ñ 210:Ò 211:Ó 212:Ô 213:Õ 214:Ö 215:× 216:Ø 217:Ù 218:Ú 219:Û 220:Ü 221:Ý 222:Þ Which means that the kernel and userspace do not agree on the hash value for a directory filename that contains those higher values. The hash values are written into the leaf index block of directories that are larger than two blocks in size, which means that xfs_repair will flag these directories as having corrupted hash indexes and rewrite the index with hash values that the kernel now will not recognize. Because the ascii-ci feature is not frequently enabled and the kernel touches filesystems far more frequently than xfs_repair does, fix this by encoding the kernel's toupper predicate and tolower functions into libxfs. Give the new functions less provocative names to make it really obvious that this is a pre-hash name preparation function, and nothing else. This change makes userspace's behavior consistent with the kernel. Found by auditing obfuscate_name in xfs_metadump as part of working on parent pointers, wondering how it could possibly work correctly with ci filesystems, writing a test tool to create a directory with hash-colliding names, and watching xfs_repair flag it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
1 parent 4f5e304 commit a924853

2 files changed

Lines changed: 34 additions & 2 deletions

File tree

fs/xfs/libxfs/xfs_dir2.c

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ xfs_ascii_ci_hashname(
6464
int i;
6565

6666
for (i = 0, hash = 0; i < name->len; i++)
67-
hash = tolower(name->name[i]) ^ rol32(hash, 7);
67+
hash = xfs_ascii_ci_xfrm(name->name[i]) ^ rol32(hash, 7);
6868

6969
return hash;
7070
}
@@ -85,7 +85,8 @@ xfs_ascii_ci_compname(
8585
for (i = 0; i < len; i++) {
8686
if (args->name[i] == name[i])
8787
continue;
88-
if (tolower(args->name[i]) != tolower(name[i]))
88+
if (xfs_ascii_ci_xfrm(args->name[i]) !=
89+
xfs_ascii_ci_xfrm(name[i]))
8990
return XFS_CMP_DIFFERENT;
9091
result = XFS_CMP_CASE;
9192
}

fs/xfs/libxfs/xfs_dir2.h

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -248,4 +248,35 @@ unsigned int xfs_dir3_data_end_offset(struct xfs_da_geometry *geo,
248248
struct xfs_dir2_data_hdr *hdr);
249249
bool xfs_dir2_namecheck(const void *name, size_t length);
250250

251+
/*
252+
* The "ascii-ci" feature was created to speed up case-insensitive lookups for
253+
* a Samba product. Because of the inherent problems with CI and UTF-8
254+
* encoding, etc, it was decided that Samba would be configured to export
255+
* latin1/iso 8859-1 encodings as that covered >90% of the target markets for
256+
* the product. Hence the "ascii-ci" casefolding code could be encoded into
257+
* the XFS directory operations and remove all the overhead of casefolding from
258+
* Samba.
259+
*
260+
* To provide consistent hashing behavior between the userspace and kernel,
261+
* these functions prepare names for hashing by transforming specific bytes
262+
* to other bytes. Robustness with other encodings is not guaranteed.
263+
*/
264+
static inline bool xfs_ascii_ci_need_xfrm(unsigned char c)
265+
{
266+
if (c >= 0x41 && c <= 0x5a) /* A-Z */
267+
return true;
268+
if (c >= 0xc0 && c <= 0xd6) /* latin A-O with accents */
269+
return true;
270+
if (c >= 0xd8 && c <= 0xde) /* latin O-Y with accents */
271+
return true;
272+
return false;
273+
}
274+
275+
static inline unsigned char xfs_ascii_ci_xfrm(unsigned char c)
276+
{
277+
if (xfs_ascii_ci_need_xfrm(c))
278+
c -= 'A' - 'a';
279+
return c;
280+
}
281+
251282
#endif /* __XFS_DIR2_H__ */

0 commit comments

Comments
 (0)