Skip to content

Commit 3893854

Browse files
committed
Merge tag 'erofs-for-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang: "In this cycle, inode page cache sharing among filesystems on the same machine is now supported, which is particularly useful for high-density hosts running tens of thousands of containers. In addition, we fully isolate the EROFS core on-disk format from other optional encoded layouts since the core on-disk part is designed to be simple, effective, and secure. Users can use the core format to build unique golden immutable images and import their filesystem trees directly from raw block devices via DMA, page-mapped DAX devices, and/or file-backed mounts without having to worry about unnecessary intrinsic consistency issues found in other generic filesystems by design. However, the full vision is still working in progress and will spend more time to achieve final goals. There are other improvements and bug fixes as usual, as listed below: - Support inode page cache sharing among filesystems - Formally separate optional encoded (aka compressed) inode layouts (and the implementations) from the EROFS core on-disk aligned plain format for future zero-trust security usage - Improve performance by caching the fact that an inode does not have a POSIX ACL - Improve LZ4 decompression error reporting - Enable LZMA by default and promote DEFLATE and Zstandard algorithms out of EXPERIMENTAL status - Switch to inode_set_cached_link() to cache symlink lengths - random bugfixes and minor cleanups" * tag 'erofs-for-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: (31 commits) erofs: fix UAF issue for file-backed mounts w/ directio option erofs: update compression algorithm status erofs: fix inline data read failure for ztailpacking pclusters erofs: avoid some unnecessary #ifdefs erofs: handle end of filesystem properly for file-backed mounts erofs: separate plain and compressed filesystems formally erofs: use inode_set_cached_link() erofs: mark inodes without acls in erofs_read_inode() erofs: implement .fadvise for page cache share erofs: support compressed inodes for page cache share erofs: support unencoded inodes for page cache share erofs: pass inode to trace_erofs_read_folio erofs: introduce the page cache share feature erofs: using domain_id in the safer way erofs: add erofs_inode_set_aops helper to set the aops erofs: support user-defined fingerprint name erofs: decouple `struct erofs_anon_fs_type` fs: Export alloc_empty_backing_file erofs: tidy up erofs_init_inode_xattrs() erofs: add missing documentation about `directio` mount option ...
2 parents 4fb7d86 + 1caf50c commit 3893854

21 files changed

Lines changed: 804 additions & 413 deletions

File tree

Documentation/ABI/testing/sysfs-fs-erofs

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,19 +3,23 @@ Date: November 2021
33
Contact: "Huang Jianan" <huangjianan@oppo.com>
44
Description: Shows all enabled kernel features.
55
Supported features:
6-
zero_padding, compr_cfgs, big_pcluster, chunked_file,
7-
device_table, compr_head2, sb_chksum, ztailpacking,
8-
dedupe, fragments, 48bit, metabox.
6+
compr_cfgs, big_pcluster, chunked_file, device_table,
7+
compr_head2, sb_chksum, ztailpacking, dedupe, fragments,
8+
48bit, metabox.
99

1010
What: /sys/fs/erofs/<disk>/sync_decompress
1111
Date: November 2021
1212
Contact: "Huang Jianan" <huangjianan@oppo.com>
13-
Description: Control strategy of sync decompression:
13+
Description: Control strategy of synchronous decompression. Synchronous
14+
decompression tries to decompress in the reader thread for
15+
synchronous reads and small asynchronous reads (<= 12 KiB):
1416

15-
- 0 (default, auto): enable for readpage, and enable for
16-
readahead on atomic contexts only.
17-
- 1 (force on): enable for readpage and readahead.
18-
- 2 (force off): disable for all situations.
17+
- 0 (auto, default): apply to synchronous reads only, but will
18+
switch to 1 (force on) if any decompression
19+
request is detected in atomic contexts;
20+
- 1 (force on): apply to synchronous reads and small
21+
asynchronous reads;
22+
- 2 (force off): disable synchronous decompression completely.
1923

2024
What: /sys/fs/erofs/<disk>/drop_caches
2125
Date: November 2024

Documentation/filesystems/erofs.rst

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -63,9 +63,9 @@ Here are the main features of EROFS:
6363
- Support POSIX.1e ACLs by using extended attributes;
6464

6565
- Support transparent data compression as an option:
66-
LZ4, MicroLZMA and DEFLATE algorithms can be used on a per-file basis; In
67-
addition, inplace decompression is also supported to avoid bounce compressed
68-
buffers and unnecessary page cache thrashing.
66+
LZ4, MicroLZMA, DEFLATE and Zstandard algorithms can be used on a per-file
67+
basis; In addition, inplace decompression is also supported to avoid bounce
68+
compressed buffers and unnecessary page cache thrashing.
6969

7070
- Support chunk-based data deduplication and rolling-hash compressed data
7171
deduplication;
@@ -125,10 +125,18 @@ dax={always,never} Use direct access (no page cache). See
125125
Documentation/filesystems/dax.rst.
126126
dax A legacy option which is an alias for ``dax=always``.
127127
device=%s Specify a path to an extra device to be used together.
128+
directio (For file-backed mounts) Use direct I/O to access backing
129+
files, and asynchronous I/O will be enabled if supported.
128130
fsid=%s Specify a filesystem image ID for Fscache back-end.
129-
domain_id=%s Specify a domain ID in fscache mode so that different images
130-
with the same blobs under a given domain ID can share storage.
131+
domain_id=%s Specify a trusted domain ID for fscache mode so that
132+
different images with the same blobs, identified by blob IDs,
133+
can share storage within the same trusted domain.
134+
Also used for different filesystems with inode page sharing
135+
enabled to share page cache within the trusted domain.
131136
fsoffset=%llu Specify block-aligned filesystem offset for the primary device.
137+
inode_share Enable inode page sharing for this filesystem. Inodes with
138+
identical content within the same domain ID can share the
139+
page cache.
132140
=================== =========================================================
133141

134142
Sysfs Entries

fs/erofs/Kconfig

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -112,13 +112,14 @@ config EROFS_FS_ZIP
112112
config EROFS_FS_ZIP_LZMA
113113
bool "EROFS LZMA compressed data support"
114114
depends on EROFS_FS_ZIP
115+
default y
115116
help
116117
Saying Y here includes support for reading EROFS file systems
117118
containing LZMA compressed data, specifically called microLZMA. It
118119
gives better compression ratios than the default LZ4 format, at the
119120
expense of more CPU overhead.
120121

121-
If unsure, say N.
122+
Say N if you want to disable LZMA compression support.
122123

123124
config EROFS_FS_ZIP_DEFLATE
124125
bool "EROFS DEFLATE compressed data support"
@@ -129,9 +130,6 @@ config EROFS_FS_ZIP_DEFLATE
129130
ratios than the default LZ4 format, while it costs more CPU
130131
overhead.
131132

132-
DEFLATE support is an experimental feature for now and so most
133-
file systems will be readable without selecting this option.
134-
135133
If unsure, say N.
136134

137135
config EROFS_FS_ZIP_ZSTD
@@ -141,10 +139,7 @@ config EROFS_FS_ZIP_ZSTD
141139
Saying Y here includes support for reading EROFS file systems
142140
containing Zstandard compressed data. It gives better compression
143141
ratios than the default LZ4 format, while it costs more CPU
144-
overhead.
145-
146-
Zstandard support is an experimental feature for now and so most
147-
file systems will be readable without selecting this option.
142+
overhead and memory footprint.
148143

149144
If unsure, say N.
150145

@@ -194,3 +189,12 @@ config EROFS_FS_PCPU_KTHREAD_HIPRI
194189
at higher priority.
195190

196191
If unsure, say N.
192+
193+
config EROFS_FS_PAGE_CACHE_SHARE
194+
bool "EROFS page cache share support (experimental)"
195+
depends on EROFS_FS && EROFS_FS_XATTR && !EROFS_FS_ONDEMAND
196+
help
197+
This enables page cache sharing among inodes with identical
198+
content fingerprints on the same machine.
199+
200+
If unsure, say N.

fs/erofs/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,4 @@ erofs-$(CONFIG_EROFS_FS_ZIP_ZSTD) += decompressor_zstd.o
1010
erofs-$(CONFIG_EROFS_FS_ZIP_ACCEL) += decompressor_crypto.o
1111
erofs-$(CONFIG_EROFS_FS_BACKED_BY_FILE) += fileio.o
1212
erofs-$(CONFIG_EROFS_FS_ONDEMAND) += fscache.o
13+
erofs-$(CONFIG_EROFS_FS_PAGE_CACHE_SHARE) += ishare.o

fs/erofs/data.c

Lines changed: 32 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -270,21 +270,23 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
270270
struct erofs_iomap_iter_ctx {
271271
struct page *page;
272272
void *base;
273+
struct inode *realinode;
273274
};
274275

275276
static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
276277
unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
277278
{
278279
struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
279280
struct erofs_iomap_iter_ctx *ctx = iter->private;
280-
struct super_block *sb = inode->i_sb;
281+
struct inode *realinode = ctx ? ctx->realinode : inode;
282+
struct super_block *sb = realinode->i_sb;
281283
struct erofs_map_blocks map;
282284
struct erofs_map_dev mdev;
283285
int ret;
284286

285287
map.m_la = offset;
286288
map.m_llen = length;
287-
ret = erofs_map_blocks(inode, &map);
289+
ret = erofs_map_blocks(realinode, &map);
288290
if (ret < 0)
289291
return ret;
290292

@@ -297,7 +299,7 @@ static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
297299
return 0;
298300
}
299301

300-
if (!(map.m_flags & EROFS_MAP_META) || !erofs_inode_in_metabox(inode)) {
302+
if (!(map.m_flags & EROFS_MAP_META) || !erofs_inode_in_metabox(realinode)) {
301303
mdev = (struct erofs_map_dev) {
302304
.m_deviceid = map.m_deviceid,
303305
.m_pa = map.m_pa,
@@ -323,7 +325,7 @@ static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
323325
void *ptr;
324326

325327
ptr = erofs_read_metabuf(&buf, sb, map.m_pa,
326-
erofs_inode_in_metabox(inode));
328+
erofs_inode_in_metabox(realinode));
327329
if (IS_ERR(ptr))
328330
return PTR_ERR(ptr);
329331
iomap->inline_data = ptr;
@@ -364,12 +366,10 @@ int erofs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
364366
u64 start, u64 len)
365367
{
366368
if (erofs_inode_is_data_compressed(EROFS_I(inode)->datalayout)) {
367-
#ifdef CONFIG_EROFS_FS_ZIP
369+
if (!IS_ENABLED(CONFIG_EROFS_FS_ZIP))
370+
return -EOPNOTSUPP;
368371
return iomap_fiemap(inode, fieinfo, start, len,
369372
&z_erofs_iomap_report_ops);
370-
#else
371-
return -EOPNOTSUPP;
372-
#endif
373373
}
374374
return iomap_fiemap(inode, fieinfo, start, len, &erofs_iomap_ops);
375375
}
@@ -384,11 +384,15 @@ static int erofs_read_folio(struct file *file, struct folio *folio)
384384
.ops = &iomap_bio_read_ops,
385385
.cur_folio = folio,
386386
};
387-
struct erofs_iomap_iter_ctx iter_ctx = {};
388-
389-
trace_erofs_read_folio(folio, true);
387+
bool need_iput;
388+
struct erofs_iomap_iter_ctx iter_ctx = {
389+
.realinode = erofs_real_inode(folio_inode(folio), &need_iput),
390+
};
390391

392+
trace_erofs_read_folio(iter_ctx.realinode, folio, true);
391393
iomap_read_folio(&erofs_iomap_ops, &read_ctx, &iter_ctx);
394+
if (need_iput)
395+
iput(iter_ctx.realinode);
392396
return 0;
393397
}
394398

@@ -398,12 +402,16 @@ static void erofs_readahead(struct readahead_control *rac)
398402
.ops = &iomap_bio_read_ops,
399403
.rac = rac,
400404
};
401-
struct erofs_iomap_iter_ctx iter_ctx = {};
402-
403-
trace_erofs_readahead(rac->mapping->host, readahead_index(rac),
404-
readahead_count(rac), true);
405+
bool need_iput;
406+
struct erofs_iomap_iter_ctx iter_ctx = {
407+
.realinode = erofs_real_inode(rac->mapping->host, &need_iput),
408+
};
405409

410+
trace_erofs_readahead(iter_ctx.realinode, readahead_index(rac),
411+
readahead_count(rac), true);
406412
iomap_readahead(&erofs_iomap_ops, &read_ctx, &iter_ctx);
413+
if (need_iput)
414+
iput(iter_ctx.realinode);
407415
}
408416

409417
static sector_t erofs_bmap(struct address_space *mapping, sector_t block)
@@ -419,12 +427,13 @@ static ssize_t erofs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
419427
if (!iov_iter_count(to))
420428
return 0;
421429

422-
#ifdef CONFIG_FS_DAX
423-
if (IS_DAX(inode))
430+
if (IS_ENABLED(CONFIG_FS_DAX) && IS_DAX(inode))
424431
return dax_iomap_rw(iocb, to, &erofs_iomap_ops);
425-
#endif
432+
426433
if ((iocb->ki_flags & IOCB_DIRECT) && inode->i_sb->s_bdev) {
427-
struct erofs_iomap_iter_ctx iter_ctx = {};
434+
struct erofs_iomap_iter_ctx iter_ctx = {
435+
.realinode = inode,
436+
};
428437

429438
return iomap_dio_rw(iocb, to, &erofs_iomap_ops,
430439
NULL, 0, &iter_ctx, 0);
@@ -480,12 +489,11 @@ static loff_t erofs_file_llseek(struct file *file, loff_t offset, int whence)
480489
struct inode *inode = file->f_mapping->host;
481490
const struct iomap_ops *ops = &erofs_iomap_ops;
482491

483-
if (erofs_inode_is_data_compressed(EROFS_I(inode)->datalayout))
484-
#ifdef CONFIG_EROFS_FS_ZIP
492+
if (erofs_inode_is_data_compressed(EROFS_I(inode)->datalayout)) {
493+
if (!IS_ENABLED(CONFIG_EROFS_FS_ZIP))
494+
return generic_file_llseek(file, offset, whence);
485495
ops = &z_erofs_iomap_report_ops;
486-
#else
487-
return generic_file_llseek(file, offset, whence);
488-
#endif
496+
}
489497

490498
if (whence == SEEK_HOLE)
491499
offset = iomap_seek_hole(inode, offset, ops);

0 commit comments

Comments
 (0)