Skip to content

Commit c84bb79

Browse files
committed
Merge tag 'vfs-7.0-rc1.nullfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs nullfs update from Christian Brauner: "Add a completely catatonic minimal pseudo filesystem called "nullfs" and make pivot_root() work in the initramfs. Currently pivot_root() does not work on the real rootfs because it cannot be unmounted. Userspace has to recursively delete initramfs contents manually before continuing boot, using the fragile switch_root sequence (overmount + chroot). Add nullfs, a minimal immutable filesystem that serves as the true root of the mount hierarchy. The mutable rootfs (tmpfs/ramfs) is mounted on top of it. This allows userspace to simply: chdir(new_root); pivot_root(".", "."); umount2(".", MNT_DETACH); without the traditional switch_root workarounds. systemd already handles this correctly. It tries pivot_root() first and falls back to MS_MOVE only when that fails. This also means rootfs mounts in unprivileged namespaces no longer need MNT_LOCKED, since the immutable nullfs guarantees nothing can be revealed by unmounting the covering mount. nullfs is a single-instance filesystem (get_tree_single()) marked SB_NOUSER | SB_I_NOEXEC | SB_I_NODEV with an immutable empty root directory. This means sooner or later it can be used to overmount other directories to hide their contents without any additional protection needed. We enable it unconditionally. If we see any real regression we'll hide it behind a boot option. nullfs has extensions beyond this in the future. It will serve as a concept to support the creation of completely empty mount namespaces - which is work coming up in the next cycle" * tag 'vfs-7.0-rc1.nullfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: use nullfs unconditionally as the real rootfs docs: mention nullfs fs: add immutable rootfs fs: add init_pivot_root() fs: ensure that internal tmpfs mount gets mount id zero
2 parents 7e01a69 + 313c47f commit c84bb79

10 files changed

Lines changed: 216 additions & 74 deletions

File tree

Documentation/filesystems/ramfs-rootfs-initramfs.rst

Lines changed: 12 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -76,10 +76,10 @@ What is rootfs?
7676
---------------
7777

7878
Rootfs is a special instance of ramfs (or tmpfs, if that's enabled), which is
79-
always present in 2.6 systems. You can't unmount rootfs for approximately the
80-
same reason you can't kill the init process; rather than having special code
81-
to check for and handle an empty list, it's smaller and simpler for the kernel
82-
to just make sure certain lists can't become empty.
79+
always present in Linux systems. The kernel uses an immutable empty filesystem
80+
called nullfs as the true root of the VFS hierarchy, with the mutable rootfs
81+
(tmpfs/ramfs) mounted on top of it. This allows pivot_root() and unmounting
82+
of the initramfs to work normally.
8383

8484
Most systems just mount another filesystem over rootfs and ignore it. The
8585
amount of space an empty instance of ramfs takes up is tiny.
@@ -121,16 +121,14 @@ All this differs from the old initrd in several ways:
121121
program. See the switch_root utility, below.)
122122

123123
- When switching another root device, initrd would pivot_root and then
124-
umount the ramdisk. But initramfs is rootfs: you can neither pivot_root
125-
rootfs, nor unmount it. Instead delete everything out of rootfs to
126-
free up the space (find -xdev / -exec rm '{}' ';'), overmount rootfs
127-
with the new root (cd /newmount; mount --move . /; chroot .), attach
128-
stdin/stdout/stderr to the new /dev/console, and exec the new init.
129-
130-
Since this is a remarkably persnickety process (and involves deleting
131-
commands before you can run them), the klibc package introduced a helper
132-
program (utils/run_init.c) to do all this for you. Most other packages
133-
(such as busybox) have named this command "switch_root".
124+
umount the ramdisk. With nullfs as the true root, pivot_root() works
125+
normally from the initramfs. Userspace can simply do::
126+
127+
chdir(new_root);
128+
pivot_root(".", ".");
129+
umount2(".", MNT_DETACH);
130+
131+
This is the preferred method for switching root filesystems.
134132

135133
Populating initramfs:
136134
---------------------

fs/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ obj-y := open.o read_write.o file_table.o super.o \
1616
stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
1717
fs_dirent.o fs_context.o fs_parser.o fsopen.o init.o \
1818
kernel_read_file.o mnt_idmapping.o remap_range.o pidfs.o \
19-
file_attr.o fserror.o
19+
file_attr.o fserror.o nullfs.o
2020

2121
obj-$(CONFIG_BUFFER_HEAD) += buffer.o mpage.o
2222
obj-$(CONFIG_PROC_FS) += proc_namespace.o

fs/init.c

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,23 @@
1313
#include <linux/security.h>
1414
#include "internal.h"
1515

16+
int __init init_pivot_root(const char *new_root, const char *put_old)
17+
{
18+
struct path new_path __free(path_put) = {};
19+
struct path old_path __free(path_put) = {};
20+
int ret;
21+
22+
ret = kern_path(new_root, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &new_path);
23+
if (ret)
24+
return ret;
25+
26+
ret = kern_path(put_old, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &old_path);
27+
if (ret)
28+
return ret;
29+
30+
return path_pivot_root(&new_path, &old_path);
31+
}
32+
1633
int __init init_mount(const char *dev_name, const char *dir_name,
1734
const char *type_page, unsigned long flags, void *data_page)
1835
{

fs/internal.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,7 @@ extern bool may_mount(void);
9090
int path_mount(const char *dev_name, const struct path *path,
9191
const char *type_page, unsigned long flags, void *data_page);
9292
int path_umount(const struct path *path, int flags);
93+
int path_pivot_root(struct path *new, struct path *old);
9394

9495
int show_path(struct seq_file *m, struct dentry *root);
9596

fs/mount.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
#include <linux/ns_common.h>
66
#include <linux/fs_pin.h>
77

8+
extern struct file_system_type nullfs_fs_type;
89
extern struct list_head notify_list;
910

1011
struct mnt_namespace {

fs/namespace.c

Lines changed: 102 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -221,7 +221,7 @@ static int mnt_alloc_id(struct mount *mnt)
221221
int res;
222222

223223
xa_lock(&mnt_id_xa);
224-
res = __xa_alloc(&mnt_id_xa, &mnt->mnt_id, mnt, XA_LIMIT(1, INT_MAX), GFP_KERNEL);
224+
res = __xa_alloc(&mnt_id_xa, &mnt->mnt_id, mnt, xa_limit_31b, GFP_KERNEL);
225225
if (!res)
226226
mnt->mnt_id_unique = ++mnt_id_ctr;
227227
xa_unlock(&mnt_id_xa);
@@ -4498,65 +4498,27 @@ bool path_is_under(const struct path *path1, const struct path *path2)
44984498
}
44994499
EXPORT_SYMBOL(path_is_under);
45004500

4501-
/*
4502-
* pivot_root Semantics:
4503-
* Moves the root file system of the current process to the directory put_old,
4504-
* makes new_root as the new root file system of the current process, and sets
4505-
* root/cwd of all processes which had them on the current root to new_root.
4506-
*
4507-
* Restrictions:
4508-
* The new_root and put_old must be directories, and must not be on the
4509-
* same file system as the current process root. The put_old must be
4510-
* underneath new_root, i.e. adding a non-zero number of /.. to the string
4511-
* pointed to by put_old must yield the same directory as new_root. No other
4512-
* file system may be mounted on put_old. After all, new_root is a mountpoint.
4513-
*
4514-
* Also, the current root cannot be on the 'rootfs' (initial ramfs) filesystem.
4515-
* See Documentation/filesystems/ramfs-rootfs-initramfs.rst for alternatives
4516-
* in this situation.
4517-
*
4518-
* Notes:
4519-
* - we don't move root/cwd if they are not at the root (reason: if something
4520-
* cared enough to change them, it's probably wrong to force them elsewhere)
4521-
* - it's okay to pick a root that isn't the root of a file system, e.g.
4522-
* /nfs/my_root where /nfs is the mount point. It must be a mountpoint,
4523-
* though, so you may need to say mount --bind /nfs/my_root /nfs/my_root
4524-
* first.
4525-
*/
4526-
SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
4527-
const char __user *, put_old)
4501+
int path_pivot_root(struct path *new, struct path *old)
45284502
{
4529-
struct path new __free(path_put) = {};
4530-
struct path old __free(path_put) = {};
45314503
struct path root __free(path_put) = {};
45324504
struct mount *new_mnt, *root_mnt, *old_mnt, *root_parent, *ex_parent;
45334505
int error;
45344506

45354507
if (!may_mount())
45364508
return -EPERM;
45374509

4538-
error = user_path_at(AT_FDCWD, new_root,
4539-
LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &new);
4540-
if (error)
4541-
return error;
4542-
4543-
error = user_path_at(AT_FDCWD, put_old,
4544-
LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &old);
4545-
if (error)
4546-
return error;
4547-
4548-
error = security_sb_pivotroot(&old, &new);
4510+
error = security_sb_pivotroot(old, new);
45494511
if (error)
45504512
return error;
45514513

45524514
get_fs_root(current->fs, &root);
45534515

4554-
LOCK_MOUNT(old_mp, &old);
4516+
LOCK_MOUNT(old_mp, old);
45554517
old_mnt = old_mp.parent;
45564518
if (IS_ERR(old_mnt))
45574519
return PTR_ERR(old_mnt);
45584520

4559-
new_mnt = real_mount(new.mnt);
4521+
new_mnt = real_mount(new->mnt);
45604522
root_mnt = real_mount(root.mnt);
45614523
ex_parent = new_mnt->mnt_parent;
45624524
root_parent = root_mnt->mnt_parent;
@@ -4568,23 +4530,23 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
45684530
return -EINVAL;
45694531
if (new_mnt->mnt.mnt_flags & MNT_LOCKED)
45704532
return -EINVAL;
4571-
if (d_unlinked(new.dentry))
4533+
if (d_unlinked(new->dentry))
45724534
return -ENOENT;
45734535
if (new_mnt == root_mnt || old_mnt == root_mnt)
45744536
return -EBUSY; /* loop, on the same file system */
45754537
if (!path_mounted(&root))
45764538
return -EINVAL; /* not a mountpoint */
45774539
if (!mnt_has_parent(root_mnt))
45784540
return -EINVAL; /* absolute root */
4579-
if (!path_mounted(&new))
4541+
if (!path_mounted(new))
45804542
return -EINVAL; /* not a mountpoint */
45814543
if (!mnt_has_parent(new_mnt))
45824544
return -EINVAL; /* absolute root */
45834545
/* make sure we can reach put_old from new_root */
4584-
if (!is_path_reachable(old_mnt, old_mp.mp->m_dentry, &new))
4546+
if (!is_path_reachable(old_mnt, old_mp.mp->m_dentry, new))
45854547
return -EINVAL;
45864548
/* make certain new is below the root */
4587-
if (!is_path_reachable(new_mnt, new.dentry, &root))
4549+
if (!is_path_reachable(new_mnt, new->dentry, &root))
45884550
return -EINVAL;
45894551
lock_mount_hash();
45904552
umount_mnt(new_mnt);
@@ -4603,10 +4565,55 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
46034565
unlock_mount_hash();
46044566
mnt_notify_add(root_mnt);
46054567
mnt_notify_add(new_mnt);
4606-
chroot_fs_refs(&root, &new);
4568+
chroot_fs_refs(&root, new);
46074569
return 0;
46084570
}
46094571

4572+
/*
4573+
* pivot_root Semantics:
4574+
* Moves the root file system of the current process to the directory put_old,
4575+
* makes new_root as the new root file system of the current process, and sets
4576+
* root/cwd of all processes which had them on the current root to new_root.
4577+
*
4578+
* Restrictions:
4579+
* The new_root and put_old must be directories, and must not be on the
4580+
* same file system as the current process root. The put_old must be
4581+
* underneath new_root, i.e. adding a non-zero number of /.. to the string
4582+
* pointed to by put_old must yield the same directory as new_root. No other
4583+
* file system may be mounted on put_old. After all, new_root is a mountpoint.
4584+
*
4585+
* The immutable nullfs filesystem is mounted as the true root of the VFS
4586+
* hierarchy. The mutable rootfs (tmpfs/ramfs) is layered on top of this,
4587+
* allowing pivot_root() to work normally from initramfs.
4588+
*
4589+
* Notes:
4590+
* - we don't move root/cwd if they are not at the root (reason: if something
4591+
* cared enough to change them, it's probably wrong to force them elsewhere)
4592+
* - it's okay to pick a root that isn't the root of a file system, e.g.
4593+
* /nfs/my_root where /nfs is the mount point. It must be a mountpoint,
4594+
* though, so you may need to say mount --bind /nfs/my_root /nfs/my_root
4595+
* first.
4596+
*/
4597+
SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
4598+
const char __user *, put_old)
4599+
{
4600+
struct path new __free(path_put) = {};
4601+
struct path old __free(path_put) = {};
4602+
int error;
4603+
4604+
error = user_path_at(AT_FDCWD, new_root,
4605+
LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &new);
4606+
if (error)
4607+
return error;
4608+
4609+
error = user_path_at(AT_FDCWD, put_old,
4610+
LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &old);
4611+
if (error)
4612+
return error;
4613+
4614+
return path_pivot_root(&new, &old);
4615+
}
4616+
46104617
static unsigned int recalc_flags(struct mount_kattr *kattr, struct mount *mnt)
46114618
{
46124619
unsigned int flags = mnt->mnt.mnt_flags;
@@ -5969,24 +5976,62 @@ struct mnt_namespace init_mnt_ns = {
59695976

59705977
static void __init init_mount_tree(void)
59715978
{
5972-
struct vfsmount *mnt;
5973-
struct mount *m;
5979+
struct vfsmount *mnt, *nullfs_mnt;
5980+
struct mount *mnt_root;
59745981
struct path root;
59755982

5983+
/*
5984+
* We create two mounts:
5985+
*
5986+
* (1) nullfs with mount id 1
5987+
* (2) mutable rootfs with mount id 2
5988+
*
5989+
* with (2) mounted on top of (1).
5990+
*/
5991+
nullfs_mnt = vfs_kern_mount(&nullfs_fs_type, 0, "nullfs", NULL);
5992+
if (IS_ERR(nullfs_mnt))
5993+
panic("VFS: Failed to create nullfs");
5994+
59765995
mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", initramfs_options);
59775996
if (IS_ERR(mnt))
59785997
panic("Can't create rootfs");
59795998

5980-
m = real_mount(mnt);
5981-
init_mnt_ns.root = m;
5982-
init_mnt_ns.nr_mounts = 1;
5983-
mnt_add_to_ns(&init_mnt_ns, m);
5999+
VFS_WARN_ON_ONCE(real_mount(nullfs_mnt)->mnt_id != 1);
6000+
VFS_WARN_ON_ONCE(real_mount(mnt)->mnt_id != 2);
6001+
6002+
/* The namespace root is the nullfs mnt. */
6003+
mnt_root = real_mount(nullfs_mnt);
6004+
init_mnt_ns.root = mnt_root;
6005+
6006+
/* Mount mutable rootfs on top of nullfs. */
6007+
root.mnt = nullfs_mnt;
6008+
root.dentry = nullfs_mnt->mnt_root;
6009+
6010+
LOCK_MOUNT_EXACT(mp, &root);
6011+
if (unlikely(IS_ERR(mp.parent)))
6012+
panic("VFS: Failed to mount rootfs on nullfs");
6013+
scoped_guard(mount_writer)
6014+
attach_mnt(real_mount(mnt), mp.parent, mp.mp);
6015+
6016+
pr_info("VFS: Finished mounting rootfs on nullfs\n");
6017+
6018+
/*
6019+
* We've dropped all locks here but that's fine. Not just are we
6020+
* the only task that's running, there's no other mount
6021+
* namespace in existence and the initial mount namespace is
6022+
* completely empty until we add the mounts we just created.
6023+
*/
6024+
for (struct mount *p = mnt_root; p; p = next_mnt(p, mnt_root)) {
6025+
mnt_add_to_ns(&init_mnt_ns, p);
6026+
init_mnt_ns.nr_mounts++;
6027+
}
6028+
59846029
init_task.nsproxy->mnt_ns = &init_mnt_ns;
59856030
get_mnt_ns(&init_mnt_ns);
59866031

5987-
root.mnt = mnt;
5988-
root.dentry = mnt->mnt_root;
5989-
6032+
/* The root and pwd always point to the mutable rootfs. */
6033+
root.mnt = mnt;
6034+
root.dentry = mnt->mnt_root;
59906035
set_fs_pwd(current->fs, &root);
59916036
set_fs_root(current->fs, &root);
59926037

fs/nullfs.c

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
// SPDX-License-Identifier: GPL-2.0-only
2+
/* Copyright (c) 2026 Christian Brauner <brauner@kernel.org> */
3+
#include <linux/fs/super_types.h>
4+
#include <linux/fs_context.h>
5+
#include <linux/magic.h>
6+
7+
static const struct super_operations nullfs_super_operations = {
8+
.statfs = simple_statfs,
9+
};
10+
11+
static int nullfs_fs_fill_super(struct super_block *s, struct fs_context *fc)
12+
{
13+
struct inode *inode;
14+
15+
s->s_maxbytes = MAX_LFS_FILESIZE;
16+
s->s_blocksize = PAGE_SIZE;
17+
s->s_blocksize_bits = PAGE_SHIFT;
18+
s->s_magic = NULL_FS_MAGIC;
19+
s->s_op = &nullfs_super_operations;
20+
s->s_export_op = NULL;
21+
s->s_xattr = NULL;
22+
s->s_time_gran = 1;
23+
s->s_d_flags = 0;
24+
25+
inode = new_inode(s);
26+
if (!inode)
27+
return -ENOMEM;
28+
29+
/* nullfs is permanently empty... */
30+
make_empty_dir_inode(inode);
31+
simple_inode_init_ts(inode);
32+
inode->i_ino = 1;
33+
/* ... and immutable. */
34+
inode->i_flags |= S_IMMUTABLE;
35+
36+
s->s_root = d_make_root(inode);
37+
if (!s->s_root)
38+
return -ENOMEM;
39+
40+
return 0;
41+
}
42+
43+
/*
44+
* For now this is a single global instance. If needed we can make it
45+
* mountable by userspace at which point we will need to make it
46+
* multi-instance.
47+
*/
48+
static int nullfs_fs_get_tree(struct fs_context *fc)
49+
{
50+
return get_tree_single(fc, nullfs_fs_fill_super);
51+
}
52+
53+
static const struct fs_context_operations nullfs_fs_context_ops = {
54+
.get_tree = nullfs_fs_get_tree,
55+
};
56+
57+
static int nullfs_init_fs_context(struct fs_context *fc)
58+
{
59+
fc->ops = &nullfs_fs_context_ops;
60+
fc->global = true;
61+
fc->sb_flags = SB_NOUSER;
62+
fc->s_iflags = SB_I_NOEXEC | SB_I_NODEV;
63+
return 0;
64+
}
65+
66+
struct file_system_type nullfs_fs_type = {
67+
.name = "nullfs",
68+
.init_fs_context = nullfs_init_fs_context,
69+
.kill_sb = kill_anon_super,
70+
};

0 commit comments

Comments
 (0)