Skip to content

Commit 3b3f874

Browse files
committed
Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Rename and export helpers that get write access to a mount. They are used in overlayfs to get write access to the upper mount. - Print the pretty name of the root device on boot failure. This helps in scenarios where we would usually only print "unknown-block(1,2)". - Add an internal SB_I_NOUMASK flag. This is another part in the endless POSIX ACL saga in a way. When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip the umask because if the relevant inode has POSIX ACLs set it might take the umask from there. But if the inode doesn't have any POSIX ACLs set then we apply the umask in the filesytem itself. So we end up with: (1) no SB_POSIXACL -> strip umask in vfs (2) SB_POSIXACL -> strip umask in filesystem The umask semantics associated with SB_POSIXACL allowed filesystems that don't even support POSIX ACLs at all to raise SB_POSIXACL purely to avoid umask stripping. That specifically means NFS v4 and Overlayfs. NFS v4 does it because it delegates this to the server and Overlayfs because it needs to delegate umask stripping to the upper filesystem, i.e., the filesystem used as the writable layer. This went so far that SB_POSIXACL is raised eve on kernels that don't even have POSIX ACL support at all. Stop this blatant abuse and add SB_I_NOUMASK which is an internal superblock flag that filesystems can raise to opt out of umask handling. That should really only be the two mentioned above. It's not that we want any filesystems to do this. Ideally we have all umask handling always in the vfs. - Make overlayfs use SB_I_NOUMASK too. - Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in IS_POSIXACL() if the kernel doesn't have support for it. This is a very old patch but it's only possible to do this now with the wider cleanup that was done. - Follow-up work on fake path handling from last cycle. Citing mostly from Amir: When overlayfs was first merged, overlayfs files of regular files and directories, the ones that are installed in file table, had a "fake" path, namely, f_path is the overlayfs path and f_inode is the "real" inode on the underlying filesystem. In v6.5, we took another small step by introducing of the backing_file container and the file_real_path() helper. This change allowed vfs and filesystem code to get the "real" path of an overlayfs backing file. With this change, we were able to make fsnotify work correctly and report events on the "real" filesystem objects that were accessed via overlayfs. This method works fine, but it still leaves the vfs vulnerable to new code that is not aware of files with fake path. A recent example is commit db1d1e8 ("IMA: use vfs_getattr_nosec to get the i_version"). This commit uses direct referencing to f_path in IMA code that otherwise uses file_inode() and file_dentry() to reference the filesystem objects that it is measuring. This contains work to switch things around: instead of having filesystem code opt-in to get the "real" path, have generic code opt-in for the "fake" path in the few places that it is needed. Is it far more likely that new filesystems code that does not use the file_dentry() and file_real_path() helpers will end up causing crashes or averting LSM/audit rules if we keep the "fake" path exposed by default. This change already makes file_dentry() moot, but for now we did not change this helper just added a WARN_ON() in ovl_d_real() to catch if we have made any wrong assumptions. After the dust settles on this change, we can make file_dentry() a plain accessor and we can drop the inode argument to ->d_real(). - Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small change but it really isn't and I would like to see everyone on their tippie toes for any possible bugs from this work. Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for files since a very long time because of the nasty interactions between the SCM_RIGHTS file descriptor garbage collection. So extending it makes a lot of sense but it is a subtle change. There are almost no places that fiddle with file rcu semantics directly and the ones that did mess around with struct file internal under rcu have been made to stop doing that because it really was always dodgy. I forgot to put in the link tag for this change and the discussion in the commit so adding it into the merge message: https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com Cleanups: - Various smaller pipe cleanups including the removal of a spin lock that was only used to protect against writes without pipe_lock() from O_NOTIFICATION_PIPE aka watch queues. As that was never implemented remove the additional locking from pipe_write(). - Annotate struct watch_filter with the new __counted_by attribute. - Clarify do_unlinkat() cleanup so that it doesn't look like an extra iput() is done that would cause issues. - Simplify file cleanup when the file has never been opened. - Use module helper instead of open-coding it. - Predict error unlikely for stale retry. - Use WRITE_ONCE() for mount expiry field instead of just commenting that one hopes the compiler doesn't get smart. Fixes: - Fix readahead on block devices. - Fix writeback when layztime is enabled and inodes whose timestamp is the only thing that changed reside on wb->b_dirty_time. This caused excessively large zombie memory cgroup when lazytime was enabled as such inodes weren't handled fast enough. - Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()" * tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits) file, i915: fix file reference for mmap_singleton() vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs chardev: Simplify usage of try_module_get() ovl: rely on SB_I_NOUMASK fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n fs: store real path instead of fake path in backing file f_path fs: create helper file_user_path() for user displayed mapped file path fs: get mnt_writers count for an open backing file's real path vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked vfs: predict the error in retry_estale as unlikely backing file: free directly vfs: fix readahead(2) on block devices io_uring: use files_lookup_fd_locked() file: convert to SLAB_TYPESAFE_BY_RCU vfs: shave work on failed file open fs: simplify misleading code to remove ambiguity regarding ihold()/iput() watch_queue: Annotate struct watch_filter with __counted_by fs/pipe: use spinlock in pipe_read() only if there is a watch_queue fs/pipe: remove unnecessary spinlock from pipe_write() ...
2 parents 0d63d8b + 61d4fb0 commit 3b3f874

39 files changed

Lines changed: 479 additions & 270 deletions

File tree

Documentation/filesystems/files.rst

Lines changed: 24 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -62,51 +62,30 @@ the fdtable structure -
6262
be held.
6363

6464
4. To look up the file structure given an fd, a reader
65-
must use either lookup_fd_rcu() or files_lookup_fd_rcu() APIs. These
65+
must use either lookup_fdget_rcu() or files_lookup_fdget_rcu() APIs. These
6666
take care of barrier requirements due to lock-free lookup.
6767

6868
An example::
6969

7070
struct file *file;
7171
7272
rcu_read_lock();
73-
file = lookup_fd_rcu(fd);
74-
if (file) {
75-
...
76-
}
77-
....
73+
file = lookup_fdget_rcu(fd);
7874
rcu_read_unlock();
79-
80-
5. Handling of the file structures is special. Since the look-up
81-
of the fd (fget()/fget_light()) are lock-free, it is possible
82-
that look-up may race with the last put() operation on the
83-
file structure. This is avoided using atomic_long_inc_not_zero()
84-
on ->f_count::
85-
86-
rcu_read_lock();
87-
file = files_lookup_fd_rcu(files, fd);
8875
if (file) {
89-
if (atomic_long_inc_not_zero(&file->f_count))
90-
*fput_needed = 1;
91-
else
92-
/* Didn't get the reference, someone's freed */
93-
file = NULL;
76+
...
77+
fput(file);
9478
}
95-
rcu_read_unlock();
9679
....
97-
return file;
98-
99-
atomic_long_inc_not_zero() detects if refcounts is already zero or
100-
goes to zero during increment. If it does, we fail
101-
fget()/fget_light().
10280

103-
6. Since both fdtable and file structures can be looked up
81+
5. Since both fdtable and file structures can be looked up
10482
lock-free, they must be installed using rcu_assign_pointer()
10583
API. If they are looked up lock-free, rcu_dereference()
10684
must be used. However it is advisable to use files_fdtable()
107-
and lookup_fd_rcu()/files_lookup_fd_rcu() which take care of these issues.
85+
and lookup_fdget_rcu()/files_lookup_fdget_rcu() which take care of these
86+
issues.
10887

109-
7. While updating, the fdtable pointer must be looked up while
88+
6. While updating, the fdtable pointer must be looked up while
11089
holding files->file_lock. If ->file_lock is dropped, then
11190
another thread expand the files thereby creating a new
11291
fdtable and making the earlier fdtable pointer stale.
@@ -126,3 +105,19 @@ the fdtable structure -
126105
Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
127106
the fdtable pointer (fdt) must be loaded after locate_fd().
128107

108+
On newer kernels rcu based file lookup has been switched to rely on
109+
SLAB_TYPESAFE_BY_RCU instead of call_rcu(). It isn't sufficient anymore
110+
to just acquire a reference to the file in question under rcu using
111+
atomic_long_inc_not_zero() since the file might have already been
112+
recycled and someone else might have bumped the reference. In other
113+
words, callers might see reference count bumps from newer users. For
114+
this is reason it is necessary to verify that the pointer is the same
115+
before and after the reference count increment. This pattern can be seen
116+
in get_file_rcu() and __files_get_rcu().
117+
118+
In addition, it isn't possible to access or check fields in struct file
119+
without first aqcuiring a reference on it under rcu lookup. Not doing
120+
that was always very dodgy and it was only usable for non-pointer data
121+
in struct file. With SLAB_TYPESAFE_BY_RCU it is necessary that callers
122+
either first acquire a reference or they must hold the files_lock of the
123+
fdtable.

arch/arc/kernel/troubleshoot.c

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -90,10 +90,12 @@ static void show_faulting_vma(unsigned long address)
9090
*/
9191
if (vma) {
9292
char buf[ARC_PATH_MAX];
93-
char *nm = "?";
93+
char *nm = "anon";
9494

9595
if (vma->vm_file) {
96-
nm = file_path(vma->vm_file, buf, ARC_PATH_MAX-1);
96+
/* XXX: can we use %pD below and get rid of buf? */
97+
nm = d_path(file_user_path(vma->vm_file), buf,
98+
ARC_PATH_MAX-1);
9799
if (IS_ERR(nm))
98100
nm = "?";
99101
}

arch/powerpc/platforms/cell/spufs/coredump.c

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -66,18 +66,21 @@ static int match_context(const void *v, struct file *file, unsigned fd)
6666
*/
6767
static struct spu_context *coredump_next_context(int *fd)
6868
{
69-
struct spu_context *ctx;
69+
struct spu_context *ctx = NULL;
7070
struct file *file;
7171
int n = iterate_fd(current->files, *fd, match_context, NULL);
7272
if (!n)
7373
return NULL;
7474
*fd = n - 1;
7575

7676
rcu_read_lock();
77-
file = lookup_fd_rcu(*fd);
78-
ctx = SPUFS_I(file_inode(file))->i_ctx;
79-
get_spu_context(ctx);
77+
file = lookup_fdget_rcu(*fd);
8078
rcu_read_unlock();
79+
if (file) {
80+
ctx = SPUFS_I(file_inode(file))->i_ctx;
81+
get_spu_context(ctx);
82+
fput(file);
83+
}
8184

8285
return ctx;
8386
}

drivers/gpu/drm/i915/gem/i915_gem_mman.c

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -916,11 +916,7 @@ static struct file *mmap_singleton(struct drm_i915_private *i915)
916916
{
917917
struct file *file;
918918

919-
rcu_read_lock();
920-
file = READ_ONCE(i915->gem.mmap_singleton);
921-
if (file && !get_file_rcu(file))
922-
file = NULL;
923-
rcu_read_unlock();
919+
file = get_file_active(&i915->gem.mmap_singleton);
924920
if (file)
925921
return file;
926922

fs/char_dev.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -350,7 +350,7 @@ static struct kobject *cdev_get(struct cdev *p)
350350
struct module *owner = p->owner;
351351
struct kobject *kobj;
352352

353-
if (owner && !try_module_get(owner))
353+
if (!try_module_get(owner))
354354
return NULL;
355355
kobj = kobject_get_unless_zero(&p->kobj);
356356
if (!kobj)

fs/file.c

Lines changed: 135 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -604,6 +604,9 @@ void fd_install(unsigned int fd, struct file *file)
604604
struct files_struct *files = current->files;
605605
struct fdtable *fdt;
606606

607+
if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
608+
return;
609+
607610
rcu_read_lock_sched();
608611

609612
if (unlikely(files->resize_in_progress)) {
@@ -853,8 +856,104 @@ void do_close_on_exec(struct files_struct *files)
853856
spin_unlock(&files->file_lock);
854857
}
855858

859+
static struct file *__get_file_rcu(struct file __rcu **f)
860+
{
861+
struct file __rcu *file;
862+
struct file __rcu *file_reloaded;
863+
struct file __rcu *file_reloaded_cmp;
864+
865+
file = rcu_dereference_raw(*f);
866+
if (!file)
867+
return NULL;
868+
869+
if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
870+
return ERR_PTR(-EAGAIN);
871+
872+
file_reloaded = rcu_dereference_raw(*f);
873+
874+
/*
875+
* Ensure that all accesses have a dependency on the load from
876+
* rcu_dereference_raw() above so we get correct ordering
877+
* between reuse/allocation and the pointer check below.
878+
*/
879+
file_reloaded_cmp = file_reloaded;
880+
OPTIMIZER_HIDE_VAR(file_reloaded_cmp);
881+
882+
/*
883+
* atomic_long_inc_not_zero() above provided a full memory
884+
* barrier when we acquired a reference.
885+
*
886+
* This is paired with the write barrier from assigning to the
887+
* __rcu protected file pointer so that if that pointer still
888+
* matches the current file, we know we have successfully
889+
* acquired a reference to the right file.
890+
*
891+
* If the pointers don't match the file has been reallocated by
892+
* SLAB_TYPESAFE_BY_RCU.
893+
*/
894+
if (file == file_reloaded_cmp)
895+
return file_reloaded;
896+
897+
fput(file);
898+
return ERR_PTR(-EAGAIN);
899+
}
900+
901+
/**
902+
* get_file_rcu - try go get a reference to a file under rcu
903+
* @f: the file to get a reference on
904+
*
905+
* This function tries to get a reference on @f carefully verifying that
906+
* @f hasn't been reused.
907+
*
908+
* This function should rarely have to be used and only by users who
909+
* understand the implications of SLAB_TYPESAFE_BY_RCU. Try to avoid it.
910+
*
911+
* Return: Returns @f with the reference count increased or NULL.
912+
*/
913+
struct file *get_file_rcu(struct file __rcu **f)
914+
{
915+
for (;;) {
916+
struct file __rcu *file;
917+
918+
file = __get_file_rcu(f);
919+
if (unlikely(!file))
920+
return NULL;
921+
922+
if (unlikely(IS_ERR(file)))
923+
continue;
924+
925+
return file;
926+
}
927+
}
928+
EXPORT_SYMBOL_GPL(get_file_rcu);
929+
930+
/**
931+
* get_file_active - try go get a reference to a file
932+
* @f: the file to get a reference on
933+
*
934+
* In contast to get_file_rcu() the pointer itself isn't part of the
935+
* reference counting.
936+
*
937+
* This function should rarely have to be used and only by users who
938+
* understand the implications of SLAB_TYPESAFE_BY_RCU. Try to avoid it.
939+
*
940+
* Return: Returns @f with the reference count increased or NULL.
941+
*/
942+
struct file *get_file_active(struct file **f)
943+
{
944+
struct file __rcu *file;
945+
946+
rcu_read_lock();
947+
file = __get_file_rcu(f);
948+
rcu_read_unlock();
949+
if (IS_ERR(file))
950+
file = NULL;
951+
return file;
952+
}
953+
EXPORT_SYMBOL_GPL(get_file_active);
954+
856955
static inline struct file *__fget_files_rcu(struct files_struct *files,
857-
unsigned int fd, fmode_t mask)
956+
unsigned int fd, fmode_t mask)
858957
{
859958
for (;;) {
860959
struct file *file;
@@ -865,12 +964,6 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
865964
return NULL;
866965

867966
fdentry = fdt->fd + array_index_nospec(fd, fdt->max_fds);
868-
file = rcu_dereference_raw(*fdentry);
869-
if (unlikely(!file))
870-
return NULL;
871-
872-
if (unlikely(file->f_mode & mask))
873-
return NULL;
874967

875968
/*
876969
* Ok, we have a file pointer. However, because we do
@@ -879,10 +972,15 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
879972
*
880973
* Such a race can take two forms:
881974
*
882-
* (a) the file ref already went down to zero,
883-
* and get_file_rcu() fails. Just try again:
975+
* (a) the file ref already went down to zero and the
976+
* file hasn't been reused yet or the file count
977+
* isn't zero but the file has already been reused.
884978
*/
885-
if (unlikely(!get_file_rcu(file)))
979+
file = __get_file_rcu(fdentry);
980+
if (unlikely(!file))
981+
return NULL;
982+
983+
if (unlikely(IS_ERR(file)))
886984
continue;
887985

888986
/*
@@ -893,12 +991,20 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
893991
*
894992
* If so, we need to put our ref and try again.
895993
*/
896-
if (unlikely(rcu_dereference_raw(files->fdt) != fdt) ||
897-
unlikely(rcu_dereference_raw(*fdentry) != file)) {
994+
if (unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
898995
fput(file);
899996
continue;
900997
}
901998

999+
/*
1000+
* This isn't the file we're looking for or we're not
1001+
* allowed to get a reference to it.
1002+
*/
1003+
if (unlikely(file->f_mode & mask)) {
1004+
fput(file);
1005+
return NULL;
1006+
}
1007+
9021008
/*
9031009
* Ok, we have a ref to the file, and checked that it
9041010
* still exists.
@@ -948,7 +1054,14 @@ struct file *fget_task(struct task_struct *task, unsigned int fd)
9481054
return file;
9491055
}
9501056

951-
struct file *task_lookup_fd_rcu(struct task_struct *task, unsigned int fd)
1057+
struct file *lookup_fdget_rcu(unsigned int fd)
1058+
{
1059+
return __fget_files_rcu(current->files, fd, 0);
1060+
1061+
}
1062+
EXPORT_SYMBOL_GPL(lookup_fdget_rcu);
1063+
1064+
struct file *task_lookup_fdget_rcu(struct task_struct *task, unsigned int fd)
9521065
{
9531066
/* Must be called with rcu_read_lock held */
9541067
struct files_struct *files;
@@ -957,13 +1070,13 @@ struct file *task_lookup_fd_rcu(struct task_struct *task, unsigned int fd)
9571070
task_lock(task);
9581071
files = task->files;
9591072
if (files)
960-
file = files_lookup_fd_rcu(files, fd);
1073+
file = __fget_files_rcu(files, fd, 0);
9611074
task_unlock(task);
9621075

9631076
return file;
9641077
}
9651078

966-
struct file *task_lookup_next_fd_rcu(struct task_struct *task, unsigned int *ret_fd)
1079+
struct file *task_lookup_next_fdget_rcu(struct task_struct *task, unsigned int *ret_fd)
9671080
{
9681081
/* Must be called with rcu_read_lock held */
9691082
struct files_struct *files;
@@ -974,7 +1087,7 @@ struct file *task_lookup_next_fd_rcu(struct task_struct *task, unsigned int *ret
9741087
files = task->files;
9751088
if (files) {
9761089
for (; fd < files_fdtable(files)->max_fds; fd++) {
977-
file = files_lookup_fd_rcu(files, fd);
1090+
file = __fget_files_rcu(files, fd, 0);
9781091
if (file)
9791092
break;
9801093
}
@@ -983,7 +1096,7 @@ struct file *task_lookup_next_fd_rcu(struct task_struct *task, unsigned int *ret
9831096
*ret_fd = fd;
9841097
return file;
9851098
}
986-
EXPORT_SYMBOL(task_lookup_next_fd_rcu);
1099+
EXPORT_SYMBOL(task_lookup_next_fdget_rcu);
9871100

9881101
/*
9891102
* Lightweight file lookup - no refcnt increment if fd table isn't shared.
@@ -1272,12 +1385,16 @@ SYSCALL_DEFINE2(dup2, unsigned int, oldfd, unsigned int, newfd)
12721385
{
12731386
if (unlikely(newfd == oldfd)) { /* corner case */
12741387
struct files_struct *files = current->files;
1388+
struct file *f;
12751389
int retval = oldfd;
12761390

12771391
rcu_read_lock();
1278-
if (!files_lookup_fd_rcu(files, oldfd))
1392+
f = __fget_files_rcu(files, oldfd, 0);
1393+
if (!f)
12791394
retval = -EBADF;
12801395
rcu_read_unlock();
1396+
if (f)
1397+
fput(f);
12811398
return retval;
12821399
}
12831400
return ksys_dup3(oldfd, newfd, 0);

0 commit comments

Comments
 (0)