Skip to content

Commit 3831170

Browse files
fdmananagregkh
authored andcommitted
btrfs: fix corruption after buffer fault in during direct IO append write
commit 939b656 upstream. During an append (O_APPEND write flag) direct IO write if the input buffer was not previously faulted in, we can corrupt the file in a way that the final size is unexpected and it includes an unexpected hole. The problem happens like this: 1) We have an empty file, with size 0, for example; 2) We do an O_APPEND direct IO with a length of 4096 bytes and the input buffer is not currently faulted in; 3) We enter btrfs_direct_write(), lock the inode and call generic_write_checks(), which calls generic_write_checks_count(), and that function sets the iocb position to 0 with the following code: if (iocb->ki_flags & IOCB_APPEND) iocb->ki_pos = i_size_read(inode); 4) We call btrfs_dio_write() and enter into iomap, which will end up calling btrfs_dio_iomap_begin() and that calls btrfs_get_blocks_direct_write(), where we update the i_size of the inode to 4096 bytes; 5) After btrfs_dio_iomap_begin() returns, iomap will attempt to access the page of the write input buffer (at iomap_dio_bio_iter(), with a call to bio_iov_iter_get_pages()) and fail with -EFAULT, which gets returned to btrfs at btrfs_direct_write() via btrfs_dio_write(); 6) At btrfs_direct_write() we get the -EFAULT error, unlock the inode, fault in the write buffer and then goto to the label 'relock'; 7) We lock again the inode, do all the necessary checks again and call again generic_write_checks(), which calls generic_write_checks_count() again, and there we set the iocb's position to 4K, which is the current i_size of the inode, with the following code pointed above: if (iocb->ki_flags & IOCB_APPEND) iocb->ki_pos = i_size_read(inode); 8) Then we go again to btrfs_dio_write() and enter iomap and the write succeeds, but it wrote to the file range [4K, 8K), leaving a hole in the [0, 4K) range and an i_size of 8K, which goes against the expectations of having the data written to the range [0, 4K) and get an i_size of 4K. Fix this by not unlocking the inode before faulting in the input buffer, in case we get -EFAULT or an incomplete write, and not jumping to the 'relock' label after faulting in the buffer - instead jump to a location immediately before calling iomap, skipping all the write checks and relocking. This solves this problem and it's fine even in case the input buffer is memory mapped to the same file range, since only holding the range locked in the inode's io tree can cause a deadlock, it's safe to keep the inode lock (VFS lock), as was fixed and described in commit 51bd956 ("btrfs: fix deadlock due to page faults during direct IO reads and writes"). A sample reproducer provided by a reporter is the following: $ cat test.c #ifndef _GNU_SOURCE #define _GNU_SOURCE #endif #include <fcntl.h> #include <stdio.h> #include <sys/mman.h> #include <sys/stat.h> #include <unistd.h> int main(int argc, char *argv[]) { if (argc < 2) { fprintf(stderr, "Usage: %s <test file>\n", argv[0]); return 1; } int fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT | O_APPEND, 0644); if (fd < 0) { perror("creating test file"); return 1; } char *buf = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); ssize_t ret = write(fd, buf, 4096); if (ret < 0) { perror("pwritev2"); return 1; } struct stat stbuf; ret = fstat(fd, &stbuf); if (ret < 0) { perror("stat"); return 1; } printf("size: %llu\n", (unsigned long long)stbuf.st_size); return stbuf.st_size == 4096 ? 0 : 1; } A test case for fstests will be sent soon. Reported-by: Hanna Czenczek <hreitz@redhat.com> Link: https://lore.kernel.org/linux-btrfs/0b841d46-12fe-4e64-9abb-871d8d0de271@redhat.com/ Fixes: 8184620 ("btrfs: fix lost file sync on direct IO write with nowait and dsync iocb") CC: stable@vger.kernel.org # 6.1+ Tested-by: Hanna Czenczek <hreitz@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1 parent 31a679a commit 3831170

2 files changed

Lines changed: 43 additions & 13 deletions

File tree

fs/btrfs/ctree.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -457,6 +457,7 @@ struct btrfs_file_private {
457457
void *filldir_buf;
458458
u64 last_index;
459459
struct extent_state *llseek_cached_state;
460+
bool fsync_skip_inode_lock;
460461
};
461462

462463
static inline u32 BTRFS_LEAF_DATA_SIZE(const struct btrfs_fs_info *info)

fs/btrfs/file.c

Lines changed: 42 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1550,21 +1550,37 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
15501550
* So here we disable page faults in the iov_iter and then retry if we
15511551
* got -EFAULT, faulting in the pages before the retry.
15521552
*/
1553+
again:
15531554
from->nofault = true;
15541555
dio = btrfs_dio_write(iocb, from, written);
15551556
from->nofault = false;
15561557

1557-
/*
1558-
* iomap_dio_complete() will call btrfs_sync_file() if we have a dsync
1559-
* iocb, and that needs to lock the inode. So unlock it before calling
1560-
* iomap_dio_complete() to avoid a deadlock.
1561-
*/
1562-
btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
1563-
1564-
if (IS_ERR_OR_NULL(dio))
1558+
if (IS_ERR_OR_NULL(dio)) {
15651559
ret = PTR_ERR_OR_ZERO(dio);
1566-
else
1560+
} else {
1561+
struct btrfs_file_private stack_private = { 0 };
1562+
struct btrfs_file_private *private;
1563+
const bool have_private = (file->private_data != NULL);
1564+
1565+
if (!have_private)
1566+
file->private_data = &stack_private;
1567+
1568+
/*
1569+
* If we have a synchoronous write, we must make sure the fsync
1570+
* triggered by the iomap_dio_complete() call below doesn't
1571+
* deadlock on the inode lock - we are already holding it and we
1572+
* can't call it after unlocking because we may need to complete
1573+
* partial writes due to the input buffer (or parts of it) not
1574+
* being already faulted in.
1575+
*/
1576+
private = file->private_data;
1577+
private->fsync_skip_inode_lock = true;
15671578
ret = iomap_dio_complete(dio);
1579+
private->fsync_skip_inode_lock = false;
1580+
1581+
if (!have_private)
1582+
file->private_data = NULL;
1583+
}
15681584

15691585
/* No increment (+=) because iomap returns a cumulative value. */
15701586
if (ret > 0)
@@ -1591,10 +1607,12 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
15911607
} else {
15921608
fault_in_iov_iter_readable(from, left);
15931609
prev_left = left;
1594-
goto relock;
1610+
goto again;
15951611
}
15961612
}
15971613

1614+
btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
1615+
15981616
/*
15991617
* If 'ret' is -ENOTBLK or we have not written all data, then it means
16001618
* we must fallback to buffered IO.
@@ -1793,6 +1811,7 @@ static inline bool skip_inode_logging(const struct btrfs_log_ctx *ctx)
17931811
*/
17941812
int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
17951813
{
1814+
struct btrfs_file_private *private = file->private_data;
17961815
struct dentry *dentry = file_dentry(file);
17971816
struct inode *inode = d_inode(dentry);
17981817
struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
@@ -1802,6 +1821,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
18021821
int ret = 0, err;
18031822
u64 len;
18041823
bool full_sync;
1824+
const bool skip_ilock = (private ? private->fsync_skip_inode_lock : false);
18051825

18061826
trace_btrfs_sync_file(file, datasync);
18071827

@@ -1829,7 +1849,10 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
18291849
if (ret)
18301850
goto out;
18311851

1832-
btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
1852+
if (skip_ilock)
1853+
down_write(&BTRFS_I(inode)->i_mmap_lock);
1854+
else
1855+
btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
18331856

18341857
atomic_inc(&root->log_batch);
18351858

@@ -1853,7 +1876,10 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
18531876
*/
18541877
ret = start_ordered_ops(inode, start, end);
18551878
if (ret) {
1856-
btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
1879+
if (skip_ilock)
1880+
up_write(&BTRFS_I(inode)->i_mmap_lock);
1881+
else
1882+
btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
18571883
goto out;
18581884
}
18591885

@@ -1982,7 +2008,10 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
19822008
* file again, but that will end up using the synchronization
19832009
* inside btrfs_sync_log to keep things safe.
19842010
*/
1985-
btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
2011+
if (skip_ilock)
2012+
up_write(&BTRFS_I(inode)->i_mmap_lock);
2013+
else
2014+
btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
19862015

19872016
if (ret == BTRFS_NO_LOG_SYNC) {
19882017
ret = btrfs_end_transaction(trans);

0 commit comments

Comments
 (0)