Skip to content

Commit fa8d4e6

Browse files
Mike Snitzerchucklever
authored andcommitted
NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
This document details the NFSD IO modes that are configurable using NFSD's experimental debugfs interfaces: /sys/kernel/debug/nfsd/io_cache_read /sys/kernel/debug/nfsd/io_cache_write This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's debugfs interfaces are replaced with per-export controls). Future updates will provide more specific guidance and howto information to help others use and evaluate NFSD's IO modes: BUFFERED, DONTCACHE and DIRECT. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
1 parent 06c5c97 commit fa8d4e6

1 file changed

Lines changed: 144 additions & 0 deletions

File tree

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
=============
4+
NFSD IO MODES
5+
=============
6+
7+
Overview
8+
========
9+
10+
NFSD has historically always used buffered IO when servicing READ and
11+
WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible
12+
to override that default to use either DONTCACHE or DIRECT IO modes.
13+
14+
Experimental NFSD debugfs interfaces are available to allow the NFSD IO
15+
mode used for READ and WRITE to be configured independently. See both:
16+
- /sys/kernel/debug/nfsd/io_cache_read
17+
- /sys/kernel/debug/nfsd/io_cache_write
18+
19+
The default value for both io_cache_read and io_cache_write reflects
20+
NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
21+
22+
Based on the configured settings, NFSD's IO will either be:
23+
- cached using page cache (NFSD_IO_BUFFERED=0)
24+
- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
25+
- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
26+
27+
To set an NFSD IO mode, write a supported value (0 - 2) to the
28+
corresponding IO operation's debugfs interface, e.g.:
29+
echo 2 > /sys/kernel/debug/nfsd/io_cache_read
30+
echo 2 > /sys/kernel/debug/nfsd/io_cache_write
31+
32+
To check which IO mode NFSD is using for READ or WRITE, simply read the
33+
corresponding IO operation's debugfs interface, e.g.:
34+
cat /sys/kernel/debug/nfsd/io_cache_read
35+
cat /sys/kernel/debug/nfsd/io_cache_write
36+
37+
If you experiment with NFSD's IO modes on a recent kernel and have
38+
interesting results, please report them to linux-nfs@vger.kernel.org
39+
40+
NFSD DONTCACHE
41+
==============
42+
43+
DONTCACHE offers a hybrid approach to servicing IO that aims to offer
44+
the benefits of using DIRECT IO without any of the strict alignment
45+
requirements that DIRECT IO imposes. To achieve this buffered IO is used
46+
but the IO is flagged to "drop behind" (meaning associated pages are
47+
dropped from the page cache) when IO completes.
48+
49+
DONTCACHE aims to avoid what has proven to be a fairly significant
50+
limition of Linux's memory management subsystem if/when large amounts of
51+
data is infrequently accessed (e.g. read once _or_ written once but not
52+
read until much later). Such use-cases are particularly problematic
53+
because the page cache will eventually become a bottleneck to servicing
54+
new IO requests.
55+
56+
For more context on DONTCACHE, please see these Linux commit headers:
57+
- Overview: 9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
58+
to take a struct kiocb")
59+
- for READ: 8026e49bff9b1 ("mm/filemap: add read support for
60+
RWF_DONTCACHE")
61+
- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
62+
63+
NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the underlying
64+
filesystem doesn't indicate support by setting FOP_DONTCACHE.
65+
66+
NFSD DIRECT
67+
===========
68+
69+
DIRECT IO doesn't make use of the page cache, as such it is able to
70+
avoid the Linux memory management's page reclaim scalability problems
71+
without resorting to the hybrid use of page cache that DONTCACHE does.
72+
73+
Some workloads benefit from NFSD avoiding the page cache, particularly
74+
those with a working set that is significantly larger than available
75+
system memory. The pathological worst-case workload that NFSD DIRECT has
76+
proven to help most is: NFS client issuing large sequential IO to a file
77+
that is 2-3 times larger than the NFS server's available system memory.
78+
The reason for such improvement is NFSD DIRECT eliminates a lot of work
79+
that the memory management subsystem would otherwise be required to
80+
perform (e.g. page allocation, dirty writeback, page reclaim). When
81+
using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
82+
time trying to find adequate free pages so that forward IO progress can
83+
be made.
84+
85+
The performance win associated with using NFSD DIRECT was previously
86+
discussed on linux-nfs, see:
87+
https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
88+
But in summary:
89+
- NFSD DIRECT can significantly reduce memory requirements
90+
- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
91+
- NFSD DIRECT can offer more deterministic IO performance
92+
93+
As always, your mileage may vary and so it is important to carefully
94+
consider if/when it is beneficial to make use of NFSD DIRECT. When
95+
assessing comparative performance of your workload please be sure to log
96+
relevant performance metrics during testing (e.g. memory usage, cpu
97+
usage, IO performance). Using perf to collect perf data that may be used
98+
to generate a "flamegraph" for work Linux must perform on behalf of your
99+
test is a really meaningful way to compare the relative health of the
100+
system and how switching NFSD's IO mode changes what is observed.
101+
102+
If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
103+
NFSD's debugfs interfaces, ideally the IO will be aligned relative to
104+
the underlying block device's logical_block_size. Also the memory buffer
105+
used to store the READ or WRITE payload must be aligned relative to the
106+
underlying block device's dma_alignment.
107+
108+
But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
109+
it can:
110+
111+
Misaligned READ:
112+
If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
113+
DIO-aligned block (on either end of the READ). The expanded READ is
114+
verified to have proper offset/len (logical_block_size) and
115+
dma_alignment checking.
116+
117+
Misaligned WRITE:
118+
If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
119+
middle and end as needed. The large middle segment is DIO-aligned
120+
and the start and/or end are misaligned. Buffered IO is used for the
121+
misaligned segments and O_DIRECT is used for the middle DIO-aligned
122+
segment. DONTCACHE buffered IO is _not_ used for the misaligned
123+
segments because using normal buffered IO offers significant RMW
124+
performance benefit when handling streaming misaligned WRITEs.
125+
126+
Tracing:
127+
The nfsd_read_direct trace event shows how NFSD expands any
128+
misaligned READ to the next DIO-aligned block (on either end of the
129+
original READ, as needed).
130+
131+
This combination of trace events is useful for READs:
132+
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
133+
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable
134+
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
135+
echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
136+
137+
The nfsd_write_direct trace event shows how NFSD splits a given
138+
misaligned WRITE into a DIO-aligned middle segment.
139+
140+
This combination of trace events is useful for WRITEs:
141+
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
142+
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable
143+
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
144+
echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable

0 commit comments

Comments
 (0)