|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +============= |
| 4 | +NFSD IO MODES |
| 5 | +============= |
| 6 | + |
| 7 | +Overview |
| 8 | +======== |
| 9 | + |
| 10 | +NFSD has historically always used buffered IO when servicing READ and |
| 11 | +WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible |
| 12 | +to override that default to use either DONTCACHE or DIRECT IO modes. |
| 13 | + |
| 14 | +Experimental NFSD debugfs interfaces are available to allow the NFSD IO |
| 15 | +mode used for READ and WRITE to be configured independently. See both: |
| 16 | +- /sys/kernel/debug/nfsd/io_cache_read |
| 17 | +- /sys/kernel/debug/nfsd/io_cache_write |
| 18 | + |
| 19 | +The default value for both io_cache_read and io_cache_write reflects |
| 20 | +NFSD's default IO mode (which is NFSD_IO_BUFFERED=0). |
| 21 | + |
| 22 | +Based on the configured settings, NFSD's IO will either be: |
| 23 | +- cached using page cache (NFSD_IO_BUFFERED=0) |
| 24 | +- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1) |
| 25 | +- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2) |
| 26 | + |
| 27 | +To set an NFSD IO mode, write a supported value (0 - 2) to the |
| 28 | +corresponding IO operation's debugfs interface, e.g.: |
| 29 | + echo 2 > /sys/kernel/debug/nfsd/io_cache_read |
| 30 | + echo 2 > /sys/kernel/debug/nfsd/io_cache_write |
| 31 | + |
| 32 | +To check which IO mode NFSD is using for READ or WRITE, simply read the |
| 33 | +corresponding IO operation's debugfs interface, e.g.: |
| 34 | + cat /sys/kernel/debug/nfsd/io_cache_read |
| 35 | + cat /sys/kernel/debug/nfsd/io_cache_write |
| 36 | + |
| 37 | +If you experiment with NFSD's IO modes on a recent kernel and have |
| 38 | +interesting results, please report them to linux-nfs@vger.kernel.org |
| 39 | + |
| 40 | +NFSD DONTCACHE |
| 41 | +============== |
| 42 | + |
| 43 | +DONTCACHE offers a hybrid approach to servicing IO that aims to offer |
| 44 | +the benefits of using DIRECT IO without any of the strict alignment |
| 45 | +requirements that DIRECT IO imposes. To achieve this buffered IO is used |
| 46 | +but the IO is flagged to "drop behind" (meaning associated pages are |
| 47 | +dropped from the page cache) when IO completes. |
| 48 | + |
| 49 | +DONTCACHE aims to avoid what has proven to be a fairly significant |
| 50 | +limition of Linux's memory management subsystem if/when large amounts of |
| 51 | +data is infrequently accessed (e.g. read once _or_ written once but not |
| 52 | +read until much later). Such use-cases are particularly problematic |
| 53 | +because the page cache will eventually become a bottleneck to servicing |
| 54 | +new IO requests. |
| 55 | + |
| 56 | +For more context on DONTCACHE, please see these Linux commit headers: |
| 57 | +- Overview: 9ad6344568cc3 ("mm/filemap: change filemap_create_folio() |
| 58 | + to take a struct kiocb") |
| 59 | +- for READ: 8026e49bff9b1 ("mm/filemap: add read support for |
| 60 | + RWF_DONTCACHE") |
| 61 | +- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE") |
| 62 | + |
| 63 | +NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the underlying |
| 64 | +filesystem doesn't indicate support by setting FOP_DONTCACHE. |
| 65 | + |
| 66 | +NFSD DIRECT |
| 67 | +=========== |
| 68 | + |
| 69 | +DIRECT IO doesn't make use of the page cache, as such it is able to |
| 70 | +avoid the Linux memory management's page reclaim scalability problems |
| 71 | +without resorting to the hybrid use of page cache that DONTCACHE does. |
| 72 | + |
| 73 | +Some workloads benefit from NFSD avoiding the page cache, particularly |
| 74 | +those with a working set that is significantly larger than available |
| 75 | +system memory. The pathological worst-case workload that NFSD DIRECT has |
| 76 | +proven to help most is: NFS client issuing large sequential IO to a file |
| 77 | +that is 2-3 times larger than the NFS server's available system memory. |
| 78 | +The reason for such improvement is NFSD DIRECT eliminates a lot of work |
| 79 | +that the memory management subsystem would otherwise be required to |
| 80 | +perform (e.g. page allocation, dirty writeback, page reclaim). When |
| 81 | +using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU |
| 82 | +time trying to find adequate free pages so that forward IO progress can |
| 83 | +be made. |
| 84 | + |
| 85 | +The performance win associated with using NFSD DIRECT was previously |
| 86 | +discussed on linux-nfs, see: |
| 87 | +https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/ |
| 88 | +But in summary: |
| 89 | +- NFSD DIRECT can significantly reduce memory requirements |
| 90 | +- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work |
| 91 | +- NFSD DIRECT can offer more deterministic IO performance |
| 92 | + |
| 93 | +As always, your mileage may vary and so it is important to carefully |
| 94 | +consider if/when it is beneficial to make use of NFSD DIRECT. When |
| 95 | +assessing comparative performance of your workload please be sure to log |
| 96 | +relevant performance metrics during testing (e.g. memory usage, cpu |
| 97 | +usage, IO performance). Using perf to collect perf data that may be used |
| 98 | +to generate a "flamegraph" for work Linux must perform on behalf of your |
| 99 | +test is a really meaningful way to compare the relative health of the |
| 100 | +system and how switching NFSD's IO mode changes what is observed. |
| 101 | + |
| 102 | +If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to |
| 103 | +NFSD's debugfs interfaces, ideally the IO will be aligned relative to |
| 104 | +the underlying block device's logical_block_size. Also the memory buffer |
| 105 | +used to store the READ or WRITE payload must be aligned relative to the |
| 106 | +underlying block device's dma_alignment. |
| 107 | + |
| 108 | +But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best |
| 109 | +it can: |
| 110 | + |
| 111 | +Misaligned READ: |
| 112 | + If NFSD_IO_DIRECT is used, expand any misaligned READ to the next |
| 113 | + DIO-aligned block (on either end of the READ). The expanded READ is |
| 114 | + verified to have proper offset/len (logical_block_size) and |
| 115 | + dma_alignment checking. |
| 116 | + |
| 117 | +Misaligned WRITE: |
| 118 | + If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start, |
| 119 | + middle and end as needed. The large middle segment is DIO-aligned |
| 120 | + and the start and/or end are misaligned. Buffered IO is used for the |
| 121 | + misaligned segments and O_DIRECT is used for the middle DIO-aligned |
| 122 | + segment. DONTCACHE buffered IO is _not_ used for the misaligned |
| 123 | + segments because using normal buffered IO offers significant RMW |
| 124 | + performance benefit when handling streaming misaligned WRITEs. |
| 125 | + |
| 126 | +Tracing: |
| 127 | + The nfsd_read_direct trace event shows how NFSD expands any |
| 128 | + misaligned READ to the next DIO-aligned block (on either end of the |
| 129 | + original READ, as needed). |
| 130 | + |
| 131 | + This combination of trace events is useful for READs: |
| 132 | + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable |
| 133 | + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable |
| 134 | + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable |
| 135 | + echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable |
| 136 | + |
| 137 | + The nfsd_write_direct trace event shows how NFSD splits a given |
| 138 | + misaligned WRITE into a DIO-aligned middle segment. |
| 139 | + |
| 140 | + This combination of trace events is useful for WRITEs: |
| 141 | + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable |
| 142 | + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable |
| 143 | + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable |
| 144 | + echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable |
0 commit comments