Skip to content

Commit dd81dc0

Browse files
author
Darrick J. Wong
committed
Merge tag 'xfs-cil-scale-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeA
xfs: improve CIL scalability This series aims to improve the scalability of XFS transaction commits on large CPU count machines. My 32p machine hits contention limits in xlog_cil_commit() at about 700,000 transaction commits a section. It hits this at 16 thread workloads, and 32 thread workloads go no faster and just burn CPU on the CIL spinlocks. This patchset gets rid of spinlocks and global serialisation points in the xlog_cil_commit() path. It does this by moving to a combination of per-cpu counters, unordered per-cpu lists and post-ordered per-cpu lists. This results in transaction commit rates exceeding 1.4 million commits/s under unlink certain workloads, and while the log lock contention is largely gone there is still significant lock contention in the VFS (dentry cache, inode cache and security layers) at >600,000 transactions/s that still limit scalability. The changes to the CIL accounting and behaviour, combined with the structural changes to xlog_write() in prior patchsets make the per-cpu restructuring possible and sane. This allows us to move to precalculated reservation requirements that allow for reservation stealing to be accounted across multiple CPUs accurately. That is, instead of trying to account for continuation log opheaders on a "growth" basis, we pre-calculate how many iclogs we'll need to write out a maximally sized CIL checkpoint and steal that reserveD that space one commit at a time until the CIL has a full reservation. If we ever run a commit when we are already at the hard limit (because post-throttling) we simply take an extra reservation from each commit that is run when over the limit. Hence we don't need to do space usage math in the fast path and so never need to sum the per-cpu counters in this fast path. Similarly, per-cpu lists have the problem of ordering - we can't remove an item from a per-cpu list if we want to move it forward in the CIL. We solve this problem by using an atomic counter to give every commit a sequence number that is copied into the log items in that transaction. Hence relogging items just overwrites the sequence number in the log item, and does not move it in the per-cpu lists. Once we reaggregate the per-cpu lists back into a single list in the CIL push work, we can run it through list-sort() and reorder it back into a globally ordered list. This costs a bit of CPU time, but now that the CIL can run multiple works and pipelines properly, this is not a limiting factor for performance. It does increase fsync latency when the CIL is full, but workloads issuing large numbers of fsync()s or sync transactions end up with very small CILs and so the latency impact or sorting is not measurable for such workloads. OVerall, this pushes the transaction commit bottleneck out to the lockless reservation grant head updates. These atomic updates don't start to be a limiting fact until > 1.5 million transactions/s are being run, at which point the accounting functions start to show up in profiles as the highest CPU users. Still, this series doubles transaction throughput without increasing CPU usage before we get to that cacheline contention breakdown point... ` Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> * tag 'xfs-cil-scale-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: expanding delayed logging design with background material xfs: xlog_sync() manually adjusts grant head space xfs: avoid cil push lock if possible xfs: move CIL ordering to the logvec chain xfs: convert log vector chain to use list heads xfs: convert CIL to unordered per cpu lists xfs: Add order IDs to log items in CIL xfs: convert CIL busy extents to per-cpu xfs: track CIL ticket reservation in percpu structure xfs: implement percpu cil space used calculation xfs: introduce per-cpu CIL tracking structure xfs: rework per-iclog header CIL reservation xfs: lift init CIL reservation out of xc_cil_lock xfs: use the CIL space used counter for emptiness checks
2 parents 88084a3 + 51a117e commit dd81dc0

9 files changed

Lines changed: 768 additions & 190 deletions

File tree

Documentation/filesystems/xfs-delayed-logging-design.rst

Lines changed: 322 additions & 39 deletions
Large diffs are not rendered by default.

fs/xfs/xfs_log.c

Lines changed: 35 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,8 @@ xlog_grant_push_ail(
5757
STATIC void
5858
xlog_sync(
5959
struct xlog *log,
60-
struct xlog_in_core *iclog);
60+
struct xlog_in_core *iclog,
61+
struct xlog_ticket *ticket);
6162
#if defined(DEBUG)
6263
STATIC void
6364
xlog_verify_grant_tail(
@@ -567,7 +568,8 @@ xlog_state_shutdown_callbacks(
567568
int
568569
xlog_state_release_iclog(
569570
struct xlog *log,
570-
struct xlog_in_core *iclog)
571+
struct xlog_in_core *iclog,
572+
struct xlog_ticket *ticket)
571573
{
572574
xfs_lsn_t tail_lsn;
573575
bool last_ref;
@@ -614,7 +616,7 @@ xlog_state_release_iclog(
614616
trace_xlog_iclog_syncing(iclog, _RET_IP_);
615617

616618
spin_unlock(&log->l_icloglock);
617-
xlog_sync(log, iclog);
619+
xlog_sync(log, iclog, ticket);
618620
spin_lock(&log->l_icloglock);
619621
return 0;
620622
}
@@ -881,7 +883,7 @@ xlog_force_iclog(
881883
iclog->ic_flags |= XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA;
882884
if (iclog->ic_state == XLOG_STATE_ACTIVE)
883885
xlog_state_switch_iclogs(iclog->ic_log, iclog, 0);
884-
return xlog_state_release_iclog(iclog->ic_log, iclog);
886+
return xlog_state_release_iclog(iclog->ic_log, iclog, NULL);
885887
}
886888

887889
/*
@@ -944,6 +946,8 @@ xlog_write_unmount_record(
944946
.lv_niovecs = 1,
945947
.lv_iovecp = &reg,
946948
};
949+
LIST_HEAD(lv_chain);
950+
list_add(&vec.lv_list, &lv_chain);
947951

948952
BUILD_BUG_ON((sizeof(struct xlog_op_header) +
949953
sizeof(struct xfs_unmount_log_format)) !=
@@ -952,7 +956,7 @@ xlog_write_unmount_record(
952956
/* account for space used by record data */
953957
ticket->t_curr_res -= sizeof(unmount_rec);
954958

955-
return xlog_write(log, NULL, &vec, ticket, reg.i_len);
959+
return xlog_write(log, NULL, &lv_chain, ticket, reg.i_len);
956960
}
957961

958962
/*
@@ -2025,7 +2029,8 @@ xlog_calc_iclog_size(
20252029
STATIC void
20262030
xlog_sync(
20272031
struct xlog *log,
2028-
struct xlog_in_core *iclog)
2032+
struct xlog_in_core *iclog,
2033+
struct xlog_ticket *ticket)
20292034
{
20302035
unsigned int count; /* byte count of bwrite */
20312036
unsigned int roundoff; /* roundoff to BB or stripe */
@@ -2037,12 +2042,20 @@ xlog_sync(
20372042

20382043
count = xlog_calc_iclog_size(log, iclog, &roundoff);
20392044

2040-
/* move grant heads by roundoff in sync */
2041-
xlog_grant_add_space(log, &log->l_reserve_head.grant, roundoff);
2042-
xlog_grant_add_space(log, &log->l_write_head.grant, roundoff);
2045+
/*
2046+
* If we have a ticket, account for the roundoff via the ticket
2047+
* reservation to avoid touching the hot grant heads needlessly.
2048+
* Otherwise, we have to move grant heads directly.
2049+
*/
2050+
if (ticket) {
2051+
ticket->t_curr_res -= roundoff;
2052+
} else {
2053+
xlog_grant_add_space(log, &log->l_reserve_head.grant, roundoff);
2054+
xlog_grant_add_space(log, &log->l_write_head.grant, roundoff);
2055+
}
20432056

20442057
/* put cycle number in every block */
2045-
xlog_pack_data(log, iclog, roundoff);
2058+
xlog_pack_data(log, iclog, roundoff);
20462059

20472060
/* real byte length */
20482061
size = iclog->ic_offset;
@@ -2275,7 +2288,7 @@ xlog_write_get_more_iclog_space(
22752288
spin_lock(&log->l_icloglock);
22762289
ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC);
22772290
xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
2278-
error = xlog_state_release_iclog(log, iclog);
2291+
error = xlog_state_release_iclog(log, iclog, ticket);
22792292
spin_unlock(&log->l_icloglock);
22802293
if (error)
22812294
return error;
@@ -2471,13 +2484,13 @@ int
24712484
xlog_write(
24722485
struct xlog *log,
24732486
struct xfs_cil_ctx *ctx,
2474-
struct xfs_log_vec *log_vector,
2487+
struct list_head *lv_chain,
24752488
struct xlog_ticket *ticket,
24762489
uint32_t len)
24772490

24782491
{
24792492
struct xlog_in_core *iclog = NULL;
2480-
struct xfs_log_vec *lv = log_vector;
2493+
struct xfs_log_vec *lv;
24812494
uint32_t record_cnt = 0;
24822495
uint32_t data_cnt = 0;
24832496
int error = 0;
@@ -2505,7 +2518,7 @@ xlog_write(
25052518
if (ctx)
25062519
xlog_cil_set_ctx_write_state(ctx, iclog);
25072520

2508-
while (lv) {
2521+
list_for_each_entry(lv, lv_chain, lv_list) {
25092522
/*
25102523
* If the entire log vec does not fit in the iclog, punt it to
25112524
* the partial copy loop which can handle this case.
@@ -2526,7 +2539,6 @@ xlog_write(
25262539
xlog_write_full(lv, ticket, iclog, &log_offset,
25272540
&len, &record_cnt, &data_cnt);
25282541
}
2529-
lv = lv->lv_next;
25302542
}
25312543
ASSERT(len == 0);
25322544

@@ -2538,7 +2550,7 @@ xlog_write(
25382550
*/
25392551
spin_lock(&log->l_icloglock);
25402552
xlog_state_finish_copy(log, iclog, record_cnt, 0);
2541-
error = xlog_state_release_iclog(log, iclog);
2553+
error = xlog_state_release_iclog(log, iclog, ticket);
25422554
spin_unlock(&log->l_icloglock);
25432555

25442556
return error;
@@ -2958,7 +2970,7 @@ xlog_state_get_iclog_space(
29582970
* reference to the iclog.
29592971
*/
29602972
if (!atomic_add_unless(&iclog->ic_refcnt, -1, 1))
2961-
error = xlog_state_release_iclog(log, iclog);
2973+
error = xlog_state_release_iclog(log, iclog, ticket);
29622974
spin_unlock(&log->l_icloglock);
29632975
if (error)
29642976
return error;
@@ -3406,7 +3418,8 @@ xfs_log_ticket_get(
34063418
static int
34073419
xlog_calc_unit_res(
34083420
struct xlog *log,
3409-
int unit_bytes)
3421+
int unit_bytes,
3422+
int *niclogs)
34103423
{
34113424
int iclog_space;
34123425
uint num_headers;
@@ -3486,6 +3499,8 @@ xlog_calc_unit_res(
34863499
/* roundoff padding for transaction data and one for commit record */
34873500
unit_bytes += 2 * log->l_iclog_roundoff;
34883501

3502+
if (niclogs)
3503+
*niclogs = num_headers;
34893504
return unit_bytes;
34903505
}
34913506

@@ -3494,7 +3509,7 @@ xfs_log_calc_unit_res(
34943509
struct xfs_mount *mp,
34953510
int unit_bytes)
34963511
{
3497-
return xlog_calc_unit_res(mp->m_log, unit_bytes);
3512+
return xlog_calc_unit_res(mp->m_log, unit_bytes, NULL);
34983513
}
34993514

35003515
/*
@@ -3512,7 +3527,7 @@ xlog_ticket_alloc(
35123527

35133528
tic = kmem_cache_zalloc(xfs_log_ticket_cache, GFP_NOFS | __GFP_NOFAIL);
35143529

3515-
unit_res = xlog_calc_unit_res(log, unit_bytes);
3530+
unit_res = xlog_calc_unit_res(log, unit_bytes, &tic->t_iclog_hdrs);
35163531

35173532
atomic_set(&tic->t_ref, 1);
35183533
tic->t_task = current;

fs/xfs/xfs_log.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,8 @@
99
struct xfs_cil_ctx;
1010

1111
struct xfs_log_vec {
12-
struct xfs_log_vec *lv_next; /* next lv in build list */
12+
struct list_head lv_list; /* CIL lv chain ptrs */
13+
uint32_t lv_order_id; /* chain ordering info */
1314
int lv_niovecs; /* number of iovecs in lv */
1415
struct xfs_log_iovec *lv_iovecp; /* iovec array */
1516
struct xfs_log_item *lv_item; /* owner */

0 commit comments

Comments
 (0)