fix(audit): shard audit ledger per user, drop retries#286
Closed
kptdobe wants to merge 2 commits into
Closed
Conversation
Cross-user concurrency on a single audit.txt key drove writeAuditEntry's 412 contention: every writer reads the same etag, only one wins per generation, and burst load (6814 failed/24h on COR-39 Day 3, worse than the 5614 baseline) stays in the losing-etag generation across all retries. The retry-budget bump (#282, 4->6 retries / ~3050 ms) did not converge. A prior attempt to use one R2 object per audit entry was rejected on COR-50 by the board for blowing up file count. This change shards the live audit by user: each writer hits {org}/{repo}/.da-versions/{fileId}/audit-{userHash}.txt. Cross-user contention disappears by construction (different keys); same-user concurrent writes fall inside AUDIT_TIME_WINDOW_MS and therefore dedup to the same observable audit sequence whether the loser is retried or dropped, so the 6-retry budget is removed entirely. A single GET-modify-PUT-with-If-Match runs per call; 412 surfaces as 500 immediately and is then absorbed on read. userHash is a privacy-preserving 16-hex SHA-256 of the normalized emails (emails are NOT readable via ListObjectsV2). Anonymous payloads land on a stable 'audit-anon.txt' shard. Per-user archive rotation continues to fire at AUDIT_MAX_ENTRIES, keyed by user. readAuditLines is unchanged: it pages ListObjectsV2 under the same audit prefix, so legacy audit.txt + audit-{ts}.txt archives + new per-user shards all merge transparently. No backfill, no schema change. Acceptance: writeAuditEntry-failed < 50/24h sustained for 3 consecutive daily reviews after deploy. The only remaining failure modes on the write path are rare same-user concurrent races (folded by dedup) and transient network/R2 failures. - src/storage/version/paths.js: add auditUserKey, auditUserArchiveKey. - src/storage/version/audit.js: usersHash; writeAuditEntry targets the per-user shard; no retry on 412. - test/storage/version/audit.test.js: drop the four 412-retry tests; add cross-user, same-user, anonymous, and no-retry-on-412 tests; update the archive test to assert per-user key shapes. - test/storage/version/paths.test.js: cover auditUserKey and auditUserArchiveKey. - test/storage/object/conditionals.test.js: PutObjectCommand count drops from 8 to 2 (1 main + 1 audit, no retry); elevated mocha timeout removed. Supersedes the per-entry append-only ledger from PR #285 (closed). Refs COR-50, COR-39. Co-Authored-By: Paperclip <noreply@paperclip.ing>
Today's `console.error('writeAuditEntry failed', e)` log only carries the
SDK error and the parent Cloudflare worker URL — there is no way to pivot
Coralogix by audit shard. Adding the three fields the per-user ledger
key is derived from lets us tell, post-deploy, whether residual 412s are
concentrated on a single user shard (true same-user race) or spread
across many shards (something else we missed).
No functional change: log-shape only.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Contributor
Author
|
Closing per board feedback on COR-50. Coralogix shows the dominant writeAuditEntry-failed pattern is same-user parallel-fire from one client (translation tool in adobe/da-nx/nx/blocks/loc, single user New direction: file a follow-up on da-nx to serialize / batch the translation tool writes (root cause), and optionally downgrade the retry-exhausted 412 to |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Shard the audit ledger per user. Cross-user concurrency on a single
audit.txtwas the root of the COR-39 contention; with per-user shards the cross-user race is impossible by construction, and the residual same-user race folds into the existing dedup. The 6-retry budget is removed.Why
writeAuditEntry-failedreached 6814 / 24h, worse than the 5614 pre-deploy baseline.Design
{org}/{repo}/.da-versions/{fileId}/audit-{userHash}.txt.userHash= 16 hex chars of SHA-256 of the normalized emails (privacy preserving — emails are NOT readable viaListObjectsV2).audit-anon.txt.AUDIT_MAX_ENTRIES, keyed by user (audit-{userHash}-{ts}.txt).If-Match. 412 surfaces as 500 immediately. No retry.readAuditLinesalready pagesListObjectsV2under the audit prefix, so legacyaudit.txt+audit-{ts}.txtarchives + new per-user shards all merge transparently.Why no retry is safe
AUDIT_TIME_WINDOW_MS, so dedup folds them into one observable entry whether the 412-loser is retried or dropped. No audit-entry loss as observed by the reader.Changes
src/storage/version/paths.js: addauditUserKey,auditUserArchiveKey.src/storage/version/audit.js: addusersHash;writeAuditEntrytargets per-user shard; no retry on 412.test/storage/version/audit.test.js:audit-anon.txt, 412 returns 500 with no retry.audit-{hash}-{ts}.txtshape.test/storage/version/paths.test.js: coverauditUserKeyandauditUserArchiveKey.test/storage/object/conditionals.test.js:PutObjectCommandcount drops 8 → 2 (1 main + 1 audit, no retry); elevated 8000 ms mocha timeout removed.Verification
npm test— 396 passing, 0 failing.npx eslinton touched files — clean.Coralogix post-deploy:
Target: < 50 / 24h sustained for 3 consecutive daily reviews.
Migration
None. Legacy
audit.txt+audit-{ts}.txtfiles stay in place; new writes go to per-user shards.readAuditLinesmerges everything via the existing prefix scan.Supersedes / refs