fix(audit): append-only ledger eliminates writeAuditEntry contention#285
Closed
kptdobe wants to merge 1 commit into
Closed
fix(audit): append-only ledger eliminates writeAuditEntry contention#285kptdobe wants to merge 1 commit into
kptdobe wants to merge 1 commit into
Conversation
Replace audit.txt read-modify-write-with-If-Match + 6-retry loop with one
unconditional PUT per entry under {org}/{repo}/.da-versions/{fileId}/audit/{ts}-{rand}.txt.
COR-39's retry-budget bump (#282, 4->6 retries / ~3050 ms) did not converge
(Day 3 post-deploy: 6814 writeAuditEntry-failed / 24h vs 5614 baseline);
the contention window outlasts the retry budget under burst. Removing the
shared key removes the contention by construction.
Read path already merges everything under the audit prefix via ListObjectsV2,
so legacy audit.txt + audit-{ts}.txt archives + new per-entry objects are
transparently consumed. Same-user same-window collapse moves from write-side
to read-side post-processing (AUDIT_TIME_WINDOW_MS unchanged) so the observable
audit sequence is preserved.
- src/storage/version/audit.js: writeAuditEntry single unconditional PUT,
readAuditLines applies the collapse pass, paginates ListObjectsV2.
- src/storage/version/paths.js: add auditEntryKey(repo, fileId, ts, rand).
- test/storage/version/audit.test.js: replace write-side collapse/retry/archive
tests with read-side collapse + per-entry PUT contract tests; keep mixed-file
read tests to prove transparent migration.
- test/storage/object/conditionals.test.js: PUT count drops from 8 to 2 and the
elevated 8000 ms mocha timeout is no longer needed.
- test/storage/version/paths.test.js: cover auditEntryKey.
- test/storage/version/put.test.js: comment update only.
Acceptance: writeAuditEntry-failed < 50/24h sustained for 3 consecutive daily
reviews post-deploy. Zero PreconditionFailed on the audit-write path by
construction. Compaction deferred until measured (existing archive cycle still
applies; per-entry objects throttle naturally via ListObjectsV2 page size).
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Contributor
Author
|
Closing per board feedback on COR-50: this design (one R2 object per audit entry) is the historically-rejected model and reintroduces the file-count problem the current audit.txt + dedup approach was meant to solve. Next direction is per-user audit-file sharding (audit-{userHash}.txt) — bounds R2 file count by distinct users while breaking the cross-user If-Match contention that exhausted retries in COR-39. Awaiting board confirmation on COR-50 before opening a fresh PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
{org}/{repo}/.da-versions/{fileId}/audit/{ts}-{rand}.txt.AUDIT_TIME_WINDOW_MS(30 min) preserved so the observable sequence is unchanged.audit.txt+audit-{ts}.txtarchive files keep their entries;readAuditLinescontinues to merge everything under theauditprefix viaListObjectsV2.Why
COR-39 (#282) bumped the audit-write retry budget 4 to 6 (~750 ms to ~3050 ms) but did not converge:
Concurrent writers race the etag on a single key; sustained burst keeps everyone in the losing-etag generation across all retries. Removing the shared key removes the contention by construction.
Changes
Verification
Local:
Coralogix (post-deploy): writeAuditEntry-failed Coralogix rate < 50 / 24h sustained for 3 consecutive daily reviews after deploy. The only failure mode left on the write path is a network/R2 outage (zero PreconditionFailed by construction).
Out of scope
Refs Cor COR-50. Supersedes #282.