Skip to content

fix(audit): append-only ledger eliminates writeAuditEntry contention#285

Closed
kptdobe wants to merge 1 commit into
mainfrom
fix/audit-append-only-ledger
Closed

fix(audit): append-only ledger eliminates writeAuditEntry contention#285
kptdobe wants to merge 1 commit into
mainfrom
fix/audit-append-only-ledger

Conversation

@kptdobe
Copy link
Copy Markdown
Contributor

@kptdobe kptdobe commented May 26, 2026

Summary

  • Replace audit.txt read-modify-write-with-If-Match + 6-retry loop with a single unconditional PUT per entry under {org}/{repo}/.da-versions/{fileId}/audit/{ts}-{rand}.txt.
  • Move the same-user same-window collapse logic from write-side to read-side; AUDIT_TIME_WINDOW_MS (30 min) preserved so the observable sequence is unchanged.
  • Transparent migration: existing audit.txt + audit-{ts}.txt archive files keep their entries; readAuditLines continues to merge everything under the audit prefix via ListObjectsV2.

Why

COR-39 (#282) bumped the audit-write retry budget 4 to 6 (~750 ms to ~3050 ms) but did not converge:

Date writeAuditEntry-failed / 24h
Pre-deploy baseline 5614
Day 1 (post-deploy) 0
Day 2 (post-deploy) 273
Day 3 (post-deploy) 6814 (worse than baseline)
Day 4 (post-deploy) 830

Concurrent writers race the etag on a single key; sustained burst keeps everyone in the losing-etag generation across all retries. Removing the shared key removes the contention by construction.

Changes

  • src/storage/version/audit.js
    • writeAuditEntry: one unconditional PUT to auditEntryKey(repo, fileId, ts, rand). No GET, no etag, no retry.
    • readAuditLines: paginates ListObjectsV2, sorts ascending by timestamp, applies a read-side collapse pass that matches the historical write-side behavior (same user, both edits, within window).
  • src/storage/version/paths.js
    • Add auditEntryKey(repo, fileId, ts, rand). The auditDirPrefix doc is updated to note per-entry objects.
  • test/storage/version/audit.test.js
    • Replace write-side retry / collapse / archive tests with: per-entry PUT contract tests (no GET, no If-Match, no retry on 412) and read-side collapse tests (within window, cross-window, cross-user, version-breaks-window, malformed-users-JSON).
    • Add a merges legacy audit.txt, archive files, and per-entry objects test that proves the migration is transparent.
    • Add a ContinuationToken pagination test for readAuditLines.
  • test/storage/object/conditionals.test.js
    • PUT with If-Match: returns 412 ... no longer needs the elevated 8000 ms mocha timeout; expected PutObjectCommand count drops from 8 to 2 (1 main + 1 audit).
  • test/storage/version/paths.test.js
    • Cover auditEntryKey (suffix format, lives under auditDirPrefix).
  • test/storage/version/put.test.js
    • Comment update on the writeAuditEntry-called-exactly-once test to reflect the append-only contract (no retries).

Verification

Local:

  • npm test -- 394 passing, 0 failing (includes all new tests).
  • npx eslint on touched files -- clean.

Coralogix (post-deploy): writeAuditEntry-failed Coralogix rate < 50 / 24h sustained for 3 consecutive daily reviews after deploy. The only failure mode left on the write path is a network/R2 outage (zero PreconditionFailed by construction).

Out of scope

  • Compaction. With per-entry objects the existing archive cycle no longer fires. Deferred until measured.
  • No backfill. Legacy audit.txt and audit-{ts}.txt files remain in place; readAuditLines continues to merge them in.

Refs Cor COR-50. Supersedes #282.

Replace audit.txt read-modify-write-with-If-Match + 6-retry loop with one
unconditional PUT per entry under {org}/{repo}/.da-versions/{fileId}/audit/{ts}-{rand}.txt.
COR-39's retry-budget bump (#282, 4->6 retries / ~3050 ms) did not converge
(Day 3 post-deploy: 6814 writeAuditEntry-failed / 24h vs 5614 baseline);
the contention window outlasts the retry budget under burst. Removing the
shared key removes the contention by construction.

Read path already merges everything under the audit prefix via ListObjectsV2,
so legacy audit.txt + audit-{ts}.txt archives + new per-entry objects are
transparently consumed. Same-user same-window collapse moves from write-side
to read-side post-processing (AUDIT_TIME_WINDOW_MS unchanged) so the observable
audit sequence is preserved.

- src/storage/version/audit.js: writeAuditEntry single unconditional PUT,
  readAuditLines applies the collapse pass, paginates ListObjectsV2.
- src/storage/version/paths.js: add auditEntryKey(repo, fileId, ts, rand).
- test/storage/version/audit.test.js: replace write-side collapse/retry/archive
  tests with read-side collapse + per-entry PUT contract tests; keep mixed-file
  read tests to prove transparent migration.
- test/storage/object/conditionals.test.js: PUT count drops from 8 to 2 and the
  elevated 8000 ms mocha timeout is no longer needed.
- test/storage/version/paths.test.js: cover auditEntryKey.
- test/storage/version/put.test.js: comment update only.

Acceptance: writeAuditEntry-failed < 50/24h sustained for 3 consecutive daily
reviews post-deploy. Zero PreconditionFailed on the audit-write path by
construction. Compaction deferred until measured (existing archive cycle still
applies; per-entry objects throttle naturally via ListObjectsV2 page size).

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@kptdobe
Copy link
Copy Markdown
Contributor Author

kptdobe commented May 26, 2026

Closing per board feedback on COR-50: this design (one R2 object per audit entry) is the historically-rejected model and reintroduces the file-count problem the current audit.txt + dedup approach was meant to solve.

Next direction is per-user audit-file sharding (audit-{userHash}.txt) — bounds R2 file count by distinct users while breaking the cross-user If-Match contention that exhausted retries in COR-39. Awaiting board confirmation on COR-50 before opening a fresh PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant