Skip to content

fix(audit): shard audit ledger per user, drop retries#286

Closed
kptdobe wants to merge 2 commits into
mainfrom
fix/audit-per-user-shard
Closed

fix(audit): shard audit ledger per user, drop retries#286
kptdobe wants to merge 2 commits into
mainfrom
fix/audit-per-user-shard

Conversation

@kptdobe
Copy link
Copy Markdown
Contributor

@kptdobe kptdobe commented May 26, 2026

Summary

Shard the audit ledger per user. Cross-user concurrency on a single audit.txt was the root of the COR-39 contention; with per-user shards the cross-user race is impossible by construction, and the residual same-user race folds into the existing dedup. The 6-retry budget is removed.

Why

  • COR-39 retry-bump (fix(audit): grow writeAuditEntry 412 backoff to 6 retries / ~3050 ms #282) did not converge: Day 3 post-deploy writeAuditEntry-failed reached 6814 / 24h, worse than the 5614 pre-deploy baseline.
  • A prior attempt to use one R2 object per audit entry was rejected on COR-50 by the board for blowing up file count.
  • Per-user shards eliminate the cross-user etag race while bounding file count by distinct users (not by entries).

Design

  • New live shard key: {org}/{repo}/.da-versions/{fileId}/audit-{userHash}.txt.
  • userHash = 16 hex chars of SHA-256 of the normalized emails (privacy preserving — emails are NOT readable via ListObjectsV2).
  • Anonymous identities land on audit-anon.txt.
  • Per-user archive rotation continues at AUDIT_MAX_ENTRIES, keyed by user (audit-{userHash}-{ts}.txt).
  • Write path: single GET → modify → PUT with If-Match. 412 surfaces as 500 immediately. No retry.
  • Read path: unchanged. readAuditLines already pages ListObjectsV2 under the audit prefix, so legacy audit.txt + audit-{ts}.txt archives + new per-user shards all merge transparently.

Why no retry is safe

  • Cross-user concurrent writes: zero contention (different keys).
  • Same-user concurrent writes: rare (multi-tab autosave is the realistic case). Both writes are by the same user inside AUDIT_TIME_WINDOW_MS, so dedup folds them into one observable entry whether the 412-loser is retried or dropped. No audit-entry loss as observed by the reader.
  • Same-user version + edit at the same instant: not realistic — version entries fire on explicit user actions, not autosave. If one is dropped, the next normal write self-heals the file state.
  • Network / R2 transient failure: 500, logged. Same as today.

Changes

  • src/storage/version/paths.js: add auditUserKey, auditUserArchiveKey.
  • src/storage/version/audit.js: add usersHash; writeAuditEntry targets per-user shard; no retry on 412.
  • test/storage/version/audit.test.js:
    • Drop the four 412-retry tests (retries-on-412, persists-7-attempts, exponential-jitter-backoff, succeeds-on-5th-attempt).
    • Add: cross-user writes use distinct keys, same-user writes share a key, anonymous → audit-anon.txt, 412 returns 500 with no retry.
    • Update: archive test asserts per-user audit-{hash}-{ts}.txt shape.
  • test/storage/version/paths.test.js: cover auditUserKey and auditUserArchiveKey.
  • test/storage/object/conditionals.test.js: PutObjectCommand count drops 8 → 2 (1 main + 1 audit, no retry); elevated 8000 ms mocha timeout removed.

Verification

  • npm test — 396 passing, 0 failing.
  • npx eslint on touched files — clean.

Coralogix post-deploy:

source logs last 24h
| filter $d.ScriptName == da-admin
| explode $d.Logs into l
| filter l.Level == error
| explode l.Message into m
| filter m.contains(writeAuditEntry)
| count

Target: < 50 / 24h sustained for 3 consecutive daily reviews.

Migration

None. Legacy audit.txt + audit-{ts}.txt files stay in place; new writes go to per-user shards. readAuditLines merges everything via the existing prefix scan.

Supersedes / refs

kptdobe and others added 2 commits May 26, 2026 13:04
Cross-user concurrency on a single audit.txt key drove writeAuditEntry's 412
contention: every writer reads the same etag, only one wins per generation,
and burst load (6814 failed/24h on COR-39 Day 3, worse than the 5614 baseline)
stays in the losing-etag generation across all retries. The retry-budget bump
(#282, 4->6 retries / ~3050 ms) did not converge. A prior attempt to use one
R2 object per audit entry was rejected on COR-50 by the board for blowing up
file count.

This change shards the live audit by user: each writer hits
{org}/{repo}/.da-versions/{fileId}/audit-{userHash}.txt. Cross-user
contention disappears by construction (different keys); same-user concurrent
writes fall inside AUDIT_TIME_WINDOW_MS and therefore dedup to the same
observable audit sequence whether the loser is retried or dropped, so the
6-retry budget is removed entirely. A single GET-modify-PUT-with-If-Match
runs per call; 412 surfaces as 500 immediately and is then absorbed on read.

userHash is a privacy-preserving 16-hex SHA-256 of the normalized emails
(emails are NOT readable via ListObjectsV2). Anonymous payloads land on a
stable 'audit-anon.txt' shard. Per-user archive rotation continues to fire
at AUDIT_MAX_ENTRIES, keyed by user.

readAuditLines is unchanged: it pages ListObjectsV2 under the same audit
prefix, so legacy audit.txt + audit-{ts}.txt archives + new per-user shards
all merge transparently. No backfill, no schema change.

Acceptance: writeAuditEntry-failed < 50/24h sustained for 3 consecutive daily
reviews after deploy. The only remaining failure modes on the write path are
rare same-user concurrent races (folded by dedup) and transient network/R2
failures.

- src/storage/version/paths.js: add auditUserKey, auditUserArchiveKey.
- src/storage/version/audit.js: usersHash; writeAuditEntry targets the
  per-user shard; no retry on 412.
- test/storage/version/audit.test.js: drop the four 412-retry tests; add
  cross-user, same-user, anonymous, and no-retry-on-412 tests; update the
  archive test to assert per-user key shapes.
- test/storage/version/paths.test.js: cover auditUserKey and
  auditUserArchiveKey.
- test/storage/object/conditionals.test.js: PutObjectCommand count drops
  from 8 to 2 (1 main + 1 audit, no retry); elevated mocha timeout removed.

Supersedes the per-entry append-only ledger from PR #285 (closed).
Refs COR-50, COR-39.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Today's `console.error('writeAuditEntry failed', e)` log only carries the
SDK error and the parent Cloudflare worker URL — there is no way to pivot
Coralogix by audit shard. Adding the three fields the per-user ledger
key is derived from lets us tell, post-deploy, whether residual 412s are
concentrated on a single user shard (true same-user race) or spread
across many shards (something else we missed).

No functional change: log-shape only.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@kptdobe
Copy link
Copy Markdown
Contributor Author

kptdobe commented May 26, 2026

Closing per board feedback on COR-50. Coralogix shows the dominant writeAuditEntry-failed pattern is same-user parallel-fire from one client (translation tool in adobe/da-nx/nx/blocks/loc, single user syt@adobe.com firing many version entries within milliseconds, sometimes literally identical). Per-user sharding does not help that case — same-user writes hash to the same shard and race the same etag. Board confirmed no audit-data-model redesign for this edge case.

New direction: file a follow-up on da-nx to serialize / batch the translation tool writes (root cause), and optionally downgrade the retry-exhausted 412 to console.warn in da-admin (small follow-up patch, pending board acceptance).

@kptdobe kptdobe closed this May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant