fix(webapp): deliver realtime changes with current content when the read replica lags#3910
Conversation
|
WalkthroughThis PR adds replica lag measurement and "read-your-writes" consistency gating to the realtime backend. It introduces new replica-lag source abstractions (Aurora and PostgreSQL dialects) and an estimator that samples lag over time. All event publishing paths—metadata service, route handlers, and run engine handlers—now capture and propagate 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint install timed out. The project may have too many dependencies for the sandbox. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…ead replica lags The realtime feed hydrates change rows from the read replica, and that read can race the replica's apply of the very write that triggered it. Subscribers then receive the previous change's content, and an isolated final change is only corrected by the backstop poll. Publishers now stamp change records with the committed row's updatedAt (taken from writes they already perform, no extra queries), the router waits out the measured replica lag before hydrating, and a tripwire retries stale reads, feeding observations back into the lag estimate. Exhausted retries deliver anyway (liveness over freshness) while echo re-hydrates emit the fresh row through the normal diff once the replica catches up. Also: metadata updates that write nothing no longer publish a change record, and buffered parent/root metadata operations now publish when the flusher writes them.
A streaming standby behind the "replica" compose profile with a configurable recovery_min_apply_delay, so replica-lag behavior can be reproduced deterministically on a laptop. Also raises the primary's max_connections so multi-instance local testing has connection headroom.
5a91d14 to
0e8088d
Compare
@trigger.dev/build
trigger.dev
@trigger.dev/core
@trigger.dev/plugins
@trigger.dev/python
@trigger.dev/react-hooks
@trigger.dev/redis-worker
@trigger.dev/rsc
@trigger.dev/schema-to-json
@trigger.dev/sdk
commit: |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
apps/webapp/app/services/realtime/nativeRealtimeClientInstance.server.ts (1)
256-263: 💤 Low valueConsider adding
unit: "ms"for consistency with other latency metrics.Other
_msmetrics in this file (e.g.,runSetQueryMsat line 31,deliveryLagMsat line 61) includeunit: "ms"in their options. Adding it here would maintain consistency, though be aware this will cause Prometheus to export it asrealtime_native_replica_lag_estimate_ms_milliseconds.Suggested change
if (lagEstimator) { meter .createObservableGauge("realtime_native.replica_lag_estimate_ms", { description: "The read-your-writes gate's current replica-lag estimate (max sample in the window). Wake hydrates are delayed by roughly this much past each change's commit.", + unit: "ms", }) .addCallback((result) => result.observe(lagEstimator.getLagMs())); }Source: Coding guidelines
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: cf1d7ce6-4985-43a4-985f-3d8500989b9d
📒 Files selected for processing (13)
.server-changes/realtime-replica-read-consistency.mdapps/webapp/app/env.server.tsapps/webapp/app/routes/api.v1.runs.$runId.metadata.tsapps/webapp/app/routes/api.v1.runs.$runId.tags.tsapps/webapp/app/services/metadata/updateMetadata.server.tsapps/webapp/app/services/metadata/updateMetadataInstance.server.tsapps/webapp/app/services/realtime/envChangeRouter.server.tsapps/webapp/app/services/realtime/nativeRealtimeClientInstance.server.tsapps/webapp/app/services/realtime/replicaLagEstimator.server.tsapps/webapp/app/v3/runEngineHandlers.server.tsapps/webapp/test/realtime/envChangeRouter.test.tsapps/webapp/test/realtime/replicaLagEstimator.test.tsdocker/docker-compose.yml
✅ Files skipped from review due to trivial changes (1)
- .server-changes/realtime-replica-read-consistency.md
🚧 Files skipped from review as they are similar to previous changes (11)
- apps/webapp/app/routes/api.v1.runs.$runId.tags.ts
- apps/webapp/app/routes/api.v1.runs.$runId.metadata.ts
- apps/webapp/app/v3/runEngineHandlers.server.ts
- apps/webapp/test/realtime/replicaLagEstimator.test.ts
- docker/docker-compose.yml
- apps/webapp/app/env.server.ts
- apps/webapp/app/services/metadata/updateMetadataInstance.server.ts
- apps/webapp/app/services/realtime/envChangeRouter.server.ts
- apps/webapp/app/services/realtime/replicaLagEstimator.server.ts
- apps/webapp/test/realtime/envChangeRouter.test.ts
- apps/webapp/app/services/metadata/updateMetadata.server.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (10, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (9, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 10)
- GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
- GitHub Check: typecheck / typecheck
- GitHub Check: Build and publish previews
- GitHub Check: 🛡️ E2E Auth Tests (full)
- GitHub Check: Analyze (javascript-typescript)
🧰 Additional context used
📓 Path-based instructions (7)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects insteadImport from
@trigger.dev/sdkwhen writing Trigger.dev tasks. Never use@trigger.dev/sdk/v3or deprecatedclient.defineJob
Files:
apps/webapp/app/services/realtime/nativeRealtimeClientInstance.server.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use zod for validation in packages/core and apps/webapp
Files:
apps/webapp/app/services/realtime/nativeRealtimeClientInstance.server.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use function declarations instead of default exports
**/*.{ts,tsx,js,jsx}: Prefer static imports over dynamic imports. Only use dynamicimport()when circular dependencies cannot be resolved, code splitting is needed for performance, or the module must be loaded conditionally at runtime
Import subpaths only frompackages/core(@trigger.dev/core), never import from the root
Files:
apps/webapp/app/services/realtime/nativeRealtimeClientInstance.server.ts
**/*.ts
📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries
Files:
apps/webapp/app/services/realtime/nativeRealtimeClientInstance.server.ts
apps/webapp/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
apps/webapp/**/*.{ts,tsx}: Access environment variables through theenvexport ofenv.server.tsinstead of directly accessingprocess.env
Use subpath exports from@trigger.dev/corepackage instead of importing from the root@trigger.dev/corepathUse named constants for sentinel/placeholder values (e.g.
const UNSET_VALUE = '__unset__') instead of raw string literals scattered across comparisons
Files:
apps/webapp/app/services/realtime/nativeRealtimeClientInstance.server.ts
apps/webapp/**/*.server.ts
📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)
apps/webapp/**/*.server.ts: Never userequest.signalfor detecting client disconnects. UsegetRequestAbortSignal()fromapp/services/httpAsyncStorage.server.tsinstead, which is wired directly to Expressres.on('close')and fires reliably
Access environment variables viaenvexport fromapp/env.server.ts. Never useprocess.envdirectly
Always usefindFirstinstead offindUniquein Prisma queries.findUniquehas an implicit DataLoader that batches concurrent calls and has active bugs even in Prisma 6.x (uppercase UUIDs returning null, composite key SQL correctness issues, 5-10x worse performance).findFirstis never batched and avoids this entire class of issues
Files:
apps/webapp/app/services/realtime/nativeRealtimeClientInstance.server.ts
**/*.{js,ts,tsx,jsx,css,json,md}
📄 CodeRabbit inference engine (AGENTS.md)
Use Prettier for code formatting and run
pnpm run formatbefore committing
Files:
apps/webapp/app/services/realtime/nativeRealtimeClientInstance.server.ts
🧠 Learnings (9)
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).
Applied to files:
apps/webapp/app/services/realtime/nativeRealtimeClientInstance.server.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.
Applied to files:
apps/webapp/app/services/realtime/nativeRealtimeClientInstance.server.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma error P1001 ("Can't reach database server") in TypeScript, don’t assume a single error shape. Prisma can surface P1001 via two different error classes/fields: `PrismaClientKnownRequestError` exposes it as `err.code === "P1001"` (common during mid-query connection drops), while `PrismaClientInitializationError` exposes it as `err.errorCode === "P1001"` (common on client startup failure). Therefore, predicates should use `err.code === "P1001" || err.errorCode === "P1001"`. Do not flag `err.code === "P1001"` as “unreachable/never matches,” as it is expected in production.
Applied to files:
apps/webapp/app/services/realtime/nativeRealtimeClientInstance.server.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma errors for P1001 ("Can't reach database server"), do not assume it only appears under a single property name. Prisma may surface P1001 via either `PrismaClientKnownRequestError` (`err.code === "P1001"`, e.g., mid-query connection drops) or `PrismaClientInitializationError` (`err.errorCode === "P1001"`, e.g., client startup connection failure). To reliably detect the condition, check `err.code === "P1001" || err.errorCode === "P1001"`, and avoid review rules that would incorrectly flag `err.code === "P1001"` as unreachable/never-matching.
Applied to files:
apps/webapp/app/services/realtime/nativeRealtimeClientInstance.server.ts
📚 Learning: 2026-03-26T09:02:07.973Z
Learnt from: myftija
Repo: triggerdotdev/trigger.dev PR: 3274
File: apps/webapp/app/services/runsReplicationService.server.ts:922-924
Timestamp: 2026-03-26T09:02:07.973Z
Learning: When parsing Trigger.dev task run annotations in server-side services, keep `TaskRun.annotations` strictly conforming to the `RunAnnotations` schema from `trigger.dev/core/v3`. If the code already uses `RunAnnotations.safeParse` (e.g., in a `#parseAnnotations` helper), treat that as intentional/necessary for atomic, schema-accurate annotation handling. Do not recommend relaxing the annotation payload schema or using a permissive “passthrough” parse path, since the annotations are expected to be written atomically in one operation and should not contain partial/legacy payloads that would require a looser parser.
Applied to files:
apps/webapp/app/services/realtime/nativeRealtimeClientInstance.server.ts
📚 Learning: 2026-05-05T09:38:02.512Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3523
File: apps/webapp/app/routes/api.v3.batches.ts:178-181
Timestamp: 2026-05-05T09:38:02.512Z
Learning: When reviewing code that catches `ServiceValidationError` in `*.server.ts` files, do not blindly forward `error.status` to HTTP responses, because SVEs may be thrown with non-default statuses (e.g., 400/500) and forwarding them can cause client-visible behavioral regressions (e.g., surfacing 500s to clients). Prefer a safe default response status of `error.status ?? 422`, but only after confirming via the reachable call graph that the caught `ServiceValidationError` instances are expected to carry those non-default statuses; otherwise, normalize to `422` to avoid unexpected client-visible 5xx behavior.
Applied to files:
apps/webapp/app/services/realtime/nativeRealtimeClientInstance.server.ts
📚 Learning: 2026-05-12T21:04:05.815Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3542
File: apps/webapp/app/components/sessions/v1/SessionStatus.tsx:1-3
Timestamp: 2026-05-12T21:04:05.815Z
Learning: In this Remix + TypeScript codebase, do not flag a server/client boundary violation when a file imports only types from a module matching `*.server`.
Specifically, it’s safe to import types using `import type { Foo } from "*.server"` or `import { type Foo } from "*.server"` because TypeScript erases type-only imports at compile time and they emit no JavaScript, so they won’t cross the Remix server/client bundle boundary.
Only raise the boundary concern for value imports (e.g., `import { Foo }` without `type`, or `import Foo`), since those produce JavaScript output.
Applied to files:
apps/webapp/app/services/realtime/nativeRealtimeClientInstance.server.ts
📚 Learning: 2026-06-04T18:16:35.386Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3836
File: apps/supervisor/src/backpressure/backpressureMonitor.ts:3-5
Timestamp: 2026-06-04T18:16:35.386Z
Learning: When reviewing TypeScript in this repo, apply the rule “prefer type aliases over interfaces” only to data/object shapes and union/intersection type modeling. If an interface is being used as a behavioral contract for collaborators to implement (e.g., method-shape interfaces that define required behavior, such as `BackpressureLogger` / `BackpressureSignalSource` in `apps/supervisor/src/backpressure/backpressureMonitor.ts`), keep it as an `interface` and do not flag it as a type-alias-vs-interface violation.
Applied to files:
apps/webapp/app/services/realtime/nativeRealtimeClientInstance.server.ts
📚 Learning: 2026-06-09T17:58:04.699Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 3879
File: apps/webapp/app/models/vercelIntegration.server.ts:619-630
Timestamp: 2026-06-09T17:58:04.699Z
Learning: In this codebase, outbound raw `fetch` calls should typically rely on Node/undici’s default request timeout (about ~300s) rather than adding a per-call `AbortController` + `setTimeout` wrapper inside individual functions (e.g. in files like `apps/webapp/app/models/vercelIntegration.server.ts`). During code review, do not flag the absence of a per-call timeout on a single `fetch` as an issue; if per-call timeouts are needed, they should be implemented via a codebase-wide convention (e.g., a shared fetch wrapper or documented pattern) rather than ad-hoc per-function changes.
Applied to files:
apps/webapp/app/services/realtime/nativeRealtimeClientInstance.server.ts
🔇 Additional comments (5)
apps/webapp/app/services/realtime/nativeRealtimeClientInstance.server.ts (5)
8-8: LGTM!Also applies to: 13-13
87-90: LGTM!
129-142: LGTM!
144-156: LGTM!
158-177: LGTM!
Summary
When the realtime runs feed (the backend behind the
realtimeBackendfeature flag) hydrates a change from a Postgres read replica, the read can race the replica's apply of the very write that triggered it. The delivered row then carries the previous change's content, and an isolated final change (for example a lastmetadata.setbefore a run goes quiet) is not corrected until the roughly 20 second backstop poll. Measured against a replica with deliberate apply delay, every delivery trailed exactly one change behind and a final change stranded for the full backstop interval.Fix
Publishers stamp each change record with the committed row's
updatedAt, taken from writes they already perform, so the stamp costs no extra queries. The router delays its wake hydrate until the replica's measured lag has passed, anchored to that timestamp: a record that has already spent longer than the lag in transit is hydrated immediately, so only the racing leading edge ever waits. After hydrating, a tripwire compares each row against its record's watermark. Still-stale rows are withheld and retried briefly, and each detection feeds the lag estimate. If retries run out, the rows are delivered anyway (liveness over freshness) and follow-up re-hydrates emit the fresh version through the normal working-set diff once the replica catches up, with the backstop as the terminal net.Replica lag is sampled reader-side only, and only while feeds are active. Aurora reports live lag via
aurora_replica_status(); vanilla Postgres can only report "caught up or not" (mid-apply lag is not honestly measurable from a replica), so tripwire observations floor the estimate there. Deployments without a replica resolve to zero lag and skip the gate entirely. Tunables live underREALTIME_BACKEND_NATIVE_REPLICA_LAG_*, andrealtime_native.stale_hydratesplusrealtime_native.replica_lag_estimate_msmake replica health observable.Two adjacent fixes: a metadata update that writes nothing no longer publishes a change record, and buffered parent and root metadata operations now publish when the flusher writes them, so those changes wake live feeds instead of waiting for the backstop.
For local testing,
docker-composegains an opt-indatabase-replicaservice (compose profilereplica) with a configurablerecovery_min_apply_delay, which reproduces replica-lag behavior deterministically. With the gate disabled this rig reproduces the one-change-behind delivery exactly; with it enabled, deliveries arrive with current content at roughly the true replica lag, across write rates faster and slower than the lag itself.