Fix crash on writable standby coordinator when writing to a coordinator-local shard (#8426)#8561
Conversation
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (60.00%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #8561 +/- ##
==========================================
+ Coverage 88.78% 88.79% +0.01%
==========================================
Files 287 287
Lines 64046 64052 +6
Branches 8051 8054 +3
==========================================
+ Hits 56865 56878 +13
+ Misses 4850 4848 -2
+ Partials 2331 2326 -5 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR prevents a backend crash on a hot-standby coordinator when citus.writable_standby_coordinator=on and a write targets a shard placement on the local (coordinator) group by detecting the condition during local-vs-remote routing and raising a clear read_only_sql_transaction error instead of entering the broken local-execution path.
Changes:
- Add a recovery-aware guard in
TaskAccessesLocalNode()to error out for local-group MODIFY/DDL tasks on hot standby before local execution. - Update follower-cluster regression coverage to exercise INSERT/UPDATE/DELETE/modifying-CTE behavior and verify the session remains usable.
- Refresh expected outputs to match the new error text at all impacted sites.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/backend/distributed/executor/local_executor.c | Adds the recovery-mode error in TaskAccessesLocalNode() to prevent local execution of writes on hot standby. |
| src/test/regress/sql/multi_follower_dml.sql | Adds a dedicated regression block for #8426 covering multiple write forms plus a follow-up read. |
| src/test/regress/expected/multi_follower_dml.out | Updates expected error output to match the new, more actionable message. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (RecoveryInProgress() && | ||
| (task->taskType == MODIFY_TASK || | ||
| task->taskType == DDL_TASK)) | ||
| { | ||
| ereport(ERROR, | ||
| (errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION), | ||
| errmsg("cannot execute %s on a hot standby for a " | ||
| "shard placement that resides on this node", | ||
| task->taskType == DDL_TASK ? | ||
| "DDL command" : "modification"), | ||
| errdetail("citus.writable_standby_coordinator forwards " |
There was a problem hiding this comment.
@copilot apply changes based on this feedback
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@copilot apply changes based on the comments in this thread |
Done in 6eee7f7. The |
onurctirtir
left a comment
There was a problem hiding this comment.
I wonder should we instead have this check in ExecuteLocalTaskListExtended(), the central function that's responsible for executing given tasks locally, rather having it in a function like TaskAccessesLocalNode().
But if we decide following that path, then we should also add an explicit check into InitializeCopyShardState(). The function already tracks hasRemoteCopy - similarly we can add hasLocalCopy and throw an error accordingly for local copy when needed.
e25534e to
d3bc667
Compare
|
Addressed @onurctirtir's review (d3bc667):
|
d3bc667 to
8e4e0b5
Compare
|
Update (8e4e0b5): dropped the multi_copy.c change. COPY in this scenario is already protected by |
colm-mchugh
left a comment
There was a problem hiding this comment.
Lgtm, couple of minor suggestions.
…or-local shard (#8426) When citus.writable_standby_coordinator is enabled on a hot standby, modifying queries are forwarded to primary nodes via libpq. However, if the targeted shard has a placement on the local (coordinator) group, the local executor was invoked anyway. The local plan was then finalized against a read-only storage layer, leading to a NULL custom_private dereference in IsCitusCustomScan and a backend crash. Detect this case in TaskAccessesLocalNode -- the single chokepoint that all local-vs-remote routing decisions pass through -- and raise a clear ERROR (errcode read_only_sql_transaction) before entering the local execution path. Read tasks remain unaffected and continue to execute locally, which is the intended use of a hot standby. Includes a regression test in multi_follower_dml. Fixes #8426.
8e4e0b5 to
d91ecd6
Compare
|
(mistakenly asked reviews from Copilot, please nvm) |
| SELECT 1 FROM master_set_node_property('localhost', :worker_1_port, 'shouldhaveshards', false); | ||
| SELECT 1 FROM master_set_node_property('localhost', :worker_2_port, 'shouldhaveshards', false); | ||
| SELECT 1 FROM master_set_node_property('localhost', :master_port, 'shouldhaveshards', true); | ||
| SET citus.shard_count TO 1; | ||
| SET citus.shard_replication_factor TO 1; | ||
| CREATE TABLE dist_on_coord (a int, b int); | ||
| SELECT create_distributed_table('dist_on_coord', 'a'); | ||
| SELECT 1 FROM master_set_node_property('localhost', :worker_1_port, 'shouldhaveshards', true); | ||
| SELECT 1 FROM master_set_node_property('localhost', :worker_2_port, 'shouldhaveshards', true); |
| /* | ||
| * On a hot standby with citus.writable_standby_coordinator enabled, | ||
| * Citus forwards modifying queries to primaries via libpq. A task that | ||
| * targets a placement on the *local* group, however, is routed into | ||
| * local execution -- and the storage layer is read-only during | ||
| * recovery. Letting such a task proceed causes a downstream crash when | ||
| * the local plan is finalized (see GitHub issue #8426). | ||
| * | ||
| * Detect this here -- the central function that executes tasks | ||
| * locally -- and raise a clear error before the broken local execution |
Summary
Fixes #8426.
When
citus.writable_standby_coordinatoris enabled on a hot standby, modifying queries are forwarded to primary nodes via libpq. However, if the targeted shard has a placement on the local (coordinator) group, the local executor was invoked anyway. The local plan was then finalized against a read-only storage layer, leading to a NULL `custom_private` dereference in `IsCitusCustomScan` and a backend crash.This patch detects the situation in `TaskAccessesLocalNode` -- the single chokepoint that all local-vs-remote routing decisions pass through -- and raises a clear `ERROR` (`errcode read_only_sql_transaction`) before entering the broken local-execution path. Read tasks remain unaffected and continue to execute locally, which is the intended use of a hot standby.
Behavior change
The only user-visible change is the error message text. Previously, INSERT/UPDATE/DELETE against a coordinator-local shard on a writable standby would either crash (the original report) or surface a misleading PostgreSQL-level error like `cannot execute INSERT in a read-only transaction`. After this patch:
```
ERROR: cannot execute modification on a hot standby for a shard placement that resides on this node
DETAIL: citus.writable_standby_coordinator forwards writes to primary nodes, but the targeted shard has a placement on the local node, whose storage is read-only during recovery.
HINT: Place the affected shards on primary worker nodes only, or fail over this node to perform writes.
```
The session remains usable after the error.
Tests
`make -C src/test/regress check-follower-cluster` passes all 8 tests locally.
Files