Skip to content

Fix crash on writable standby coordinator when writing to a coordinator-local shard (#8426)#8561

Merged
emelsimsek merged 1 commit intomainfrom
fix/8426-writable-standby-local-shard-crash
May 6, 2026
Merged

Fix crash on writable standby coordinator when writing to a coordinator-local shard (#8426)#8561
emelsimsek merged 1 commit intomainfrom
fix/8426-writable-standby-local-shard-crash

Conversation

@emelsimsek
Copy link
Copy Markdown
Contributor

@emelsimsek emelsimsek commented May 4, 2026

Summary

Fixes #8426.

When citus.writable_standby_coordinator is enabled on a hot standby, modifying queries are forwarded to primary nodes via libpq. However, if the targeted shard has a placement on the local (coordinator) group, the local executor was invoked anyway. The local plan was then finalized against a read-only storage layer, leading to a NULL `custom_private` dereference in `IsCitusCustomScan` and a backend crash.

This patch detects the situation in `TaskAccessesLocalNode` -- the single chokepoint that all local-vs-remote routing decisions pass through -- and raises a clear `ERROR` (`errcode read_only_sql_transaction`) before entering the broken local-execution path. Read tasks remain unaffected and continue to execute locally, which is the intended use of a hot standby.

Behavior change

The only user-visible change is the error message text. Previously, INSERT/UPDATE/DELETE against a coordinator-local shard on a writable standby would either crash (the original report) or surface a misleading PostgreSQL-level error like `cannot execute INSERT in a read-only transaction`. After this patch:

```
ERROR: cannot execute modification on a hot standby for a shard placement that resides on this node
DETAIL: citus.writable_standby_coordinator forwards writes to primary nodes, but the targeted shard has a placement on the local node, whose storage is read-only during recovery.
HINT: Place the affected shards on primary worker nodes only, or fail over this node to perform writes.
```

The session remains usable after the error.

Tests

  • Updated `multi_follower_dml` expected output: every pre-existing `citus_local_table` modification line that previously bottomed out at PG's read-only check now emits the new error (8 sites).
  • Added an explicit regression block tagged with the issue number, exercising INSERT, UPDATE, DELETE, modifying CTE, and a follow-up SELECT to assert the session is still usable.

`make -C src/test/regress check-follower-cluster` passes all 8 tests locally.

Files

@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

❌ Patch coverage is 60.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.79%. Comparing base (c2b3005) to head (d91ecd6).

❌ Your patch check has failed because the patch coverage (60.00%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8561      +/-   ##
==========================================
+ Coverage   88.78%   88.79%   +0.01%     
==========================================
  Files         287      287              
  Lines       64046    64052       +6     
  Branches     8051     8054       +3     
==========================================
+ Hits        56865    56878      +13     
+ Misses       4850     4848       -2     
+ Partials     2331     2326       -5     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents a backend crash on a hot-standby coordinator when citus.writable_standby_coordinator=on and a write targets a shard placement on the local (coordinator) group by detecting the condition during local-vs-remote routing and raising a clear read_only_sql_transaction error instead of entering the broken local-execution path.

Changes:

  • Add a recovery-aware guard in TaskAccessesLocalNode() to error out for local-group MODIFY/DDL tasks on hot standby before local execution.
  • Update follower-cluster regression coverage to exercise INSERT/UPDATE/DELETE/modifying-CTE behavior and verify the session remains usable.
  • Refresh expected outputs to match the new error text at all impacted sites.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/backend/distributed/executor/local_executor.c Adds the recovery-mode error in TaskAccessesLocalNode() to prevent local execution of writes on hot standby.
src/test/regress/sql/multi_follower_dml.sql Adds a dedicated regression block for #8426 covering multiple write forms plus a follow-up read.
src/test/regress/expected/multi_follower_dml.out Updates expected error output to match the new, more actionable message.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1037 to +1047
if (RecoveryInProgress() &&
(task->taskType == MODIFY_TASK ||
task->taskType == DDL_TASK))
{
ereport(ERROR,
(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
errmsg("cannot execute %s on a hot standby for a "
"shard placement that resides on this node",
task->taskType == DDL_TASK ?
"DDL command" : "modification"),
errdetail("citus.writable_standby_coordinator forwards "
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/backend/distributed/executor/local_executor.c Outdated
@emelsimsek
Copy link
Copy Markdown
Contributor Author

@copilot apply changes based on the comments in this thread

Copy link
Copy Markdown
Contributor

Copilot AI commented May 4, 2026

@copilot apply changes based on the comments in this thread

Done in 6eee7f7. The RecoveryInProgress() guard in TaskAccessesLocalNode() is now also gated on WritableStandbyCoordinator, so the detailed error (referencing citus.writable_standby_coordinator forwarding writes to primaries) is only raised when that GUC is actually ON. When the GUC is OFF, the existing PostgreSQL read-only machinery handles the error as before.

@emelsimsek emelsimsek requested a review from colm-mchugh May 4, 2026 15:25
@onurctirtir onurctirtir self-requested a review May 4, 2026 16:17
Copy link
Copy Markdown
Member

@onurctirtir onurctirtir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder should we instead have this check in ExecuteLocalTaskListExtended(), the central function that's responsible for executing given tasks locally, rather having it in a function like TaskAccessesLocalNode().

But if we decide following that path, then we should also add an explicit check into InitializeCopyShardState(). The function already tracks hasRemoteCopy - similarly we can add hasLocalCopy and throw an error accordingly for local copy when needed.

@emelsimsek emelsimsek force-pushed the fix/8426-writable-standby-local-shard-crash branch 2 times, most recently from e25534e to d3bc667 Compare May 5, 2026 11:09
@emelsimsek
Copy link
Copy Markdown
Contributor Author

Addressed @onurctirtir's review (d3bc667):

  • Moved the recovery guard out of TaskAccessesLocalNode() and into ExecuteLocalTaskListExtended() — the central function responsible for executing tasks locally.
  • Added symmetric hasLocalCopy tracking in InitializeCopyShardState() and a matching error. This is defense-in-depth: EnsureCopyCanRunOnRelation() already blocks all Citus COPY in this scenario today, but the new check documents intent at the central local-COPY setup site.

make check-follower-cluster still passes (8/8).

@emelsimsek emelsimsek force-pushed the fix/8426-writable-standby-local-shard-crash branch from d3bc667 to 8e4e0b5 Compare May 5, 2026 11:16
@emelsimsek
Copy link
Copy Markdown
Contributor Author

Update (8e4e0b5): dropped the multi_copy.c change. COPY in this scenario is already protected by EnsureCopyCanRunOnRelation(), which errors out for any Citus COPY when RecoveryInProgress() && WritableStandbyCoordinator -- including the local-placement case -- so the additional hasLocalCopy check would have been dead code. Only the ExecuteLocalTaskListExtended() guard remains.

Comment thread src/backend/distributed/executor/local_executor.c
Comment thread src/test/regress/sql/multi_follower_dml.sql
Copy link
Copy Markdown
Contributor

@colm-mchugh colm-mchugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, couple of minor suggestions.

…or-local shard (#8426)

When citus.writable_standby_coordinator is enabled on a hot standby, modifying
queries are forwarded to primary nodes via libpq. However, if the targeted
shard has a placement on the local (coordinator) group, the local executor
was invoked anyway. The local plan was then finalized against a read-only
storage layer, leading to a NULL custom_private dereference in
IsCitusCustomScan and a backend crash.

Detect this case in TaskAccessesLocalNode -- the single chokepoint that
all local-vs-remote routing decisions pass through -- and raise a clear
ERROR (errcode read_only_sql_transaction) before entering the local
execution path. Read tasks remain unaffected and continue to execute
locally, which is the intended use of a hot standby.

Includes a regression test in multi_follower_dml.

Fixes #8426.
@emelsimsek emelsimsek force-pushed the fix/8426-writable-standby-local-shard-crash branch from 8e4e0b5 to d91ecd6 Compare May 5, 2026 12:31
@ihalatci ihalatci added this to the 14.1 milestone May 5, 2026
@onurctirtir onurctirtir requested review from Copilot and onurctirtir May 6, 2026 09:00
@onurctirtir
Copy link
Copy Markdown
Member

(mistakenly asked reviews from Copilot, please nvm)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment on lines +27 to +35
SELECT 1 FROM master_set_node_property('localhost', :worker_1_port, 'shouldhaveshards', false);
SELECT 1 FROM master_set_node_property('localhost', :worker_2_port, 'shouldhaveshards', false);
SELECT 1 FROM master_set_node_property('localhost', :master_port, 'shouldhaveshards', true);
SET citus.shard_count TO 1;
SET citus.shard_replication_factor TO 1;
CREATE TABLE dist_on_coord (a int, b int);
SELECT create_distributed_table('dist_on_coord', 'a');
SELECT 1 FROM master_set_node_property('localhost', :worker_1_port, 'shouldhaveshards', true);
SELECT 1 FROM master_set_node_property('localhost', :worker_2_port, 'shouldhaveshards', true);
Comment on lines +237 to +246
/*
* On a hot standby with citus.writable_standby_coordinator enabled,
* Citus forwards modifying queries to primaries via libpq. A task that
* targets a placement on the *local* group, however, is routed into
* local execution -- and the storage layer is read-only during
* recovery. Letting such a task proceed causes a downstream crash when
* the local plan is finalized (see GitHub issue #8426).
*
* Detect this here -- the central function that executes tasks
* locally -- and raise a clear error before the broken local execution
@ihalatci ihalatci removed this from the 14.1 milestone May 6, 2026
@emelsimsek emelsimsek merged commit 8b99e07 into main May 6, 2026
416 of 565 checks passed
@emelsimsek emelsimsek deleted the fix/8426-writable-standby-local-shard-crash branch May 6, 2026 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error occurred when executing insert sql on secondary worker node

6 participants