Fix crash on writable standby coordinator when writing to a coordinator-local shard (#8426) by emelsimsek · Pull Request #8561 · citusdata/citus

emelsimsek · 2026-05-04T13:41:48Z

Summary

When citus.writable_standby_coordinator is enabled on a hot standby, modifying queries are forwarded to primary nodes via libpq. However, if the targeted shard has a placement on the local (coordinator) group, the local executor was invoked anyway. The local plan was then finalized against a read-only storage layer, leading to a NULL `custom_private` dereference in `IsCitusCustomScan` and a backend crash.

This patch detects the situation in `TaskAccessesLocalNode` -- the single chokepoint that all local-vs-remote routing decisions pass through -- and raises a clear `ERROR` (`errcode read_only_sql_transaction`) before entering the broken local-execution path. Read tasks remain unaffected and continue to execute locally, which is the intended use of a hot standby.

Behavior change

The only user-visible change is the error message text. Previously, INSERT/UPDATE/DELETE against a coordinator-local shard on a writable standby would either crash (the original report) or surface a misleading PostgreSQL-level error like `cannot execute INSERT in a read-only transaction`. After this patch:

```
ERROR: cannot execute modification on a hot standby for a shard placement that resides on this node
DETAIL: citus.writable_standby_coordinator forwards writes to primary nodes, but the targeted shard has a placement on the local node, whose storage is read-only during recovery.
HINT: Place the affected shards on primary worker nodes only, or fail over this node to perform writes.
```

The session remains usable after the error.

Tests

Updated `multi_follower_dml` expected output: every pre-existing `citus_local_table` modification line that previously bottomed out at PG's read-only check now emits the new error (8 sites).
Added an explicit regression block tagged with the issue number, exercising INSERT, UPDATE, DELETE, modifying CTE, and a follow-up SELECT to assert the session is still usable.

`make -C src/test/regress check-follower-cluster` passes all 8 tests locally.

Files

`src/backend/distributed/executor/local_executor.c` — added `#include "access/xlog.h"` and the recovery-aware error in `TaskAccessesLocalNode`.
`src/test/regress/sql/multi_follower_dml.sql` — explicit Error occurred when executing insert sql on secondary worker node #8426 regression block.
`src/test/regress/expected/multi_follower_dml.out` — refreshed expected output.

codecov · 2026-05-04T13:45:27Z

Codecov Report

❌ Patch coverage is 60.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.79%. Comparing base (c2b3005) to head (d91ecd6).

❌ Your patch check has failed because the patch coverage (60.00%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8561      +/-   ##
==========================================
+ Coverage   88.78%   88.79%   +0.01%     
==========================================
  Files         287      287              
  Lines       64046    64052       +6     
  Branches     8051     8054       +3     
==========================================
+ Hits        56865    56878      +13     
+ Misses       4850     4848       -2     
+ Partials     2331     2326       -5

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR prevents a backend crash on a hot-standby coordinator when citus.writable_standby_coordinator=on and a write targets a shard placement on the local (coordinator) group by detecting the condition during local-vs-remote routing and raising a clear read_only_sql_transaction error instead of entering the broken local-execution path.

Changes:

Add a recovery-aware guard in TaskAccessesLocalNode() to error out for local-group MODIFY/DDL tasks on hot standby before local execution.
Update follower-cluster regression coverage to exercise INSERT/UPDATE/DELETE/modifying-CTE behavior and verify the session remains usable.
Refresh expected outputs to match the new error text at all impacted sites.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
src/backend/distributed/executor/local_executor.c	Adds the recovery-mode error in `TaskAccessesLocalNode()` to prevent local execution of writes on hot standby.
src/test/regress/sql/multi_follower_dml.sql	Adds a dedicated regression block for #8426 covering multiple write forms plus a follow-up read.
src/test/regress/expected/multi_follower_dml.out	Updates expected error output to match the new, more actionable message.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

emelsimsek · 2026-05-04T14:32:10Z

+			if (RecoveryInProgress() &&
+				(task->taskType == MODIFY_TASK ||
+				 task->taskType == DDL_TASK))
+			{
+				ereport(ERROR,
+						(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+						 errmsg("cannot execute %s on a hot standby for a "
+								"shard placement that resides on this node",
+								task->taskType == DDL_TASK ?
+								"DDL command" : "modification"),
+						 errdetail("citus.writable_standby_coordinator forwards "


@copilot apply changes based on this feedback

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

emelsimsek · 2026-05-04T14:54:01Z

@copilot apply changes based on the comments in this thread

Copilot · 2026-05-04T14:55:59Z

@copilot apply changes based on the comments in this thread

Done in 6eee7f7. The RecoveryInProgress() guard in TaskAccessesLocalNode() is now also gated on WritableStandbyCoordinator, so the detailed error (referencing citus.writable_standby_coordinator forwarding writes to primaries) is only raised when that GUC is actually ON. When the GUC is OFF, the existing PostgreSQL read-only machinery handles the error as before.

onurctirtir

I wonder should we instead have this check in ExecuteLocalTaskListExtended(), the central function that's responsible for executing given tasks locally, rather having it in a function like TaskAccessesLocalNode().

But if we decide following that path, then we should also add an explicit check into InitializeCopyShardState(). The function already tracks hasRemoteCopy - similarly we can add hasLocalCopy and throw an error accordingly for local copy when needed.

emelsimsek · 2026-05-05T11:11:11Z

Addressed @onurctirtir's review (d3bc667):

Moved the recovery guard out of TaskAccessesLocalNode() and into ExecuteLocalTaskListExtended() — the central function responsible for executing tasks locally.
Added symmetric hasLocalCopy tracking in InitializeCopyShardState() and a matching error. This is defense-in-depth: EnsureCopyCanRunOnRelation() already blocks all Citus COPY in this scenario today, but the new check documents intent at the central local-COPY setup site.

make check-follower-cluster still passes (8/8).

emelsimsek · 2026-05-05T11:16:39Z

Update (8e4e0b5): dropped the multi_copy.c change. COPY in this scenario is already protected by EnsureCopyCanRunOnRelation(), which errors out for any Citus COPY when RecoveryInProgress() && WritableStandbyCoordinator -- including the local-placement case -- so the additional hasLocalCopy check would have been dead code. Only the ExecuteLocalTaskListExtended() guard remains.

colm-mchugh

Lgtm, couple of minor suggestions.

…or-local shard (#8426) When citus.writable_standby_coordinator is enabled on a hot standby, modifying queries are forwarded to primary nodes via libpq. However, if the targeted shard has a placement on the local (coordinator) group, the local executor was invoked anyway. The local plan was then finalized against a read-only storage layer, leading to a NULL custom_private dereference in IsCitusCustomScan and a backend crash. Detect this case in TaskAccessesLocalNode -- the single chokepoint that all local-vs-remote routing decisions pass through -- and raise a clear ERROR (errcode read_only_sql_transaction) before entering the local execution path. Read tasks remain unaffected and continue to execute locally, which is the intended use of a hot standby. Includes a regression test in multi_follower_dml. Fixes #8426.

onurctirtir · 2026-05-06T09:01:00Z

(mistakenly asked reviews from Copilot, please nvm)

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

+SELECT 1 FROM master_set_node_property('localhost', :worker_1_port, 'shouldhaveshards', false);
+SELECT 1 FROM master_set_node_property('localhost', :worker_2_port, 'shouldhaveshards', false);
+SELECT 1 FROM master_set_node_property('localhost', :master_port, 'shouldhaveshards', true);
+SET citus.shard_count TO 1;
+SET citus.shard_replication_factor TO 1;
+CREATE TABLE dist_on_coord (a int, b int);
+SELECT create_distributed_table('dist_on_coord', 'a');
+SELECT 1 FROM master_set_node_property('localhost', :worker_1_port, 'shouldhaveshards', true);
+SELECT 1 FROM master_set_node_property('localhost', :worker_2_port, 'shouldhaveshards', true);


+	/*
+	 * On a hot standby with citus.writable_standby_coordinator enabled,
+	 * Citus forwards modifying queries to primaries via libpq. A task that
+	 * targets a placement on the *local* group, however, is routed into
+	 * local execution -- and the storage layer is read-only during
+	 * recovery. Letting such a task proceed causes a downstream crash when
+	 * the local plan is finalized (see GitHub issue #8426).
+	 *
+	 * Detect this here -- the central function that executes tasks
+	 * locally -- and raise a clear error before the broken local execution


emelsimsek requested a review from Copilot May 4, 2026 13:45

Copilot started reviewing on behalf of emelsimsek May 4, 2026 13:46 View session

Copilot AI reviewed May 4, 2026

View reviewed changes

Copilot stopped work on behalf of emelsimsek due to an error May 4, 2026 14:32
Before you can use Copilot cloud agent, you need to pick a "Usage billed to" option in your Copilot settings. Update your settings at https://github.com/settings/copilot/features, then try again.

emelsimsek requested a review from Copilot May 4, 2026 14:41

Copilot started reviewing on behalf of emelsimsek May 4, 2026 14:41 View session

Copilot AI reviewed May 4, 2026

View reviewed changes

Comment thread src/backend/distributed/executor/local_executor.c Outdated

Copilot started work on behalf of emelsimsek May 4, 2026 14:54 View session

Copilot finished work on behalf of emelsimsek May 4, 2026 14:56

emelsimsek requested a review from colm-mchugh May 4, 2026 15:25

onurctirtir self-requested a review May 4, 2026 16:17

onurctirtir reviewed May 4, 2026

View reviewed changes

emelsimsek force-pushed the fix/8426-writable-standby-local-shard-crash branch 2 times, most recently from e25534e to d3bc667 Compare May 5, 2026 11:09

emelsimsek force-pushed the fix/8426-writable-standby-local-shard-crash branch from d3bc667 to 8e4e0b5 Compare May 5, 2026 11:16

colm-mchugh reviewed May 5, 2026

View reviewed changes

Comment thread src/backend/distributed/executor/local_executor.c

colm-mchugh reviewed May 5, 2026

View reviewed changes

Comment thread src/test/regress/sql/multi_follower_dml.sql

colm-mchugh approved these changes May 5, 2026

View reviewed changes

emelsimsek force-pushed the fix/8426-writable-standby-local-shard-crash branch from 8e4e0b5 to d91ecd6 Compare May 5, 2026 12:31

ihalatci added this to the 14.1 milestone May 5, 2026

colm-mchugh approved these changes May 5, 2026

View reviewed changes

onurctirtir requested review from Copilot and onurctirtir May 6, 2026 09:00

onurctirtir approved these changes May 6, 2026

View reviewed changes

Copilot started reviewing on behalf of onurctirtir May 6, 2026 09:02 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

ihalatci removed this from the 14.1 milestone May 6, 2026

emelsimsek merged commit 8b99e07 into main May 6, 2026
416 of 565 checks passed

emelsimsek deleted the fix/8426-writable-standby-local-shard-crash branch May 6, 2026 14:12

Conversation

emelsimsek commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behavior change

Tests

Files

Uh oh!

codecov Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

emelsimsek May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

emelsimsek commented May 4, 2026

Uh oh!

Copilot AI commented May 4, 2026

Uh oh!

onurctirtir left a comment

Choose a reason for hiding this comment

Uh oh!

emelsimsek commented May 5, 2026

Uh oh!

emelsimsek commented May 5, 2026

Uh oh!

Uh oh!

Uh oh!

colm-mchugh left a comment

Choose a reason for hiding this comment

Uh oh!

onurctirtir commented May 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

emelsimsek commented May 4, 2026 •

edited

Loading

codecov Bot commented May 4, 2026 •

edited

Loading