Skip to content

fix(opensearch): stop Phase 3 reindex from orphaning ES rows in indicies table (#36077)#36191

Open
fabrizzio-dotCMS wants to merge 3 commits into
mainfrom
fix/issue-36077-phase3-reindex-orphan-es-rows
Open

fix(opensearch): stop Phase 3 reindex from orphaning ES rows in indicies table (#36077)#36191
fabrizzio-dotCMS wants to merge 3 commits into
mainfrom
fix/issue-36077-phase3-reindex-orphan-es-rows

Conversation

@fabrizzio-dotCMS

Copy link
Copy Markdown
Member

Proposed Changes

Fixes #36077.

During OpenSearch migration Phase 3 (OS primary, ES decommissioned), a full reindex left stale ES-side rows in the indicies table — the old live/working pair plus the transient reindex_live/reindex_working pair (all NULL version) — alongside the correctly-promoted .os (v3.X) pair. The table accumulated 6 rows instead of 2 on every Phase-3 reindex.

Root cause

An asymmetry in ContentletIndexAPIImpl:

  • initAndPointReindex wrote the legacy ES store unconditionally (pointES had no phase gate), so a Phase-3 reindex still inserted ES reindex_* rows and re-persisted the old ES live/working rows.
  • The Phase-3 switchover (fullReindexSwitchoverOS) only touches the OS store (versionedIndicesAPI) — it never promotes or clears those ES rows. So they orphaned.

The fix (DB-only — never contacts the ES cluster, which may be down in Phase 3)

  1. Gate pointES(...) in initAndPointReindex on !isMigrationComplete() so a Phase-3 reindex no longer writes ES pointers (mirrors the existing isMigrationStarted() gate on the OS write).
  2. Add VersionedIndicesAPI.removeLegacyContentIndices() (delegating to IndicesFactory): deletes the NULL-version live/working/reindex_live/reindex_working rows, preserves the unmigrated site_search pointer and all OS (non-NULL version) rows, and flushes the index caches. Invoked best-effort by fullReindexSwitchoverOS after promoting OS (a housekeeping failure must not undo a successful switchover).

Physical index deletion is intentionally out of scope — in Phase 3 ES may not be running, so touching the cluster is fragile. Orphaned physical indices are cleaned by the existing scheduled DeleteInactiveLiveWorkingIndicesJob.

Acceptance criteria

  • After a Phase-3 reindex, no leftover ES live/working (NULL version) rows.
  • Transient reindex_live/reindex_working rows removed.
  • Physical indices deleted — deferred to DeleteInactiveLiveWorkingIndicesJob (see scope note above).
  • Re-running reindex in Phase 3 does not accumulate orphan rows.
  • Integration coverage asserts the final indicies table state.

Testing

  • ContentletIndexAPIImplMigrationIntegrationTest#test_phase3_removeLegacyContentIndices_purgesEsRowsPreservesSiteSearchAndOs — seeds the issue's exact 6-row orphan state and asserts the legacy ES content rows are purged while site_search and OS rows survive.
  • ContentletIndexAPIImplPhaseTest — fake updated for the new API method.
  • ./mvnw test-compile -pl :dotcms-core,:dotcms-integration passes on JDK 25.

Note: the new IT still needs a run against a live OpenSearch Phase-3 environment (opensearch-upgrade profile) — a full-reindex E2E was deliberately avoided due to the known inferIndexToHit gap (#36054); the test validates the cleanup guarantee deterministically at the DB level.

🤖 Generated with Claude Code

@github-actions github-actions Bot added Area : Backend PR changes Java/Maven backend code Area : CLI PR changes dotCMS CLI code Area : Documentation PR changes documentation files labels Jun 16, 2026
@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

🤖 Bedrock Review — deepseek.v3.2

[🟡 Medium] dotCMS/src/main/java/com/dotcms/content/elasticsearch/business/ContentletIndexAPIImpl.java:1225newActiveOs is initialized as an empty list but later conditionally assigned; however, the condition osWorking.isPresent() && osLive.isPresent() could be false even if one is present, leading to an empty list when one index name exists. This might skip optimization for a partially present pair, which could be intentional but should be verified against the requirement that both must be present for promotion.

[🟡 Medium] dotCMS/src/main/java/com/dotcms/content/elasticsearch/business/ContentletIndexAPIImpl.java:1350 — In optimizeNewActiveIndicesAsync, the method checks esNames != null && !esNames.isEmpty() and osNames != null && !osNames.isEmpty(). However, the parameters are documented as "or empty to skip", and the callers pass List.of() which is non-null. The null check is defensive but unnecessary; if a caller passes null, it would skip optimization silently, which might hide bugs. Consider removing the null checks or documenting that null is allowed.

[🟠 High] dotCMS/src/main/java/com/dotcms/content/index/IndicesFactoryImpl.java:46 — SQL query DELETE_LEGACY_CONTENT_INDICES_SQL uses string concatenation for the IN clause values. This is safe because the values are hardcoded literals, not user input, so SQL injection is not a risk. However, it's a minor convention deviation; ensure no dynamic values are ever appended here.

[🟡 Medium] dotCMS/src/main/java/com/dotcms/content/index/IndexAPIImpl.java:131 — The optimize method now groups indices by vendor tag. It uses IndexTag.resolve which may throw an exception for an unrecognized tag; if an index name doesn't match expected patterns, it could cause the optimization to fail. Ensure IndexTag.resolve handles all possible index names gracefully or that the input list is guaranteed to be valid.

[🟡 Medium] dotCMS/src/main/java/com/dotcms/content/index/VersionedIndicesAPIImpl.java:228 — The removeLegacyContentIndices method is annotated with @WrapInTransaction. This is appropriate for a delete operation, but ensure that the method is only called in contexts where a transaction is available (e.g., not from within an existing transaction that might have different isolation requirements). The call in fullReindexSwitchoverOS is within a transactional context, which is fine.

[🟡 Medium] dotcms-integration/src/test/java/com/dotcms/content/elasticsearch/business/ContentletIndexAPIImplMigrationIntegrationTest.java:777 — Integration test test_phase3_removeLegacyContentIndices_purgesEsRowsPreservesSiteSearchAndOs inserts rows directly into the database but does not clean up in case of test failure before the finally block. The @After method restores ES rows, but if the test fails before the finally block, the seeded rows may remain. However, the test uses a unique RUN_ID to avoid collisions, so this is low risk.

[🟡 Medium] dotcms-integration/src/test/java/com/dotcms/content/elasticsearch/business/ContentletIndexAPIImplMigrationIntegrationTest.java:846 — Helper method countIndiciesRows uses string concatenation to build the SQL IN clause placeholders. This is safe because the input is a varargs of strings that are index names from test data, not user input. However, it's a pattern that could be misused elsewhere; ensure no untrusted data is passed.


Run: #27656287969 · tokens: in: 6816 · out: 822 · total: 7638

@claude

claude Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Claude finished @fabrizzio-dotCMS's task in 5m 15s —— View job


Rollback Safety Analysis

  • Read rollback-unsafe categories reference
  • Get full PR diff
  • Analyze diff against unsafe categories
  • Apply appropriate label

Verdict: ✅ Safe To Rollback

All three commits in this PR were analyzed against every category in the rollback-unsafe reference. None of the unsafe categories apply.

What the PR does

Commit Change
8395244f Phase 3 fix: gate pointES() on !isMigrationComplete(); add removeLegacyContentIndices() to purge orphan NULL-version ES rows from indicies table
4a2a6a83 Fix optimize-after-switchover to call each provider with its own index names directly
303c6c19 Partition IndexAPIImpl.optimize() by IndexTag so mixed ES/OS name lists don't misroute

Category-by-category assessment

Category Verdict Reasoning
C-1 Structural Data Model Change ✅ Safe No schema change to core tables (contentlet, identifier, etc.). The indicies table is not restructured.
C-2 Elasticsearch Mapping Change ✅ Safe No changes to ESMappingAPIImpl, ESMappingUtilHelper, ESMappingConstants, or putMapping().
C-3 Content JSON Model Version Bump ✅ Safe No changes to ImmutableContentlet, CURRENT_MODEL_VERSION, or serialization logic.
C-4 DROP TABLE / DROP Column ✅ Safe DELETE FROM indicies WHERE ... removes rows (orphan pointers), not schema. The table and all its columns are preserved.
H-1 One-Way Data Migration ✅ Safe The DELETE removes orphaned NULL-version live/working/reindex_* rows that Phase 3 already made semantically dead. Rolling back to N-1 is safe: N-1 will simply re-create these rows on the next Phase-3 reindex (restoring the pre-fix behavior). No irreversible transformation of live content.
H-2 RENAME TABLE/COLUMN ✅ Safe No renames.
H-3 PK Restructuring ✅ Safe No constraint changes.
H-4 New ContentType Field Type ✅ Safe No new field types.
H-5 Storage Provider Change ✅ Safe No changes to storage provider configuration or getBinaryMetadataVersion().
H-6 DROP PROCEDURE/FUNCTION ✅ Safe No stored procedures affected.
H-7 NOT NULL Column without Default ✅ Safe No DDL at all.
H-8 VTL Viewtool Contract Change ✅ Safe No viewtool classes changed; no public method signatures on VTL-exposed objects altered.
M-1 Non-Broadening Column Type Change ✅ Safe No DDL.
M-2 Push Publishing Bundle Format ✅ Safe No bundler or bundle XML changes.
M-3 REST/GraphQL API Contract ✅ Safe No REST endpoint signatures or response shapes changed.
M-4 OSGi Plugin API Breakage ✅ Safe VersionedIndicesAPI and IndicesFactory are not exported in the OSGi manifest (MANIFEST.MF has no matching export-package entries). Adding a new method to an unexported internal interface does not break any OSGi consumer.

Key rollback-safety note

The DELETE in removeLegacyContentIndices() targets only indicies rows where index_version IS NULL AND index_type IN ('live', 'working', 'reindex_live', 'reindex_working'). Rolling back to N-1:

  • N-1 does not call removeLegacyContentIndices() — it simply won't clean up orphan rows (pre-fix behavior is restored).
  • The indicies table structure is unchanged, so N-1 reads and writes work exactly as before.
  • No content data, ES mappings, or OS indices are touched.

@fabrizzio-dotCMS fabrizzio-dotCMS force-pushed the fix/issue-36077-phase3-reindex-orphan-es-rows branch from cbc3cf8 to fe85a17 Compare June 16, 2026 16:59
@github-actions github-actions Bot removed AI: Safe To Rollback Area : CLI PR changes dotCMS CLI code Area : Documentation PR changes documentation files labels Jun 16, 2026
fabrizzio-dotCMS and others added 3 commits June 16, 2026 17:57
…ies (#36077)

During OpenSearch migration Phase 3 (OS primary, ES decommissioned), a full
reindex left stale ES-side rows in the `indicies` table: the old live/working
pair plus the transient reindex_live/reindex_working pair (all NULL version),
alongside the correctly-promoted .os (v3.X) pair — 6 rows instead of 2.

Root cause: an asymmetry in ContentletIndexAPIImpl. initAndPointReindex wrote
the legacy ES store unconditionally (pointES had no phase gate), but the Phase 3
switchover (fullReindexSwitchoverOS) only touches the OS store — so the ES rows
were never promoted or cleared.

Fix (DB-only — never contacts the ES cluster, which may be down in Phase 3):
- Gate pointES on !isMigrationComplete() so Phase 3 reindex no longer writes ES
  pointers (mirrors the existing isMigrationStarted() gate on the OS write).
- Add VersionedIndicesAPI.removeLegacyContentIndices() (delegating to
  IndicesFactory) which deletes the NULL-version live/working/reindex_* rows,
  preserving the unmigrated site_search pointer and all OS (non-NULL) rows, and
  flushes the index caches. Invoked by fullReindexSwitchoverOS after promoting OS.
- Physical ES index deletion is intentionally left to the scheduled
  DeleteInactiveLiveWorkingIndicesJob (ES may not be running in Phase 3).

Tests:
- ContentletIndexAPIImplMigrationIntegrationTest: seeds the issue's 6-row orphan
  state and asserts the legacy ES content rows are purged while site_search and
  OS rows survive.
- ContentletIndexAPIImplPhaseTest: fake updated for the new API method.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e 2 misrouting)

After a reindex switchover, the async optimize used the phase-aware optimize(),
which routes through router.read() — the read provider. In Phase 2 that is OS,
so the ES physical names (bare, from IndiciesInfo) were sent to OpenSearch, hit
index_not_found_exception (the real OS index carries the .os tag), logged a
misleading ERROR, and only completed via the Phase-2 ES fallback. The OS indices
were never optimized.

Optimize each provider directly with the names it actually holds — ES bare names
from newInfo, OS .os-tagged names from VersionedIndices — via
operationsES/OS.indexAPI().optimize(), bypassing the read-provider router. New
private helper optimizeNewActiveIndicesAsync(esNames, osNames): async,
best-effort per provider (a force-merge failure never affects the completed
switchover), skips a provider when its name list is empty (so Phase 3 never
contacts the decommissioned ES cluster).

Both switchover paths updated: the ES path (Phases 0/1/2) now also optimizes the
promoted OS indices in dual-write phases; the OS path (Phase 3) routes explicitly
to OS instead of relying on router.read resolving to OS.

The public optimize()/IndexAPIImpl routing is unchanged — other callers keep
their behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Phase 2 REST /optimize misrouting)

The REST /optimize endpoint (ESIndexResource.optimizeIndices) calls
ContentletIndexAPIImpl.optimize(listDotCMSIndices()), and listDotCMSIndices()
-> getIndices() aggregates BOTH providers in dual-write phases: ES bare names
plus OS .os-tagged names. IndexAPIImpl.optimize() routed that mixed list through
router.read() to a single provider, so the foreign-tagged names hit
index_not_found_exception (ES has no .os indices), surfacing as a noisy ERROR.

Partition the names by IndexTag.resolve (tag-dispatch, the canonical pattern):
force-merge ES with its bare names and OS with its .os-tagged names, each via the
provider directly. Skip a provider when its subset is empty, so Phase 0 never
contacts OS and Phase 3 never contacts the decommissioned ES cluster.

Companion to the switchover-path fix (optimizeNewActiveIndicesAsync); same bug
class, different caller — the public optimize() path was explicitly left unchanged
there and is fixed here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@fabrizzio-dotCMS fabrizzio-dotCMS force-pushed the fix/issue-36077-phase3-reindex-orphan-es-rows branch from 996e826 to 303c6c1 Compare June 17, 2026 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Safe To Rollback Area : Backend PR changes Java/Maven backend code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Phase 3 reindex leaves orphan ES entries in the indicies table

1 participant