fix(opensearch): stop Phase 3 reindex from orphaning ES rows in indicies table (#36077)#36191
fix(opensearch): stop Phase 3 reindex from orphaning ES rows in indicies table (#36077)#36191fabrizzio-dotCMS wants to merge 3 commits into
Conversation
🤖 Bedrock Review —
|
|
Claude finished @fabrizzio-dotCMS's task in 5m 15s —— View job Rollback Safety Analysis
Verdict: ✅ Safe To RollbackAll three commits in this PR were analyzed against every category in the rollback-unsafe reference. None of the unsafe categories apply. What the PR does
Category-by-category assessment
Key rollback-safety noteThe
|
cbc3cf8 to
fe85a17
Compare
…ies (#36077) During OpenSearch migration Phase 3 (OS primary, ES decommissioned), a full reindex left stale ES-side rows in the `indicies` table: the old live/working pair plus the transient reindex_live/reindex_working pair (all NULL version), alongside the correctly-promoted .os (v3.X) pair — 6 rows instead of 2. Root cause: an asymmetry in ContentletIndexAPIImpl. initAndPointReindex wrote the legacy ES store unconditionally (pointES had no phase gate), but the Phase 3 switchover (fullReindexSwitchoverOS) only touches the OS store — so the ES rows were never promoted or cleared. Fix (DB-only — never contacts the ES cluster, which may be down in Phase 3): - Gate pointES on !isMigrationComplete() so Phase 3 reindex no longer writes ES pointers (mirrors the existing isMigrationStarted() gate on the OS write). - Add VersionedIndicesAPI.removeLegacyContentIndices() (delegating to IndicesFactory) which deletes the NULL-version live/working/reindex_* rows, preserving the unmigrated site_search pointer and all OS (non-NULL) rows, and flushes the index caches. Invoked by fullReindexSwitchoverOS after promoting OS. - Physical ES index deletion is intentionally left to the scheduled DeleteInactiveLiveWorkingIndicesJob (ES may not be running in Phase 3). Tests: - ContentletIndexAPIImplMigrationIntegrationTest: seeds the issue's 6-row orphan state and asserts the legacy ES content rows are purged while site_search and OS rows survive. - ContentletIndexAPIImplPhaseTest: fake updated for the new API method. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e 2 misrouting) After a reindex switchover, the async optimize used the phase-aware optimize(), which routes through router.read() — the read provider. In Phase 2 that is OS, so the ES physical names (bare, from IndiciesInfo) were sent to OpenSearch, hit index_not_found_exception (the real OS index carries the .os tag), logged a misleading ERROR, and only completed via the Phase-2 ES fallback. The OS indices were never optimized. Optimize each provider directly with the names it actually holds — ES bare names from newInfo, OS .os-tagged names from VersionedIndices — via operationsES/OS.indexAPI().optimize(), bypassing the read-provider router. New private helper optimizeNewActiveIndicesAsync(esNames, osNames): async, best-effort per provider (a force-merge failure never affects the completed switchover), skips a provider when its name list is empty (so Phase 3 never contacts the decommissioned ES cluster). Both switchover paths updated: the ES path (Phases 0/1/2) now also optimizes the promoted OS indices in dual-write phases; the OS path (Phase 3) routes explicitly to OS instead of relying on router.read resolving to OS. The public optimize()/IndexAPIImpl routing is unchanged — other callers keep their behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Phase 2 REST /optimize misrouting) The REST /optimize endpoint (ESIndexResource.optimizeIndices) calls ContentletIndexAPIImpl.optimize(listDotCMSIndices()), and listDotCMSIndices() -> getIndices() aggregates BOTH providers in dual-write phases: ES bare names plus OS .os-tagged names. IndexAPIImpl.optimize() routed that mixed list through router.read() to a single provider, so the foreign-tagged names hit index_not_found_exception (ES has no .os indices), surfacing as a noisy ERROR. Partition the names by IndexTag.resolve (tag-dispatch, the canonical pattern): force-merge ES with its bare names and OS with its .os-tagged names, each via the provider directly. Skip a provider when its subset is empty, so Phase 0 never contacts OS and Phase 3 never contacts the decommissioned ES cluster. Companion to the switchover-path fix (optimizeNewActiveIndicesAsync); same bug class, different caller — the public optimize() path was explicitly left unchanged there and is fixed here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
996e826 to
303c6c1
Compare
Proposed Changes
Fixes #36077.
During OpenSearch migration Phase 3 (OS primary, ES decommissioned), a full reindex left stale ES-side rows in the
indiciestable — the oldlive/workingpair plus the transientreindex_live/reindex_workingpair (all NULL version) — alongside the correctly-promoted.os(v3.X) pair. The table accumulated 6 rows instead of 2 on every Phase-3 reindex.Root cause
An asymmetry in
ContentletIndexAPIImpl:initAndPointReindexwrote the legacy ES store unconditionally (pointEShad no phase gate), so a Phase-3 reindex still inserted ESreindex_*rows and re-persisted the old ESlive/workingrows.fullReindexSwitchoverOS) only touches the OS store (versionedIndicesAPI) — it never promotes or clears those ES rows. So they orphaned.The fix (DB-only — never contacts the ES cluster, which may be down in Phase 3)
pointES(...)ininitAndPointReindexon!isMigrationComplete()so a Phase-3 reindex no longer writes ES pointers (mirrors the existingisMigrationStarted()gate on the OS write).VersionedIndicesAPI.removeLegacyContentIndices()(delegating toIndicesFactory): deletes the NULL-versionlive/working/reindex_live/reindex_workingrows, preserves the unmigratedsite_searchpointer and all OS (non-NULL version) rows, and flushes the index caches. Invoked best-effort byfullReindexSwitchoverOSafter promoting OS (a housekeeping failure must not undo a successful switchover).Physical index deletion is intentionally out of scope — in Phase 3 ES may not be running, so touching the cluster is fragile. Orphaned physical indices are cleaned by the existing scheduled
DeleteInactiveLiveWorkingIndicesJob.Acceptance criteria
live/working(NULL version) rows.reindex_live/reindex_workingrows removed.DeleteInactiveLiveWorkingIndicesJob(see scope note above).indiciestable state.Testing
ContentletIndexAPIImplMigrationIntegrationTest#test_phase3_removeLegacyContentIndices_purgesEsRowsPreservesSiteSearchAndOs— seeds the issue's exact 6-row orphan state and asserts the legacy ES content rows are purged whilesite_searchand OS rows survive.ContentletIndexAPIImplPhaseTest— fake updated for the new API method../mvnw test-compile -pl :dotcms-core,:dotcms-integrationpasses on JDK 25.🤖 Generated with Claude Code