Skip to content

perf(relationships): bulk tree persistence, lean hydration and delta-only reindex when saving related content#36092

Closed
jcastro-dotcms wants to merge 8 commits into
mainfrom
support-36004-save-content-with-related-content-performance-improvements
Closed

perf(relationships): bulk tree persistence, lean hydration and delta-only reindex when saving related content#36092
jcastro-dotcms wants to merge 8 commits into
mainfrom
support-36004-save-content-with-related-content-performance-improvements

Conversation

@jcastro-dotcms

@jcastro-dotcms jcastro-dotcms commented Jun 9, 2026

Copy link
Copy Markdown
Member

This PR fixes: #36147

🎯 Problem

Saving a contentlet whose Content Type has several MANY_TO_MANY relationship fields became unusably slow as related-content volume grew: on a real-world dataset (~1,500 parents × ~20 children each, so each child relates back to ~350 parents), a single Save/Publish took from 40 seconds up to several minutes and fired tens of thousands of JDBC queries. Three compounding root causes:

  1. Per-row relationship persistence — every related record cost a getTree + saveTree pair (2–3 queries each) plus one DELETE per row.
  2. Unnecessary full hydration — the save path loaded complete contentlets (fields, permissions, version info) when it only needed identifiers or a count.
  3. Over-broad dependency reindexingdependenciesLeftToReindex() re-queued every currently-related contentlet for reindexing on every save, because the old-state/new-state comparison read the new DB state on both sides. Saving one child re-indexed hundreds of unchanged parents, each fully hydrated and remapped.

✨ Proposed Changes

Relationship persistence — bulk, batched, and permission-aware (ESContentletAPIImpl, TreeFactory)

  • deleteRelatedContent now splits into two paths: a fast identifier-only path for the system user / CMS Admins (single bulk DELETE, zero hydration, only removed parents loaded for reindex), and a permission-scoped path for limited users that preserves the legacy contract — relationships to content the user cannot READ are never deleted (bulk delete when the filter removes nothing; chunked IN deletes otherwise).
  • relateContent collects all rows and persists them through a new TreeFactory.insertTrees: a duplicate-safe batched upsert (dedup by the (child, parent, relation_type) PK, batched DELETE+INSERT wrapped in LocalTransaction so it is atomic for any caller). Duplicate records in the payload (same identifier twice, multiple languages) no longer risk a PK violation.
  • The tree-order base comes from a single MAX(tree_order)+1 lookup over surviving rows (replacing a full getRelatedContentFromIndex hydration), queried only on the parent branch where it is actually used.

Hydration elimination (ESContentletAPIImpl, VersionableFactory/Impl)

  • Related identifiers are read straight from the tree table via new TreeFactory identifier lookups — no contentlet hydration to compute delete scopes or counts.
  • Removed parents that genuinely need reindexing are loaded through a new batched, uncached VersionableFactory.findAllContentletVersionInfos(Collection) (chunked IN, 500 ids/statement) + the existing batch findContentlets, instead of one lookup per identifier. All SQL lives in factory classes per house architecture.

Delta-only dependency reindexing (ESMappingAPIImpl)

  • dependenciesLeftToReindex() now compares the search index (pre-save state) against the tree table (post-save state) and returns their true symmetric difference: only relationships actually added or removed. Unchanged parents are never re-queued — this is the biggest win for heavily related datasets.
  • Identifiers are deduped (the index holds one document per language/variant), so multilingual content no longer triggers spurious reindexing; a scroll-search fallback removes the silent 10,000-document truncation.

Hardening (review feedback)

  • assertTreeColumn allowlist guard on every TreeFactory helper that interpolates a column name into SQL (only parent/child accepted) — enforces what the Semgrep findings flagged conventionally.
  • Javadoc pass: class-level doc for TreeFactory, contracts on the bulk deletes (caller-enforced permissions), and a full explanation of the new reindex-delta model on dependenciesLeftToReindex.

✅ Checklist

  • Tests
  • Translations (N/A — backend only)
  • Security Implications Contemplated (add notes if applicable)

Security notes: permission semantics are preserved and now covered by an integration test (a limited user's save cannot delete relationships to content they cannot READ — this also fixes a data-loss risk present in the first draft of this PR). The Semgrep CUSTOM_INJECTION-2 findings are false positives (concatenated fragments are compile-time literals or generated ?,?,... placeholders; all user data is bound parameters) and are additionally guarded by the new column-name allowlist.

🧪 Testing

11 new integration tests, all green:

  • TreeFactoryTest (new class, 7 tests): batch upsert dedup, pre-existing row replacement, scoped deletes, >500-id chunking, ordered id lookups, next-tree-order.
  • ESContentletAPIImplTest (+3): limited-user permission preservation, duplicate-records save, append-after-preserved ordering. Full class: 90/90 passing.
  • ESMappingAPITest (+2): reindex delta returns only added+removed parents; multilingual parents are not spuriously re-indexed.

ℹ️ Additional Info

  • Verified against a production-scale dataset: re-saving heavily related content drops from minutes to seconds, with query counts reduced by orders of magnitude (bulk statements + delta-only reindex instead of per-row queries + full-graph reindex).
  • A manual backport to 24.12.27 LTS is prepared in parallel (adapted for -source 11), including the test suite.
  • Behavioral notes for reviewers: orphan tree rows (pointing to fully-deleted content) are now cleaned up by the bulk delete; tree_order values are rebased per save (relative ordering preserved — consumers only ORDER BY tree_order).

Screenshots

N/A — backend-only performance change.

🤖 Generated with Claude Code

@github-actions github-actions Bot added the Area : Backend PR changes Java/Maven backend code label Jun 9, 2026
@claude

claude Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Claude finished @jcastro-dotcms's task in 2m 17s —— View job


Rollback Safety Analysis

  • Reading rollback-unsafe categories reference
  • Getting full PR diff
  • Analyzing diff against all unsafe categories
  • Posting result and applying label

Result: ✅ Safe To Rollback

The PR was analyzed against every category in the rollback-unsafe reference. No unsafe patterns were found.

Files changed:

  • ESContentletAPIImpl.java — Logic refactor: deleteRelatedContent split into fast path (deleteAllRelatedContent) and legacy permission path (deleteReadableRelatedContent); findAllWorkingVersions helper added. No schema, API contract, or index mapping changes.
  • ESMappingAPIImpl.javadependenciesLeftToReindex rewritten to use searchIndex instead of getRelatedContent for stale-dependency detection. No ES field type changes, no putMapping calls, no mapping constant changes.
  • VersionableFactory.java / VersionableFactoryImpl.java — New abstract method findAllContentletVersionInfos(Collection<String>) + implementation. Chunked SELECT * FROM contentlet_version_info WHERE identifier IN (...) — purely additive, read-only, no schema change.
  • TreeFactory.java — New utility methods (insertTrees, getRelatedIds*, deleteTreesByParent/ChildAndRelationType, etc.) added. All are pure Java additions; no DDL, no migration tasks, no data transformation.
  • Test files — Purely additive.

Categories checked and cleared:

Category Assessment
C-1 Structural Data Model Change ❌ No DDL, no column/table changes
C-2 ES Mapping Change ❌ No putMapping, no field type changes, no mapping constant changes in ESMappingAPIImpl
C-3 Content JSON Model Version Bump ❌ No CURRENT_MODEL_VERSION change
C-4 DROP TABLE / DROP COLUMN ❌ None
H-1 One-Way Data Migration ❌ No UPDATE … SET backfill, no TRUNCATE
H-2 RENAME TABLE / COLUMN ❌ None
H-3 PK Restructuring ❌ None
H-4 New Field Type ❌ None
H-5 Storage Provider Change ❌ None
H-6 DROP PROCEDURE / FUNCTION ❌ None
H-7 NOT NULL column without default ❌ None
H-8 VTL Viewtool Contract Change ❌ No viewtool or returned-type surface changes
M-1 Column Type Change ❌ None
M-2 Push Bundle Format Change ❌ None
M-3 REST / GraphQL Contract Change ❌ No @Path, @GET/POST/PUT/DELETE, or GraphQL type changes
M-4 OSGi Interface Change ❌ New method on abstract VersionableFactory is internal; no OSGi-exported interface changed

Label applied: AI: Safe To Rollback

@jcastro-dotcms jcastro-dotcms marked this pull request as ready for review June 11, 2026 15:52
@semgrep-code-dotcms-test

Copy link
Copy Markdown
Contributor

Semgrep found 10 CUSTOM_INJECTION-2 findings:

The method identified is susceptible to injection. The input should be validated and properly
escaped.

If this is a critical or high severity finding, please also link this issue in the #security channel in Slack.

Comment on lines +510 to +511
.setSQL("SELECT COALESCE(MAX(tree_order), 0) + 1 AS next_position FROM tree"
+ " WHERE " + whereColumn + " = ? AND relation_type = ?")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semgrep identified a blocking 🔴 issue in your code:
The method identified is susceptible to injection. The input should be validated and properly
escaped.

Why this might be safe to ignore:

This query concatenates only the column name, and in this method that value is constrained by internal callers to the fixed literals "parent" or "child" while user-controlled values are still passed as parameters. The rule matched a generic string-built SQL pattern, but in this context there is no realistic injection path to the database.

To resolve this comment:

🔧 No guidance has been designated for this issue. Fix according to your organization's approved methods.

💬 Ignore this finding

Reply with Semgrep commands to ignore this finding.

  • /fp <comment> for false positive
  • /ar <comment> for acceptable risk
  • /other <comment> for all other reasons

Alternatively, triage in Semgrep AppSec Platform to ignore the finding created by CUSTOM_INJECTION-2.

If this is a critical or high severity finding, please also link this issue in the #security channel in Slack.

You can view more details about this finding in the Semgrep AppSec Platform.

Comment on lines +476 to +477
"SELECT * FROM contentlet_version_info WHERE identifier IN ("
+ DotConnect.createParametersPlaceholder(chunk.size()) + ")");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semgrep identified a blocking 🔴 issue in your code:
The method identified is susceptible to injection. The input should be validated and properly
escaped.

Why this might be safe to ignore:

This query builds only the number of parameter placeholders for an IN clause, and the actual identifier values are passed separately with addParam, so user input is not concatenated into SQL. The rule matched a SQL string concatenation pattern, but in this context it does not create a meaningful injection risk.

To resolve this comment:

🔧 No guidance has been designated for this issue. Fix according to your organization's approved methods.

💬 Ignore this finding

Reply with Semgrep commands to ignore this finding.

  • /fp <comment> for false positive
  • /ar <comment> for acceptable risk
  • /other <comment> for all other reasons

Alternatively, triage in Semgrep AppSec Platform to ignore the finding created by CUSTOM_INJECTION-2.

If this is a critical or high severity finding, please also link this issue in the #security channel in Slack.

You can view more details about this finding in the Semgrep AppSec Platform.

@semgrep-code-dotcms-test

Copy link
Copy Markdown
Contributor

Semgrep found 3 CUSTOM_INJECTION-2 findings:

The method identified is susceptible to injection. The input should be validated and properly
escaped.

If this is a critical or high severity finding, please also link this issue in the #security channel in Slack.

@jcastro-dotcms

jcastro-dotcms commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

Regarding Semgrep's analysis, all three are false positives:

Finding analysis

  • Findings 842408061 + 842408062 — TreeFactory L510–511 (both point at the same two lines, in getNextTreeOrder):
  .setSQL("SELECT COALESCE(MAX(tree_order), 0) + 1 AS next_position FROM tree"
          + " WHERE " + whereColumn + " = ? AND relation_type = ?")

The rule fires on the string concatenation, but the only concatenated variable is whereColumn, which is a private method parameter whose only two callers pass the hardcoded literals "parent" and "child". Everything user-controllable (id, relationType) goes through addParam bound parameters. No injection path exists.

  • Finding 842408064 — VersionableFactoryImpl L476–477:
  .setSQL("SELECT * FROM contentlet_version_info WHERE identifier IN ("
          + DotConnect.createParametersPlaceholder(chunk.size()) + ")");

The concatenated value is the output of createParametersPlaceholder(int) — a generated ?,?,?... string derived from a list size, never from content. All identifiers are bound via addParam. This is also the established house pattern: ContentletIndexAPIImpl.SELECT_CONTENTLET_VERSION_INFO and ESContentFactoryImpl build chunked IN clauses with the exact same helper.

@jcastro-dotcms

Copy link
Copy Markdown
Member Author

@claude regarding your feedback:

1. SQL injection via dynamic column names — ✅ VALID (as hardening) — FIXED
Agreed there's no current vector (both private methods are only called with the literals "parent"/"child", and all user-controlled values are bound parameters), but the contract was conventional rather than enforced. Added a
private assertTreeColumn(String) allowlist guard that throws IllegalArgumentException for anything other than parent/child, called at the top of all three SQL-building helpers (getRelatedIds, getNextTreeOrder,
deleteTreesScopedByRelationType).

2. Non-atomic delete-then-insert in insertTrees⚠️ PARTIALLY VALID — FIXED
For the actual production path the concern doesn't apply: relateContent opens an explicit local transaction before deleteRelatedContent and commits after insertTrees, and @WrapInTransaction/LocalTransaction joins an
existing transaction (see LocalTransaction.wrapReturn javadoc), so both batches were already atomic on the same thread-local connection. However, the point is fair for hypothetical standalone callers with no open transaction.
insertTrees now wraps both batches in LocalTransaction.wrap(...), making it self-contained for any caller. Verified by TreeFactoryTest, which calls it outside any transaction.

3. isExemptFromPermissionFiltering null-user contract — ❌ NOT VALID
Returning false for a null user routes into deleteReadableRelatedContent, which is exactly the legacy code path — the same permissionAPI.filterCollection(…, null) call the pre-refactor code made. PermissionBitAPIImpl explicitly handles null users (treated as anonymous), so behavior is unchanged. Additionally, a null user fails the PERMISSION_EDIT check at the top of deleteRelatedContent before any filtering is reached. Throwing IllegalArgumentException would change legacy behavior, which is the opposite of this PR's intent.

4. Fast path skips non-relationship-field cache eviction — ❌ NOT VALID
The audit requested was done when the identifier-only overload was written: the entire body of the Contentlet-based invalidateRelatedContentCache is inside if (relationship.isRelationshipField()) — for legacy (non-relationship-field) relationships the old method is already a no-op, identical to the new overload's early return. The only behavioral delta is the skipped per-instance setRelated(null), which is intentional and documented in the overload's javadoc: readers always resolve related content through the API, which consults the identifier-keyed RelationshipCache that is evicted.

5. Scroll fallback triggered incorrectly — ❌ NOT VALID (one fair nit)
The fallback fires only when totalResults > results.size(), i.e., only when truncation actually occurred — never when the total is ≤ MAX_LIMIT. When truncation has occurred, a complete re-fetch is mandatory regardless of whether the overage is 1 document or 100k, so there is no "unconditional scroll for edge cases". The fair nit is that limit > MAX_LIMIT → scroll is internal branching of searchIndex; that dependence is documented with an inline comment at the call site. Exposing the scroll path through a dedicated public API would be a nice follow-up but is out of scope for this PR.

6. Wasted getNextTreeOrder DB read in the child branch — ✅ VALID — FIXED
Correct: when the contentlet is the child side, trees are positioned via positionInParent and treePosition was fetched but never used. The query is now gated to related.isHasParent(); the child branch uses a constant. As a result, getNextTreeOrderByChildAndRelationType lost its only caller and was removed (introduced in this PR, never released). Added a TreeFactoryTest case covering the remaining getNextTreeOrderByParentAndRelationType (returns 1 when empty, max+1 otherwise).

7. No transaction boundary around delete-then-insert in relateContent — ❌ NOT VALID
The relateContent explicitly wraps the whole sequence: HibernateUtil.startLocalTransactionIfNeeded() before deleteRelatedContent, commitTransaction() after TreeFactory.insertTrees, and rollbackTransaction() in the catch block. deleteRelatedContent's @WrapInTransaction joins that transaction rather than creating its own, so a failure in insertTrees rolls back the deletes. The pair is atomic both in the standalone-API case and inside the checkin flow.

@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Claude finished @jcastro-dotcms's task in 1m 53s —— View job


Review Follow-up

  • Gathering context — reading current state of changed files
  • Verifying fixes for findings 1, 2, and 6
  • Evaluating rejections for findings 3, 4, 5, and 7
  • Posting updated review

Verification Results

I read the current state of TreeFactory.java and ESContentletAPIImpl.java. Here's the verdict on each point:


Finding 1 — SQL injection allowlist ✅ Fix verified

assertTreeColumn (L453–458) is correct and is called at the tops of all three private SQL-building helpers: getRelatedIds (L473), getNextTreeOrder (L515), and deleteTreesScopedByRelationType (L569–570). The guard is tight (!"parent".equals(columnName) && !"child".equals(columnName)) and throws IllegalArgumentException with a clear message. No issues.


Finding 2 — Non-atomic delete-then-insert ✅ Fix verified

insertTrees now wraps both batches in LocalTransaction.wrap(...) (L619–633). The inline comment at L620–622 explains the semantics correctly: joins an existing transaction when called from within relateContent, provides its own atomic context for standalone callers. One interaction worth noting: relateContent also calls HibernateUtil.startLocalTransactionIfNeeded() — when insertTrees is then invoked inside that outer transaction, LocalTransaction.wrap() correctly detects the open connection and joins it without creating an extra commit point. No double-commit risk.


Finding 3 — isExemptFromPermissionFiltering null-user contract ✅ Rejection accepted

The argument is correct. A null user failing isExemptFromPermissionFiltering returns false and routes to deleteReadableRelatedContent, which is exactly the pre-refactor path. More importantly, relateContent (L5092–5098) performs the PERMISSION_EDIT check before reaching any of this — a null user without permission throws there, so the null-user case never reaches the delete path in practice. The behavior is unchanged.


Finding 4 — Fast path skips non-relationship-field cache eviction ✅ Rejection accepted

After reading both overloads (L4786–4800 and L4803–4833), the audit is confirmed:

  • Both are entirely gated on relationship.isRelationshipField() — for legacy (non-relationship-field) relationships both return early, so no behavioral difference exists.
  • The only delta in the sameParentAndChild branch is that the contentlet overload uses two sequential if blocks (may call removeRelatedContentMap twice) vs. the identifier-only overload using || (calls it once). Calling removeRelatedContentMap twice for the same key is idempotent, so the cache semantics are identical.
  • The setRelated(null) calls on the in-memory contentlet are intentionally absent in the identifier-only path, as documented in the javadoc.

Finding 5 — Scroll fallback ✅ Rejection accepted

The original finding was wrong about "unconditional scroll for edge cases": the condition totalResults > results.size() is a truncation check and only fires when there actually are more results than were returned. The nit about relying on internal branching (limit > MAX_LIMIT → scroll) is acknowledged and documented with an inline comment. Acceptable as a follow-up item.


Finding 6 — Wasted DB read in child branch ✅ Fix verified

relateContent L5144–5147 now uses related.isHasParent() ? TreeFactory.getNextTreeOrderByParentAndRelationType(...) : 1. The child branch uses a constant, no DB read. getNextTreeOrderByChildAndRelationType is gone (confirmed — it's not in the file). Clean fix.


Finding 7 — Transaction boundary around delete-then-insert ✅ Rejection accepted

Reading relateContent (L5123–5192): startLocalTransactionIfNeeded() at L5127 opens the outer transaction, deleteRelatedContent's @WrapInTransaction joins it, insertTrees via LocalTransaction.wrap() joins it, and commit/rollback are only done when localTransaction == true. The pair is atomic. The rejection is correct.


Summary

All three fixes (1, 2, 6) look correct and complete. All four rejections (3, 4, 5, 7) are well-reasoned and verified by reading the code. The PR is in good shape — no remaining blocking issues from my prior review.

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

❌ Codex Review failed — openai.gpt-5.5

The review job failed before producing output. See the run for details.

Run: #27432404422

@jcastro-dotcms jcastro-dotcms changed the title Creating draft PR with potential fix perf(relationships): bulk tree persistence, lean hydration and delta-only reindex when saving related content Jun 12, 2026
@jcastro-dotcms jcastro-dotcms deleted the support-36004-save-content-with-related-content-performance-improvements branch June 12, 2026 18:03
@jcastro-dotcms

Copy link
Copy Markdown
Member Author

This PR was automatically closed by GitHub when its head branch was renamed to match issue #36147. It is superseded by #36149, which contains the exact same commits and description. All review feedback from this PR (Semgrep findings and bot review) has been addressed in the code — see #36149.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Safe To Rollback Area : Backend PR changes Java/Maven backend code

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Saving contentlets with many-to-many relationship fields and large related-content volumes is extremely slow

1 participant