perf(relationships): bulk tree persistence, lean hydration and delta-only reindex when saving related content#36092
Conversation
|
Claude finished @jcastro-dotcms's task in 2m 17s —— View job Rollback Safety Analysis
Result: ✅ Safe To RollbackThe PR was analyzed against every category in the rollback-unsafe reference. No unsafe patterns were found. Files changed:
Categories checked and cleared:
Label applied: AI: Safe To Rollback |
…ent-performance-improvements
|
Semgrep found 10
The method identified is susceptible to injection. The input should be validated and properly If this is a critical or high severity finding, please also link this issue in the #security channel in Slack. |
…ent-performance-improvements
…ent-performance-improvements
| .setSQL("SELECT COALESCE(MAX(tree_order), 0) + 1 AS next_position FROM tree" | ||
| + " WHERE " + whereColumn + " = ? AND relation_type = ?") |
There was a problem hiding this comment.
Semgrep identified a blocking 🔴 issue in your code:
The method identified is susceptible to injection. The input should be validated and properly
escaped.
Why this might be safe to ignore:
This query concatenates only the column name, and in this method that value is constrained by internal callers to the fixed literals "parent" or "child" while user-controlled values are still passed as parameters. The rule matched a generic string-built SQL pattern, but in this context there is no realistic injection path to the database.
To resolve this comment:
🔧 No guidance has been designated for this issue. Fix according to your organization's approved methods.
💬 Ignore this finding
Reply with Semgrep commands to ignore this finding.
/fp <comment>for false positive/ar <comment>for acceptable risk/other <comment>for all other reasons
Alternatively, triage in Semgrep AppSec Platform to ignore the finding created by CUSTOM_INJECTION-2.
If this is a critical or high severity finding, please also link this issue in the #security channel in Slack.
You can view more details about this finding in the Semgrep AppSec Platform.
| "SELECT * FROM contentlet_version_info WHERE identifier IN (" | ||
| + DotConnect.createParametersPlaceholder(chunk.size()) + ")"); |
There was a problem hiding this comment.
Semgrep identified a blocking 🔴 issue in your code:
The method identified is susceptible to injection. The input should be validated and properly
escaped.
Why this might be safe to ignore:
This query builds only the number of parameter placeholders for an IN clause, and the actual identifier values are passed separately with addParam, so user input is not concatenated into SQL. The rule matched a SQL string concatenation pattern, but in this context it does not create a meaningful injection risk.
To resolve this comment:
🔧 No guidance has been designated for this issue. Fix according to your organization's approved methods.
💬 Ignore this finding
Reply with Semgrep commands to ignore this finding.
/fp <comment>for false positive/ar <comment>for acceptable risk/other <comment>for all other reasons
Alternatively, triage in Semgrep AppSec Platform to ignore the finding created by CUSTOM_INJECTION-2.
If this is a critical or high severity finding, please also link this issue in the #security channel in Slack.
You can view more details about this finding in the Semgrep AppSec Platform.
|
Semgrep found 3
The method identified is susceptible to injection. The input should be validated and properly If this is a critical or high severity finding, please also link this issue in the #security channel in Slack. |
|
Regarding Semgrep's analysis, all three are false positives: Finding analysis
.setSQL("SELECT COALESCE(MAX(tree_order), 0) + 1 AS next_position FROM tree"
+ " WHERE " + whereColumn + " = ? AND relation_type = ?")The rule fires on the string concatenation, but the only concatenated variable is whereColumn, which is a private method parameter whose only two callers pass the hardcoded literals "parent" and "child". Everything user-controllable (id, relationType) goes through addParam bound parameters. No injection path exists.
.setSQL("SELECT * FROM contentlet_version_info WHERE identifier IN ("
+ DotConnect.createParametersPlaceholder(chunk.size()) + ")");The concatenated value is the output of createParametersPlaceholder(int) — a generated ?,?,?... string derived from a list size, never from content. All identifiers are bound via addParam. This is also the established house pattern: ContentletIndexAPIImpl.SELECT_CONTENTLET_VERSION_INFO and ESContentFactoryImpl build chunked IN clauses with the exact same helper. |
|
@claude regarding your feedback: 1. SQL injection via dynamic column names — ✅ VALID (as hardening) — FIXED 2. Non-atomic delete-then-insert in 3. 4. Fast path skips non-relationship-field cache eviction — ❌ NOT VALID 5. Scroll fallback triggered incorrectly — ❌ NOT VALID (one fair nit) 6. Wasted 7. No transaction boundary around delete-then-insert in |
…ent-performance-improvements
|
Claude finished @jcastro-dotcms's task in 1m 53s —— View job Review Follow-up
Verification ResultsI read the current state of Finding 1 — SQL injection allowlist ✅ Fix verified
Finding 2 — Non-atomic delete-then-insert ✅ Fix verified
Finding 3 —
|
❌ Codex Review failed —
|
This PR fixes: #36147
🎯 Problem
Saving a contentlet whose Content Type has several MANY_TO_MANY relationship fields became unusably slow as related-content volume grew: on a real-world dataset (~1,500 parents × ~20 children each, so each child relates back to ~350 parents), a single Save/Publish took from 40 seconds up to several minutes and fired tens of thousands of JDBC queries. Three compounding root causes:
getTree+saveTreepair (2–3 queries each) plus oneDELETEper row.dependenciesLeftToReindex()re-queued every currently-related contentlet for reindexing on every save, because the old-state/new-state comparison read the new DB state on both sides. Saving one child re-indexed hundreds of unchanged parents, each fully hydrated and remapped.✨ Proposed Changes
Relationship persistence — bulk, batched, and permission-aware (
ESContentletAPIImpl,TreeFactory)deleteRelatedContentnow splits into two paths: a fast identifier-only path for the system user / CMS Admins (single bulkDELETE, zero hydration, only removed parents loaded for reindex), and a permission-scoped path for limited users that preserves the legacy contract — relationships to content the user cannot READ are never deleted (bulk delete when the filter removes nothing; chunkedINdeletes otherwise).relateContentcollects all rows and persists them through a newTreeFactory.insertTrees: a duplicate-safe batched upsert (dedup by the(child, parent, relation_type)PK, batchedDELETE+INSERTwrapped inLocalTransactionso it is atomic for any caller). Duplicate records in the payload (same identifier twice, multiple languages) no longer risk a PK violation.MAX(tree_order)+1lookup over surviving rows (replacing a fullgetRelatedContentFromIndexhydration), queried only on the parent branch where it is actually used.Hydration elimination (
ESContentletAPIImpl,VersionableFactory/Impl)treetable via newTreeFactoryidentifier lookups — no contentlet hydration to compute delete scopes or counts.VersionableFactory.findAllContentletVersionInfos(Collection)(chunkedIN, 500 ids/statement) + the existing batchfindContentlets, instead of one lookup per identifier. All SQL lives in factory classes per house architecture.Delta-only dependency reindexing (
ESMappingAPIImpl)dependenciesLeftToReindex()now compares the search index (pre-save state) against thetreetable (post-save state) and returns their true symmetric difference: only relationships actually added or removed. Unchanged parents are never re-queued — this is the biggest win for heavily related datasets.Hardening (review feedback)
assertTreeColumnallowlist guard on everyTreeFactoryhelper that interpolates a column name into SQL (onlyparent/childaccepted) — enforces what the Semgrep findings flagged conventionally.TreeFactory, contracts on the bulk deletes (caller-enforced permissions), and a full explanation of the new reindex-delta model ondependenciesLeftToReindex.✅ Checklist
Security notes: permission semantics are preserved and now covered by an integration test (a limited user's save cannot delete relationships to content they cannot READ — this also fixes a data-loss risk present in the first draft of this PR). The Semgrep
CUSTOM_INJECTION-2findings are false positives (concatenated fragments are compile-time literals or generated?,?,...placeholders; all user data is bound parameters) and are additionally guarded by the new column-name allowlist.🧪 Testing
11 new integration tests, all green:
TreeFactoryTest(new class, 7 tests): batch upsert dedup, pre-existing row replacement, scoped deletes, >500-id chunking, ordered id lookups, next-tree-order.ESContentletAPIImplTest(+3): limited-user permission preservation, duplicate-records save, append-after-preserved ordering. Full class: 90/90 passing.ESMappingAPITest(+2): reindex delta returns only added+removed parents; multilingual parents are not spuriously re-indexed.ℹ️ Additional Info
-source 11), including the test suite.treerows (pointing to fully-deleted content) are now cleaned up by the bulk delete;tree_ordervalues are rebased per save (relative ordering preserved — consumers onlyORDER BY tree_order).Screenshots
N/A — backend-only performance change.
🤖 Generated with Claude Code