[feature](cloud) Add table-level event-driven warm up#63832
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
65920e0 to
b67c9f7
Compare
|
run buildall |
TPC-H: Total hot run time: 31875 ms |
TPC-DS: Total hot run time: 172324 ms |
FE Regression Coverage ReportIncrement line coverage |
### What problem does this PR solve? Issue Number: None Related PR: apache#63832 Problem Summary: The table-level warm-up change adds a table_id argument before sync_wait_timeout_ms in CloudWarmUpManager::warm_up_rowset. After rebasing onto the latest master, the existing CloudWarmUpManagerTest calls still used the old two-argument form, so the positive-timeout test passed 1000 as table_id and left sync_wait_timeout_ms at its default -1. That made the test take the async non-positive-timeout branch, so the before-wait sync point was never reached and the spurious notify assertion failed. Update the test calls to pass table_id and sync_wait_timeout_ms explicitly. ### Release note None ### Check List (For Author) - Test: - Unit Test: ./run-be-ut.sh --run --filter=CloudWarmUpManagerTest.* -j100 - Behavior changed: No. - Does this need documentation: No.
|
run buildall |
TPC-H: Total hot run time: 31958 ms |
TPC-DS: Total hot run time: 172417 ms |
### What problem does this PR solve? Issue Number: None Related PR: apache#63832 Problem Summary: The table-level warm-up table filter performance tests used tight wall-clock thresholds for the 200K and 500K wildcard match-all cases. CI machines can run these scale tests slightly slower than local runs even though the matching implementation remains efficient. Relax the 200K threshold from 1s to 1.5s and the 500K threshold from 2s to 3s while keeping the existing functional assertions and smaller or more selective performance checks. ### Release note None ### Check List (For Author) - Test: - Unit Test: ./run-fe-ut.sh --run org.apache.doris.cloud.CacheHotspotManagerTableFilterTest - Behavior changed: No. - Does this need documentation: No.
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
### What problem does this PR solve? Issue Number: None Related PR: apache#63832 Problem Summary: The table-level warm-up table filter performance test for 200K tables with 15 include/exclude rules still used a tight 2s wall-clock threshold. CI can exceed that threshold under load while the matcher remains functionally correct. Relax the threshold to 3s and keep the matched-table assertion unchanged. ### Release note None ### Check List (For Author) - Test: - Unit Test: ./run-fe-ut.sh --run org.apache.doris.cloud.CacheHotspotManagerTableFilterTest - Behavior changed: No. - Does this need documentation: No.
|
run buildall |
a67fe97 to
44f6b85
Compare
### What problem does this PR solve? Issue Number: None Related PR: apache#63832 Problem Summary: The table-level warm-up change adds a table_id argument before sync_wait_timeout_ms in CloudWarmUpManager::warm_up_rowset. After rebasing onto the latest master, the existing CloudWarmUpManagerTest calls still used the old two-argument form, so the positive-timeout test passed 1000 as table_id and left sync_wait_timeout_ms at its default -1. That made the test take the async non-positive-timeout branch, so the before-wait sync point was never reached and the spurious notify assertion failed. Update the test calls to pass table_id and sync_wait_timeout_ms explicitly. ### Release note None ### Check List (For Author) - Test: - Unit Test: ./run-be-ut.sh --run --filter=CloudWarmUpManagerTest.* -j100 - Behavior changed: No. - Does this need documentation: No.
### What problem does this PR solve? Issue Number: None Related PR: apache#63832 Problem Summary: The table-level warm-up table filter performance tests used tight wall-clock thresholds for the 200K and 500K wildcard match-all cases. CI machines can run these scale tests slightly slower than local runs even though the matching implementation remains efficient. Relax the 200K threshold from 1s to 1.5s and the 500K threshold from 2s to 3s while keeping the existing functional assertions and smaller or more selective performance checks. ### Release note None ### Check List (For Author) - Test: - Unit Test: ./run-fe-ut.sh --run org.apache.doris.cloud.CacheHotspotManagerTableFilterTest - Behavior changed: No. - Does this need documentation: No.
### What problem does this PR solve? Issue Number: None Related PR: apache#63832 Problem Summary: The table-level warm-up table filter performance test for 200K tables with 15 include/exclude rules still used a tight 2s wall-clock threshold. CI can exceed that threshold under load while the matcher remains functionally correct. Relax the threshold to 3s and keep the matched-table assertion unchanged. ### Release note None ### Check List (For Author) - Test: - Unit Test: ./run-fe-ut.sh --run org.apache.doris.cloud.CacheHotspotManagerTableFilterTest - Behavior changed: No. - Does this need documentation: No.
| static constexpr int WINDOW_30M = 1800; | ||
| static constexpr int WINDOW_1H = 3600; | ||
|
|
||
| MBvarWindowedAdder g_warmup_ed_finish_segment_num("warmup_ed_finish_segment_num", {"job_id"}, |
There was a problem hiding this comment.
is there any memory issues if there are many jobs.
how does bvar implement "windows", does it recored every smaples of the adder every second?
There was a problem hiding this comment.
I checked the bvar implementation again.
bvar::Window does not record every update written to the Adder. For bvar::Adder, the underlying sampler samples the cumulative adder value roughly once per second, and the window value is calculated from the difference between the latest sampled cumulative value and the oldest sampled cumulative value in the requested window.
The 5m/30m/1h windows created for the same Adder also share the same underlying sampler. The sampler queue is sized by the largest window, so here it keeps about 3600 + 1 samples, not 300 + 1800 + 3600 samples and not one sample per warm-up event.
Rough estimate:
- One
Sample<int64_t>storesdataandtime_us, so it is about 16 bytes. - The largest window is 1h, so one sampler queue is about
(3600 + 1) * 16 ~= 56KB. - Source-side stats have 4 windowed adders, about
4 * 56KB ~= 224KB/jobfor sampler queues. - Target-side stats have 8 windowed adders, about
8 * 56KB ~= 448KB/jobfor sampler queues. - If the same BE process observes both sides, the sampler queue storage is roughly
(4 + 8) * 56KB ~= 672KB/job, plus small object/map/string overhead.
So this is proportional to the number of job_id dimensions seen by a BE process, not proportional to the number of rowsets/segments/events. The overall memory usage should be small for the expected number of warm-up jobs. This state is also BE-process-local memory only; it is not persisted and will be released after BE restart.
TPC-H: Total hot run time: 31398 ms |
TPC-H: Total hot run time: 31974 ms |
TPC-DS: Total hot run time: 172895 ms |
TPC-DS: Total hot run time: 171939 ms |
|
run feut |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 29683 ms |
TPC-DS: Total hot run time: 168203 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
/review |
There was a problem hiding this comment.
I found one issue in the target-BE event-driven stats accounting. The PR now allows multiple table-level load-event warmup jobs for the same source/target when their ON TABLES filters differ; when those filters overlap, the target BE can receive a duplicate rowset request while another job is still warming the same rowset. The new stats path currently records that skipped request as finished immediately, so SyncStats can report catch-up before the cache warmup has completed or after the original attempt later fails.
Critical checkpoint conclusions:
- Goal and tests: The feature path has broad FE/BE/regression coverage, but the overlapping table-filter duplicate/in-flight stats case is not covered.
- Scope: The changes are broad but mostly focused on table-level event-driven warmup, protocol plumbing, and observability.
- Concurrency and locks: The existing tablet warmup-state lock path was inspected; the issue is semantic accounting across an in-flight duplicate, not a lock ordering problem.
- Lifecycle and static initialization: New process-global metrics/tracker definitions are straightforward; no additional static initialization dependency issue was found.
- Config and dynamic behavior: New intervals/display limits are wired as dynamic FE configs; no additional config issue was found.
- Compatibility and rolling upgrade: The old-BE
table_idscompatibility concern was already raised in an existing thread, so I did not duplicate it. - Parallel code paths:
warm_up_rowsettable-id filtering was reviewed; the recycle-cache table-id gap was already raised in an existing thread, so I did not duplicate it. - Observability: The submitted issue directly affects the correctness of the new
SyncStats/warmup metrics. - Persistence and replay: Persisted table-filter rules and replay rebuild paths were inspected; no additional issue was found.
- FE/BE variable passing:
table_ids,job_id, andupstream_trigger_ts_mswere reviewed; no new issue beyond the already-known rolling-upgrade concern was found. - Performance: HTTP aggregation and bvar usage were inspected; the existing bvar-memory discussion is already covered by prior review context.
- User focus: No additional user-provided review focus was present.
| << ", skip it"; | ||
| g_warmup_ed_downstream_progress_tracker.record_task_done(job_id_str, | ||
| upstream_trigger_ts_ms); | ||
| record_warmup_ed_skipped_rowset_as_finished(rs_meta, job_id_str); |
There was a problem hiding this comment.
add_rowset_warmup_state returns false both when the rowset is already DONE and when an EVENT_DRIVEN warmup is still DOING. This branch treats both cases as finished by advancing downstream progress and recording all skipped files as finished, but the in-flight case is now reachable when two table-level jobs have overlapping filters, or when the same rowset request is retried while the first download is still running. In that case SHOW WARM UP JOB/FE metrics can report zero trigger gap and full destination size even though the original downloads may still be pending or may later fail. Please only mark the duplicate request as finished when the existing warmup state is already complete, or attach the duplicate job/progress accounting to the in-flight completion instead of completing it immediately.
|
PR approved by at least one committer and no changes requested. |
## Summary apache/doris#63832 - Update the read/write separation File Cache warm-up guide in both English and Chinese with table-level event-driven warm-up usage. - Document `ON TABLES` syntax, `INCLUDE`/`EXCLUDE` matching rules, examples, refresh behavior, `SHOW WARM UP JOB` fields, detailed `SyncStats` JSON, BE Bvar metrics, and FE Prometheus metrics. - Clarify that compute-group-level load-event warm-up and table-level `ON TABLES` load-event warm-up should not be configured together for the same source and destination compute groups. ## Validation - `git diff --check` - Front matter JSON parsing and Markdown code-fence/admonition pairing check Note: Docusaurus/docs-governance checks were not run because this checkout does not have `node_modules`; the docs governance scripts fail on missing `gray-matter`.
Issue Number: None
Problem Summary:
This PR adds table-level event-driven cloud warm-up support and improves
active incremental warm-up progress observability.
Before this change, event-driven warm-up was only controlled at
compute-group granularity. Once a load-event warm-up job was enabled for
a source and target compute group pair, all source-side table writes
could trigger warm-up to the target compute group. That is inefficient
for workloads where only selected core tables, high-frequency query
tables, or selected async materialized views need to stay warm.
This PR lets users define the warm-up scope with `ON TABLES` when
creating an event-driven load warm-up job. FE persists the normalized
table filter in the warm-up job, resolves matched table ids dynamically,
sends the table ids to BE, and lets BE filter warm-up rowsets by table
id.
User-visible behavior:
- `WARM UP ... ON TABLES` supports table-level event-driven warm-up.
- Table filters support `INCLUDE` and `EXCLUDE` rules.
- Rules support `*` and `?` wildcards, for example `db.table`, `db.*`,
`*.orders_*`, and `log_db.log_?`.
- `INCLUDE` defines the candidate warm-up scope, and `EXCLUDE` removes
tables from that included scope.
- Rules are canonicalized before duplicate checks, so semantically
equivalent filters do not create duplicate jobs just because rule order
differs.
- Matching covers both regular OLAP tables and async materialized views.
- Matched table ids are refreshed as tables or async materialized views
are created, dropped, or renamed.
- The same source compute group can create independent table-level
warm-up jobs to different target compute groups with different table
filters.
- `SHOW WARM UP JOB` exposes the table-level job type, table filter,
matched tables, and SyncStats.
- `SHOW WARM UP JOB` list output keeps compact SyncStats, while
single-job lookup keeps detailed windowed SyncStats.
Example:
```sql
WARM UP COMPUTE GROUP query_cg WITH COMPUTE GROUP write_cg
ON TABLES (
INCLUDE 'core_db.config',
INCLUDE 'report_db.monthly_*',
INCLUDE '*.sales_*',
EXCLUDE '*.*_archive'
)
PROPERTIES (
"sync_mode" = "event_driven",
"sync_event" = "load"
);
```
Conflict and virtual compute group behavior:
- Table-level load-event warm-up and cluster-level load-event warm-up
are mutually exclusive for the same source and target compute group
pair.
- If a conflicting job already exists, creation returns an error that
includes the conflicting job id; table-level conflicts also include the
table filter.
- Duplicate checks within the same job type still follow the existing
duplicate-check logic.
- VCG-managed cluster-level load-event warm-up creation does not fail on
conflict. Because VCG jobs are created by the MS HTTP API path, FE
cancels existing table-level load-event warm-up jobs with the same
source and target first, then recreates the VCG-managed cluster-level
job.
- Manually creating a table-level load-event warm-up job is rejected
only when both source and target compute groups are owned by the same
VCG.
- SQL still cannot use a virtual compute group directly as the source or
target compute group.
Warm-up progress observation:
- BE records per-job windowed requested, finished, and failed warm-up
statistics.
- BE exposes per-job warm-up statistics through
`/api/warmup_event_driven_stats`.
- FE aggregates BE statistics and caches the aggregated result in the
warm-up job.
- SyncStats includes source-side and target-side warm-up size/count
progress across windows.
- SyncStats includes trigger-time progress, so users can observe whether
the target compute group is behind the latest source-side warm-up
trigger.
- FE `/metrics` exposes per-job active warm-up metadata, synchronized
size, and trigger gap metrics for cloud event-driven warm-up jobs.
Support table-level event-driven cloud warm-up with `ON TABLES` filters
and per-job warm-up sync statistics.
- Test
- [x] Regression test
- [x] Unit Test
- [x] Manual test
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason
- Behavior changed:
- [ ] No.
- [x] Yes. `WARM UP` supports table-level `ON TABLES` filters for
event-driven load warm-up, and warm-up job output/metrics expose table
filter, matched tables, SyncStats, and trigger-gap information.
- Does this need documentation?
- [ ] No.
- [x] Yes. apache/doris-website#3829
Issue Number: None
Problem Summary:
This PR adds table-level event-driven cloud warm-up support and improves
active incremental warm-up progress observability.
Before this change, event-driven warm-up was only controlled at
compute-group granularity. Once a load-event warm-up job was enabled for
a source and target compute group pair, all source-side table writes
could trigger warm-up to the target compute group. That is inefficient
for workloads where only selected core tables, high-frequency query
tables, or selected async materialized views need to stay warm.
This PR lets users define the warm-up scope with `ON TABLES` when
creating an event-driven load warm-up job. FE persists the normalized
table filter in the warm-up job, resolves matched table ids dynamically,
sends the table ids to BE, and lets BE filter warm-up rowsets by table
id.
User-visible behavior:
- `WARM UP ... ON TABLES` supports table-level event-driven warm-up.
- Table filters support `INCLUDE` and `EXCLUDE` rules.
- Rules support `*` and `?` wildcards, for example `db.table`, `db.*`,
`*.orders_*`, and `log_db.log_?`.
- `INCLUDE` defines the candidate warm-up scope, and `EXCLUDE` removes
tables from that included scope.
- Rules are canonicalized before duplicate checks, so semantically
equivalent filters do not create duplicate jobs just because rule order
differs.
- Matching covers both regular OLAP tables and async materialized views.
- Matched table ids are refreshed as tables or async materialized views
are created, dropped, or renamed.
- The same source compute group can create independent table-level
warm-up jobs to different target compute groups with different table
filters.
- `SHOW WARM UP JOB` exposes the table-level job type, table filter,
matched tables, and SyncStats.
- `SHOW WARM UP JOB` list output keeps compact SyncStats, while
single-job lookup keeps detailed windowed SyncStats.
Example:
```sql
WARM UP COMPUTE GROUP query_cg WITH COMPUTE GROUP write_cg
ON TABLES (
INCLUDE 'core_db.config',
INCLUDE 'report_db.monthly_*',
INCLUDE '*.sales_*',
EXCLUDE '*.*_archive'
)
PROPERTIES (
"sync_mode" = "event_driven",
"sync_event" = "load"
);
```
Conflict and virtual compute group behavior:
- Table-level load-event warm-up and cluster-level load-event warm-up
are mutually exclusive for the same source and target compute group
pair.
- If a conflicting job already exists, creation returns an error that
includes the conflicting job id; table-level conflicts also include the
table filter.
- Duplicate checks within the same job type still follow the existing
duplicate-check logic.
- VCG-managed cluster-level load-event warm-up creation does not fail on
conflict. Because VCG jobs are created by the MS HTTP API path, FE
cancels existing table-level load-event warm-up jobs with the same
source and target first, then recreates the VCG-managed cluster-level
job.
- Manually creating a table-level load-event warm-up job is rejected
only when both source and target compute groups are owned by the same
VCG.
- SQL still cannot use a virtual compute group directly as the source or
target compute group.
Warm-up progress observation:
- BE records per-job windowed requested, finished, and failed warm-up
statistics.
- BE exposes per-job warm-up statistics through
`/api/warmup_event_driven_stats`.
- FE aggregates BE statistics and caches the aggregated result in the
warm-up job.
- SyncStats includes source-side and target-side warm-up size/count
progress across windows.
- SyncStats includes trigger-time progress, so users can observe whether
the target compute group is behind the latest source-side warm-up
trigger.
- FE `/metrics` exposes per-job active warm-up metadata, synchronized
size, and trigger gap metrics for cloud event-driven warm-up jobs.
Support table-level event-driven cloud warm-up with `ON TABLES` filters
and per-job warm-up sync statistics.
- Test
- [x] Regression test
- [x] Unit Test
- [x] Manual test
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason
- Behavior changed:
- [ ] No.
- [x] Yes. `WARM UP` supports table-level `ON TABLES` filters for
event-driven load warm-up, and warm-up job output/metrics expose table
filter, matched tables, SyncStats, and trigger-gap information.
- Does this need documentation?
- [ ] No.
- [x] Yes. apache/doris-website#3829
Issue Number: None
Problem Summary:
This PR adds table-level event-driven cloud warm-up support and improves
active incremental warm-up progress observability.
Before this change, event-driven warm-up was only controlled at
compute-group granularity. Once a load-event warm-up job was enabled for
a source and target compute group pair, all source-side table writes
could trigger warm-up to the target compute group. That is inefficient
for workloads where only selected core tables, high-frequency query
tables, or selected async materialized views need to stay warm.
This PR lets users define the warm-up scope with `ON TABLES` when
creating an event-driven load warm-up job. FE persists the normalized
table filter in the warm-up job, resolves matched table ids dynamically,
sends the table ids to BE, and lets BE filter warm-up rowsets by table
id.
User-visible behavior:
- `WARM UP ... ON TABLES` supports table-level event-driven warm-up.
- Table filters support `INCLUDE` and `EXCLUDE` rules.
- Rules support `*` and `?` wildcards, for example `db.table`, `db.*`,
`*.orders_*`, and `log_db.log_?`.
- `INCLUDE` defines the candidate warm-up scope, and `EXCLUDE` removes
tables from that included scope.
- Rules are canonicalized before duplicate checks, so semantically
equivalent filters do not create duplicate jobs just because rule order
differs.
- Matching covers both regular OLAP tables and async materialized views.
- Matched table ids are refreshed as tables or async materialized views
are created, dropped, or renamed.
- The same source compute group can create independent table-level
warm-up jobs to different target compute groups with different table
filters.
- `SHOW WARM UP JOB` exposes the table-level job type, table filter,
matched tables, and SyncStats.
- `SHOW WARM UP JOB` list output keeps compact SyncStats, while
single-job lookup keeps detailed windowed SyncStats.
Example:
```sql
WARM UP COMPUTE GROUP query_cg WITH COMPUTE GROUP write_cg
ON TABLES (
INCLUDE 'core_db.config',
INCLUDE 'report_db.monthly_*',
INCLUDE '*.sales_*',
EXCLUDE '*.*_archive'
)
PROPERTIES (
"sync_mode" = "event_driven",
"sync_event" = "load"
);
```
Conflict and virtual compute group behavior:
- Table-level load-event warm-up and cluster-level load-event warm-up
are mutually exclusive for the same source and target compute group
pair.
- If a conflicting job already exists, creation returns an error that
includes the conflicting job id; table-level conflicts also include the
table filter.
- Duplicate checks within the same job type still follow the existing
duplicate-check logic.
- VCG-managed cluster-level load-event warm-up creation does not fail on
conflict. Because VCG jobs are created by the MS HTTP API path, FE
cancels existing table-level load-event warm-up jobs with the same
source and target first, then recreates the VCG-managed cluster-level
job.
- Manually creating a table-level load-event warm-up job is rejected
only when both source and target compute groups are owned by the same
VCG.
- SQL still cannot use a virtual compute group directly as the source or
target compute group.
Warm-up progress observation:
- BE records per-job windowed requested, finished, and failed warm-up
statistics.
- BE exposes per-job warm-up statistics through
`/api/warmup_event_driven_stats`.
- FE aggregates BE statistics and caches the aggregated result in the
warm-up job.
- SyncStats includes source-side and target-side warm-up size/count
progress across windows.
- SyncStats includes trigger-time progress, so users can observe whether
the target compute group is behind the latest source-side warm-up
trigger.
- FE `/metrics` exposes per-job active warm-up metadata, synchronized
size, and trigger gap metrics for cloud event-driven warm-up jobs.
Support table-level event-driven cloud warm-up with `ON TABLES` filters
and per-job warm-up sync statistics.
- Test
- [x] Regression test
- [x] Unit Test
- [x] Manual test
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason
- Behavior changed:
- [ ] No.
- [x] Yes. `WARM UP` supports table-level `ON TABLES` filters for
event-driven load warm-up, and warm-up job output/metrics expose table
filter, matched tables, SyncStats, and trigger-gap information.
- Does this need documentation?
- [ ] No.
- [x] Yes. apache/doris-website#3829
Issue Number: None
Problem Summary:
This PR adds table-level event-driven cloud warm-up support and improves
active incremental warm-up progress observability.
Before this change, event-driven warm-up was only controlled at
compute-group granularity. Once a load-event warm-up job was enabled for
a source and target compute group pair, all source-side table writes
could trigger warm-up to the target compute group. That is inefficient
for workloads where only selected core tables, high-frequency query
tables, or selected async materialized views need to stay warm.
This PR lets users define the warm-up scope with `ON TABLES` when
creating an event-driven load warm-up job. FE persists the normalized
table filter in the warm-up job, resolves matched table ids dynamically,
sends the table ids to BE, and lets BE filter warm-up rowsets by table
id.
User-visible behavior:
- `WARM UP ... ON TABLES` supports table-level event-driven warm-up.
- Table filters support `INCLUDE` and `EXCLUDE` rules.
- Rules support `*` and `?` wildcards, for example `db.table`, `db.*`,
`*.orders_*`, and `log_db.log_?`.
- `INCLUDE` defines the candidate warm-up scope, and `EXCLUDE` removes
tables from that included scope.
- Rules are canonicalized before duplicate checks, so semantically
equivalent filters do not create duplicate jobs just because rule order
differs.
- Matching covers both regular OLAP tables and async materialized views.
- Matched table ids are refreshed as tables or async materialized views
are created, dropped, or renamed.
- The same source compute group can create independent table-level
warm-up jobs to different target compute groups with different table
filters.
- `SHOW WARM UP JOB` exposes the table-level job type, table filter,
matched tables, and SyncStats.
- `SHOW WARM UP JOB` list output keeps compact SyncStats, while
single-job lookup keeps detailed windowed SyncStats.
Example:
```sql
WARM UP COMPUTE GROUP query_cg WITH COMPUTE GROUP write_cg
ON TABLES (
INCLUDE 'core_db.config',
INCLUDE 'report_db.monthly_*',
INCLUDE '*.sales_*',
EXCLUDE '*.*_archive'
)
PROPERTIES (
"sync_mode" = "event_driven",
"sync_event" = "load"
);
```
Conflict and virtual compute group behavior:
- Table-level load-event warm-up and cluster-level load-event warm-up
are mutually exclusive for the same source and target compute group
pair.
- If a conflicting job already exists, creation returns an error that
includes the conflicting job id; table-level conflicts also include the
table filter.
- Duplicate checks within the same job type still follow the existing
duplicate-check logic.
- VCG-managed cluster-level load-event warm-up creation does not fail on
conflict. Because VCG jobs are created by the MS HTTP API path, FE
cancels existing table-level load-event warm-up jobs with the same
source and target first, then recreates the VCG-managed cluster-level
job.
- Manually creating a table-level load-event warm-up job is rejected
only when both source and target compute groups are owned by the same
VCG.
- SQL still cannot use a virtual compute group directly as the source or
target compute group.
Warm-up progress observation:
- BE records per-job windowed requested, finished, and failed warm-up
statistics.
- BE exposes per-job warm-up statistics through
`/api/warmup_event_driven_stats`.
- FE aggregates BE statistics and caches the aggregated result in the
warm-up job.
- SyncStats includes source-side and target-side warm-up size/count
progress across windows.
- SyncStats includes trigger-time progress, so users can observe whether
the target compute group is behind the latest source-side warm-up
trigger.
- FE `/metrics` exposes per-job active warm-up metadata, synchronized
size, and trigger gap metrics for cloud event-driven warm-up jobs.
Support table-level event-driven cloud warm-up with `ON TABLES` filters
and per-job warm-up sync statistics.
- Test
- [x] Regression test
- [x] Unit Test
- [x] Manual test
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason
- Behavior changed:
- [ ] No.
- [x] Yes. `WARM UP` supports table-level `ON TABLES` filters for
event-driven load warm-up, and warm-up job output/metrics expose table
filter, matched tables, SyncStats, and trigger-gap information.
- Does this need documentation?
- [ ] No.
- [x] Yes. apache/doris-website#3829
…3832) (#64602) ## Proposed changes Backport #63832 to `branch-4.1`: add table-level event-driven warm-up support and related FE/BE metrics, parsing, show output, and regression coverage. This backport also includes a small `doris-compose` compatibility fix for this branch so the cloud docker regression runner can accept `up --env` and initialize `external_ms_cluster` before cloud cluster setup. Without it, the docker suites fail during compose argument handling before product code is exercised. ## Validation - `git diff HEAD^ HEAD --check` - `./run-be-ut.sh --clean --run --coverage --filter=CloudWarmUpManagerTest.*:CloudWarmUpManagerFilterTest.*:MBvarWindowedAdderTest.* -j100` - `./run-fe-ut.sh --run org.apache.doris.cloud.CacheHotspotManagerTableFilterTest,org.apache.doris.cloud.CloudWarmUpJobTableFilterTest,org.apache.doris.cloud.OnTablesFilterTest,org.apache.doris.cloud.WarmUpClusterOnTablesParseTest,org.apache.doris.cloud.WarmUpStatsTest,org.apache.doris.cloud.catalog.CloudInstanceStatusCheckerTest,org.apache.doris.metric.MetricsTest` - `./build.sh --be --fe --cloud -j100` - `docker build -f docker/runtime/doris-compose/Dockerfile -t bh-cluster-2 .` - `DORIS_FDB_IMAGE=foundationdb/foundationdb:7.1.26-single-layer ./run-regression-test.sh --run -d regression-test/suites/cloud_p0/cache/multi_cluster/warm_up/on_tables -runMode=cloud -dockerSuiteParallel 1 -image bh-cluster-2` - Result: `Test 19 suites, failed 0 suites, fatal 0 scripts, skipped 0 scripts`
What problem does this PR solve?
Issue Number: None
Problem Summary:
This PR adds table-level event-driven cloud warm-up support and improves active incremental warm-up progress observability.
Before this change, event-driven warm-up was only controlled at compute-group granularity. Once a load-event warm-up job was enabled for a source and target compute group pair, all source-side table writes could trigger warm-up to the target compute group. That is inefficient for workloads where only selected core tables, high-frequency query tables, or selected async materialized views need to stay warm.
This PR lets users define the warm-up scope with
ON TABLESwhen creating an event-driven load warm-up job. FE persists the normalized table filter in the warm-up job, resolves matched table ids dynamically, sends the table ids to BE, and lets BE filter warm-up rowsets by table id.User-visible behavior:
WARM UP ... ON TABLESsupports table-level event-driven warm-up.INCLUDEandEXCLUDErules.*and?wildcards, for exampledb.table,db.*,*.orders_*, andlog_db.log_?.INCLUDEdefines the candidate warm-up scope, andEXCLUDEremoves tables from that included scope.SHOW WARM UP JOBexposes the table-level job type, table filter, matched tables, and SyncStats.SHOW WARM UP JOBlist output keeps compact SyncStats, while single-job lookup keeps detailed windowed SyncStats.Example:
Conflict and virtual compute group behavior:
Warm-up progress observation:
/api/warmup_event_driven_stats./metricsexposes per-job active warm-up metadata, synchronized size, and trigger gap metrics for cloud event-driven warm-up jobs.Release note
Support table-level event-driven cloud warm-up with
ON TABLESfilters and per-job warm-up sync statistics.Check List (For Author)
Test
Behavior changed:
WARM UPsupports table-levelON TABLESfilters for event-driven load warm-up, and warm-up job output/metrics expose table filter, matched tables, SyncStats, and trigger-gap information.Does this need documentation?
Check List (For Reviewer who merge this PR)