test: gate memory-path benchmarks in baselines (parse + ZIP export)#97
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughThe PR adds tracked peak-memory benchmarks for parse and ZIP export, updates regression tooling to read and print peak-byte metrics, and refreshes baseline data plus benchmark workflow instructions. ChangesMemory-sensitive benchmark gating
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tests/benchmarks/conftest.py (1)
22-33: 🩺 Stability & Availability | 🔵 TrivialMake
measure()preserve any existingtracemallocsession
tracemalloc.start()is fine when tracing is already active, butclear_traces()still wipes the current trace history andstop()ends the caller’s session. Track whether this helper started tracing, stop only in that case, and usereset_peak()for the peak-only reset.scripts/check_benchmark_regression.py (1)
25-40: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueMetric selection and display format use two different heuristics.
benchmark_entry_meanselects the gated value byextra_info.peak_bytespresence, butis_memory_metricdecides byte-vs-seconds formatting from the name substring"peak_memory". These agree for the current benchmarks, but any future memory benchmark whose name omitspeak_memorywould have its byte value rendered as…s. Consider deriving the format from the same signal used for selection.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/check_benchmark_regression.py` around lines 25 - 40, The metric selection and display format logic are using different signals, so update the formatting path to derive bytes-vs-seconds from the same source used by benchmark_entry_mean. In particular, align is_memory_metric with the extra_info.peak_bytes check (or refactor the shared decision into a single helper) so future memory benchmarks without "peak_memory" in the name still render in bytes. Keep benchmark_entry_mean and the formatting logic consistent by using one shared heuristic for both selection and display.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmarks/baselines.json`:
- Around line 2-21: The committed benchmark baselines were captured on Windows
even though the gating process expects ubuntu-latest values, so update the
baselines source to match the documented CI environment before relying on them.
Refresh the entries in baselines.json from an ubuntu-latest CI artifact,
especially the gated metrics under the parse, export, and search groups, and
keep the reducer’s _note in sync with the intended source machine. If an
ubuntu-latest artifact is not yet available, temporarily widen the slack in the
gating setup rather than committing Windows-derived values.
In `@tests/benchmarks/test_parse_memory.py`:
- Around line 34-49: The `test_parse_large_peak_memory` benchmark mixes
`tracemalloc` overhead into the `parse` group timing, so its measured latency
should not be treated as the baseline. Keep the memory sampling in
`tracemalloc_peak.measure` and the `benchmark(_run)` flow as-is, but make sure
the regression tooling uses `benchmark.extra_info["peak_bytes"]` for this test
and that the actual `parse` latency baseline comes from the non-memory parse
benchmark (`parse_session` elsewhere), not from this inflated benchmark mean.
---
Nitpick comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 25-40: The metric selection and display format logic are using
different signals, so update the formatting path to derive bytes-vs-seconds from
the same source used by benchmark_entry_mean. In particular, align
is_memory_metric with the extra_info.peak_bytes check (or refactor the shared
decision into a single helper) so future memory benchmarks without "peak_memory"
in the name still render in bytes. Keep benchmark_entry_mean and the formatting
logic consistent by using one shared heuristic for both selection and display.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f7d84d69-a5e1-48b4-af3d-9c987fc946c0
📒 Files selected for processing (9)
benchmarks/README.mdbenchmarks/baselines.jsonscripts/check_benchmark_regression.pyscripts/reduce_baselines.pytests/benchmarks/conftest.pytests/benchmarks/test_export_bench.pytests/benchmarks/test_parse_memory.pytests/test_check_benchmark_regression.pytests/test_reduce_baselines.py
|
@CodeRabbit, review this PR in more depth |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
set -euo pipefail
echo "=== tests/benchmarks/conftest.py ==="
cat tests/benchmarks/conftest.py
echo ""
echo "=== tests/benchmarks/test_parse_memory.py ==="
cat tests/benchmarks/test_parse_memory.py
echo ""
echo "=== tests/benchmarks/test_export_bench.py ==="
cat tests/benchmarks/test_export_bench.py
echo ""
echo "=== scripts/check_benchmark_regression.py ==="
cat scripts/check_benchmark_regression.py
echo ""
echo "=== scripts/reduce_baselines.py ==="
cat scripts/reduce_baselines.py
echo ""
echo "=== benchmarks/baselines.json ==="
cat benchmarks/baselines.json
echo ""
echo "=== benchmarks/README.md ==="
cat benchmarks/README.md
echo ""
echo "=== tests/test_check_benchmark_regression.py ==="
cat tests/test_check_benchmark_regression.py
echo ""
echo "=== tests/test_reduce_baselines.py ==="
cat tests/test_reduce_baselines.pyLength of output: 33906 🏁 Script executed: #!/bin/bash
# Check if Makefile has update-baselines target
if [ -f Makefile ]; then
grep -n "update-baselines" Makefile || echo "target not found in Makefile"
else
echo "No Makefile found"
fi
# Check if tracemalloc is used elsewhere that could conflict
rg -n "tracemalloc" --type py -C2
# Check BenchmarkDataError inheritance
rg -n "class BenchmarkDataError" --type py -C3
# Check if coverage/pytest plugins use tracemalloc
rg -rn "tracemalloc" requirements*.txt pyproject.toml || trueLength of output: 895 🏁 Script executed: #!/bin/bash
# Check if Makefile has update-baselines target
echo "=== Makefile update-baselines ==="
grep -n "update-baselines" Makefile 2>/dev/null || echo "not found"
echo ""
echo "=== BenchmarkDataError class definition ==="
rg -n "class BenchmarkDataError" --type py -A3
echo ""
echo "=== tracemalloc usages ==="
rg -n "tracemalloc" --type py
echo ""
echo "=== requirements / pyproject tracemalloc refs ==="
rg -n "tracemalloc\|coverage\|pytest-cov" pyproject.toml requirements*.txt 2>/dev/null || trueLength of output: 1015 🏁 Script executed: #!/bin/bash
# Run each command separately and ignore errors
echo "=== Makefile update-baselines target ==="
grep -n -A5 "update-baselines" Makefile 2>/dev/null || echo "not found"
echo ""
echo "=== BenchmarkDataError class ==="
grep -rn "class BenchmarkDataError" --include="*.py" . 2>/dev/null || echo "not found"
echo ""
echo "=== tracemalloc usages ==="
grep -rn "tracemalloc" --include="*.py" . 2>/dev/null || echo "not found"
echo ""
echo "=== pytest-cov / coverage in deps ==="
grep -rn "pytest-cov\|coverage" pyproject.toml 2>/dev/null || echo "not found"
grep -rn "pytest-cov\|coverage" requirements*.txt 2>/dev/null || echo "not found"Length of output: 3008
Deep Review: Memory-Path Benchmarks & CI Regression Gate🔴 High Severity1.
|
| # | File | Severity | Issue |
|---|---|---|---|
| 1 | conftest.py |
🔴 High | tracemalloc.stop() kills pre-existing tracing session |
| 2 | check_benchmark_regression.py |
🔴 High | BenchmarkDataError caught and re-wrapped with generic message |
| 3 | check_benchmark_regression.py |
🔴 High | assert disabled by python -O |
| 4 | test_export_bench.py |
🟡 Medium | Assertion inside benchmark loop inflates timing |
| 5 | test_check_benchmark_regression.py |
🟡 Medium | Test doesn't truly isolate extra_info path |
| 6 | baselines.json |
🟡 Medium | test_search_full_corpus baseline is never compared |
| 7 | Makefile |
🟡 Medium | make update-baselines runs locally, risks cross-OS skew |
| 8 | reduce_baselines.py |
🔵 Low | Fragile dual-import pattern |
| 9 | baselines.json |
🔵 Low | 1.5× slack factor not documented in _note |
The overall architecture (fixture → extra_info.peak_bytes → reduce_baselines → gate script) is sound and the test coverage is good. The items above are correctness/safety issues that should be resolved before the CI gate is relied upon, with item 1 (tracemalloc state corruption) being the most important to address.
…ct / f"session_{i:04d}.jsonl", 20) # Unique first_timestamp per session so export filenames do not collide in ZIP benches. first_ts = f"2026-06-12T{i % 24:02d}:{i % 60:02d}:00Z" @timon0305 timon0305 1 minute ago Member first_ts = f"2026-06-12T{i % 24:02d}:{i % 60:02d}:00Z" is only unique for count ≤ 120 (collides at lcm(24,60)). Add assert count <= 120 (or a wider format) so a future corpus bump can't produce colliding ZIP entries.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
tests/benchmarks/conftest.py (1)
104-106: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winGuard timestamp generation against invalid hour values for larger corpora.
first_ts = f"2026-06-12T{i // 60:02d}:{i % 60:02d}:00Z"becomes non-ISO oncei // 60 >= 24(e.g.,T24:00:00Z). If session counts increase, this will produce invalid timestamps and degrade timestamp-based metadata behavior.Suggested fix
+from datetime import datetime, timedelta, timezone ... - first_ts = f"2026-06-12T{i // 60:02d}:{i % 60:02d}:00Z" + base = datetime(2026, 6, 12, tzinfo=timezone.utc) + first_ts = (base + timedelta(minutes=i)).strftime("%Y-%m-%dT%H:%M:%SZ")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/benchmarks/conftest.py` around lines 104 - 106, The timestamp generation in the benchmark session loop can become invalid once the computed hour exceeds 23, so update the `first_ts` logic in `conftest.py` to use a rollover-safe datetime calculation instead of formatting `i // 60` directly. Use a real base date/time plus a `timedelta` derived from `i` and pass the resulting ISO timestamp into `write_jsonl`, keeping the unique-per-session behavior while ensuring `first_timestamp` always stays valid for larger corpora.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmarks/README.md`:
- Around line 63-68: The manual baseline-refresh commands in the README are
missing the same slack factor used in the recommended workflow, which can lead
to inconsistent baselines. Update the benchmark refresh instructions so the
`pytest` command in the manual flow includes the same `--slack 1.5` option as
the main benchmark command, keeping the documented flow consistent with the
baseline generation path.
In `@Makefile`:
- Line 9: The seed-baselines-local target is missing the documented slack value,
so update the Makefile command that invokes reduce_baselines.py to pass the same
1.5x slack used elsewhere. Use the seed-baselines-local recipe and the
reduce_baselines.py invocation as the reference points, and make sure the
command includes the --slack 1.5 argument so local baseline generation matches
the documented behavior.
---
Duplicate comments:
In `@tests/benchmarks/conftest.py`:
- Around line 104-106: The timestamp generation in the benchmark session loop
can become invalid once the computed hour exceeds 23, so update the `first_ts`
logic in `conftest.py` to use a rollover-safe datetime calculation instead of
formatting `i // 60` directly. Use a real base date/time plus a `timedelta`
derived from `i` and pass the resulting ISO timestamp into `write_jsonl`,
keeping the unique-per-session behavior while ensuring `first_timestamp` always
stays valid for larger corpora.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 23f3e4da-51a2-403a-aa00-906177a5490e
📒 Files selected for processing (11)
.gitignoreMakefilebenchmarks/README.mdbenchmarks/baselines.jsonscripts/check_benchmark_regression.pyscripts/reduce_baselines.pytests/benchmarks/conftest.pytests/benchmarks/test_export_bench.pytests/benchmarks/test_parse_memory.pytests/test_check_benchmark_regression.pytests/test_reduce_baselines.py
✅ Files skipped from review due to trivial changes (1)
- .gitignore
🚧 Files skipped from review as they are similar to previous changes (5)
- benchmarks/baselines.json
- tests/test_reduce_baselines.py
- scripts/reduce_baselines.py
- scripts/check_benchmark_regression.py
- tests/benchmarks/test_export_bench.py
|
@CodeRabbit @timon0305 comments are addressed? |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
echo "=== tests/benchmarks/conftest.py (tracemalloc stop fix) ==="
cat tests/benchmarks/conftest.py
echo ""
echo "=== scripts/check_benchmark_regression.py (BenchmarkDataError + assert fix) ==="
cat scripts/check_benchmark_regression.py
echo ""
echo "=== tests/benchmarks/test_export_bench.py (assertion inside loop fix) ==="
cat tests/benchmarks/test_export_bench.py
echo ""
echo "=== tests/benchmarks/test_parse_memory.py (tracemalloc stop fix) ==="
cat tests/benchmarks/test_parse_memory.py
echo ""
echo "=== tests/test_check_benchmark_regression.py (test isolation fix) ==="
cat tests/test_check_benchmark_regression.py
echo ""
echo "=== benchmarks/baselines.json (dead entry + slack note fix) ==="
cat benchmarks/baselines.json
echo ""
echo "=== Makefile (update-baselines warning fix) ==="
cat Makefile
echo ""
echo "=== scripts/reduce_baselines.py (dual-import fix) ==="
cat scripts/reduce_baselines.pyLength of output: 31027
✅ Fully Addressed (7/9)
🟡 Partially Addressed (2/9)Issue 6 — The entry is still present in both
The intent is now clear to future maintainers, so this is acceptable. The structural ambiguity is resolved by the documentation even if the entry is never compared. Issue 9 —
There is no mention of Net result: 7 of 9 issues are cleanly resolved. Issue 6 is documented away acceptably. Issue 9 is a tooling fix — the question is whether the current |
Closes #96
Summary
Integrates memory-sensitive code paths into the tracked pytest-benchmark suite and populates
benchmarks/baselines.jsonfor CI regression gating (>20%).tests/benchmarks/conftest.py:tracemalloc_peakfixture for peak Python heap measurementtests/benchmarks/test_parse_memory.py: newtest_parse_large_peak_memoryrecordsextra_info.peak_bytes; existingtest_large_parse_peak_memory_under_ceilingunchangedtests/benchmarks/test_export_bench.py: newtest_bulk_export_zip_peak_memoryfor 10/50/100 session corpora (ZIP sink + tracemalloc)scripts/check_benchmark_regression.py/reduce_baselines.py: preferextra_info.peak_bytesoverstats.meanwhen present; memory metrics reported in bytesbenchmarks/baselines.json: populated parse (latency + memory), export (latency + memory), search (latency); 1.5× slack from local reference run — refresh from ubuntu-latest CI artifact after first green jobbenchmarks/README.md: documents memory benchmarks and gated groupsSprint item #1 (June Week 4, Wednesday).
Test plan
pytest tests/benchmarks/ --benchmark-only -o addopts=pytest tests/benchmarks/test_parse_memory.py(ceiling test passes; benchmark variant skipped via--benchmark-skip)python scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.jsonpytest tests/test_check_benchmark_regression.py tests/test_reduce_baselines.py --no-cov -o addopts=benchmarksCI job green on ubuntu-latestSummary by CodeRabbit