test: gate memory-path benchmarks in baselines (parse + ZIP export) by clean6378-max-it · Pull Request #97 · cppalliance/claude-code-chat-browser

clean6378-max-it · 2026-06-24T19:29:08Z

Closes #96

Summary

Integrates memory-sensitive code paths into the tracked pytest-benchmark suite and populates benchmarks/baselines.json for CI regression gating (>20%).

tests/benchmarks/conftest.py: tracemalloc_peak fixture for peak Python heap measurement
tests/benchmarks/test_parse_memory.py: new test_parse_large_peak_memory records extra_info.peak_bytes; existing test_large_parse_peak_memory_under_ceiling unchanged
tests/benchmarks/test_export_bench.py: new test_bulk_export_zip_peak_memory for 10/50/100 session corpora (ZIP sink + tracemalloc)
scripts/check_benchmark_regression.py / reduce_baselines.py: prefer extra_info.peak_bytes over stats.mean when present; memory metrics reported in bytes
benchmarks/baselines.json: populated parse (latency + memory), export (latency + memory), search (latency); 1.5× slack from local reference run — refresh from ubuntu-latest CI artifact after first green job
benchmarks/README.md: documents memory benchmarks and gated groups

Sprint item #1 (June Week 4, Wednesday).

Test plan

pytest tests/benchmarks/ --benchmark-only -o addopts=
pytest tests/benchmarks/test_parse_memory.py (ceiling test passes; benchmark variant skipped via --benchmark-skip)
python scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.json
pytest tests/test_check_benchmark_regression.py tests/test_reduce_baselines.py --no-cov -o addopts=
Confirm benchmarks CI job green on ubuntu-latest

Summary by CodeRabbit

New Features
- Added peak-memory tracking to parsing and exporting benchmarks, including new large-parse peak-heap and ZIP export peak-memory coverage.
- Added search benchmark baseline coverage.
Bug Fixes
- Improved benchmark regression gating to consistently use the intended metrics and to fail when required gated results are missing.
Documentation
- Updated CI gate and baseline refresh docs, plus the new local baseline seeding workflow.
Tests
- Expanded benchmark/regression tests for peak-byte metrics and missing gated-result behavior.
Chores
- Added/updated benchmark-related Makefile targets for seeding and cleaning artifacts.

coderabbitai · 2026-06-24T19:29:24Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1168aa3d-77d4-46dc-b035-b76922419e36

📥 Commits

Reviewing files that changed from the base of the PR and between 2fb39bc and 1c66666.

📒 Files selected for processing (3)

Makefile
benchmarks/README.md
tests/benchmarks/conftest.py

🚧 Files skipped from review as they are similar to previous changes (3)

tests/benchmarks/conftest.py
Makefile
benchmarks/README.md

📝 Walkthrough

Walkthrough

The PR adds tracked peak-memory benchmarks for parse and ZIP export, updates regression tooling to read and print peak-byte metrics, and refreshes baseline data plus benchmark workflow instructions.

Changes

Memory-sensitive benchmark gating

Layer / File(s)	Summary
Tracemalloc fixture `tests/benchmarks/conftest.py`	Adds a helper and pytest fixture that measure peak traced memory for a callable, and updates JSONL generation timestamps for benchmark corpora.
Parse and export peak benchmarks `tests/benchmarks/test_parse_memory.py`, `tests/benchmarks/test_export_bench.py`	Adds the tracked parse peak-memory benchmark and the ZIP export peak-memory benchmark, and centralizes export project setup for benchmark runs.
Regression checks and metric formatting `scripts/check_benchmark_regression.py`, `tests/test_check_benchmark_regression.py`	Updates benchmark result loading, metric classification, regression output, and regression tests to handle peak-bytes metrics and missing gated results.
Baseline reduction and data `scripts/reduce_baselines.py`, `benchmarks/baselines.json`, `tests/test_reduce_baselines.py`	Updates baseline reduction, baseline metadata, populated baseline groups, and reducer tests for memory and latency values.
Benchmark workflow commands `Makefile`, `benchmarks/README.md`, `.gitignore`	Updates benchmark workflow commands, ignore rules, and README instructions for generating, reducing, and cleaning benchmark artifacts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

cppalliance/claude-code-chat-browser#76: Changes the same benchmark harness pieces, including tests/benchmarks/conftest.py and tests/benchmarks/test_parse_memory.py.
cppalliance/claude-code-chat-browser#91: Updates the benchmark regression gate and baseline tooling that this PR extends for peak-memory metrics.

Suggested reviewers

timon0305
wpak-ai

Poem

A rabbit counted bytes by moonlight glow,
Then zipped and parsed the garden row.
Peak heaps hopped neatly into line,
And baselines now hum “all is fine.”
🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 18.92% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title concisely describes the main change: gating memory-path benchmarks for parse and ZIP export baselines.
Linked Issues check	✅ Passed	The PR adds the requested parse and ZIP-export peak-memory benchmarks, populates parse/export/search baselines, and updates the CI gate to fail >20% regressions.
Out of Scope Changes check	✅ Passed	The supporting script, Makefile, README, and .gitignore changes all directly serve the benchmark gating work.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch test/memory-path-benchmarks

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

tests/benchmarks/conftest.py (1)

22-33: 🩺 Stability & Availability | 🔵 Trivial

Make measure() preserve any existing tracemalloc session
tracemalloc.start() is fine when tracing is already active, but clear_traces() still wipes the current trace history and stop() ends the caller’s session. Track whether this helper started tracing, stop only in that case, and use reset_peak() for the peak-only reset.
scripts/check_benchmark_regression.py (1)
25-40: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Metric selection and display format use two different heuristics.

benchmark_entry_mean selects the gated value by extra_info.peak_bytes presence, but is_memory_metric decides byte-vs-seconds formatting from the name substring "peak_memory". These agree for the current benchmarks, but any future memory benchmark whose name omits peak_memory would have its byte value rendered as …s. Consider deriving the format from the same signal used for selection.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/check_benchmark_regression.py` around lines 25 - 40, The metric
selection and display format logic are using different signals, so update the
formatting path to derive bytes-vs-seconds from the same source used by
benchmark_entry_mean. In particular, align is_memory_metric with the
extra_info.peak_bytes check (or refactor the shared decision into a single
helper) so future memory benchmarks without "peak_memory" in the name still
render in bytes. Keep benchmark_entry_mean and the formatting logic consistent
by using one shared heuristic for both selection and display.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/baselines.json`:
- Around line 2-21: The committed benchmark baselines were captured on Windows
even though the gating process expects ubuntu-latest values, so update the
baselines source to match the documented CI environment before relying on them.
Refresh the entries in baselines.json from an ubuntu-latest CI artifact,
especially the gated metrics under the parse, export, and search groups, and
keep the reducer’s _note in sync with the intended source machine. If an
ubuntu-latest artifact is not yet available, temporarily widen the slack in the
gating setup rather than committing Windows-derived values.

In `@tests/benchmarks/test_parse_memory.py`:
- Around line 34-49: The `test_parse_large_peak_memory` benchmark mixes
`tracemalloc` overhead into the `parse` group timing, so its measured latency
should not be treated as the baseline. Keep the memory sampling in
`tracemalloc_peak.measure` and the `benchmark(_run)` flow as-is, but make sure
the regression tooling uses `benchmark.extra_info["peak_bytes"]` for this test
and that the actual `parse` latency baseline comes from the non-memory parse
benchmark (`parse_session` elsewhere), not from this inflated benchmark mean.

---

Nitpick comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 25-40: The metric selection and display format logic are using
different signals, so update the formatting path to derive bytes-vs-seconds from
the same source used by benchmark_entry_mean. In particular, align
is_memory_metric with the extra_info.peak_bytes check (or refactor the shared
decision into a single helper) so future memory benchmarks without "peak_memory"
in the name still render in bytes. Keep benchmark_entry_mean and the formatting
logic consistent by using one shared heuristic for both selection and display.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f7d84d69-a5e1-48b4-af3d-9c987fc946c0

📥 Commits

Reviewing files that changed from the base of the PR and between 2c1fd9d and 93cb445.

📒 Files selected for processing (9)

benchmarks/README.md
benchmarks/baselines.json
scripts/check_benchmark_regression.py
scripts/reduce_baselines.py
tests/benchmarks/conftest.py
tests/benchmarks/test_export_bench.py
tests/benchmarks/test_parse_memory.py
tests/test_check_benchmark_regression.py
tests/test_reduce_baselines.py

clean6378-max-it · 2026-06-24T19:45:05Z

@CodeRabbit, review this PR in more depth

coderabbitai · 2026-06-24T19:47:56Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== tests/benchmarks/conftest.py ==="
cat tests/benchmarks/conftest.py

echo ""
echo "=== tests/benchmarks/test_parse_memory.py ==="
cat tests/benchmarks/test_parse_memory.py

echo ""
echo "=== tests/benchmarks/test_export_bench.py ==="
cat tests/benchmarks/test_export_bench.py

echo ""
echo "=== scripts/check_benchmark_regression.py ==="
cat scripts/check_benchmark_regression.py

echo ""
echo "=== scripts/reduce_baselines.py ==="
cat scripts/reduce_baselines.py

echo ""
echo "=== benchmarks/baselines.json ==="
cat benchmarks/baselines.json

echo ""
echo "=== benchmarks/README.md ==="
cat benchmarks/README.md

echo ""
echo "=== tests/test_check_benchmark_regression.py ==="
cat tests/test_check_benchmark_regression.py

echo ""
echo "=== tests/test_reduce_baselines.py ==="
cat tests/test_reduce_baselines.py

Length of output: 33906

🏁 Script executed:

#!/bin/bash
# Check if Makefile has update-baselines target
if [ -f Makefile ]; then
    grep -n "update-baselines" Makefile || echo "target not found in Makefile"
else
    echo "No Makefile found"
fi

# Check if tracemalloc is used elsewhere that could conflict
rg -n "tracemalloc" --type py -C2

# Check BenchmarkDataError inheritance
rg -n "class BenchmarkDataError" --type py -C3

# Check if coverage/pytest plugins use tracemalloc
rg -rn "tracemalloc" requirements*.txt pyproject.toml || true

Length of output: 895

🏁 Script executed:

#!/bin/bash
# Check if Makefile has update-baselines target
echo "=== Makefile update-baselines ==="
grep -n "update-baselines" Makefile 2>/dev/null || echo "not found"

echo ""
echo "=== BenchmarkDataError class definition ==="
rg -n "class BenchmarkDataError" --type py -A3

echo ""
echo "=== tracemalloc usages ==="
rg -n "tracemalloc" --type py

echo ""
echo "=== requirements / pyproject tracemalloc refs ==="
rg -n "tracemalloc\|coverage\|pytest-cov" pyproject.toml requirements*.txt 2>/dev/null || true

Length of output: 1015

🏁 Script executed:

#!/bin/bash
# Run each command separately and ignore errors

echo "=== Makefile update-baselines target ==="
grep -n -A5 "update-baselines" Makefile 2>/dev/null || echo "not found"

echo ""
echo "=== BenchmarkDataError class ==="
grep -rn "class BenchmarkDataError" --include="*.py" . 2>/dev/null || echo "not found"

echo ""
echo "=== tracemalloc usages ==="
grep -rn "tracemalloc" --include="*.py" . 2>/dev/null || echo "not found"

echo ""
echo "=== pytest-cov / coverage in deps ==="
grep -rn "pytest-cov\|coverage" pyproject.toml 2>/dev/null || echo "not found"
grep -rn "pytest-cov\|coverage" requirements*.txt 2>/dev/null || echo "not found"

Length of output: 3008

@clean6378-max-it Here is an in-depth review of PR #97.

Deep Review: Memory-Path Benchmarks & CI Regression Gate

🔴 High Severity

1. `TracemallocPeak.measure()` unconditionally stops a pre-existing tracemalloc session

File: tests/benchmarks/conftest.py, Lines 26–33

tracemalloc.start()
tracemalloc.clear_traces()
try:
    result = func(*args, **kwargs)
    _, peak = tracemalloc.get_traced_memory()
    return result, peak
finally:
    tracemalloc.stop()   # ← always fires, even if tracing was active before the call

tracemalloc.stop() in finally unconditionally tears down the global tracing state. If any outer scope (another measure() call, a future plugin, or the ceiling test running in the same process) had already started tracing, this silently kills it. The correct pattern is to snapshot state, conditionally start, and restore on exit:

def measure(self, func: Callable[..., T], /, *args: Any, **kwargs: Any) -> tuple[T, int]:
    was_tracing = tracemalloc.is_tracing()
    tracemalloc.start()
    tracemalloc.clear_traces()
    try:
        result = func(*args, **kwargs)
        _, peak = tracemalloc.get_traced_memory()
        return result, peak
    finally:
        if not was_tracing:
            tracemalloc.stop()

The same issue exists in test_large_parse_peak_memory_under_ceiling (Lines 22–29 of test_parse_memory.py).

2. `BenchmarkDataError` is silently re-wrapped in `load_results`

File: scripts/check_benchmark_regression.py, Lines 62–70

BenchmarkDataError extends ValueError. In load_results, the broad except clause catches it:

try:
    name = entry["name"]
    mean = benchmark_entry_mean(entry)   # can raise BenchmarkDataError(ValueError)
except (KeyError, TypeError, ValueError) as exc:
    raise BenchmarkDataError(
        f"{path} benchmarks[{index}] missing 'name' or measurable value"  # ← generic; original lost
    ) from exc

A specific error like "benchmark 'test_parse_large_peak_memory' missing 'stats.mean' or extra_info.peak_bytes" becomes the less informative "benchmarks[2] missing 'name' or measurable value" in the primary message. The __cause__ chain preserves it, but users running CI only see the outer message. Fix: catch BenchmarkDataError separately and re-raise without wrapping, or narrow the except clause:

except BenchmarkDataError:
    raise
except (KeyError, TypeError, ValueError) as exc:
    raise BenchmarkDataError(
        f"{path} benchmarks[{index}] missing 'name' or measurable value"
    ) from exc

3. `assert` used for runtime safety in a production script

File: scripts/check_benchmark_regression.py, Line 44

assert isinstance(extra, dict)  # disabled by python -O

Python optimized mode (python -O) silently skips assert statements. This is a CI script — replace with an explicit guard:

if not isinstance(extra, dict):
    raise BenchmarkDataError(f"extra_info for {entry.get('name')!r} is not a dict")

🟡 Medium Severity

4. Assertion inside benchmark loop inflates measurements

File: tests/benchmarks/test_export_bench.py, Line 80

def _run() -> None:
    ...
    result, peak = tracemalloc_peak.measure(_export)
    assert result.exported_session_count > 0   # ← runs every iteration
    peaks.append(peak)

benchmark(_run)

The assert executes on every warm-up and timed round, adding noise. Move it outside the benchmark loop:

results: list[BulkExportResult] = []

def _run() -> None:
    result, peak = tracemalloc_peak.measure(_export)
    results.append(result)
    peaks.append(peak)

benchmark(_run)
assert results and results[-1].exported_session_count > 0

5. `test_metric_is_bytes_uses_extra_info_without_name_hint` doesn't isolate the `extra_info` path

File: tests/test_check_benchmark_regression.py, Lines 177–185

def test_metric_is_bytes_uses_extra_info_without_name_hint() -> None:
    entry = {
        "name": "test_custom_memory",   # ← contains "memory"; name heuristic also fires
        ...
        "extra_info": {"peak_bytes": 1_000_000},
    }
    assert metric_is_bytes("test_custom_memory", entry)

Because "peak_memory" in "test_custom_memory" is False but "memory" is not the substring being checked, this actually works — "peak_memory" is the exact substring. Let me be more precise: "peak_memory" in "test_custom_memory" → False. But to truly test that extra_info drives the result independently of the name, use a name with no memory-related substring at all:

entry = {
    "name": "test_export_latency",   # no "peak_memory" in name
    "stats": {"mean": 0.05},
    "extra_info": {"peak_bytes": 1_000_000},
}
assert metric_is_bytes("test_export_latency", entry)

6. `test_search_full_corpus` baseline is dead data

File: benchmarks/baselines.json

"search": {
    "test_search_full_corpus": 0.001091
}

EXCLUDED_FROM_GATE in check_benchmark_regression.py contains "test_search_full_corpus", so this baseline is loaded, iterated, and then silently skipped every CI run. Either remove it from the JSON (if it will never be gated) or remove it from EXCLUDED_FROM_GATE (if it's stable enough to gate). The current state can mislead future maintainers into thinking it's being enforced.

7. `make update-baselines` seeds baselines from the local machine, not `ubuntu-latest`

File: Makefile, Lines 3–5

update-baselines:
    pytest tests/benchmarks/ --benchmark-only --benchmark-json=benchmarks/_raw.json -o addopts=
    python scripts/reduce_baselines.py benchmarks/_raw.json benchmarks/baselines.json

The README correctly warns that baselines must be captured on ubuntu-latest, but the Makefile target does the opposite — it runs locally. A developer running make update-baselines on macOS or Windows will commit baselines that differ from CI timing, causing spurious failures or masking real regressions. Consider adding a guard or renaming the target to seed-baselines-local with a prominent warning comment.

🔵 Low Severity

8. Fragile dual-import pattern in `reduce_baselines.py`

File: scripts/reduce_baselines.py, Lines 13–16

try:
    from scripts.check_benchmark_regression import BenchmarkDataError, benchmark_entry_mean
except ModuleNotFoundError:
    from check_benchmark_regression import BenchmarkDataError, benchmark_entry_mean

This breaks silently if both paths exist but one is stale. The standard fix is to add scripts/ to PYTHONPATH when invoking the script, or to make it a proper package with __init__.py and use python -m scripts.reduce_baselines.

9. `_note` in `baselines.json` omits the slack factor

File: benchmarks/baselines.json

The PR says baselines were seeded with 1.5× slack, but the _note field only says "Gated means from ubuntu-latest CI" without mentioning the multiplier. Future maintainers won't know the values already include headroom. Add it:

"_note": "... Values multiplied by 1.5× slack at seeding time. ..."

Summary Table

#	File	Severity	Issue
1	`conftest.py`	🔴 High	`tracemalloc.stop()` kills pre-existing tracing session
2	`check_benchmark_regression.py`	🔴 High	`BenchmarkDataError` caught and re-wrapped with generic message
3	`check_benchmark_regression.py`	🔴 High	`assert` disabled by `python -O`
4	`test_export_bench.py`	🟡 Medium	Assertion inside benchmark loop inflates timing
5	`test_check_benchmark_regression.py`	🟡 Medium	Test doesn't truly isolate `extra_info` path
6	`baselines.json`	🟡 Medium	`test_search_full_corpus` baseline is never compared
7	`Makefile`	🟡 Medium	`make update-baselines` runs locally, risks cross-OS skew
8	`reduce_baselines.py`	🔵 Low	Fragile dual-import pattern
9	`baselines.json`	🔵 Low	1.5× slack factor not documented in `_note`

The overall architecture (fixture → extra_info.peak_bytes → reduce_baselines → gate script) is sound and the test coverage is good. The items above are correctness/safety issues that should be resolved before the CI gate is relied upon, with item 1 (tracemalloc state corruption) being the most important to address.

@timon0305

…ct / f"session_{i:04d}.jsonl", 20) # Unique first_timestamp per session so export filenames do not collide in ZIP benches. first_ts = f"2026-06-12T{i % 24:02d}:{i % 60:02d}:00Z" @timon0305 timon0305 1 minute ago Member first_ts = f"2026-06-12T{i % 24:02d}:{i % 60:02d}:00Z" is only unique for count ≤ 120 (collides at lcm(24,60)). Add assert count <= 120 (or a wider format) so a future corpus bump can't produce colliding ZIP entries.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

tests/benchmarks/conftest.py (1)

104-106: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Guard timestamp generation against invalid hour values for larger corpora.

first_ts = f"2026-06-12T{i // 60:02d}:{i % 60:02d}:00Z" becomes non-ISO once i // 60 >= 24 (e.g., T24:00:00Z). If session counts increase, this will produce invalid timestamps and degrade timestamp-based metadata behavior.

Suggested fix

+from datetime import datetime, timedelta, timezone
...
-        first_ts = f"2026-06-12T{i // 60:02d}:{i % 60:02d}:00Z"
+        base = datetime(2026, 6, 12, tzinfo=timezone.utc)
+        first_ts = (base + timedelta(minutes=i)).strftime("%Y-%m-%dT%H:%M:%SZ")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/benchmarks/conftest.py` around lines 104 - 106, The timestamp
generation in the benchmark session loop can become invalid once the computed
hour exceeds 23, so update the `first_ts` logic in `conftest.py` to use a
rollover-safe datetime calculation instead of formatting `i // 60` directly. Use
a real base date/time plus a `timedelta` derived from `i` and pass the resulting
ISO timestamp into `write_jsonl`, keeping the unique-per-session behavior while
ensuring `first_timestamp` always stays valid for larger corpora.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/README.md`:
- Around line 63-68: The manual baseline-refresh commands in the README are
missing the same slack factor used in the recommended workflow, which can lead
to inconsistent baselines. Update the benchmark refresh instructions so the
`pytest` command in the manual flow includes the same `--slack 1.5` option as
the main benchmark command, keeping the documented flow consistent with the
baseline generation path.

In `@Makefile`:
- Line 9: The seed-baselines-local target is missing the documented slack value,
so update the Makefile command that invokes reduce_baselines.py to pass the same
1.5x slack used elsewhere. Use the seed-baselines-local recipe and the
reduce_baselines.py invocation as the reference points, and make sure the
command includes the --slack 1.5 argument so local baseline generation matches
the documented behavior.

---

Duplicate comments:
In `@tests/benchmarks/conftest.py`:
- Around line 104-106: The timestamp generation in the benchmark session loop
can become invalid once the computed hour exceeds 23, so update the `first_ts`
logic in `conftest.py` to use a rollover-safe datetime calculation instead of
formatting `i // 60` directly. Use a real base date/time plus a `timedelta`
derived from `i` and pass the resulting ISO timestamp into `write_jsonl`,
keeping the unique-per-session behavior while ensuring `first_timestamp` always
stays valid for larger corpora.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 23f3e4da-51a2-403a-aa00-906177a5490e

📥 Commits

Reviewing files that changed from the base of the PR and between d4dee52 and 2fb39bc.

📒 Files selected for processing (11)

.gitignore
Makefile
benchmarks/README.md
benchmarks/baselines.json
scripts/check_benchmark_regression.py
scripts/reduce_baselines.py
tests/benchmarks/conftest.py
tests/benchmarks/test_export_bench.py
tests/benchmarks/test_parse_memory.py
tests/test_check_benchmark_regression.py
tests/test_reduce_baselines.py

✅ Files skipped from review due to trivial changes (1)

.gitignore

🚧 Files skipped from review as they are similar to previous changes (5)

benchmarks/baselines.json
tests/test_reduce_baselines.py
scripts/reduce_baselines.py
scripts/check_benchmark_regression.py
tests/benchmarks/test_export_bench.py

clean6378-max-it · 2026-06-24T21:26:57Z

@CodeRabbit @timon0305 comments are addressed?

coderabbitai · 2026-06-24T21:28:31Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== tests/benchmarks/conftest.py (tracemalloc stop fix) ==="
cat tests/benchmarks/conftest.py

echo ""
echo "=== scripts/check_benchmark_regression.py (BenchmarkDataError + assert fix) ==="
cat scripts/check_benchmark_regression.py

echo ""
echo "=== tests/benchmarks/test_export_bench.py (assertion inside loop fix) ==="
cat tests/benchmarks/test_export_bench.py

echo ""
echo "=== tests/benchmarks/test_parse_memory.py (tracemalloc stop fix) ==="
cat tests/benchmarks/test_parse_memory.py

echo ""
echo "=== tests/test_check_benchmark_regression.py (test isolation fix) ==="
cat tests/test_check_benchmark_regression.py

echo ""
echo "=== benchmarks/baselines.json (dead entry + slack note fix) ==="
cat benchmarks/baselines.json

echo ""
echo "=== Makefile (update-baselines warning fix) ==="
cat Makefile

echo ""
echo "=== scripts/reduce_baselines.py (dual-import fix) ==="
cat scripts/reduce_baselines.py

Length of output: 31027

@clean6378-max-it Here's a full status report on all 9 items.

✅ Fully Addressed (7/9)

#	Issue	Status
1	`tracemalloc.stop()` kills pre-existing tracing session	✅ Fixed — `was_tracing` guard added; `finally` only calls `tracemalloc.stop()` when tracing was not already active
2	`BenchmarkDataError` caught and re-wrapped with generic message	✅ Fixed — `except BenchmarkDataError: raise` is now the first clause before the generic handler
3	`assert isinstance(extra, dict)` disabled by `python -O`	✅ Fixed — replaced with `if not isinstance(extra, dict): raise BenchmarkDataError(...)`
4	Assertion inside benchmark loop inflates timing	✅ Fixed — `results` list collects inside `_run()`, assertion `results[-1].exported_session_count > 0` is outside `benchmark(_run)`
5	Test name `test_custom_memory` triggers name heuristic, masking `extra_info` path	✅ Fixed — renamed to `test_export_latency` which has no `"peak_memory"` substring
7	`make update-baselines` runs locally, risks cross-OS skew	✅ Fixed — renamed to `seed-baselines-local` with a `@echo "WARNING: ..."` stderr notice; `update-baselines` is now an explicit deprecated alias
8	Fragile dual-import `try/except ModuleNotFoundError` in `reduce_baselines.py`	✅ Fixed — single direct `from scripts.check_benchmark_regression import ...`

🟡 Partially Addressed (2/9)

Issue 6 — test_search_full_corpus baseline is dead data

The entry is still present in both baselines.json and EXCLUDED_FROM_GATE. The _note now explicitly documents the intent:

"Excluded from gate (recorded for reference): test_parse_session_small, test_search_full_corpus (sub-ms CI noise)."

The intent is now clear to future maintainers, so this is acceptable. The structural ambiguity is resolved by the documentation even if the entry is never compared.

Issue 9 — _note omits the 1.5× slack factor

reduce_baselines.py now correctly embeds "Values multiplied by {slack}× slack at generation time." when --slack is passed. However, the current benchmarks/baselines.json _note reads:

"Gated means from ubuntu-latest CI benchmark-results.json (PR `#97`, run 28126772276)..."

There is no mention of 1.5×. This suggests the file was refreshed from CI with slack=1.0 (or written manually), so the on-disk values may not have headroom applied. If the baselines were intentionally seeded without a slack multiplier (relying on CI natural variance), that's fine — but worth confirming. If they were meant to include 1.5× slack, a regeneration with --slack 1.5 is needed.

Net result: 7 of 9 issues are cleanly resolved. Issue 6 is documented away acceptably. Issue 9 is a tooling fix — the question is whether the current baselines.json on disk actually has 1.5× slack baked in (the _note doesn't say so). Would you like to regenerate the baselines with --slack 1.5 to make it explicit?

test: gate memory-path benchmarks in baselines (parse + ZIP export)

93cb445

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread benchmarks/baselines.json Outdated

Comment thread tests/benchmarks/test_parse_memory.py

fix(ci): seed ubuntu baselines and fix ruff/mypy benchmark regressions

d4dee52

clean6378-max-it added 2 commits June 25, 2026 04:09

fix(test): harden benchmark gate and tracemalloc fixtures (PR review)

717d1ea

fix(ci): ruff format check_benchmark_regression.py

0f5e1ab

timon0305 requested changes Jun 24, 2026

View reviewed changes

Comment thread benchmarks/baselines.json

Comment thread scripts/reduce_baselines.py Outdated

Comment thread benchmarks/baselines.json Outdated

Comment thread tests/benchmarks/conftest.py Outdated

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread benchmarks/README.md

Comment thread Makefile Outdated

fix(test): align baseline refresh slack and export timestamp rollover

1c66666

clean6378-max-it requested a review from timon0305 June 24, 2026 21:33

timon0305 approved these changes Jun 24, 2026

View reviewed changes

timon0305 requested a review from wpak-ai June 24, 2026 21:41

wpak-ai approved these changes Jun 24, 2026

View reviewed changes

wpak-ai merged commit 7f1e0cf into master Jun 24, 2026
16 checks passed

wpak-ai deleted the test/memory-path-benchmarks branch June 24, 2026 21:47

Uh oh!

Conversation

clean6378-max-it commented Jun 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

clean6378-max-it commented Jun 24, 2026

Uh oh!

coderabbitai Bot commented Jun 24, 2026

Deep Review: Memory-Path Benchmarks & CI Regression Gate

🔴 High Severity

1. TracemallocPeak.measure() unconditionally stops a pre-existing tracemalloc session

2. BenchmarkDataError is silently re-wrapped in load_results

3. assert used for runtime safety in a production script

🟡 Medium Severity

4. Assertion inside benchmark loop inflates measurements

5. test_metric_is_bytes_uses_extra_info_without_name_hint doesn't isolate the extra_info path

6. test_search_full_corpus baseline is dead data

7. make update-baselines seeds baselines from the local machine, not ubuntu-latest

🔵 Low Severity

8. Fragile dual-import pattern in reduce_baselines.py

9. _note in baselines.json omits the slack factor

Summary Table

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

clean6378-max-it commented Jun 24, 2026

Uh oh!

coderabbitai Bot commented Jun 24, 2026

✅ Fully Addressed (7/9)

🟡 Partially Addressed (2/9)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clean6378-max-it commented Jun 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading

1. `TracemallocPeak.measure()` unconditionally stops a pre-existing tracemalloc session

2. `BenchmarkDataError` is silently re-wrapped in `load_results`

3. `assert` used for runtime safety in a production script

5. `test_metric_is_bytes_uses_extra_info_without_name_hint` doesn't isolate the `extra_info` path

6. `test_search_full_corpus` baseline is dead data

7. `make update-baselines` seeds baselines from the local machine, not `ubuntu-latest`

8. Fragile dual-import pattern in `reduce_baselines.py`

9. `_note` in `baselines.json` omits the slack factor