Skip to content

test: gate memory-path benchmarks in baselines (parse + ZIP export)#97

Merged
wpak-ai merged 6 commits into
masterfrom
test/memory-path-benchmarks
Jun 24, 2026
Merged

test: gate memory-path benchmarks in baselines (parse + ZIP export)#97
wpak-ai merged 6 commits into
masterfrom
test/memory-path-benchmarks

Conversation

@clean6378-max-it

@clean6378-max-it clean6378-max-it commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Closes #96

Summary

Integrates memory-sensitive code paths into the tracked pytest-benchmark suite and populates benchmarks/baselines.json for CI regression gating (>20%).

  • tests/benchmarks/conftest.py: tracemalloc_peak fixture for peak Python heap measurement
  • tests/benchmarks/test_parse_memory.py: new test_parse_large_peak_memory records extra_info.peak_bytes; existing test_large_parse_peak_memory_under_ceiling unchanged
  • tests/benchmarks/test_export_bench.py: new test_bulk_export_zip_peak_memory for 10/50/100 session corpora (ZIP sink + tracemalloc)
  • scripts/check_benchmark_regression.py / reduce_baselines.py: prefer extra_info.peak_bytes over stats.mean when present; memory metrics reported in bytes
  • benchmarks/baselines.json: populated parse (latency + memory), export (latency + memory), search (latency); 1.5× slack from local reference run — refresh from ubuntu-latest CI artifact after first green job
  • benchmarks/README.md: documents memory benchmarks and gated groups

Sprint item #1 (June Week 4, Wednesday).

Test plan

  • pytest tests/benchmarks/ --benchmark-only -o addopts=
  • pytest tests/benchmarks/test_parse_memory.py (ceiling test passes; benchmark variant skipped via --benchmark-skip)
  • python scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.json
  • pytest tests/test_check_benchmark_regression.py tests/test_reduce_baselines.py --no-cov -o addopts=
  • Confirm benchmarks CI job green on ubuntu-latest

Summary by CodeRabbit

  • New Features
    • Added peak-memory tracking to parsing and exporting benchmarks, including new large-parse peak-heap and ZIP export peak-memory coverage.
    • Added search benchmark baseline coverage.
  • Bug Fixes
    • Improved benchmark regression gating to consistently use the intended metrics and to fail when required gated results are missing.
  • Documentation
    • Updated CI gate and baseline refresh docs, plus the new local baseline seeding workflow.
  • Tests
    • Expanded benchmark/regression tests for peak-byte metrics and missing gated-result behavior.
  • Chores
    • Added/updated benchmark-related Makefile targets for seeding and cleaning artifacts.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1168aa3d-77d4-46dc-b035-b76922419e36

📥 Commits

Reviewing files that changed from the base of the PR and between 2fb39bc and 1c66666.

📒 Files selected for processing (3)
  • Makefile
  • benchmarks/README.md
  • tests/benchmarks/conftest.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/benchmarks/conftest.py
  • Makefile
  • benchmarks/README.md

📝 Walkthrough

Walkthrough

The PR adds tracked peak-memory benchmarks for parse and ZIP export, updates regression tooling to read and print peak-byte metrics, and refreshes baseline data plus benchmark workflow instructions.

Changes

Memory-sensitive benchmark gating

Layer / File(s) Summary
Tracemalloc fixture
tests/benchmarks/conftest.py
Adds a helper and pytest fixture that measure peak traced memory for a callable, and updates JSONL generation timestamps for benchmark corpora.
Parse and export peak benchmarks
tests/benchmarks/test_parse_memory.py, tests/benchmarks/test_export_bench.py
Adds the tracked parse peak-memory benchmark and the ZIP export peak-memory benchmark, and centralizes export project setup for benchmark runs.
Regression checks and metric formatting
scripts/check_benchmark_regression.py, tests/test_check_benchmark_regression.py
Updates benchmark result loading, metric classification, regression output, and regression tests to handle peak-bytes metrics and missing gated results.
Baseline reduction and data
scripts/reduce_baselines.py, benchmarks/baselines.json, tests/test_reduce_baselines.py
Updates baseline reduction, baseline metadata, populated baseline groups, and reducer tests for memory and latency values.
Benchmark workflow commands
Makefile, benchmarks/README.md, .gitignore
Updates benchmark workflow commands, ignore rules, and README instructions for generating, reducing, and cleaning benchmark artifacts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • timon0305
  • wpak-ai

Poem

A rabbit counted bytes by moonlight glow,
Then zipped and parsed the garden row.
Peak heaps hopped neatly into line,
And baselines now hum “all is fine.”
🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.92% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title concisely describes the main change: gating memory-path benchmarks for parse and ZIP export baselines.
Linked Issues check ✅ Passed The PR adds the requested parse and ZIP-export peak-memory benchmarks, populates parse/export/search baselines, and updates the CI gate to fail >20% regressions.
Out of Scope Changes check ✅ Passed The supporting script, Makefile, README, and .gitignore changes all directly serve the benchmark gating work.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch test/memory-path-benchmarks

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
tests/benchmarks/conftest.py (1)

22-33: 🩺 Stability & Availability | 🔵 Trivial

Make measure() preserve any existing tracemalloc session
tracemalloc.start() is fine when tracing is already active, but clear_traces() still wipes the current trace history and stop() ends the caller’s session. Track whether this helper started tracing, stop only in that case, and use reset_peak() for the peak-only reset.

scripts/check_benchmark_regression.py (1)

25-40: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Metric selection and display format use two different heuristics.

benchmark_entry_mean selects the gated value by extra_info.peak_bytes presence, but is_memory_metric decides byte-vs-seconds formatting from the name substring "peak_memory". These agree for the current benchmarks, but any future memory benchmark whose name omits peak_memory would have its byte value rendered as …s. Consider deriving the format from the same signal used for selection.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/check_benchmark_regression.py` around lines 25 - 40, The metric
selection and display format logic are using different signals, so update the
formatting path to derive bytes-vs-seconds from the same source used by
benchmark_entry_mean. In particular, align is_memory_metric with the
extra_info.peak_bytes check (or refactor the shared decision into a single
helper) so future memory benchmarks without "peak_memory" in the name still
render in bytes. Keep benchmark_entry_mean and the formatting logic consistent
by using one shared heuristic for both selection and display.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/baselines.json`:
- Around line 2-21: The committed benchmark baselines were captured on Windows
even though the gating process expects ubuntu-latest values, so update the
baselines source to match the documented CI environment before relying on them.
Refresh the entries in baselines.json from an ubuntu-latest CI artifact,
especially the gated metrics under the parse, export, and search groups, and
keep the reducer’s _note in sync with the intended source machine. If an
ubuntu-latest artifact is not yet available, temporarily widen the slack in the
gating setup rather than committing Windows-derived values.

In `@tests/benchmarks/test_parse_memory.py`:
- Around line 34-49: The `test_parse_large_peak_memory` benchmark mixes
`tracemalloc` overhead into the `parse` group timing, so its measured latency
should not be treated as the baseline. Keep the memory sampling in
`tracemalloc_peak.measure` and the `benchmark(_run)` flow as-is, but make sure
the regression tooling uses `benchmark.extra_info["peak_bytes"]` for this test
and that the actual `parse` latency baseline comes from the non-memory parse
benchmark (`parse_session` elsewhere), not from this inflated benchmark mean.

---

Nitpick comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 25-40: The metric selection and display format logic are using
different signals, so update the formatting path to derive bytes-vs-seconds from
the same source used by benchmark_entry_mean. In particular, align
is_memory_metric with the extra_info.peak_bytes check (or refactor the shared
decision into a single helper) so future memory benchmarks without "peak_memory"
in the name still render in bytes. Keep benchmark_entry_mean and the formatting
logic consistent by using one shared heuristic for both selection and display.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f7d84d69-a5e1-48b4-af3d-9c987fc946c0

📥 Commits

Reviewing files that changed from the base of the PR and between 2c1fd9d and 93cb445.

📒 Files selected for processing (9)
  • benchmarks/README.md
  • benchmarks/baselines.json
  • scripts/check_benchmark_regression.py
  • scripts/reduce_baselines.py
  • tests/benchmarks/conftest.py
  • tests/benchmarks/test_export_bench.py
  • tests/benchmarks/test_parse_memory.py
  • tests/test_check_benchmark_regression.py
  • tests/test_reduce_baselines.py

Comment thread benchmarks/baselines.json Outdated
Comment thread tests/benchmarks/test_parse_memory.py
@clean6378-max-it

Copy link
Copy Markdown
Collaborator Author

@CodeRabbit, review this PR in more depth

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== tests/benchmarks/conftest.py ==="
cat tests/benchmarks/conftest.py

echo ""
echo "=== tests/benchmarks/test_parse_memory.py ==="
cat tests/benchmarks/test_parse_memory.py

echo ""
echo "=== tests/benchmarks/test_export_bench.py ==="
cat tests/benchmarks/test_export_bench.py

echo ""
echo "=== scripts/check_benchmark_regression.py ==="
cat scripts/check_benchmark_regression.py

echo ""
echo "=== scripts/reduce_baselines.py ==="
cat scripts/reduce_baselines.py

echo ""
echo "=== benchmarks/baselines.json ==="
cat benchmarks/baselines.json

echo ""
echo "=== benchmarks/README.md ==="
cat benchmarks/README.md

echo ""
echo "=== tests/test_check_benchmark_regression.py ==="
cat tests/test_check_benchmark_regression.py

echo ""
echo "=== tests/test_reduce_baselines.py ==="
cat tests/test_reduce_baselines.py

Length of output: 33906


🏁 Script executed:

#!/bin/bash
# Check if Makefile has update-baselines target
if [ -f Makefile ]; then
    grep -n "update-baselines" Makefile || echo "target not found in Makefile"
else
    echo "No Makefile found"
fi

# Check if tracemalloc is used elsewhere that could conflict
rg -n "tracemalloc" --type py -C2

# Check BenchmarkDataError inheritance
rg -n "class BenchmarkDataError" --type py -C3

# Check if coverage/pytest plugins use tracemalloc
rg -rn "tracemalloc" requirements*.txt pyproject.toml || true

Length of output: 895


🏁 Script executed:

#!/bin/bash
# Check if Makefile has update-baselines target
echo "=== Makefile update-baselines ==="
grep -n "update-baselines" Makefile 2>/dev/null || echo "not found"

echo ""
echo "=== BenchmarkDataError class definition ==="
rg -n "class BenchmarkDataError" --type py -A3

echo ""
echo "=== tracemalloc usages ==="
rg -n "tracemalloc" --type py

echo ""
echo "=== requirements / pyproject tracemalloc refs ==="
rg -n "tracemalloc\|coverage\|pytest-cov" pyproject.toml requirements*.txt 2>/dev/null || true

Length of output: 1015


🏁 Script executed:

#!/bin/bash
# Run each command separately and ignore errors

echo "=== Makefile update-baselines target ==="
grep -n -A5 "update-baselines" Makefile 2>/dev/null || echo "not found"

echo ""
echo "=== BenchmarkDataError class ==="
grep -rn "class BenchmarkDataError" --include="*.py" . 2>/dev/null || echo "not found"

echo ""
echo "=== tracemalloc usages ==="
grep -rn "tracemalloc" --include="*.py" . 2>/dev/null || echo "not found"

echo ""
echo "=== pytest-cov / coverage in deps ==="
grep -rn "pytest-cov\|coverage" pyproject.toml 2>/dev/null || echo "not found"
grep -rn "pytest-cov\|coverage" requirements*.txt 2>/dev/null || echo "not found"

Length of output: 3008



@clean6378-max-it Here is an in-depth review of PR #97.


Deep Review: Memory-Path Benchmarks & CI Regression Gate

🔴 High Severity


1. TracemallocPeak.measure() unconditionally stops a pre-existing tracemalloc session

File: tests/benchmarks/conftest.py, Lines 26–33

tracemalloc.start()
tracemalloc.clear_traces()
try:
    result = func(*args, **kwargs)
    _, peak = tracemalloc.get_traced_memory()
    return result, peak
finally:
    tracemalloc.stop()   # ← always fires, even if tracing was active before the call

tracemalloc.stop() in finally unconditionally tears down the global tracing state. If any outer scope (another measure() call, a future plugin, or the ceiling test running in the same process) had already started tracing, this silently kills it. The correct pattern is to snapshot state, conditionally start, and restore on exit:

def measure(self, func: Callable[..., T], /, *args: Any, **kwargs: Any) -> tuple[T, int]:
    was_tracing = tracemalloc.is_tracing()
    tracemalloc.start()
    tracemalloc.clear_traces()
    try:
        result = func(*args, **kwargs)
        _, peak = tracemalloc.get_traced_memory()
        return result, peak
    finally:
        if not was_tracing:
            tracemalloc.stop()

The same issue exists in test_large_parse_peak_memory_under_ceiling (Lines 22–29 of test_parse_memory.py).


2. BenchmarkDataError is silently re-wrapped in load_results

File: scripts/check_benchmark_regression.py, Lines 62–70

BenchmarkDataError extends ValueError. In load_results, the broad except clause catches it:

try:
    name = entry["name"]
    mean = benchmark_entry_mean(entry)   # can raise BenchmarkDataError(ValueError)
except (KeyError, TypeError, ValueError) as exc:
    raise BenchmarkDataError(
        f"{path} benchmarks[{index}] missing 'name' or measurable value"  # ← generic; original lost
    ) from exc

A specific error like "benchmark 'test_parse_large_peak_memory' missing 'stats.mean' or extra_info.peak_bytes" becomes the less informative "benchmarks[2] missing 'name' or measurable value" in the primary message. The __cause__ chain preserves it, but users running CI only see the outer message. Fix: catch BenchmarkDataError separately and re-raise without wrapping, or narrow the except clause:

except BenchmarkDataError:
    raise
except (KeyError, TypeError, ValueError) as exc:
    raise BenchmarkDataError(
        f"{path} benchmarks[{index}] missing 'name' or measurable value"
    ) from exc

3. assert used for runtime safety in a production script

File: scripts/check_benchmark_regression.py, Line 44

assert isinstance(extra, dict)  # disabled by python -O

Python optimized mode (python -O) silently skips assert statements. This is a CI script — replace with an explicit guard:

if not isinstance(extra, dict):
    raise BenchmarkDataError(f"extra_info for {entry.get('name')!r} is not a dict")

🟡 Medium Severity


4. Assertion inside benchmark loop inflates measurements

File: tests/benchmarks/test_export_bench.py, Line 80

def _run() -> None:
    ...
    result, peak = tracemalloc_peak.measure(_export)
    assert result.exported_session_count > 0   # ← runs every iteration
    peaks.append(peak)

benchmark(_run)

The assert executes on every warm-up and timed round, adding noise. Move it outside the benchmark loop:

results: list[BulkExportResult] = []

def _run() -> None:
    result, peak = tracemalloc_peak.measure(_export)
    results.append(result)
    peaks.append(peak)

benchmark(_run)
assert results and results[-1].exported_session_count > 0

5. test_metric_is_bytes_uses_extra_info_without_name_hint doesn't isolate the extra_info path

File: tests/test_check_benchmark_regression.py, Lines 177–185

def test_metric_is_bytes_uses_extra_info_without_name_hint() -> None:
    entry = {
        "name": "test_custom_memory",   # ← contains "memory"; name heuristic also fires
        ...
        "extra_info": {"peak_bytes": 1_000_000},
    }
    assert metric_is_bytes("test_custom_memory", entry)

Because "peak_memory" in "test_custom_memory" is False but "memory" is not the substring being checked, this actually works — "peak_memory" is the exact substring. Let me be more precise: "peak_memory" in "test_custom_memory"False. But to truly test that extra_info drives the result independently of the name, use a name with no memory-related substring at all:

entry = {
    "name": "test_export_latency",   # no "peak_memory" in name
    "stats": {"mean": 0.05},
    "extra_info": {"peak_bytes": 1_000_000},
}
assert metric_is_bytes("test_export_latency", entry)

6. test_search_full_corpus baseline is dead data

File: benchmarks/baselines.json

"search": {
    "test_search_full_corpus": 0.001091
}

EXCLUDED_FROM_GATE in check_benchmark_regression.py contains "test_search_full_corpus", so this baseline is loaded, iterated, and then silently skipped every CI run. Either remove it from the JSON (if it will never be gated) or remove it from EXCLUDED_FROM_GATE (if it's stable enough to gate). The current state can mislead future maintainers into thinking it's being enforced.


7. make update-baselines seeds baselines from the local machine, not ubuntu-latest

File: Makefile, Lines 3–5

update-baselines:
    pytest tests/benchmarks/ --benchmark-only --benchmark-json=benchmarks/_raw.json -o addopts=
    python scripts/reduce_baselines.py benchmarks/_raw.json benchmarks/baselines.json

The README correctly warns that baselines must be captured on ubuntu-latest, but the Makefile target does the opposite — it runs locally. A developer running make update-baselines on macOS or Windows will commit baselines that differ from CI timing, causing spurious failures or masking real regressions. Consider adding a guard or renaming the target to seed-baselines-local with a prominent warning comment.


🔵 Low Severity


8. Fragile dual-import pattern in reduce_baselines.py

File: scripts/reduce_baselines.py, Lines 13–16

try:
    from scripts.check_benchmark_regression import BenchmarkDataError, benchmark_entry_mean
except ModuleNotFoundError:
    from check_benchmark_regression import BenchmarkDataError, benchmark_entry_mean

This breaks silently if both paths exist but one is stale. The standard fix is to add scripts/ to PYTHONPATH when invoking the script, or to make it a proper package with __init__.py and use python -m scripts.reduce_baselines.


9. _note in baselines.json omits the slack factor

File: benchmarks/baselines.json

The PR says baselines were seeded with 1.5× slack, but the _note field only says "Gated means from ubuntu-latest CI" without mentioning the multiplier. Future maintainers won't know the values already include headroom. Add it:

"_note": "... Values multiplied by 1.5× slack at seeding time. ..."

Summary Table

# File Severity Issue
1 conftest.py 🔴 High tracemalloc.stop() kills pre-existing tracing session
2 check_benchmark_regression.py 🔴 High BenchmarkDataError caught and re-wrapped with generic message
3 check_benchmark_regression.py 🔴 High assert disabled by python -O
4 test_export_bench.py 🟡 Medium Assertion inside benchmark loop inflates timing
5 test_check_benchmark_regression.py 🟡 Medium Test doesn't truly isolate extra_info path
6 baselines.json 🟡 Medium test_search_full_corpus baseline is never compared
7 Makefile 🟡 Medium make update-baselines runs locally, risks cross-OS skew
8 reduce_baselines.py 🔵 Low Fragile dual-import pattern
9 baselines.json 🔵 Low 1.5× slack factor not documented in _note

The overall architecture (fixture → extra_info.peak_bytesreduce_baselines → gate script) is sound and the test coverage is good. The items above are correctness/safety issues that should be resolved before the CI gate is relied upon, with item 1 (tracemalloc state corruption) being the most important to address.

Comment thread benchmarks/baselines.json
Comment thread scripts/reduce_baselines.py Outdated
Comment thread benchmarks/baselines.json Outdated
Comment thread tests/benchmarks/conftest.py Outdated
…ct / f"session_{i:04d}.jsonl", 20) # Unique first_timestamp per session so export filenames do not collide in ZIP benches. first_ts = f"2026-06-12T{i % 24:02d}:{i % 60:02d}:00Z" @timon0305 timon0305 1 minute ago Member first_ts = f"2026-06-12T{i % 24:02d}:{i % 60:02d}:00Z" is only unique for count ≤ 120 (collides at lcm(24,60)). Add assert count <= 120 (or a wider format) so a future corpus bump can't produce colliding ZIP entries.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
tests/benchmarks/conftest.py (1)

104-106: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Guard timestamp generation against invalid hour values for larger corpora.

first_ts = f"2026-06-12T{i // 60:02d}:{i % 60:02d}:00Z" becomes non-ISO once i // 60 >= 24 (e.g., T24:00:00Z). If session counts increase, this will produce invalid timestamps and degrade timestamp-based metadata behavior.

Suggested fix
+from datetime import datetime, timedelta, timezone
...
-        first_ts = f"2026-06-12T{i // 60:02d}:{i % 60:02d}:00Z"
+        base = datetime(2026, 6, 12, tzinfo=timezone.utc)
+        first_ts = (base + timedelta(minutes=i)).strftime("%Y-%m-%dT%H:%M:%SZ")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/benchmarks/conftest.py` around lines 104 - 106, The timestamp
generation in the benchmark session loop can become invalid once the computed
hour exceeds 23, so update the `first_ts` logic in `conftest.py` to use a
rollover-safe datetime calculation instead of formatting `i // 60` directly. Use
a real base date/time plus a `timedelta` derived from `i` and pass the resulting
ISO timestamp into `write_jsonl`, keeping the unique-per-session behavior while
ensuring `first_timestamp` always stays valid for larger corpora.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/README.md`:
- Around line 63-68: The manual baseline-refresh commands in the README are
missing the same slack factor used in the recommended workflow, which can lead
to inconsistent baselines. Update the benchmark refresh instructions so the
`pytest` command in the manual flow includes the same `--slack 1.5` option as
the main benchmark command, keeping the documented flow consistent with the
baseline generation path.

In `@Makefile`:
- Line 9: The seed-baselines-local target is missing the documented slack value,
so update the Makefile command that invokes reduce_baselines.py to pass the same
1.5x slack used elsewhere. Use the seed-baselines-local recipe and the
reduce_baselines.py invocation as the reference points, and make sure the
command includes the --slack 1.5 argument so local baseline generation matches
the documented behavior.

---

Duplicate comments:
In `@tests/benchmarks/conftest.py`:
- Around line 104-106: The timestamp generation in the benchmark session loop
can become invalid once the computed hour exceeds 23, so update the `first_ts`
logic in `conftest.py` to use a rollover-safe datetime calculation instead of
formatting `i // 60` directly. Use a real base date/time plus a `timedelta`
derived from `i` and pass the resulting ISO timestamp into `write_jsonl`,
keeping the unique-per-session behavior while ensuring `first_timestamp` always
stays valid for larger corpora.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 23f3e4da-51a2-403a-aa00-906177a5490e

📥 Commits

Reviewing files that changed from the base of the PR and between d4dee52 and 2fb39bc.

📒 Files selected for processing (11)
  • .gitignore
  • Makefile
  • benchmarks/README.md
  • benchmarks/baselines.json
  • scripts/check_benchmark_regression.py
  • scripts/reduce_baselines.py
  • tests/benchmarks/conftest.py
  • tests/benchmarks/test_export_bench.py
  • tests/benchmarks/test_parse_memory.py
  • tests/test_check_benchmark_regression.py
  • tests/test_reduce_baselines.py
✅ Files skipped from review due to trivial changes (1)
  • .gitignore
🚧 Files skipped from review as they are similar to previous changes (5)
  • benchmarks/baselines.json
  • tests/test_reduce_baselines.py
  • scripts/reduce_baselines.py
  • scripts/check_benchmark_regression.py
  • tests/benchmarks/test_export_bench.py

Comment thread benchmarks/README.md
Comment thread Makefile Outdated
@clean6378-max-it

Copy link
Copy Markdown
Collaborator Author

@CodeRabbit @timon0305 comments are addressed?

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== tests/benchmarks/conftest.py (tracemalloc stop fix) ==="
cat tests/benchmarks/conftest.py

echo ""
echo "=== scripts/check_benchmark_regression.py (BenchmarkDataError + assert fix) ==="
cat scripts/check_benchmark_regression.py

echo ""
echo "=== tests/benchmarks/test_export_bench.py (assertion inside loop fix) ==="
cat tests/benchmarks/test_export_bench.py

echo ""
echo "=== tests/benchmarks/test_parse_memory.py (tracemalloc stop fix) ==="
cat tests/benchmarks/test_parse_memory.py

echo ""
echo "=== tests/test_check_benchmark_regression.py (test isolation fix) ==="
cat tests/test_check_benchmark_regression.py

echo ""
echo "=== benchmarks/baselines.json (dead entry + slack note fix) ==="
cat benchmarks/baselines.json

echo ""
echo "=== Makefile (update-baselines warning fix) ==="
cat Makefile

echo ""
echo "=== scripts/reduce_baselines.py (dual-import fix) ==="
cat scripts/reduce_baselines.py

Length of output: 31027



@clean6378-max-it Here's a full status report on all 9 items.


✅ Fully Addressed (7/9)

# Issue Status
1 tracemalloc.stop() kills pre-existing tracing session ✅ Fixed — was_tracing guard added; finally only calls tracemalloc.stop() when tracing was not already active
2 BenchmarkDataError caught and re-wrapped with generic message ✅ Fixed — except BenchmarkDataError: raise is now the first clause before the generic handler
3 assert isinstance(extra, dict) disabled by python -O ✅ Fixed — replaced with if not isinstance(extra, dict): raise BenchmarkDataError(...)
4 Assertion inside benchmark loop inflates timing ✅ Fixed — results list collects inside _run(), assertion results[-1].exported_session_count > 0 is outside benchmark(_run)
5 Test name test_custom_memory triggers name heuristic, masking extra_info path ✅ Fixed — renamed to test_export_latency which has no "peak_memory" substring
7 make update-baselines runs locally, risks cross-OS skew ✅ Fixed — renamed to seed-baselines-local with a @echo "WARNING: ..." stderr notice; update-baselines is now an explicit deprecated alias
8 Fragile dual-import try/except ModuleNotFoundError in reduce_baselines.py ✅ Fixed — single direct from scripts.check_benchmark_regression import ...

🟡 Partially Addressed (2/9)

Issue 6 — test_search_full_corpus baseline is dead data

The entry is still present in both baselines.json and EXCLUDED_FROM_GATE. The _note now explicitly documents the intent:

"Excluded from gate (recorded for reference): test_parse_session_small, test_search_full_corpus (sub-ms CI noise)."

The intent is now clear to future maintainers, so this is acceptable. The structural ambiguity is resolved by the documentation even if the entry is never compared.

Issue 9 — _note omits the 1.5× slack factor

reduce_baselines.py now correctly embeds "Values multiplied by {slack}× slack at generation time." when --slack is passed. However, the current benchmarks/baselines.json _note reads:

"Gated means from ubuntu-latest CI benchmark-results.json (PR `#97`, run 28126772276)..."

There is no mention of 1.5×. This suggests the file was refreshed from CI with slack=1.0 (or written manually), so the on-disk values may not have headroom applied. If the baselines were intentionally seeded without a slack multiplier (relying on CI natural variance), that's fine — but worth confirming. If they were meant to include 1.5× slack, a regeneration with --slack 1.5 is needed.


Net result: 7 of 9 issues are cleanly resolved. Issue 6 is documented away acceptably. Issue 9 is a tooling fix — the question is whether the current baselines.json on disk actually has 1.5× slack baked in (the _note doesn't say so). Would you like to regenerate the baselines with --slack 1.5 to make it explicit?

@timon0305 timon0305 requested a review from wpak-ai June 24, 2026 21:41
@wpak-ai wpak-ai merged commit 7f1e0cf into master Jun 24, 2026
16 checks passed
@wpak-ai wpak-ai deleted the test/memory-path-benchmarks branch June 24, 2026 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

claude-code-chat-browser: Benchmarks for memory-sensitive code paths — gate in baselines

3 participants