evo2 SAE steering analysis: dose-response / selectivity harness + tested metrics#1635
evo2 SAE steering analysis: dose-response / selectivity harness + tested metrics#1635polinabinder1 wants to merge 4 commits into
Conversation
The analysis layer steering lacked (steer.py used to only print). Adds steer_analysis.py — pure, CPU-tested metrics: divergence (continuation departs from baseline), dose_response (effect vs clamp strength), selectivity (target vs control features). steer.py reuses the production Evo2SAE.generate clamp, scores with these, and writes steering_results.json so the evidence is persisted/reproducible — the steering analog of probe.py annotate. Stacked on #1622. Signed-off-by: Polina Binder <pbinder@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR introduces a complete SAE steering harness for Evo2: a pure metrics module for quantifying feature-driven text changes, a CLI script that orchestrates encoding, generation sweeps, and causal analysis, and corresponding unit tests. The workflow measures dose-response and selectivity of target feature clamping against baseline and control features. ChangesSAE Steering Harness Implementation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
…nder/evo2-steering-analysis Signed-off-by: Polina Binder <pbinder@nvidia.com>
…nder/evo2-steering-analysis Signed-off-by: Polina Binder <pbinder@nvidia.com>
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_steer_analysis.py (1)
27-50: ⚡ Quick winConsider adding edge case tests.
The current tests cover happy paths well. Consider adding tests for edge cases such as:
- Empty strings in
divergence(e.g.,divergence("", "")ordivergence("A", ""))- Empty
control_steereddict inselectivity(verifies the edge case flagged insteer_analysis.py)dose_responsewith emptysteered_by_strengthdict🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_steer_analysis.py` around lines 27 - 50, Add unit tests covering edge cases: call divergence with empty strings and with one empty vs non-empty (e.g., divergence("", "") and divergence("A", "")) and assert expected behavior; add a test for selectivity that passes an empty control_steered dict to ensure the function handles/raises as intended; and add a test for dose_response with an empty steered_by_strength dict to verify it returns an empty/appropriate response. Place these new tests in the existing test_steer_analysis.py alongside the other tests and reference the same functions divergence, selectivity, and dose_response so edge behaviors are validated.bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/steer_analysis.py (1)
33-68: ⚡ Quick winExpand docstrings to Google-style format.
The coding guidelines require Google-style docstrings with Args, Returns, and (if applicable) Raises sections. The current brief docstrings are functional but could be more explicit about parameter types, return structure, and edge cases.
📚 Example expansion for `divergence`
def divergence(a: str, b: str) -> tuple[int, float]: """Return divergence metrics between two strings over their shared prefix. Args: a: First string to compare. b: Second string to compare. Returns: A tuple of (first_differing_index, fraction_differing) where: - first_differing_index is the index of the first character that differs, or the shared prefix length if strings are identical up to that point. - fraction_differing is the fraction of characters that differ within the shared prefix length (min of the two string lengths). """🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/steer_analysis.py` around lines 33 - 68, Update the short docstrings for divergence, dose_response, and selectivity to full Google-style docstrings: for divergence, add an Args section documenting a: str and b: str, a Returns section describing the tuple (first_differing_index: int, fraction_differing: float) and explain edge cases (identical prefix => index == shared prefix length, fraction computed over min(len(a), len(b)) and uses max(1, n) to avoid division by zero); for dose_response, document Args (baseline: str, steered_by_strength: dict[float, str]) and Returns (list[dict] with keys "strength": float, "first_divergence": int, "frac_changed": float) and note rows are sorted by ascending strength; for selectivity, document Args (baseline: str, target_steered: str, control_steered: dict[int, str]) and Returns (dict with keys "target_frac_changed", "control_frac_changed", "mean_control_frac_changed", "selectivity_ratio") and mention handling of empty controls (mean_control_frac_changed==0 and selectivity_ratio uses denominator max(mean_c, 1e-9)). Ensure each docstring uses Google-style sections (Args, Returns, and mention Raises if applicable) and references the existing function names divergence, dose_response, and selectivity so reviewers can locate the changes.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/steer_analysis.py`:
- Around line 54-68: The selectivity() function currently divides by a tiny
epsilon when control_steered is empty, producing a misleading huge
selectivity_ratio; modify selectivity() to detect when controls is empty (e.g.,
if not control_steered or if len(controls)==0) and set "selectivity_ratio" to
None (or another sentinel) instead of computing target / max(mean_c, 1e-9); keep
the other fields the same (you may also set "mean_control_frac_changed" to None
if you prefer), and update any type hints or callers of selectivity to accept a
nullable selectivity_ratio.
- Line 32: The top-level imports in steer_analysis.py currently have only one
blank line following them; update the module by inserting a second blank line
after the import block so there are exactly two blank lines before the first
subsequent statement (e.g., before any module-level constants, function or class
definitions) to satisfy the isort configuration.
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/steer.py`:
- Line 38: The import block in steer.py currently has only one blank line after
the imports which violates the isort rule requiring two blank lines; update the
top-of-file imports section by adding one additional newline so there are two
blank lines separating the imports from the rest of the code (i.e., ensure the
imports block in steer.py is followed by two blank lines).
- Around line 77-78: The parsing of comma-separated inputs for controls and
strengths (variables controls and strengths derived from a.controls and
a.strengths) lacks error handling and will raise ValueError on malformed tokens;
wrap the list comprehensions in a try/except that catches ValueError, validate
each token with .strip() before converting (skip empty tokens), convert controls
with int() and strengths with float() inside the try block, verify the two lists
have the expected matching lengths and raise or exit with a clear error message
including the offending token and the original input (use the argparse object a
for context) so the script fails gracefully instead of crashing.
- Around line 71-76: The code can leave target as None when no labeled features
are in the top activations; after the loop that inspects vals, ids and sets
target from a.feature or eng.labels, add a validation: if target is still None,
either set a sensible fallback (e.g., target = int(ids[0]) to pick the
highest-activation feature) or raise a clear error asking the user to provide
--feature; ensure the check references target, a.feature, vals, ids and
eng.labels so downstream code that uses target as a dict key/for formatting
won't get a TypeError.
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_steer_analysis.py`:
- Line 21: The file test_steer_analysis.py has only one blank line after the
import block which violates the isort config requiring two blank lines; open
test_steer_analysis.py, locate the import section at the top of the module (the
import statements in the test_steer_analysis module), and add one additional
blank line so there are two consecutive blank lines immediately following the
imports before the first test or code block.
---
Nitpick comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/steer_analysis.py`:
- Around line 33-68: Update the short docstrings for divergence, dose_response,
and selectivity to full Google-style docstrings: for divergence, add an Args
section documenting a: str and b: str, a Returns section describing the tuple
(first_differing_index: int, fraction_differing: float) and explain edge cases
(identical prefix => index == shared prefix length, fraction computed over
min(len(a), len(b)) and uses max(1, n) to avoid division by zero); for
dose_response, document Args (baseline: str, steered_by_strength: dict[float,
str]) and Returns (list[dict] with keys "strength": float, "first_divergence":
int, "frac_changed": float) and note rows are sorted by ascending strength; for
selectivity, document Args (baseline: str, target_steered: str, control_steered:
dict[int, str]) and Returns (dict with keys "target_frac_changed",
"control_frac_changed", "mean_control_frac_changed", "selectivity_ratio") and
mention handling of empty controls (mean_control_frac_changed==0 and
selectivity_ratio uses denominator max(mean_c, 1e-9)). Ensure each docstring
uses Google-style sections (Args, Returns, and mention Raises if applicable) and
references the existing function names divergence, dose_response, and
selectivity so reviewers can locate the changes.
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_steer_analysis.py`:
- Around line 27-50: Add unit tests covering edge cases: call divergence with
empty strings and with one empty vs non-empty (e.g., divergence("", "") and
divergence("A", "")) and assert expected behavior; add a test for selectivity
that passes an empty control_steered dict to ensure the function handles/raises
as intended; and add a test for dose_response with an empty steered_by_strength
dict to verify it returns an empty/appropriate response. Place these new tests
in the existing test_steer_analysis.py alongside the other tests and reference
the same functions divergence, selectivity, and dose_response so edge behaviors
are validated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ea342482-73ae-48e5-b854-6dc5e460dff3
📒 Files selected for processing (3)
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/steer.pybionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/steer_analysis.pybionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_steer_analysis.py
|
|
||
| from __future__ import annotations | ||
|
|
||
|
|
There was a problem hiding this comment.
Add a second blank line after imports.
The isort configuration requires 2 blank lines after imports. Currently only 1 blank line is present.
📝 Proposed fix
from __future__ import annotations
+
def divergence(a: str, b: str) -> tuple[int, float]:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from __future__ import annotations | |
| def divergence(a: str, b: str) -> tuple[int, float]: |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/steer_analysis.py`
at line 32, The top-level imports in steer_analysis.py currently have only one
blank line following them; update the module by inserting a second blank line
after the import block so there are exactly two blank lines before the first
subsequent statement (e.g., before any module-level constants, function or class
definitions) to satisfy the isort configuration.
Source: Coding guidelines
| def selectivity(baseline: str, target_steered: str, control_steered: dict[int, str]) -> dict: | ||
| """Target effect vs control features clamped to the same strength. | ||
|
|
||
| ``selectivity_ratio`` > 1 means the target feature moves generation more than the average | ||
| control — evidence the steering is feature-specific, not a generic "any clamp perturbs output". | ||
| """ | ||
| target = divergence(baseline, target_steered)[1] | ||
| controls = {int(c): round(divergence(baseline, seq)[1], 4) for c, seq in control_steered.items()} | ||
| mean_c = sum(controls.values()) / len(controls) if controls else 0.0 | ||
| return { | ||
| "target_frac_changed": round(target, 4), | ||
| "control_frac_changed": controls, | ||
| "mean_control_frac_changed": round(mean_c, 4), | ||
| "selectivity_ratio": round(target / max(mean_c, 1e-9), 2), | ||
| } |
There was a problem hiding this comment.
Guard against misleading selectivity_ratio when controls are empty.
When control_steered is empty, Line 62 sets mean_c = 0.0, and Line 67 computes target / max(0.0, 1e-9) = target / 1e-9, which can produce an artificially large ratio (e.g., if target is 0.5, ratio becomes 50 million). This may mislead users into thinking the target is highly selective when in fact there were no controls to compare against.
🛡️ Proposed fix
def selectivity(baseline: str, target_steered: str, control_steered: dict[int, str]) -> dict:
"""Target effect vs control features clamped to the same strength.
``selectivity_ratio`` > 1 means the target feature moves generation more than the average
control — evidence the steering is feature-specific, not a generic "any clamp perturbs output".
"""
+ if not control_steered:
+ raise ValueError("control_steered must contain at least one control feature")
target = divergence(baseline, target_steered)[1]
controls = {int(c): round(divergence(baseline, seq)[1], 4) for c, seq in control_steered.items()}
mean_c = sum(controls.values()) / len(controls) if controls else 0.0
return {
"target_frac_changed": round(target, 4),
"control_frac_changed": controls,
"mean_control_frac_changed": round(mean_c, 4),
"selectivity_ratio": round(target / max(mean_c, 1e-9), 2),
}Alternatively, return None or a sentinel value for selectivity_ratio when there are no controls.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/steer_analysis.py`
around lines 54 - 68, The selectivity() function currently divides by a tiny
epsilon when control_steered is empty, producing a misleading huge
selectivity_ratio; modify selectivity() to detect when controls is empty (e.g.,
if not control_steered or if len(controls)==0) and set "selectivity_ratio" to
None (or another sentinel) instead of computing target / max(mean_c, 1e-9); keep
the other fields the same (you may also set "mean_control_frac_changed" to None
if you prefer), and update any type hints or callers of selectivity to accept a
nullable selectivity_ratio.
| import sys | ||
| from pathlib import Path | ||
|
|
||
|
|
There was a problem hiding this comment.
Add a second blank line after imports.
The isort configuration requires 2 blank lines after imports. Currently only 1 blank line is present.
📝 Proposed fix
_HERE = Path(__file__).resolve().parent
sys.path.insert(0, str(_HERE))
sys.path.insert(0, str(_HERE.parent / "src")) # recipes/evo2/src -> evo2_sae package
sys.path.insert(0, str(_HERE.parents[2] / "sae" / "src"))
+
from steer_analysis import dose_response, selectivity # noqa: E402📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _HERE = Path(__file__).resolve().parent | |
| sys.path.insert(0, str(_HERE)) | |
| sys.path.insert(0, str(_HERE.parent / "src")) # recipes/evo2/src -> evo2_sae package | |
| sys.path.insert(0, str(_HERE.parents[2] / "sae" / "src")) | |
| from steer_analysis import dose_response, selectivity # noqa: E402 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/steer.py`
at line 38, The import block in steer.py currently has only one blank line after
the imports which violates the isort rule requiring two blank lines; update the
top-of-file imports section by adding one additional newline so there are two
blank lines separating the imports from the rest of the code (i.e., ensure the
imports block in steer.py is followed by two blank lines).
Source: Coding guidelines
| target = a.feature | ||
| for v, i in zip(vals.tolist(), ids.tolist()): | ||
| lab = eng.labels.get(int(i)) | ||
| print(f" feat {int(i):6d} {str(lab):18s} max_act {v:7.2f}") | ||
| if target is None and lab: | ||
| target = int(i) |
There was a problem hiding this comment.
Validate that a target feature is selected.
If the sequence has no labeled features in its top 10 activations and the user did not specify --feature, Line 76 leaves target = None. This will cause a TypeError later when target is used as a dictionary key (Line 93) or in string formatting (Line 95).
🛡️ Proposed fix
for v, i in zip(vals.tolist(), ids.tolist()):
lab = eng.labels.get(int(i))
print(f" feat {int(i):6d} {str(lab):18s} max_act {v:7.2f}")
if target is None and lab:
target = int(i)
+ if target is None:
+ print("Error: No labeled feature found in top activations. Specify --feature explicitly.", file=sys.stderr)
+ sys.exit(1)
controls = [int(c) for c in a.controls.split(",") if c.strip()]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/steer.py`
around lines 71 - 76, The code can leave target as None when no labeled features
are in the top activations; after the loop that inspects vals, ids and sets
target from a.feature or eng.labels, add a validation: if target is still None,
either set a sensible fallback (e.g., target = int(ids[0]) to pick the
highest-activation feature) or raise a clear error asking the user to provide
--feature; ensure the check references target, a.feature, vals, ids and
eng.labels so downstream code that uses target as a dict key/for formatting
won't get a TypeError.
| controls = [int(c) for c in a.controls.split(",") if c.strip()] | ||
| strengths = [float(s) for s in a.strengths.split(",")] |
There was a problem hiding this comment.
Add error handling for parsing controls and strengths.
Lines 77-78 parse comma-separated strings into integers and floats without error handling. If the user provides malformed input (e.g., --controls "12345,abc"), the script will crash with a ValueError.
🛡️ Proposed fix
- controls = [int(c) for c in a.controls.split(",") if c.strip()]
- strengths = [float(s) for s in a.strengths.split(",")]
+ try:
+ controls = [int(c) for c in a.controls.split(",") if c.strip()]
+ except ValueError as e:
+ print(f"Error parsing --controls: {e}", file=sys.stderr)
+ sys.exit(1)
+ try:
+ strengths = [float(s) for s in a.strengths.split(",")]
+ except ValueError as e:
+ print(f"Error parsing --strengths: {e}", file=sys.stderr)
+ sys.exit(1)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/steer.py`
around lines 77 - 78, The parsing of comma-separated inputs for controls and
strengths (variables controls and strengths derived from a.controls and
a.strengths) lacks error handling and will raise ValueError on malformed tokens;
wrap the list comprehensions in a try/except that catches ValueError, validate
each token with .strip() before converting (skip empty tokens), convert controls
with int() and strengths with float() inside the try block, verify the two lists
have the expected matching lengths and raise or exit with a clear error message
including the offending token and the original input (use the argparse object a
for context) so the script fails gracefully instead of crashing.
| import sys | ||
| from pathlib import Path | ||
|
|
||
|
|
There was a problem hiding this comment.
Add a second blank line after imports.
The isort configuration requires 2 blank lines after imports. Currently only 1 blank line is present.
📝 Proposed fix
from pathlib import Path
+
sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "scripts"))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from pathlib import Path | |
| sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "scripts")) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_steer_analysis.py`
at line 21, The file test_steer_analysis.py has only one blank line after the
import block which violates the isort config requiring two blank lines; open
test_steer_analysis.py, locate the import section at the top of the module (the
import statements in the test_steer_analysis module), and add one additional
blank line so there are two consecutive blank lines immediately following the
imports before the first test or code block.
Source: Coding guidelines
…-commit CI) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Summary
Steering analysis — tested effect-metrics + a harness that persists results. Stacked on the engine (#1622).
Contents
scripts/steer_analysis.py— pure, CPU-tested metrics:divergence,dose_response,selectivityscripts/steer.py— harness: reuses the productionEvo2SAE.generatedecode-onlysae.steeringclamp, scores with the above, writessteering_results.jsontests/test_steer_analysis.py(CPU)How to run
Writes
steering_results.json(target, dose-response curve, target-vs-control selectivity) — the REQ-6 steering evidence once run on 7B/L26.Summary by CodeRabbit
Release Notes
New Features
Tests
Expected output
steer.py … --out steering_results.jsonprints: the sequence's top active features, then a dose-response table (strength → diverges@N, %changed) where%changedrises with strength for a real steering feature, then selectivity (target%changed≫ mean control;ratio > 1). Writes the same to JSON.pytest tests/test_steer_analysis.py→ 3 passed.The dose-response / selectivity numbers are pending the 7B/L26 GPU run (REQ-6 evidence). Expected shape: monotone dose-response + target ratio clearly above controls.