Skip to content

feat: Goe rewrite#5

Open
jacobseunglee wants to merge 9 commits into
mainfrom
goe-rewrite
Open

feat: Goe rewrite#5
jacobseunglee wants to merge 9 commits into
mainfrom
goe-rewrite

Conversation

@jacobseunglee

Copy link
Copy Markdown
Collaborator

No description provided.

Akshay-Rohatgi and others added 8 commits May 20, 2026 19:37
Implements the GoE v2 foundation: Pydantic v2 models for the full entity
graph + procedure DSL, a step-by-step procedure executor with interpolation/
assertions/output capture, and a thin TestEnvironment adapter over v1
TestEnvironmentTool. 74 tests passing (3 browser/listen xfailed with
documented root causes).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…g, and SUID privesc

Signed-off-by: Akshay Rohatgi <52616034+Akshay-Rohatgi@users.noreply.github.com>
…ations implemented and tests for attack procedures added

Signed-off-by: Akshay Rohatgi <52616034+Akshay-Rohatgi@users.noreply.github.com>
…ng working: detached background processes, attacker container reset on retry, chromium browser installed via PPA

Signed-off-by: Akshay Rohatgi <52616034+Akshay-Rohatgi@users.noreply.github.com>
…self-review to address commonly seen custom app development pitfalls

Signed-off-by: Akshay Rohatgi <52616034+Akshay-Rohatgi@users.noreply.github.com>
…g tests and visualizers

Signed-off-by: Akshay Rohatgi <52616034+Akshay-Rohatgi@users.noreply.github.com>
* feat: model change

* delete game_of_everything/goe/jacobtest.yaml

* fix: workflow

* fix: switch off plan determined runtime
Adds the v2 evaluation suite (goe/eval), metrics instrumentation
(goe/metrics), single-system orchestration/packaging (goe/flow,
goe/packaging), workflow artifacts (goe/artifacts), and their tests
and fixtures.

Correctness fixes from code review:
- build.py: a non-zero deploy exit no longer falls through to a
  possible PASS; it routes into the retry loop as a design_flaw.
- runtimes/registry.py: create parent dirs for nested source-file
  paths (set -e no longer aborts); raise on unknown db_type.
- eval/llm_judge.py: print_judge_result tolerates missing keys.
- eval/golden.py: edge coverage requires a real connecting edge,
  not independent provides/requires matches.
- eval/runner.py: capture real run start time (durations were ~0).
- flow/orchestrator.py + checkpoint.py: persist and restore
  failure_category on the resume path.

Cleanups:
- metrics/collector.py: drop dead capture_artifacts ternary.
- bedrock.py: cache the bedrock-runtime client per region/creds
  instead of rebuilding it on every call.
- runtimes: consolidate per-runtime knowledge into the template
  YAMLs (target_image, deps_install_template, pre_start); deploy()
  is now table-driven and _RUNTIME_IMAGES is removed.

Co-authored-by: Jacob Lee <66867022+jacobseunglee@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
@Akshay-Rohatgi Akshay-Rohatgi requested a review from Copilot June 14, 2026 20:25

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Combines planner improvements (atom catalog + grading step) with Phase 4
increment 1 (multi-entity chain test + multi-system packaging).

## Planner Improvements

Fixes planner producing wrong runtimes (apache_php for SSH) and invented
atoms by grounding all prompts in actual atom inventory with few-shot examples.

**New pipeline structure:**
- Step 1: plan_entities → includes runtime + atoms in stubs
- Step 1.5: grade_stubs → LLM validates/corrects stubs (5-check rubric)
- Step 2: specify_entities → adds edges with validated runtime/atoms

**Atom catalog** (goe/planner/_atom_catalog.py):
- Parses 13 web vuln atoms from atoms/web_vulnerabilities/*.md
- Extracts descriptions, compatible runtimes (from code examples), capabilities
- Provides rich markdown table for prompt injection

**Prompts rewritten** (all 4 steps + grading):
- design_systems: port-to-runtime mapping + 2 examples
- plan_entities: atom catalog injection, runtime selection rules, 2 examples
- grade_stubs (new): 5-check rubric (atom exists, runtime matches, web vs
  system, single responsibility, chain logic)
- specify_entities: rich atom table, runtime affinity rules, edge consistency
- connect_edges: edge type selection guide, 2 examples

**Fixes:**
- resolve.py: match structural port to exposed ports (not always first)
- topology_environment.py: create containers on network directly (not none→connect)

## Phase 4: Multi-System Orchestration

**Chain test** (goe/flow/chain_test.py):
- L3 validation after all entities pass L2
- Gates overall run (FAILED → RunResult.success=False, CLI exits non-zero)
- TopologyEnvironment: one ubuntu:22.04 container per system + shared Kali attacker
- Chain attacker agent (Opus) synthesizes end-to-end procedure
- Retries up to 2× on failure

**Multi-system packaging** (goe/packaging/packager.py):
- Single-system: unchanged (deploy.sh + playbook.yaml)
- Multi-system: per-system deploy scripts + docker-compose.yml + chain_playbook.yaml
- Port collision detection scoped per system_id

**Cross-system addressing** (goe/executor/interpolation.py):
- ${system.<system_id>.host} / ${system.<system_id>.port}
- Existing ${target_host}, ${edge.*}, ${steps.*} unchanged

**Orchestrator** (goe/flow/orchestrator.py):
- Runs chain test when len(built) > 1
- Chain test result gates success
- Persists chain_test in checkpoint

**CLI** (goe/flow/__main__.py):
- goe flow test <output_dir> — replays packaged runs
- Auto-detects chain_playbook.yaml for multi-system replay

## Verification

End-to-end test: "SQLi → SSH" scenario that was failing before:
- ✓ SSH entity now has runtime=ubuntu (was apache_php)
- ✓ Both entities build successfully
- ✓ L3 chain test completes (synthesizes SQLi→SSH attack chain)
- ✓ Output includes chain_playbook.yaml with cross-system addressing

Co-Authored-By: Jacob Lee <66867022+jacobseunglee@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Akshay Rohatgi <52616034+Akshay-Rohatgi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants