Add dry-run doc validation skill (#724)

ChrisJBurns · claude · web-flow · commit 81c5e7d0f81e · 2026-04-15T22:22:45.000+01:00
* Add dry-run doc validation skill

New /test-docs-dryrun skill for fast CRD schema validation of all
YAML blocks in K8s and vMCP documentation. Extracts YAML from docs,
runs kubectl apply --dry-run=server, and reports pass/fail per doc.
No resources created. Completes in under 5 minutes for all docs.

Includes:
- SKILL.md with scope selection (single page, section, all)
- Python extraction script for YAML blocks from .mdx files
- Lightweight prereqs checker (only needs CRDs, not operator)

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

* Improve dry-run skill with modular procedures

- Split skill into SKILL.md (workflow) + 4 procedure files:
  cluster-setup.md, extract.md, validate.md, report.md
- Fix duplicate filename bug: extraction script now takes --prefix flag
  to namespace output files by section (k8s-, vmcp-)
- Fix cluster lifecycle: check for existing cluster first, default to
  keeping it after the run
- Fix result tracking: use CSV file instead of bash associative arrays
  for zsh compatibility
- Require per-doc breakdown table as primary output every run

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

* Fix validation procedure to use script file

Long-running for loops with nested if/elif and subshells get
backgrounded or time out when run as inline Bash tool calls.
The procedure now instructs writing the loop to a .sh file first,
then executing with bash. Also uses || true after kubectl and
checks output string instead of exit code for reliability.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

* Address PR review feedback for dry-run skill

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

---------

Co-authored-by: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.claude/skills/test-docs-dryrun/SKILL.md b/.claude/skills/test-docs-dryrun/SKILL.md
@@ -0,0 +1,56 @@
+---
+name: test-docs-dryrun
+description: >
+  Fast CRD schema validation for ToolHive documentation. Extracts all YAML blocks from K8s and vMCP docs, runs kubectl apply --dry-run=server to catch field name errors, type mismatches, and schema drift. No cluster resources are created. Use for: "dry-run the docs", "validate the YAML", "check for schema issues", "run a quick doc check", or after any CRD/API change to catch doc rot. Requires a Kubernetes cluster with ToolHive CRDs installed.
+---
+
+# Test Docs - Dry Run
+
+Fast CRD schema validation for all ToolHive documentation YAML blocks. Extracts every YAML block containing `toolhive.stacklok.dev` resources, runs `kubectl apply --dry-run=server` on each, and reports pass/fail in a per-doc table. No resources are created. Under 5 minutes for all docs.
+
+## Workflow
+
+Follow these steps in order. Each step references a procedure file in the `procedures/` directory - read that file and follow its instructions exactly. Do NOT improvise inline bash when a procedure exists.
+
+### 1. Determine scope
+
+Ask what to validate:
+
+- **Kubernetes Operator** - all `.mdx` in `docs/toolhive/guides-k8s/`
+- **Virtual MCP Server** - all `.mdx` in `docs/toolhive/guides-vmcp/`
+- **All** - both sections
+
+### 2. Cluster setup
+
+Follow `procedures/cluster-setup.md`. Key points:
+
+- Check for existing `toolhive` kind cluster FIRST
+- Only create if missing
+- Only CRDs needed, not the operator
+- Default to KEEPING the cluster after the run
+
+### 3. Extract YAML
+
+Follow `procedures/extract.md`. Key points:
+
+- Always use `scripts/extract-yaml.py` with the `--prefix` flag
+- Use `--prefix k8s` for K8s docs, `--prefix vmcp` for vMCP docs
+- This prevents filename collisions between sections
+
+### 4. Validate
+
+Follow `procedures/validate.md`. Key points:
+
+- Write results to a CSV file (not bash variables)
+- Classify as pass/fail/expected/skip per the procedure
+- One CSV line per YAML block
+
+### 5. Report
+
+Follow `procedures/report.md`. Key points:
+
+- Always output the per-doc breakdown table to the user
+- The table is the PRIMARY output of this skill
+- Bold non-zero fail counts
+- List real failures with error messages and fix suggestions
+- Write results to `TEST_DRYRUN_*.md`
diff --git a/.claude/skills/test-docs-dryrun/procedures/cluster-setup.md b/.claude/skills/test-docs-dryrun/procedures/cluster-setup.md
@@ -0,0 +1,42 @@
+# Cluster setup
+
+How to manage the kind cluster for dry-run validation. The goal is to avoid unnecessary cluster creation/deletion between runs.
+
+## Check for existing cluster
+
+Always check first. Do NOT create a cluster without checking.
+
+```bash
+kind get clusters 2>/dev/null | grep -q "^toolhive$"
+```
+
+If the cluster exists, verify CRDs are installed:
+
+```bash
+kubectl get crd --context kind-toolhive 2>/dev/null | grep -c toolhive.stacklok.dev
+```
+
+If CRDs are present (count > 0), skip straight to extraction. No setup needed.
+
+## Create cluster only if missing
+
+If no `toolhive` cluster exists:
+
+```bash
+kind create cluster --name toolhive
+helm upgrade --install toolhive-operator-crds \
+  oci://ghcr.io/stacklok/toolhive/toolhive-operator-crds \
+  -n toolhive-system --create-namespace
+```
+
+The operator is NOT needed for dry-run validation. Only install CRDs.
+
+## After the run: keep the cluster
+
+Default to keeping the cluster after the run. Do NOT delete it unless the user explicitly asks. This avoids the 2-minute cluster creation penalty on the next run.
+
+If the user asks to clean up, delete with:
+
+```bash
+kind delete cluster --name toolhive
+```
diff --git a/.claude/skills/test-docs-dryrun/procedures/extract.md b/.claude/skills/test-docs-dryrun/procedures/extract.md
@@ -0,0 +1,61 @@
+# Extract YAML blocks
+
+How to extract ToolHive CRD YAML blocks from documentation files. Always use the extraction script - do NOT write inline bash to parse YAML.
+
+## Commands
+
+Use the `--prefix` flag to avoid filename collisions between sections. This is critical because both sections have files with the same name (e.g., `telemetry-and-metrics.mdx`).
+
+```bash
+SKILL_PATH="<skill-path>"
+YAML_DIR="$(mktemp -d)/yaml"
+mkdir -p "$YAML_DIR"
+
+# K8s Operator docs
+for f in docs/toolhive/guides-k8s/*.mdx; do
+    python3 "$SKILL_PATH/scripts/extract-yaml.py" "$f" "$YAML_DIR" --prefix k8s
+done
+
+# vMCP docs
+for f in docs/toolhive/guides-vmcp/*.mdx; do
+    python3 "$SKILL_PATH/scripts/extract-yaml.py" "$f" "$YAML_DIR" --prefix vmcp
+done
+```
+
+Replace `<skill-path>` with the absolute path to this skill's directory.
+
+## Output
+
+The script writes one file per YAML block:
+
+```text
+k8s-auth-k8s_0.yaml
+k8s-auth-k8s_1.yaml
+k8s-run-mcp-k8s_0.yaml
+vmcp-authentication_0.yaml
+vmcp-telemetry-and-metrics_0.yaml
+```
+
+The prefix ensures `k8s-telemetry-and-metrics_0.yaml` and `vmcp-telemetry-and-metrics_0.yaml` don't collide.
+
+## For single section
+
+If only validating one section, use just the relevant loop:
+
+```bash
+# K8s only
+for f in docs/toolhive/guides-k8s/*.mdx; do
+    python3 "$SKILL_PATH/scripts/extract-yaml.py" "$f" "$YAML_DIR" --prefix k8s
+done
+
+# vMCP only
+for f in docs/toolhive/guides-vmcp/*.mdx; do
+    python3 "$SKILL_PATH/scripts/extract-yaml.py" "$f" "$YAML_DIR" --prefix vmcp
+done
+```
+
+## For single page
+
+```bash
+python3 "$SKILL_PATH/scripts/extract-yaml.py" <file-path> "$YAML_DIR" --prefix single
+```
diff --git a/.claude/skills/test-docs-dryrun/procedures/report.md b/.claude/skills/test-docs-dryrun/procedures/report.md
@@ -0,0 +1,75 @@
+# Report results
+
+How to produce the per-doc breakdown table from the results CSV.
+
+## Build the table from CSV
+
+The validate step writes a CSV at `$RESULTS` with columns: `doc,filename,result,detail`
+
+Aggregate it into the per-doc table using this bash:
+
+```bash
+RESULTS="<path-to-results.csv>"
+
+echo "| Doc | Section | Blocks | Pass | Fail | Expected | Skip |"
+echo "|-----|---------|--------|------|------|----------|------|"
+
+TBLOCKS=0; TPASS=0; TFAIL=0; TEXPECTED=0; TSKIP=0
+
+for doc in $(cut -d',' -f1 "$RESULTS" | sort -u); do
+    blocks=$(grep -c "^$doc," "$RESULTS")
+    pass=$(grep -c "^$doc,[^,]*,pass" "$RESULTS" || true)
+    fail=$(grep -c "^$doc,[^,]*,fail" "$RESULTS" || true)
+    expected=$(grep -c "^$doc,[^,]*,expected" "$RESULTS" || true)
+    skip=$(grep -c "^$doc,[^,]*,skip" "$RESULTS" || true)
+
+    # Determine section from prefix
+    section="K8s"
+    echo "$doc" | grep -q "^vmcp-" && section="vMCP"
+
+    # Bold non-zero fail counts
+    fail_display="$fail"
+    [ "$fail" -gt 0 ] && fail_display="**$fail**"
+
+    echo "| ${doc} | ${section} | ${blocks} | ${pass} | ${fail_display} | ${expected} | ${skip} |"
+
+    TBLOCKS=$((TBLOCKS + blocks))
+    TPASS=$((TPASS + pass))
+    TFAIL=$((TFAIL + fail))
+    TEXPECTED=$((TEXPECTED + expected))
+    TSKIP=$((TSKIP + skip))
+done
+
+echo "| **TOTAL** | | **$TBLOCKS** | **$TPASS** | **$TFAIL** | **$TEXPECTED** | **$TSKIP** |"
+```
+
+## Table rules
+
+- Include every doc that had at least 1 YAML block extracted
+- Omit docs with 0 blocks
+- Group K8s docs first (prefix `k8s-`), then vMCP docs (prefix `vmcp-`)
+- Sort alphabetically within each section
+- Bold the TOTAL row
+- Bold any non-zero Fail count
+
+## After the table
+
+List each real failure with:
+
+```text
+### Real failures
+
+**k8s-mcp-server-entry_1.yaml**: `strict decoding error: unknown field "spec.remoteURL"`
+  Fix: Change `remoteURL` to `remoteUrl`
+
+**vmcp-optimizer_3.yaml**: `strict decoding error: unknown field "spec.modelCache.storageSize"`
+  Fix: Change `storageSize` to `size`
+```
+
+## Write to file
+
+Always write the table and failure details to a markdown file:
+
+- Single page: `TEST_DRYRUN_<PAGE_NAME>.md`
+- Section: `TEST_DRYRUN_<SECTION_NAME>.md`
+- All: `TEST_DRYRUN_ALL.md`
diff --git a/.claude/skills/test-docs-dryrun/procedures/validate.md b/.claude/skills/test-docs-dryrun/procedures/validate.md
@@ -0,0 +1,76 @@
+# Validate YAML blocks
+
+How to dry-run validate extracted YAML blocks efficiently.
+
+## Important: use a script file, not inline bash
+
+Long-running `for` loops with nested `if/elif/else` and subshells do NOT work reliably as inline Bash tool calls - they get backgrounded or time out. Always write the validation loop to a temporary `.sh` file and execute it with `bash`.
+
+## Step 1: Write the validation script
+
+Write the following to a temp file (e.g., `$YAML_DIR/../validate.sh`). Replace `<yaml-dir>` with the actual YAML directory path.
+
+```bash
+#!/bin/bash
+set -euo pipefail
+
+YAML_DIR="<yaml-dir>"
+RESULTS="$YAML_DIR/../results.csv"
+> "$RESULTS"
+
+for f in "$YAML_DIR"/*.yaml; do
+    bn=$(basename "$f")
+    doc=$(echo "$bn" | sed 's/_[0-9]*\.yaml$//')
+
+    # Skip incomplete resources (snippets without kind/apiVersion/name)
+    if ! grep -q 'kind:' "$f" || ! grep -q 'apiVersion:' "$f" || ! grep -q 'name:' "$f"; then
+        echo "$doc,$bn,skip,incomplete fragment" >> "$RESULTS"
+        continue
+    fi
+
+    OUT=$(kubectl apply --dry-run=server -f "$f" 2>&1) || true
+
+    if echo "$OUT" | grep -q 'created (server dry run)'; then
+        echo "$doc,$bn,pass," >> "$RESULTS"
+    elif echo "$OUT" | grep -q 'namespaces.*not found'; then
+        echo "$doc,$bn,pass,namespace missing but schema valid" >> "$RESULTS"
+    elif echo "$OUT" | grep -qE '<SERVER_NAME>|<NAMESPACE>|<SERVER_URL>'; then
+        echo "$doc,$bn,expected,placeholder name" >> "$RESULTS"
+    elif echo "$OUT" | grep -qE 'incomingAuth: Required|compositeTools.*steps: Required'; then
+        echo "$doc,$bn,expected,partial snippet" >> "$RESULTS"
+    else
+        ERR=$(echo "$OUT" | tr '\n' ' ' | head -c 200)
+        echo "$doc,$bn,fail,$ERR" >> "$RESULTS"
+    fi
+done
+
+echo "Done: $(wc -l < "$RESULTS") lines"
+```
+
+## Step 2: Run the script
+
+Use the Write tool to create the script file, then execute it:
+
+```bash
+bash $YAML_DIR/../validate.sh
+```
+
+This runs in a single process and completes reliably regardless of how many YAML blocks are validated.
+
+## Result classification
+
+| Result     | Meaning                                  | Action needed?    |
+| ---------- | ---------------------------------------- | ----------------- |
+| `pass`     | Schema valid (or only namespace missing) | No                |
+| `expected` | Placeholder name or partial snippet      | No                |
+| `fail`     | Real schema error                        | Yes - fix the doc |
+| `skip`     | Incomplete fragment (no kind/name)       | No                |
+
+## Important notes
+
+- ALWAYS write to a script file first, then execute with `bash`
+- Use `|| true` after `kubectl apply` to prevent `set -e` from exiting on validation failures
+- Check for `created (server dry run)` in output rather than exit code, because some versions of kubectl return non-zero even on dry-run success with warnings
+- Use a CSV file for results, not bash variables or associative arrays
+- Write one line per YAML block with: doc name, filename, result, detail
+- The CSV is consumed by the report procedure to build the table
diff --git a/.claude/skills/test-docs-dryrun/scripts/check-prereqs.sh b/.claude/skills/test-docs-dryrun/scripts/check-prereqs.sh
@@ -0,0 +1,76 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright 2026 Stacklok, Inc.
+# SPDX-License-Identifier: Apache-2.0
+
+# Check prerequisites for dry-run documentation validation.
+# Only requires kubectl and a cluster with CRDs installed (no operator needed).
+
+set -euo pipefail
+
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[0;33m'
+NC='\033[0m'
+
+ERRORS=0
+WARNINGS=0
+
+echo "=== Dry-Run Validation Prerequisites ==="
+echo ""
+
+# Check kubectl
+if command -v kubectl &>/dev/null; then
+    echo -e "${GREEN}[OK]${NC} kubectl available"
+else
+    echo -e "${RED}[MISSING]${NC} kubectl (required)"
+    ERRORS=$((ERRORS + 1))
+fi
+
+# Check cluster connectivity
+echo ""
+echo "--- Kubernetes cluster ---"
+CLUSTER_REACHABLE=false
+if kubectl cluster-info &>/dev/null; then
+    CONTEXT=$(kubectl config current-context 2>/dev/null || echo "unknown")
+    echo -e "${GREEN}[OK]${NC} Cluster reachable (context: $CONTEXT)"
+    CLUSTER_REACHABLE=true
+else
+    echo -e "${RED}[MISSING]${NC} No Kubernetes cluster reachable"
+    ERRORS=$((ERRORS + 1))
+fi
+
+# Check ToolHive CRDs installed (only if cluster is reachable)
+echo ""
+echo "--- ToolHive CRDs ---"
+if [ "$CLUSTER_REACHABLE" = true ]; then
+    CRD_COUNT=$(kubectl get crd 2>/dev/null | grep -c toolhive.stacklok.dev || true)
+    if [ "$CRD_COUNT" -gt 0 ]; then
+        echo -e "${GREEN}[OK]${NC} $CRD_COUNT ToolHive CRDs installed"
+    else
+        echo -e "${RED}[MISSING]${NC} No ToolHive CRDs found"
+        echo "       Install with: helm upgrade --install toolhive-operator-crds oci://ghcr.io/stacklok/toolhive/toolhive-operator-crds -n toolhive-system --create-namespace"
+        ERRORS=$((ERRORS + 1))
+    fi
+else
+    echo -e "${YELLOW}[SKIPPED]${NC} Cannot check CRDs (cluster unreachable)"
+fi
+
+# Check python3 for extraction script
+echo ""
+echo "--- Tools ---"
+if command -v python3 &>/dev/null; then
+    echo -e "${GREEN}[OK]${NC} python3 available"
+else
+    echo -e "${RED}[MISSING]${NC} python3 (required for YAML extraction)"
+    ERRORS=$((ERRORS + 1))
+fi
+
+echo ""
+echo "=== Summary ==="
+if [ "$ERRORS" -gt 0 ]; then
+    echo -e "${RED}$ERRORS error(s)${NC}"
+    exit 1
+else
+    echo -e "${GREEN}All prerequisites met${NC}"
+    exit 0
+fi
diff --git a/.claude/skills/test-docs-dryrun/scripts/extract-yaml.py b/.claude/skills/test-docs-dryrun/scripts/extract-yaml.py