Fix code review agent timeouts caused by gpt-5.4 (#126783)

danmoseley · Copilot · web-flow · commit 634fa0df4c8b · 2026-04-10T22:30:14.000-06:00
> [!NOTE] > This PR was developed with Copilot assistance based on analysis of workflow run logs and duration data. ## Problem ~20% of custom code review agent runs hit the 20-minute workflow timeout. Analysis of all 43 timeout runs from the last 1000 workflow executions shows: - **93% of timeouts** are caused by GPT-5.4 sub-agents that never return - GPT-5.4 is present in **100% of timeout runs** (24/24 checked in detail) - GPT-5.2-only runs have **0 timeouts** in 6+ successful runs - 86% of timed-out runs had already posted the review — they time out waiting for hung sub-agents - Each timeout causes the agent job to fail, making the overall workflow **red in CI** (even though the `conclusion` job succeeds) — PR authors must manually rerun The current SKILL.md rule "pick the highest version number" causes gpt-5.4 to always be selected when available. ## Changes (SKILL.md only) 1. **Block gpt-5.4** — it has known reliability issues. Recommend `gpt-5.3-codex` as the GPT-family pick instead. If that also exhibits hangs, we can block the GPT family entirely with no expected quality loss. 2. **Exit after posting** — the agent was lingering 2-3 minutes after successfully posting the review comment, waiting for hung sub-agents. Now it exits immediately once the comment is visible. 3. **Reduce max sub-agents from 4 to 3** — with only 2-3 model families available in practice, 4 was never fully utilized. ## What this does NOT change - The 10-minute sub-agent timeout instruction (already in place, appropriate for agents that do return) - The overall workflow `timeout-minutes: 20` (hardcoded in the compiled `.lock.yml`) - The review methodology, severity definitions, or quality bar - Any CCR (Copilot Code Review) configuration ## Expected impact - Eliminates the dominant timeout cause (GPT-5.4 hangs) - Saves 2-3 min per run from exit-after-post - No expected quality regression: GPT contributed unique blocking findings in 0% of sampled runs ## Data | Metric | Value | |--------|-------| | Runs analyzed | 1000 workflow runs, 420 non-skipped, 218 with CLI data | | Timeout rate | 19.7% (43/218) | | GPT-5.4 in timeouts | 100% (24/24 detailed) | | GPT-5.2 timeouts | 0% (0/6+ successful GPT-5.2 runs) | | Reviews with GPT-unique findings | <8% | | GPT-only blocking bugs found | 0 | | MCP add_comment missing | 12/43 timeout runs (~6% of all runs) — platform issue, not addressed here | ## Why not increase the 20-minute timeout? The GPT-5.4 sub-agent hangs indefinitely — there is no evidence it would eventually complete if given more time. Increasing the timeout would just delay the inevitable and waste more compute. ## Duration distribution (218 runs with CLI data) | Bucket | Runs | % | |--------|------|---| | 0–2m | 18 | 8.3% | | 2–4m | 24 | 11.0% | | 4–6m | 29 | 13.3% | | 6–8m | 25 | 11.5% | | 8–10m | 26 | 11.9% | | 10–12m | 21 | 9.6% | | 12–14m | 9 | 4.1% | | 14–16m | 8 | 3.7% | | 16–18m | 9 | 4.1% | | 18–20m | 6 | 2.8% | | 20m+ (timeout) | 43 | 19.7% | The bimodal distribution (healthy hump at 4–10m, spike at 20m wall) confirms these are hangs, not slow completions. Related: #126779 (not a fix for that, but general efficiency improvement to the review agent) --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
diff --git a/.github/skills/code-review/SKILL.md b/.github/skills/code-review/SKILL.md
@@ -82,15 +82,17 @@ Now read the PR description, labels, linked issues (in full), author information
 When the environment supports launching sub-agents with different models (e.g., the `task` tool with a `model` parameter), run the review in parallel across multiple model families to get diverse perspectives. Different models catch different classes of issues. If the environment does not support this, proceed with a single-model review.
 
 **How to execute (when supported):**
-1. Inspect the available model list and select one model from each distinct model family (e.g., one Anthropic Claude, one Google Gemini, one OpenAI GPT). Use at least 2 and at most 4 models. **Model selection rules:**
+1. Inspect the available model list and select models from 2-3 distinct model families, up to 3 sub-agent models total. If fewer than 2 eligible families are available, use what is available. **Model selection rules:**
    - Pick only from models explicitly listed as available in the environment. Do not guess or assume model names.
-   - From each family, pick the model with the highest capability tier (prefer "premium" or "standard" over "fast/cheap").
+   - From each selected family, pick the model with the highest capability tier (prefer "premium" or "standard" over "fast/cheap").
    - Never pick models labeled "mini", "fast", or "cheap" for code review.
-   - If multiple standard-tier models exist in the same family (e.g., `gpt-5` and `gpt-5.1`), pick the one with the highest version number.
    - Do not select the same model that is already running the primary review (i.e., your own model). The goal is diverse perspectives from different model families.
+   - **Do not use `gpt-5.4`** — it has known reliability issues causing sub-agent timeouts in >90% of affected runs. For the OpenAI/GPT family, prefer `gpt-5.3-codex` if it is explicitly listed as available; otherwise, fall back to the highest-version non-blocked GPT model that satisfies the other rules here.
+   - If multiple standard-tier models exist in the same family (excluding blocked models above), pick the one with the highest version number. Prefer "-codex" variants over general-purpose for code review tasks.
 2. Launch a sub-agent for each selected model in parallel, giving each the same review prompt: the PR diff, the review rules from this skill, and instructions to produce findings in the severity format defined above.
 3. Wait for all agents to complete, then synthesize: deduplicate findings that appear across models, elevate issues flagged by multiple models (higher confidence), and include unique findings from individual models that meet the confidence bar. **Timeout handling:** If a sub-agent has not completed after 10 minutes and you have results from other agents, proceed with the results you have. Do not block the review indefinitely waiting for a single slow model. Note in the output which models contributed.
-4. Present a single unified review to the user, noting when an issue was flagged by multiple models.
+4. Present a single unified review to the user, noting when an issue was flagged by multiple models. **After posting the review, immediately exit.** Do not wait for any remaining sub-agents. Do not attempt retries if the comment was posted successfully. The review is complete once the post operation succeeds or returns a comment URL.
+
 
 ---