Commit 634fa0d
Fix code review agent timeouts caused by gpt-5.4 (#126783)
> [!NOTE]
> This PR was developed with Copilot assistance based on analysis of
workflow run logs and duration data.
## Problem
~20% of custom code review agent runs hit the 20-minute workflow
timeout. Analysis of all 43 timeout runs from the last 1000 workflow
executions shows:
- **93% of timeouts** are caused by GPT-5.4 sub-agents that never return
- GPT-5.4 is present in **100% of timeout runs** (24/24 checked in
detail)
- GPT-5.2-only runs have **0 timeouts** in 6+ successful runs
- 86% of timed-out runs had already posted the review — they time out
waiting for hung sub-agents
- Each timeout causes the agent job to fail, making the overall workflow
**red in CI** (even though the `conclusion` job succeeds) — PR authors
must manually rerun
The current SKILL.md rule "pick the highest version number" causes
gpt-5.4 to always be selected when available.
## Changes (SKILL.md only)
1. **Block gpt-5.4** — it has known reliability issues. Recommend
`gpt-5.3-codex` as the GPT-family pick instead. If that also exhibits
hangs, we can block the GPT family entirely with no expected quality
loss.
2. **Exit after posting** — the agent was lingering 2-3 minutes after
successfully posting the review comment, waiting for hung sub-agents.
Now it exits immediately once the comment is visible.
3. **Reduce max sub-agents from 4 to 3** — with only 2-3 model families
available in practice, 4 was never fully utilized.
## What this does NOT change
- The 10-minute sub-agent timeout instruction (already in place,
appropriate for agents that do return)
- The overall workflow `timeout-minutes: 20` (hardcoded in the compiled
`.lock.yml`)
- The review methodology, severity definitions, or quality bar
- Any CCR (Copilot Code Review) configuration
## Expected impact
- Eliminates the dominant timeout cause (GPT-5.4 hangs)
- Saves 2-3 min per run from exit-after-post
- No expected quality regression: GPT contributed unique blocking
findings in 0% of sampled runs
## Data
| Metric | Value |
|--------|-------|
| Runs analyzed | 1000 workflow runs, 420 non-skipped, 218 with CLI data
|
| Timeout rate | 19.7% (43/218) |
| GPT-5.4 in timeouts | 100% (24/24 detailed) |
| GPT-5.2 timeouts | 0% (0/6+ successful GPT-5.2 runs) |
| Reviews with GPT-unique findings | <8% |
| GPT-only blocking bugs found | 0 |
| MCP add_comment missing | 12/43 timeout runs (~6% of all runs) —
platform issue, not addressed here |
## Why not increase the 20-minute timeout?
The GPT-5.4 sub-agent hangs indefinitely — there is no evidence it would
eventually complete if given more time. Increasing the timeout would
just delay the inevitable and waste more compute.
## Duration distribution (218 runs with CLI data)
| Bucket | Runs | % |
|--------|------|---|
| 0–2m | 18 | 8.3% |
| 2–4m | 24 | 11.0% |
| 4–6m | 29 | 13.3% |
| 6–8m | 25 | 11.5% |
| 8–10m | 26 | 11.9% |
| 10–12m | 21 | 9.6% |
| 12–14m | 9 | 4.1% |
| 14–16m | 8 | 3.7% |
| 16–18m | 9 | 4.1% |
| 18–20m | 6 | 2.8% |
| 20m+ (timeout) | 43 | 19.7% |
The bimodal distribution (healthy hump at 4–10m, spike at 20m wall)
confirms these are hangs, not slow completions.
Related: #126779 (not a fix for that, but general efficiency improvement
to the review agent)
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>1 parent 25f7804 commit 634fa0d
1 file changed
Lines changed: 6 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
82 | 82 | | |
83 | 83 | | |
84 | 84 | | |
85 | | - | |
| 85 | + | |
86 | 86 | | |
87 | | - | |
| 87 | + | |
88 | 88 | | |
89 | | - | |
90 | 89 | | |
| 90 | + | |
| 91 | + | |
91 | 92 | | |
92 | 93 | | |
93 | | - | |
| 94 | + | |
| 95 | + | |
94 | 96 | | |
95 | 97 | | |
96 | 98 | | |
| |||
0 commit comments