Handle GitHub rate limits with specialized error and Prometheus metrics#6306
Handle GitHub rate limits with specialized error and Prometheus metrics#6306krrish175-byte wants to merge 4 commits intomindersec:mainfrom
Conversation
evankanderson
left a comment
There was a problem hiding this comment.
A few small comments, but this is looking reasonable, thank you!
| if resp != nil && (resp.StatusCode == http.StatusNotFound || resp.StatusCode == http.StatusForbidden) { | ||
| // If the hook is not found, we can ignore the | ||
| // error, user might have deleted it manually. | ||
| // We also ignore deleting webhooks that we're not | ||
| // allowed to touch. This is usually the case | ||
| // with repository transfer. | ||
| return nil | ||
| } |
There was a problem hiding this comment.
I'm not sure these are in the "err is set" case. (They weren't in such a block before.) Are you sure that these set err?
There was a problem hiding this comment.
Still wondering about these being in the err != nil case, since they weren't previously in such a block.
| // or if it's an abuse rate limit. | ||
| isThrottled := false | ||
| if remainingHeader != "" { | ||
| if remaining, err := strconv.ParseInt(remainingHeader, 10, 64); err == nil && remaining == 0 { |
There was a problem hiding this comment.
I'd prefer to parse the remainingHeader once rather than re-processing it here. Maybe use a sentinel value like math.MinInt64 or the like to detect that the value couldn't be found?
| if resp.StatusCode == http.StatusForbidden || resp.StatusCode == http.StatusTooManyRequests { | ||
| // Verify it's actually a rate limit error by checking the remaining quota (if 0) | ||
| // or if it's an abuse rate limit. | ||
| isThrottled := false |
There was a problem hiding this comment.
You can simplify this with the case on 78-80:
| isThrottled := false | |
| isThrottled := resp.StatusCode == http.StatusTooManyRequests | |
| // then add the case where we don't get a 429, but we have 0 quota remaining |
| return m.resp, m.err | ||
| } | ||
|
|
||
| func TestInstrumentedRoundTripper_RateLimitMetrics(t *testing.T) { |
There was a problem hiding this comment.
Missing a call to t.Parallel here
| }, | ||
| } | ||
|
|
||
| for _, tt := range tests { |
There was a problem hiding this comment.
And t.Parallel here as well.
- common.go: use errors.AsType[T] (Go 1.26) in isRateLimitError - go.mod: bump Go version to 1.26 - telemetry.go: parse remainingHeader once using math.MinInt64 sentinel, simplify isThrottled initialization to resp.StatusCode == 429 - telemetry_test.go: add t.Parallel() to test function and subtest loop
|
@evankanderson PTAL |
- datasources/rest/handler.go: drain and close response body before retrying on 429 to fix gosec G104 unhandled error warnings - security.yml: add helm dependency update step before Helm scan so the common chart dependency is present for Trivy to render the chart
evankanderson
left a comment
There was a problem hiding this comment.
Looks like you have a merge conflict in internal/datasources/rest/handler.go (just discard your changes), and one question in common.go where you moved a block inside an if err != nil block where it wasn't before.
| Msg("rate limited, retrying") | ||
| retryCount++ | ||
| // Drain and close the body before retrying to allow connection reuse. | ||
| _, _ = io.Copy(io.Discard, resp.Body) |
There was a problem hiding this comment.
I think these were handled in #6312 -- maybe sync / discard these changes?
(Sorry, between a few CI flakes and about 5-6 PRs merging each day, things have gotten less clean than I'd like.)
| if resp != nil && (resp.StatusCode == http.StatusNotFound || resp.StatusCode == http.StatusForbidden) { | ||
| // If the hook is not found, we can ignore the | ||
| // error, user might have deleted it manually. | ||
| // We also ignore deleting webhooks that we're not | ||
| // allowed to touch. This is usually the case | ||
| // with repository transfer. | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Still wondering about these being in the err != nil case, since they weren't previously in such a block.
Summary
This PR implements robust handling for GitHub API throttling and exposes relevant Prometheus metrics to monitor rate limit usage, resolving Issue #4616.
Key changes:
pkg/engine/errorsthat wraps GitHub rate limit information (limit, remaining, reset time), giving upper layers structured context for throttling events.internal/providers/telemetryto automatically parse GitHub-specificX-RateLimit-*headers from every response.github.http.ratelimit.remaining(Histogram): Records the available quota to monitor usage patterns.github.http.throttled.count(Counter): Tracks the frequency of 403/429 responses specifically caused by rate limiting.CreateHook,DeleteHook,UpdateCheckRun, etc.) to consistently use the new error type and the existing rate-limit wait logic.pkg/engine/errorsto match the latest upstream repository organization.Fixes #4616
Testing
internal/providers/telemetry/telemetry_test.gousing OpenTelemetry'sManualReaderto verify accurate header parsing and metric recording across normal and throttled scenarios.internal/providers/github/common_test.goto ensure correct propagation of the new specialized error type.make buildand a full test run.