Skip to content

Handle GitHub rate limits with specialized error and Prometheus metrics#6306

Open
krrish175-byte wants to merge 4 commits intomindersec:mainfrom
krrish175-byte:fix/issue-4616
Open

Handle GitHub rate limits with specialized error and Prometheus metrics#6306
krrish175-byte wants to merge 4 commits intomindersec:mainfrom
krrish175-byte:fix/issue-4616

Conversation

@krrish175-byte
Copy link
Copy Markdown
Contributor

Summary

This PR implements robust handling for GitHub API throttling and exposes relevant Prometheus metrics to monitor rate limit usage, resolving Issue #4616.

Key changes:

  • Specialized RateLimitError: Introduced a new error type in pkg/engine/errors that wraps GitHub rate limit information (limit, remaining, reset time), giving upper layers structured context for throttling events.
  • Telemetry Instrumentation: Extended the HTTP client instrumentation in internal/providers/telemetry to automatically parse GitHub-specific X-RateLimit-* headers from every response.
  • Observability Metrics:
    • github.http.ratelimit.remaining (Histogram): Records the available quota to monitor usage patterns.
    • github.http.throttled.count (Counter): Tracks the frequency of 403/429 responses specifically caused by rate limiting.
  • Provider Consistency: Updated core GitHub provider methods (CreateHook, DeleteHook, UpdateCheckRun, etc.) to consistently use the new error type and the existing rate-limit wait logic.
  • Structural Alignment: Migrated the engine errors package to pkg/engine/errors to match the latest upstream repository organization.

Fixes #4616

Testing

  • Automated Verification: Created a new test suite in internal/providers/telemetry/telemetry_test.go using OpenTelemetry's ManualReader to verify accurate header parsing and metric recording across normal and throttled scenarios.
  • Integration Stability: Updated and verified existing tests in internal/providers/github/common_test.go to ensure correct propagation of the new specialized error type.
  • Full Build: Verified project stability with a clean make build and a full test run.

@krrish175-byte krrish175-byte requested a review from a team as a code owner April 8, 2026 17:07
@coveralls
Copy link
Copy Markdown

coveralls commented Apr 8, 2026

Coverage Status

coverage: 58.753% (+0.1%) from 58.616% — krrish175-byte:fix/issue-4616 into mindersec:main

Copy link
Copy Markdown
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments, but this is looking reasonable, thank you!

Comment on lines +595 to +602
if resp != nil && (resp.StatusCode == http.StatusNotFound || resp.StatusCode == http.StatusForbidden) {
// If the hook is not found, we can ignore the
// error, user might have deleted it manually.
// We also ignore deleting webhooks that we're not
// allowed to touch. This is usually the case
// with repository transfer.
return nil
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure these are in the "err is set" case. (They weren't in such a block before.) Are you sure that these set err?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still wondering about these being in the err != nil case, since they weren't previously in such a block.

// or if it's an abuse rate limit.
isThrottled := false
if remainingHeader != "" {
if remaining, err := strconv.ParseInt(remainingHeader, 10, 64); err == nil && remaining == 0 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to parse the remainingHeader once rather than re-processing it here. Maybe use a sentinel value like math.MinInt64 or the like to detect that the value couldn't be found?

if resp.StatusCode == http.StatusForbidden || resp.StatusCode == http.StatusTooManyRequests {
// Verify it's actually a rate limit error by checking the remaining quota (if 0)
// or if it's an abuse rate limit.
isThrottled := false
Copy link
Copy Markdown
Member

@evankanderson evankanderson Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify this with the case on 78-80:

Suggested change
isThrottled := false
isThrottled := resp.StatusCode == http.StatusTooManyRequests
// then add the case where we don't get a 429, but we have 0 quota remaining

return m.resp, m.err
}

func TestInstrumentedRoundTripper_RateLimitMetrics(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a call to t.Parallel here

},
}

for _, tt := range tests {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And t.Parallel here as well.

- common.go: use errors.AsType[T] (Go 1.26) in isRateLimitError
- go.mod: bump Go version to 1.26
- telemetry.go: parse remainingHeader once using math.MinInt64 sentinel,
  simplify isThrottled initialization to resp.StatusCode == 429
- telemetry_test.go: add t.Parallel() to test function and subtest loop
@krrish175-byte
Copy link
Copy Markdown
Contributor Author

@evankanderson PTAL

- datasources/rest/handler.go: drain and close response body before
  retrying on 429 to fix gosec G104 unhandled error warnings
- security.yml: add helm dependency update step before Helm scan so
  the common chart dependency is present for Trivy to render the chart
Copy link
Copy Markdown
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you have a merge conflict in internal/datasources/rest/handler.go (just discard your changes), and one question in common.go where you moved a block inside an if err != nil block where it wasn't before.

Msg("rate limited, retrying")
retryCount++
// Drain and close the body before retrying to allow connection reuse.
_, _ = io.Copy(io.Discard, resp.Body)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these were handled in #6312 -- maybe sync / discard these changes?

(Sorry, between a few CI flakes and about 5-6 PRs merging each day, things have gotten less clean than I'd like.)

Comment on lines +595 to +602
if resp != nil && (resp.StatusCode == http.StatusNotFound || resp.StatusCode == http.StatusForbidden) {
// If the hook is not found, we can ignore the
// error, user might have deleted it manually.
// We also ignore deleting webhooks that we're not
// allowed to touch. This is usually the case
// with repository transfer.
return nil
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still wondering about these being in the err != nil case, since they weren't previously in such a block.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle GitHub throttling errors properly

3 participants