Collect Pull requests Phase 2 triggers persistent 502s #8824
Replies: 1 comment
-
|
Great write-up — your analysis is spot-on. Here's what I found in the codebase and existing issues that directly relates: The PR #8791 [2] provides the comprehensive fix for this — it replaces both panics with graceful error handling and also wires up the GraphQL client to use On the Your suggestion #1 (reducing One additional workaround until these fixes land: @zaiddialpad reported very similar 502 issues at scale in discussion #8821. The recommended mitigations are:
On making Phase 2 optional: There's no direct toggle, but setting To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Environment
github_graphqlProblem Summary
The "Collect Pull Requests" subtask in the
github_graphqlplugin has been failing consistently for our largest repository since late March 2026. The failures are caused by two separate issues that compound each other:updateRateRemainingcrashes the entire pod when GitHub returns a 401 during the retry loopDetailed Description
Phase 2 behavior
Looking at
pr_collector.go, the "Collect Pull Requests" subtask operates in two phases:state = 'OPEN', then refetches them from GitHub in batches of 100 using individualpullRequest(number: $number)GraphQL queries.For our repo with ~1,200 OPEN PRs, Phase 2 generates 12+ batches of 100 PRs each. These compound GraphQL queries (each requesting 100 individual PRs by number in a single request) are heavy enough to consistently trigger 502 Bad Gateway responses from GitHub's GraphQL API.
The retry logic (
retry #0throughretry #9, 120s apart) does not help because the 502s are not transient for these heavy queries — they fail on every attempt. After exhausting retries on one batch, it moves to the next batch, which also fails.Observed log pattern
This pattern repeats for every batch. The task runs for hours exhausting retries across all batches before finally failing.
Panic crash (separate but compounding issue)
During the retry loops, the
updateRateRemainingfunction calls GitHub's/rate_limitREST endpoint to check remaining quota. If GitHub returns a non-200 response (e.g., 401 Unauthorized due to a transient token issue during heavy load), the code panics instead of handling the error gracefully:This crashes the pod (our lake pod is now at 42 restarts). After the pod restarts, the task is retried from scratch but hits the same 502 pattern again.
Why this is getting worse over time
The number of OPEN PRs in the DB only grows:
We attempted to trim stale OPEN PRs in the DB (setting
state='CLOSED'for PRs not updated since Jan 1, 2026), which temporarily reduced the count to 516. However, the next successful run re-collected those PRs from GitHub and set them back to OPEN, undoing the fix entirely.Success rate history
Suggested Fixes
1. Reduce
InputStepfor Phase 2 (high impact, easy fix)In
pr_collector.go, the Phase 2InputStepis hardcoded to 100:Reducing this to 10-20 would create lighter GraphQL queries that are far less likely to trigger 502s. For our repo, this would mean 60-120 smaller batches instead of 12 heavy ones. Each individual request would be 5-10x lighter on GitHub's backend.
Could this be made configurable per-connection or per-scope?
2. Fix the panic in
updateRateRemaining(critical stability fix)In
graphql_async_client.go:129, the panic should be replaced with proper error handling:This panic crashes the entire pod, affecting all running tasks — not just the one that triggered it.
3. Consider making Phase 2 optional or configurable
For repos with very large numbers of OPEN PRs, Phase 2 is the bottleneck. Phase 1 already catches any OPEN PR that gets updated (since updated PRs have a newer
updatedAtthat Phase 1's cursor picks up). Phase 2 only adds value for PRs whose state changes without any other activity — which is relatively rare.An option to disable Phase 2 per-scope, or to cap the number of OPEN PRs refetched, would help large repos significantly.
Reproduction
github_graphqlconnection/rate_limit, the pod crashesRelated
updateRateRemainingpanic also exists on themainbranch (not just v1.0.3-beta9)Beta Was this translation helpful? Give feedback.
All reactions