Collect Pull requests Phase 2 triggers persistent 502s #8824

zaiddialpad · 2026-04-07T16:30:24Z

zaiddialpad
Apr 7, 2026

Environment

DevLake version: v1.0.3-beta9 (Helm chart)
Deployment: Kubernetes (GKE)
Plugin: github_graphql
Affected repo: Large monorepo (~20K+ total PRs, ~1,200 currently OPEN)

Problem Summary

The "Collect Pull Requests" subtask in the github_graphql plugin has been failing consistently for our largest repository since late March 2026. The failures are caused by two separate issues that compound each other:

Phase 2 of PR collection triggers GitHub 502 Bad Gateway errors when batching OPEN PRs in groups of 100
A panic in updateRateRemaining crashes the entire pod when GitHub returns a 401 during the retry loop

Detailed Description

Phase 2 behavior

Looking at pr_collector.go, the "Collect Pull Requests" subtask operates in two phases:

Phase 1: Paginated search for new/updated PRs since the last cursor — this completes fine every time.
Phase 2: Queries the DB for ALL PRs with state = 'OPEN', then refetches them from GitHub in batches of 100 using individual pullRequest(number: $number) GraphQL queries.

For our repo with ~1,200 OPEN PRs, Phase 2 generates 12+ batches of 100 PRs each. These compound GraphQL queries (each requesting 100 individual PRs by number in a single request) are heavy enough to consistently trigger 502 Bad Gateway responses from GitHub's GraphQL API.

The retry logic (retry #0 through retry #9, 120s apart) does not help because the 502s are not transient for these heavy queries — they fail on every attempt. After exhausting retries on one batch, it moves to the next batch, which also fails.

Observed log pattern

[Collect Pull Requests] ended api collection without error         # Phase 1 done
[Collect Pull Requests] start graphql collection                   # Phase 2 starts
retry #0 graphql calling after 120s — 502 Bad Gateway              # Batch 1 fails
retry #1 graphql calling after 120s — 502 Bad Gateway
...
retry #9 graphql calling after 120s — 502 Bad Gateway              # All retries exhausted
[Collect Pull Requests] finished records: 1(not exactly)           # Moves to batch 2
retry #0 graphql calling after 120s — 502 Bad Gateway              # Batch 2 also fails
...

This pattern repeats for every batch. The task runs for hours exhausting retries across all batches before finally failing.

Panic crash (separate but compounding issue)

During the retry loops, the updateRateRemaining function calls GitHub's /rate_limit REST endpoint to check remaining quota. If GitHub returns a non-200 response (e.g., 401 Unauthorized due to a transient token issue during heavy load), the code panics instead of handling the error gracefully:

panic: non-200 OK status code: 401 Unauthorized body: "Bad credentials"

goroutine 3660 [running]:
github.com/apache/incubator-devlake/helpers/pluginhelper/api.(*GraphqlAsyncClient).updateRateRemaining.func1()
    /app/helpers/pluginhelper/api/graphql_async_client.go:129 +0x166

This crashes the pod (our lake pod is now at 42 restarts). After the pod restarts, the task is retried from scratch but hits the same 502 pattern again.

Why this is getting worse over time

The number of OPEN PRs in the DB only grows:

Phase 2 refetches all OPEN PRs and updates their state from GitHub
But new PRs are created faster than old ones are closed/merged
We went from ~1,055 OPEN PRs in late March to ~1,209 now
More OPEN PRs → more Phase 2 batches → more chances to hit 502s → lower success rate

We attempted to trim stale OPEN PRs in the DB (setting state='CLOSED' for PRs not updated since Jan 1, 2026), which temporarily reduced the count to 516. However, the next successful run re-collected those PRs from GitHub and set them back to OPEN, undoing the fix entirely.

Success rate history

Date	Pipeline	Collect PRs	Notes
Mar 29	P53	✅ Success	Last success (1,055 OPEN PRs)
Mar 31	P55	❌ Failed	502s on Phase 2
Apr 1	P56	❌ Failed	502s on Phase 2
Apr 3	P57	❌ Failed	502s on Phase 2
Apr 6	P58 (retry)	✅ Success	Lucky timing — GitHub API was stable
Apr 7	P59	❌ Failed	502s + panic crash from 401

Suggested Fixes

1. Reduce `InputStep` for Phase 2 (high impact, easy fix)

In pr_collector.go, the Phase 2 InputStep is hardcoded to 100:

err = apiCollector.InitGraphQLCollector(api.GraphqlCollectorArgs{
    GraphqlClient: data.GraphqlClient,
    Input:         iterator,
    InputStep:     100,  // <-- This creates very heavy compound queries
    ...
})

Reducing this to 10-20 would create lighter GraphQL queries that are far less likely to trigger 502s. For our repo, this would mean 60-120 smaller batches instead of 12 heavy ones. Each individual request would be 5-10x lighter on GitHub's backend.

Could this be made configurable per-connection or per-scope?

2. Fix the panic in `updateRateRemaining` (critical stability fix)

In graphql_async_client.go:129, the panic should be replaced with proper error handling:

// Current (panics):
if err != nil {
    panic(err)
}

// Suggested (graceful):
if err != nil {
    logger.Error(err, "failed to update rate remaining, will retry")
    return
}

This panic crashes the entire pod, affecting all running tasks — not just the one that triggered it.

3. Consider making Phase 2 optional or configurable

For repos with very large numbers of OPEN PRs, Phase 2 is the bottleneck. Phase 1 already catches any OPEN PR that gets updated (since updated PRs have a newer updatedAt that Phase 1's cursor picks up). Phase 2 only adds value for PRs whose state changes without any other activity — which is relatively rare.

An option to disable Phase 2 per-scope, or to cap the number of OPEN PRs refetched, would help large repos significantly.

Reproduction

Add a GitHub repo with 500+ OPEN PRs to a github_graphql connection
Run the blueprint — Phase 1 will succeed, Phase 2 will hit 502s
During the retry loop, if a transient 401 occurs on /rate_limit, the pod crashes

The updateRateRemaining panic is a known bug tracked in #8788 [1]. The panic at graphql_async_client.go:129 crashes the process on any error from the rate-limit polling goroutine — including transient 502s and 401s from expired GitHub App tokens. There's actually a second panic at line 71 during initial client setup as well.

PR #8791 [2] provides the comprehensive fix for this — it replaces both panics with graceful error handling and also wires up the GraphQL client to use RefreshRoundTripper for automatic GitHub App token renewal (matching what the REST client already does via #8746). However, PR #8791 has been open since March 22 and is not yet merged. You may want to comment on it to signal urgency given your 42+ pod restarts.

On the InputStep: 100 issue — you're correct that it's hardcoded at line 243 in pr_collector.go with no configuration option. The same value of 100 is used for issue collection and account collection. Notably, the job collector already uses InputStep: 10 [3], likely for the exact same reason — large responses causing 502s. A similar problem was addressed for jobs in PR #8616 which introduced the GITHUB_JOB_GRAPHQL_PAGE_SIZE env var, but no equivalent was added for PRs or issues.

Your suggestion #1 (reducing InputStep to 10-20) is the right approach. Ideally this would follow the same pattern as GITHUB_JOB_GRAPHQL_PAGE_SIZE — an environment variable like GITHUB_PR_GRAPHQL_BATCH_SIZE — so it can be tuned per deployment without forking the code.

One additional workaround until these fixes land: @zaiddialpad reported very similar 502 issues at scale in discussion #8821. The recommended mitigations are:

Set API_RETRY=5 or higher (default is 3)
Increase API_TIMEOUT for more breathing room between retries
If using GitHub App auth, try switching to a Personal Access Token to avoid the token expiry → panic chain entirely
Split large blueprints so individual runs stay under ~45 minutes

On making Phase 2 optional: There's no direct toggle, but setting fullSync: true in the Blueprint's SyncPolicy skips Phase 2 since it only runs during incremental collection [4]. The trade-off is you lose incremental behavior entirely, which is probably not what you want for a daily sync. A dedicated per-scope flag to skip Phase 2 refetch would be the ideal long-term solution.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect Pull requests Phase 2 triggers persistent 502s #8824

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Collect Pull requests Phase 2 triggers persistent 502s #8824

Uh oh!

zaiddialpad Apr 7, 2026

Environment

Problem Summary

Detailed Description

Phase 2 behavior

Observed log pattern

Panic crash (separate but compounding issue)

Why this is getting worse over time

Success rate history

Suggested Fixes

1. Reduce InputStep for Phase 2 (high impact, easy fix)

2. Fix the panic in updateRateRemaining (critical stability fix)

3. Consider making Phase 2 optional or configurable

Reproduction

Related

Replies: 1 comment

Uh oh!

dosubot[bot] Bot Apr 7, 2026

zaiddialpad
Apr 7, 2026

1. Reduce `InputStep` for Phase 2 (high impact, easy fix)

2. Fix the panic in `updateRateRemaining` (critical stability fix)

dosubot[bot]
Bot Apr 7, 2026