Skip to content

[Bug][github] "Collect Workflow Runs" fails with HTTP 422 on large repos due to GitHub.com pagination cap #8842

@yamoyamoto

Description

@yamoyamoto

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

When collecting workflow runs from a large GitHub.com (Enterprise Cloud) repository, the Collect Workflow Runs subtask fails with HTTP 422 once the collector crosses an undocumented pagination boundary, and the whole pipeline aborts. A representative error log:

subtask Collect Workflow Runs ended unexpectedly
Wraps: (2) Error waiting for async Collector execution
  | combined messages: 
  | {
  |   Retry exceeded 3 times calling repos/{owner}/{repo}/actions/runs.
  |   The last error was: Http DoAsync error calling
  |   [method:GET path:repos/{owner}/{repo}/actions/runs
  |    query:map[page:[1340] per_page:[30]]].
  |   Response: {"message":"In order to keep the API fast for everyone,
  |              pagination is limited for this resource.",
  |              "documentation_url":"https://docs.github.com/v3/#pagination",
  |              "status":"422"} (422)
  | }

page=1340 × per_page=30 = 40,200 items, which crosses a roughly per_page × page ≤ 40,000 items hard cap that github.com enforces on unfiltered /actions/runs pagination. The cap is easy to probe directly with gh api: per_page=100&page=400 returns HTTP 200 with total_count: 40000 and Link rel="last" page=400, while per_page=100&page=401 (and anything beyond) returns HTTP 422 with the same "pagination is limited for this resource" message. total_count is clamped at 40,000 even though the repository has significantly more runs. Neither this 40k boundary nor the 422 response is described in the official docs for this endpoint, which only mention a separate cap for filtered queries ("up to 1,000 results for each search when using actor, branch, check_suite_id, created, event, head_sha, or status").

On the DevLake side, the root cause is in backend/plugins/github/tasks/cicd_run_collector.go L77-84:

Query: func(reqData *helper.RequestData, createdAfter *time.Time) (url.Values, errors.Error) {
    query := url.Values{}
    query.Set("page", fmt.Sprintf("%v", reqData.Pager.Page))
    query.Set("per_page", fmt.Sprintf("%v", reqData.Pager.Size))
    return query, nil
},

createdAfter is received but never forwarded to the server — time filtering happens purely on the client side. With Concurrency=10 and PageSize=30, concurrent workers blindly advance until some of them cross the 40k boundary and hit 422. Once 3 retries are exhausted for any one page the whole subtask fails, and no partial data is salvaged (Extract Workflow Runs / Convert Workflow Runs in the same task do not run either).

GraphQL is not a viable fallback: the Repository object in GitHub's GraphQL schema has no workflowRuns, workflows, or actions field, and WorkflowRun is only reachable as a singular field on CheckSuite (not a connection), so there is no way to list a repository's workflow runs via GraphQL. DevLake's github_graphql plugin already acknowledges this by importing the REST collector (plugins/github_graphql/impl/impl.go:97).

What do you expect to happen

Collect Workflow Runs should successfully collect the full set of workflow runs within the blueprint's timeAfter window regardless of total volume. Large repositories on github.com with more than 40,000 runs in-window should not cause the pipeline to abort.

How to reproduce

Configure a GitHub connection pointing at a github.com repository that has more than 40,000 workflow runs, set the blueprint's timeAfter to a date far enough back that the range contains >40,000 runs, and trigger "Collect Data". The pipeline fails on Collect Workflow Runs once the collector reaches page × per_page > 40,000 (with default PageSize=30 this happens near page 1,334). Any sufficiently busy CI repository reaches the boundary eventually; no specific GitHub feature beyond volume is required.

Anything else

Existing workarounds are insufficient in isolation. Narrowing timeAfter works once but the same initial bootstrap fails again later as the repository accumulates runs, and historical data is forfeited. skipOnFail=true lets unrelated plugins (Jira, DORA, etc.) keep running, but _tool_github_runs still never gets populated for the affected repo. Raising per_page to 100 reduces the number of requests but does not raise the 40,000-item ceiling.

The only way to read past the 40k boundary is to use a filter parameter. Adding a created filter switches the endpoint into the filtered mode (up to 1,000 results per search). Fixed-size windows (e.g. monthly) are not enough because a single month on a busy repo can exceed 1,000 runs, so the fix needs to bisect windows adaptively. Roughly:

// Pseudocode
func collectRunsAdaptive(from, to time.Time) {
    items, reachedCap := fetchWindow(from, to) // GET .../actions/runs?created=<from>..<to>&per_page=100
    if reachedCap {
        mid := from.Add(to.Sub(from) / 2)
        collectRunsAdaptive(from, mid)
        collectRunsAdaptive(mid, to)
    } else {
        persist(items)
    }
}

The query syntax is created:YYYY-MM-DD..YYYY-MM-DD (ISO 8601, supports >=, <=, .., with optional THH:MM:SSZ for sub-day granularity — see search syntax). createdAfter is already passed to the Query hook so no interface change is needed, and for incremental runs it simply becomes the lower bound of the outer window.

The problem reproduces deterministically on every bootstrap of a large github.com repository. For reference, none of #8028, #8614, #3642, #3688, or #3199 address the github.com 40k item cap on unfiltered /actions/runs pagination, although they touch adjacent areas (large-repo GraphQL timeouts, PageSize tunables, time/workflow filters, payload size).

Version

v1.0.3-beta10

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    component/pluginsThis issue or PR relates to pluginspriority/highThis issue is very importantseverity/p0This bug blocks key user journey and functiontype/bugThis issue is a bug

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions