Search before asking
What happened
When collecting workflow runs from a large GitHub.com (Enterprise Cloud) repository, the Collect Workflow Runs subtask fails with HTTP 422 once the collector crosses an undocumented pagination boundary, and the whole pipeline aborts. A representative error log:
subtask Collect Workflow Runs ended unexpectedly
Wraps: (2) Error waiting for async Collector execution
| combined messages:
| {
| Retry exceeded 3 times calling repos/{owner}/{repo}/actions/runs.
| The last error was: Http DoAsync error calling
| [method:GET path:repos/{owner}/{repo}/actions/runs
| query:map[page:[1340] per_page:[30]]].
| Response: {"message":"In order to keep the API fast for everyone,
| pagination is limited for this resource.",
| "documentation_url":"https://docs.github.com/v3/#pagination",
| "status":"422"} (422)
| }
page=1340 × per_page=30 = 40,200 items, which crosses a roughly per_page × page ≤ 40,000 items hard cap that github.com enforces on unfiltered /actions/runs pagination. The cap is easy to probe directly with gh api: per_page=100&page=400 returns HTTP 200 with total_count: 40000 and Link rel="last" page=400, while per_page=100&page=401 (and anything beyond) returns HTTP 422 with the same "pagination is limited for this resource" message. total_count is clamped at 40,000 even though the repository has significantly more runs. Neither this 40k boundary nor the 422 response is described in the official docs for this endpoint, which only mention a separate cap for filtered queries ("up to 1,000 results for each search when using actor, branch, check_suite_id, created, event, head_sha, or status").
On the DevLake side, the root cause is in backend/plugins/github/tasks/cicd_run_collector.go L77-84:
Query: func(reqData *helper.RequestData, createdAfter *time.Time) (url.Values, errors.Error) {
query := url.Values{}
query.Set("page", fmt.Sprintf("%v", reqData.Pager.Page))
query.Set("per_page", fmt.Sprintf("%v", reqData.Pager.Size))
return query, nil
},
createdAfter is received but never forwarded to the server — time filtering happens purely on the client side. With Concurrency=10 and PageSize=30, concurrent workers blindly advance until some of them cross the 40k boundary and hit 422. Once 3 retries are exhausted for any one page the whole subtask fails, and no partial data is salvaged (Extract Workflow Runs / Convert Workflow Runs in the same task do not run either).
GraphQL is not a viable fallback: the Repository object in GitHub's GraphQL schema has no workflowRuns, workflows, or actions field, and WorkflowRun is only reachable as a singular field on CheckSuite (not a connection), so there is no way to list a repository's workflow runs via GraphQL. DevLake's github_graphql plugin already acknowledges this by importing the REST collector (plugins/github_graphql/impl/impl.go:97).
What do you expect to happen
Collect Workflow Runs should successfully collect the full set of workflow runs within the blueprint's timeAfter window regardless of total volume. Large repositories on github.com with more than 40,000 runs in-window should not cause the pipeline to abort.
How to reproduce
Configure a GitHub connection pointing at a github.com repository that has more than 40,000 workflow runs, set the blueprint's timeAfter to a date far enough back that the range contains >40,000 runs, and trigger "Collect Data". The pipeline fails on Collect Workflow Runs once the collector reaches page × per_page > 40,000 (with default PageSize=30 this happens near page 1,334). Any sufficiently busy CI repository reaches the boundary eventually; no specific GitHub feature beyond volume is required.
Anything else
Existing workarounds are insufficient in isolation. Narrowing timeAfter works once but the same initial bootstrap fails again later as the repository accumulates runs, and historical data is forfeited. skipOnFail=true lets unrelated plugins (Jira, DORA, etc.) keep running, but _tool_github_runs still never gets populated for the affected repo. Raising per_page to 100 reduces the number of requests but does not raise the 40,000-item ceiling.
The only way to read past the 40k boundary is to use a filter parameter. Adding a created filter switches the endpoint into the filtered mode (up to 1,000 results per search). Fixed-size windows (e.g. monthly) are not enough because a single month on a busy repo can exceed 1,000 runs, so the fix needs to bisect windows adaptively. Roughly:
// Pseudocode
func collectRunsAdaptive(from, to time.Time) {
items, reachedCap := fetchWindow(from, to) // GET .../actions/runs?created=<from>..<to>&per_page=100
if reachedCap {
mid := from.Add(to.Sub(from) / 2)
collectRunsAdaptive(from, mid)
collectRunsAdaptive(mid, to)
} else {
persist(items)
}
}
The query syntax is created:YYYY-MM-DD..YYYY-MM-DD (ISO 8601, supports >=, <=, .., with optional THH:MM:SSZ for sub-day granularity — see search syntax). createdAfter is already passed to the Query hook so no interface change is needed, and for incremental runs it simply becomes the lower bound of the outer window.
The problem reproduces deterministically on every bootstrap of a large github.com repository. For reference, none of #8028, #8614, #3642, #3688, or #3199 address the github.com 40k item cap on unfiltered /actions/runs pagination, although they touch adjacent areas (large-repo GraphQL timeouts, PageSize tunables, time/workflow filters, payload size).
Version
v1.0.3-beta10
Are you willing to submit PR?
Code of Conduct
Search before asking
What happened
When collecting workflow runs from a large GitHub.com (Enterprise Cloud) repository, the
Collect Workflow Runssubtask fails with HTTP 422 once the collector crosses an undocumented pagination boundary, and the whole pipeline aborts. A representative error log:page=1340 × per_page=30 = 40,200items, which crosses a roughlyper_page × page ≤ 40,000 itemshard cap that github.com enforces on unfiltered/actions/runspagination. The cap is easy to probe directly withgh api:per_page=100&page=400returns HTTP 200 withtotal_count: 40000andLink rel="last" page=400, whileper_page=100&page=401(and anything beyond) returns HTTP 422 with the same"pagination is limited for this resource"message.total_countis clamped at 40,000 even though the repository has significantly more runs. Neither this 40k boundary nor the 422 response is described in the official docs for this endpoint, which only mention a separate cap for filtered queries ("up to 1,000 results for each search when usingactor,branch,check_suite_id,created,event,head_sha, orstatus").On the DevLake side, the root cause is in
backend/plugins/github/tasks/cicd_run_collector.goL77-84:createdAfteris received but never forwarded to the server — time filtering happens purely on the client side. WithConcurrency=10andPageSize=30, concurrent workers blindly advance until some of them cross the 40k boundary and hit 422. Once 3 retries are exhausted for any one page the whole subtask fails, and no partial data is salvaged (Extract Workflow Runs/Convert Workflow Runsin the same task do not run either).GraphQL is not a viable fallback: the
Repositoryobject in GitHub's GraphQL schema has noworkflowRuns,workflows, oractionsfield, andWorkflowRunis only reachable as a singular field onCheckSuite(not a connection), so there is no way to list a repository's workflow runs via GraphQL. DevLake'sgithub_graphqlplugin already acknowledges this by importing the REST collector (plugins/github_graphql/impl/impl.go:97).What do you expect to happen
Collect Workflow Runsshould successfully collect the full set of workflow runs within the blueprint'stimeAfterwindow regardless of total volume. Large repositories on github.com with more than 40,000 runs in-window should not cause the pipeline to abort.How to reproduce
Configure a GitHub connection pointing at a github.com repository that has more than 40,000 workflow runs, set the blueprint's
timeAfterto a date far enough back that the range contains >40,000 runs, and trigger "Collect Data". The pipeline fails onCollect Workflow Runsonce the collector reachespage × per_page > 40,000(with defaultPageSize=30this happens near page 1,334). Any sufficiently busy CI repository reaches the boundary eventually; no specific GitHub feature beyond volume is required.Anything else
Existing workarounds are insufficient in isolation. Narrowing
timeAfterworks once but the same initial bootstrap fails again later as the repository accumulates runs, and historical data is forfeited.skipOnFail=truelets unrelated plugins (Jira, DORA, etc.) keep running, but_tool_github_runsstill never gets populated for the affected repo. Raisingper_pageto 100 reduces the number of requests but does not raise the 40,000-item ceiling.The only way to read past the 40k boundary is to use a filter parameter. Adding a
createdfilter switches the endpoint into the filtered mode (up to 1,000 results per search). Fixed-size windows (e.g. monthly) are not enough because a single month on a busy repo can exceed 1,000 runs, so the fix needs to bisect windows adaptively. Roughly:The query syntax is
created:YYYY-MM-DD..YYYY-MM-DD(ISO 8601, supports>=,<=,.., with optionalTHH:MM:SSZfor sub-day granularity — see search syntax).createdAfteris already passed to theQueryhook so no interface change is needed, and for incremental runs it simply becomes the lower bound of the outer window.The problem reproduces deterministically on every bootstrap of a large github.com repository. For reference, none of #8028, #8614, #3642, #3688, or #3199 address the github.com 40k item cap on unfiltered
/actions/runspagination, although they touch adjacent areas (large-repo GraphQL timeouts, PageSize tunables, time/workflow filters, payload size).Version
v1.0.3-beta10
Are you willing to submit PR?
Code of Conduct