Skip to content

fix(test): macOS local image, plugin tests, and test suite restructuring#9610

Merged
mlwelles merged 20 commits intomainfrom
fix/macos-local-image-and-plugin-tests
Feb 25, 2026
Merged

fix(test): macOS local image, plugin tests, and test suite restructuring#9610
mlwelles merged 20 commits intomainfrom
fix/macos-local-image-and-plugin-tests

Conversation

@mlwelles
Copy link
Copy Markdown
Contributor

@mlwelles mlwelles commented Feb 25, 2026

Summary

This PR fixes macOS compatibility for local image builds and plugin tests, restructures the test suite system for better ergonomics and resource management, and fixes a flaky integration2 test.

macOS / Local Build Fixes

  • Enable CGO cross-compilation for plugin support on macOS (Darwin → Linux)
  • Fix integration test routing, gotestsum PATH resolution, and cross-compilation issues
  • Add cluster pause/resume and resilient restore waits for stability
  • Resolve shellcheck SC2312 in check-cross-compiler.sh

Test Suite Restructuring

  • New integration suite (new default): runs everything except ldbc, load, and systest-heavy — replaces old unit,systest,core default
  • New unit suite behavior: true unit tests only — no Docker cluster, no --tags=integration, skips custom-cluster packages
  • New systest-baseline / systest-heavy sub-suites: separates lightweight systests from resource-intensive ones (minio, encryption, tracing, online-restore) that can OOM on Docker
  • systest still runs both sub-suites for backward compatibility
  • New Make targets: test-integration, test-integration-heavy
  • Renamed: test-fulltest-all
  • Removed: test-suite, test-ldbc, test-load (use make test SUITE=... instead)

Import Client Retry Fix

  • Fixed flaky TestImportApis/SingleGroupShutOneAlpha in the dgraph-integration2-tests CI job
  • Root cause: initiateSnapshotStream made a single attempt to start the snapshot stream, but the Dgraph server can return "overloaded with pending proposals" during Raft membership changes (e.g., when an alpha is shut down and the cluster is rebalancing)
  • Fix: Added exponential backoff retry (1s → 10s cap, 60s max total) for transient "overloaded" errors
  • Only retries on "overloaded" — connectivity errors ("connection refused", "unable to connect to the leader") still fail immediately so negative test cases don't block unnecessarily

Quick Reference

Command What it runs
make test integration suite + integration2 (~30 min)
make test-unit True unit tests — no Docker
make test-integration Integration tests via t/ runner with Docker
make test-integration-heavy systest-heavy + ldbc + load
make test-all Every test: all suites + integration2 + upgrade + fuzz

Test plan

  • make test-unit runs without starting Docker cluster and discovers only non-integration tests
  • make test (integration suite) correctly excludes heavy/ldbc/load packages
  • make test-integration-heavy (systest-heavy) correctly runs only heavy packages (11 packages)
  • make test SUITE=systest runs both systest-baseline and systest-heavy (30 packages)
  • make test SUITE=systest-baseline runs only lightweight systests (18 packages)
  • TestImportApis/SingleGroupShutOneAlpha passes reliably with retry logic (CI)
  • Existing CI workflows continue to pass

Shiva and others added 14 commits February 19, 2026 18:19
When running `make test` (--suite=all), the load and ldbc download
blocks in t/t.go shared the same *tmp directory. The ldbc block's
MakeDirEmpty call wiped files downloaded by the load block, causing
systest/1million to fail with missing schema files.

Hoist directory initialization above both download blocks so
MakeDirEmpty runs exactly once. Both datasets coexist in the same
directory since their filenames don't overlap. Also use a dedicated
subdirectory (dgraph-test-data) instead of bare os.TempDir() to
avoid wiping the system temp directory.

Add testSuiteContainsAny() helper to replace repeated
testSuiteContains("x") || testSuiteContains("y") patterns.
30 Docker Compose files hardcoded $GOPATH/bin as the binary mount
source. On macOS, this mounts the native macOS binary into Linux
containers, causing them to fail on startup.

Replace all 78 occurrences with ${LINUX_GOBIN:-$GOPATH/bin} to match
the pattern already used in dgraph/docker-compose.yml. On Linux,
LINUX_GOBIN defaults to $GOPATH/bin (no change). On macOS, it points
to the cross-compiled Linux binary directory.
Add a configurable per-package test timeout flag to the t/ runner.
Previously the timeout was hardcoded to 30m (or 180m with --race),
which caused the 21million/live test to time out on slower machines.

Usage:
  make test TIMEOUT=90m
  cd t && ./t --suite=all --timeout=60m

Defaults remain unchanged: 30m normal, 180m with --race. An explicit
--timeout overrides both.
Remove empty if-branch flagged by staticcheck SA9003 in t/t.go and
fix markdown table alignment in TESTING.md for prettier compliance.
The default `make test` (no args) previously ran --suite=all (~60+ min).
Now it runs unit,systest,core suites plus integration2 tests (~30 min)
for a faster local feedback loop.

Changes:
- Split the else branch in `test` target: SUITE set → explicit suite;
  nothing set → default (unit,systest,core + integration2)
- Add $(origin) guards on all test-* targets to prevent confusing
  variable conflicts (e.g. `make test-unit SUITE=ldbc` now errors)
- Add `test-suites` target (runs all t/ runner suites via SUITE=all)
- Add `test-everything` target (all suites + integration + integration2
  + upgrade + fuzz)
- Update TESTING.md, CONTRIBUTING.md, AGENTS.md to reflect new defaults
- Update `make help` output with new default description
Resolve conflicts in Makefile and CONTRIBUTING.md, keeping the PR's
dual-command default (unit,systest,core + integration2) and the
SUITE=all example line.
Shorter, clearer name for the target that runs every test in the repo
(all suites + integration + integration2 + upgrade + fuzz).
test-suite now accepts an optional SUITE= argument (defaults to all),
making it a flexible entry point for running any t/ runner suite.
AGENTS.md is a local Claude Code config file that should not be
tracked in version control.
…s-compilation

- Remove broken test-integration target: TAGS=integration bypassed the t/
  runner, skipping Docker Compose orchestration and plugin compilation that
  integration tests require. Use SUITE= to route through the t/ runner instead.
- Fix gotestsum PATH resolution in t/t.go: add gotestsumBin() that resolves
  to $GOPATH/bin/gotestsum instead of relying on PATH lookup, which fails on
  machines where $GOPATH/bin is not in PATH.
- Add cross-compilation support for Go plugins on macOS: detect non-Linux
  hosts and set CGO_ENABLED=1, CC to the appropriate cross-compiler, and
  use the BFD linker (both testutil/plugin.go and dgraphtest/local_cluster.go).
- Add check-cross-compiler.sh dependency check script.
- Add top-level make deps and make setup targets for dependency management.
- Update TESTING.md and CONTRIBUTING.md to document new targets and recommend
  make setup for first-time onboarding.
- Remove test-integration from test-full (redundant: SUITE=all covers it).
The local-image and install targets cross-compiled the dgraph binary
without CGO, producing a statically linked binary that cannot dlopen()
Go plugin .so files. This caused all plugin tests to fail on macOS
with "Invalid tokenizer anagram".

Changes:
- Add LINUX_CC variable for architecture-aware cross-compiler selection
- Enable CGO_ENABLED=1 with cross-compiler in install and local-image
- Use BFD linker (-fuse-ld=bfd) since gold is not in cross-toolchains
- Skip jemalloc (BUILD_TAGS=) for cross-compilation (headers unavailable)
- Add EXTLDFLAGS support to dgraph/Makefile for external linker flags
- Split t/Makefile check into deps (tools) + check (tools + binary)
- Top-level setup now calls t/deps so it no longer requires the binary
@mlwelles mlwelles requested a review from a team as a code owner February 25, 2026 01:11
@github-actions github-actions Bot added area/testing Testing related issues area/documentation Documentation related issues. go Pull requests that update Go code labels Feb 25, 2026
…ent restore waits

- Add deploy memory limits to all docker-compose services (zeros: 512M,
  alphas: 2-4GB, minio: 512M) to prevent OOM kills on macOS Docker Desktop
- Add --cache "size-mb=1024" to alpha commands for explicit cache sizing
- Implement pause/resume of the default cluster during custom-cluster tests
  so the full Docker memory budget is available for custom clusters
- Make WaitForRestore resilient to transient errors (connection reset,
  unavailable, transport errors) with a 10-minute deadline instead of
  infinite loop
- Simplify dgraph-installed Make target to always rebuild
- Ensure $GOPATH/bin is in PATH for subprocess tool discovery
…compose

Add deploy memory limits (zeros: 512M, alphas: 2048M, minio: 512M) and
--cache "size-mb=1024" to all alpha commands in the main dgraph test
cluster docker-compose file, matching the changes in the online-restore
compose file. Prevents OOM kills on memory-constrained Docker Desktop VMs.
Capture get_os() output in a variable before using it in conditionals,
preventing the return value from being masked by [[ ]]. Reuse the
variable in the later case statement.
@xqqp
Copy link
Copy Markdown
Contributor

xqqp commented Feb 25, 2026

Add Docker memory limits to all docker-compose services to prevent OOM kills on macOS Docker Desktop (zeros: 512M, alphas: 2-4GB, minio: 512M)

This just shifts OOM kills from the OS to docker, not sure if this really accomplishes much. Setting a memory limit on docker containers does not make applications use less memory.

…rue unit mode

Restructure test suites to separate lightweight from resource-intensive tests:

- Add `integration` suite as the new default (replaces old `unit,systest,core`),
  excluding ldbc, load, and systest-heavy packages
- Add `systest-baseline` and `systest-heavy` sub-suites; `systest` runs both
- Add `heavyPackages` list for resource-intensive tests (minio, encryption,
  tracing, online-restore) that can OOM on macOS Docker Desktop
- Make `unit` suite truly unit-only: no Docker cluster, no `--tags=integration`,
  skips custom-cluster packages entirely
- Add `make test-integration` and `make test-integration-heavy` targets
- Rename `make test-full` to `make test-all`; remove `test-suite`, `test-ldbc`,
  `test-load` targets (use `make test SUITE=...` instead)
- Update CONTRIBUTING.md, TESTING.md, and Makefile help text to match
@mlwelles mlwelles changed the title fix(test): make test defaults to fast suite + macOS compat fix(test): macOS local image, plugin tests, and test suite restructuring Feb 25, 2026
The TestImportApis/SingleGroupShutOneAlpha integration2 test was
flaking because initiateSnapshotStream would fail immediately when
the Dgraph server returned "overloaded with pending proposals" during
Raft membership changes. Add exponential backoff retry (1s→10s cap,
60s max) for transient "overloaded" errors while preserving fast
failure for non-retryable errors like connectivity loss.
Remove deploy memory limits and --cache "size-mb=1024" flags from
dgraph/docker-compose.yml and systest/online-restore/docker-compose.yml.
These just shift OOM kills from the OS to Docker without actually
solving the underlying resource problem, and they add noise to the
compose files.
@mlwelles
Copy link
Copy Markdown
Contributor Author

@xqqp: Good point. While they did seem to improve stability, it was a purely subjective seeming, and you're right -- they really just shift where the OOM kill comes from, and the amount of noise they add to the compose files is icky. Removed.

@mlwelles mlwelles merged commit 793fe9c into main Feb 25, 2026
26 checks passed
@mlwelles mlwelles deleted the fix/macos-local-image-and-plugin-tests branch February 25, 2026 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/documentation Documentation related issues. area/testing Testing related issues go Pull requests that update Go code

Development

Successfully merging this pull request may close these issues.

3 participants