You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GOMAXPROCS=128 — Dgraph v25.0.0 hardcodes GOMAXPROCS from the node CPU count (128) rather than the container's CPU limit (6), increasing scheduling overhead and memory fragmentation.
Alphas OOMKill in rotation: alpha-2 was killed on 2026-03-06 at 19:10 UTC, alpha-0 on 2026-03-02, and alpha-1's GC goal currently sits at 21.00GB (above the 20Gi limit), making it the next to be killed. This has been happening repeatedly.
To Reproduce
Deploy Dgraph v25.0.0 alpha StatefulSet with --cache size-mb=4096, memory limit of 20Gi, and no GOMEMLIMIT/GOGC env vars.
Allow normal production query and mutation workload to run over days.
Observe go_memstats_heap_alloc_bytes growing steadily due to the posting cache bug exceeding its configured budget.
With GOGC=100, the GC goal (go_memstats_next_gc_bytes) reaches 2× live heap (~18-22GB), exceeding the 20Gi container limit.
Kubernetes OOMKills the alpha (exit code 137, reason: OOMKilled). The pattern rotates across alphas as load shifts after each restart from time to time.
Expected behavior
The Go garbage collector should trigger frequently enough to keep heap usage well within the 20Gi container memory limit. The posting list cache should respect the configured --cache size-mb=4096 (4GB) budget and not grow unboundedly.
Screenshots
Prometheus metrics captured on 2026-03-06 ~20:55 UTC:
Describe the bug
Dgraph Alpha nodes running v25.0.0 experience unbounded heap growth leading to repeated OOMKill events in Kubernetes. The Go garbage collector's target heap size (next_gc) grows beyond the container's 20Gi memory limit because:
Alphas OOMKill in rotation: alpha-2 was killed on 2026-03-06 at 19:10 UTC, alpha-0 on 2026-03-02, and alpha-1's GC goal currently sits at 21.00GB (above the 20Gi limit), making it the next to be killed. This has been happening repeatedly.
To Reproduce
Expected behavior
The Go garbage collector should trigger frequently enough to keep heap usage well within the 20Gi container memory limit. The posting list cache should respect the configured --cache size-mb=4096 (4GB) budget and not grow unboundedly.
Screenshots
Prometheus metrics captured on 2026-03-06 ~20:55 UTC:
Environment
• OS: Linux (GKE nodes: Container-Optimized OS, c4d-standard-8 — 8 vCPU, 31GB RAM)
• Orchestration: Kubernetes (GKE cluster)
• Language: Go (toolchain v1.24, bundled with Dgraph v25.0.0)
• Dgraph Version: v25.0.0
• Go runtime config: GOGC=100 (default), GOMEMLIMIT=not set, GOMAXPROCS=128 (hardcoded by Dgraph from node CPUs)
• Container resources: requests cpu=4 / memory=16Gi, limits cpu=6 / memory=20Gi
• Dgraph flags: --cache size-mb=4096, --raft snapshot-after-entries=100000, --limit mutations=strict
Additional context
• Posting cache hit ratios: posting list 70.6%, block cache 93.5% — cache is actively used but unbounded growth defeats the purpose.