The Pangolin Kubernetes Controller synchronizes dynamic Traefik Custom Resource Definitions (CRDs) — e.g., IngressRoute, Middleware, TraefikService — from a Pangolin Traefik Config API (commonly used as an HTTPFileProvider for Traefik) into a Kubernetes namespace.
- Quick Start: see the "Quick Start" section below.
The Pangolin Kubernetes Controller automatically synchronizes dynamic Traefik Custom Resource Definitions (CRDs) such as IngressRoute, Middleware, and TraefikService from the Pangolin Traefik Config API (usually used as an HTTPFileProvider for Traefik) into a Kubernetes namespace.
- Key Benefits:
- Safe, minimal writes via ETag/body-hash change detection
- Optional high-availability leader election
- Robust observability: structured logs, metrics, traces
- Read-only (dry-run) mode for validation/audit
- Typical Use Cases: Automated Traefik routing, safe configuration validation, CI/CD deployment pipelines
- Running Kubernetes cluster with Traefik V3 as IngressController installed
- Pangolin Config API endpoint accessible
- Traefik CRD support enabled
- Appropriate RBAC permissions for Traefik CRDs (see examples below)
- (Optional): mTLS certificates and keys (via Secrets) for secure API access
-
Ensure you have a running Kubernetes cluster with Traefik v3 as IngressController and CRD support enabled.
-
Install the controller using Helm (recommended) or your own deployment manifests:
# Recommended: install the published Helm chart helm repo add fossorial https://charts.fossorial.io helm repo update helm install pangolin fossorial/pangolin # Or apply your own Kubernetes manifests (not included in this repository)
-
Configure the controller via environment variables. The examples below show three common deployment approaches:
-
Local development (shell): set env vars in your shell session for quick testing. These do not persist across shells.
# local shell (development only) export CONFIG_ENDPOINT=https://your-pangolin:3001/api/v1/traefik-config export TARGET_NAMESPACE=pangolin # then run the binary locally ./pangolin-kube-controller
-
Helm install (recommended for clusters): set chart values via
--setorvalues.yaml. Example using--set:helm install pangolin fossorial/pangolin \ --set env.CONFIG_ENDPOINT="https://your-pangolin:3001/api/v1/traefik-config" \ --set env.TARGET_NAMESPACE="pangolin"
Or add the values to your
values.yaml:env: CONFIG_ENDPOINT: https://your-pangolin:3001/api/v1/traefik-config TARGET_NAMESPACE: pangolin
-
Kubernetes manifest (Deployment): add the env values under the
containers[].envsection in your Deployment manifest:apiVersion: apps/v1 kind: Deployment metadata: name: pangolin-kube-controller namespace: pangolin spec: replicas: 1 template: spec: containers: - name: controller image: ghcr.io/fosrl/pangolin-kube-controller:0.1.0-alpha.1 env: - name: CONFIG_ENDPOINT value: "https://your-pangolin:3001/api/v1/traefik-config" - name: TARGET_NAMESPACE value: "pangolin"
-
Verify the controller is running:
kubectl get pods -n pangolin kubectl logs -n pangolin deployment/pangolin-kube-controller
-
Check the metrics endpoint:
kubectl port-forward -n pangolin svc/pangolin-kube-controller 9090 & curl http://localhost:9090/metrics
For a production deployment, review and adjust the RBAC scope, resource limits, and environment variables in your deployment manifests (not included in this repository).
- Change detection: Uses ETag (
If-None-Match/304 Not Modified) when available, falling back toSHA256(body)when not. Weak ETags (W/) are treated as equivalent when the body content is unchanged. - Garbage collection: Deletes only resources it manages (label:
app.kubernetes.io/managed-by=pangolin-kube-controller). - Leader election (optional) for safe, high availability across replicas.
- Exponential backoff with jitter for safe retries.
- Read-only (dry-run) mode for production-safe testing and CI validation.
- Prometheus & OpenTelemetry metrics, traces, and optional profiling (pprof).
- Minimized writes: Uses ETag/body-hash fallback to avoid unnecessary updates.
- Safe & resilient: Robust reconciliation with exponential backoff and optional leader election.
- Observability: Structured diff logging, metrics, traces.
- Extensible: Easy support for additional Traefik CRDs.
- Zero required changes to Pangolin source code.
- Production-ready, full-HA support.
Out of the box the controller targets common Traefik CRDs. Example list (expandable via configuration/code):
- IngressRoute
- Middleware
- TraefikService
- Fetch loop: Poll JSON config from
CONFIG_ENDPOINTvia HTTP. - Change detection: Compare ETag or fall back to SHA256(body).
- Parse raw JSON to a simplified
TraefikConfig. - Reconcile: For each resource kind:
- Only objects labeled
app.kubernetes.io/managed-by=pangolin-kube-controller. - Apply with Server-Side Apply (SSA) and stable
fieldManager: "pangolin-kube-controller". - Garbage Collect (GC): Remove labeled resources not in desired set.
- Only objects labeled
- Metrics & logging: Record durations, changes, errors, GC events.
- Error handling: Exponential backoff + jitter to avoid hot‑looping.
- (Optional) leader election: Prevents concurrent writes in multi‑replica setups.
Note: All durations use Go’s
time.Durationsyntax (e.g.,30s). Leader election duration values are interrelated; see client-go leaderelection docs.
Core behavior
- CONFIG_ENDPOINT (string, required): Pangolin config API URL
- READ_ONLY (bool, default=false): Dry-run mode (no mutations)
- POLL_INTERVAL (duration, default=15s): Base polling/backoff interval
- MAX_BACKOFF (duration, default=2m): Maximum backoff wait
- TARGET_NAMESPACE (string, default=pangolin): Namespace to manage Traefik CRDs
- ON_LOSE (string, default=exit): Behavior on leadership loss: exit | pause
HTTP/TLS fetch
- FETCH_TIMEOUT (duration, default=30s)
- CONFIG_AUTH_HEADER (string): Authorization header (e.g., "Bearer ...")
- CONFIG_CA_FILE (path)
- CONFIG_CLIENT_CERT_FILE (path)
- CONFIG_CLIENT_KEY_FILE (path)
- CONFIG_TLS_SKIP_VERIFY (bool, default=false) — do not use in production
HTTP transport tuning
- HTTP_MAX_IDLE_CONNS (int, default=100)
- HTTP_MAX_IDLE_CONNS_PER_HOST (int, default=100)
- HTTP_IDLE_CONN_TIMEOUT (duration, default=90s)
client-go tuning (Kubernetes)
- CLIENT_QPS (float, default=0=disabled)
- CLIENT_BURST (int, default=0=disabled)
Traefik specifics
- INGRESS_CLASS (string, default=traefik)
- TRAEFIK_INSTANCE_LABEL_KEY / TRAEFIK_INSTANCE_LABEL_VALUE (optional): Explicit instance label pair applied to all managed resources
- TRAEFIK_INSTANCE_LABEL (optional): Combined form "key=value" used if KEY/VALUE are not set
- INGRESS_CLASS_LABEL_VERIFY_INTERVAL (duration, default=3h): Periodic verification of the selected IngressClass having the instance label
- INGRESS_CLASS_LABEL_STRICT (bool, default=false): If true, a verification mismatch is fatal (CrashLoop)
- CONFIG_FILE (string, optional): Path to YAML/JSON with the same fields; precedence is ENV > file > defaults
- TRAEFIK_LB_URL (string): Full URL used to fill empty TraefikService specs
- TRAEFIK_LB_IP (string), TRAEFIK_LB_SCHEME (default=http), TRAEFIK_LB_PORT (string): Alternative for building the URL
Logging
- CONFIG_LOG_PREVIEW (bool, default=false): When true, logs a redacted preview of the fetched Traefik configuration. Intended strictly for debugging. The preview passes through a redaction pipeline that replaces values of any JSON keys whose names contain (case-insensitive) "auth", "pass", "secret", "token", or "key" with "redacted".
- LOG_TRAEFIK_CONFIG (bool, default=false): Backward-compatible alias for CONFIG_LOG_PREVIEW. Also debug-only and goes through the same redaction pipeline.
- MAX_CONFIG_LOG_BYTES (int, default=0=no cap): Maximum number of bytes to include in the preview section; appended with "..." when cut.
- FETCH_LOG_INTERVAL (duration, default=5m, max=24h): Emit INFO-level polling status on this cadence. Set to
0to suppress periodic fetch logs and only log on startup or when changes occur.
At INFO level the controller reports configuration polls at the configured interval and always emits change detections. DEBUG level retains per-cycle fetch chatter (including "no change" messages) to aid troubleshooting without spamming production logs.
Reconcilers & GC
- RECONCILE_PARALLEL (bool, default=false)
- RECONCILE_MAX (int, default=3)
- GC_GRACE_PERIOD (duration, default=0)
- GC_WORKERS (int, default=2)
Leader election
- ENABLE_LEADER_ELECTION (bool, default=false)
- LEASE_LOCK_NAME (string, default=pangolin-kube-controller-leader)
- LEASE_LOCK_NAMESPACE (string, default=TARGET_NAMESPACE)
- LEASE_DURATION (duration, default=30s)
- RENEW_DEADLINE (duration, default=20s)
- RETRY_PERIOD (duration, default=5s)
Metrics & debug
- METRICS_ADDR (string, default=:9090)
- DISABLE_LIVEZ (bool, default=false)
- ENABLE_PPROF (bool, default=false)
- DISABLE_PPROF (bool, default=false)
Always mount secrets as files via Kubernetes Secrets; never as environment variables in production.
Local:
# Example: standalone local run (HTTP-only mode)
STANDALONE_HTTP_ONLY=true \
METRICS_ADDR=:9090 \
LOG_TRAEFIK_CONFIG=false \
./pangolin-kube-controller# Example: in-cluster style (not starting reconcile here)
CONFIG_ENDPOINT="https://config.example.com" \
CONFIG_AUTH_HEADER="Bearer abc" \
FETCH_TIMEOUT=15s \
RECONCILE_PARALLEL=true RECONCILE_MAX=3 \
LOG_TRAEFIK_CONFIG=false MAX_CONFIG_LOG_BYTES=2048 \
./pangolin-kube-controllerKubernetes Deployment (env section):
env:
- name: CONFIG_ENDPOINT
value: "https://config.example.com"
- name: FETCH_TIMEOUT
value: "30s"
- name: READ_ONLY
value: "true"
- name: METRICS_ADDR
value: ":9090"
- name: CONFIG_CLIENT_CERT_FILE
value: "/etc/pki/tls/client.crt"
- name: CONFIG_CLIENT_KEY_FILE
value: "/etc/pki/tls/client.key"Security Tip: If using mTLS, mount cert/key files as Kubernetes Secrets, not env var blobs.
parseBody: Unmarshal and validate the remote JSON config.
sleepWithBackoff/sleepWithContext: Sleep using context for cancellation/reactivity.
for {
select {
case <-ctx.Done():
return
default:
}
var etag, body string
var status int
var err error
if lastETag != "" {
etag, status, body, err = fetchConditional(ctx, lastETag)
} else {
etag, status, body, err = fetchConditional(ctx, "")
}
if err != nil {
handleError(err)
sleepWithBackoff()
continue
}
if status == http.StatusNotModified {
time.Sleep(pollInterval)
continue
}
traefikCfg, err := parseBody(body)
if err != nil {
handleError(err)
sleepWithBackoff()
continue
}
if err := reconcileAll(ctx, traefikCfg); err != nil {
handleError(err)
sleepWithBackoff()
continue
}
if etag != "" {
lastETag = etag
} else {
lastETag = computeSHA256(body)
}
observeSuccessMetrics()
time.Sleep(pollInterval)
}- HTTP client uses pooling, timeout, TLS settings from env/config.
- If
CONFIG_AUTH_HEADERis set, add to requests asAuthorization. Risk: Avoid secrets in env vars if possible. - If both
CONFIG_CLIENT_CERT_FILEandCONFIG_CLIENT_KEY_FILEare present, enable mTLS using the mounted files (via Secret). - If
CONFIG_CA_FILEpresent, load as additional root CA. - On
200, prefer ETag, else SHA256 of body as signature. - On
304 Not Modified, skip parsing/reconcile. - Treat weak ETags (
W/): consider unchanged if body matches. - Avoid
CONFIG_TLS_SKIP_VERIFYin production.
The controller keeps two signatures to detect changes robustly:
lastETag— the last ETag header value received from the server when present (tracked only when the server returns an ETag header).lastHash— SHA256(body) hex of the last successfully-processed body.
Decision rules used by the controller when fetching a new response:
- Only send
If-None-MatchwhenlastETagwas previously set from a header (do not send conditional header when only a body hash exists). - If server returns
304 Not Modified— skip parsing and reconcile. - If server returns a strong ETag and it equals
lastETag— skip (no change). - If server returns a weak ETag (
W/...) — compute SHA256(body) and compare withlastHash; if equal, skip (no change). If different, treat as changed. - If server returns any ETag that changed (strong or weak) but the body SHA256 equals
lastHash, treat as no-change and updatelastETagto the new value without reconciling. - If no ETag header is present, rely on SHA256(body) vs
lastHashto detect changes.
Notes:
-
The controller only updates
lastETag/lastHashafter a successful parse and reconcile (or successful diff in read-only mode). This prevents advancing signatures on transient parse errors. -
This approach prevents unnecessary applies when the server changes ETag semantics but not the body.
- Use exponential backoff with full jitter:
- base = POLL_INTERVAL
- MAX_BACKOFF = e.g., 5 * base
- consecutiveErrors increments on each error
- wait = max(200ms, rand(0, min(MAX_BACKOFF, base * 2^(consecutiveErrors-1))))
- Reset
consecutiveErrorson successful reconcile. - Increment metric
pangolin_kube_controller_reconcile_errors_totalon each error.
- Build unstructured.Unstructured object with
apiVersion,kind, metadata, spec. - Use SSA (
types.ApplyPatchType), set stableFieldManager: "pangolin-kube-controller". - On NotFound: create; On Conflict: re-fetch and retry.
- Use
forceonly to recover from severe field ownership conflicts (with audit/documentation).
Example apply function:
func applyResource(ctx context.Context, dynamicClient dynamic.Interface, gvr schema.GroupVersionResource, ns, name string, unstructuredObj *unstructured.Unstructured, readOnly bool) error {
if readOnly {
log.Infof("[READ-ONLY] would apply %s/%s", unstructuredObj.GetKind(), name)
return nil
}
patch := marshalForApply(unstructuredObj)
_, err := dynamicClient.Resource(gvr).Namespace(ns).Patch(ctx, name, types.ApplyPatchType, patch, metav1.PatchOptions{
FieldManager: "pangolin-kube-controller",
})
return err
}For each resource kind:
- Build desired set keyed by name (and optional stable key).
- List existing objects in namespace labeled
app.kubernetes.io/managed-by=pangolin-kube-controller. - For each existing object not in desired set: delete (or log if
READ_ONLY=true) and incrementpangolin_kube_controller_objects_deleted_total{kind=...}. - Perform apply loop (create/update) before GC to avoid downtime on renames: apply -> GC.
- Process items deterministically (sort by name) to ensure stable behavior and logs.
The controller supports an optional GC_GRACE_PERIOD duration. When set, stale objects are scheduled for deletion after the grace period instead of being deleted immediately. This provides an operator window to inspect or recover from transient upstream configuration problems.
- If
READ_ONLY=true, GC runs in dry-run mode and only logs deletions. - Deletions performed after the grace period are recorded in
pangolin_kube_controller_gc_deleted_total{reason="grace"}. Immediate deletions usereason="immediate". - The controller emits Kubernetes Events for GC deletions when possible (best-effort).
Set READ_ONLY=true to:
- Skip mutating API calls (create/patch/delete).
- Log instead:
[READ-ONLY] would apply IngressRoute my-route - Still track ETag/SHA256 so future writes skip unchanged resources.
- Record duration/error metrics but skip mutation counters.
Use Cases:
- Pre‑deployment validation in CI/CD.
- Audit in production without touching resources.
To avoid unnecessary or noisy updates to Kubernetes objects, the controller uses semantic diffing with normalization before applying changes.
- Semantic comparison: Compare resources logically, not by raw JSON strings.
- Normalization steps:
- Sort maps (for order-independent comparison).
- Strip server-defaulted fields.
- Remove metadata fields that should not trigger updates (e.g., timestamps, UIDs, resourceVersion, managedFields).
- Canonicalize numeric types to avoid type-related false positives.
- Comparison tools:
apiequality.Semantic.DeepEqualfor Kubernetes-aware structural equality checks.go-cmpwith appropriate canonicalization options for stable, reproducible diffs.
- INFO level: Log concise, structured summaries of changed fields.
- DEBUG level: Log full before/after payloads for deep troubleshooting.
- Logged diffs should be order-independent to avoid noise from serialization order changes.
- No JSON string comparison: String-based diffing is order-sensitive and will generate unnecessary "changes" when only serialization order varies.
- Avoid including high-cardinality data (names, UIDs) in metric labels to prevent metric cardinality issues.
Tip: This semantic diffing process significantly reduces unnecessary patches, stabilizes reconciliation behavior, and improves observability when combined with structured logging.
Keep labels low-cardinality; recommended sets:
kind:IngressRoute,Middleware,TraefikServiceaction:create,patch
| Metric | Type | Description |
|---|---|---|
pangolin_kube_controller_reconcile_seconds |
Histogram | Duration of reconcile loop |
pangolin_kube_controller_reconcile_errors_total |
Counter | Number of reconcile errors |
pangolin_kube_controller_objects_applied_total{kind,action} |
Counter | Count of created/patched resources |
pangolin_kube_controller_objects_deleted_total{kind} |
Counter | Count of deleted resources |
pangolin_kube_controller_leader |
Gauge | 1 if leader |
pangolin_kube_controller_ready |
Gauge | 1 if ready |
up |
Gauge | Exporter reachability |
pangolin_kube_controller_build_info{version,git_sha} |
Gauge | Always 1; with build metadata |
Additional metrics and observability:
pangolin_kube_controller_consecutive_errors(gauge): number of consecutive fetch/reconcile errors.pangolin_kube_controller_last_fetch_success_timestamp_seconds(gauge): unix timestamp of last successful fetch+reconcile.pangolin_kube_controller_desired_objects_count{kind}(gauge): desired objects per kind observed in last config.pangolin_kube_controller_gc_deleted_total{kind,reason}(counter): GC deletions annotated by reason (e.g., "immediate", "grace").pangolin_kube_controller_gc_runs_total{result}(counter): GC run results (start/success/fail/dryrun).
The controller also exposes additional metrics via the OpenTelemetry Prometheus exporter on the same /metrics endpoint. These series are safe to scrape with Prometheus and complement the legacy metrics above.
-
pangolin_controller_reconcile_phase_duration_seconds(Histogram, unit: s)- Duration of each reconcile phase.
- Labels:
phase:middlewares|routers|serversTransports|services|tcp|udpresult:success|error
-
pangolin_controller_fetch_duration_seconds(Histogram, unit: s)- Duration of remote fetch cycle HTTP requests.
- Labels:
status_code: e.g.,"200","304","401","403","404","5xx"status_class:2xx|3xx|4xx|5xx
-
pangolin_controller_k8s_request_duration_seconds(Histogram, unit: s)- Duration of Kubernetes API requests.
- Labels:
verb:get|create|patch|update|delete|listresource_kind: low-cardinality kind (e.g.,IngressRoute,Middleware,TraefikService,ServersTransport,ServersTransportTCP,Service,EndpointSlice)result:success|error|conflictforced:true|false
-
pangolin_controller_k8s_requests_total(Counter)- Total Kubernetes API requests.
- Labels: same as
pangolin_controller_k8s_request_duration_seconds
-
pangolin_controller_retries_total(Counter)- Total retry attempts by reason in the SSA apply loop.
- Labels:
reason:conflict|transient|timeoutoperation:get|create|patch|delete|applyresource_kind: Kubernetes kind
-
pangolin_controller_active_reconcile_routines(UpDownCounter)- Number of active reconcile routines by phase (parallel mode).
- Labels:
phase:middlewares|routers|serversTransports|services|tcp|udp
-
pangolin_controller_gc_run_duration_seconds(Histogram, unit: s)- Duration of GC runs.
- Labels:
result:success|fail|dryrun
-
pangolin_controller_config_parse_duration_seconds(Histogram, unit: s)- Duration of configuration parsing.
- Labels:
section:full
-
pangolin_controller_loop_iterations_total(Counter)- Number of controller loop iterations by outcome.
- Labels:
outcome:success|nochange|error
| Endpoint | Purpose | Description |
|---|---|---|
/healthz |
Readiness | 200 after one successful reconcile and if recent (< 5 * POLL_INTERVAL), else 503 |
/readyz |
Readiness | Alias to /healthz |
/livez |
Liveness | 200 if server/process is up; can disable with DISABLE_LIVEZ=true |
/metrics |
Metrics | Prometheus endpoint |
/debug/pprof/ |
Profiling | Available if ENABLE_PPROF=true and not disabled by DISABLE_PPROF |
Adapt readiness/liveness probe delays and timeouts to
POLL_INTERVALfor reliability.
- If enabled, only the elected leader executes reconciliation.
- On losing leadership, default behavior is to log and exit (OnStoppedLeading → warn + exit). Alternatively, you can pause reconciliation until leadership is regained.
- Requires RBAC for
coordination.k8s.io/leases.
You can control controller behavior when leadership is lost via the ON_LOSE environment variable (or CLI flag --on-lose) with the following values:
exit(default): log a warning and exit the process. This is suitable when using a Pod restart to re-elect.pause: stop reconciling but keep the process alive; useful for graceful handovers and rolling updates where automatic restarts are not desired.
If pause is used, the controller will stop executing the main reconcile loop while still serving metrics and health endpoints.
Adjust scope (Role vs ClusterRole) and namespace bindings as needed.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: pangolin-kube-controller
rules:
- apiGroups: ["traefik.io"]
resources: ["ingressroutes", "middlewares", "traefikservices"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["networking.k8s.io"]
resources: ["ingressclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["get", "list", "watch", "create", "update", "patch"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "patch", "get", "list"]apiVersion: v1
kind: Service
metadata:
name: pangolin-kube-controller
labels:
app: pangolin-kube-controller
spec:
selector:
app: pangolin-kube-controller
ports:
- name: http-metrics
port: 9090
targetPort: 9090
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: pangolin-kube-controller
spec:
selector:
matchLabels:
app: pangolin-kube-controller
endpoints:
- port: http-metrics
path: /metrics
interval: 30sIf pprof is enabled and should not be scraped, use relabeling/metric_relabel_configs in Prometheus to restrict.
-
Add support for more Traefik CRDs:
TLSOptions,ServerTransport,MiddlewareChain. -
Report CRD status conditions or emit Kubernetes Events.
-
ConfigMap‑based configuration with hot reload.
-
Additional error metrics by CRD kind/action and step-level latency histograms.
-
Webhook/streaming updates to reduce polling (Pangolin SourceCode changes needed).
- Use namespace-scoped RBAC when possible.
- Run as non-root with least privilege ServiceAccount.
- Mount certificates/keys/CA as files via Secrets; never as env vars.
- Avoid
CONFIG_TLS_SKIP_VERIFYin production. - Keep metric labels low cardinality (avoid object names, namespaces, UIDs).
- Prefer SSA for safe field ownership.
- Default to SSA with
FieldManager: "pangolin-kube-controller". Useforcesparingly. - Normalize and sort resources before comparison for stable diffs.
- Apply then GC (create/patch first, delete later) to handle renames without traffic loss.
- Ensure
FETCH_TIMEOUT <= POLL_INTERVALto avoid overlapping fetches. - Use structured JSON logging for production; log diffs at INFO, full payloads at DEBUG.