Skip to content

GCP Cloud Run + Kubernetes/GKE hardening, CLI lifecycle, docs, tests#57

Open
jivanb7 wants to merge 33 commits into
devfrom
feat/gcp
Open

GCP Cloud Run + Kubernetes/GKE hardening, CLI lifecycle, docs, tests#57
jivanb7 wants to merge 33 commits into
devfrom
feat/gcp

Conversation

@jivanb7
Copy link
Copy Markdown

@jivanb7 jivanb7 commented May 31, 2026

Summary

Brings deployml to a production-ready state on GCP across all three deployment targets, Cloud Run, GKE, and minikube, and makes the exact same CLI run on native Windows in addition to macOS and
Linux. The command surface never changes; the engine detects the operating system and adapts underneath. This branch hardens the deploy and destroy lifecycle, adds full Windows support, and brings
docs, examples, and tests up to date.

Closes #53
Closes #54
Closes #56

GCP Cloud Run security and reliability

  • Lock down Cloud SQL, remove public 0.0.0.0/0 access, and mark credential outputs sensitive
  • Inject MLflow and Grafana database and admin credentials through Secret Manager instead of plaintext env, and stop the Grafana entrypoint from logging the database URL
  • Add MLflow min-instances, a longer request timeout, and a startup health probe
  • Drop the unused plaintext database env from FastAPI, inject the Feast password through Secret Manager, and default Feast to private

CLI deploy and destroy lifecycle

  • Preflight gcloud auth and Application Default Credentials before deploy, so a missing ADC fails fast with guidance instead of at terraform apply. Fixes On deploy step (5), google config can't find default credentials #54
  • Validate the config stack shape, so a malformed or set-typed entry exits cleanly instead of crashing. Fixes config dictionary treated as a set #53
  • gcloud auth, ADC, and region preflight, with a consistent Terraform workspace name across deploy, destroy, get-urls, and status
  • Surface terraform stderr on failure, preserve state on a failed destroy, and clean up the Artifact Registry repo and Cloud Build staging bucket on destroy

BigQuery and teardown

  • Day-partition the drift, ground-truth, and prediction tables, and allow delete_contents_on_destroy
  • Narrow the auto-teardown service account privileges

Kubernetes, minikube, and GKE

  • minikube and GKE commands with namespace support and persistent storage
  • MLflow PVC with fsGroup and a Recreate strategy, so experiment data survives pod restarts on both minikube and GKE
  • Host-architecture local builds, so images run on an arm64 minikube node
  • GKE LoadBalancer wait targets the named service, with a retry on the transient cluster-delete conflict
  • gke-destroy is now fully self-cleaning. Deleting the cluster used to tear down the in-cluster CSI driver before it reclaimed the PVC backing disk, leaving a billing PersistentDisk behind.
    Teardown now captures that disk before deletion and removes it after the cluster is gone, plus the PVC and gcr.io image, so teardown ends at zero residual

Windows compatibility (Closes #56)

The same CLI now runs on native Windows. All operating-system awareness is centralized in one new module, src/deployml/utils/platform_compat.py, with IS_WINDOWS, resolve_tool, run_tool,
configure_console_encoding, robust_rmtree, find_windows_bash, and terraform_env. Command code calls these helpers instead of branching on the OS. The four blockers from #56, plus two more
found in the audit:

  • External tool execution. gcloud, bq, and gsutil ship as .cmd wrappers on Windows that subprocess could not launch by bare name. Every external tool call now routes through
    run_tool, which resolves the real executable and runs it.
  • Console encoding. Emoji and box glyphs crashed the cp1252 console. UTF-8 is forced at the CLI entry point and captured subprocess output is decoded as UTF-8, so neither the write nor the
    read side raises.
  • Cloud SQL Terraform provisioner. The readiness check used bash-only syntax and previously ran under cmd.exe. An explicit bash interpreter was added, and on Windows Terraform resolves to
    Git bash rather than the WSL launcher that mangles quoting.
  • Workspace cleanup. robust_rmtree clears the read-only bit and retries, so destroy cleanup survives read-only files and OneDrive locks.
  • GKE auth plugin. An actionable warning when gke-gcloud-auth-plugin is missing from PATH, so kubectl-to-GKE auth fails with guidance instead of a cryptic error.
  • minikube runtime. Documented the docker-driver service URL tunnel and the MLflow memory needs on Windows.

Cross-platform hardening and tests

  • doctor skips the docker permission check when docker is absent, instead of emitting a misleading permission failure on top of the docker-not-found result
  • Merged a duplicate markdown_extensions key in mkdocs.yml, so attr_list and pymdownx.emoji load again
  • Added unit tests for platform_compat covering tool resolution, the run_tool kwarg passthrough and Windows decode branch, robust_rmtree, and console encoding
  • Removed duplicate imports

Docs, examples, tests

  • Rewrote docs for the supported init, build-images, deploy, get-urls, and destroy flow, documented the new flags and behaviors, and added native Windows setup and platform notes
  • Example scripts guard env vars with actionable errors
  • 63 unit tests covering config validation, the stack and ADC preflight checks, helper probes, teardown cron timing, the platform_compat helpers, and the gke-destroy disk cleanup

Validation

Run live end to end on both operating systems against a GCP test project, with zero residual after every teardown.

macOS

  • 63 unit tests pass, doctor runs clean, mkdocs build --strict passes
  • Live GKE cycle proving the orphaned-disk fix: a real PersistentDisk was provisioned, the cluster was deleted, and the disk that would have been orphaned was removed automatically, ending at zero
    clusters and zero disks

native Windows

  • All six blockers proven live, with the original failures reproduced and the fixes shown to resolve them
  • Full Cloud Run cycle: init, build-images via Cloud Build, deploy to Apply complete, get-urls, all three services returning HTTP 200, destroy, zero residual
  • Full GKE cycle including the orphaned-disk fix, ending at zero residual
  • minikube with FastAPI and MLflow, PVC bound and data persisted across a pod restart
  • 63 unit tests pass, doctor clean, mkdocs build --strict passes

#53 and #54 were verified both with unit tests and live: a malformed stack exits cleanly, and removing ADC makes deploy fail at the preflight before any Terraform. #56 was verified live on native
Windows across all three deployment targets.

jivanb7 and others added 30 commits May 30, 2026 17:16
Remove public 0.0.0.0/0 access from Cloud SQL and mark credential outputs sensitive. Inject the MLflow backend DSN and the Grafana DB URL and admin password via Secret Manager value_from instead of plaintext env, and stop the Grafana entrypoint echoing the DB URL to logs. Add MLflow min-instances, a longer request timeout, and a startup health probe.
Drop the unused plaintext DATABASE_URL, BACKEND_STORE_URI, and USE_POSTGRES env from FastAPI since the app never reads them. Inject the Feast online-store password via Secret Manager and default the internal Feast feature server to private.
Partition drift_metrics, ground_truth, and predictions by day to bound query scan cost, and set delete_contents_on_destroy so teardown does not fail on populated tables.
Tighten the teardown service account role set and remove project IAM admin.
Add gcloud auth, ADC, and region preflight checks, keep the Terraform workspace name consistent across deploy, destroy, get-urls, and status, surface terraform stderr on failure, preserve state on a failed destroy, and clean up the Artifact Registry repo and Cloud Build staging bucket. Fix the generate overwrite flag and add the minikube and GKE commands with namespace support, persistent storage, and self-cleaning teardown.
Add a PersistentVolumeClaim with fsGroup and a Recreate strategy so MLflow data survives pod restarts on minikube and GKE, isolate deploys by namespace, target the named service when waiting for the GKE LoadBalancer IP, and harden image loading.
Probe the Docker daemon before local builds and build for the host architecture by default with an optional platform override. In the FastAPI app, load the model in a background task so startup is fast and report a degraded but healthy status when MLflow is unreachable.
Document the supported init, build-images, deploy, get-urls, destroy flow with auth, IAM, billing, and cost guidance, add config.example.yaml, and align the Python version and command references.
Each script checks for the variables written by get-urls and exits with a clear message pointing to the right command instead of failing cryptically.
Add offline unit tests for config validation, helper probes, and teardown cron timing. Update packaging metadata and normalize shell and Dockerfile line endings.
deploy only checked 'gcloud auth list', so a user who was logged in but had not run application-default login passed the check and then failed opaquely at terraform apply with 'default credentials not found'. Add a GCP credentials preflight, mirroring init and doctor, that verifies both auth and ADC before any deploy work and exits with actionable guidance. Covers the Cloud Run and GKE deploy paths. Includes unit tests for the preflight.
A stack tool value parsed as a set or any non-mapping crashed the deploy loop with 'set' object has no attribute 'get'. Validate in _validate_deploy_config_or_exit that stack is a list, each stage is a mapping, and each tool is a mapping, exiting with an actionable message that points at a stray !!set tag. Validation runs only when stack is present so existing configs are unaffected. Includes unit tests for set, non-list, and non-dict stage cases.
Document build-images --platform, the deploy auth/ADC preflight and config-shape validation, destroy removing the Cloud Build staging bucket, generate --force, gke-cluster-create, the --namespace flag on the minikube and GKE deploy commands, MLflow PVC persistence on minikube and GKE, and gke-destroy cleaning the PVC and gcr.io image with --keep-images to opt out. Fixes the stale note that deployml does not manage GKE clusters.
destroy now removes the Artifact Registry repo and the Cloud Build staging bucket, so the old 'does not delete Artifact Registry images' line was inaccurate. Correct it and add an Other deployment targets section pointing to the local minikube and GKE flows.
api/overview lists the Kubernetes and GKE commands, features/overview documents gke and minikube as deployment targets selected by deployment.type, tutorials/overview adds a Kubernetes section, and costs notes the GKE PersistentDisk and ties the SQLite tip to the minikube and GKE paths. Brings the architecture and reference pages in line with the now-real Kubernetes functionality.
Centralize OS awareness so CLI commands stay identical across Windows, macOS, and Linux. resolve_tool returns an executable tool path and prefers a real wrapper over the extensionless launcher scripts that ship beside gcloud.cmd, bq.cmd, gsutil.cmd, and docker.exe on Windows. run_tool launches the resolved path and falls back to the command interpreter only if a direct .cmd launch raises OSError, so batch wrappers work on any Windows build. configure_console_encoding forces UTF-8 with replacement to stop cp1252 UnicodeEncodeError crashes. robust_rmtree clears the read only bit and retries, supporting both the onerror and onexc rmtree callbacks.
On Windows gcloud, bq, and gsutil are .cmd batch wrappers that subprocess cannot launch by bare name, failing with WinError 2. Route every external tool invocation, gcloud, bq, gsutil, terraform, docker, kubectl, minikube, infracost, git, through run_tool from platform_compat, which resolves the real executable path and works for both .cmd wrappers and .exe binaries. Streaming Popen call sites resolve the tool with resolve_tool and keep Popen. Remove now-dead subprocess imports where only run calls remained. Behavior is preserved on macOS and Linux because resolve_tool returns the same path a bare name resolves to, and all subprocess kwargs pass through unchanged. Update the helpers unit tests to patch run_tool instead of subprocess.run to match the new call path; they assert only on return values, so coverage is unchanged.
The default Windows console code page is cp1252, so emoji and box glyphs in CLI output raise UnicodeEncodeError and crash the command. Call configure_console_encoding first thing in main(), before cli(), to reconfigure stdout and stderr to UTF-8 with errors=replace. Proven: on a cp1252 stream the lightbulb glyph raised UnicodeEncodeError, and after the call stdout is UTF-8 and the glyph prints. deployml doctor now runs to completion with exit code 0 and no UnicodeEncodeError.
The local-exec readiness script uses bash only syntax: set +e, brace expansion {1..30}, command -v, POSIX test brackets, and sleep. On Windows the default local-exec interpreter is cmd.exe, which cannot parse this and fails the provisioner, killing the deploy. Add interpreter = [bash, -c] so it always runs under bash, provided by Git for Windows or WSL. This is also a portability improvement on Ubuntu, whose /bin/sh is dash and does not expand {1..30}. The script self-guards with command -v gcloud and always exits 0, so it degrades gracefully regardless of which bash resolves.
On Windows, shutil.rmtree of the .deployml workspace can fail with PermissionError when a file is marked read only or briefly locked, for example by a sync client. Route the destroy workspace cleanup through robust_rmtree from platform_compat, which clears the read only bit and retries. Removed the now-unused import shutil; this was its only use in cli.py. The Terraform module copy rmtree calls in helpers.py are left as-is per the brief, to be revisited only if they fail.
kubectl cannot authenticate to GKE without gke-gcloud-auth-plugin, and on Windows the gcloud SDK bin directory holding it is not always on the PATH a subprocess inherits. connect_to_gke_cluster now checks shutil.which for the plugin after connecting and, if absent, prints an actionable message: how to install it and, on Windows, the SDK bin directory to add to PATH. This replaces the cryptic kubectl executable-not-found failure with a clear fix, covering every GKE flow since deploy and destroy both connect through this function.
Expand the installation platform notes for native Windows: the required toolchain, using the py launcher to build the venv instead of the Microsoft Store python stub, Git for Windows as a hard requirement for the bash that the Cloud SQL readiness step needs, keeping the working directory off OneDrive to avoid workspace cleanup PermissionErrors, and the gcloud .cmd and PowerShell execution-policy note. The GKE auth plugin PATH note for Windows already exists in the Cloud Run tutorial. mkdocs build --strict passes.
The interpreter = [bash, -c] fix is not enough on a Windows host that also has WSL: terraform resolves bare bash to C:\Windows\System32\bash.exe, the WSL launcher, which re-translates the Windows command line and strips embedded quotes. The Cloud SQL readiness script gcloud --format=value(state) then reaches bash as an unquoted value(state) and fails with a syntax error on the parenthesis, killing the apply. Proven with subprocess: WSL bash errors on the quoted parens, Git bash returns value(state) cleanly. Fix in the engine: find_windows_bash locates a real Git for Windows bash and terraform_env prepends its directory to PATH for the terraform apply subprocess, so the bare bash interpreter resolves to Git bash. None off Windows, so macOS and Linux are unchanged. Caught by the live W9 Cloud Run deploy.
configure_console_encoding fixed the write side, but captured subprocess output was still decoded with the legacy cp1252 code page, which raised UnicodeDecodeError on bytes invalid in cp1252, for example 0x9d emitted by minikube. run_tool now sets encoding=utf-8, errors=replace when the caller requests text mode on Windows, and the terraform streaming Popen and its log file in helpers use the same. Caught live by minikube-deploy, where minikube service --url output crashed a subprocess reader thread; after the fix minikube-deploy runs clean with no traceback. None off Windows since utf-8 is already the default there.
Document the Windows minikube nuances found during live validation: the Docker Desktop driver service URL is not reachable from the host without minikube tunnel, minikube service --url, or kubectl port-forward; MLflow on minikube needs at least 4 GB which may not fit on an 8 GB machine; and gcloud components install may need CLOUDSDK_PYTHON via copy-bundled-python in a non interactive shell.
subprocess was imported twice and shutil was re-imported locally in cleanup_terraform_files; both already exist at module top. No behavior change.
Previously went straight to run_tool('docker', ['ps']); with docker missing, resolve_tool raised FileNotFoundError caught into a misleading FAIL on top of the real Docker-not-found FAIL. Guard with shutil.which and emit a clean SKIP, matching the other checks.
Two markdown_extensions blocks meant YAML kept only the second, silently dropping attr_list and pymdownx.emoji. Merge into one block so all extensions load. mkdocs build --strict passes.
Cover resolve_tool resolve and raise, run_tool kwarg passthrough and the Windows utf-8 decode branch, robust_rmtree, and configure_console_encoding. Closes the gap where mocking run_tool wholesale skipped resolve_tool.
Deleting the cluster tore down the in-cluster CSI driver before it reclaimed the PVC backing PersistentDisk, orphaning a billing disk. Capture the disk before teardown and remove it after the cluster is deleted, touching only the disk our own PVC created. Validated live on GKE: reproduced the orphan and the fix removed it, zero residual.
jivanb7 and others added 3 commits June 1, 2026 00:03
…roken manifest

gke-init ignored the push_image_to_gcr return value, so a failed docker push (Docker not running, auth, or any error) silently produced a manifest referencing a gcr.io image that does not exist; the user only discovered it later as an ImagePullBackOff at deploy time. Now check the result and emit a clear, actionable warning naming the image and the consequence and how to fix it. Applies to both the fastapi and mlflow gke-init generators. 63 unit tests pass.
Windows Compatibility - GCP + GKE (Kubernetes, Minikube)
…and dead notebook.py removal (PR #59), keep the Windows compat and run_tool work

# Conflicts:
#	src/deployml/cli/cli.py
#	src/deployml/enum/cloud_provider.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant