Conversation
Remove public 0.0.0.0/0 access from Cloud SQL and mark credential outputs sensitive. Inject the MLflow backend DSN and the Grafana DB URL and admin password via Secret Manager value_from instead of plaintext env, and stop the Grafana entrypoint echoing the DB URL to logs. Add MLflow min-instances, a longer request timeout, and a startup health probe.
Drop the unused plaintext DATABASE_URL, BACKEND_STORE_URI, and USE_POSTGRES env from FastAPI since the app never reads them. Inject the Feast online-store password via Secret Manager and default the internal Feast feature server to private.
Partition drift_metrics, ground_truth, and predictions by day to bound query scan cost, and set delete_contents_on_destroy so teardown does not fail on populated tables.
Tighten the teardown service account role set and remove project IAM admin.
Add gcloud auth, ADC, and region preflight checks, keep the Terraform workspace name consistent across deploy, destroy, get-urls, and status, surface terraform stderr on failure, preserve state on a failed destroy, and clean up the Artifact Registry repo and Cloud Build staging bucket. Fix the generate overwrite flag and add the minikube and GKE commands with namespace support, persistent storage, and self-cleaning teardown.
Add a PersistentVolumeClaim with fsGroup and a Recreate strategy so MLflow data survives pod restarts on minikube and GKE, isolate deploys by namespace, target the named service when waiting for the GKE LoadBalancer IP, and harden image loading.
Probe the Docker daemon before local builds and build for the host architecture by default with an optional platform override. In the FastAPI app, load the model in a background task so startup is fast and report a degraded but healthy status when MLflow is unreachable.
Document the supported init, build-images, deploy, get-urls, destroy flow with auth, IAM, billing, and cost guidance, add config.example.yaml, and align the Python version and command references.
Each script checks for the variables written by get-urls and exits with a clear message pointing to the right command instead of failing cryptically.
Add offline unit tests for config validation, helper probes, and teardown cron timing. Update packaging metadata and normalize shell and Dockerfile line endings.
deploy only checked 'gcloud auth list', so a user who was logged in but had not run application-default login passed the check and then failed opaquely at terraform apply with 'default credentials not found'. Add a GCP credentials preflight, mirroring init and doctor, that verifies both auth and ADC before any deploy work and exits with actionable guidance. Covers the Cloud Run and GKE deploy paths. Includes unit tests for the preflight.
A stack tool value parsed as a set or any non-mapping crashed the deploy loop with 'set' object has no attribute 'get'. Validate in _validate_deploy_config_or_exit that stack is a list, each stage is a mapping, and each tool is a mapping, exiting with an actionable message that points at a stray !!set tag. Validation runs only when stack is present so existing configs are unaffected. Includes unit tests for set, non-list, and non-dict stage cases.
Document build-images --platform, the deploy auth/ADC preflight and config-shape validation, destroy removing the Cloud Build staging bucket, generate --force, gke-cluster-create, the --namespace flag on the minikube and GKE deploy commands, MLflow PVC persistence on minikube and GKE, and gke-destroy cleaning the PVC and gcr.io image with --keep-images to opt out. Fixes the stale note that deployml does not manage GKE clusters.
destroy now removes the Artifact Registry repo and the Cloud Build staging bucket, so the old 'does not delete Artifact Registry images' line was inaccurate. Correct it and add an Other deployment targets section pointing to the local minikube and GKE flows.
api/overview lists the Kubernetes and GKE commands, features/overview documents gke and minikube as deployment targets selected by deployment.type, tutorials/overview adds a Kubernetes section, and costs notes the GKE PersistentDisk and ties the SQLite tip to the minikube and GKE paths. Brings the architecture and reference pages in line with the now-real Kubernetes functionality.
Centralize OS awareness so CLI commands stay identical across Windows, macOS, and Linux. resolve_tool returns an executable tool path and prefers a real wrapper over the extensionless launcher scripts that ship beside gcloud.cmd, bq.cmd, gsutil.cmd, and docker.exe on Windows. run_tool launches the resolved path and falls back to the command interpreter only if a direct .cmd launch raises OSError, so batch wrappers work on any Windows build. configure_console_encoding forces UTF-8 with replacement to stop cp1252 UnicodeEncodeError crashes. robust_rmtree clears the read only bit and retries, supporting both the onerror and onexc rmtree callbacks.
On Windows gcloud, bq, and gsutil are .cmd batch wrappers that subprocess cannot launch by bare name, failing with WinError 2. Route every external tool invocation, gcloud, bq, gsutil, terraform, docker, kubectl, minikube, infracost, git, through run_tool from platform_compat, which resolves the real executable path and works for both .cmd wrappers and .exe binaries. Streaming Popen call sites resolve the tool with resolve_tool and keep Popen. Remove now-dead subprocess imports where only run calls remained. Behavior is preserved on macOS and Linux because resolve_tool returns the same path a bare name resolves to, and all subprocess kwargs pass through unchanged. Update the helpers unit tests to patch run_tool instead of subprocess.run to match the new call path; they assert only on return values, so coverage is unchanged.
The default Windows console code page is cp1252, so emoji and box glyphs in CLI output raise UnicodeEncodeError and crash the command. Call configure_console_encoding first thing in main(), before cli(), to reconfigure stdout and stderr to UTF-8 with errors=replace. Proven: on a cp1252 stream the lightbulb glyph raised UnicodeEncodeError, and after the call stdout is UTF-8 and the glyph prints. deployml doctor now runs to completion with exit code 0 and no UnicodeEncodeError.
The local-exec readiness script uses bash only syntax: set +e, brace expansion {1..30}, command -v, POSIX test brackets, and sleep. On Windows the default local-exec interpreter is cmd.exe, which cannot parse this and fails the provisioner, killing the deploy. Add interpreter = [bash, -c] so it always runs under bash, provided by Git for Windows or WSL. This is also a portability improvement on Ubuntu, whose /bin/sh is dash and does not expand {1..30}. The script self-guards with command -v gcloud and always exits 0, so it degrades gracefully regardless of which bash resolves.
On Windows, shutil.rmtree of the .deployml workspace can fail with PermissionError when a file is marked read only or briefly locked, for example by a sync client. Route the destroy workspace cleanup through robust_rmtree from platform_compat, which clears the read only bit and retries. Removed the now-unused import shutil; this was its only use in cli.py. The Terraform module copy rmtree calls in helpers.py are left as-is per the brief, to be revisited only if they fail.
kubectl cannot authenticate to GKE without gke-gcloud-auth-plugin, and on Windows the gcloud SDK bin directory holding it is not always on the PATH a subprocess inherits. connect_to_gke_cluster now checks shutil.which for the plugin after connecting and, if absent, prints an actionable message: how to install it and, on Windows, the SDK bin directory to add to PATH. This replaces the cryptic kubectl executable-not-found failure with a clear fix, covering every GKE flow since deploy and destroy both connect through this function.
Expand the installation platform notes for native Windows: the required toolchain, using the py launcher to build the venv instead of the Microsoft Store python stub, Git for Windows as a hard requirement for the bash that the Cloud SQL readiness step needs, keeping the working directory off OneDrive to avoid workspace cleanup PermissionErrors, and the gcloud .cmd and PowerShell execution-policy note. The GKE auth plugin PATH note for Windows already exists in the Cloud Run tutorial. mkdocs build --strict passes.
The interpreter = [bash, -c] fix is not enough on a Windows host that also has WSL: terraform resolves bare bash to C:\Windows\System32\bash.exe, the WSL launcher, which re-translates the Windows command line and strips embedded quotes. The Cloud SQL readiness script gcloud --format=value(state) then reaches bash as an unquoted value(state) and fails with a syntax error on the parenthesis, killing the apply. Proven with subprocess: WSL bash errors on the quoted parens, Git bash returns value(state) cleanly. Fix in the engine: find_windows_bash locates a real Git for Windows bash and terraform_env prepends its directory to PATH for the terraform apply subprocess, so the bare bash interpreter resolves to Git bash. None off Windows, so macOS and Linux are unchanged. Caught by the live W9 Cloud Run deploy.
configure_console_encoding fixed the write side, but captured subprocess output was still decoded with the legacy cp1252 code page, which raised UnicodeDecodeError on bytes invalid in cp1252, for example 0x9d emitted by minikube. run_tool now sets encoding=utf-8, errors=replace when the caller requests text mode on Windows, and the terraform streaming Popen and its log file in helpers use the same. Caught live by minikube-deploy, where minikube service --url output crashed a subprocess reader thread; after the fix minikube-deploy runs clean with no traceback. None off Windows since utf-8 is already the default there.
Document the Windows minikube nuances found during live validation: the Docker Desktop driver service URL is not reachable from the host without minikube tunnel, minikube service --url, or kubectl port-forward; MLflow on minikube needs at least 4 GB which may not fit on an 8 GB machine; and gcloud components install may need CLOUDSDK_PYTHON via copy-bundled-python in a non interactive shell.
subprocess was imported twice and shutil was re-imported locally in cleanup_terraform_files; both already exist at module top. No behavior change.
Previously went straight to run_tool('docker', ['ps']); with docker missing, resolve_tool raised FileNotFoundError caught into a misleading FAIL on top of the real Docker-not-found FAIL. Guard with shutil.which and emit a clean SKIP, matching the other checks.
Two markdown_extensions blocks meant YAML kept only the second, silently dropping attr_list and pymdownx.emoji. Merge into one block so all extensions load. mkdocs build --strict passes.
Cover resolve_tool resolve and raise, run_tool kwarg passthrough and the Windows utf-8 decode branch, robust_rmtree, and configure_console_encoding. Closes the gap where mocking run_tool wholesale skipped resolve_tool.
Deleting the cluster tore down the in-cluster CSI driver before it reclaimed the PVC backing PersistentDisk, orphaning a billing disk. Capture the disk before teardown and remove it after the cluster is deleted, touching only the disk our own PVC created. Validated live on GKE: reproduced the orphan and the fix removed it, zero residual.
…roken manifest gke-init ignored the push_image_to_gcr return value, so a failed docker push (Docker not running, auth, or any error) silently produced a manifest referencing a gcr.io image that does not exist; the user only discovered it later as an ImagePullBackOff at deploy time. Now check the result and emit a clear, actionable warning naming the image and the consequence and how to fix it. Applies to both the fastapi and mlflow gke-init generators. 63 unit tests pass.
Windows Compatibility - GCP + GKE (Kubernetes, Minikube)
…and dead notebook.py removal (PR #59), keep the Windows compat and run_tool work # Conflicts: # src/deployml/cli/cli.py # src/deployml/enum/cloud_provider.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings deployml to a production-ready state on GCP across all three deployment targets, Cloud Run, GKE, and minikube, and makes the exact same CLI run on native Windows in addition to macOS and
Linux. The command surface never changes; the engine detects the operating system and adapts underneath. This branch hardens the deploy and destroy lifecycle, adds full Windows support, and brings
docs, examples, and tests up to date.
Closes #53
Closes #54
Closes #56
GCP Cloud Run security and reliability
0.0.0.0/0access, and mark credential outputs sensitiveCLI deploy and destroy lifecycle
BigQuery and teardown
delete_contents_on_destroyKubernetes, minikube, and GKE
fsGroupand a Recreate strategy, so experiment data survives pod restarts on both minikube and GKEgke-destroyis now fully self-cleaning. Deleting the cluster used to tear down the in-cluster CSI driver before it reclaimed the PVC backing disk, leaving a billing PersistentDisk behind.Teardown now captures that disk before deletion and removes it after the cluster is gone, plus the PVC and gcr.io image, so teardown ends at zero residual
Windows compatibility (Closes #56)
The same CLI now runs on native Windows. All operating-system awareness is centralized in one new module,
src/deployml/utils/platform_compat.py, withIS_WINDOWS,resolve_tool,run_tool,configure_console_encoding,robust_rmtree,find_windows_bash, andterraform_env. Command code calls these helpers instead of branching on the OS. The four blockers from #56, plus two morefound in the audit:
gcloud,bq, andgsutilship as.cmdwrappers on Windows that subprocess could not launch by bare name. Every external tool call now routes throughrun_tool, which resolves the real executable and runs it.read side raises.
bashinterpreter was added, and on Windows Terraform resolves toGit bash rather than the WSL launcher that mangles quoting.
robust_rmtreeclears the read-only bit and retries, so destroy cleanup survives read-only files and OneDrive locks.gke-gcloud-auth-pluginis missing from PATH, so kubectl-to-GKE auth fails with guidance instead of a cryptic error.Cross-platform hardening and tests
doctorskips the docker permission check when docker is absent, instead of emitting a misleading permission failure on top of the docker-not-found resultmarkdown_extensionskey inmkdocs.yml, soattr_listandpymdownx.emojiload againplatform_compatcovering tool resolution, therun_toolkwarg passthrough and Windows decode branch,robust_rmtree, and console encodingDocs, examples, tests
Validation
Run live end to end on both operating systems against a GCP test project, with zero residual after every teardown.
macOS
doctorruns clean,mkdocs build --strictpassesclusters and zero disks
native Windows
doctorclean,mkdocs build --strictpasses#53 and #54 were verified both with unit tests and live: a malformed stack exits cleanly, and removing ADC makes deploy fail at the preflight before any Terraform. #56 was verified live on native
Windows across all three deployment targets.