Serverless GPU monocular depth estimation on Runpod. Backed by the Depth-Anything-V2 family (and DPT as an alternative). Given a single image you get back any of:
-
depth- the raw depth map (compressed 16-bit PNG, with optional float32.npz) -
colorized- the depth map rendered with a matplotlib-style colormap -
normals- surface normals derived from the depth gradient -
disparity- reciprocal depth ($1/(d+\varepsilon)$), normalized -
point_cloud- a top-down (x, depth) preview rendered to a PNG (no raw XYZ)
- Six supported models, from the very fast
Depth-Anything-V2-Smallto the high-accuracyLarge, plus two metric variants (indoor + outdoor) and a DPT fallback. - 1, 2, or N images per request via
image_url,image_urls,image_b64, or a heterogeneousimages: [{type, data}, ...]list. - Per-image errors so one bad URL never blocks a batch.
- Pipeline LRU cache keyed by
(model, device, dtype), so warm GPU containers serve subsequent requests instantly. - JSON-safe outputs (no raw numpy in the response unless you set
include_raw: true). - Six colormaps:
viridis,magma,inferno,plasma,turbo,gray. - Output formats:
png,webp,jpg. - Automatic FP16 on GPU, FP32 on CPU. Override with
dtype: "float16"|"float32".
model |
Type | Size | Best for |
|---|---|---|---|
depth-anything/Depth-Anything-V2-Small-hf |
Relative | ~25M | Fast preview, default |
depth-anything/Depth-Anything-V2-Base-hf |
Relative | ~97M | Balanced quality |
depth-anything/Depth-Anything-V2-Large-hf |
Relative | ~335M | Best generic depth |
depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf |
Metric | ~335M | Real-world meters, indoor scenes |
depth-anything/Depth-Anything-V2-Metric-Outdoor-Large-hf |
Metric | ~335M | Real-world meters, outdoor scenes |
Intel/dpt-large |
Relative | ~344M | DPT alternative, well-tested |
Metric models return depth in meters (or close to it) and metric: true is set
on the response. Relative models return arbitrary-scale inverse-depth-like
floats; min/max are reported on every response so you can rescale on the
client.
| Task | Output key | Notes |
|---|---|---|
depth |
depth_png_b64 (uint16 PNG) + optional depth_npz_b64
|
The raw map. Set include_raw: true for float32 .npz. |
colorized |
colorized_b64 + colormap
|
Normalized depth -> RGB via the selected colormap. |
normals |
normals_b64 |
Surface normals from depth gradient, encoded as RGB. |
disparity |
disparity_b64 |
|
point_cloud |
point_cloud_b64 + point_cloud_grid
|
Top-down (x, depth) projection rendered to a PNG preview. |
| Field | Type | Default | Description |
|---|---|---|---|
image_url |
string | - | Single image URL. |
image_urls |
string[] | - | Multiple image URLs. |
image_b64 |
string | - | Single raw base64 (or data URI) image. |
images |
object[] | - | List of {"type": "url"|"b64", "data": "..."}. |
model |
string | Depth-Anything-V2-Small-hf (or $DEPTH_MODEL) |
One of the supported models. |
tasks |
string[] | ["depth"] |
Any subset of depth, colorized, normals, disparity, point_cloud. |
colormap |
string | viridis |
One of viridis, magma, inferno, plasma, turbo, gray. |
normalize |
bool | true |
Normalize depth to [0, 1] for visualization. |
output_format |
string | png |
png, webp, or jpg. |
quality |
int | 95 | JPEG/WebP quality. |
max_size |
int | 1024 | Longest edge for inference (downscaled before forward pass). |
invert |
bool | false |
Invert depth (near<->far swap before colorization). |
include_raw |
bool | false |
Include float32 depth as .npz b64. |
point_cloud_grid |
int | 512 | Grid size for point_cloud preview. |
dtype |
string | float16 on GPU, float32 on CPU |
float16, bfloat16, or float32. |
{
"results": [
{
"index": 0,
"input": {"kind": "url", "url": "..."},
"model": "depth-anything/Depth-Anything-V2-Small-hf",
"metric": false,
"width": 1024,
"height": 768,
"inference_size": [1024, 768],
"depth_min": 0.124,
"depth_max": 28.5,
"depth_mean": 4.31,
"tasks": ["depth", "colorized"],
"depth_png_b64": "<uint16 PNG, base64>",
"depth_png_mime": "image/png",
"colorized_b64": "<PNG/WebP/JPG, base64>",
"colorized_mime": "image/png",
"colormap": "viridis",
"elapsed_sec": 0.412
}
],
"model": "depth-anything/Depth-Anything-V2-Small-hf",
"tasks": ["depth", "colorized"],
"colormap": "viridis",
"output_format": "png",
"normalize": true,
"invert": false,
"include_raw": false,
"max_size": 1024,
"point_cloud_grid": 512,
"device": "cuda",
"cuda_available": true,
"count": 1,
"elapsed_sec": 0.418
}{
"input": {
"image_url": "https://example.com/photo.jpg",
"tasks": ["depth"]
}
}{
"input": {
"image_url": "https://example.com/photo.jpg",
"tasks": ["colorized"],
"colormap": "viridis"
}
}{
"input": {
"image_url": "https://example.com/photo.jpg",
"tasks": ["depth", "colorized", "normals", "disparity"],
"colormap": "turbo",
"model": "depth-anything/Depth-Anything-V2-Large-hf"
}
}{
"input": {
"image_url": "https://example.com/room.jpg",
"model": "depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf",
"tasks": ["depth"],
"include_raw": true,
"normalize": false
}
}The returned depth_npz_b64 decodes to a np.savez_compressed archive with a
single depth array (float32, in meters).
{
"input": {
"images": [
{"type": "url", "data": "https://example.com/a.jpg"},
{"type": "b64", "data": "iVBORw0KGgo..."}
],
"tasks": ["colorized", "normals"],
"colormap": "magma",
"output_format": "webp"
}
}{
"input": {
"image_url": "https://example.com/scene.jpg",
"tasks": ["point_cloud"],
"point_cloud_grid": 768
}
}The result's point_cloud_b64 is a 768x768 PNG showing each pixel projected
onto an (x, depth) plane.
{
"input": {
"image_url": "https://example.com/photo.jpg",
"tasks": ["colorized"],
"colormap": "magma",
"invert": true
}
}- The first call to a given
(model, device, dtype)triplet pays a cold-start cost: HF weights are downloaded and a pipeline is constructed. Subsequent calls in the same container reuse the cached pipeline via_PIPELINE_CACHE. max_sizecontrols inference resolution. The depth map is upsampled back to the original image size for all output renderings, so increasingmax_sizetrades latency for sharpness.- FP16 (
dtype: "float16") on GPU is ~2x faster than FP32 with negligible quality impact on this family. - The point-cloud preview is intentionally rasterized (
HxWx3 uint8) rather than returned as raw XYZ; full XYZ for a 1024x768 image would be ~9MB of JSON payload per output. - Metric models are roughly the size of the Large relative model. Use the Small relative model when you only need ordinal depth.
| Name | Style | Good for |
|---|---|---|
viridis |
dark blue -> green -> yellow | Perceptually uniform, default |
magma |
black -> purple -> orange -> yellow | High-contrast scenes |
inferno |
black -> red -> orange -> yellow | Heat-map look |
plasma |
dark blue -> magenta -> orange | Vibrant alternative to viridis |
turbo |
blue -> green -> yellow -> red | Maximum perceptual range |
gray |
linear grayscale | Raw-looking, no false color |
| Variable | Default | Purpose |
|---|---|---|
DEPTH_MODEL |
depth-anything/Depth-Anything-V2-Small-hf |
Default model when model is not supplied. |
HF_HOME |
/root/.cache/huggingface |
HuggingFace cache directory. |
PYTHONUNBUFFERED |
1 |
Stdout flushing. |
python3 test_handler.pyThe test harness injects light-weight mocks for torch, transformers,
cv2, and matplotlib.cm, so no GPU and no model download is needed. The
mock pipeline returns a radial-gradient grayscale "depth" image that the
helper functions process end-to-end.
Expected output: ALL TESTS PASSED.
docker build -t runpod-depth .The Docker image is built on nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04,
pre-installs PyTorch 2.3.1 + CUDA 12.1 wheels, then layers in transformers,
opencv-python-headless, matplotlib, accelerate, runpod, and the rest of
requirements.txt. The container runs python3 handler.py, which starts the
Runpod serverless worker.