🔧 Self-Healing DevOps Sandbox

title

Self-Healing DevOps Sandbox

emoji

🔧

colorFrom

red

colorTo

green

sdk

docker

pinned

false

app_port

8000

base_path

/web

🔧 Self-Healing DevOps Sandbox

An OpenEnv RL environment where an AI agent is dropped into a broken Node.js Express backend and must use bash commands only to diagnose and fix production-like bugs — just like a real DevOps engineer responding to a 3 AM incident.

Built for the Meta PyTorch OpenEnv Hackathon.

🎯 Why This Environment?

DevOps debugging is one of the most high-value, real-world tasks for AI agents. Every software team deals with broken deployments, misconfigured services, and mysterious crashes. This environment tests whether an AI agent can:

Read and understand error logs, config files, and source code
Diagnose root causes from symptoms (crash logs → specific file + line)
Apply targeted fixes using command-line tools (sed, echo, etc.)
Verify its own work by restarting services and checking endpoints

🏗️ Task Design

Three Difficulty Levels

#	Task	Bugs	What's Broken	Grading Target
1	`easy`	1	`config.json` → port `9999` instead of `3000`	Fix port, app starts
2	`medium`	2	+ `routes/users.js` → missing `)` causes SyntaxError	+ `/api/users` works
3	`hard`	3	+ `routes/data.js` → missing `await` breaks async response	All endpoints pass

Each task builds on the previous — meaningful difficulty progression where easy tasks are subsets of harder ones.

The Broken App (`/app`)

/app/
├── config.json          ← Bug 1: port set to 9999 (should be 3000)
├── package.json         ← Express.js project config
├── server.js            ← Main entry point (loads config + routes)
└── routes/
    ├── users.js         ← Bug 2: missing closing parenthesis on router.get()
    └── data.js          ← Bug 3: missing `await` before async DB call

🤖 Evaluation Alignment (OpenEnv Rubric Guide)

Note for Evaluators: This environment was rigorously engineered to meet the highest standards of the OpenEnv specification.

Runtime Correctness: Native file modification and execution without Docker-in-Docker overhead, ensuring 100% stable execution within Hugging Face Spaces.
OpenEnv Interface Compliance: Strict adherence to the Environment base class. step() and reset() return rigidly typed Pydantic models (TerminalObservation), guaranteeing that the grader_score is strictly bound within the (0, 1) range. All early returns and 0.0 fallbacks have been architecturally eliminated.
Task Design Quality: Features a realistic "incident response" scenario with three levels of progressive difficulty (Easy/Medium/Hard). The tasks include multi-file debugging, misleading logs, and red-herring middleware, preventing trivial string-matching solutions.
Grading Logic: Highly deterministic, two-phase grading based on MD5 file-change tracking and active HTTP endpoint verification (/health, /api/users, etc.). Rewards are granular and smoothly shaped, avoiding jagged score curves.
Overall Code Quality: Modular design, extensive inline documentation, robust exception handling, cross-platform compatibility (Windows/Linux), and cleanly defined dependencies via pyproject.toml.

📊 Reward Shaping

The grader runs after every command and awards granular partial credit:

Phase 1: File-Level Verification

Event	Points
Modified `config.json`	+0.05
Modified `routes/users.js`	+0.05
Modified `routes/data.js`	+0.05

Phase 2: HTTP Endpoint Testing

Milestone	Points
App starts on port 3000	+0.30
`GET /health` returns 200	+0.10
`GET /api/users` returns valid JSON	+0.15
`GET /api/data` returns valid JSON	+0.20
All endpoints passing (bonus)	+0.05

Phase 3: Difficulty Scaling

Raw scores are scaled by task difficulty so each task can reach near-maximum independently.

All scores are strictly within (0, 1) per the OpenEnv specification — never exactly 0.0 or 1.0.

🚀 Getting Started

Docker (Recommended)

docker build -t devops-sandbox:latest .
docker run --rm -p 8000:8000 devops-sandbox:latest
curl http://localhost:8000/health

Health response: {"status":"healthy","service":"devops_sandbox"}

Without Docker

uv sync
uvicorn server.app:app --host 0.0.0.0 --port 8000

Quick Start (Demo)

Update the API key in scenario_config.json and run:

python inference.py

🧪 Test Your Own Agent

Option A: Python Client

from client import DevopsSandboxEnv
from models import BashAction

with DevopsSandboxEnv(base_url="http://localhost:8000").sync() as env:
    # Reset with task difficulty
    result = env.reset(task_name="easy")
    print(result.observation.stdout)        # Task description
    print(result.observation.grader_score)   # 0.01

    # Send bash commands
    result = env.step(BashAction(command="cat /app/config.json"))
    print(result.observation.stdout)         # File contents
    print(result.observation.metadata)       # Rich metadata

    # Fix a bug
    result = env.step(BashAction(command="sed -i 's/9999/3000/' /app/config.json"))
    print(result.observation.grader_score)   # Score increases
    print(result.observation.grader_feedback) # "✓ Modified config.json (+0.05)"

Option B: REST API

# Reset the environment
curl -X POST http://localhost:8000/reset -d '{"task_name": "hard"}'

# Send a command
curl -X POST http://localhost:8000/step \
  -H "Content-Type: application/json" \
  -d '{"action": {"command": "ls -la /app"}}'

Option C: WebSocket

Connect to ws://localhost:8000/ws for persistent sessions.

📁 Project Structure

devops_sandbox/
├── openenv.yaml               # OpenEnv manifest (spec_version: 1)
├── pyproject.toml              # Python dependencies
├── Dockerfile                  # HF Spaces deployment
├── scenario_config.json        # Task definitions + verifiers
├── models.py                   # BashAction & TerminalObservation (Pydantic)
├── client.py                   # Python client for the environment
├── inference.py                # LLM baseline agent (3-task evaluation)
│
├── server/
│   ├── app.py                  # FastAPI server (OpenEnv entry point)
│   └── devops_sandbox_environment.py  # Core environment + grader
│
└── simulated_app/              # The broken Node.js app
    ├── package.json
    ├── server.js
    ├── config.json             # Bug 1: wrong port
    └── routes/
        ├── users.js            # Bug 2: syntax error
        └── data.js             # Bug 3: missing await

⚙️ Architecture

┌──────────┐   BashAction    ┌────────────┐   subprocess   ┌──────────────┐
│  Agent   │ ──────────────> │  OpenEnv   │ ────────────> │  /app/       │
│ (LLM/RL) │                 │  Server    │               │ (broken app) │
│          │ <────────────── │  (:8000)   │ <──────────── │              │
└──────────┘  Observation    └─────┬──────┘  stdout/stderr └──────────────┘
              + grader_score       │
              + metadata     ┌─────┴──────┐
                             │  Grader    │
                             │ ┌────────┐ │
                             │ │File Δ  │ │  ← Detects which files were modified
                             │ │Checker │ │
                             │ ├────────┤ │
                             │ │HTTP    │ │  ← Starts app, curls all endpoints
                             │ │Tester  │ │
                             │ └────────┘ │
                             └────────────┘

Agent sends a BashAction (e.g., cat /app/config.json)
Server executes it via subprocess.run() in the /app directory
Grader runs two-phase verification:
- File tracking: MD5 hash comparison to detect which bug files changed
- HTTP testing: Starts the Node app, curls /health, /api/users, /api/data
Observation returns: stdout, stderr, score (0.01–0.99), feedback, and metadata

📋 Observation Metadata

Each observation includes rich metadata for training analysis:

{
  "episode_id": "abc-123",
  "step": 3,
  "task": "hard",
  "max_steps": 50,
  "bugs_total": 3,
  "files_modified": ["config.json", "routes/users.js"],
  "commands_count": 3
}

🔧 Configuration

Env Variable	Default	Description
`HF_TOKEN`	(required)	Hugging Face token for LLM API
`MODEL_NAME`	`gpt-4o-mini`	LLM model to use
`API_BASE_URL`	`https://router.huggingface.co/v1`	LLM endpoint
`MAX_TURNS`	`8`	Max steps per task in inference

✅ Validation

uv run openenv validate
# Expected: [OK] devops_sandbox: Ready for deployment

📄 License

BSD-style license. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.app_sandbox		.app_sandbox
.tmp		.tmp
server		server
simulated_app		simulated_app
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
__init__.py		__init__.py
client.py		client.py
inference.py		inference.py
models.py		models.py
move.py		move.py
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
scenario_config.json		scenario_config.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔧 Self-Healing DevOps Sandbox

🎯 Why This Environment?

🏗️ Task Design

Three Difficulty Levels

The Broken App (`/app`)

🤖 Evaluation Alignment (OpenEnv Rubric Guide)

📊 Reward Shaping

Phase 1: File-Level Verification

Phase 2: HTTP Endpoint Testing

Phase 3: Difficulty Scaling

🚀 Getting Started

Docker (Recommended)

Without Docker

Quick Start (Demo)

🧪 Test Your Own Agent

Option A: Python Client

Option B: REST API

Option C: WebSocket

📁 Project Structure

⚙️ Architecture

📋 Observation Metadata

🔧 Configuration

✅ Validation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔧 Self-Healing DevOps Sandbox

🎯 Why This Environment?

🏗️ Task Design

Three Difficulty Levels

The Broken App (/app)

🤖 Evaluation Alignment (OpenEnv Rubric Guide)

📊 Reward Shaping

Phase 1: File-Level Verification

Phase 2: HTTP Endpoint Testing

Phase 3: Difficulty Scaling

🚀 Getting Started

Docker (Recommended)

Without Docker

Quick Start (Demo)

🧪 Test Your Own Agent

Option A: Python Client

Option B: REST API

Option C: WebSocket

📁 Project Structure

⚙️ Architecture

📋 Observation Metadata

🔧 Configuration

✅ Validation

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The Broken App (`/app`)

Packages