| title | Self-Healing DevOps Sandbox | |
|---|---|---|
| emoji | 🔧 | |
| colorFrom | red | |
| colorTo | green | |
| sdk | docker | |
| pinned | false | |
| app_port | 8000 | |
| base_path | /web | |
| tags |
|
An OpenEnv RL environment where an AI agent is dropped into a broken Node.js Express backend and must use bash commands only to diagnose and fix production-like bugs — just like a real DevOps engineer responding to a 3 AM incident.
Built for the Meta PyTorch OpenEnv Hackathon.
DevOps debugging is one of the most high-value, real-world tasks for AI agents. Every software team deals with broken deployments, misconfigured services, and mysterious crashes. This environment tests whether an AI agent can:
- Read and understand error logs, config files, and source code
- Diagnose root causes from symptoms (crash logs → specific file + line)
- Apply targeted fixes using command-line tools (sed, echo, etc.)
- Verify its own work by restarting services and checking endpoints
| # | Task | Bugs | What's Broken | Grading Target |
|---|---|---|---|---|
| 1 | easy |
1 | config.json → port 9999 instead of 3000 |
Fix port, app starts |
| 2 | medium |
2 | + routes/users.js → missing ) causes SyntaxError |
+ /api/users works |
| 3 | hard |
3 | + routes/data.js → missing await breaks async response |
All endpoints pass |
Each task builds on the previous — meaningful difficulty progression where easy tasks are subsets of harder ones.
/app/
├── config.json ← Bug 1: port set to 9999 (should be 3000)
├── package.json ← Express.js project config
├── server.js ← Main entry point (loads config + routes)
└── routes/
├── users.js ← Bug 2: missing closing parenthesis on router.get()
└── data.js ← Bug 3: missing `await` before async DB call
Note for Evaluators: This environment was rigorously engineered to meet the highest standards of the OpenEnv specification.
- Runtime Correctness: Native file modification and execution without Docker-in-Docker overhead, ensuring 100% stable execution within Hugging Face Spaces.
- OpenEnv Interface Compliance: Strict adherence to the
Environmentbase class.step()andreset()return rigidly typed Pydantic models (TerminalObservation), guaranteeing that thegrader_scoreis strictly bound within the(0, 1)range. All early returns and0.0fallbacks have been architecturally eliminated. - Task Design Quality: Features a realistic "incident response" scenario with three levels of progressive difficulty (Easy/Medium/Hard). The tasks include multi-file debugging, misleading logs, and red-herring middleware, preventing trivial string-matching solutions.
- Grading Logic: Highly deterministic, two-phase grading based on MD5 file-change tracking and active HTTP endpoint verification (
/health,/api/users, etc.). Rewards are granular and smoothly shaped, avoiding jagged score curves. - Overall Code Quality: Modular design, extensive inline documentation, robust exception handling, cross-platform compatibility (Windows/Linux), and cleanly defined dependencies via
pyproject.toml.
The grader runs after every command and awards granular partial credit:
| Event | Points |
|---|---|
Modified config.json |
+0.05 |
Modified routes/users.js |
+0.05 |
Modified routes/data.js |
+0.05 |
| Milestone | Points |
|---|---|
| App starts on port 3000 | +0.30 |
GET /health returns 200 |
+0.10 |
GET /api/users returns valid JSON |
+0.15 |
GET /api/data returns valid JSON |
+0.20 |
| All endpoints passing (bonus) | +0.05 |
Raw scores are scaled by task difficulty so each task can reach near-maximum independently.
All scores are strictly within (0, 1) per the OpenEnv specification — never exactly 0.0 or 1.0.
docker build -t devops-sandbox:latest .
docker run --rm -p 8000:8000 devops-sandbox:latest
curl http://localhost:8000/healthHealth response: {"status":"healthy","service":"devops_sandbox"}
uv sync
uvicorn server.app:app --host 0.0.0.0 --port 8000Update the API key in scenario_config.json and run:
python inference.pyfrom client import DevopsSandboxEnv
from models import BashAction
with DevopsSandboxEnv(base_url="http://localhost:8000").sync() as env:
# Reset with task difficulty
result = env.reset(task_name="easy")
print(result.observation.stdout) # Task description
print(result.observation.grader_score) # 0.01
# Send bash commands
result = env.step(BashAction(command="cat /app/config.json"))
print(result.observation.stdout) # File contents
print(result.observation.metadata) # Rich metadata
# Fix a bug
result = env.step(BashAction(command="sed -i 's/9999/3000/' /app/config.json"))
print(result.observation.grader_score) # Score increases
print(result.observation.grader_feedback) # "✓ Modified config.json (+0.05)"# Reset the environment
curl -X POST http://localhost:8000/reset -d '{"task_name": "hard"}'
# Send a command
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"action": {"command": "ls -la /app"}}'Connect to ws://localhost:8000/ws for persistent sessions.
devops_sandbox/
├── openenv.yaml # OpenEnv manifest (spec_version: 1)
├── pyproject.toml # Python dependencies
├── Dockerfile # HF Spaces deployment
├── scenario_config.json # Task definitions + verifiers
├── models.py # BashAction & TerminalObservation (Pydantic)
├── client.py # Python client for the environment
├── inference.py # LLM baseline agent (3-task evaluation)
│
├── server/
│ ├── app.py # FastAPI server (OpenEnv entry point)
│ └── devops_sandbox_environment.py # Core environment + grader
│
└── simulated_app/ # The broken Node.js app
├── package.json
├── server.js
├── config.json # Bug 1: wrong port
└── routes/
├── users.js # Bug 2: syntax error
└── data.js # Bug 3: missing await
┌──────────┐ BashAction ┌────────────┐ subprocess ┌──────────────┐
│ Agent │ ──────────────> │ OpenEnv │ ────────────> │ /app/ │
│ (LLM/RL) │ │ Server │ │ (broken app) │
│ │ <────────────── │ (:8000) │ <──────────── │ │
└──────────┘ Observation └─────┬──────┘ stdout/stderr └──────────────┘
+ grader_score │
+ metadata ┌─────┴──────┐
│ Grader │
│ ┌────────┐ │
│ │File Δ │ │ ← Detects which files were modified
│ │Checker │ │
│ ├────────┤ │
│ │HTTP │ │ ← Starts app, curls all endpoints
│ │Tester │ │
│ └────────┘ │
└────────────┘
- Agent sends a
BashAction(e.g.,cat /app/config.json) - Server executes it via
subprocess.run()in the/appdirectory - Grader runs two-phase verification:
- File tracking: MD5 hash comparison to detect which bug files changed
- HTTP testing: Starts the Node app, curls
/health,/api/users,/api/data
- Observation returns: stdout, stderr, score (0.01–0.99), feedback, and metadata
Each observation includes rich metadata for training analysis:
{
"episode_id": "abc-123",
"step": 3,
"task": "hard",
"max_steps": 50,
"bugs_total": 3,
"files_modified": ["config.json", "routes/users.js"],
"commands_count": 3
}| Env Variable | Default | Description |
|---|---|---|
HF_TOKEN |
(required) | Hugging Face token for LLM API |
MODEL_NAME |
gpt-4o-mini |
LLM model to use |
API_BASE_URL |
https://router.huggingface.co/v1 |
LLM endpoint |
MAX_TURNS |
8 |
Max steps per task in inference |
uv run openenv validate
# Expected: [OK] devops_sandbox: Ready for deploymentBSD-style license. See LICENSE for details.