{"data":{"kind":"file","path":"README.md","version_id":"yujsd2g6tldtszuu6h1dd95j","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":7728,"modified_at":"2026-05-18T13:56:28.369000","content_hash":"22377b335555a8b6fbc2914b27eddd891cdc2f4458fe0328c65a929f04c8b3ed"},"entries":[],"content":"# devops-gym\n\nVerifiers environment that turns [DevOps-Gym](https://github.com/ucsb-mlsec/DevOps-Gym) tasks into **single-turn** train/eval examples: each row is the `instruction` from `task.yaml` (plus metadata), with optional reference text extracted from `solution.sh` for scoring.\n\nThis is **not** a drop-in replacement for the official Docker + [Terminal-Bench](https://github.com/laude-institute/terminal-bench) harness. Use those for leaderboard-style numbers. Use this environment when you want **Prime `eval` / hosted RL** on open models with a clear prompt–reward loop.\n\n### Overview\n\n- **Environment ID**: `devops-gym`\n- **Type**: Single-turn, XML-structured assistant output (`<reasoning>`, `<answer>`)\n- **Rubric (default)**: `oracle_overlap` — no extra API; scores overlap of `+`/`-` patch lines from the reference into the model reply. Best when `solution.sh` embeds a diff.\n- **Optional**: `reward_mode=\"judge\"` — LLM judge via `JudgeRubric`; requires `OPENAI_API_KEY`.\n\n### Data sources (priority)\n\n1. **`hf_dataset_id`** — Hugging Face dataset id **or** absolute path to a **`.parquet`** file (from the export script). Use this for **hosted** runs so workers do not need the full GitHub tree.\n2. **Local clone** — `dataset_root` or `DEVOPS_GYM_ROOT` pointing at a clone of `ucsb-mlsec/DevOps-Gym`.\n3. **Embedded smoke** — if neither is available and `fallback_embedded=true`.\n\n### Export from the real DevOps-Gym repo (for HF / hosted)\n\nFrom your lab workspace (after cloning DevOps-Gym locally):\n\n```bash\npython scripts/export_devops_gym_dataset.py \\\n  --repo-root /ABSOLUTE/PATH/TO/DevOps-Gym \\\n  --category build \\\n  --max-tasks 500 \\\n  --output ./devops-gym-build.parquet\n```\n\nUse the **repository root** (the folder that contains `tasks/`), not a subfolder. If you see `No records exported`, the script prints diagnostics; you can also run `--list-categories` to see valid `--category` values:\n\n```bash\npython scripts/export_devops_gym_dataset.py \\\n  --repo-root /ABSOLUTE/PATH/TO/DevOps-Gym \\\n  --list-categories\n```\n\nUpload the Parquet to [Hugging Face Datasets](https://huggingface.co/new-dataset) (any org you control). Then pass `hf_dataset_id` in env args.\n\n**Parquet / HF row schema**\n\n- **This repo’s export** (`scripts/export_devops_gym_dataset.py`): `task`, `category`, `user_content`, `answer`, plus optional metadata columns (`project_name`, `language`, etc.).\n- **Upstream DevOps-Gym table dumps** (e.g. `devops-gym-build.parquet` from their pipeline): `task_id` + `instruction` are accepted as aliases for `task` + `user_content`. An `answer` column is optional; without it, `oracle_overlap` has no reference diff (rewards stay low unless you add a judge).\n\nIf your file mixes several `category` values (not only `build`), either pass `\"category\": \"<one-of-them>\"` to match a slice, or set `\"hf_filter_category\": false` to keep every row.\n\n### Environments Hub (push)\n\n1. Smoke locally: `prime env install devops-gym` then `prime eval run devops-gym ...`.\n2. Push:\n\n   ```bash\n   prime env push devops-gym --path ./environments/devops_gym --visibility PRIVATE\n   ```\n\n   Use `PUBLIC` if you intend to share broadly.\n\n3. In hosted training TOML, reference the Hub slug like `yourusername/devops-gym` (see [configs/rl/devops-gym-hosted.example.toml](../../configs/rl/devops-gym-hosted.example.toml)).\n\n### Setup (local install)\n\n```bash\ngit clone https://github.com/ucsb-mlsec/DevOps-Gym.git   # optional if using HF export only\nexport DEVOPS_GYM_ROOT=\"$PWD/DevOps-Gym\"                  # optional\nprime env install devops-gym\n```\n\n**Install error: `project must not contain {'tags'}`** — Your `pyproject.toml` is out of date. The `[project]` table must be valid PEP 621 (setuptools enforces this). This package does **not** use a `tags` key under `[project]`; pull the latest `environments/devops_gym/pyproject.toml` from the repo or remove any `tags = [...]` line from `[project]`.\n\n### Policy optimization (Prime-RL)\n\nRL trains the **model weights** using rewards from this environment (same pattern as other Verifiers envs on [prime-rl](https://github.com/PrimeIntellect-ai/prime-rl)). You need **Linux + NVIDIA GPUs** for a typical `uv sync` (vLLM). On a single GPU you can split VRAM between trainer and inference (see [deployment](https://docs.primeintellect.ai/prime-rl/deployment)).\n\n1. Clone and install **prime-rl** (see its README for `deps/verifiers`, `deps/renderers`, `deps/research-environments`; private submodules can be skipped if you do not have access).\n2. From the prime-rl repo: `uv pip install -e /path/to/this/workspace/environments/devops_gym`\n3. Run RL:\n\n   ```bash\n   cd /path/to/prime-rl\n   uv run rl @\"/path/to/Prime Intellect/configs/rl/devops-gym.toml\" --output-dir ./outputs/devops-gym-rl\n   ```\n\n   Or: `./scripts/run_devops_gym_rl.sh` from this workspace (`PRIME_RL_ROOT` optional).\n\n4. Prefer **`hf_dataset_id`** in `[orchestrator.train.env].args` or `DEVOPS_GYM_ROOT` / embedded smoke per [configs/rl/devops-gym.toml](../../configs/rl/devops-gym.toml).\n\n### Quickstart\n\nEval from Hugging Face, or from a **real** `.parquet` file you created with `scripts/export_devops_gym_dataset.py` (do not use the string `/path/to/export.parquet` — that was only a documentation placeholder).\n\n```bash\nprime eval run devops-gym -m gpt-4.1-mini -n 5 -r 1 \\\n  -a '{\"hf_dataset_id\": \"your-org/your-dataset\", \"hf_split\": \"train\", \"reward_mode\": \"oracle_overlap\"}'\n```\n\n**Local Parquet (absolute path on your machine)** — example path where your export lives:\n\n```bash\nprime eval run devops-gym -m gpt-4.1-mini -n 5 -r 1 \\\n  -a '{\"hf_dataset_id\": \"/Users/vinayakrastogi/Desktop/SkySurf/Repo/DevOps-Gym-main/devops-gym-build.parquet\", \"hf_split\": \"train\"}'\n```\n\nOr run `./scripts/eval_devops_gym_parquet.sh` (defaults to that path; override with `DEVOPS_GYM_PARQUET` and `MODEL`).\n\nLocal clone + oracle overlap:\n\n```bash\nprime eval run devops-gym -m gpt-4.1-mini -n 10 -r 1 \\\n  -a '{\"category\": \"build\", \"max_tasks\": 32, \"reward_mode\": \"oracle_overlap\"}'\n```\n\nLLM judge (requires `OPENAI_API_KEY`):\n\n```bash\nprime eval run devops-gym -m gpt-4.1-mini -n 10 -r 1 \\\n  -a '{\"category\": \"build\", \"max_tasks\": 32, \"reward_mode\": \"judge\"}'\n```\n\nEmbedded smoke only (no clone, no HF):\n\n```bash\nprime eval run devops-gym -m gpt-4.1-mini -n 2 -r 1 \\\n  -a '{\"fallback_embedded\": true}'\n```\n\n### Environment arguments (`-a` JSON)\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `hf_dataset_id` | str | `null` | HF dataset id or absolute path to `.parquet` (export schema or upstream `task_id` + `instruction`) |\n| `hf_split` | str | `\"train\"` | Split passed to `datasets.load_dataset` |\n| `hf_revision` | str | `null` | Optional dataset revision |\n| `hf_filter_category` | bool | `true` | If rows include `category`, keep only rows matching `category` |\n| `dataset_root` | str | `DEVOPS_GYM_ROOT` | Path to DevOps-Gym clone |\n| `category` | str | `\"build\"` | Task subdirectory for local scan; filter for HF when enabled |\n| `max_tasks` | int | `64` | Cap tasks loaded (`-1` = all) |\n| `seed` | int | `42` | Shuffle + split seed |\n| `eval_holdout` | float | `0.15` | Fraction held out for `eval_dataset` |\n| `reward_mode` | str | `\"oracle_overlap\"` | `oracle_overlap` or `judge` |\n| `judge_model` | str | `\"gpt-4.1-mini\"` | OpenAI-compatible judge id |\n| `fallback_embedded` | bool | `true` | Use built-in tasks if no HF id and no local root |\n| `task_names` | list | `null` | Optional allow-list of task ids |\n\n### Metrics\n\n- Main reward: oracle overlap in `[0,1]`, or judge yes/no → `1.0` / `0.0`.\n- Standard verifiers rollout metrics apply.\n\n### Citation\n\nDevOps-Gym paper: [arXiv:2601.20882](https://arxiv.org/abs/2601.20882).\n","encoding":"utf-8","truncated":false,"total_bytes":7728},"status":null}