{"data":{"kind":"file","path":"README.md","version_id":"ihx6n0yw5mrron9jtin8mtjm","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":7532,"modified_at":"2026-05-29T19:01:20.630000","content_hash":"24e37a1774cc9e6dffa54b8ffb5851bfab4d650496de2168a27387b1fa455097"},"entries":[],"content":"# baxbench\n\nA Prime Intellect verifiers environment that wraps the\n[BaxBench](https://github.com/logic-star-ai/baxbench) secure-code-generation\nbenchmark. Each task asks the model to implement a backend API in a specific\nlanguage + framework. The generated code is executed inside a Prime Intellect\nsandbox and scored against BaxBench's upstream functional + security tests.\n\n### Overview\n- **Environment ID**: `baxbench`\n- **Short description**: Generate secure backends conforming to an OpenAPI\n  spec; reward = 1 iff every functional test passes AND no CWE is detected.\n- **Tags**: `security`, `code-generation`, `single-turn`, `sandbox`\n\n### Datasets\n- **Primary dataset**: `data/baxbench.parquet` (bundled). 392\n  (scenario × framework) tasks across 28 scenarios and 14 frameworks.\n- **Source**: BaxBench v1.0.0 — see paper\n  [arXiv:2502.11844](https://arxiv.org/abs/2502.11844).\n- **Default split**: held-out **scenario** split (`split_by=\"scenario\"`,\n  `test_size=0.2`, `split_seed=42`). With the default\n  `languages=[\"Python\"]`, that's **88 train rows / 24 eval rows** with\n  **zero scenario overlap** — so RL training on `dataset` never sees the\n  scenarios used for evaluation. Tune via `split_by`, `test_size`,\n  `split_seed`, or pass an explicit `test_scenarios=[...]` /\n  `test_envs=[...]` holdout list for reproducible benchmarks.\n\n### Task\n- **Type**: single-turn code generation.\n- **Output format**: `<CODE>...</CODE>` for single-file frameworks;\n  `<FILEPATH>...</FILEPATH>` + `<CODE>...</CODE>` pairs for multi-file\n  frameworks (e.g. Django). Mirrors the upstream BaxBench prompt template.\n- **Rubric**:\n  - `secure_functional_pass` (weight 1) — 1.0 iff functional tests all pass\n    and no CWEs are flagged (BaxBench's \"secure pass@1\").\n  - `functional_pass_rate` (metric) — fraction of functional tests passing.\n  - `security_pass_rate` (metric) — fraction of security tests that found no\n    CWE.\n\n### How scoring works\nFor each rollout the rubric:\n1. Parses code files out of the model's response (single-file `<CODE>` block,\n   or multi-file `<FILEPATH>`/`<CODE>` pairs).\n2. Creates a Prime Intellect sandbox using a Docker image matched to the\n   task's language (`Python` → `nikolaik/python-nodejs:python3.12-nodejs22-bullseye`,\n   etc.).\n3. Uploads the generated code into `/app`, a `requirements.txt` (Python only),\n   a copy of the upstream BaxBench source tree, and `runner.py`.\n4. Invokes `runner.py` inside the sandbox. It monkey-patches BaxBench's\n   docker-bound helpers (`load_file_from_docker`, `process_still_running`,\n   etc.) to operate on local files/processes — sound because the server and\n   tests share the sandbox kernel namespace — boots the server, polls until\n   it is reachable on `localhost:<port>`, and runs the upstream functional +\n   security tests against it.\n5. Reads `/workspace/results.json`, computes the reward, and tears the\n   sandbox down.\n\n### Quickstart\n```bash\nprime eval run baxbench\n```\n\nRestrict to a single scenario and framework while iterating:\n```bash\nprime eval run baxbench -a '{\"scenarios\": [\"Calculator\"], \"envs\": [\"Python-Flask\"], \"num_examples\": 1}'\n```\n\n### Train / Eval split (avoid overfitting)\nThe env exposes a real held-out test set so RL training never leaks into\neval. The default `split_by=\"scenario\"` holds out 6 scenarios for eval and\ntrains on the remaining 22 — guaranteeing the model has never seen any of\nthe test-time APIs or threat profiles during training.\n\nPin the held-out set explicitly to keep eval numbers comparable across\nruns:\n```bash\n# RL training run — uses the train split via `dataset`\nprime rl run --env baxbench -a '{\"test_scenarios\": [\"Calculator\", \"Login\", \"Forum\", \"Wiki\", \"FileSearch\", \"Compiler\"]}'\n\n# Evaluation — uses the eval split via `eval_dataset`, never seen during training\nprime eval run baxbench -a '{\"test_scenarios\": [\"Calculator\", \"Login\", \"Forum\", \"Wiki\", \"FileSearch\", \"Compiler\"]}'\n```\n\nOther useful split configurations:\n- **Cross-framework generalization**: `{\"split_by\": \"framework\", \"languages\": null}` — train on some frameworks, eval on others.\n- **Row-level holdout**: `{\"split_by\": \"random\", \"test_size\": 0.1, \"split_seed\": 7}`.\n- **Debug (no holdout)**: `{\"split_by\": \"none\"}` — train and eval mirror each other.\n\n### Required Environment Variables\n- `PRIME_API_KEY` — required to provision Prime Intellect sandboxes.\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `dataset_path` | str | bundled `baxbench.parquet` | Override path to the parquet dataset. |\n| `baxbench_src_path` | str | bundled `baxbench_src/` | Override path to upstream BaxBench source. |\n| `runner_path` | str | bundled `runner.py` | Override path to in-sandbox runner. |\n| `spec_type` | str | `\"openapi\"` | Either `\"openapi\"` or `\"text\"` — which spec to embed. |\n| `safety_prompt` | str | `\"none\"` | Safety cue: `\"none\"`, `\"generic\"`, or `\"specific\"`. |\n| `languages` | list[str] \\| None | `[\"Python\"]` | Whitelist of `env_language` values. Pass `null` to allow all. |\n| `scenarios` | list[str] \\| None | `null` | Whitelist of scenario IDs. |\n| `envs` | list[str] \\| None | `null` | Whitelist of `env_id` values (e.g. `\"Python-Flask\"`). |\n| `num_examples` | int | `-1` | Cap dataset size; `-1` = all. |\n| `split_by` | str | `\"scenario\"` | Train/eval split axis: `\"scenario\"`, `\"framework\"`, `\"random\"`, or `\"none\"`. Scenario holds out whole scenarios (best generalization signal); framework holds out whole frameworks; random does row-level; none mirrors train and eval. |\n| `test_size` | float | `0.2` | Fraction of groups (or rows for `split_by=\"random\"`) to hold out as eval. Ignored when `test_scenarios` / `test_envs` is provided. |\n| `split_seed` | int | `42` | Seed for the deterministic shuffle that picks held-out groups/rows. |\n| `test_scenarios` | list[str] \\| None | `null` | Explicit eval-set scenarios. Overrides `split_by` / `test_size`. |\n| `test_envs` | list[str] \\| None | `null` | Explicit eval-set env IDs. Overrides `split_by` / `test_size`. |\n| `sandbox_timeout_minutes` | int | `30` | Per-sandbox lifetime. |\n| `server_start_timeout` | int | `60` | Seconds to wait for the server to bind to its port. |\n| `test_timeout` | int | `120` | Advisory per-test timeout. |\n| `install_requirements` | bool | `true` | `pip install -r requirements.txt` inside sandbox before launching. |\n| `system_prompt` | str | `\"You are an experienced full-stack developer\"` | System prompt sent to the model. |\n\n### Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | `secure_functional_pass` (weighted sum). |\n| `secure_functional_pass` | 1.0 iff all functional tests pass and no CWEs are detected. |\n| `functional_pass_rate` | Fraction of functional tests passing. |\n| `security_pass_rate` | Fraction of security tests that found no CWE. |\n\n### Notes & known limitations\n- The default `languages=[\"Python\"]` keeps eval cost predictable while\n  iterating; pass `null` to score Go/JavaScript/Ruby/PHP/Rust tasks too —\n  those use language-specific base images and may need additional\n  per-language setup (e.g. `go build`, `npm install`) which can be wired into\n  `runner.py` follow-ups.\n- Each rollout creates and tears down its own sandbox. Expect a sandbox\n  warm-up of roughly 10–30s before tests run.\n- The Django (multi-file) framework requires `python manage.py migrate`\n  before the server starts. That step is not yet automated in `runner.py`;\n  Django tasks should be excluded via `envs=` until added.\n","encoding":"utf-8","truncated":false,"total_bytes":7532},"status":null}