{"data":{"kind":"file","path":"README.md","version_id":"hncm1qisganb2rn4kxahhenc","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3504,"modified_at":"2026-05-22T23:21:01.816000","content_hash":"e5a6de142d6cef842f5b4d73f4996cd553507330f9f7662fe0953ab4b9ca9ce1"},"entries":[],"content":"# prime-grep\n\nCross-repo code-search environment for the `prime-swe-grep` task set. Each task is a user-style question; the agent uses `grep` / `list_files` / `read_file` to search four pinned repos in a sandbox, then calls `submit_spans` with the spans that justify the answer.\n\n## Repos (pinned)\n\n| Repo | Commit |\n|---|---|\n| `prime-rl` | `65919439195fb384eda593b3f93d62cba5b60cf3` |\n| `verifiers` | `58b119fa1b24eff85b74a75ccf3e132523b3c6c3` |\n| `vllm` | `a171e6b52dff47dc567657e7d51f641bdcb22774` |\n| `pytorch` | `c200b7e590a77d52373861844be4287c8ef9507a` |\n\nBumping any of these requires re-verifying every span in `tasks/` — spans store absolute line numbers, so cosmetic upstream edits will silently shift the ground truth.\n\n## Lifecycle\n\n1. **setup** (`PrimeGrepEnv.setup_state`) — creates one persistent prime sandbox for the rollout from the prebuilt image.\n2. **rollout** — `vf.StatefulToolEnv` alternates between tool calls (`grep` / `list_files` / `read_file`) and assistant messages until `submit_spans` records the final answer.\n3. **stop** (`PrimeGrepEnv.answer_submitted`) — fires once `state[\"submitted\"]` is set.\n4. **cleanup** (`PrimeGrepEnv.destroy_sandbox`) — deletes the sandbox through `src/sandbox_manager`.\n5. **reward** (`span_score`) — compares `state[\"submitted_spans\"]` against the task's gold spans.\n\n## Reward modes\n\nSet per-task via `reward_mode` in the YAML, or globally via the `PRIME_GREP_REWARD_MODE` env var.\n\n- `file_recall` (default) — 1.0 per essential gold span whose `(repo, path)` appears in the submission. Lenient: ignores exact line ranges.\n- `span_iou` — line-range IoU averaged across essential gold spans. Strict: sloppy ranges hurt the score.\n\nSupporting spans (`role: supporting` in the YAML) never affect the score in either mode. They exist so depth answers (e.g. citing a `torch` primitive at the bottom of a call chain) are accepted but not required.\n\n## Tasks\n\n`tasks/*.yaml` is bundled with this package; the repo root has a symlink for authoring convenience. See `../../AUTHORING.md` for the recipe.\n\n## Quickstart\n\n```bash\nprime eval run prime-grep -n 8 -r 1\n```\n\n## Environment Arguments\n\n| Arg | Type | Default | Description |\n|-----|------|---------|-------------|\n| `num_examples` | int | `-1` | Limit on number of tasks (-1 = all) |\n| `max_turns` | int | `30` | Max rollout turns before forced stop |\n| `sandbox_config` | dict | prebuilt prime-grep image config | Override sandbox request settings |\n\nFor compatibility with previous v1-style configs, `load_environment` also\naccepts `taskset.num_examples` and `harness.max_turns`, but it always returns a\nplain `PrimeGrepEnv(vf.StatefulToolEnv)`.\n\n## Metrics\n\n| Metric | Meaning |\n|--------|---------|\n| `span_score` | Mean recall over essential gold spans (mode-dependent) |\n| `num_predicted` | Number of spans the model submitted |\n| `submitted` | 1.0 if the model called `submit_spans`, else 0.0 |\n| `avg_parallel_tool_calls_per_turn` | Mean tool-call batch size across assistant turns that call tools; single-call turns are 1.0 |\n\n## Sandbox management\n\nSandbox command execution and cleanup are handled by `src/sandbox_manager`.\nIt centralizes app-level retries, lifecycle counters, recoverable tool-facing\nerrors, fatal infra errors, and best-effort sandbox deletion. In particular, a\ntransient `DELETE /sandbox/{id}` 500 during rollout cleanup is retried and then\nrecorded as an orphaned sandbox instead of failing the worker after an otherwise\nsuccessful rollout.\n","encoding":"utf-8","truncated":false,"total_bytes":3504},"status":null}