{"data":{"kind":"file","path":"README.md","version_id":"pr5ltbnkm63grirw8ofpw49v","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6257,"modified_at":"2026-05-12T21:08:45.802000","content_hash":"3a5caf67f66d1e7b4c010180e4a136cd31e6902a6d922de9364c3f424a81f87b"},"entries":[],"content":"# swe\n\nSWE tasks inside Prime Sandboxes via ComposableEnv.\n\n### Overview\n- **Environment ID**: `swe`\n- **Agent**: Sandbox CLI agent wired through ComposableEnv — defaults include bash plus the bundled **edit** skill (see `load_environment` / harness defaults).\n- **TaskSet**: R2E-Gym (default), SWE-bench, Multi-SWE, OpenSWE via `task_type` arg\n- **Scoring**: Test-based evaluation via the SWE taskset's rubric\n\n### Quickstart\n\n```bash\n# From research-environments root\nuv pip install -e ./environments/swe\n\n# Single debug rollout (GH_TOKEN may be required when the host must populate the agent checkout cache)\nGH_TOKEN=... uv run vf-eval swe -a '{\"task_type\":\"r2e\"}' -d -v -n1 -r1\n```\n\n### Environment Arguments\n\n| Argument | Default | Description |\n|---|---|---|\n| `task_type` | `\"r2e\"` | SWE backend: `r2e`, `swebench`, `multiswe`, `openswe` |\n| `dataset_name` | (taskset default) | Override dataset name |\n| `filter_repos` | None | Filter to specific repos |\n| `ds_keep_in_memory` | None | Forwarded to the upstream SWE taskset dataset loader |\n| `ds_num_proc` | None | Forwarded to the upstream SWE taskset dataset loader |\n| `gh_token` | `$GH_TOKEN` | GitHub token used on the host only when cloning/checking out the agent bundle into the local cache |\n| `**kwargs` | — | Forwarded as-is to the composable sandbox harness (install/run/tool/env wiring). Includes knobs such as exec timeouts, summarize/auto-compaction thresholds, checkout ref and repo URL, tool allowlists, append-to-system-prompt text, local checkout overrides (`local_checkout`), etc. See the upstream harness implementation in verifiers for names and defaults |\n| `max_turns` | 200 | Max interception server turns |\n| `timeout_seconds` | 5400 | Sandbox timeout (90min) |\n| `poll_interval` | 1.0 | Seconds between `CliAgentEnv` intercept-queue polls / liveness checks |\n| `sandbox_cpu_cores` | 4 | CPU cores per sandbox |\n| `sandbox_memory_gb` | 4 | Memory per sandbox |\n| `sandbox_disk_size_gb` | 2 | Disk per sandbox |\n| `sandbox_guaranteed` | false | Request guaranteed Prime sandbox capacity for created rollouts |\n| `sandbox_client_max_workers` | None | Max worker threads in the shared sandbox client |\n| `labels` | `[\"swe\"]` | Sandbox labels attached to created rollouts |\n\n### Changelog\n\n#### v0.3.4\n- Default `sandbox_client_max_workers` to `None` so the shared sandbox client uses the verifiers default worker cap unless callers explicitly override it.\n\n#### v0.3.3\n- Add `sandbox_guaranteed` to request Prime sandbox guaranteed capacity while preserving the default non-guaranteed sandbox behavior. Require `prime-sandboxes>=0.2.23` for `CreateSandboxRequest.guaranteed`.\n\n#### v0.3.2\n- Declare `multi-swe-bench>=1.1.2` as a direct dep. `MultiSWERubric` calls `multi_swe_bench.harness.report.generate_report` to score `task_type=\"multiswe\"` rollouts; without it the rubric raises `ModuleNotFoundError` and silently zeros every reward (verified during a gpt-5.4 vf-eval run).\n\n#### v0.3.1\n- Declare `swebench==4.1.0` as a direct dep — needed when `task_type=\"swebench\"` routes through `verifiers`' composable `swe_bench` taskset (which imports `swebench` at module top level without declaring it).\n\n#### v0.3.0\n- Stop enumerating harness kwargs on `load_environment`; everything except `gh_token` now flows through `**kwargs` directly to the composable sandbox harness. Rename: `*_local_checkout` style aliases consolidated to `local_checkout` where applicable. No runtime default changes beyond upstream harness defaults.\n- Drop duplicated max-turn / exec-timeout keys from the env's bare `environment_vars` dict — the harness merges those into the sandbox via `Harness.environment_vars`.\n- Require `verifiers>=0.1.13.dev5`.\n\n#### v0.2.9\n- Re-add tool-list forwarding (previously removed in v0.2.7 as a no-op). It now fans out through the composable harness to both `Harness.tool_names` (drives `ToolMonitorRubric`) and the sandbox tool allowlist. Documented defaults included interactive REPL plus compaction helpers; other tool names such as bash/edit remain available depending on harness support.\n\n#### v0.2.8\n- Replace branch-only checkout with ref (branch, tag, or full commit SHA) and make the default host cache commit-keyed.\n- Clarify auto-materialized host cache vs existing-checkout override paths.\n\n#### v0.2.7\n- Remove the unused tool-list forwarding argument and stop exporting obsolete verbosity-related sandbox env vars.\n- Require `verifiers>=0.1.13.dev3`.\n- Refresh the README argument table to match the current `load_environment()` signature.\n\n#### v0.2.6\n- Add host-side checkout path override for the uploaded agent bundle.\n- Cache checkouts on the host and upload into each sandbox, reducing repeated clone pressure during large runs.\n\n#### v0.2.5\n- Bump verifiers to `>=0.1.13.dev1`.\n\n#### v0.2.4\n- Add per-tool exec timeout parameter (default 300s); forwarded into the sandbox to cap individual tool runs inside the agent loop.\n- Unify timeout knob: `timeout_seconds` now drives both the rollout deadline and the sandbox container lifetime (`sandbox_timeout_minutes` is derived via `math.ceil`), preventing sandbox teardown before the agent reaches its deadline.\n- Expose `poll_interval` kwarg; forwarded to `ComposableEnv` / `CliAgentEnv` to tune the intercept-queue poll cadence.\n\n#### v0.2.3\n- Ship the `edit` skill with this environment (under `swe/skills/edit/`); auto-uploaded via `ComposableEnv`'s skills-upload mechanism.\n\n#### v0.2.2\n- Simplify to use `ComposableEnv` directly; metrics and `GH_TOKEN` handling are driven by upstream harness configuration.\n- Surface session metrics from agent meta logs instead of a fixed whitelist.\n\n#### v0.2.1\n- Add configurable upstream repo URL and branch/ref so installs can track a chosen remote.\n\n#### v0.1.3\n- Add optional cap on retained assistant turns in live context.\n- Add `append_to_system_prompt` to append environment-specific instructions.\n\n#### v0.1.2\n- Extract session metrics from `meta.json` after each rollout and surface as top-level state keys (turn counts, token usage per turn, stop reasons, etc.).\n\n#### v0.1.1\n- Scope `gh_token` / `GH_TOKEN` to install/checkout steps only, without exporting it as a sandbox runtime environment variable.\n\n#### v0.1.0\n- Initial release\n","encoding":"utf-8","truncated":false,"total_bytes":6257},"status":null}