{"data":{"kind":"file","path":"README.md","version_id":"rt01nwvnu0kx8llpoavdj1fh","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3425,"modified_at":"2026-06-01T19:55:31.462000","content_hash":"d5afacac71f170aa2a18a9e7f0f89a12de42d4722b6f3fe4596dea5fc5ea8d24"},"entries":[],"content":"# harbor\n\nHarbor (terminal-bench-style) tasks inside Prime Sandboxes via ComposableEnv.\n\n### Overview\n- **Environment ID**: `harbor`\n- **Agent**: Sandbox CLI agent wired through ComposableEnv (bash + edit tools by default)\n- **TaskSet**: `HarborDatasetTaskSet` over the bundled `harbor/tasks/` directory (11 sample tasks; `task_names=` to subset)\n- **Scoring**: `HarborDatasetRubric` — reads `/logs/verifier/reward.txt|.json` after `bash test.sh`\n\n### Quickstart\n\n```bash\n# From research-environments root\nuv pip install -e ./environments/harbor\n\n# Single debug rollout against the bundled task set\nGH_TOKEN=... uv run vf-eval harbor -d -v -n1 -r1\n\n# Restrict to a specific task\nGH_TOKEN=... uv run vf-eval harbor -d -v -n1 -r1 -a '{\"tasks\":[\"hello-world\"]}'\n```\n\n### Bundled tasks\n\nThe 11 terminal-bench sample tasks shipped under `harbor/tasks/`:\n\n| Task | Notes |\n|---|---|\n| `build-cython-ext` | Build a Cython extension |\n| `chess-best-move` | Determine best move from a FEN |\n| `configure-git-webserver` | Stand up a git-over-HTTP server |\n| `fix-code-vulnerability` | Patch a security bug |\n| `hello-world` | Smoke test |\n| `log-summary-date-ranges` | Summarize logs over date ranges |\n| `polyglot-c-py` | Polyglot C / Python source |\n| `qemu-alpine-ssh` | SSH into a QEMU Alpine VM |\n| `qemu-startup` | Start a QEMU VM |\n| `regex-log` | Extract structured fields from logs |\n| `sqlite-with-gcov` | Build SQLite with gcov instrumentation |\n\nEach task directory provides `task.toml`, `instruction.md`, a `tests/` harness, and a gold `solution/`.\n\n### Environment Arguments\n\n| Argument | Default | Description |\n|---|---|---|\n| `dataset_path` | `harbor/tasks` | Directory of Harbor task subdirectories |\n| `tasks` | `None` | Subset of task directory names to run |\n| `gh_token` | `$GH_TOKEN` | Token forwarded to the harness for private agent-bundle clones |\n| `max_turns` | `200` | Max interception server turns |\n| `timeout_seconds` | `5400.0` | Sandbox timeout (90min) |\n| `poll_interval` | `1.0` | Seconds between intercept-queue polls |\n| `sandbox_cpu_cores` | `2` | CPU cores per sandbox |\n| `sandbox_memory_gb` | `4` | Memory per sandbox |\n| `sandbox_disk_size_gb` | `10` | Disk per sandbox |\n| `sandbox_client_max_workers` | `50` | Max worker threads in the shared sandbox client |\n| `labels` | `[\"harbor\"]` | Sandbox labels attached to created rollouts |\n| `**kwargs` | — | Forwarded to the harness (e.g. `workdir`, `append_to_system_prompt`, …) |\n\n### Changelog\n\n#### v0.1.5\n- Restore explicit `sandbox_client_max_workers=50` default instead of falling through to the verifiers sandbox client default.\n\n#### v0.1.4\n- Default `sandbox_client_max_workers` to `None` so the shared sandbox client uses the verifiers default worker cap unless callers explicitly override it.\n\n#### v0.1.3\n- Republish with the v0.1.2 fixes after the wheel-content drift between the in-flight push and the follow-up commit. No code changes vs v0.1.2.\n\n#### v0.1.2\n- Default tool set to `[\"bash\",\"edit\"]` so the agent gets shell + edit tools instead of falling back to the harness default. Documented \"bash + edit by default\" claim was previously not enforced.\n\n#### v0.1.1\n- Publish env publicly on the Environments Hub.\n\n#### v0.1.0\n- Initial release. Mirrors the `swe` environment wiring (`ComposableEnv` + agent harness) but routes through `HarborDatasetTaskSet`. Ships the 11 terminal-bench sample tasks in `harbor/tasks/`.\n","encoding":"utf-8","truncated":false,"total_bytes":3425},"status":null}