{"data":{"kind":"file","path":"README.md","version_id":"dyqe2miumfrvn3p0mrngoopx","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6158,"modified_at":"2026-05-24T17:14:37.990000","content_hash":"e5883ae0cf190484fc98176d6a40512dcf7df6d11610fbdcaaefa28c8d019738"},"entries":[],"content":"# SkillsBench\n\nSkillsBench in the [Prime Intellect Environments Hub](https://docs.primeintellect.ai/environments-hub/overview).\nWraps the [SkillsBench](https://github.com/benchflow-ai/skillsbench) benchmark\nas a `verifiers.v1` environment so it can be evaluated and trained against\nthrough `prime eval run` and the hosted training stack.\n\n### Overview\n- **Environment ID**: `skillsbench`\n- **Short description**: Benchmark for evaluating how well AI agents use *skills*\n  (modular folders of instructions, scripts, and resources). Bundles 94 tasks.\n- **Tags**: skills, agents, tool-use, harbor, eval, train\n\n### Datasets\n- **Primary dataset**: 94 SkillsBench tasks vendored under `tasks/`. Each task\n  follows the Harbor task layout (`task.toml`, `instruction.md`,\n  `environment/`, `solution/solve.sh`, `tests/test.sh`).\n- **Source**: https://github.com/benchflow-ai/skillsbench\n- **Split sizes**: 94 examples, single split.\n\n### Task\n- **Type**: multi-turn, tool-use coding agent.\n- **Output expectations**: each task's `instruction.md` names the file path(s)\n  the agent must produce. Examples: `/root/mass_report.json`,\n  `/root/answer.txt`. The bundled `tests/test_outputs.py` scores them.\n- **Rubric**: a single binary `harbor_reward` — `1.0` if the task's test\n  script writes a non-zero value to `/logs/verifier/reward.txt`, else `0.0`.\n\n### How it differs from upstream SkillsBench\n\n| Concern | Upstream SkillsBench (`bench eval`) | This port (`prime eval run`) |\n| --- | --- | --- |\n| Sandbox | Daytona, per-task built Docker image | One shared image (`ubuntu:24.04`) + per-task setup script |\n| Workdir | `/root` (Dockerfile `WORKDIR /root`) | `/root` (set on sandbox config) |\n| Skills mount | `/root/.claude/skills/<skill>/` | Same: `/root/.claude/skills/<skill>/` |\n| Agent | `claude-agent-acp`, etc. | `verifiers.v1.OpenCode` by default; `Pi`, `MiniSWEAgent`, base harness all wireable |\n| Scoring | `bench eval` runs `tests/test.sh`, reads `/logs/verifier/reward.txt` | `harbor_reward` does exactly the same |\n\nThe port intentionally trades per-task Dockerfiles for a shared base image\nplus a setup hook. For a task that needs a different image (e.g. CUDA, a\nlocked compiler version), set `[environment].docker_image` in its\n`task.toml` — `HarborTaskset` honors it.\n\n### Quickstart\n\n```bash\n# Install\nprime env install benchflow/skillsbench   # or `prime env install .` after cloning\n\n# List the tasks the env can run\npython -c \"from skillsbench import SkillsBenchTaskset; \\\n  print(len(SkillsBenchTaskset().rows()))\"\n\n# Run one task end-to-end with the default OpenCode harness\n# (Verified working: Prime CLI + anthropic/claude-haiku-4.5 via -p prime)\nprime eval run skillsbench \\\n  -m anthropic/claude-haiku-4.5 -p prime \\\n  -n 1 -r 1 --max-concurrent 1 --timeout 1800 \\\n  -a '{\"config\":{\"taskset\":{\"task_names\":[\"dialogue-parser\"]}}}'\n\n# Run a curated subset\nprime eval run skillsbench \\\n  -m anthropic/claude-sonnet-4.6 -p prime \\\n  -n 5 -r 1 --max-concurrent 4 \\\n  -a '{\"config\":{\"taskset\":{\"task_names\":[\"3d-scan-calc\",\"edit-pdf\",\"dialogue-parser\",\"exam-block-sequencing\",\"find-topk-similiar-chemicals\"]}}}'\n```\n\n### Environment Arguments\n\n`load_environment` takes the standard `vf.EnvConfig` envelope. The taskset\nand harness child configs accept:\n\n| Section | Arg | Type | Default | Description |\n| --- | --- | --- | --- | --- |\n| `taskset` | `task_names` | `list[str] \\| None` | `None` (all 94) | Subset filter |\n| `taskset` | `docker_image` | `str` | `ubuntu:24.04` | Image used when task.toml does not override |\n| `taskset` | `workdir` | `str` | `/root` | Sandbox workdir; also where instruction.md / data is uploaded |\n| `taskset` | `skills_remote_dir` | `str` | `/root/.claude/skills` | Where the per-task `environment/skills/` tree is mounted |\n| `taskset` | `apt_packages` | `list[str]` | python3, pip, ripgrep, jq, … | Installed once per rollout via `apt-get install` |\n| `taskset` | `timeout_minutes` | `int` | `480` | Sandbox lifetime (8h) |\n| `taskset` | `agent_timeout_seconds` | `float` | `7200` | Per-command timeout for the agent. Foreground HTTP is capped at 900s, so anything above 600s is auto-routed through `start_background_job` (polled, ~24h cap). |\n| `taskset` | `verifier_timeout_seconds` | `float` | `1800` | Test script timeout (also background-routed) |\n| `harness` | `max_turns` | `int` | `60` | OpenCode turn budget |\n| `harness` | `system_prompt` | `str` | SkillsBench prompt | Override the system prompt |\n| `harness` | `disabled_tools` | `list[str]` | narrow list | OpenCode tool gating |\n| `harness` | `agent_workdir` | `str` | `/root` | Where OpenCode `cd`s before running |\n| `harness` | `install_ripgrep` | `bool` | `false` | Skipped because our apt setup installs ripgrep |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | `1.0` iff the task's `tests/test.sh` wrote a non-zero value to `/logs/verifier/reward.txt`, else `0.0` |\n| `harbor_tests` (state) | dict with `returncode`, `stdout`, `stderr` from the test script — useful for debugging failed rollouts |\n| `harbor_error` (state) | exception string if test upload/exec itself failed |\n\n### Caveats\n\n1. **Single base image.** Tasks that hard-depend on an Ubuntu 20.04 base or\n   custom images (e.g. `jasonish/suricata`, `bugswarm/cached-images`,\n   `oss-fuzz-base/base-builder-python`) will not match upstream behavior\n   unless their `task.toml` sets `[environment].docker_image`.\n2. **Network on by default.** SkillsBench setup steps `apt-get` / `pip\n   install`; the port enables `network_access = True` per rollout. Tasks\n   that need to be sealed off should set `[environment].allow_internet =\n   false`.\n3. **`tests/test.sh` runs inside the same sandbox.** It uses pytest from\n   the task's setup commands. Harbor's reward function uploads\n   `solution/` and `tests/` from the host into the live sandbox at score\n   time, so the agent cannot peek at the oracle.\n4. **No bench-specific judge agents.** SkillsBench's `bench eval`\n   wrapper provides things like trajectory inspection and skill-coverage\n   metrics; this port stays inside the `verifiers.v1` contract and only\n   emits the single binary reward.\n","encoding":"utf-8","truncated":false,"total_bytes":6158},"status":null}