{"data":{"kind":"file","path":"README.md","version_id":"cuxn6s7k93p1l01lw0tlx7qe","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6992,"modified_at":"2026-06-11T02:45:42.424000","content_hash":"e233ed516454b234e3eb121cd40727ada84fd7ea857ca75e7afe465a47bd5f8e"},"entries":[],"content":"# skillsbench\n\nskillsbench in the [Prime Intellect Environments Hub](https://docs.primeintellect.ai/environments-hub/overview).\nWraps the [skillsbench](https://github.com/benchflow-ai/skillsbench) benchmark\nas a `verifiers.v1` environment so it can be evaluated and trained against\nthrough `prime eval run` and the hosted training stack.\n\n### Overview\n- **Environment ID**: `skillsbench`\n- **Short description**: Benchmark for evaluating how well AI agents use *skills*\n  (modular folders of instructions, scripts, and resources). Bundles 100 task\n  dirs: an 86-task scored grid plus 14 opt-in extras.\n- **Tags**: skills, agents, tool-use, harbor, eval, train\n\n### Datasets\n- **Scored grid**: 86 skillsbench tasks under `tasks/`, loaded by default.\n- **Extras**: 14 opt-in tasks under `tasks-extra/`, loaded only with\n  `extras=true` (credential/external-dep tasks upstream parks in `tasks-extra/`,\n  plus `drone-planning-control` — see Caveats).\n- Each task follows the Harbor layout (`task.toml`, `instruction.md`,\n  `environment/`, `solution/solve.sh`, `tests/test.sh`).\n- **Source**: https://github.com/benchflow-ai/skillsbench — trees re-vendored\n  verbatim from `origin/main` (through #916, 2026-06-05).\n- **Split sizes**: 86 scored examples, single split.\n\n### Divergence from upstream main\n- Upstream keeps `drone-planning-control` in the **scored** `tasks/` grid (87\n  total). This port moves it to `tasks-extra/` because it is compose-only (no\n  `/app`, references `${ANTHROPIC_API_KEY}`) and fails under this port's\n  Dockerfile-on-Prime-sandbox harness for non-model reasons. So our scored grid\n  is 86, not 87. Everything else mirrors main exactly.\n\n### Task\n- **Type**: multi-turn, tool-use coding agent.\n- **Output expectations**: each task's `instruction.md` names the file path(s)\n  the agent must produce. Examples: `/root/mass_report.json`,\n  `/root/answer.txt`. The bundled `tests/test_outputs.py` scores them.\n- **Rubric**: a single binary `harbor_reward` — `1.0` if the task's test\n  script writes a non-zero value to `/logs/verifier/reward.txt`, else `0.0`.\n\n### How it differs from upstream skillsbench\n\n| Concern | Upstream skillsbench (`bench eval`) | This port (`prime eval run`) |\n| --- | --- | --- |\n| Sandbox | Daytona, per-task built Docker image | One shared image (`ubuntu:24.04`) + per-task setup script |\n| Workdir | `/root` (Dockerfile `WORKDIR /root`) | `/root` (set on sandbox config) |\n| Skills mount | `/root/.claude/skills/<skill>/` | Same: `/root/.claude/skills/<skill>/` |\n| Agent | `claude-agent-acp`, etc. | `verifiers.v1.OpenCode` by default; `Pi`, `MiniSWEAgent`, base harness all wireable |\n| Scoring | `bench eval` runs `tests/test.sh`, reads `/logs/verifier/reward.txt` | `harbor_reward` does exactly the same |\n\nThe port intentionally trades per-task Dockerfiles for a shared base image\nplus a setup hook. For a task that needs a different image (e.g. CUDA, a\nlocked compiler version), set `[environment].docker_image` in its\n`task.toml` — `HarborTaskset` honors it.\n\n### Quickstart\n\n```bash\n# Install\nprime env install benchflow/skillsbench   # or `prime env install .` after cloning\n\n# List the tasks the env can run\npython -c \"from skillsbench import skillsbenchTaskset; \\\n  print(len(skillsbenchTaskset().rows()))\"\n\n# Run one task end-to-end with the default OpenCode harness\n# (Verified working: Prime CLI + anthropic/claude-haiku-4.5 via -p prime)\nprime eval run skillsbench \\\n  -m anthropic/claude-haiku-4.5 -p prime \\\n  -n 1 -r 1 --max-concurrent 1 --timeout 1800 \\\n  -a '{\"config\":{\"taskset\":{\"task_names\":[\"dialogue-parser\"]}}}'\n\n# Run a curated subset\nprime eval run skillsbench \\\n  -m anthropic/claude-sonnet-4.6 -p prime \\\n  -n 5 -r 1 --max-concurrent 4 \\\n  -a '{\"config\":{\"taskset\":{\"task_names\":[\"3d-scan-calc\",\"edit-pdf\",\"dialogue-parser\",\"exam-block-sequencing\",\"find-topk-similiar-chemicals\"]}}}'\n```\n\n### Environment Arguments\n\n`load_environment` takes the standard `vf.EnvConfig` envelope. The taskset\nand harness child configs accept:\n\n| Section | Arg | Type | Default | Description |\n| --- | --- | --- | --- | --- |\n| `taskset` | `task_names` | `list[str] \\| None` | `None` (all 86) | Subset filter |\n| `taskset` | `extras` | `bool` | `false` | Merge the 14 `tasks-extra/` tasks into the grid |\n| `taskset` | `docker_image` | `str` | `ubuntu:24.04` | Image used when task.toml does not override |\n| `taskset` | `workdir` | `str` | `/root` | Sandbox workdir; also where instruction.md / data is uploaded |\n| `taskset` | `skills_remote_dir` | `str` | `/root/.claude/skills` | Where the per-task `environment/skills/` tree is mounted |\n| `taskset` | `apt_packages` | `list[str]` | python3, pip, ripgrep, jq, … | Installed once per rollout via `apt-get install` |\n| `taskset` | `timeout_minutes` | `int` | `480` | Sandbox lifetime (8h) |\n| `taskset` | `agent_timeout_seconds` | `float` | `7200` | Per-command timeout for the agent. Foreground HTTP is capped at 900s, so anything above 600s is auto-routed through `start_background_job` (polled, ~24h cap). |\n| `taskset` | `verifier_timeout_seconds` | `float` | `1800` | Test script timeout (also background-routed) |\n| `harness` | `max_turns` | `int` | `60` | OpenCode turn budget |\n| `harness` | `system_prompt` | `str` | skillsbench prompt | Override the system prompt |\n| `harness` | `disabled_tools` | `list[str]` | narrow list | OpenCode tool gating |\n| `harness` | `agent_workdir` | `str` | `/root` | Where OpenCode `cd`s before running |\n| `harness` | `install_ripgrep` | `bool` | `false` | Skipped because our apt setup installs ripgrep |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | `1.0` iff the task's `tests/test.sh` wrote a non-zero value to `/logs/verifier/reward.txt`, else `0.0` |\n| `harbor_tests` (state) | dict with `returncode`, `stdout`, `stderr` from the test script — useful for debugging failed rollouts |\n| `harbor_error` (state) | exception string if test upload/exec itself failed |\n\n### Caveats\n\n1. **Single base image.** Tasks that hard-depend on an Ubuntu 20.04 base or\n   custom images (e.g. `jasonish/suricata`, `bugswarm/cached-images`,\n   `oss-fuzz-base/base-builder-python`) will not match upstream behavior\n   unless their `task.toml` sets `[environment].docker_image`.\n2. **Network on by default.** skillsbench setup steps `apt-get` / `pip\n   install`; the port enables `network_access = True` per rollout. Tasks\n   that need to be sealed off should set `[environment].allow_internet =\n   false`.\n3. **`tests/test.sh` runs inside the same sandbox.** It uses pytest from\n   the task's setup commands. Harbor's reward function uploads\n   `solution/` and `tests/` from the host into the live sandbox at score\n   time, so the agent cannot peek at the oracle.\n4. **No bench-specific judge agents.** skillsbench's `bench eval`\n   wrapper provides things like trajectory inspection and skill-coverage\n   metrics; this port stays inside the `verifiers.v1` contract and only\n   emits the single binary reward.\n","encoding":"utf-8","truncated":false,"total_bytes":6992},"status":null}