{"data":{"kind":"file","path":"README.md","version_id":"jyvzzr8c1v2iumhjcc7jkceh","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":7893,"modified_at":"2026-06-16T22:36:32.539000","content_hash":"811a3949f0d03c0139174fa090a25313db1b92c86a02548338c4dcd6a3d5232d"},"entries":[],"content":"# SkillsBench\n\nSkillsBench in the Prime Intellect Environments Hub. Wraps the\n[SkillsBench](https://github.com/benchflow-ai/skillsbench) benchmark as a\n`verifiers.v1` environment so it can be evaluated and trained against through\n`prime eval run` and the hosted training stack.\n\nThis is the **v1.1 release** of SkillsBench — native BenchFlow **`task.md`**\npackages (GitHub release [`v1.1`](https://github.com/benchflow-ai/skillsbench/releases/tag/v1.1),\ncommit `27738384`). The task list is **faithful to the v1.1 release** — no tasks\nmoved, dropped, or renamed.\n\n## Overview\n\n- **Environment ID**: `skillsbench`\n- **Short description**: Benchmark for evaluating how well AI agents use *skills*\n  (modular folders of instructions, scripts, and resources). Bundles **101 task\n  dirs**: an **87-task scored grid** plus **14 opt-in extras**.\n- **Tags**: skills, agents, tool-use, harbor, eval, train\n\n## Datasets\n\n- **Scored grid**: **87** SkillsBench tasks under `tasks/`, loaded by default —\n  the full v1.1 default grid.\n- **Extras**: **14** opt-in tasks under `tasks-extra/`, loaded only with\n  `extras=true` (credential-/external-dep or integration-incompatible tasks that\n  upstream parks in `tasks-extra/`).\n- Each task is a native **`task.md`** package (BenchFlow v1.1 layout):\n\n  ```text\n  tasks/<task-id>/\n    task.md                  # YAML frontmatter (schema_version 1.3:\n                             #   metadata / verifier / agent / environment) + instruction body\n    environment/\n      Dockerfile             # baked recipe (ignored by this port)\n      skills/                # uploaded to /root/.claude/skills/\n      <data files...>        # uploaded to /root/\n    oracle/\n      solve.sh               # reference solution (held back, scoring only)\n    verifier/\n      test.sh                # scorer: pytest verifier/test_outputs.py -> /logs/verifier/reward.txt\n      test_outputs.py\n  ```\n\n- **Source**: <https://github.com/benchflow-ai/skillsbench> — `tasks/` and\n  `tasks-extra/` trees vendored verbatim from the **v1.1 release** (`27738384`),\n  matching the release's exact split.\n- **Split sizes**: 87 scored examples, single split.\n\n## Task\n\n- **Type**: multi-turn, tool-use coding agent.\n- **Output expectations**: each task's `task.md` instruction names the file\n  path(s) the agent must produce. Examples: `/root/mass_report.json`,\n  `/root/answer.txt`. The task's `verifier/test_outputs.py` scores them.\n- **Rubric**: a single binary `skillsbench_taskmd_reward` — `1.0` if the task's\n  verifier writes a non-zero value to `/logs/verifier/reward.txt`, else `0.0`.\n\n### How it differs from upstream SkillsBench\n\n| Concern | Upstream SkillsBench (`bench eval`) | This port (`prime eval run`) |\n|---|---|---|\n| Sandbox | Daytona, per-task built Docker image | One shared image (`ubuntu:24.04`) + per-task setup script |\n| Workdir | `/root` (Dockerfile `WORKDIR /root`) | `/root` (set on sandbox config) |\n| Skills mount | `/root/.claude/skills/<skill>/` | Same: `/root/.claude/skills/<skill>/` |\n| Agent | `claude-agent-acp`, etc. | `verifiers.v1.OpenCode` by default; Pi, MiniSWEAgent, base harness all wireable |\n| Scoring | runs `verifier/test.sh`, reads `/logs/verifier/reward.txt` | `skillsbench_taskmd_reward` does exactly the same |\n\nBecause stock `verifiers` (0.1.14) `HarborTaskset` only parses the old\n`task.toml` layout, this env ships a native **`task.md`** loader: it parses the\nYAML frontmatter (sandbox sizing from `environment.memory_mb` / `storage_mb` /\n`cpus`, timeouts from `verifier`/`agent`), uses the markdown body as the\ninstruction, and scores via the v1.1 `verifier/` protocol (mounts `oracle/` →\n`/oracle`, `verifier/` → `/verifier`, runs `bash /verifier/test.sh`).\n\nThe port intentionally trades per-task Dockerfiles for a shared base image plus\na setup hook. For a task that needs a different image (e.g. CUDA, a locked\ncompiler version), set `docker_image` under `environment:` in its `task.md`\nfrontmatter — the loader honors it.\n\n## Quickstart\n\n```bash\n# Install\nprime env install benchflow/skillsbench      # or `prime env install .` after cloning\n\n# List the tasks the env can run\npython -c \"from skillsbench import SkillsBenchTaskset; \\\nprint(len(SkillsBenchTaskset().load_rows()))\"\n\n# Run one task end-to-end with the default OpenCode harness\n# (Verified working: Prime CLI + anthropic/claude-haiku-4.5 via -p prime)\nprime eval run skillsbench \\\n  -m anthropic/claude-haiku-4.5 -p prime \\\n  -n 1 -r 1 --max-concurrent 1 --timeout 1800 \\\n  -a '{\"config\":{\"taskset\":{\"task_names\":[\"dialogue-parser\"]}}}'\n\n# Run a curated subset\nprime eval run skillsbench \\\n  -m anthropic/claude-sonnet-4.6 -p prime \\\n  -n 5 -r 1 --max-concurrent 4 \\\n  -a '{\"config\":{\"taskset\":{\"task_names\":[\"3d-scan-calc\",\"edit-pdf\",\"dialogue-parser\",\"exam-block-sequencing\",\"find-topk-similiar-chemicals\"]}}}'\n```\n\n## Environment Arguments\n\n`load_environment` takes the standard `vf.EnvConfig` envelope. The `taskset` and\n`harness` child configs accept:\n\n| Section | Arg | Type | Default | Description |\n|---|---|---|---|---|\n| taskset | `task_names` | `list[str] \\| None` | `None` (all 87) | Subset filter |\n| taskset | `extras` | `bool` | `false` | Merge the 14 `tasks-extra/` tasks into the grid |\n| taskset | `docker_image` | `str` | `ubuntu:24.04` | Image used when a task's `task.md` does not override |\n| taskset | `workdir` | `str` | `/root` | Sandbox workdir; also where `instruction.md` / data is uploaded |\n| taskset | `skills_remote_dir` | `str` | `/root/.claude/skills` | Where the per-task `environment/skills/` tree is mounted |\n| taskset | `with_skills` | `bool` | `true` | Set `false` for the without-skills half of a paired sweep |\n| taskset | `apt_packages` | `list[str]` | python3, pip, ripgrep, jq, … | Installed once per rollout via `apt-get install` |\n| taskset | `timeout_minutes` | `int` | `480` | Sandbox lifetime (8h) |\n| taskset | `agent_timeout_seconds` | `float` | `7200` | Per-command timeout for the agent. Foreground HTTP caps at 900s, so anything above 600s is auto-routed through `start_background_job` (polled, ~24h cap). |\n| taskset | `verifier_timeout_seconds` | `float` | `1800` | Verifier script timeout (also background-routed) |\n| harness | `max_turns` | `int` | `60` | OpenCode turn budget |\n| harness | `system_prompt` | `str` | SkillsBench prompt | Override the system prompt |\n| harness | `disabled_tools` | `list[str]` | narrow list | OpenCode tool gating |\n| harness | `agent_workdir` | `str` | `/root` | Where OpenCode `cd`s before running |\n| harness | `install_ripgrep` | `bool` | `false` | Skipped because our apt setup installs ripgrep |\n\n## Metrics\n\n| Metric | Meaning |\n|---|---|\n| `reward` | `1.0` iff the task's `verifier/test.sh` wrote a non-zero value to `/logs/verifier/reward.txt`, else `0.0` |\n| `skillsbench_tests` (state) | dict with `returncode`, `stdout`, `stderr` from the verifier — useful for debugging failed rollouts |\n| `skillsbench_error` (state) | exception string if verifier upload/exec itself failed |\n\n## Caveats\n\n- **Single base image.** Tasks that hard-depend on a custom base image (e.g. a\n  specific CUDA/compiler image, or compose-based tasks) will not match upstream\n  behavior unless their `task.md` frontmatter sets `environment.docker_image`.\n- **Network on by default.** SkillsBench setup steps `apt-get` / `pip install`;\n  the port enables `network_access = True` per rollout.\n- **`verifier/test.sh` runs inside the same sandbox.** At score time\n  `skillsbench_taskmd_reward` uploads `oracle/` → `/oracle` and `verifier/` →\n  `/verifier` from the host into the live sandbox, so the agent cannot peek at\n  the oracle before it runs.\n- **No bench-specific judge agents.** SkillsBench's `bench eval` wrapper provides\n  trajectory inspection and skill-coverage metrics; this port stays inside the\n  `verifiers.v1` contract and emits the single binary reward.\n","encoding":"utf-8","truncated":false,"total_bytes":7893},"status":null}