{"data":{"kind":"file","path":"README.md","version_id":"ivt4j7ypqu92xn9vu5yisvg0","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3052,"modified_at":"2026-01-13T23:59:49.029000","content_hash":"85f080dd1720ca269d4fb8a1cb825dc4edac4378ad5345687fc02a2571a34c69"},"entries":[],"content":"# shipd-harness\n\n### Overview\n- **Environment ID**: `shipd-harness`\n- **Short description**: Software engineering environment using XML-based tool calling for the dataish/shipd dataset\n- **Tags**: swe, multi-turn, sandbox, shipd\n\n### Datasets\n- **Primary dataset(s)**: `dataish/shipd` - Software engineering tasks with Docker images\n- **Source links**: [HuggingFace](https://huggingface.co/datasets/dataish/shipd)\n- **Split sizes**: train (primary evaluation split)\n\n### Task\n- **Type**: multi-turn, tool use\n- **Tool Format**: XML-based (`<tool:bash>`, `<tool:replace_file_contents>`, `<tool:submit/>`)\n- **Parser**: Custom XML parser (no OpenAI function calling)\n- **Rubric overview**: \n  - `solved`: Reward based on test pass/fail rates (PASS_TO_PASS + FAIL_TO_PASS)\n  - `has_error`: Detects infrastructure failures\n\n### Available Tools\n\n1. **Bash**: Execute shell commands\n   ```xml\n   <tool:bash>ls -la</tool:bash>\n   ```\n\n2. **Replace File Contents**: Edit files by string replacement\n   ```xml\n   <tool:replace_file_contents file=\"path/to/file.py\">\n   <replace><![CDATA[old code]]></replace>\n   <insert><![CDATA[new code]]></insert>\n   </tool:replace_file_contents>\n   ```\n\n3. **Submit**: Submit solution for evaluation\n   ```xml\n   <tool:submit/>\n   ```\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval shipd-harness\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval shipd-harness \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 8000 -T 0.7 \\\n  -a '{\"max_turns\": 50, \"repo_path\": \"/app\"}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `dataset_name` | str | `\"dataish/shipd\"` | HuggingFace dataset name |\n| `max_turns` | int | `50` | Maximum turns per episode |\n| `total_timeout_minutes` | int | `120` | Total sandbox timeout |\n| `sandbox_client_max_workers` | int | `10` | Max concurrent sandbox workers |\n| `test_timeout` | int | `300` | Timeout for test execution |\n| `turn_timeout` | int | `180` | Timeout per tool call |\n| `repo_path` | str | `\"/app\"` | Working directory in sandbox |\n| `start` | int | `0` | Starting index in dataset |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Combined pass rate: (FAIL_TO_PASS + PASS_TO_PASS) / 2 |\n| `has_error` | 1 if sandbox infrastructure error occurred |\n| `instance_completed` | True if agent called submit |\n\n### Architecture\n\nThis environment uses:\n- **Prime Sandboxes** (`prime-sandboxes`) for container management\n- **verifiers** (`vf.SandboxEnv`) as the base environment class\n- **pydantic-ai** for AI-powered test output parsing\n- Custom XML tool parsing (not OpenAI function calling)\n\n### Differences from deepswe-1\n\n| Aspect | deepswe-1 | shipd-harness |\n|--------|-----------|---------------|\n| Tool format | OpenAI function calling | XML tags |\n| Tools | file_editor, search, execute_bash, submit | bash, replace_file_contents, submit |\n| Prompt style | SWE-bench style | Shipd/custom style |\n| Dataset | Multiple (R2E, SWE-bench, swesmith, shipd) | shipd only |\n","encoding":"utf-8","truncated":false,"total_bytes":3052},"status":null}