{"data":{"kind":"file","path":"README.md","version_id":"id8myk0pcprrwdmf0wdkpl7e","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4642,"modified_at":"2026-06-03T15:08:16.844000","content_hash":"6d8cf45463ff1fe8df84911b48da7e57183e0634292120edd76a6da52d6e9bac"},"entries":[],"content":"# pi_apex_agents\n\nThis is a minimal Verifiers v1 taskset for\n[mercor/apex-agents](https://huggingface.co/datasets/mercor/apex-agents).\nThe environment uses the generic Verifiers v1 harness interface, so any\ncompatible harness can be supplied through `[eval.harness]`.\n\nSandbox setup downloads the task's `world_files_zipped/<world_id>.zip` snapshot\nfrom Hugging Face, extracts it, and then overlays `task_files/<task_id>/` in the\nsame order as the Archipelago example runner. Agent-visible files are therefore\navailable at:\n\n- `/workspace/filesystem`\n- `/workspace/.apps_data`\n- `/workspace/input_manifest.txt`\n\nTask prompts are passed through from the dataset without adding a taskset system\nprompt, so each harness keeps its own prompt format.\n\nThe reward reads `/workspace/final_answer.txt` when a harness writes it, falling\nback to the harness completion otherwise. It grades with the same reference\nrubric shape used by\n[Mercor-Intelligence/archipelago](https://github.com/Mercor-Intelligence/archipelago):\neach rubric criterion is judged independently with Verifiers' `JudgeRubric`,\nthe judge JSON is parsed with the v1 judge utilities, and each criterion passes\nonly when its score is at least `0.99`.\n\nThe weight-1 `task_reward` is binary: `1.0` only when every criterion passed.\nThe partial result `passed_count / total_count` is emitted as `partial_reward`\nwith weight `0.0`, and `passed_count` and `total_count` are emitted as metrics\nfor auditing. For file-output tasks, the reward also extracts changed workspace\ndocuments,\nspreadsheets, slide decks, PDFs, and text files and appends their readable\ncontents to the solution shown to the judge.\n\n## Tooling Assumption\n\nArchipelago exposes calendar, chat, code, document, filesystem, mail, PDF,\npresentation, and spreadsheet MCP servers. I checked the public tool packages\nbefore implementing this version. The code/document/PDF/spreadsheet/presentation\ntools are mostly wrappers around standard Python libraries such as `pandas`,\n`openpyxl`, `python-docx`, `python-pptx`, `pypdf`, `pdfplumber`, `pymupdf`, and\n`reportlab`; this matches Epoch's note that many benchmark tools wrap common Python\npackages.\n\nThis taskset therefore makes those Python packages available directly in the\nsandbox instead of recreating each MCP wrapper. The default sandbox starts from\n`python:3.11-slim` and installs the library set in `harness.program.setup`\nafter the selected command harness has installed itself and before task setup\ndownloads the Hugging Face world files. Installing this way makes the packages\navailable to the image's normal `python3`, which is what coding harnesses tend\nto expose to the agent inside shell commands.\nCalendar, chat, and mail are not pure library wrappers; they expose structured\napp state. This implementation exposes the raw `.apps_data` files and assumes a\ncoding harness can inspect and modify them directly when needed.\n\nOptional Archipelago code extras for medicine/scientific-computing worlds\n(`pydicom`, `biopython`, `openmm`, `pyhmmer`, `particle`) are not installed by\ndefault because the released domains are law, investment banking, and\nmanagement consulting. Add them under `[eval.harness.program.sandbox].packages`\nif a future shard needs them.\n\n## Running\n\nInstall the environment from the repository root:\n\n```bash\nuv pip install -e ./environments/pi_apex_agents\n```\n\nRun with a local TOML config, for example:\n\n```toml\n[[eval]]\nenv_id = \"pi_apex_agents\"\n\n[eval.harness]\nid = \"harnesses.mini_swe_agent\"\nmax_turns = 100\n```\n\nThe dataset is gated. The host must expose either `HF_TOKEN`,\n`HUGGINGFACE_HUB_TOKEN`, or `~/.cache/huggingface/token`; the environment passes\nthat token only to the sandbox setup download step and removes the temporary\ntoken file before the agent starts.\n\nThe grader uses Pinference at `https://dev-inference.pinference.ai/api/v1` with\n`google/gemini-2.5-flash` and reads its API key from `PRIME_API_KEY`.\n\nArchipelago's example config names the same judge as\n`gemini/gemini-2.5-flash`; Pinference currently exposes it as\n`google/gemini-2.5-flash`.\n\n## Useful Knobs\n\nTOML examples:\n\n```toml\n[eval.taskset]\nmax_tasks = 3\ndomains = [\"Law\"]\ntask_ids = [\"task_0b9134a634c14f24a6c256d034a6c130\"]\n\n[eval.harness]\nmax_turns = 100\n```\n\nTo use a different harness in TOML, add an `id` under `[eval.harness]` and pass\nthat harness's config fields there:\n\n```toml\n[eval.harness]\nid = \"harnesses.terminus_2\"\nmax_turns = 1\n```\n\n### Changelog\n\n- `0.1.0`: Initial taskset with hardcoded\n  `mercor/apex-agents` train split, generic Verifiers v1 harness wiring,\n  preinstalled Python sandbox libraries, and artifact-aware rubric grading.\n","encoding":"utf-8","truncated":false,"total_bytes":4642},"status":null}