{"data":{"kind":"file","path":"README.md","version_id":"yc61glia9aroxm5pc3faoprs","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":1729,"modified_at":"2026-02-03T16:17:15.948000","content_hash":"e8b219f0cdb7a4b65e8d19172a2888f9480f92a963f1a9a1ecf2ee0b4b3a8831"},"entries":[],"content":"# apex-prime-demo\n\nThis is a small demo environment packaged for Prime-compatible evaluation.\n\nWhat it provides:\n- A tiny subset of tasks (3) referencing 1 embedded world snapshot (zip file).\n- A minimal tool harness for browsing and reading files inside the extracted world.\n- Two scoring modes:\n  - `scoring=\"gold\"` (default): self-contained scoring using `gold_response` text\n  - `scoring=\"judge\"`: rubric criteria scoring via an LLM judge\n\n## Quickstart (local)\n\nFrom a Prime workspace that has this environment in `./environments/`:\n\n```bash\nprime env install apex-prime-demo\nprime eval run apex-prime-demo -m openai/gpt-4.1-mini\n```\n\n## Environment Arguments\n\nPass args with `-a/--env-args` (JSON).\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `max_examples` | int | `-1` | Limit dataset size (`-1` = all). |\n| `max_turns` | int | `30` | Max tool-using turns per rollout. |\n| `scoring` | str | `\"gold\"` | `\"gold\"` (no external judge) or `\"judge\"` (judge rubric). |\n| `tool_choice` | str | `\"none\"` | `\"none\"` disables tool calling (works on hosted RL); `\"auto\"` enables tool calling (requires tool-call support in the inference server). |\n| `judge_model` | str | `\"openai/gpt-4.1-mini\"` | Model used to grade rubric criteria. |\n\n## Embedded Worlds\n\nThis demo embeds the following world zip files in the package:\n- `world_95fe2c7d53ae4120b830d30539506334`\n\n## Notes / limitations\n\n- The harness currently supports reading common file types (txt/json/ics/docx/xlsx/pdf/pptx/mbox) and basic writes for text files.\n- If you need grading to incorporate filesystem artifacts (e.g. created/edited files), we should extend the rubric to read and summarize those artifacts into the judge prompt.\n","encoding":"utf-8","truncated":false,"total_bytes":1729},"status":null}