{"data":{"kind":"file","path":"README.md","version_id":"mdw4vod58mctn509cyctkid0","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3332,"modified_at":"2026-04-23T21:56:36.788000","content_hash":"131aabfb0693c18c93242b385f472d25bc66ab58246f16ccb2d71ccd80a132c7"},"entries":[],"content":"# awm-harness-1k\n\nA Prime Intellect / `verifiers` environment that turns the Hugging Face\ndataset [`arcee-train/awm_1k_fixed_harness`](https://huggingface.co/datasets/arcee-train/awm_1k_fixed_harness)\ninto an interactive tool-use sandbox. The HF dataset bundles 1,000\nindependent AWM scenarios — each with its own FastAPI server code, SQLite\nschema, sample data, API spec and passing validation report — and this\npackage wraps them into one isolated environment per rollout.\n\n## What a rollout looks like\n\nFor every example the environment:\n\n1. Materialises a fresh SQLite DB from the scenario's `db` + `sample`.\n2. Spawns the scenario's FastAPI server (the `env` column) against that DB\n   on a random port, in a subprocess that is torn down at the end of the\n   rollout.\n3. Builds OpenAI-native tool specs from the scenario `spec` (`api_groups`\n   → function schemas) and exposes them to the agent.\n4. Dispatches each `tool_calls` entry as an HTTP request and streams the\n   response body back as the tool message.\n\nTasks are generated automatically from each scenario's `val_report`: we\npick one (or more) endpoints that passed validation and ask the agent to\naccomplish that endpoint's natural-language description. Primary reward\nis `endpoint_hit = 1.0` iff the target `operation_id` appears in the\ntrajectory's tool-call sequence.\n\n## Rubric\n\n| Metric | Weight | Notes |\n| --- | --- | --- |\n| `endpoint_hit` | 1.0 | Primary reward: did the agent call the target operation? |\n| `tool_call_count` | 0.0 | Monitor-only total tool calls. |\n| `llm_judge` (optional) | 0.0 | LLM judge over `(task, trajectory)`. |\n\n## Quickstart\n\n```bash\nprime env install awm-harness-1k\nexport OPENAI_API_KEY=...   # or OPENROUTER_API_KEY\nprime eval run awm-harness-1k -n 5\n```\n\nOr from Python:\n\n```python\nimport verifiers as vf\nenv = vf.load_environment(\"awm-harness-1k\", max_scenarios=5, tasks_per_scenario=1)\nresults = env.evaluate(client=my_openai_client, model=\"gpt-4.1-mini\", num_examples=5)\n```\n\n## Arguments (`load_environment`)\n\n| Arg | Default | Description |\n| --- | --- | --- |\n| `tasks_per_scenario` | `1` | Endpoint-level tasks per scenario (total tasks ≈ `tasks_per_scenario × num_scenarios`). |\n| `max_scenarios` | `None` | Cap on scenarios loaded from HF. |\n| `scenario_filter` | `None` | Optional list of scenario names to keep (e.g. `[\"e_commerce_34\"]`). |\n| `max_turns` | `20` | Agent turn budget per rollout. |\n| `seed` | `0` | Controls which endpoint(s) are sampled per scenario. |\n| `include_llm_judge` | `False` | Add LLM judge metric (monitor-only). |\n| `judge_model` | `\"gpt-4.1-nano\"` | Judge model. |\n| `judge_base_url` | `None` | Custom base URL for judge. |\n| `dataset_name` | `\"arcee-train/awm_1k_fixed_harness\"` | HF dataset to load. |\n\n## Notes\n\n* The HF dataset is streamed at load time (no bundled data), so the first\n  run will cache ~500 MB under `HF_DATASETS_CACHE`.\n* Each rollout spawns one FastAPI subprocess on a random port; subprocess\n  stdout/stderr are redirected to a per-rollout log file to avoid pipe\n  deadlock, and `PYTHONWARNINGS=ignore` is set to silence Pydantic v2\n  deprecation spam in the scenario code.\n* Related environment for a single scenario with richer DAG-based tasks:\n  [`awm-ecom34-motifs`](https://app.primeintellect.ai/dashboard/environments/anushka/awm-ecom34-motifs).\n","encoding":"utf-8","truncated":false,"total_bytes":3332},"status":null}