{"data":{"kind":"file","path":"README.md","version_id":"olvqp5bub38a4wbck1sq0vzz","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3947,"modified_at":"2026-06-19T00:42:47.850000","content_hash":"6fa074bfd9975ad5baa6045a1ba40836338acddcd98ea6bf95ff603ed2d01bcb"},"entries":[],"content":"# quest-deepresearch\n\n> **Attribution.** This environment vendors and builds on QUEST by the OSU-NLP-Group\n> ([repo](https://github.com/OSU-NLP-Group/QUEST), [arXiv:2605.24218](https://arxiv.org/abs/2605.24218)):\n> the `obj_task_eval` framework and the `QUEST-RL-Data` tasks + rubric scripts, both MIT-licensed\n> (© 2026 OSU-NLP-Group; see `obj_task_eval/QUEST_LICENSE.txt`). The only substantive change to\n> their code is replacing `tool_visit` (which required `verl` + Playwright) with a dependency-light\n> live fetcher. All credit for the agent, tasks, and rewards is theirs.\n\n### Overview\n- **Environment ID**: `quest-deepresearch`\n- **Short description**: Runs OSU-NLP **QUEST**'s released RL deep-research tasks against their original rubric-tree rewards, but re-grounds the agent-as-a-judge on the **live web** so the tasks are usable without QUEST's withheld cached corpus.\n- **Tags**: multi-turn, tools, deep-research, agent-as-judge, eval, train\n\n### Why this exists (the honest part)\nQUEST ([arXiv:2605.24218](https://arxiv.org/abs/2605.24218), [repo](https://github.com/OSU-NLP-Group/QUEST)) open-sourced 864 objective RL tasks, each with a ~740-line rubric-tree reward script, **and** the evaluation framework (`obj_task_eval`). But it **withheld the cached web corpus** those rewards were verified against (legal review). A static audit of all 864 scripts shows why that matters:\n\n- **~93%** of tasks have reward leaf-checks that re-verify the agent's **cited URLs** with an agent-as-a-judge (`evaluator.verify(claim, sources)`), not a static string match.\n- **~78%** of tasks hinge on **time-sensitive facts** (prices, rates, \"effective\"/\"current\" dates); **~22%** hardcode specific dollar amounts.\n- **65%** of all leaf checks are agent-judged; 35% are deterministic Python checks.\n\nSo as shipped, the RL data is **not directly trainable**: the rewards need a corpus that isn't released. This environment supplies the bridge by re-fetching cited URLs live (a dependency-light `visit` tool replaces QUEST's `verl`+Playwright chain, and the judge runs against an empty `CacheFileSys`). **Scores here therefore include live-web drift** since the snapshot, which is itself the quantity this env makes measurable.\n\n### Datasets\n- **Source**: `osunlp/QUEST-RL-Data` (HuggingFace). Vendored at `quest_tasks/rl_train.parquet` + `quest_tasks/eval_scripts/*.py`.\n- **Splits**: `objective` (864 tasks, executable rubric scripts) and `open-ended` (269 tasks; no tree2py script, not yet wired).\n\n### Task\n- **Type**: multi-turn tool use (`vf.ToolEnv`). Policy uses `search` + `visit`, then writes a final report with cited source URLs.\n- **Rubric**: the task's original QUEST verification tree (gate-then-average over critical/soft nodes), scored on the final answer. Reward = `final_score` in [0, 1].\n\n### Quickstart\n```bash\nprime env install quest_deepresearch\n# requires runtime keys (see below)\nprime eval run quest_deepresearch -m openai/gpt-4.1-mini -n 5 -a '{\"n\": 5, \"judge_model\": \"gpt-4.1-mini\"}'\n```\n\n### Runtime keys\n| Var | Needed for | Notes |\n| --- | --- | --- |\n| `OPENAI_API_KEY` | agent-as-a-judge reward | set `OPENAI_BASE_URL` to use any OpenAI-compatible judge |\n| `SERPER_API_KEY` | `search` tool | google.serper.dev |\n| `JINA_API_KEY` | higher-quality `visit` | optional; falls back to httpx + html2text |\n\nThe env **imports and installs without keys**; keys are only checked when an eval actually runs.\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `split` | str | `\"objective\"` | `objective` (has rubric scripts) or `open-ended` |\n| `n` | int\\|None | `None` | cap number of tasks |\n| `judge_model` | str | `\"gpt-4.1-mini\"` | model for the agent-as-a-judge reward |\n| `max_turns` | int | `12` | tool-use budget for the policy |\n\n### Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | QUEST rubric-tree `final_score` in [0,1] (gate-then-average) |\n","encoding":"utf-8","truncated":false,"total_bytes":3947},"status":null}