{"data":{"kind":"file","path":"README.md","version_id":"f1nyvrt42q0wlqkslpc5l26m","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2561,"modified_at":"2026-01-29T01:00:08.073000","content_hash":"0753e9c3719ecfb7f6be0c21951d4e41c482a495fda43d253983581efe498c18"},"entries":[],"content":"# research-plan-gen\n\n### Overview\n- **Environment ID**: `research-plan-gen`\n- **Short description**: Benchmark for evaluating AI research planning capabilities using rubric-based LLM judging\n- **Tags**: research, planning, llm-judge, eval\n\n### Datasets\n- **Primary dataset(s)**: [facebook/research-plan-gen](https://huggingface.co/datasets/facebook/research-plan-gen) - Research Plan Generation dataset with evaluation rubrics\n- **Source links**: [HuggingFace](https://huggingface.co/datasets/facebook/research-plan-gen), [Paper](https://arxiv.org/abs/2512.23707)\n- **Subsets**: \n  - `ml`: 6,872 train / 685 test samples (Machine Learning papers)\n  - `arxiv`: 6,573 train / 1,496 test samples (arXiv papers)\n  - `pubmed`: 6,423 train / 464 test samples (PubMed papers)\n\n### Task\n- **Type**: single-turn\n- **Parser**: Default (plain text response)\n- **Rubric overview**: LLM judge evaluates generated research plans against paper-specific rubric criteria\n\nEach sample contains:\n- **Goal**: A research task/objective to be accomplished\n- **Rubric**: List of evaluation criteria for assessing the generated plan\n- **Reference solution**: A reference solution from the original paper\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval research-plan-gen\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval research-plan-gen \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 2048 -T 0.7 \\\n  -a '{\"subset\": \"ml\"}'\n```\n\nRun directly:\n\n```bash\nuv run python research_plan_gen.py --model gpt-4.1 --subset ml --num-examples 10\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `subset` | str | `\"ml\"` | Dataset subset: \"ml\", \"arxiv\", or \"pubmed\" |\n| `split` | str | `\"test\"` | Dataset split: \"train\" or \"test\" |\n| `num_samples` | int | `None` | Limit on dataset size (None for all) |\n| `judge_model` | str | `\"gpt-4.1-mini\"` | Model to use for LLM judge |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | LLM judge score (0.0-1.0) based on rubric satisfaction |\n| `mean_reward` | Average reward across all samples |\n| `std_reward` | Standard deviation of rewards |\n\n### Citation\n\n```bibtex\n@article{goel2025training,\n  title={Training AI Co-Scientists Using Rubric Rewards},\n  author={Goel, Shashwat and Hazra, Rishi and Jayalath, Dulhan and Willi, Timon and Jain, Parag and Shen, William F and Leontiadis, Ilias and Barbieri, Francesco and Bachrach, Yoram and Geiping, Jonas and Whitehouse, Chenxi},\n  journal={arXiv preprint arXiv:2512.23707},\n  year={2025}\n}\n```","encoding":"utf-8","truncated":false,"total_bytes":2561},"status":null}