{"data":{"kind":"file","path":"README.md","version_id":"jpo9b9ogqe539w5jx8bd5sb4","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":1994,"modified_at":"2026-02-20T04:18:39.512000","content_hash":"080385b234a2e1b5aea2d584f99c72fc1af661ed93c43971804e8ebb500705bd"},"entries":[],"content":"# Bio-Literature Search — Level 2\n\n**Multi-abstract comparison and structural reasoning**\n\nAn RL environment where agents must call multiple tools and reason over structured metadata.\n\n## Task\n\nGiven multi-step biomedical research questions, the agent must retrieve and compare information across multiple abstracts.\n\n## Question Types\n\n| Type | Example | What agent must do |\n|------|---------|-------------------|\n| `section_exists` | \"Does abstract X have a METHODS section?\" | Search → get_abstract → check labels |\n| `mesh_overlap` | \"Do papers A and B share MeSH terms?\" | Get both → check mesh intersection |\n| `decision_match` | \"Do studies A and B reach the same conclusion?\" | Get both → compare final_decision |\n| `context_count` | \"How many paragraphs does abstract X have?\" | Get abstract → count sections |\n| `search_section` | \"Find paper about [topic]; does it have RESULTS?\" | Search → get_abstract → check |\n\n## Dataset\n\nProgrammatically generated from PubMedQA `pqa_labeled` — ~900 synthetic questions.\n\n- Train: ~880 questions (from 700 base records)\n- Test: ~230 questions (from 300 base records)\n\n## Reward\n\nDeterministic binary: exact match for yes/no answers, exact match for integer counts.\nParsing uses `str.find()` — **no regex**.\n\n## Tools\n\n```python\nsearch_abstracts(query: str) -> list[{pmid, question_stub, relevance_score}]\nget_abstract(pmid: str) -> {pmid, abstract, section_labels, mesh_terms, num_paragraphs}\nget_mesh_terms(pmid: str) -> {pmid, mesh_terms}\n```\n\n## Eval Results\n\n| Model | pass@1 | Notes |\n|-------|--------|-------|\n| gpt-4.1-mini (n=20) | 0.85 | Initial run |\n| gpt-4.1-mini (n=50, r=2) | 0.92 | More samples |\n| Qwen3-30B-A3B (base, step 0) | 0.83 | Training baseline |\n\n## Install & Run\n\n```bash\nprime env install d42me/bio-search-l2\nprime eval run d42me/bio-search-l2 --endpoint gpt-4.1-mini -n 20\n```\n\n## Required Environment Variables\n\nNone.\n\n## Training\n\n```bash\nprime rl run configs/rl/bio-search-l2-run1.toml\n```\n","encoding":"utf-8","truncated":false,"total_bytes":1994},"status":null}