{"data":{"kind":"file","path":"README.md","version_id":"ecdxf66zcwgz9vsivf04aggg","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2340,"modified_at":"2026-02-20T04:18:53.033000","content_hash":"52858fe9c4f6a37b636a99399eb06b004a323ab23cf341f6a7742e780f4a16e0"},"entries":[],"content":"# Bio-Literature Search — Level 3\n\n**Open-ended scientific synthesis with LLM judge**\n\nAn RL environment where agents must synthesise findings across multiple biomedical papers and produce evidence-grounded scientific narratives.\n\n## Task\n\nGiven a synthesis question requiring sustained multi-paper reasoning, the agent must:\n1. Search for relevant abstracts\n2. Read them carefully\n3. Produce a factually accurate, well-evidenced scientific response\n\n## Question Types\n\n| Type | Example |\n|------|---------|\n| `findings_summary` | \"Summarize the key findings from the study on X in 2-3 sentences\" |\n| `method_identification` | \"What study design is used in research on Y?\" |\n| `clinical_implication` | \"What are the clinical implications of the findings on Z?\" |\n| `cross_study_comparison` | \"Compare the approaches in these two studies on X and Y\" |\n\n## Dataset\n\n~170 synthetic synthesis questions generated from PubMedQA `pqa_labeled`.\n\n- Train: ~140 questions\n- Test: ~30 questions\n\n## Reward\n\nLLM judge (GPT-4.1-mini) using a **universal rubric**:\n\n| Criterion | Points | What it catches |\n|-----------|--------|----------------|\n| Factually accurate relative to abstract | 3 | Wrong findings |\n| Free of hallucinated information | 3 | Made-up results, citations |\n| Covers key points | 2 | Missing important content |\n| Directly answers the question | 1 | Off-topic responses |\n| References specific abstract content | 1 | Unsupported assertions |\n\n**Hallucination penalty:** score × 0.2 if hallucination detected.\n\nOutput parsed with `json.loads()` — **no regex**.\n\n## Tools\n\n```python\nsearch_abstracts(query: str) -> list[{pmid, question_stub, relevance_score}]\nget_abstract(pmid: str) -> {pmid, abstract, section_labels, mesh_terms, overall_conclusion}\nget_mesh_terms(pmid: str) -> {pmid, mesh_terms}\ncompare_abstracts(pmid_list: str) -> list of abstract objects  # comma-separated PMIDs\n```\n\n## Install & Run\n\n```bash\nprime env install d42me/bio-search-l3\nexport OPENAI_API_KEY=your-key\nprime eval run d42me/bio-search-l3 --endpoint gpt-4.1-mini -n 10\n```\n\n## Required Environment Variables\n\n| Variable | Purpose |\n|----------|---------|\n| `OPENAI_API_KEY` | LLM judge (GPT-4.1-mini) for response scoring |\n\n## Training\n\n```bash\n# Set OPENAI_API_KEY first\nprime rl run configs/rl/bio-search-l3.toml --env-file secrets.env\n```\n","encoding":"utf-8","truncated":false,"total_bytes":2340},"status":null}