{"data":{"kind":"file","path":"README.md","version_id":"gemx5p9z3f2wxiin3yffsf8i","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":7551,"modified_at":"2026-02-06T16:04:10.937000","content_hash":"0c64be965f114591d1665bf96d8334ee3843acc3db24ffefaa72e3981abd544b"},"entries":[],"content":"# Predictive Validity in Drug Discovery\n\nAn RL environment for budgeted drug candidate screening, based on [Scannell et al. (2022)](https://doi.org/10.1038/s41573-022-00552-x) *\"Predictive validity in drug discovery: what it is, why it matters and how to improve it\"* (Nature Reviews Drug Discovery).\n\n## Core Idea\n\nDrug candidates have hidden **clinical utility** — how well they'd work in patients. Screening tools score candidates, but scores are only *imperfectly correlated* with true utility. The correlation (\"**predictive validity**\", ρ) is the key parameter: small improvements in ρ can matter more than testing 10× or 100× as many candidates.\n\n> **\"Quality beats quantity\"** — the central thesis of the paper.\n\n## Data Sources\n\nThe environment supports two data modes:\n\n### Synthetic Mode (`data_source=\"synthetic\"`, default)\nGaussian latent variable model from the paper. Candidate pools are generated from `y ~ N(0,1)` with controlled correlations. Good for rapid iteration and controlled experiments.\n\n### ChEMBL Mode (`data_source=\"chembl\"`)\n**Real compound data** from the [ChEMBL](https://www.ebi.ac.uk/chembl/) database. Each episode presents real molecules tested against a real biological target:\n\n| Target | Compounds | Clinical | ρ (binding) | ρ (functional) |\n|--------|-----------|----------|-------------|----------------|\n| D(2) dopamine receptor | 464 | 26 | 0.43 | 0.97 |\n| Cannabinoid receptor 1 | 446 | 10 | 0.60 | 0.97 |\n| EGFR | 347 | 10 | 0.74 | 0.97 |\n| VEGFR2 | 206 | 6 | 0.59 | 0.93 |\n| Src kinase | 156 | 3 | 0.53 | 0.91 |\n| Serotonin transporter | 74 | 5 | 0.77 | 0.92 |\n| HERG channel | 60 | 51 | 0.82 | 0.97 |\n| HER2/erbB-2 | 53 | 11 | 0.56 | 0.96 |\n\nTool mapping:\n- `screen_low` → Binding assay pIC50 (cheaper, less predictive)\n- `screen_high` → Functional/cell-based assay pIC50 (more expensive, more predictive)\n- `y` → Hidden clinical utility (functional potency or clinical phase outcome)\n\nRealistic measurement noise (σ_low=0.5, σ_high=0.3 log units) ensures neither tool perfectly reveals the true utility, preserving the quality-vs-quantity trade-off.\n\n## Environment Design\n\n### Tools\n| Tool | Cost | Predictive Validity (ρ) | Description |\n|------|------|------------------------|-------------|\n| `screen_low(n)` | Low (default: 1/candidate) | 0.2–0.5 (synthetic) / 0.43–0.82 (chembl) | Cheap, noisy screening of n NEW candidates |\n| `screen_high(n)` | High (default: 5/candidate) | 0.55–0.85 (synthetic) / 0.91–0.97 (chembl) | Expensive, more accurate screening of n NEW candidates |\n| `rescreen(candidate_ids, tool)` | Same as tool used | Same as tool used | Re-screen PREVIOUSLY SEEN candidates with either tool |\n\n### Action\nThe agent must select **one** candidate to advance (must have been screened):\n```xml\n<answer>C0042</answer>\n```\n\n### Reward\nDense sigmoid reward based on the chosen candidate's hidden utility:\n```\nreward = σ((y_chosen - threshold) / temperature) - cost_penalty\n```\n\n## What the Agent Should Learn\n\n1. **When to pay for quality**: Higher-ρ tools find better candidates despite screening fewer\n2. **Budget allocation**: Balance cheap broad screening vs. expensive targeted screening\n3. **Funnel strategies**: Broad low-ρ screen → shortlist → targeted high-ρ rescreen\n4. **Avoiding over-testing**: Don't waste budget on diminishing returns\n\n## Metrics\n\n| Metric | Description |\n|--------|-------------|\n| `pick_reward` | Primary reward (sigmoid utility − cost penalty) |\n| `binary_success` | 1 if chosen candidate above utility threshold |\n| `chosen_utility` | Raw clinical utility of selected candidate |\n| `oracle_regret` | y_max − y_chosen (lower is better; 0 = optimal) |\n| `percentile_rank` | Percentile of chosen candidate within pool (0–100) |\n| `chose_screened` | 1 if candidate was actually screened (sanity check) |\n| `budget_efficiency` | Fraction of budget remaining |\n| `total_screened` | Total candidates screened |\n| `num_low_screens` | Number of `screen_low` calls |\n| `num_high_screens` | Number of `screen_high` calls |\n| `num_rescreens` | Number of `rescreen` calls |\n\n## Installation\n\n```bash\nprime env install predictive-validity\n```\n\n## Usage\n\n### Local Evaluation (Synthetic)\n```bash\nprime eval run predictive-validity -m gpt-4.1-mini\n```\n\n### Local Evaluation (ChEMBL)\n```bash\nprime eval run predictive-validity -m gpt-4.1-mini -- --data_source chembl\n```\n\n### Custom Parameters\n```python\nfrom predictive_validity import load_environment\n\n# Synthetic mode (default)\nenv = load_environment(\n    num_examples=200,\n    data_source=\"synthetic\",\n    n_pool=5000,\n    rho_low_range=(0.2, 0.5),\n    rho_high_range=(0.55, 0.85),\n    budget=100.0,\n    cost_low=1.0,\n    cost_high=5.0,\n    threshold=1.5,\n    temperature=0.5,\n    max_turns=15,\n)\n\n# ChEMBL mode (real drug data)\nenv = load_environment(\n    num_examples=200,\n    data_source=\"chembl\",\n    utility_mode=\"assay_potency\",  # or \"clinical_phase\"\n    budget=100.0,\n    cost_low=1.0,\n    cost_high=5.0,\n    temperature=0.5,\n    max_turns=15,\n)\n```\n\n### ChEMBL Utility Modes\n- `\"assay_potency\"` (default): `y` = functional pIC50. Dense, continuous signal — recommended for training.\n- `\"clinical_phase\"`: `y` = clinical phase score (Phase I=1, II=2, III=3, Approved=4) or sigmoid proxy. Sparse but more realistic.\n\n### Difficulty Tuning\n- **Threshold / base_rate** (synthetic): Higher threshold = rarer good candidates.\n- **ρ ranges** (synthetic): Narrower gap makes quality-vs-quantity subtler.\n- **Budget**: Lower budgets force harder allocation decisions.\n- **Pool size** (synthetic): Larger pools mean more candidates but same budget.\n- **Temperature**: Lower = sharper reward around threshold.\n\n## Data Curation (ChEMBL)\n\nThe bundled `data/chembl_pools.parquet` was curated from ChEMBL by:\n1. Finding targets with 200+ compounds in binding assays AND 80+ in functional assays\n2. Extracting median pIC50 per compound per assay type (binding + functional)\n3. Looking up max clinical phase per compound\n4. Filtering for 50+ compounds with both assay types\n\nTo re-run curation (requires `chembl-webresource-client`):\n```bash\nuv add --dev chembl-webresource-client\nuv run python environments/predictive_validity/data/curate_chembl.py\n```\n\n## Design Rationale\n\n### Why StatefulToolEnv?\nThe agent interacts via tool calls (`screen_low`, `screen_high`, `rescreen`), matching the verifiers `StatefulToolEnv` pattern. Per-rollout hidden state is injected via `update_tool_args`.\n\n### Why Measurement Noise (ChEMBL mode)?\nReal assay measurements have run-to-run variability (0.3–0.5 log units for pIC50). Adding noise ensures neither tool perfectly reveals the hidden utility, maintaining the quality-vs-quantity decision the agent must learn.\n\n### Why Rescreen?\nThe paper's R&D model is sequential: cheap broad screen → shortlist → expensive re-screen. Without `rescreen`, the funnel strategy is impossible.\n\n### Why Dense Reward?\nBase rates in drug discovery are very low (~0.1%). Sparse binary reward would be brutal for RL. The sigmoid provides continuous gradient while staying faithful to the threshold concept.\n\n## References\n\n- Scannell, J.W. et al. (2022). Predictive validity in drug discovery: what it is, why it matters and how to improve it. *Nature Reviews Drug Discovery*, 21, 915–931.\n- Scannell, J.W. & Bosley, J. (2016). When quality beats quantity: decision theory, drug discovery, and the reproducibility crisis. *PLoS ONE*, 11(2), e0147215.\n- [ChEMBL Database](https://www.ebi.ac.uk/chembl/) — Curated database of bioactive molecules.\n","encoding":"utf-8","truncated":false,"total_bytes":7551},"status":null}