{"data":{"kind":"file","path":"README.md","version_id":"k9egryj1g0380yngl05fdngo","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3684,"modified_at":"2026-02-03T17:47:09.559000","content_hash":"ddfbc22dcf26ed0904e946feae0bd45d25bce1b054c46ff16c2800ed0671baf8"},"entries":[],"content":"# Variant Impact Prediction\n\nHigh-signal single-turn RL environment for clinical variant classification. The model predicts clinical significance (five-class) and an optional mechanism label from gene/variant context. Designed for strong, objective reward signals with partial credit.\n\n## Task\nGiven a gene and variant context, predict:\n- `clinical_significance`: one of `pathogenic`, `likely_pathogenic`, `uncertain`, `likely_benign`, `benign`\n- `mechanism`: one of `loss_of_function`, `gain_of_function`, `dominant_negative`, `unknown`\n\n## Output Format\nReturn **only JSON**:\n```json\n{\"clinical_significance\":\"pathogenic\",\"mechanism\":\"loss_of_function\"}\n```\n\n## Scoring\n- Exact label match: `1.0`\n- Correct polarity (pathogenic vs benign): `0.7`\n- Predicting `uncertain` when not uncertain (or vice versa): `0.3`\n- Mechanism bonus: `+0.1` if correct (capped at `1.0` total)\n\n## Dataset Schema (JSONL)\nEach line is a JSON object with these fields:\n- `gene` (string)\n- `variant` (string, typically protein HGVS)\n- `hgvs_c` (string, optional)\n- `variant_type` (string, optional)\n- `clinical_significance` (string, required)\n- `mechanism` (string, optional, default `unknown`)\n- `condition` (string, optional)\n- `review_status` (string, optional)\n- `num_submitters` (string or int, optional)\n- `protein_sequence` (string, optional)\n\nThe environment will also accept datasets with different column names and attempt to map them to the schema.\n\n## Build a Dataset (Local)\nA helper script is provided to convert a TSV/CSV into JSONL:\n```bash\npython scripts/build_dataset.py \\\n  --input path/to/variant_summary.txt.gz \\\n  --output data/clinvar.jsonl \\\n  --delimiter '\\t'\n```\n\nFor high-precision labels, filter by review status and create a gene-level split:\n```bash\npython scripts/build_dataset.py \\\n  --input path/to/variant_summary.txt.gz \\\n  --output data/clinvar_train.jsonl \\\n  --eval-output data/clinvar_eval.jsonl \\\n  --split-by-gene \\\n  --eval-ratio 0.1 \\\n  --review-status \"practice_guideline,reviewed_by_expert_panel,criteria_provided_multiple_submitters_no_conflicts\"\n```\n\nIf your source uses different column names, pass them explicitly:\n```bash\npython scripts/build_dataset.py \\\n  --input my_variants.tsv \\\n  --output data/variants.jsonl \\\n  --col-gene GeneSymbol \\\n  --col-variant HGVS \\\n  --col-clinical_significance ClinicalSignificance\n```\n\n## Hosted Dataset\nThe recommended hosted dataset is on Hugging Face:\n`tylergolato/clinvar-variant-impact`\n\n## Quick Start\nInstall the environment locally:\n```bash\nprime env install variant-impact -p ./environments\n```\n\nRun a local evaluation:\n```bash\nprime eval run variant-impact -m gpt-5-nano\n```\n\n## Configuration\n`load_environment` accepts these parameters:\n- `dataset_name`: Hugging Face dataset name to load\n- `data_path`: JSON/JSONL path for training data\n- `eval_data_path`: JSON/JSONL path for evaluation data\n- `split`: training split name when using `dataset_name`\n- `eval_split`: evaluation split name when using `dataset_name`\n- `eval_ratio`: fallback evaluation ratio if `eval_split` is missing\n- `max_train`, `max_eval`: cap the number of examples\n- `include_sequence`: include protein sequence in the prompt when available\n- `max_sequence_len`: truncate sequence to this length\n- `lazy_load`: if true, pass dataset builders instead of loading immediately\n- `system_prompt`: override the system prompt\n\n## Notes\n- Default dataset: `tylergolato/clinvar-variant-impact` on Hugging Face with `train` and `validation` splits.\n- If no dataset is provided, the environment falls back to a small synthetic dataset for smoke tests.\n- For real training, provide ClinVar or other curated variant datasets with clinical labels.\n","encoding":"utf-8","truncated":false,"total_bytes":3684},"status":null}