{"data":{"kind":"file","path":"README.md","version_id":"t71lsjgrs1blj9v29pqb1kys","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5965,"modified_at":"2026-04-09T16:24:41.492000","content_hash":"b84b827fa9beb780d8e2bbd0c4b4e6fd3f47bab50da573b1a609a3017a2abc25"},"entries":[],"content":"# geo-rl\n\n### Overview\n- **Environment ID**: `geo-rl`\n- **Short description**: Single-turn geoscience QA with strict binary eval and answer-only dense train reward.\n- **Tags**: `single-turn`, `geoscience`, `qa`, `train`, `eval`\n\n### Dataset\n- **Primary dataset**: `diicell/geo-sci-qa`\n- **Default splits**: `train` for training, `eval` for evaluation\n- **Task coverage**: petroleum geology, reservoir engineering, drilling engineering, production engineering, geophysics, well logging, and mineral exploration\n- **Answer types**: `term`, `classification`, `numeric`, `yes_no`\n\nThe environment loads the dataset through Hugging Face `datasets.load_dataset(...)`. For local debugging you can point `dataset_name` at a local HF dataset directory if needed.\n\n### Task Contract\n- **Type**: `SingleTurnEnv`\n- **Output format**: XML with both `<reasoning>` and `<answer>`\n- **Reasoning expectations**: solve privately, then provide only a concise final justification in 2-5 sentences and under 120 words\n- **Answer expectations**: canonical term/value only; `yes`/`no` exactly for boolean questions\n\n### Reward Profiles\n- `eval`: strict binary correctness only\n- `train`: deterministic dense shaping from answer quality plus format compliance\n\nTrain profile weights:\n\n| Reward | Weight | Signal |\n| --- | --- | --- |\n| `answer_quality` | `0.90` | Exact generic match or capped partial credit |\n| `format_score` | `0.10` | XML format adherence |\n\nEval profile weights:\n\n| Reward | Weight |\n| --- | --- |\n| `correct_answer` | `1.0` |\n\n### Answer Matching\n- The environment uses only generic normalization in code:\n  - lowercase\n  - whitespace collapse\n  - dash normalization\n  - outer punctuation trimming\n  - apostrophe and possessive folding\n  - parenthetical clarifier stripping\n  - numeric parsing and unit normalization\n- There are no answer-specific equivalence tuples or geology-specific alias tables in the rubric.\n- `term` and `classification` partial credit uses token-F1 against accepted answers, capped at `0.60`.\n- `numeric` partial credit uses value closeness plus unit policy from `answer_spec`, capped at `0.60`.\n- `yes_no` answers are exact-only.\n- Train and eval both score the last well-formed `<reasoning>...</reasoning><answer>...</answer>` block.\n- Visible scratchpad or duplicate XML blocks can still receive answer credit if the final block is valid, but they receive a graded `format_score` penalty instead of a single fallback bucket.\n\n### Format Scoring\n- `1.0`: exactly one final XML block pair with no extra assistant text outside it\n- intermediate values in `(0, 1)`: a valid final XML block exists, but the score is reduced deterministically by:\n  - visible `<think>` / scratchpad leakage\n  - duplicate `<answer>` / `<reasoning>` tags\n  - leading or trailing assistant text outside the final XML block\n  - final `<reasoning>` blocks that violate the 2-5 sentence / 120-word contract\n- `0.0`: no usable final XML block exists\n\n### Dataset-Side Variants\n- Current datasets with only `answer` still work.\n- The environment also supports optional variant metadata inside `info`:\n  - `accepted_answers: list[str]`\n  - `answer_spec: { canonical_answer, accepted_answers?, numeric_value?, numeric_unit?, units_required?, absolute_tolerance?, relative_tolerance? }`\n- Valid variants should live in data, not in rubric code.\n\n### Quickstart\nInstall locally:\n\n```bash\nprime env install geo-rl\n```\n\nSmoke eval with strict binary scoring:\n\n```bash\nprime eval run geo-rl -m gpt-4.1-mini -n 10\n```\n\nSmoke eval with train-profile shaping:\n\n```bash\nprime eval run geo-rl -m gpt-4.1-mini -n 10 -a '{\"reward_profile\":\"train\"}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | --- | --- | --- |\n| `dataset_name` | `str` | `\"diicell/geo-sci-qa\"` | HF dataset repo or local HF dataset path |\n| `train_split` | `str` | `\"train\"` | Training split name |\n| `eval_split` | `str` | `\"eval\"` | Evaluation split name |\n| `reward_profile` | `str` | `\"eval\"` | `eval` for binary reward, `train` for dense answer-only shaping |\n| `min_train_total_quality` | `int \\| null` | `18` | Minimum `total_quality` retained for train |\n| `min_eval_total_quality` | `int \\| null` | `null` | Optional eval quality floor |\n| `domains` | `str \\| list[str] \\| null` | `null` | Optional domain filter |\n| `difficulties` | `str \\| list[str] \\| null` | `null` | Optional difficulty filter |\n| `answer_types` | `str \\| list[str] \\| null` | `null` | Optional answer-type filter |\n| `num_examples` | `int` | `-1` | Limit train examples after filtering |\n| `num_eval_examples` | `int` | `-1` | Limit eval examples after filtering |\n| `seed` | `int` | `42` | Seed used when sampling a limited subset |\n\n### Metrics\n\n| Metric | Meaning |\n| --- | --- |\n| `correct_answer` | Strict binary normalized correctness |\n| `answer_quality` | Main dense train-answer score before weighting |\n| `answer_partial` | Partial-credit component used when exact correctness fails |\n| `parse_success` | Whether strict XML parsing of `<answer>...</answer>` succeeded |\n| `exact_surface_match` | Case-insensitive raw answer match against accepted answers |\n| `normalized_match` | Match after answer-type-aware generic normalization |\n| `unit_match` | Numeric-unit compatibility result |\n| `format_score` | XML format adherence score |\n| `answer_tag_count` | Number of `<answer>` openings in the raw assistant output |\n| `reasoning_tag_count` | Number of `<reasoning>` openings in the raw assistant output |\n| `has_visible_think` | Whether visible `<think>` / `<thinking>` tags were present |\n| `used_final_xml_block` | Whether the scorer used a final well-formed XML block |\n| `leading_text_before_final_xml` | Whether extra assistant text appeared before the final XML block |\n| `trailing_text_after_final_xml` | Whether extra assistant text appeared after the final XML block |\n| `reasoning_length` | Reasoning word count |\n| `completion_length` | Assistant completion character count |\n","encoding":"utf-8","truncated":false,"total_bytes":5965},"status":null}