{"data":{"kind":"file","path":"README.md","version_id":"pqi2i2nbsrox80rlghtk8xku","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3459,"modified_at":"2025-10-06T00:21:07.914000","content_hash":"afbcd6c0c52a15b3ed83c763bda0719f5e4e1fdc495aeed636052ec2cbd05866"},"entries":[],"content":"# chem-QA\n\n### Overview\n- **Environment ID**: `chem-qa`\n- **Short description**: Tolerance-scored numeric chemistry problems from SciBench Chemistry dataset\n- **Tags**: chemistry, science, numeric-reasoning, single-turn, tolerance-scoring\n\n### Datasets\n- **Primary dataset(s)**: SciBench Chemistry - numeric chemistry problems requiring quantitative reasoning\n- **Source links**: [lmms-lab/SciBench_Chemistry](https://huggingface.co/datasets/lmms-lab/SciBench_Chemistry)\n- **Split sizes**: Test: 266 examples\n\n### Task\n- **Type**: Single-turn\n- **Parser**: Custom numeric parser with LaTeX/scientific notation support\n- **Rubric overview**:\n  - **numeric_correctness** (weight: 1.0): Relative tolerance-based scoring (default ±3%)\n  - **unit_match** (weight: 0.05): Bonus for including correct units\n  - **parseable** (weight: 0.0): Metric tracking if answer is parseable\n  - **mape** (weight: 0.0): Mean Absolute Percentage Error metric\n\n### Quickstart\n\nInstall the environment:\n```bash\nprime env install mxz/chem-qa\n```\n\nRun an evaluation with default settings:\n```bash\nuv run vf-eval chem-qa -m gpt-4o -n 100 -r 3\n```\n\nConfigure model and sampling:\n```bash\nuv run vf-eval chem-qa -m gpt-4o -n 100 -r 3 -t 1024 -T 0.7\n```\n\nWith custom environment arguments:\n```bash\nuv run vf-eval chem-qa -m gpt-4o -n 100 -r 3 -a '{\"tolerance\": 0.05, \"unit_bonus\": 0.1}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `split` | str | `\"test\"` | Dataset split to use |\n| `max_examples` | int | `None` | Limit number of examples (None = all) |\n| `tolerance` | float | `0.03` | Relative tolerance for numeric correctness (0.03 = ±3%) |\n| `absolute_tol` | float | `1e-8` | Absolute tolerance for near-zero values |\n| `unit_bonus` | float | `0.05` | Bonus reward for including correct units |\n| `require_units` | bool | `False` | Whether to require units for correctness |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar reward = numeric_correctness + unit_match * unit_bonus (typically 0-1.05) |\n| `numeric_correctness` | 1.0 if answer within tolerance, else 0.0 |\n| `unit_match` | 1.0 if correct units mentioned, else 0.0 (bonus only) |\n| `parseable` | 1.0 if a number was successfully parsed from completion |\n| `mape` | Mean Absolute Percentage Error of parsed answer vs. ground truth |\n\n### Example Problem\n\n**Question:** Suppose that 10.0 mol C₂H₆(g) is confined to 4.860 dm³ at 27°C. Predict the pressure exerted by the ethane from the perfect gas.\n\n**Ground Truth:** 50.7 atm\n\n**Expected Answer Format:** The model should provide a numeric answer (e.g., \"50.7 atm\" or \"The pressure is approximately 50.7 atm\")\n\n### Model Performance (100 examples, 3 rollouts)\n\n| Model | Avg Reward | Numeric Correctness | Unit Match |\n|-------|------------|---------------------|------------|\n| GPT-4o (2024-08-06) | 40.1% | 39.3% | 16.0% |\n| GPT-4.1 | 39.5% | 38.7% | 16.0% |\n| GPT-4 Turbo | 37.8% | 37.0% | 16.0% |\n\nAll models achieve 100% parseability, demonstrating reliable numeric output formatting.\n\n### Notes\n\n- The parser handles LaTeX notation (e.g., `\\boxed{50.7}`) and scientific notation\n- Answers are evaluated with relative tolerance (default ±3%) for numerical accuracy\n- Units are extracted and normalized (e.g., \"kJ mol^{-1}\" → \"kj mol^-1\") for matching\n- If multiple numbers appear in the response, the last number is typically used (robust to showing work)\n","encoding":"utf-8","truncated":false,"total_bytes":3459},"status":null}