{"data":{"kind":"file","path":"README.md","version_id":"h4es8670kyjbgzazeg8ne07c","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":8296,"modified_at":"2025-12-05T08:43:08.903000","content_hash":"24149f79bf5e62bc7416838691817db3df4522ceab45e7aa101a8dd681482dda"},"entries":[],"content":"# dolci-think\n\nVerifiers environment for the **math subset** of the [allenai/Dolci-Think-RL-7B](https://huggingface.co/datasets/allenai/Dolci-Think-RL-7B) dataset with hybrid verification.\n\n## About the Dataset\n\n**Dolci-Think-RL-7B** is the reinforcement learning dataset used to train [Olmo-3-7B-Think](https://huggingface.co/allenai/Olmo-3-7B-Think), released by Allen AI. It contains **102,014 prompts** designed to elicit deep reasoning across four domains:\n\n| Domain | Samples | Description |\n|--------|---------|-------------|\n| **Math RLVR** | ~30,180 | Mathematical reasoning problems |\n| Instruction Following | ~29,813 | Multi-constraint IF tasks |\n| Code RLVR | ~21,385 | Code generation with test cases |\n| General RLVR | ~20,636 | Long-form reasoning & chat |\n\n### Math Sources (used by this environment)\n\nThis environment filters to **math tasks only** (`dataset = [\"math\"]`), which includes problems from:\n\n| Source | Count | Paper |\n|--------|-------|-------|\n| [OMEGA Math](https://arxiv.org/abs/2506.18880) | 15,000 | Competition-level math |\n| [AceReason-Math](https://arxiv.org/abs/2505.16400) | 6,598 | Reasoning chains |\n| [Multi-Subject RLVR](https://arxiv.org/abs/2503.23829v1) | 7,106 | Cross-domain math |\n| [MathSub-30K](https://arxiv.org/abs/2508.07629) | 2,999 | KlearReasoner math |\n| [ORZ Math](https://arxiv.org/abs/2503.24290) | 2,999 | Olympiad-style problems |\n| [DAPO-Math](https://arxiv.org/abs/2503.14476) | 2,584 | Diverse algebra/geometry |\n\n**Domains covered:** geometry, algebra, combinatorics, number theory, proofs, word problems, and more.\n\n## Overview\n\nThis environment implements a two-stage verification pipeline for math problems:\n\n1. **Rule-based verification** (`math_verify`): Fast symbolic comparison of `\\boxed{}` answers\n2. **LLM Judge fallback**: When rule-based fails, uses an LLM to evaluate semantic equivalence\n\nThe judge is only called when math_verify fails, making evaluations efficient while maintaining accuracy.\n\n**Based on:**\n- [open-instruct](https://github.com/allenai/open-instruct)'s MathVerifier logic\n- [PrimeIntellect/i3-math](https://github.com/PrimeIntellect-ai/verifiers) hybrid rubric pattern\n\n## Installation\n\n```bash\n# Using vf-install (recommended)\nvf-install dolci_think\n\n# Or from local directory\ncd environments/dolci_think\npip install -e .\n```\n\n**Dependencies:** `datasets`, `httpx`, `math-verify>=0.8.0`, `openai`, `sympy`, `verifiers>=0.1.8.post1`\n\n## Quick Start\n\n### Rule-based only (no judge)\n\n```bash\nvf-eval dolci_think \\\n  -m gpt-4.1-mini \\\n  -n 20 \\\n  -r 4 \\\n  -s\n```\n\n### With LLM Judge (Gemini)\n\n```bash\nvf-eval dolci_think \\\n  -m gpt-4.1-mini \\\n  -n 128 \\\n  -r 4 \\\n  -a '{\"judge_model\": \"gemini-2.5-flash\", \"judge_base_url\": \"https://generativelanguage.googleapis.com/v1beta/openai/\", \"judge_api_key_var\": \"GEMINI_API_KEY\"}' \\\n  -C \"judge_reasoning,judge_result\" \\\n  -s\n```\n\n## Example Output Dataset\n\nSee a complete evaluation run at [**pmahdavi/dolci-math_olmo3-32b_gemini-2.5-flash**](https://huggingface.co/datasets/pmahdavi/dolci-math_olmo3-32b_gemini-2.5-flash)\n\nThis dataset was generated using:\n\n```bash\nvf-eval dolci_think \\\n  -m allenai/Olmo-3-32B-Think \\\n  -b http://localhost:8000/v1 \\\n  -n 128 \\\n  -r 4 \\\n  -t 24576 \\\n  -c 256 \\\n  -S '{\"temperature\": 1.0, \"top_p\": 0.7}' \\\n  -a '{\"use_think\": false, \"judge_model\": \"gemini-2.5-flash\", \"judge_base_url\": \"https://generativelanguage.googleapis.com/v1beta/openai/\", \"judge_api_key_var\": \"GEMINI_API_KEY\"}' \\\n  -s -H \\\n  -D pmahdavi/dolci-math_olmo3-32b_gemini-2.5-flash \\\n  -v \\\n  -C judge_reasoning,judge_result\n```\n\n### Output Columns\n\n| Column | Type | Description |\n|--------|------|-------------|\n| `example_id` | int | Example index |\n| `prompt` | list | Input messages |\n| `completion` | list | Model response |\n| `task` | str | Always `\"math\"` |\n| `answer` | str | Ground truth answer |\n| `reward` | float | Final score (0.0 or 1.0) |\n| `math_verify_score` | float | Rule-based verification result |\n| `judge_score` | float | LLM judge score (equals math_verify if skipped) |\n| `correct_answer` | float | `math_verify_score OR judge_score` |\n| `generation_ms` | float | Generation time (ms) |\n| `scoring_ms` | float | Scoring time (ms) |\n| `info` | dict | Original dataset metadata |\n\n**Optional columns** (use `-C judge_reasoning,judge_result`):\n\n| Column | Description |\n|--------|-------------|\n| `judge_reasoning` | Full CoT response from judge, or `null` if skipped |\n| `judge_result` | `A` (correct), `B` (incorrect), `C` (invalid), `SKIPPED`, `EMPTY_RESPONSE`, `ERROR` |\n\n### Sample Rows\n\n**Example 1: math_verify passed → judge SKIPPED**\n\n| Field | Value |\n|-------|-------|\n| Prompt | *\"(Hungary 2004). A palace has the shape of a square divided into 2003×2003 rooms...\"* |\n| Answer | `\"99\"` |\n| math_verify_score | `1.0` |\n| judge_result | `\"SKIPPED\"` |\n| judge_reasoning | `null` |\n| reward | `1.0` |\n\n**Example 2: math_verify failed → judge called**\n\n| Field | Value |\n|-------|-------|\n| Prompt | *\"5. All three-digit numbers from 100 to 999 are written in a row...\"* |\n| Answer | `\"5\"` |\n| math_verify_score | `0.0` |\n| judge_result | `\"B\"` |\n| judge_reasoning | `\"Final Judgment: \\boxed{B} - INCORRECT\"` |\n| reward | `0.0` |\n\n## Environment Arguments\n\n| Argument | Type | Default | Description |\n|----------|------|---------|-------------|\n| `dataset_name` | `str` | `\"allenai/Dolci-Think-RL-7B\"` | HuggingFace dataset name |\n| `dataset_split` | `str` | `\"train\"` | Dataset split to use |\n| `system_prompt` | `str` | `None` | Optional system prompt to prepend |\n| `include_all_tasks` | `bool` | `False` | Include non-math tasks (rewards=0) |\n| `use_think` | `bool` | `True` | Require `<think></think>` tags (set `False` for Qwen3/DeepSeek-R1) |\n| `judge_model` | `str` | `None` | LLM judge model name (enables judge) |\n| `judge_base_url` | `str` | `None` | Base URL for judge API |\n| `judge_api_key_var` | `str` | `None` | Env var name for judge API key |\n| `judge_sampling_args` | `dict` | `None` | Sampling args for judge calls |\n\n## Dataset Format\n\nThe environment converts the Dolci dataset to verifiers format:\n\n**Input (Dolci):**\n```python\n{\n    \"prompt\": \"user: Solve for x: 2x + 5 = 15\",\n    \"ground_truth\": [\"5\"],\n    \"dataset\": [\"math\"],\n    \"original_dataset\": \"gsm8k\"\n}\n```\n\n**Output (Verifiers):**\n```python\n{\n    \"question\": \"Solve for x: 2x + 5 = 15\",\n    \"answer\": \"5\",\n    \"task\": \"math\",\n    \"info\": {\n        \"all_ground_truths\": [\"5\"],\n        \"all_datasets\": [\"math\"],\n        \"original_dataset\": \"gsm8k\"\n    }\n}\n```\n\n## Configuration Examples\n\n### Disable think parsing\n\nUse `use_think: false` when:\n- The model doesn't produce `<think>` tags (standard instruction-tuned models)\n- The chat template automatically removes `<think>` tags (e.g., **Qwen3**, **DeepSeek-R1**)\n\n> **Note:** When `use_think=True` (default), the parser **requires** `</think>` in the response. If not found, parsing fails and returns empty string.\n\n```bash\nvf-eval dolci_think \\\n  -m gpt-4.1-mini \\\n  -a '{\"use_think\": false}' \\\n  -n 20\n```\n\n### Custom dataset split\n\n```bash\nvf-eval dolci_think \\\n  -m gpt-4.1-mini \\\n  -a '{\"dataset_split\": \"test\"}' \\\n  -n 100\n```\n\n### With OpenAI judge\n\n```bash\nvf-eval dolci_think \\\n  -m gpt-4.1-mini \\\n  -a '{\"judge_model\": \"gpt-4o-mini\", \"judge_api_key_var\": \"OPENAI_API_KEY\"}' \\\n  -n 50\n```\n\n## Logging\n\nControl verbosity with the `DOLCI_THINK_LOG_LEVEL` environment variable:\n\n```bash\n# Debug mode (shows parsed responses, scores, etc.)\nDOLCI_THINK_LOG_LEVEL=DEBUG vf-eval dolci_think -n 5\n\n# Quiet mode\nDOLCI_THINK_LOG_LEVEL=WARNING vf-eval dolci_think -n 100\n```\n\n## Default Eval Config\n\nFrom `pyproject.toml`:\n```toml\n[tool.verifiers.eval]\nnum_examples = 20\nrollouts_per_example = 4\n```\n\nThese defaults are used when `-n` and `-r` are not specified.\n\n## License\n\nThe [Dolci-Think-RL-7B dataset](https://huggingface.co/datasets/allenai/Dolci-Think-RL-7B) is licensed under **ODC-BY** by Allen AI.\n\n## References\n\n- **Dataset:** [allenai/Dolci-Think-RL-7B](https://huggingface.co/datasets/allenai/Dolci-Think-RL-7B)\n- **Model:** [allenai/Olmo-3-7B-Think](https://huggingface.co/allenai/Olmo-3-7B-Think)\n- **Related Papers:**\n  - [OMEGA Math](https://arxiv.org/abs/2506.18880)\n  - [AceCoder](https://arxiv.org/abs/2502.01718)\n  - [Tulu 3](https://arxiv.org/abs/2411.15124)\n","encoding":"utf-8","truncated":false,"total_bytes":8296},"status":null}