{"data":{"kind":"file","path":"README.md","version_id":"xhpmfja4gl93av81kpeeudl7","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3708,"modified_at":"2026-01-31T05:58:39.517000","content_hash":"8a4a49fccc0bb301aedbb29e6ebe59d9c56e672353fa6cf21a6048f1fd4b6205"},"entries":[],"content":"# MathChain Environment\n\nA reinforcement learning environment that teaches language models to solve arithmetic word problems through explicit multi-step reasoning chains.\n\n## Overview\n\nMathChain focuses on improving structured mathematical reasoning by rewarding models for:\n- Breaking down problems into clear logical steps\n- Showing intermediate calculations\n- Arriving at correct final answers\n\nThis environment demonstrates how RL can improve not just accuracy, but the quality and clarity of reasoning itself.\n\n## Dataset\n\n50 carefully crafted math word problems across three difficulty levels:\n\n- **Two-step problems (20)**: Basic arithmetic chains (addition, subtraction, multiplication, division)\n- **Three-step problems (20)**: Problems involving fractions, percentages, and multiple operations\n- **Four-step problems (10)**: Complex multi-operation chains requiring careful sequential reasoning\n\n## Reward Structure\n\nTotal reward = 1.0, distributed as:\n\n| Component | Weight | Description |\n|-----------|--------|-------------|\n| Correct Answer | 0.6 | Binary reward for final numerical answer |\n| Valid Structure | 0.3 | Has proper \"Step X:\" format with answer section |\n| Step Count | 0.1 | Bonus for reasonable number of steps (close to expected) |\n\nThis reward design encourages models to:\n1. Prioritize correctness (60% weight ensures accuracy is paramount)\n2. Learn structured reasoning (30% guides toward proper formatting)\n3. Be concise but thorough (10% prevents excessive or insufficient steps)\n\n## Expected Format\n\nModels should generate responses like:\n\n```\nStep 1: Calculate 1/3 of 30 candies = 30 ÷ 3 = 10 candies given away\nStep 2: Remaining candies = 30 - 10 = 20 candies\nStep 3: Add new candies = 20 + 12 = 32 candies\nAnswer: 32\n```\n\n## Installation\n\n```bash\n# Install from Environments Hub\nprime env install bruno/mathchain\n\n# Or install locally from source\ncd environments/mathchain\nuv add verifiers\n```\n\n## Usage\n\n### Quick Evaluation\n\nTest the environment with any OpenAI-compatible model:\n\n```bash\n# Evaluate with GPT-4.1-mini (baseline)\nprime eval mathchain -m openai/gpt-4.1-mini -n 20 -r 3\n\n# Evaluate with Claude\nprime eval mathchain -m anthropic/claude-3-5-sonnet-20241022 -n 20 -r 3\n```\n\n### Training Configuration\n\nExample configuration for RL training:\n\n```toml\nmodel = \"Qwen/Qwen3-4B-Instruct-2507\"\nmax_steps = 100\nbatch_size = 256\nrollouts_per_example = 8\n\n[sampling]\nmax_tokens = 512\n\n[[env]]\nid = \"bruno/mathchain\"\n```\n\n### Launch Training\n\n```bash\nprime rl run configs/lab/mathchain.toml\n```\n\n## Why This Environment?\n\n**Novel approach**: Rather than using existing benchmarks (GSM8K, MATH), MathChain creates custom problems focused specifically on step-by-step decomposition.\n\n**Clear learning signal**: The multi-component reward function provides rich feedback beyond simple correctness, helping models learn *how* to reason, not just *what* to answer.\n\n**Demonstrates RL value**: Shows improvement in both accuracy and reasoning quality - you can see the model learning to structure its thoughts more clearly over time.\n\n**Interview-ready**: Easy to explain, impressive results, and showcases understanding of reward function design.\n\n## Expected Results\n\n**Baseline** (pre-training):\n- Most models can solve many problems but struggle with consistent formatting\n- Expected accuracy: 40-60%\n- Step structure compliance: 20-40%\n\n**Post-training** (after RL):\n- Improved accuracy: +10-20 percentage points\n- Dramatically improved step formatting: +40-60 percentage points\n- More concise, clearer reasoning chains\n\n## Development\n\nBuilt with [verifiers](https://github.com/willccbb/verifiers) - a modular library for creating RL environments.\n\n## License\n\nMIT\n","encoding":"utf-8","truncated":false,"total_bytes":3708},"status":null}