{"data":{"kind":"file","path":"README.md","version_id":"um9k1uegr57nk89a106dn7st","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3948,"modified_at":"2025-11-30T10:49:05.662000","content_hash":"4c27feaa01539bc1fef2b6d892a2b9205421e55753f83a1b8a91211063a75da9"},"entries":[],"content":"# proof-verifier-env\n\n### Overview\n- **Environment ID**: `proof-verifier-env`\n- **Short description**: A two-turn environment for training mathematical proof generators with self-verification capabilities\n- **Tags**: mathematics, proof-generation, self-verification, multi-turn\n\n### Datasets\n- **Primary dataset(s)**: ProofBench - A mathematical proof benchmark dataset with expert ratings\n- **Source links**: `wenjiema02/ProofBench` on HuggingFace\n- **Split sizes**: 1000 examples from train split (configurable)\n\n### Task\n- **Type**: Multi-turn (generator → verifier)\n- **Parser**: ProofParser (custom parser for extracting proofs, self-scores, and verifier scores)\n- **Rubric overview**: \n  - **Generator rewards**: Format compliance, proof quality (R_Y), self-calibration (R_score)\n  - **Verifier rewards**: Format compliance, scoring accuracy against expert ratings\n  - **Total reward**: Weighted combination (76% proof quality + 24% calibration for generator; format × accuracy for verifier)\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval proof-verifier-env\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval proof-verifier-env \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7 \\\n  -a '{\"dataset_name\": \"wenjiema02/ProofBench\", \"dataset_split\": \"train\", \"num_examples\": 100}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n- The environment performs two turns per example: first the generator creates a proof with self-analysis, then the verifier evaluates it.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `dataset_name` | str | `\"wenjiema02/ProofBench\"` | HuggingFace dataset identifier |\n| `dataset_split` | str | `\"train\"` | Dataset split to use (train/test/validation) |\n| `num_examples` | int | `1000` | Number of examples to evaluate (subset size) |\n\n### Metrics\n\nThe environment tracks separate metrics for both the generator and verifier models:\n\n#### Generator Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `generator_format_reward` | Binary reward for following the required format (Proof, Self-Analysis, Self-Score) |\n| `R_Y` | Proof quality score from verifier (0.0, 0.5, or 1.0) - normalized from expert ratings |\n| `r_score` | Self-calibration accuracy: 1.0 - abs(self_score - verifier_score) |\n| `generator_total_reward` | Combined reward: format × (0.76 × R_Y + 0.24 × r_z) [weight: 0.5] |\n\n#### Verifier Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `verifier_format_reward` | Binary reward for following evaluation format (evaluation text + boxed score) |\n| `verifier_r_score` | Scoring accuracy: 1.0 - abs(predicted_score - expert_score) |\n| `verifier_total_reward` | Combined reward: format × r_score [weight: 0.5] |\n\n#### Scoring Scale\nExpert scores are normalized to a 0.0-1.0 scale:\n- **1.0**: Complete and rigorous proof (expert rating ≥ 85% of max)\n- **0.5**: Sound logic with minor issues (expert rating ≥ 40% of max)\n- **0.0**: Fatal errors or critical gaps (expert rating < 40% of max)\n\n### How It Works\n\n1. **Turn 1 - Generator**: The model receives a mathematical problem and must produce:\n   - A detailed proof with all steps and justifications\n   - A self-analysis critically evaluating its own proof\n   - A self-score (0.0, 0.5, or 1.0) based on the rubric criteria\n\n2. **Turn 2 - Verifier**: The same model then acts as a verifier and must:\n   - Evaluate the proof for correctness, logical completeness, and clarity\n   - Provide detailed analysis identifying gaps or confirming validity\n   - Assign a score in boxed notation: `\\boxed{[0 or 0.5 or 1]}`\n\n3. **Reward Calculation**: \n   - Generator is rewarded for proof quality (as judged against expert scores) and self-calibration accuracy\n   - Verifier is rewarded for accurately matching expert ratings from the dataset\n   - Both roles are penalized for format non-compliance\n","encoding":"utf-8","truncated":false,"total_bytes":3948},"status":null}