{"data":{"kind":"file","path":"README.md","version_id":"infl0s2ovhf3tvtgwmwf5lp8","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2787,"modified_at":"2025-12-10T01:54:44.385000","content_hash":"faa1b795e830a1e34bda01ac0f7da297779afcbd8d8e4ce3f10ab0711ae8f05d"},"entries":[],"content":"# PutnamBench\n\n### Overview\n- **Environment ID**: `PutnamBench`\n- **Short description**: Theorem proving environment for Putnam Competition problems using Lean 4 verification\n- **Tags**: theorem-proving, lean4, putnam, mathematics, eval\n\n### Datasets\n- **Primary dataset(s)**: PutnamBench Lean 4 subset - 672 Putnam Mathematical Competition problems (1962-2024)\n- **Source links**: [https://github.com/trishullab/PutnamBench](https://github.com/trishullab/PutnamBench)\n- **Split sizes**: Configurable filtering by year, mathematical domain, problem count\n\n### Task\n- **Type**: single-turn\n- **Parser**: Custom Lean 4 code block extractor (```lean4 or ```lean)\n- **Rubric overview**: Binary verification (1.0/0.0) via kimina-lean-server compilation\n\n### Prerequisites\n- **Docker**: Required for kimina-lean-server (default) OR\n- **Local Setup**: kimina-lean-server installed locally (set `use_docker=false`)\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval PutnamBench\n```\n\nConfigure model, sampling, and filtering:\n\n```bash\nuv run vf-eval PutnamBench \\\n  -m gpt-4o-mini \\\n  -n 20 -r 1 -t 4096 -T 0.7 \\\n  -a '{\"year_range\": [2020, 2024], \"max_examples\": 10, \"tags\": [\"algebra\"]}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `cache_path` | str | `~/.cache/verifiers/environments/putnambench` | Local cache for PutnamBench repo |\n| `year_range` | tuple | `None` | Filter problems by year range, e.g., `[2010, 2020]` |\n| `tags` | list | `None` | Filter by mathematical domain tags, e.g., `[\"algebra\", \"analysis\"]` |\n| `max_examples` | int | `-1` | Limit dataset size (-1 for all) |\n| `server_port` | int | `8000` | Port for kimina-lean-server |\n| `server_memory` | str | `\"4g\"` | Memory limit for kimina-lean-server |\n| `use_docker` | bool | `True` | Use Docker for kimina-lean-server (vs local install) |\n\n### Available Filters\n**Years**: 1962-2024 (Putnam Competition years)\n\n**Mathematical Domains**:\n- `algebra` (253 problems)\n- `analysis` (229 problems) \n- `number_theory` (113 problems)\n- `geometry` (71 problems)\n- `linear_algebra` (53 problems)\n- `combinatorics`, `abstract_algebra`, etc.\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | 1.0 if proof compiles successfully, 0.0 otherwise |\n| `verification_success` | Boolean compilation success |\n| `parse_success` | Boolean code block extraction success |\n| `error` | Compilation error message if failed |\n| `compilation_output` | Full compiler output for debugging |\n\n### Setup Notes\n- First run automatically clones PutnamBench repository\n- kimina-lean-server starts automatically via Docker\n- Requires internet connection for initial setup\n- Each problem includes informal statement and formal Lean 4 theorem\n","encoding":"utf-8","truncated":false,"total_bytes":2787},"status":null}