{"data":{"kind":"file","path":"README.md","version_id":"k9th32rpt90kc27k2lvrfn4i","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2285,"modified_at":"2025-10-16T16:02:42","content_hash":"a840ef5b14b4fa622db4647d60d2e5f2052304b81c2d0cc04a862ee26d8a3465"},"entries":[],"content":"# MathVision\n\nA comprehensive benchmark for evaluating Large Multimodal Models on mathematical reasoning tasks with visual inputs. Keep the Evaluation Reports section at the bottom intact so reports can auto-render.\n\nFor more information about the MathVision benchmark, visit the [official project page](https://mathllm.github.io/mathvision/).\n\n### Overview\n- **Environment ID**: `MathVision`\n- **Short description**: A benchmark for evaluating Large Multimodal Models on mathematical reasoning tasks with visual inputs\n- **Tags**: `mathematics`, `multimodal`, `reasoning`, `visual`\n\n### Datasets\n- **Primary dataset(s)**: MathVision dataset from MathLLMs\n- **Source links**: [HF Dataset](https://huggingface.co/datasets/MathLLMs/MathVision)\n- **Split sizes**: test (`3.04K`)\n\n### Task\n- **Type**: `single-turn`\n- **Parser**: Custom parser for extracting answers wrapped in `**`\n- **Rubric overview**: Exact match accuracy on multiple-choice answers, supports filtering by difficulty level and subject\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval MathVision\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval MathVision -k <API_KEY> -b <BASE_URL> \\\n  -m 'mistralai/mistral-small-3.2-24b-instruct' \\\n  -n 5 -r 1 -s -a '{\"num_samples\": 5}'\n```\n\n```bash\nuv run vf-eval MathVision -k <API_KEY> -b <BASE_URL> \\\n  -m 'qwen/qwen3-vl-8b-instruct' \\\n  -n 5 -r 1 -s -a '{\"num_samples\": 5, \"exclude_subjects\": [\"metric geometry - length\", \"arithmetic\", \"logic\", \"counting\", \"graph theory\"], \"exclude_levels\": [1]}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\nDocument any supported environment arguments and their meaning. Example:\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_samples` | int | `-1` | Limit on dataset size (use -1 for all) |\n| `exclude_levels` | List | `[]` | List of difficulty levels to exclude |\n| `exclude_subjects` | List | `[]` | List of subjects to exclude |\n\n### Metrics\nSummarize key metrics your rubric emits and how they’re interpreted.\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar reward (1 for correct answer, 0 for incorrect) |\n| `accuracy` | Exact match on target answer |","encoding":"utf-8","truncated":false,"total_bytes":2285},"status":null}