{"data":{"kind":"file","path":"README.md","version_id":"h2fkepcydkpf5mog8ic24eyq","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2221,"modified_at":"2025-10-17T03:20:13.469000","content_hash":"378598253bb72b5a0edcd57226edc06d39c2eea4779561bb656464f3ef48ea67"},"entries":[],"content":"# MMStar\n\nMMStar is an elite vision-indispensable multi-modal benchmark with 1,500 challenge samples meticulously selected by humans to evaluate the multi-modal capabilities of Large Vision-Language Models (LVLMs). It addresses prevalent issues in current benchmarks, such as unnecessary visual content and unintentional data leakage, by ensuring visual dependency and providing metrics for data leakage and actual multi-modal performance gains.\n\nFor more information about the MMStar benchmark, visit the [official project page](https://mmstar-benchmark.github.io/).\n\n### Overview\n- **Environment ID**: `MMStar`\n- **Short description**: A benchmark for evaluating Large Multimodal Models on vision-indispensable tasks\n- **Tags**: `multimodal`, `vision`, `reasoning`, `benchmark`\n\n### Datasets\n- **Primary dataset(s)**: MMStar dataset from MMStar Benchmark\n- **Source links**: [HF Dataset](https://huggingface.co/datasets/Lin-Chen/MMStar)\n- **Split sizes**: val (`1.5K`)\n\n### Task\n- **Type**: `single-turn`\n- **Parser**: Custom parser for extracting answers wrapped in `**`\n- **Rubric overview**: Exact match accuracy on multiple-choice answers, supports filtering by category and l2_category\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval MMStar\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval MMStar -k <API_KEY> -b <BASE_URL> \\\n  -m 'z-ai/glm-4.5v' \\\n  -n 5 -r 1 -s -a '{\"num_samples\": 5}'\n\nuv run vf-eval MMStar -k <API_KEY> -b <BASE_URL> \\\n  -m 'x-ai/grok-4-fast' \\\n  -n 5 -r 1 -s -a '{\"num_samples\": 5, \"exclude_cats\": [\"coarse perception\"], \"exclude_l2_cats\": [\"localization\"]}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_samples` | int | `-1` | Limit on dataset size (use -1 for all) |\n| `exclude_cats` | List | `[]` | List of categories to exclude |\n| `exclude_l2_cats` | List | `[]` | List of l2_categories to exclude |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar reward (1 for correct answer, 0 for incorrect) |\n| `accuracy` | Exact match on target answer |\n\n","encoding":"utf-8","truncated":false,"total_bytes":2221},"status":null}