{"data":{"kind":"file","path":"README.md","version_id":"bgul13zot801l11ula2071q7","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2521,"modified_at":"2025-10-16T16:06:04.023000","content_hash":"b9c2424ed1d8d0f6568eae16f433b85782c427a45b2b543d148bbd39ac97649c"},"entries":[],"content":"# BLINK\n\n[BLINK Benchmark](https://zeyofu.github.io/blink/).\n\n### Overview\n- **Environment ID**: `BLINK`\n- **Short description**:  BLINK is a benchmark for evaluating multi-modal reasoning and visual understanding in large vision-language models.\n- **Tags**: vision, language, multimodal, benchmark, evaluation\n\n### Datasets\n- **Primary dataset(s)**: BLINK-Benchmark/BLINK — a collection of diverse visual reasoning tasks with images and multiple-choice questions.\n- **Source links**: [HF Dataset](https://huggingface.co/datasets/BLINK-Benchmark/BLINK)\n- **Split sizes**: val and test both have the same size across subsets\n\n### Task\n- **Type**: single-turn\n- **Parser**: Custom parser for extracting answers wrapped in `**`\n- **Rubric overview**: Exact Match\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval BLINK\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval BLINK -k <API_KEY> -b <BASE_URL> \\\n  -m 'z-ai/glm-4.5v' \\\n  -n 5 -r 1 -s -a '{\"num_samples\": 5}'\n\nuv run vf-eval BLINK -k <API_KEY> -b <BASE_URL> \\\n  -m 'google/gemini-2.0-flash-exp:free' \\\n  -n 2 -r 1 -s -a '{\"num_samples\": 1, \"subset_include\": [\"Visual_Similarity\"]}'\n\nuv run vf-eval BLINK -k <API_KEY> -b <BASE_URL> \\\n  -m 'z-ai/glm-4.5v' \\\n  -n 10 -r 1 -s -a '{\"num_samples\": 5, \"subset_include\": [\"IQ_Test\"]}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\n| Arg             | Type            | Default     | Description                                                                                  |\n|-----------------|-----------------|-------------|----------------------------------------------------------------------------------------------|\n| `splits`          | str             | `\"val,test\"` | Comma-separated list of splits to evaluate on (val, test, or both)                           |\n| `num_samples`     | int             | `-1`          | Number of samples per subset to use (-1 for all)                                             |\n| `subset_include`  | str or List[str]| `\"all\"`       | Which dataset subsets to include; \"all\" for all, or a list of subset names                  |\n| `subset_dict`     | dict            | `{}`          | Custom mapping of subset names to sample counts (overrides num_samples if provided)          |\n\n### Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar reward (1 for correct answer, 0 for incorrect) |\n| `accuracy` | Exact match on target answer |\n\n","encoding":"utf-8","truncated":false,"total_bytes":2521},"status":null}