{"data":{"kind":"file","path":"README.md","version_id":"q2kn2upfg8or4xsmxhzw0x9l","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3504,"modified_at":"2026-03-02T23:53:31.146000","content_hash":"4073d8842f982b989ec6f855b8878103aa0b934a02eb7cf90132387398df1a54"},"entries":[],"content":"# crate-bench\n\nAlbum cover identification benchmark for multimodal models. Given an album cover image, the model must identify the artist and album title.\n\n## Dataset\n\n300 test / 3,000 train examples sourced from Discogs, balanced across popularity tiers, decades, genres, and regions. Images are embedded directly in the HuggingFace dataset: [`zpn/crate-bench`](https://huggingface.co/datasets/zpn/crate-bench).\n\n## Modes\n\n- **closed** (default): Single-turn, no tools. Pure visual recognition from the cover image alone.\n- **open**: Multi-turn with a `web_search` tool (Exa API). The model can search for details it sees on the cover.\n\n## Scoring\n\nUses an LLM judge (`gpt-4.1-mini` by default) to evaluate artist and album identification separately:\n\n- `artist_match` (weight 0.5): 1.0 if the artist is correct, 0.0 otherwise\n- `album_match` (weight 0.5): 1.0 if the album title is correct, 0.0 otherwise\n- `format_reward` (weight 1.0): 0.1 if the model uses the `\\boxed{}` format\n\nThe judge handles Discogs formatting quirks (e.g. disambiguation suffixes like \"(4)\"), transliterations, and common name variants.\n\n## Usage\n\n```bash\n# Install\nprime env install zanussbaum/crate-bench\n\n# Closed-book eval\nprime eval crate-bench -m openai/gpt-4.1-mini -n 300 -r 1 -a '{\"mode\": \"closed\", \"split\": \"test\"}'\n\n# Open-book eval (requires EXA_API_KEY)\nEXA_API_KEY=<key> prime eval crate-bench -m openai/gpt-4.1-mini -n 300 -r 1 -a '{\"mode\": \"open\", \"split\": \"test\"}'\n```\n\n## Environment Arguments\n\n| Argument | Default | Description |\n|---|---|---|\n| `mode` | `\"closed\"` | `\"closed\"` for single-turn, `\"open\"` for multi-turn with search |\n| `split` | `\"train\"` | Dataset split (`\"train\"` or `\"test\"`) |\n| `num_examples` | `-1` | Limit examples (-1 for all) |\n| `judge_model` | `\"gpt-4.1-mini\"` | Model for LLM-as-judge scoring |\n| `max_turns` | `10` | Max turns for open-book mode |\n\n## Baseline Results (300 test examples)\n\n### Closed-book\n\n| Model | Both Correct | Artist Match | Album Match |\n|---|---|---|---|\n| GPT-4.1-mini | 73.3% | 79.0% | 83.3% |\n| Qwen3-VL-8B | 60.5% | 69.9% | 75.9% |\n| GPT-5-mini | 59.7% | 67.3% | 78.7% |\n\n### Open-book (with web search)\n\n| Model | Both Correct | Artist Match | Album Match | Avg Turns |\n|---|---|---|---|---|\n| GPT-4.1-mini | 75.7% | 82.0% | 84.0% | 2.0 |\n| GPT-5-mini | 71.3% | 76.3% | 84.7% | 3.3 |\n| Qwen3-VL-8B | 61.3% | 69.3% | 81.3% | 2.1 |\n\n### Search Analysis\n\nSearch provides modest gains overall (+2-12pp on both-correct). Key findings:\n\n- **GPT-5-mini benefits most** (+11.6pp), primarily on album identification. It searches aggressively (avg 2.4 queries) with long, specific queries including quoted phrases and catalog numbers.\n- **GPT-4.1-mini benefits moderately** (+2.4pp) with efficient single-query searches.\n- **Qwen3-VL-8B barely benefits** (+0.8pp). It uses a one-and-done strategy (92% single search), constructing queries from text already read off the cover. When its text extraction is correct, search is redundant; when wrong, the search query is also wrong.\n\nCommon failure patterns with search:\n- **Confirmation bias**: Search confirms incorrect text readings from the cover\n- **Edition confusion**: Search returns a different release/edition by the same artist\n- **Classical music attribution**: Search results emphasize composers over performers, causing artist mismatches with Discogs ground truth\n- **Search spirals**: GPT-5-mini sometimes exhausts all turns searching without producing a final answer (23/300 cases hit max_turns)\n","encoding":"utf-8","truncated":false,"total_bytes":3504},"status":null}