{"data":{"kind":"file","path":"README.md","version_id":"r1znievdndrdm1158gai56cn","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2051,"modified_at":"2025-10-24T00:47:51.817000","content_hash":"3e1ec6d3469f90b44bc8e775952c265671925fa2ac88c019c21efcbade041e73"},"entries":[],"content":"# FenBench\n\n### Overview\n- **Environment ID**: `fenbench`\n- **Short description**: Multi-task chessboard visual understanding covering piece identification, square location, piece counting, and FEN generation from board images.\n- **Tags**:  eval, multimodal, single-turn\n\n### Datasets\n- **Primary dataset(s)**: `puccibets/fenbench` via `load_dataset(\"puccibets/fenbench\", split=\"test\")`\n- **Source links**: [Hugging Face](https://huggingface.co/datasets/puccibets/fenbench), [GitHub](https://github.com/puccibets/fenbench)\n- **Split sizes**: Default split `test` (N=200), evenly mixing the four tasks listed above.\n\n### Task\n- **Type**: single-turn\n- **Parser**: No special parser; rubric-specific scorers read the assistant reply directly for each task.\n- **Rubric overview**: Single reward. For each example the rubric applies the task-specific exact-match rule (piece label, square set, piece counts, or FEN) and returns `1.0` on success, `0.0` otherwise.\n\n### Evaluation Categories\n- **Category 1: Piece Identification** – Identify which piece is on a given square – Tasks 1–50\n- **Category 2: Square Location** – Find all squares containing a specific piece type – Tasks 51–100\n- **Category 3: Piece Counting** – Count and categorize all pieces on the board – Tasks 101–150\n- **Category 4: FEN Generation** – Generate the complete board position in FEN notation – Tasks 151–200\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval fenbench\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval fenbench \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7\n```\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `tasks` | list[str] or `null` | `null` | Optional subset of tasks (`piece_identification`, `square_location`, `piece_counting`, `fen_generation`). |\n\n### Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Task-specific exact match (returns `1.0` when the response satisfies the task’s rule, otherwise `0.0`). |\n","encoding":"utf-8","truncated":false,"total_bytes":2051},"status":null}