{"data":{"kind":"file","path":"README.md","version_id":"pwjbu3esjhg5m7debqy4buch","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2388,"modified_at":"2025-08-31T20:44:09.791000","content_hash":"be21c2bb18b651b1abb27a4e5f5ad43833cbc9cfbff7866015188cfcf601811e"},"entries":[],"content":"# sudoku\n\n### Overview\n- **Environment ID**: `sudoku`\n- **Short description**: Sudoku puzzles split into \"easy\" and \"hard\" difficulties.\n- **Tags**: sudoku, single-turn, game, puzzle, grid\n\n### Datasets\n- **Primary dataset(s)**: sapientinc/sudoku-extreme\n- **Source links**: [https://huggingface.co/datasets/sapientinc/sudoku-extreme](https://huggingface.co/datasets/sapientinc/sudoku-extreme)\n- **Split sizes**: 3.8m / 423k\n\n### Task\n- **Type**: single-turn\n- **Parser**: Standard with custom regex boxed extract function\n- **Rubric overview**: <briefly list reward functions and key metrics>\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval sudoku\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval sudoku \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7 \\\n  -a '{\"difficulty\": \"easy\", \"num_train_examples\": 10, \"num_eval_examples\": 10, \"correctness_decay\": 8}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\nDocument any supported environment arguments and their meaning. Example:\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `difficulty` | str (\"easy\", \"hard\") | `None` | Selects either \"easy\" or \"hard\" puzzles |\n| `num_train_examples` | int | `-1` | Limit on dataset size (use -1 for all) |\n| `num_eval_examples` | int | `-1` | Limit on eval dataset size (use -1 for all) |\n| `correctness_decay` | int | `8` | Halflife for the exponential decay of the correctness value - lower values will reduce the value to 0 more aggressively |\n| `eval_type` | str (\"solve\", \"generate\") | `solve` | Evaluates model on either solving existing puzzles or generating valid puzzles |\n\n### Metrics\nSummarize key metrics your rubric emits and how they’re interpreted.\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar reward (weighted sum of criteria) - correctness: 1.0, validity: 0.5, formatting: 0.5 |\n| `correctness` | Scale of correct cell entries between 0-1 (solve only) |\n| `validity` | Determines whether entries are valid but not necessarily correct. Based on length of answer (81); row, column, and 3x3 cells summing to 45; and no duplicate values in each row, column and 3x3 cell |\n| `formatting` | Does the response return an answer in a \\boxed{...} format |\n| `length` | Does the response have 81 characters? |\n","encoding":"utf-8","truncated":false,"total_bytes":2388},"status":null}