{"data":{"kind":"file","path":"README.md","version_id":"no360clmz5wq8f26zpc9dgim","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4627,"modified_at":"2025-08-25T10:17:23.385000","content_hash":"becfc6653c117c485b64d8e984ff0adb43668455a5582a76c39a730c7c199be9"},"entries":[],"content":"# Enigmata\n\n## Overview\n\n- **Environment ID**: `enigmata`\n- **Short description**: Synthetic, verifiable puzzle tasks with rule-based scoring across 36 tasks in 7 categories\n- **Tags**: enigmata, single-turn, reasoning, puzzles, verifiable, generator-verifier\n\nThis environment exposes the Enigmata suite as a self-contained evaluation and data-generation environment. Problems are programmatically generated and scored with task-specific, rule-based verifiers. It is designed for training and evaluating reasoning models without external LLM judges.\n\nNotes:\n\n- Python ≥ 3.11; dependencies are declared in `pyproject.toml`.\n- If the embedded `Enigmata` submodule is missing, the environment will automatically clone `BytedTsinghua-SIA/Enigmata` on first use.\n\n## Datasets\n\n- **Primary dataset(s)**: Enigmata-Data (synthetic) and Enigmata-Eval (benchmark)\n- **Source links**:\n  - Enigmata: [Github Repo](https://github.com/BytedTsinghua-SIA/Enigmata)\n  - Enigmata-Eval: [HuggingFace Dataset](https://huggingface.co/datasets/BytedTsinghua-SIA/Enigmata-Eval)\n- **Split sizes (Eval)**: 4,758 puzzle instances across Easy/Medium/Hard\n\n## Task\n\n- **Type**: single-turn\n- **Parser**: Identity parser (returns the raw completion)\n- **Rubric overview**: Single numeric score from a task-specific verifier (`verify`) with unit weight\n\n## Quickstart\n\nRun an evaluation with defaults (no API keys required):\n\n```bash\nuv run vf-eval enigmata\n```\n\nEvaluate with a fixed number of examples and specific tasks:\n\n```bash\nuv run vf-eval enigmata \\\n  -a '{\"num_train_examples\": 200, \"num_eval_examples\": 200, \"tasks\": [\"sudoku\", \"maze\"]}'\n```\n\nUse the predefined benchmark split (downloads Enigmata-Eval from HuggingFace) and evaluate only `sudoku`:\n\n```bash\nuv run vf-eval enigmata \\\n  -a '{\"use_predefined_eval_dataset\": true, \"tasks\": \"sudoku\"}'\n```\n\nNotes:\n\n- Use `-a` / `--env-args` to pass environment configuration as JSON.\n- When `use_predefined_eval_dataset` is `false`, both train and eval sets are generated on the fly.\n- You can also generate and evaluate offline using the scripts in `Enigmata/` (see that README for details).\n- Reproducibility: pass `seed` in env args, or set env var `ENIGMATA_SEED` to seed RNGs without editing code under `Enigmata/`. Eval generation uses `seed + 1` automatically when `seed` is provided.\n\n### Seeding and Reproducibility\n\nMinimal seeding is applied to stabilize generation without touching code under `Enigmata/`:\n\n- Python `random`\n- NumPy (if available)\n- `PYTHONHASHSEED`\n\n#### Examples\n\nDeterministic generation with a fixed seed:\n\n```bash\nuv run vf-eval enigmata \\\n  -a '{\"num_train_examples\": 100, \"num_eval_examples\": 100, \"seed\": 42}'\n```\n\nUse environment variable instead of args:\n\n```bash\nENIGMATA_SEED=123 uv run vf-eval enigmata \\\n  -a '{\"num_train_examples\": 100, \"num_eval_examples\": 100}'\n```\n\nDifferent seeds for train vs eval:\n\n```bash\nuv run vf-eval enigmata \\\n  -a '{\"num_train_examples\": 100, \"num_eval_examples\": 100, \"seed\": 7}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_train_examples` | int | `-1` | Number of generated training examples per run (-1 means generator-chosen) |\n| `num_eval_examples` | int | `-1` | Number of generated evaluation examples per run (-1 means generator-chosen) |\n| `use_predefined_eval_dataset` | bool | `false` | If `true`, loads `BytedTsinghua-SIA/Enigmata-Eval` from HF for eval |\n| `tasks` | str or list | `\"all\"` | Filter to a task or list of tasks (e.g., `\"sudoku\"`, `[\"sudoku\",\"maze\"]`) |\n| `system_prompt` | str | `\"\"` | Optional system prompt propagated to the environment |\n| `seed` | Optional[int] | `None` | Global seed for reproducible generation (Python, NumPy). Eval uses `seed+1` |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Verifier score per example (typically 0 or 1; aggregated as mean) |\n\n### Example Structure\n\nNormalized examples produced by this environment follow this schema:\n\n```txt\nquestion: str\nanswer: str\ntask_name: str\ndifficulty: str\nsplit: str\nlanguage: str\nmeta_json: Optional[str]  # JSON-encoded metadata including the fields above\n```\n\n### Verifier Integration\n\nPer-example scoring dynamically imports `verifiable_tasks.tasks.<task_name>.verifier` and calls `verify(solution: str, answer: str, meta: dict) -> float|int`. If a verifier cannot be resolved, the reward defaults to `0.0` to fail closed.\n\n## Evaluation Reports\n<!-- Do not edit below this line. Content is auto-generated. -->\n<!-- vf:begin:reports -->\nNo reports found. Run `uv run vf-eval enigmata` to generate one.\n<!-- vf:end:reports -->","encoding":"utf-8","truncated":false,"total_bytes":4627},"status":null}