{"data":{"kind":"file","path":"README.md","version_id":"dgqlcevooxdpf9uo3tcphcxy","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4588,"modified_at":"2025-10-03T00:18:44.922000","content_hash":"f652a8b5a51daf96f2880f2de7502ad7c8ce50ddecbcc36a2070e5963db79d21"},"entries":[],"content":"# Enigmata\n\n## Overview\n\n- **Environment ID**: `enigmata`\n- **Short description**: Synthetic, verifiable puzzle tasks with rule-based scoring across 36 tasks in 7 categories\n- **Tags**: enigmata, single-turn, reasoning, puzzles, verifiable, generator-verifier\n\nThis environment adapts the [Enigmata](https://github.com/BytedTsinghua-SIA/Enigmata) suite as a self-contained evaluation and data-generation environment. Problems are programmatically generated and scored with task-specific, rule-based verifiers. It is designed for training and evaluating reasoning models without external LLM judges.\n\nAdapted by Fido Wang - [Github](https://github.com/popfido), [X](https://x.com/fido20222)\n\nNotes:\n\n- Python ≥ 3.11; dependencies are declared in `pyproject.toml`.\n- If the embedded `Enigmata` submodule is missing, the environment will automatically clone `BytedTsinghua-SIA/Enigmata` on first use.\n\n## Datasets\n\n- **Primary dataset(s)**: Enigmata-Data (synthetic) and Enigmata-Eval (benchmark)\n- **Source link**:\n  - Enigmata-Eval: [HuggingFace Dataset](https://huggingface.co/datasets/BytedTsinghua-SIA/Enigmata-Eval)\n- **Split sizes (Eval)**: 4,758 puzzle instances across Easy/Medium/Hard\n\n## Task\n\n- **Type**: single-turn\n- **Parser**: Identity parser (returns the raw completion)\n- **Rubric overview**: Single numeric score from a task-specific verifier (`verify`) with unit weight\n\n## Quickstart\n\nRun an evaluation with defaults (no API keys required):\n\n```bash\nuv run vf-eval enigmata\n```\n\nEvaluate with a fixed number of examples and specific tasks:\n\n```bash\nuv run vf-eval enigmata \\\n  -a '{\"num_train_examples\": 200, \"num_eval_examples\": 200, \"tasks\": [\"sudoku\", \"maze\"]}'\n```\n\nUse the predefined benchmark split (downloads Enigmata-Eval from HuggingFace) and evaluate only `sudoku`:\n\n```bash\nuv run vf-eval enigmata \\\n  -a '{\"use_predefined_eval_dataset\": true, \"tasks\": \"sudoku\"}'\n```\n\nNotes:\n\n- Use `-a` / `--env-args` to pass environment configuration as JSON.\n- When `use_predefined_eval_dataset` is `false`, both train and eval sets are generated on the fly.\n- You can also generate and evaluate offline using the scripts in `Enigmata/` (see that README for details).\n- Reproducibility: pass `seed` in env args, or set env var `ENIGMATA_SEED` to seed RNGs without editing code under `Enigmata/`. Eval generation uses `seed + 1` automatically when `seed` is provided.\n\n### Seeding and Reproducibility\n\nMinimal seeding is applied to stabilize generation without touching code under `Enigmata/`:\n\n- Python `random`\n- NumPy (if available)\n- `PYTHONHASHSEED`\n\n#### Examples\n\nDeterministic generation with a fixed seed:\n\n```bash\nuv run vf-eval enigmata \\\n  -a '{\"num_train_examples\": 100, \"num_eval_examples\": 100, \"seed\": 42}'\n```\n\nUse environment variable instead of args:\n\n```bash\nENIGMATA_SEED=123 uv run vf-eval enigmata \\\n  -a '{\"num_train_examples\": 100, \"num_eval_examples\": 100}'\n```\n\nDifferent seeds for train vs eval:\n\n```bash\nuv run vf-eval enigmata \\\n  -a '{\"num_train_examples\": 100, \"num_eval_examples\": 100, \"seed\": 7}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_train_examples` | int | `-1` | Number of generated training examples per run (-1 means generator-chosen) |\n| `num_eval_examples` | int | `-1` | Number of generated evaluation examples per run (-1 means generator-chosen) |\n| `use_predefined_eval_dataset` | bool | `false` | If `true`, loads `BytedTsinghua-SIA/Enigmata-Eval` from HF for eval |\n| `tasks` | str or list | `\"all\"` | Filter to a task or list of tasks (e.g., `\"sudoku\"`, `[\"sudoku\",\"maze\"]`) |\n| `difficulties` | Optional[list] | `None` | List of difficulties to include in the environment. |\n| `system_prompt` | str | `\"\"` | Optional system prompt propagated to the environment |\n| `seed` | Optional[int] | `None` | Global seed for reproducible generation (Python, NumPy). Eval uses `seed+1` |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | score of evaluated troubles (typically 0 or 1; aggregated as mean) |\n\n### Example Structure\n\nNormalized examples produced by this environment follow this schema:\n\n```txt\nquestion: str\nanswer: str\ninfo: \n  - task_name: str\n  - difficulty: str\n  - split: str\n  - language: str\n  - meta_json: Optional[str]  # JSON-encoded metadata\n```\n\n### Verifier Integration\n\nPer-example scoring dynamically imports `verifiable_tasks.tasks.<task_name>.verifier` and calls `verify(solution: str, answer: str, meta: dict) -> float|int`. If a verifier cannot be resolved, the reward defaults to `0.0` to fail closed.\n","encoding":"utf-8","truncated":false,"total_bytes":4588},"status":null}