{"data":{"kind":"file","path":"README.md","version_id":"r2qp46bsbdbjv8nxb1cgir71","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":1645,"modified_at":"2025-09-12T00:53:08.147000","content_hash":"35c87acbd598d925c96e020b01a5645913d524bb43fec4061f12a9cc6f4d16ac"},"entries":[],"content":"# doublecheck\n\n### Overview\n- **Environment ID**: `doublecheck`\n- **Short description**: Two-turn math QA that asks the model to answer, then prompts “Are you sure?”; scored with a math rubric.\n- **Tags**: math, multi-turn, xml, think-answer, verification\n\n### Datasets\n- **Primary dataset(s)**: `math` (example dataset loaded via `load_example_dataset`)\n- **Source links**: Uses the example loader in `verifiers.utils.data_utils`\n- **Split sizes**: Configurable via args; defaults to `train` split and all examples\n\n### Task\n- **Type**: multi-turn\n- **Parser**: XMLParser with fields `think`, `answer` (from `MathRubric`)\n- **Rubric overview**: `MathRubric` combining exact/equivalence math grading and a small format component\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval doublecheck\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval doublecheck \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7 \\\n  -a '{\"dataset_name\": \"math\", \"dataset_split\": \"train\", \"num_train_examples\": -1}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `dataset_name` | str | `\"math\"` | Example dataset name for math problems |\n| `dataset_split` | str | `\"train\"` | Dataset split to load |\n| `num_train_examples` | int | `-1` | Limit on dataset size (`-1` for all) |\n\n### Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Math answer correctness (symbolic/numeric equivalence) |\n| `format_reward` | Adherence to `<think>`/`<answer>` XML format |\n","encoding":"utf-8","truncated":false,"total_bytes":1645},"status":null}