{"data":{"kind":"file","path":"README.md","version_id":"a01s8e7jfc5j7cf9q0zilpvm","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2726,"modified_at":"2025-10-28T20:36:04.803000","content_hash":"ba4076cd5c3467b1e709a646d93c57d24565366e3f43a902389ef853e86b9ca8"},"entries":[],"content":"# WhoDunItHolmes\r\n\r\n## Overview\r\n\r\n**Environment ID:** WhoDunItHolmes\r\n**Short description:** Culprit detection on *kjgpta/WhoDunIt*, rewarding both correct identification and adherence to the concise Victorian “Holmesian” reasoning style.\r\n**Tags:** mystery, culprit-detection, single-turn, think, holmes\r\n\r\n## Datasets\r\n\r\n**Primary dataset(s):** *kjgpta/WhoDunIt* (train/test via Hugging Face Hub)\r\n**Source links:** [`datasets.load_dataset(\"kjgpta/WhoDunIt\")`](https://huggingface.co/datasets/kjgpta/WhoDunIt)\r\n**Split sizes:** Defaults to full train/test (320 train, 80 test)\r\n**Fields:** `text`, `title`, `author`, `length`, `culprit_ids`, and optional `metadata`\r\n\r\n## Task\r\n\r\n**Type:** single-turn\r\n**Parser:** `XMLParser(fields=[\"ponder\", \"verdict\"], answer_field=\"verdict\")`\r\n**Rubric overview:**\r\n\r\n* *Correctness:* strict match between parsed `<verdict>` and gold culprit.\r\n* *Format check:* verifies adherence to `<ponder>` and `<verdict>` tags.\r\n* *Style reward:* evaluates “Holmesian” diction vs modern style using a zero-shot classifier (`facebook/bart-large-mnli`).\r\n\r\n**Scoring Weights:**\r\n\r\n| Metric              | Weight | Description                             |\r\n| ------------------- | ------ | --------------------------------------- |\r\n| correct_answer      | 2.5    | 1.0 if culprit correctly identified     |\r\n| format_reward       | 0.4    | XML tag correctness                     |\r\n| holmes_style_reward | 0.6    | Confidence of Victorian reasoning style |\r\n\r\n## Quickstart\r\n\r\nRun an evaluation with default settings:\r\n\r\n```bash\r\nuv run vf-eval WhoDunItHolmes\r\n```\r\n\r\nConfigure model and sampling:\r\n\r\n```bash\r\nuv run vf-eval WhoDunItHolmes \\\r\n  -m deepseek-ai/DeepSeek-V3.1 \\\r\n  -n 5 -r 1 -t 160 -T 0.2\r\n```\r\n\r\n## Environment Arguments\r\n\r\n| Arg                | Type | Default | Description                            |\r\n| ------------------ | ---- | ------- | -------------------------------------- |\r\n| num_train_examples | int  | -1      | Limit training set size (-1 for all)   |\r\n| num_eval_examples  | int  | -1      | Limit evaluation set size (-1 for all) |\r\n\r\n## Metrics\r\n\r\n| Metric              | Meaning                                                                                      |\r\n| ------------------- | -------------------------------------------------------------------------------------------- |\r\n| correct_answer      | 1.0 if predicted `<verdict>` equals gold culprit                                             |\r\n| format_reward       | Adherence to required XML structure `<ponder>` + `<verdict>`                                 |\r\n| holmes_style_reward | Reward for “Holmesian” (Victorian detective) linguistic style, based on zero-shot classifier |\r\n","encoding":"utf-8","truncated":false,"total_bytes":2726},"status":null}