{"data":{"kind":"file","path":"README.md","version_id":"uvy1uzlih4p30f4l91kchxyz","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3848,"modified_at":"2025-09-11T16:08:37.711000","content_hash":"a226cc4f1bf2cd638a6094797efa1326cf07ae829c622bbc87cd6b523555439a"},"entries":[],"content":"# Twenty Questions\r\n\r\n> ⚠️ **Work in Progress**: This environment is currently under active development and may have breaking changes or unexpected behavior.\r\n\r\n### Overview\r\n- **Environment ID**: `twenty-questions`\r\n- **Short description**: Multi-turn game where models try to guess a secret word/object by asking strategic yes/no questions within 20 turns.\r\n- **Tags**: game, multi-turn\r\n- **Socials**: [Github @ljt019](https://github.com/ljt019), [Hf @ljt019](https://huggingface.co/ljt019), [X @Ljt019117161](https://x.com/Ljt019117161)\r\n\r\n### Datasets\r\n- **Primary dataset(s)**: \r\n  - [ljt019/twenty-questions-600](https://huggingface.co/datasets/ljt019/twenty-questions-600): 600 diverse objects/concepts for guessing\r\n\r\n### Task & Scoring\r\n- **Type**: multi-turn reasoning game\r\n- **Parser**: XMLParser with fields `[\"think\", (\"question\", \"guess\")]` and `answer_field=\"guess\"` to extract final guesses\r\n- **Rubric overview**: Weighted scoring based on victory, efficiency, and format adherence\r\n\r\n**Game Mechanics:**\r\n\r\nThis environment uses an LLM to generate env responses:\r\n**Answerer LLM** (environment): Knows the answer, provides consistent yes/no answers\r\n\r\nModels receive:\r\n1. System prompt explaining the twenty questions game rules and format requirements\r\n2. Previous conversation history\r\n3. Responses from the answerer LLM (\"Yes\", \"No\", or \"I don't know\")\r\n4. Turn limit information (up to 20 questions/guesses)\r\n\r\nModels must respond with either:\r\n- `<question>Is it alive?</question>` for yes/no questions\r\n- `<guess>elephant</guess>` for final answer attempts\r\n\r\n**Expected Response Format:**\r\n```\r\n<think>\r\nBased on the answers so far, it seems to be a living creature that's large...\r\n</think>\r\n\r\n<question>Does it live in Africa?</question>\r\n```\r\n\r\nOr when ready to guess:\r\n```\r\n<think>\r\nAll the clues point to this being an elephant.\r\n</think>\r\n\r\n<guess>elephant</guess>\r\n```\r\n\r\nThe game continues until:\r\n- **Victory**: Model correctly guesses the answer word/object\r\n- **Turn Limit**: 20 questions/guesses are used up\r\n- **Wrong Guess**: Model makes an incorrect final guess\r\n\r\n### Quickstart\r\n\r\nRun an evaluation (requires answerer model configuration):\r\n```bash\r\nuv run vf-eval twenty-questions -a '{\"answerer_model\": \"gpt-4o-mini\", \"answerer_base_url\": \"https://api.openai.com/v1\"}'\r\n```\r\n\r\nOr with explicit API key:\r\n```bash\r\nuv run vf-eval twenty-questions -a '{\"answerer_model\": \"gpt-4o-mini\", \"answerer_base_url\": \"https://api.openai.com/v1\", \"answerer_api_key\": \"your-api-key\"}'\r\n```\r\n\r\nBrowse results\r\n```bash\r\nuv run vf-tui\r\n```\r\n\r\n## Environment Arguments\r\n\r\n| Arg                  | Type | Default      | Description                                    |\r\n| -------------------- | ---- | ------------ | ---------------------------------------------- |\r\n| `answerer_model`     | str  | **Required** | Model name for answerer LLM (knows the answer) |\r\n| `answerer_base_url`  | str  | **Required** | Base URL for answerer LLM API                 |\r\n| `answerer_api_key`   | str  | `OPENAI_API_KEY` env var | API key for answerer LLM |\r\n\r\n**Environment Variables:**\r\n- `OPENAI_API_KEY`: API key for the answerer LLM (used when `answerer_api_key` is not provided)\r\n\r\n---\r\n\r\n## Metrics\r\n\r\n| Metric                           | Weight | Meaning                                               |\r\n| -------------------------------- | ------ | ----------------------------------------------------- |\r\n| `reward`                         | -      | Final weighted rubric score (0.0 to 1.8)             |\r\n| `victory_reward`                 | 1.0    | Full reward (1.0) if answer is correctly guessed     |\r\n| `efficiency_reward`              | 0.5    | Bonus reward for guessing with fewer questions       |\r\n| `format_reward`                  | 0.3    | Reward for proper XML format usage                   |\r\n\r\n---\r\n","encoding":"utf-8","truncated":false,"total_bytes":3848},"status":null}