{"data":{"kind":"file","path":"README.md","version_id":"lzj7j7j9xb542itsvdyfpp2f","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4262,"modified_at":"2025-09-16T08:13:53.141000","content_hash":"9a768f27adb66c66dcc08cf978e19fcaf77b3f0232c4875cbc730dce2ccc38ed"},"entries":[],"content":"# Hurdle Wordle Environment\n\nA custom environment implementing Hurdle Wordle, a challenging variant of Wordle designed for testing LLMs (Large Language Models).\n\n### Overview\n- **Environment ID**: `hurdle-wordle`\n- **Short description**: A Wordle variant that provides only counts of green/yellow letters, not their positions\n- **Tags**: hurdle-wordle, wordle, word-game, reasoning, llm-testing\n\n### Datasets\n- **Primary dataset(s)**: Uses standard English 5-letter word dictionary from TextArena Wordle\n- **Source links**: Integrated with TextArena framework\n- **Split sizes**: Configurable (default: 2000 train, 20 eval)\n\n### Task\n- **Type**: Multi-turn interactive word guessing game\n- **Parser**: XMLParser with `<think>` and `<guess>` fields\n- **Rubric overview**: \n  - Exact match reward (1.0 for correct guess)\n  - Partial credit based on green/yellow counts (0.2 per green, 0.1 per yellow)\n  - Turn efficiency reward (1.0 / (turns + 1))\n  - Format compliance reward\n\n### Game Rules\n\n- **Objective**: Guess a secret 5-letter word\n- **Attempts**: 8 chances (compared to 6 in regular Wordle)\n- **Feedback**: After each guess, you receive:\n  - `greens`: Number of letters that are correct and in the correct position\n  - `yellows`: Number of letters that are correct but in the wrong position\n- **No Position Information**: Unlike regular Wordle, exact positions of green/yellow letters are not revealed\n- **Word Validation**: All guesses must be valid 5-letter words from the game dictionary\n\n### Key Differences from Regular Wordle\n\n| Feature | Regular Wordle | Hurdle Wordle |\n|---------|---------------|---------------|\n| Feedback | Position-specific colors | Only counts |\n| Attempts | 6 | 8 |\n| Difficulty | Moderate | High |\n| Information | Full position data | Minimal information |\n\n### Example Gameplay\n\n```\nSecret word: PLANT (hidden)\n\nGuess 1: [CRANE]\nFeedback: greens: 2, yellows: 0\n\nGuess 2: [AUDIO] \nFeedback: greens: 0, yellows: 1\n\nGuess 3: [PLANT]\nFeedback: greens: 5, yellows: 0\n🎉 Congratulations! You guessed the word correctly!\n```\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval hurdle-wordle\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval hurdle-wordle -m gpt-4o-mini -n 20 -r 3 -t 1024 -T 0.7 -a '{\"use_think\": true}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_train_examples` | int | `2000` | Number of training examples |\n| `num_eval_examples` | int | `20` | Number of evaluation examples |\n| `use_think` | bool | `true` | Whether to use thinking step before guessing |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar reward (weighted sum of criteria) |\n| `check_answer_reward_func` | 1.0 if correct guess, 0.0 otherwise |\n| `partial_credit_reward_func` | 0.2 × greens + 0.1 × yellows |\n| `count_turns_reward_func` | Efficiency reward: 1.0 / (turns + 1) |\n| `format_reward` | Compliance with XML format requirements |\n\n### Implementation Features\n\n- **Proper Feedback Calculation**: Handles repeated letters correctly using standard Wordle logic\n- **Input Validation**: Checks word format, length, and dictionary validity  \n- **Game State Management**: Tracks guess history and game progress\n- **Win/Lose Conditions**: Detects wins and enforces 8-guess limit\n- **LLM Integration**: Compatible with verifiers framework for testing language models\n\n### Testing Coverage\n\nThe environment includes comprehensive unit tests covering:\n\n- ✅ Guess with all wrong letters (greens=0, yellows=0)\n- ✅ Guess with some yellows but no greens  \n- ✅ Guess with repeated letters\n- ✅ Winning on first attempt\n- ✅ Reaching 8 attempts without success\n- ✅ Input validation (format, length)\n- ✅ Complex feedback scenarios\n\n### Technical Notes\n\nThe feedback calculation follows standard Wordle logic:\n1. First pass: Count exact position matches (greens) and mark positions as used\n2. Second pass: Count letter matches in wrong positions (yellows) from remaining letters\n\nThis ensures proper handling of repeated letters and matches the behavior players expect from Wordle-style games.\n\n---\n\n## Evaluation Reports\n\n*Evaluation reports will be automatically rendered below when available*","encoding":"utf-8","truncated":false,"total_bytes":4262},"status":null}