{"data":{"kind":"file","path":"README.md","version_id":"tu14a4pbwqkuhch45s8qhwye","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":8916,"modified_at":"2025-11-25T04:15:54.121000","content_hash":"7c569b1d3e1c349097b298a066e2a3dd99cfc9a208d8287d90ec9a1ef546174b"},"entries":[],"content":"# Catan RL Environment\n\nA Settlers of Catan environment for training LLMs to play strategic board games using reinforcement learning.\n\n## Overview\n\nThis environment enables LLMs to learn to play Settlers of Catan through RL training. The LLM plays as the BLUE player in a 1v1 game against a `VictoryPointPlayer` bot (greedy VP maximizer). Games are played to 7 Victory Points instead of the standard 10 for faster training iterations.\n\n### Features\n\n- **Multi-turn gameplay**: Full Catan game mechanics including building, trading, and development cards\n- **Dense reward structure**: Per-turn rewards for VP gains, buildings, and resources plus win/loss bonuses\n- **Invalid action handling**: Automatic retry with accumulated penalties for invalid actions\n- **Configurable difficulty**: Adjustable VP targets and opponent types\n- **Structured action format**: Clear numbered action presentation for reliable LLM parsing\n\n## Installation\n\n### Prerequisites\n\n- Python 3.12+\n- `uv` package manager (install from [astral.sh/uv](https://docs.astral.sh/uv/))\n- Access to a GPU for training\n\n### Install from Source\n\nFrom the `examples/catan/` directory:\n\n```bash\n# Sync all dependencies (including catanatron from GitHub)\ncd examples/catan\nuv sync\n```\n\nThis will automatically:\n- Create a virtual environment (`.venv`)\n- Install catanatron from GitHub\n- Install verifiers and all required packages\n- Install the catan-env package in editable mode\n\n### Verify Installation\n\n```bash\ncd examples/catan\nuv run python -c \"from catan_env import CatanEnv; print('✓ Installation successful!')\"\n```\n\n## Quick Start\n\n### Using vf-eval (Native Verifiers Evaluation)\n\nThe easiest way to evaluate:\n\n```bash\ncd examples/catan\n\n# With OpenAI API\nexport OPENAI_API_KEY=\"your-key\"\nuv run vf-eval catan -m gpt-4o -b https://api.openai.com/v1 -k OPENAI_API_KEY -n 5 -r 2\n\n# With local vLLM server (run in another terminal: vllm serve <model> --port 8000)\nuv run vf-eval catan -m Qwen/Qwen2.5-1.5B-Instruct -b http://localhost:8000/v1 -k EMPTY -n 10 -r 4\n\n# With custom environment args\nuv run vf-eval catan -a '{\"vps_to_win\": 5, \"max_turns\": 50}' -m gpt-4o -b <url> -k <key> -n 3 -r 1\n```\n\nSee [EVALUATION.md](EVALUATION.md) for complete vf-eval documentation.\n\n### Using with verifiers (Python API)\n\n```bash\ncd examples/catan\n\n# Run with uv\nuv run python -c \"\nimport asyncio\nfrom openai import AsyncOpenAI\nfrom catan_env import CatanEnv\n\nasync def test():\n    env = CatanEnv(vps_to_win=7)\n    dataset = env.get_dataset(n=10)\n    \n    client = AsyncOpenAI(\n        api_key='your-api-key',\n        base_url='http://localhost:8000/v1'\n    )\n    \n    results = await env.evaluate(\n        client=client,\n        model='your-model-name',\n        num_examples=10,\n        rollouts_per_example=4,\n    )\n    \n    print(f'Average reward: {results.metadata.avg_reward}')\n\nasyncio.run(test())\n\"\n```\n\n### Training with prime-rl\n\nFrom the prime-rl root directory:\n\n```bash\n# Start inference server\nuv run inference @ configs/your-inference-config.toml\n\n# In another terminal, start orchestrator\nuv run orchestrator @ examples/catan/configs/train.toml\n\n# In another terminal, start trainer  \nuv run trainer @ configs/your-train-config.toml\n```\n\n## Environment Configuration\n\n### Constructor Parameters\n\n```python\nCatanEnv(\n    vps_to_win=7,                        # Victory points to win (default: 7)\n    opponent_type=\"VictoryPointPlayer\",  # Opponent bot type\n    max_turns=100,                       # Max turns before timeout\n    max_invalid_actions=5,               # Max invalid actions before episode ends\n)\n```\n\n### Opponent Types\n\n- **`VictoryPointPlayer`** (recommended): Greedily maximizes VP each turn - medium difficulty\n- **`WeightedRandomPlayer`**: Random with preference for high-value actions - easy-medium\n- **`RandomPlayer`**: Completely random - easy\n\n## Reward Structure\n\nThe environment provides dense rewards to guide learning:\n\n### Per-Turn Rewards\n\n- **+0.5 per VP gained**: Encourages progress toward winning\n- **+0.2 per building placed**: Rewards infrastructure development  \n- **+0.05 per resource gained**: Incentivizes resource collection\n- **-0.1 per invalid action**: Small penalty for mistakes (accumulated)\n\n### Game Outcome Rewards (Final Turn)\n\n- **+10.0 for winning**: Strong positive signal for victory\n- **-5.0 for losing**: Moderate negative signal for defeat\n\n### Example Scenarios\n\n**Close Loss (6 VP vs 7 VP):**\n- Per-turn rewards: ~6 VP × 0.5 + buildings + resources = +4.0\n- Loss penalty: -5.0\n- **Total: ~-1.0** (not catastrophic for a close game)\n\n**Dominant Win (7 VP vs 3 VP):**\n- Per-turn rewards: ~7 VP × 0.5 + buildings + resources = +5.0  \n- Win bonus: +10.0\n- **Total: ~+15.0** (strongly positive)\n\n## Action Format\n\nThe LLM receives game state in a structured format:\n\n```\n=== GAME STATE ===\n\nVictory Points: YOU=3 | OPPONENT=2 | Goal: 7 VP\n\nYour Resources: WOOD=2, BRICK=1, SHEEP=3, WHEAT=1, ORE=0\n\nYour Buildings: Settlements=2, Cities=0, Roads=3\n\nDevelopment Cards in Hand: 1\n\nTurn Status: YOUR TURN (already rolled)\n\n=== VALID ACTIONS ===\n\n[0] BUILD_SETTLEMENT node=15\n[1] BUILD_ROAD edge=(5,7)\n[2] MARITIME_TRADE give=4xWOOD get=ORE\n[3] PLAY_KNIGHT_CARD\n[4] END_TURN\n\nSelect your action by responding: ACTION: <number>\n```\n\nThe LLM should respond with: `ACTION: 2` (for example)\n\n## Dataset\n\nThe environment generates datasets with different game seeds for varied board layouts:\n\n```python\nfrom catan_env import generate_dataset, get_train_dataset, get_eval_dataset\n\n# Generate custom dataset\ncustom_dataset = generate_dataset(n=100, seed=1234, split=\"custom\")\n\n# Use standard datasets\ntrain_dataset = get_train_dataset()  # 500 examples, seed=42\neval_dataset = get_eval_dataset()    # 100 examples, seed=10000\n```\n\nEach dataset example contains:\n- `example_id`: Unique identifier\n- `prompt`: Initial game prompt\n- `info`: Metadata including `game_seed` for reproducibility\n\n## Game Flow\n\n1. **Initialization**: Board is set up with random layout based on seed\n2. **Initial Placement**: Auto-placed randomly (simplified for now)\n3. **Main Game Loop**:\n   - LLM receives current game state\n   - LLM selects an action by number\n   - Action is validated and executed\n   - If invalid: penalty applied, retry with error message\n   - If valid: opponent takes turn(s), new state presented\n4. **Game End**: When a player reaches 7 VP or max turns exceeded\n\n## Development\n\n### Running Tests\n\n```bash\ncd examples/catan\n\n# Sync dev dependencies first\nuv sync --group dev\n\n# Run tests\nuv run pytest tests/ -v\n\n# Run integration test\nuv run python tests/test_integration.py\n```\n\n### File Structure\n\n```\nexamples/catan/\n├── catan_env/\n│   ├── __init__.py          # Exports and dataset generation\n│   ├── environment.py       # Main CatanEnv class\n│   ├── prompts.py           # Prompt formatting\n│   ├── parser.py            # Action parsing\n│   ├── rubric.py            # Reward calculation\n│   └── utils.py             # Helper functions\n├── configs/\n│   ├── train.toml           # Training configuration\n│   └── eval.toml            # Evaluation configuration\n├── tests/\n│   └── test_env.py          # Unit tests\n├── pyproject.toml           # Package configuration\n└── README.md                # This file\n```\n\n## Tips for Training\n\n1. **Start with lower VP targets** (5-7) for faster iterations\n2. **Use temperature 0.7-0.8** for training to maintain exploration\n3. **Monitor invalid action rates** - should decrease during training\n4. **Track VP progression** - models should learn to prioritize VP-gaining actions\n5. **Use curriculum learning** - start with easier opponents, increase difficulty\n\n## Troubleshooting\n\n### Environment not found\n\n```bash\n# Re-sync dependencies\ncd examples/catan\nuv sync\n```\n\n### Import errors from catanatron\n\n```bash\n# Clean and re-sync\ncd examples/catan\nrm -rf .venv\nuv sync\n```\n\n### Games taking too long\n\n- Reduce `vps_to_win` to 5 or 6\n- Decrease `max_turns` if games aren't finishing\n- Check for infinite retry loops on invalid actions\n\n## Citation\n\nIf you use this environment in your research, please cite:\n\n```bibtex\n@misc{catan-rl-env,\n  title={Catan RL Environment for LLM Training},\n  author={Prime Intellect},\n  year={2025},\n  url={https://github.com/PrimeIntellect-ai/prime-rl}\n}\n```\n\n## License\n\nThis environment is part of the prime-rl project and is licensed under the Apache 2.0 license.\n\nThe catanatron dependency is licensed under GPL-3.0.\n\n## Contributing\n\nContributions are welcome! Areas for improvement:\n\n- [ ] Let LLM choose initial settlement placements\n- [ ] Add support for domestic (player-to-player) trading\n- [ ] Implement curriculum learning configs\n- [ ] Add visualization of game states\n- [ ] Support 3-4 player games\n- [ ] Add more opponent types (MCTS, AlphaBeta)\n\nPlease open issues or PRs on the main prime-rl repository.\n\n","encoding":"utf-8","truncated":false,"total_bytes":8916},"status":null}