{"data":{"kind":"file","path":"README.md","version_id":"vx02le2a4fu705i9omf44yj0","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5515,"modified_at":"2026-01-14T14:27:10.232000","content_hash":"8b8b01ca75f04566fecdc32de2582f6cef452ea82c5c80125d7628f8f5e7b2f9"},"entries":[],"content":"# Chess Environment for RL Training\n\nAlphaZero-style chess environment for reinforcement learning, compatible with Gymnasium and Prime Intellect's Environments Hub.\n\n## What & Intent\n\n1. **Tensor environment first** - `ChessTensorEnv` uses AlphaZero-style encoding (8x8x119 observations, 4672 discrete actions). This gives us a verifiably correct chess game that RL can interact with.\n\n2. **Naive PPO doesn't work** - Vanilla PPO with Stable-Baselines3 barely beats random play. Learning chess from scratch with sparse rewards is hard when your action space is 4672 and games last 100+ moves.\n\n3. **MCTS makes it work** - AlphaZero-style tree search provides much better training signal. The network learns from visit counts, not just win/loss. This gives us a \"zero-knowledge\" learning setup that actually learns.\n\n4. **Text/LLM interface is next** - The same game core will get a text wrapper (`ChessTextEnv`) for LLM agents and `prime-rl` integration.\n\n**An interesting TODO:** PPO might actually work in the LLM setting. Unlike a randomly-initialized CNN, an LLM already knows what chess *is*. PPO fine-tuning could work where it failed on tensors because the model has priors, the \"action space\" is just generating short text, and it can reason about positions. Worth testing.\n\n## Status\n\n**v0.1** - Tensor-based environment for CNN/RL training (AlphaZero-style).\n\n| Component | Status |\n|-----------|--------|\n| `ChessTensorEnv` | Complete - 8x8x119 obs, 4672 actions |\n| PPO Training | Complete - via Stable-Baselines3 |\n| MCTS Training | Complete - parallel self-play |\n| `ChessTextEnv` | Experimental - LLM interface for `prime-rl` planned for v0.2 |\n\n## Features\n\n- **Tensor Observations** (8x8x119): AlphaZero-style board encoding with 8-step history\n- **Discrete Actions** (4672): Full AlphaZero action space encoding\n- **MCTS Self-Play**: AlphaZero-style tree search training\n- **SB3 Integration**: PPO training with Stable-Baselines3\n- **Deterministic**: Same inputs always produce same outputs\n- **PI-Compatible**: Designed for Prime Intellect's RL infrastructure\n\n## Installation\n\n```bash\n# With uv (recommended)\nuv pip install -e .\n\n# With pip\npip install -e .\n```\n\n## Quick Start\n\n```python\nfrom chess_env import ChessTensorEnv\n\nenv = ChessTensorEnv()\nobs, info = env.reset()\n\n# Get legal actions\nlegal_actions = info[\"action_mask\"]\n\n# Take a random legal action\nimport numpy as np\naction = np.random.choice(np.where(legal_actions)[0])\nobs, reward, terminated, truncated, info = env.step(action)\n\n# Convert between SAN and action indices\naction = env.san_to_action(\"e4\")\nsan = env.action_to_san(action)\n```\n\n## Training\n\n### MCTS Self-Play (AlphaZero-style)\n\n```bash\n# Quick test\nuv run python scripts/train_mcts.py --iterations 10 --games 5 --simulations 50 --network tiny\n\n# Longer training\nuv run python scripts/train_mcts.py --iterations 100 --games 20 --simulations 100 --network small\n```\n\n### PPO with Stable-Baselines3\n\n```bash\nuv run python scripts/train.py --timesteps 100000 --network small --n-envs 12\n```\n\n### Benchmarking\n\n```bash\n# Benchmark MCTS model\nuv run python scripts/benchmark_mcts.py checkpoints/mcts_final.pt --simulations 100\n\n# Benchmark PPO model\nuv run python scripts/benchmark.py checkpoints/final_model.zip --games 20\n```\n\n## Running Tests\n\n```bash\nuv run pytest\nuv run python scripts/smoke_local.py\n```\n\n## Architecture\n\n```\nChessTensorEnv (Gymnasium)\n├── ObservationEncoder (8x8x119 tensor)\n├── ActionEncoder (4672 discrete actions)\n└── python-chess (game logic)\n        │\n        ├── MCTS Training\n        │   ├── Tree search with UCB/PUCT\n        │   ├── Parallel self-play\n        │   └── Policy + Value targets\n        │\n        └── PPO Training (SB3)\n            ├── CNN feature extractor\n            └── VecNormalize\n```\n\n## Roadmap\n\n- [x] v0.1: TensorEnv + MCTS + PPO training\n- [ ] v0.2: ChessTextEnv for LLM agents (`prime-rl` integration)\n- [ ] v0.3: Curriculum learning (MateIn1, KQK endgames)\n\n## LLM Interface Concept (v0.2 Planning)\n\nThe text/LLM interface opens up several possible directions:\n\n**Benchmarking (no training)** - Evaluate how well various LLMs play chess out of the box. Metrics: legal move rate, win rate vs random, win rate vs Stockfish. Answers \"how good is GPT-4 at chess?\"\n\n**RL Fine-tuning** - Take a base LLM (Llama 8B, Qwen, etc.) and RL fine-tune it to play better. LoRA is the typical approach - efficient, doesn't destroy base capabilities. This is what `prime-rl` / `vf-rl` are designed for.\n\n**Curriculum** - Makes RL tractable. Start with \"just output legal moves\" (high reward for legality), then \"solve mate-in-1\", then full games. Prevents the death spiral where everything is illegal and there's no learning signal.\n\n### The Format\n\n`ChessTextEnv` presents the board as text:\n```\n=== Chess ===\nr n b q k b n r\np p p p p p p p\n. . . . . . . .\n. . . . . . . .\n. . . . P . . .\n. . . . . . . .\nP P P P . P P P\nR N B Q K B N R\n\nFEN: rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq - 0 1\nSide to move: Black\nLegal moves (20): a6, a5, b6, b5, ... Nf6, Nh6\n\nRespond with your move in <move>SAN</move> format.\n```\n\nThe LLM responds with `<move>e5</move>`. Rewards: +1 win, -1 loss, -0.2 illegal move.\n\n### Workflow\n\n1. `prime env install apullin/chess`\n2. Configure `prime-rl` with ChessTextEnv\n3. Pick base model (e.g., `meta-llama/Llama-3-8B-Instruct`)\n4. Run RL training with LoRA\n5. Model learns legal moves first, then strategy\n\n## License\n\nMIT\n","encoding":"utf-8","truncated":false,"total_bytes":5515},"status":null}