{"data":{"kind":"file","path":"README.md","version_id":"rn88u8mtpt022dj6gzmx8tjz","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":12174,"modified_at":"2025-11-07T17:11:00.155000","content_hash":"329799873bbebe9ae96a10626301d995856711fe53b5f9513f065d167c93ed54"},"entries":[],"content":"# alphazero-llm-trainer\n\nmcts-guided training environment for llms on mathematical reasoning with adaptive depth control\n\n## overview\n\n- **environment id**: `alphazero-llm-trainer`\n- **task type**: multi-turn mcts tree search\n- **dataset**: gsm8k (openai/gsm8k)\n- **base class**: verifiers `MultiTurnEnv` with adaptive termination\n- **tags**: math, reasoning, mcts, grpo, eval, train\n\n## quickstart\n\n### evaluate with vf-eval\n\n```bash\n# default evaluation (uses depth=12 from config)\nuv run vf-eval alphazero-llm-trainer -m gpt-4o-mini -n 5\n\n# laptop eval with shallow depth (faster)\nMAX_TREE_DEPTH=5 uv run vf-eval alphazero-llm-trainer -m gpt-4o-mini -n 5\n\n# deep search with more compute\nMAX_TREE_DEPTH=50 NUM_MCTS_ITERATIONS=200 uv run vf-eval alphazero-llm-trainer -m gpt-4o-mini -n 10\n```\n\n### train the student model\n\n```bash\n# train with mcts + grpo (default settings)\npython train_vllm.py --num-examples 50\n\n# train with custom depth\nMAX_TREE_DEPTH=20 python train_vllm.py --num-examples 100 --use-grpo\n\n# train with adaptive depth and token budget\nMAX_TREE_DEPTH=15 MAX_TOKENS=5000 python train_vllm.py --num-examples 200\n```\n\n## key feature: adaptive depth control\n\n**what changed:** the environment now uses `MultiTurnEnv` with adaptive depth control via environment variables instead of hardcoded config values.\n\n**why:** allows scaling mcts search depth based on available hardware without modifying config files.\n\n### depth scaling by hardware\n\n| hardware | recommended depth | iterations | expected time |\n|----------|------------------|------------|---------------|\n| laptop / cpu | 5 | 30-40 | 2-5 min/sample |\n| single gpu | 12 (default) | 60 | 5-10 min/sample |\n| multi-gpu / a100 | 20-30 | 100-150 | 10-15 min/sample |\n| h100 cluster | 50+ | 200+ | 15-20 min/sample |\n\n### environment variables\n\n| variable | default | description |\n|----------|---------|-------------|\n| `MAX_TREE_DEPTH` | 12 (from config) | maximum tree depth - overrides config |\n| `NUM_MCTS_ITERATIONS` | 60 (from config) | iterations per search - overrides config |\n| `MAX_TOKENS` | 0 (unlimited) | optional token budget for cost control |\n\n**important:** these env vars override config values, enabling compute-adaptive scaling without editing yaml files.\n\n### usage examples\n\n```bash\n# quick laptop test (shallow)\nMAX_TREE_DEPTH=5 python train_vllm.py --num-examples 10\n\n# standard workstation training\nMAX_TREE_DEPTH=12 python train_vllm.py --num-examples 50\n\n# deep search on h100\nMAX_TREE_DEPTH=50 NUM_MCTS_ITERATIONS=200 python train_vllm.py --num-examples 1000\n\n# cost-controlled with token budget\nMAX_TREE_DEPTH=15 MAX_TOKENS=3000 python train_vllm.py --num-examples 100\n```\n\n## command reference\n\n### vf-eval commands\n\n```bash\n# basic evaluation\nuv run vf-eval alphazero-llm-trainer -m <model> -n <samples>\n\n# with adaptive depth\nMAX_TREE_DEPTH=<depth> uv run vf-eval alphazero-llm-trainer -m <model> -n <samples>\n\n# multiple rollouts\nuv run vf-eval alphazero-llm-trainer -m <model> -n <samples> -r <rollouts>\n```\n\n### training commands\n\n```bash\n# basic training\npython train_vllm.py --num-examples <n>\n\n# with adaptive depth\nMAX_TREE_DEPTH=<depth> python train_vllm.py --num-examples <n>\n\n# with all options\nMAX_TREE_DEPTH=<depth> NUM_MCTS_ITERATIONS=<iters> MAX_TOKENS=<budget> \\\n  python train_vllm.py --num-examples <n> --use-grpo --save-every <freq>\n```\n\n### environment arguments (load_environment)\n\n| argument | type | default | description |\n|----------|------|---------|-------------|\n| `tier` | str | `\"free\"` | model tier for teacher ensemble |\n| `use_student_model` | bool | `false` | enable student model for perplexity rewards |\n\n### training script arguments (train_vllm.py)\n\n| argument | type | default | description |\n|----------|------|---------|-------------|\n| `--num-examples` | int | 50 (config) | number of training examples |\n| `--checkpoint-dir` | str | ./checkpoints | checkpoint save directory |\n| `--save-every` | int | 300 (config) | checkpoint save frequency |\n| `--log-interval` | int | 10 (config) | logging frequency |\n| `--use-grpo` | flag | config | enable grpo training mode |\n| `--use-legacy` | flag | false | use legacy supervised learning |\n\n## metrics\n\n| metric | description | range |\n|--------|-------------|-------|\n| `reward` | combined weighted reward (0.4*hre + 0.6*pre) | 0.0 - 1.0 |\n| `correctness_reward` | binary correctness from hre | 0.0 or 1.0 |\n| `perplexity_reward` | trajectory quality from pre (if enabled) | 0.0 - 1.0 |\n| `depth` | actual tree depth reached | 0 - MAX_TREE_DEPTH |\n| `terminal` | whether problem was solved | true/false |\n\n## implementation details\n\n### what was changed\n\n**1. singleturnenv → multiturnenv**\n- converted from `vf.SingleTurnEnv` to `vf.MultiTurnEnv`\n- enables multi-turn mcts tree search with proper state management\n- implements `env_response()` (returns empty string for single-agent)\n- implements `is_completed()` with multi-condition termination\n\n**2. adaptive depth via environment variables**\n- `MAX_TREE_DEPTH` overrides config for compute-adaptive scaling\n- no hardcoded depth limits from config files\n- scales from 5 (laptop) to 50+ (h100 cluster)\n\n**3. proper vllm backend**\n- replaced placeholder terminal checker with `VLLMTerminalChecker`\n- uses qwen 2.5 0.5b for fast terminal state detection\n- proper vllm initialization for efficient inference\n\n**4. verifiers-compatible dataset schema**\n- changed dataset column from `'question'` to `'prompt'`\n- matches verifiers expected schema for seamless integration\n- format: `{'prompt': question, 'answer': solution, 'info': {...}}`\n\n**5. multi-condition termination**\n- safety limit: turn >= 100 (base class)\n- adaptive depth: depth >= MAX_TREE_DEPTH\n- terminal state: problem solved\n- mcts exhausted: no more promising branches\n- token budget: total_tokens >= MAX_TOKENS (if configured)\n\n### mcts search system\n\n- monte carlo tree search with adaptive depth (5-50+ levels)\n- ucb1 selection with exploration constant √2\n- binary expansion (right/wrong token branches)\n- terminates when any condition above is met\n\n### reward system\n\n**hard reward estimator (hre) - 40% weight:**\n- validates numerical answers against ground truth\n- uses vllm terminal checker for fast inference\n- binary reward: 1.0 if correct, 0.0 otherwise\n\n**perplexity reward estimator (pre) - 60% weight:**\n- measures trajectory quality via student model perplexity\n- normalized against baseline perplexity\n- only active if `use_student_model=True`\n\n**combined rewards:**\n- weighted combination: 0.4 * hre + 0.6 * pre\n- length penalties for efficiency (-0.01 * path_length)\n- direction bonuses (0.1 * right_count - 0.05 * wrong_count)\n\n### components\n\n- **student model**: llama 3.2 3b (4-bit quantized, lora r=32)\n- **teacher ensemble**: deepseek r1, gpt-oss, llama 3.3, qwen3 (free tier)\n- **terminal checker**: qwen 2.5 0.5b (vllm backend, 5% gpu memory)\n- **training mode**: grpo (group relative policy optimization)\n\n## dataset\n\n- **source**: openai/gsm8k from huggingface\n- **format**: mathematical word problems with step-by-step solutions\n- **schema**: `{'prompt': question, 'answer': solution}`\n- **size**: configurable (default: 50 train, 20 eval)\n\n## configuration\n\nall hyperparameters are in yaml config files:\n\n- `config/training.yaml` - training hyperparameters\n- `config/models.yaml` - model configurations\n- `config/teacher_models.yaml` - teacher ensemble models\n\nkey settings:\n- mcts iterations: 60 (override with `NUM_MCTS_ITERATIONS`)\n- max tree depth: 12 (override with `MAX_TREE_DEPTH`)\n- reward weights: 40% hre, 60% pre\n- batch size: 8 (gradient accumulation: 4)\n- learning rate: 2e-4 (warmup: 50 steps)\n\n## complete examples\n\n### 1. quick laptop test (shallow depth)\n\n```bash\n# evaluation\nMAX_TREE_DEPTH=5 uv run vf-eval alphazero-llm-trainer -m gpt-4o-mini -n 5\n\n# training\nMAX_TREE_DEPTH=5 python train_vllm.py --num-examples 10\n```\n\n### 2. standard workstation training (default depth)\n\n```bash\n# evaluation\nuv run vf-eval alphazero-llm-trainer -m gpt-4o-mini -n 10\n\n# training with grpo\nMAX_TREE_DEPTH=12 python train_vllm.py --num-examples 50 --use-grpo\n```\n\n### 3. deep search on h100 cluster\n\n```bash\n# evaluation with deep search\nMAX_TREE_DEPTH=50 NUM_MCTS_ITERATIONS=200 uv run vf-eval alphazero-llm-trainer -m gpt-4o-mini -n 20\n\n# training with maximum quality\nMAX_TREE_DEPTH=50 NUM_MCTS_ITERATIONS=200 python train_vllm.py --num-examples 1000 --use-grpo\n```\n\n### 4. cost-controlled with token budget\n\n```bash\n# evaluation with token limit\nMAX_TREE_DEPTH=15 MAX_TOKENS=3000 uv run vf-eval alphazero-llm-trainer -m gpt-4o-mini -n 50\n\n# training with token limit\nMAX_TREE_DEPTH=15 MAX_TOKENS=5000 python train_vllm.py --num-examples 100\n```\n\n## architecture\n\n```\nalphazero_llm_trainer/\n├── alphazero_llm_trainer.py  # entry point (load_environment)\n├── core/\n│   ├── environment.py         # multiturnenv with adaptive depth\n│   ├── mcts.py               # mcts search system\n│   └── tree.py               # tree node implementation\n├── models/\n│   ├── student_model.py      # llama 3.2 3b with lora\n│   ├── teacher_ensemble_vllm.py  # teacher ensemble\n│   └── terminal_checker_vllm.py  # vllm terminal checker\n├── rewards/\n│   ├── hre.py                # hard reward estimator\n│   ├── pre.py                # perplexity reward estimator\n│   └── combined.py           # combined reward system\n├── config/\n│   ├── training.yaml         # training hyperparameters\n│   ├── models.yaml           # model configurations\n│   └── teacher_models.yaml   # teacher ensemble config\n└── train_vllm.py             # training script\n```\n\n## requirements\n\n- python >= 3.11\n- verifiers >= 0.1.5.post0\n- vllm >= 0.11.0\n- torch == 2.8.0\n- unsloth >= 2024.8\n- transformers, datasets, numpy, scikit-learn\n\n## testing\n\nrun configuration tests:\n\n```bash\ncd /home/pradheep/alphazero-llm-trainer/environments/alphazero_llm_trainer\npython test_config_only.py\n```\n\nexpected output:\n```\n✓ default depth: 12 (from config)\n✓ laptop depth: 5 (from MAX_TREE_DEPTH env var)\n✓ h100 depth: 50 (from MAX_TREE_DEPTH env var)\n✓ mcts iterations: 200 (from NUM_MCTS_ITERATIONS env var)\n✓ token budget: 1000 (from MAX_TOKENS env var)\n✓ dataset schema: prompt/answer columns present\n✓ is_completed: adaptive depth check works\n✓ is_completed: terminal state check works\n✓ is_completed: mcts exhaustion check works\n✓ is_completed: token budget check works\n```\n\n## results\n\n### actual training results\n\n**latest training run (minimal configuration):**\n\n```bash\nMAX_TREE_DEPTH=30 NUM_MCTS_ITERATIONS=100 python train_vllm.py \\\n  --num-examples 10 \\\n  --eval-size 1319 \\\n  --checkpoint-dir ./checkpoints/a100_depth30_10examples \\\n  --save-every 5 \\\n  --log-interval 2\n```\n\n**training configuration:**\n- training examples: 10\n- tree depth: 30\n- mcts iterations: 100 per example\n- teacher ensemble: 4 models (10% gpu each)\n  - qwen2.5-7b-instruct-bnb-4bit\n  - meta-llama-3.1-8b-instruct-bnb-4bit\n  - mistral-7b-instruct-v0.3-bnb-4bit\n  - gemma-2-9b-it-bnb-4bit\n- terminal checker: qwen2.5-0.5b-instruct (5% gpu, max_len=512)\n- hardware: a100 gpu\n\n**results:**\n\n```\n[TRAINED] Final Accuracy: 14.63% (193/1319)\n\n===================================================\nCOMPARISON SUMMARY\n===================================================\nBase Model Accuracy:    12.89%\nTrained Model Accuracy: 14.63%\nImprovement:            +1.74%\n===================================================\n```\n\nResults saved to [gsm8k_comparison_results.json](gsm8k_comparison_results.json)\n\n**key metrics:**\n- base model: 12.89% accuracy on gsm8k\n- trained model: 14.63% accuracy on gsm8k\n- absolute improvement: +1.74 percentage points\n- relative improvement: +13.5% over base model\n- evaluation set: 1,319 problems\n- total training examples: 10 only\n\n**notes:**\n- this is a minimal proof-of-concept run with only 10 training examples\n- demonstrates +1.74% improvement even with minimal training data\n- 4 quantized teacher models (bitsandbytes 4-bit) with vllm backend\n- efficient gpu utilization: 40% (teachers) + 5% (checker) = 45% total\n","encoding":"utf-8","truncated":false,"total_bytes":12174},"status":null}