{"data":{"kind":"file","path":"README.md","version_id":"ei0lvtkfymols78gq2y6ilen","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3092,"modified_at":"2025-10-20T00:56:03.109000","content_hash":"c8b8dced310ae659795a754460cd6dade3b1c3abc5f557a3a5a655af924527a5"},"entries":[],"content":"# multi-armed-bandit\n\n### Overview\n- **Environment ID**: `multi-armed-bandit`\n- **Short description**: A Gaussian multi-armed bandit environment for training agents to balance exploration and exploitation\n- **Tags**: reinforcement-learning, bandit, exploration-exploitation, decision-making\n\n### Datasets\n- **Primary dataset(s)**: Procedurally generated bandit instances with random mean rewards\n- **Source links**: Generated on-the-fly using `create_mab_dataset()`\n- **Split sizes**: Default 1000 train / 100 eval (configurable via `num_train_examples`, `num_eval_examples`)\n\n### Task\n- **Type**: multi-turn\n- **Parser**: `Parser` (basic parser, actions are \"pull {arm_number}\")\n- **Rubric overview**: Simple total reward - sum of all bandit rewards received (matches ICRL implementation)\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval multi-armed-bandit\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval multi-armed-bandit \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7 \\\n  -a '{\"num_arms\": 10, \"max_turns\": 100, \"sigma\": 0.3}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n- Each episode runs for `max_turns` arm pulls (default 100)\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_train_examples` | int | `1000` | Number of training bandit instances |\n| `num_eval_examples` | int | `100` | Number of evaluation bandit instances |\n| `num_arms` | int | `10` | Number of bandit arms |\n| `sigma` | float | `0.3` | Standard deviation for Gaussian reward sampling |\n| `max_turns` | int | `100` | Maximum number of arm pulls per episode |\n| `seed` | int | `None` | Random seed for dataset generation |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Total reward accumulated (sum of all bandit rewards received) |\n| `reward_mean` | Average total reward across episodes (matches ICRL's `average_reward`) |\n| `reward_std` | Standard deviation of total rewards (matches ICRL's `std_reward`) |\n\n### Additional State Tracking\n\nThe environment also tracks (accessible in state, not part of reward):\n- `regret`: Cumulative regret (optimal_reward - chosen_reward for each pull)\n- `arm_counts`: Number of times each arm was pulled\n- `arm_rewards`: Cumulative reward from each arm\n- `step_count`: Number of arm pulls made\n\n### Game Rules\n\nThe agent plays a multi-armed bandit game where:\n1. There are `num_arms` arms numbered 0 to `num_arms-1`\n2. Each arm has a fixed mean reward drawn from U(0, 1)\n3. Pulling an arm returns a reward sampled from N(mean, sigma²)\n4. The agent's goal is to maximize total reward over `max_turns` pulls\n5. Actions must be in the format: `pull {arm_number}`\n6. Invalid actions receive a penalty of -1.0\n\n### Reward Function\n\n**Total Reward**: Simple sum of all rewards received from pulling arms\n- No normalization or weighting\n- Directly matches ICRL implementation\n- Formula: `sum(all rewards from arm pulls)`\n- This allows comparing average performance across different bandit instances\n","encoding":"utf-8","truncated":false,"total_bytes":3092},"status":null}