{"data":{"kind":"file","path":"README.md","version_id":"rjgqeoy0303ad2gimfyxvj6w","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":11191,"modified_at":"2025-09-23T02:24:39.673000","content_hash":"cbe26c7da808c1789d010ed7afbc11021e91137ad51d64d4801f4ab574fc2c8d"},"entries":[],"content":"# Blackjack Basic Strategy Env\n\n**Source Code Repository:** [https://github.com/peterN908/blackjack_env](https://github.com/peterN908/blackjack_env)\n\n### Overview\n- **Environment ID**: `blackjack-env`\n- **Short description**: Multi-turn heads-up Blackjack with EV shaping, plus a single-turn mode that scores the marginal EV of the chosen move.\n- **Tags**: blackjack, games, multi-turn, single-turn, eval\n\n### Datasets\n- **Primary dataset**: Programmatically generated randomized Blackjack states (shoe size, S17/H17, DAS).\n  - Multi-turn: initial deal states.\n  - Single-turn: initial or mid-hand states (double only on two cards; pairs/split constraints respected).\n- **Source**: On-load generator with basic-strategy continuation policy for EV estimation (see `strategy.py`).\n- **Size**: Controlled by `max_examples` (defaults to 200 if unspecified). Each evaluation uses fresh random scenarios (optionally fixed by `seed`).\n\n### Task\n- **Type**: multi-turn and single-turn (chat)\n- **Parser**: XMLParser expecting `<think>` and `<answer>` tags\n- **Rubric overview**:\n  - Multi-turn: EV of the first action (logged), marginal EV shaping across turns (main), plus format reward.\n  - Single-turn: marginal EV of the chosen move relative to basic strategy (main), plus format reward; absolute EV is logged.\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval blackjack-env\n```\n\nDefaults when no flags are provided:\n- Model: `gpt-4.1-mini`\n- Provider: `https://api.openai.com/v1` using `OPENAI_API_KEY`\n- Examples (`-n`): `5`\n- Repeats (`-r`): `3`\n- Max concurrent (`-c`): `32`\n- Max tokens (`-t`): unset (use model default)\n- Temperature (`-T`): unset (use model default)\n- Save: not saved unless `-s` (local) or `-H` (HF Hub) is provided\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval blackjack-env \\\n  -m gpt-4.1-mini \\\n  -n 10 -r 3 -t 1024 -T 0.5 \\\n  -a '{\"max_examples\": 50, \"ev_samples\": 200, \"randomize_rules\": true}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n- The model should output the action inside `<answer>...</answer>` and may include reasoning in `<think>...</think>`.\n\n#### Single-turn Quickstart\n\nRun a single-turn evaluation (one state → one action; reward = marginal EV of the move):\n\n```bash\nuv run vf-eval blackjack-env \n  -m gpt-4.1-mini \\\n  -n 10 -r 3 -t 1024 -T 0.5 \\\n  -a '{\"mode\":\"single\", \"ev_samples\": 200, \"randomize_rules\": true}'\n```\nor equivalently:\n```bash\nuv run vf-eval blackjack-env -a '{\"single_turn\": true, \"ev_samples\": 200}'\n```\n\nParameters (configure model and sampling):\n- `-m/--model`: model name on your OpenAI-compatible endpoint (e.g., `gpt-4.1-mini`).\n- `-n/--num`: number of examples to evaluate.\n- `-r/--repeats`: rollouts per example (sampling repeats; results averaged).\n- `-t/--tokens`: max output tokens per generation.\n- `-T/--temperature`: sampling temperature.\n- `-a/--env-args`: JSON dict of environment-specific args (see table below).\n\nNote: Some models restrict certain knobs (e.g., fixed temperature). If you see a 400 about `temperature`, omit `-T` or set an allowed value.\n\n### Installation\n- From repo root, install the environment for local development:\n\n```bash\nvf-install blackjack-env\n```\n\n- Set your model provider credentials (OpenAI-compatible):\n  - Export in shell or create a `.env` and source it.\n\n```bash\nexport OPENAI_API_KEY=sk-...                 # required\n# Optional: custom OpenAI-compatible endpoint\nexport OPENAI_BASE_URL=https://api.openai.com/v1\n```\n\n- Re-run `vf-install blackjack-env` whenever you modify files to pick up changes.\n\n### CLI Play (For Fun)\nInstall the environment, then launch the interactive CLI:\n\n```bash\nuv run blackjack-play\n```\n\nOptions:\n- `--decks 6` set number of decks (default 6)\n- `--s17` or `--h17` dealer stands/hits soft 17 (default S17)\n- `--das` or `--no-das` double after split allowed (default allowed)\n- `--seed N` fix randomness\n\nGameplay:\n- You’ll see the rules, your hand, dealer upcard, allowed actions, and a reminder of constraints:\n  - Double only on two cards; Split only on identical pairs; one split max; DAS gates double-after-split; No surrender; Blackjack pays 3:2.\n- Type `HIT`, `STAND`, `DOUBLE`, or `SPLIT` (or `q` to quit the hand). Bankroll change is reported in bets.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `max_examples` | int | `-1` | Limit dataset size (use `-1` for all examples) |\n| `rules.s17` | bool | `true` | Dealer stands on soft 17 (S17) if true; otherwise H17 |\n| `rules.das` | bool | `true` | Double after split allowed (affects pair strategy for 2s/3s/4s/6s) |\n| `rules.double_11_vs_ace` | bool | `false` | If true, double hard 11 vs Ace; otherwise hit |\n| `use_think` | bool | `true` | Require `<think>` tag before `<answer>` (Wordle-style) |\n| `ev_samples` | int | `200` | Monte Carlo samples for EV estimation of the first action |\n| `rules.num_decks` | int | `6` | Force number of decks when `randomize_rules=false` |\n| `randomize_rules` | bool | `true` | Randomize S17/H17, DAS, and num decks per example; if `false`, use `rules.*` values |\n| `max_turns` | int | `12` | Safety cap: end rollout after this many assistant turns |\n| `max_format_retries` | int | `3` | After N invalid/malformed answers in a turn, auto-apply baseline action and continue |\n| `mode` | string | unset | When set to `\"single\"`, run single-turn mode; otherwise multi-turn |\n| `single_turn` | bool | `false` | Convenience flag; equivalent to `mode=\"single\"` when true |\n\nAllowed actions are: `HIT`, `STAND`, `DOUBLE`, `SPLIT`.\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Weighted sum of metrics (EV + format) |\n| `delta_ev_sum` | Sum over turns of EV(action|state) − EV(baseline|state) |\n| `ev_reward` | Monte Carlo expected value of the first action (bets) |\n| `realized_return_metric` | Realized total result of the entire hand (bets) |\n| `format_reward_func` | Parser’s format reward for well-formed tags |\n\nSingle-turn mode:\n- `reward` = `marginal_ev_reward + 0.1 × strict_format_reward`\n- `marginal_ev_reward`: EV(action|state) − EV(baseline|state) using a common random seed\n- `chosen_action_ev`: Absolute EV of the chosen action (logged)\n- `strict_format_reward`: Parser format score\n\nReward computation:\n- Main reward: `reward = delta_ev_sum + 0.1 × format_reward_func`.\n- `delta_ev_sum`: For each assistant turn t, we compute Q_t = EV(action|state_t) and V_t = EV(baseline|state_t) using Monte Carlo with the same random stream (low-variance). We add (Q_t − V_t) across all turns (including split hands). Baseline is the basic‑strategy policy adjusted to allowed actions for that state.\n- Malformed answers: The env accepts lenient forms (e.g., `<answer-STAND</answer>`); if it must salvage formatting, the format bonus is set to 0 for that turn. After `max_format_retries` invalid attempts in a single turn, the env auto-applies the baseline action and moves on.\n- `ev_reward`: Still logged (weight 0) — EV of the first action only from the initial state (continuation via basic strategy). Typical ranges: about `−2.0` to `+3.0` in bets for doubles/splits; most spots `−1.0` to `+1.5`.\n- `realized_return_metric`: The actual one‑off outcome of the hand from the environment’s deal; a useful “overall score” but not included in the main reward by default (weight 0).\n\nPerformance note:\n- Per‑turn EV uses `ev_samples` simulations at each assistant turn; runtime scales with turns × `ev_samples` × examples × repeats. Use smaller `ev_samples` for speed or increase for tighter estimates.\n\n### Prompt Format\nEach example starts with a state prompt (rules, your hand, dealer upcard). The model responds with an action; the environment updates the state and continues until the hand is resolved. Respond using:\n\n```\n<think>Optional brief reasoning.</think>\n<answer>HIT|STAND|DOUBLE|SPLIT</answer>\n```\n\n### Example Transcript (2–3 turns)\n\n```\nSystem: You are a competitive game player. In each turn, think in <think>…</think> and put the action in <answer>…</answer>.\n\nUser:\nBlackjack — dealer stands on soft 17; DAS allowed; shoe: 6 deck(s).\nYour active hand: 9, 3 (total: 12). Dealer upcard: 2.\nAllowed actions: HIT, STAND, DOUBLE. Respond with one of these inside <answer>...</answer>.\nRules details: Double only on two cards; Split only on identical pairs; one split max; Double after split only if DAS; No surrender; Blackjack pays 3:2.\n\nAssistant:\n<think>Hard 12 vs 2 is a hit.</think>\n<answer>HIT</answer>\n\nUser:\nBlackjack — dealer stands on soft 17; DAS allowed; shoe: 6 deck(s).\nYour active hand: 9, 3, 5 (total: 17). Dealer upcard: 2.\nAllowed actions: HIT, STAND. Respond with one of these inside <answer>...</answer>.\nRules details: Double only on two cards; Split only on identical pairs; one split max; Double after split only if DAS; No surrender; Blackjack pays 3:2.\n\nAssistant:\n<think>Stand on 17.</think>\n<answer>STAND</answer>\n\nUser:\nStanding. Dealer: 2, 10, 6. Result: +1.0 bets. Hand over.\n```\n\nGame details shown each turn:\n- Dealer stands/hits on soft 17, DAS setting, and shoe size.\n- Allowed actions for the current hand (respecting two-card/double rules and pair/split constraints).\n- Rule reminders: double on two cards, split pairs only, one split max, DAS gating, no surrender, blackjack pays 3:2.\n\n### Notes on Rules\n- Encodes a standard multi-deck basic strategy. Default assumes S17 and DAS.\n- Some rule variations (e.g., double 11 vs Ace) are parameterized via `env-args.rules`.\n\n### Terminology\n- Bet (unit): base wager per hand. Typical outcomes: win `+1`, lose `-1`, push `0`; blackjack pays `+1.5`; doubles pay `±2`; split hands sum their results.\n- Dealer upcard: the dealer’s face-up card; the face-down card is the “hole” and is revealed when the dealer plays.\n\n### How Evaluation Runs\n- Multi-turn:\n  - The dataset pre-generates randomized scenarios (initial hands) and provides a `question` per example with the initial state.\n  - Chat loop per example:\n    - System message instructs formatting; user shows current state and allowed actions.\n    - Assistant replies with `<answer>...</answer>`; the environment applies the action, deals cards, and posts the updated state.\n    - Continues until the hand is finished (bust/stand/double for all hands including split).\n  - Scoring: marginal EV shaping across turns (main), EV of first action logged, plus format reward.\n- Single-turn:\n  - The dataset provides a single state (initial or mid-hand). The model outputs exactly one action.\n  - Scoring: marginal EV of the chosen move relative to baseline (main), absolute EV logged, plus format reward.\n\n### Saving Results\n- By default, results are not saved to disk.\n- To save locally, pass `-s/--save-dataset`. Outputs are written to:\n  - `environments/blackjack_env/outputs/evals/blackjack-env--<model>/<uuid>/` if the env directory is present, otherwise `./outputs/evals/...`\n  - Files: `results.jsonl` (prompts, completions, answers, rewards, metrics) and `metadata.json`.\n- To push to Hugging Face Hub, pass `-H/--save-to-hf-hub` and optionally `-D/--hf-hub-dataset-name`.\n","encoding":"utf-8","truncated":false,"total_bytes":11191},"status":null}