{"data":{"kind":"file","path":"README.md","version_id":"hbl1darcvbw7fzrnisn6vzr6","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6416,"modified_at":"2026-02-04T19:00:21.567000","content_hash":"a74bd4eba24a71bae84f5056756ba06a88ddebed6f26ea4ed58328ae5433b1e0"},"entries":[],"content":"# F1 Strategy Environment\n\nTrain and evaluate LLMs as an F1 race strategist using (1) historical OpenF1-derived scenarios and (2) deterministic stress tests.\n\n## IDs\n\n- **Hub ID**: `herr-professor/f1-strategy`\n- **Local module**: `f1_strategy.py`\n\n## What This Environment Is Optimizing\n\nEach example is a multiple-choice strategy decision (A/B/C/D). The rubric rewards:\n\n- picking the correct option (by a deterministic parser)\n- giving structured reasoning that references key race signals\n- optionally: tool use + deep reasoning evidence (numeric option-time estimates)\n\n## Deterministic Verifier Spec (Read This First)\n\n### 1) What counts as a “choice” (A/B/C/D)?\n\nWe parse the model output with `_extract_final_choice`:\n\n- We search the last ~12 non-empty lines (from bottom) for an explicit decision line:\n  - `Final: A` / `Decision - B` / `Answer: C` / `Choice: D`\n  - If multiple appear, **the last one wins**.\n- If none exist, we only accept a fallback if the **last non-empty line** begins with `A`/`B`/`C`/`D` (e.g. `C) Stay out ...`).\n- We explicitly **do not** treat time-estimate lines like `A: 120.3` as a decision.\n\n### 2) How `correct_strategy` is defined\n\n`correct_strategy = 1.0` iff the parsed final choice exactly matches the example’s `answer` (case-insensitive). Otherwise `0.0`.\n\n### 3) Missing/ambiguous decisions\n\n- If no decision is parsed, `correct_strategy = 0.0`.\n- `final_choice_present` applies a deterministic penalty: `-0.3` when missing.\n\n### 4) Label source and tie-breaks\n\nThere are two label regimes:\n\n- **Historical label (OpenF1 dataset, non-deep mode)**: `answer` is derived by `scripts/build_openf1_dataset.py` using race-phase heuristics + simple outcome proxies (stored in `info`, e.g. `outcome_score`). This is intentionally conservative and “policy-like”, not oracle-optimal.\n- **Strategy-model label (deep reasoning mode)**: when `deep_reasoning=True`, we recompute `answer` on load by minimizing the environment’s internal strategy model over options A/B/C/D (expected-time over a short horizon). This makes “deep reasoning mode” self-consistent and deterministic.\n  - Override can be disabled per-example by setting `info.lock_answer=true` (used for stress tests).\n\n## Dataset + Reproducibility\n\n- **OpenF1 cache**: `data/openf1_scenarios.jsonl`\n- **Stress tests**: `data/stress_scenarios.jsonl` (hand-authored, deterministic)\n- Every example’s `info` includes:\n  - `dataset_sha256` (hash of the JSONL file that was loaded)\n  - `dataset_variant` (`openf1` or `stress`)\n\nTo hard-pin drift in experiments:\n\n- pass `expected_dataset_sha256=<sha256>` to `load_environment(...)`\n- pass `seed=<int>` when subselecting `num_examples` to make selection deterministic\n\n## Modes (Environment Args)\n\n| Arg | Default | Meaning |\n|---|---:|---|\n| `dataset_variant` | `\"openf1\"` | `\"openf1\"` (historical) or `\"stress\"` (adversarial) |\n| `dataset_path` | `None` | Optional override path to a JSONL dataset |\n| `expected_dataset_sha256` | `None` | Optional guardrail against dataset drift |\n| `seed` | `None` | Deterministic subsampling when `num_examples > 0` |\n| `eval_season` | `2024` | Season held out for eval (set `None` to disable split) |\n| `eval_tracks` | `None` | Tracks held out for eval |\n| `use_tools` | `False` | Enable tool environment and tool-use reward |\n| `multi_turn` | `False` | Enable pit-wall follow-up environment |\n| `deep_reasoning` | `True` | Add strategy-model block + numeric scoring + label recomputation |\n| `multi_env` | `False` | Route examples to per-track envs via an EnvGroup |\n| `max_tracks` | `4` | Track count cap for `multi_env` |\n| `max_tokens` | `900` | Generation cap |\n\n## Tools\n\nWhen `use_tools=True`, the environment exposes 3 tools and tracks tool call count:\n\n- `tire_deg_estimator(info)` -> textual severity\n- `pit_delta_lookup(info)` -> pit loss + undercut window\n- `weather_confidence(info)` -> coarse wet/dry assessment\n\nThe rubric includes `uses_tools` which:\n\n- rewards `+0.1` if at least one tool call happened\n- penalizes `-0.2` if tools are enabled but none were called\n\n## Rubric (Weights + Outputs)\n\nRubrics are **additive**: total reward is `sum(metric_value * metric_weight)`.\n\nNon-deep mode (default when `deep_reasoning=False`):\n\n| metric | value range | weight | intent |\n|---|---:|---:|---|\n| `correct_strategy` | {0, 1} | 1.0 | must be correct |\n| `final_choice_present` | {0, -0.3} | 1.0 | enforce a decision |\n| `has_reasoning` | {0, 0.2} | 0.2 | non-trivial explanation |\n| `mentions_key_factors` | [0, 0.35] | 0.35 | uses signals (tires/weather/gaps/pit/SC/traffic) |\n| `acknowledges_uncertainty` | {0, 0.1} | 0.1 | tradeoffs/contingencies |\n| `outcome_aligned` | {0, 0.05, 0.15} | 0.15 | small bonus if correct + good outcome proxy |\n| `uses_tools` | {0.1, -0.2, 0} | 0.1 | tool discipline (only when enabled) |\n\nDeep reasoning mode adds:\n\n| metric | value range | weight | intent |\n|---|---:|---:|---|\n| `option_times_present` | {0, 0.1, 0.3} | 0.3 | prints A/B/C/D time estimates |\n| `option_time_accuracy` | [0, 0.6] | 0.6 | estimates match strategy-model ground truth |\n\n## “Obviously Elite” Evaluation Protocol\n\nThis repo includes scripts to produce baselines, CIs, and ablations without handwaving.\n\n### Baselines (models x modes, with 95% CIs)\n\n1) Run a baseline matrix:\n\n```bash\ncd environments/f1_strategy\npython3 scripts/run_baselines.py \\\n  --env herr-professor/f1-strategy \\\n  --model Qwen/Qwen3-4B-Instruct-2507 \\\n  --model Qwen/Qwen3-4B-Thinking-2507 \\\n  --num-examples 150 --rollouts 4\n```\n\n2) Render a markdown table with bootstrap CIs:\n\n```bash\npython3 scripts/render_baselines.py --manifest reports/baselines_manifest.json --out reports/baselines.md\n```\n\n### Ablations (same model, levers on/off)\n\n```bash\ncd environments/f1_strategy\npython3 scripts/run_ablations.py --model Qwen/Qwen3-4B-Instruct-2507 --num-examples 150 --rollouts 4\npython3 scripts/render_ablations.py --manifest reports/ablations_manifest.json --out reports/ablations.md\n```\n\n### Stress Tests (adversarial, deterministic)\n\n```bash\nprime eval run herr-professor/f1-strategy \\\n  -m Qwen/Qwen3-4B-Instruct-2507 \\\n  -n 50 -r 2 -s --skip-upload \\\n  -a '{\"dataset_variant\":\"stress\",\"deep_reasoning\":false,\"use_tools\":false}'\n```\n\n## Regenerating the OpenF1 Dataset\n\n```bash\ncd environments/f1_strategy\nuv run python scripts/build_openf1_dataset.py --years 2023 2024 --max-sessions 20 --max-scenarios 400\n```\n","encoding":"utf-8","truncated":false,"total_bytes":6416},"status":null}