{"data":{"kind":"file","path":"README.md","version_id":"dmwmuhhyhtu5in2lrpvv4kf6","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":8168,"modified_at":"2026-04-08T16:38:09.013000","content_hash":"6508fc66f26279c533c4e1271101c6e4c0322707a450c68ad6c08db2bf0b3154"},"entries":[],"content":"﻿# first_price_auction\n\n`first_price_auction` is a compact Prime / `verifiers` environment for sealed-bid first-price auctions. Each example gives the model a private value and a fully specified auction setting, asks for one bid, and scores that bid with deterministic Monte Carlo expected utility. No LLM judge is used.\n\nIn many modes the prompt explicitly reveals the opponent policy class, so the benchmark should be read mainly as \"compute a good best response against a known simulator\" rather than \"infer hidden strategic structure from sparse auction data.\"\n\n## What It Tests\n\nThe environment targets strategic reasoning under incomplete information:\n\n- first-price bid shading in textbook symmetric IPV settings\n- adaptation to perturbed opponent policies\n- robustness to value-distribution shift\n- reasoning under reserve prices and tie rules\n- best-response adaptation to disclosed heuristic opponents, including noisy, mixed, threshold-jump, and capped-bid policies\n\nThe strongest signal comes from how well a model chooses a bid against the disclosed simulator for the current instance.\n\n## Environment Arguments\n\n| Argument | Type | Default | Description |\n| --- | --- | --- | --- |\n| `num_examples` | `int` | `200` | Number of synthetic dataset rows to generate. |\n| `seed` | `int` | `7` | Master seed for instance generation and rollout simulation. |\n| `task_modes` | `tuple[str, ...]` | `(\"textbook\", \"perturbed_opponents\", \"distribution_shift\", \"reserve_price\")` | Which task families to include. |\n| `mode_weights` | `dict[str, float]` | `{}` | Optional weighted mode mix. Empty means balanced cycling. |\n| `n_bidders_options` | `tuple[int, ...]` | `(2, 3, 4, 5)` | Possible total bidder counts. |\n| `max_bid` | `float` | `100.0` | Maximum legal bid and default support upper bound. |\n| `require_json_output` | `bool` | `True` | If true, prompts request `{\"bid\": ..., \"reasoning_summary\": ...}`. |\n| `num_mc_samples` | `int` | `256` | Monte Carlo samples used to estimate expected utility. |\n| `best_response_grid_size` | `int` | `41` | Grid size for approximate empirical best-response search. |\n| `best_response_refinement_rounds` | `int` | `2` | Number of local refinement rounds run after the coarse best-response grid search. |\n| `best_response_refinement_grid_size` | `int` | `9` | Grid size for each local best-response refinement round. |\n| `compute_best_response_baseline` | `bool` | `True` | Whether to compute best-response diagnostics. |\n| `invalid_bid_penalty` | `float` | `-1.0` | Score used for parse failures or out-of-range bids. |\n| `overbid_penalty` | `float` | `0.0` | Optional score penalty when `bid > private_value`. |\n| `normalize_rewards` | `bool` | `False` | Logs `normalized_score = score / max(private_value, 1)`. |\n| `tie_break_rule` | `str` | `\"random\"` | Either `\"random\"` or `\"lose\"`. |\n| `reserve_price_fraction_range` | `tuple[float, float]` | `(0.08, 0.35)` | Reserve range, as a fraction of `max_bid`, for reserve-price tasks. |\n\n## Scoring\n\nFor a valid submitted bid `b` with private value `v`, the environment simulates `num_mc_samples` auctions against the configured opponent simulator and estimates:\n\n`expected_utility = E[(v - b) * 1{agent wins and clears reserve}]`\n\nReward behavior:\n\n- invalid parse or out-of-range bid: `score = invalid_bid_penalty`\n- otherwise: `score = expected_utility`\n- if `overbid_penalty > 0`, the score is reduced when `bid > private_value`\n\nThe environment also computes:\n\n- `reference_bid` and `reference_expected_utility`\n- `best_response_bid` and `best_response_expected_utility` from an empirical bid grid search with local refinement around the best coarse bid\n- regret-style gaps against those baselines\n\nBecause the environment exposes the auction assumptions directly in the prompt, these baselines are best interpreted as \"how close was the submitted bid to the simulator's best response?\" rather than as a hidden-theory benchmark.\n\nThe default `overbid_penalty` is `0.0` on purpose: raw expected utility already makes overbidding weakly dominated in a first-price auction because every win above value produces negative payoff. If you want an extra shaping signal during training, you can turn on a positive `overbid_penalty`.\n\n## Logged Metrics\n\nNumeric rollout metrics exposed through the rubric:\n\n| Metric | Meaning |\n| --- | --- |\n| `score` | Primary environment reward. |\n| `expected_utility` | Monte Carlo estimate of raw expected utility. |\n| `submitted_bid` | Parsed bid, or `-1.0` if no numeric bid was parsed. |\n| `bid_to_value_ratio` | `bid / private_value` when defined. |\n| `win_rate_estimate` | Estimated probability of winning. |\n| `overbid_flag` | `1.0` if `bid > private_value`, else `0.0`. |\n| `parse_success` | `1.0` if a numeric bid was extracted, else `0.0`. |\n| `json_valid` | `1.0` if strict JSON parsing succeeded with a usable `bid`. |\n| `bid_in_range` | `1.0` if the parsed bid lies in `[0, max_bid]`. |\n| `reference_bid` | Theoretical reference bid when available, else `-1.0`. |\n| `reference_expected_utility` | Utility estimate for the reference bid, else `-1.0`. |\n| `utility_gap_to_reference` | Submitted expected utility minus reference expected utility. |\n| `best_response_bid` | Approximate empirical best response on the configured grid. Logged when `compute_best_response_baseline=True`. |\n| `best_response_expected_utility` | Utility estimate of the best-response bid. Logged when `compute_best_response_baseline=True`. |\n| `regret_to_best_response` | `best_response_expected_utility - expected_utility`, clipped at zero. Logged when `compute_best_response_baseline=True`. |\n| `bid_error_to_best_response` | Absolute bid gap to the best-response bid when a bid is available. Logged when `compute_best_response_baseline=True`. |\n| `normalized_score` | Optional normalized score. Logged when `normalize_rewards=True`. |\n| `number_count` | How many numeric substrings were found during regex fallback. |\n| `n_bidders` | Total bidder count as a numeric metric. |\n\nEach dataset row stores `prompt` plus structured `info`. There is intentionally no `answer` string field, because scoring is programmatic and fully determined by the simulator-backed rubric rather than by string matching against a canonical answer.\n\nCategorical auction metadata is stored in each example's `info` field and duplicated in dataset columns where useful, including `instance_id`, `task_mode`, `distribution_type`, `difficulty`, opponent-policy details, reserve price, and the simulation seed.\n\nThe opponent models are still stylized heuristics rather than a full empirical auction model, but they now include non-linear policies in addition to linear and noisy-linear bidders.\n\n## Example Row\n\nExample prompt:\n\n```text\nYou are bidding in a sealed-bid first-price auction.\nYour private value for the item is 64.20.\nThere are 4 bidders total, including you.\nPrivate values are drawn independently from Uniform[0.0, 100.0].\nOpponents are modeled as using standard equilibrium-inspired shading for uniform private values (approximately alpha=0.75).\nThere is no reserve price.\nIf the highest eligible bids tie, the winner is chosen uniformly at random among tied bidders.\nYour goal is to maximize expected payoff. If you win, you pay your own bid, so payoff = private_value - bid. If you lose, payoff = 0.\nValid bids are numbers between 0 and 100.00, inclusive.\nBids below 0 or above the maximum are invalid.\nReturn JSON exactly in the form {\"bid\": 47.5, \"reasoning_summary\": \"optional short explanation\"}. Scoring only uses \"bid\".\n```\n\nExample `info` sketch:\n\n```json\n{\n  \"instance_id\": \"first_price_auction_00017\",\n  \"task_mode\": \"textbook\",\n  \"private_value\": 64.2,\n  \"n_bidders\": 4,\n  \"distribution_type\": \"uniform\",\n  \"reserve_price\": 0.0,\n  \"tie_break_rule\": \"random\",\n  \"opponent_policy_type\": \"equilibrium\",\n  \"max_bid\": 100.0,\n  \"reference_bid\": 48.15\n}\n```\n\n## Local Usage\n\n```python\nfrom auction_env import load_environment\n\nenv = load_environment(\n    num_examples=8,\n    num_mc_samples=64,\n    compute_best_response_baseline=True,\n)\n```\n\nFor a quick local sanity check without `datasets` or `verifiers`, run:\n\n```powershell\npython smoke_test.py\n```\n","encoding":"utf-8","truncated":false,"total_bytes":8168},"status":null}