{"data":{"kind":"file","path":"README.md","version_id":"oj5hc705n3mew1j9qreeew6q","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4723,"modified_at":"2026-06-12T03:59:31.223000","content_hash":"2d2aac68ab047e11ab3a20b4d5048903b3703da9f37a62ad4785046d266245c4"},"entries":[],"content":"# ecom-bench\n\n### Overview\n- **Environment ID**: `ecom-bench`\n- **Short description**: Long-horizon ecommerce simulation — manage a small online store over 60 simulated days, with demand driven by an explicit panel of synthetic customers, scored on capital accumulation (final vs. initial net worth).\n- **Tags**: long-horizon, multi-turn, tools, business, capital-accumulation, pricing\n\n### Task\n- **Type**: multi-turn stateful tool environment (chat + OpenAI function calling, `StatefulToolEnv`).\n- **Goal**: maximize end-of-horizon net worth (cash + inventory at cost) relative to starting net worth.\n- **Key mechanics**: fixed catalog with unit costs, prices, stock, and automatic reorder points; a customer panel split into segments with per-customer visit rates, purchase rates, and price sensitivities; daily demand simulated per customer; pricing and timing are the agent's only strategic levers.\n- **Scenarios** (interleaved across the dataset; select via `scenario_ids`):\n\n| Id | Scenario | Character | Static-play baseline (60d net-worth ratio) |\n|----|----------|-----------|--------------------------------------------|\n| 0 | Base apparel store | Comfortable margins, steady demand | 1.97× |\n| 1 | Discount grocery corner | Thin margins, high volume, very price-sensitive customers — blanket discounting *loses* money (0.84×) | 1.29× |\n| 2 | Boutique electronics shop | High-ticket items, sparse volatile demand, small customer base | 1.68× |\n\n### Interface\n- Exposed tools (function calling): `get_financials`, `list_catalog`, `change_price`, `advance_days`, `get_customer_panel`.\n- State advances only via `advance_days`; each elapsed day simulates customer visits, purchases, inventory depletion, and reorders.\n- A per-rollout `session_id` is injected by the environment via `update_tool_args` — the model never sees or controls it, keeping tool schemas strict-JSON-schema clean.\n- Dataset: 100 scenario instances by default (base scenarios × random seeds), each row carrying scenario config in `info`.\n\n### Scoring\n- Primary reward: final net worth ratio, mapped linearly from [0.5×, 2.0×] initial net worth to [0, 1] (clipped).\n- The score is scaled by horizon progress (0.5 + 0.5 × fraction of days elapsed) to penalize agents that never advance time.\n- Tool-call counts per tool are logged as auxiliary metrics.\n\n### Quickstart\nEvaluate with defaults (5 examples × 3 rollouts per the packaged eval config):\n```bash\nuv run vf-eval ecom-bench -n 5 -r 3\n```\n\nEnvironment arguments:\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_scenarios` | int | `100` | Number of scenario instances (dataset rows) to generate |\n| `max_turns` | int | `64` | Safety cap on dialogue turns per rollout |\n| `scenario_ids` | list[int] | all | Base scenarios to include (0 apparel, 1 grocery, 2 electronics). Use `[0]` to reproduce the v0.1 dataset |\n\n### Leaderboard\n\nSettings: 5 scenario instances × 3 rollouts per model (`-n 5 -r 3`), 60-day horizon, `max_turns=64`, via OpenRouter, 2026-06-11, on the **v0.1 dataset** (apparel scenario only — reproduce with `scenario_ids=[0]`). Reward is the capital-accumulation score in [0, 1] (final/initial net-worth ratio, scaled by horizon progress). The v0.2 default dataset interleaves all three scenarios and is harder (e.g., gpt-4.1-mini drops from 0.75 to ~0.62).\n\n| Rank | Model | Mean reward | Std | Rollouts | Mean turns |\n|------|-------|-------------|-----|----------|------------|\n| 1 | openai/gpt-5.4 | 0.989 | 0.023 | 15 | 13.6 |\n| 2 | minimax/minimax-m2.5 | 0.983 | 0.026 | 15 | 20.3 |\n| 3 | deepseek/deepseek-v4-flash | 0.958 | 0.065 | 15 | 25.7 |\n| 4 | openai/gpt-5-mini | 0.946 | 0.049 | 15 | 33.0 |\n| 5 | z-ai/glm-4.7-flash | 0.939 | 0.077 | 15 | 15.1 |\n| 6 | qwen/qwen3-30b-a3b-instruct-2507 | 0.776 | 0.284 | 15 | 6.9 |\n| 7 | openai/gpt-4.1-mini | 0.747 | 0.266 | 16 | 18.3 |\n| 8 | google/gemini-2.5-flash-lite | 0.653 | 0.412 | 15 | 7.1 |\n| 9 | meta-llama/llama-4-maverick | 0.451 | 0.378 | 15 | 6.7 |\n\nObservations: top models cluster near the reward ceiling while weaker models separate on *horizon management*, not tool mechanics — low scorers average under 7 turns (they under-explore the 60-day horizon or stop engaging), and their score variance is 10-15× that of the leaders.\n\n### Notes\n- Demand is stochastic but fully seeded (per-scenario seed + day index), so runs are reproducible.\n- The environment deliberately omits marketing, logistics, and supplier negotiation so that pricing and timing decisions are isolated and failure traces stay interpretable.\n- Requires `verifiers>=0.1.14` (the `setup_state` contract changed in recent versions: it now mutates state in place and returns `None`).\n","encoding":"utf-8","truncated":false,"total_bytes":4723},"status":null}