{"data":{"kind":"file","path":"README.md","version_id":"ssnv96ya3emo4j22tf055q41","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5460,"modified_at":"2026-06-05T22:08:34.426000","content_hash":"9b2b63d51cc73b5d680b3895f9fb9c10b96a18dd49db239da4cc5024dfb28c06"},"entries":[],"content":"# inventory-reorder\n\nA single-turn, verifiable (RLVR) environment for the per-SKU replenishment decision an\ninventory / supply-chain planner makes at every reorder review.\n\n## Overview and motivation\n\nAt each review a planner answers three questions for a SKU: where is the reorder\npoint, does the current on-hand position trigger an order, and how much to order. The\ntextbook starting point (constant-demand Wilson EOQ with demand-only safety stock) is\nnot what a competent planner runs. Reorder points have to absorb *lead-time*\nvariability as well as demand variability, demand is seasonal so the lead-time demand\nmust be scaled by the active season factor, and the purchase quantity is constrained\nby supplier minimum-order quantities and is pulled toward price-break thresholds when\na quantity discount lowers total cost. This environment trains and evaluates the\nplanner's computation with labels that are exactly reproducible from the printed\ninputs.\n\n## Task spec\n\nInput: one SKU's profile (average daily demand, demand standard deviation, supplier\nlead time and its standard deviation, season factor, current on-hand, ordering cost,\nholding rate, base unit price, a supplier price-break schedule, MOQ, and target\nservice level with its z-value).\n\nOutput: `<answer>{\"reorder_point\": 412.7, \"order_now\": true, \"order_qty\": 540.0}</answer>`\nafter `<think>` reasoning.\n\n## Domain grounding\n\nThe task is built on standard inventory-theory concepts, named here so a domain\nreviewer can map them directly:\n\n- **Wilson EOQ model.** The economic order quantity is the starting point for the\n  recommended buy.\n- **Reorder point (ROP).** Expected demand over lead time plus safety stock.\n- **Safety stock and cycle service level.** Safety stock is sized to a target cycle\n  service level via its one-sided normal **z-score** (90 percent to 1.28, 95 to 1.65,\n  97.5 to 1.96, 99 to 2.33).\n- **Variable-demand, variable-lead-time ROP.** Safety stock uses the combined formula\n  `z * sqrt(LT * sigma_d^2 + (avg_d * sigma_LT)^2)`, so unreliable supplier lead time\n  raises the reorder point, not just demand noise.\n- **Newsvendor tradeoff.** The service-level z-score encodes the underage (stockout) vs\n  overage (holding) cost balance: higher service buys more safety stock.\n- **Quantity-discount / price-break EOQ.** The order quantity is chosen to minimize\n  total annual cost (purchase + ordering + holding, holding proportional to unit\n  price) across the supplier price-break schedule, not the raw EOQ.\n- **Minimum order quantity (MOQ).** The recommendation is floored at the supplier MOQ.\n- **ABC-style service differentiation.** Different SKUs carry different target service\n  levels, the way higher-value items are held to tighter service.\n\n## Reward design rationale\n\nReward is a weighted rubric: reorder point closeness (0.4), order-now exact match\n(0.3), and order quantity closeness (0.3). Closeness is 1.0 within about 2 percent\nrelative error and scales linearly to 0 by about 30 percent, giving partial credit for\nnear-correct arithmetic while still punishing a wrong model. Order-now is a hard\nboolean because it is a discrete trigger, not a magnitude. Weighting the reorder point\nhighest reflects that it is the primary safety-stock decision and the input to the\norder trigger. Every label is computed deterministically, so the reward is continuous,\nbounded in [0, 1], and exactly verifiable.\n\n## Edge cases handled\n\n- Lead-time standard deviation of 0 (reliable supplier) collapses cleanly to the\n  demand-only safety stock term.\n- Season factors above 1.0 (promo / peak month) inflate lead-time demand.\n- Price-break schedules where ordering up to a discount threshold lowers total annual\n  cost even though it raises EOQ; the cost-minimizing quantity is selected, not the\n  raw EOQ.\n- MOQ floor that overrides a smaller cost-minimizing quantity.\n- On-hand sampled near the reorder point so order-now is genuinely split across the\n  dataset rather than trivially always-order.\n- Malformed / non-JSON answers and a non-boolean order-now score 0 on the affected\n  component.\n\n## EVAL (gpt-4o-mini, n=20): mean reward 0.81 on real gpt-4o-mini rollouts, versus 0.46 for a naive baseline (the gap confirms the reward discriminates competence from guessing)\n\n## Limitations and intended use\n\nThis is a synthetic, single-shot planning computation: it does not model multi-echelon\nnetworks, correlated demand across SKUs, perishability / shelf-life, capacity or budget\nconstraints across the catalog, or lost-sales vs backorder dynamics. The normal-demand\nz-score assumption is the standard cycle-service-level approximation, not a fitted\ndemand distribution. Intended use is as an RLVR reward signal for the replenishment-math\nsub-skill inside a larger operations agent. The observation schema and reward apply\nunchanged to real ERP / inventory data: swap the synthetic SKUs for live SKUs where the\nground-truth label is the replenishment parameter actually run.\n\n## Format\n\n```\n<think> reasoning </think>\n<answer>{\"reorder_point\": 412.7, \"order_now\": true, \"order_qty\": 540.0}</answer>\n```\n\n## Usage\n\n```bash\nuv run vf-install inventory-reorder\nuv run vf-eval inventory-reorder -m gpt-4.1-mini\n```\n\n`load_environment(num_examples=300, seed=7)` builds synthetic SKU scenarios with known\nground truth. Standalone separation (correct-formula optimal vs a demand-only,\nno-seasonality, fixed-quantity naive baseline): optimal 1.000, naive 0.460, gap 0.540.\n","encoding":"utf-8","truncated":false,"total_bytes":5460},"status":null}