{"data":{"kind":"file","path":"README.md","version_id":"fk9vepjtzrg70s53inpf4aon","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6827,"modified_at":"2026-06-05T22:08:34.427000","content_hash":"70017348726f52a295003c36b8a66ad3acaa7d1d6030ba31e718a919b837209f"},"entries":[],"content":"# invoice-reconciliation\n\nA single-turn, verifiable (RLVR) environment for accounts payable three-way matching. The\nagent receives a purchase order (PO), a goods-receipt record, and a vendor invoice (line\nitems plus freight and tax header lines, with payment terms), then computes the net\napproved payment and flags every discrepant line. Ground truth is computed deterministically\nby the generator, so every reward is exactly checkable.\n\n## Overview and motivation\nThe three-way match is the central control in accounts payable: an invoice is not paid\nuntil it is reconciled against the PO that authorized the spend and the goods receipt that\nconfirms delivery. Whether to pay, how much, and which lines to hold is a deterministic\nfunction of those three documents plus the buyer's tolerance and payment-terms policy. That\nmakes AP reconciliation a clean verifiable-RL target: one correct approved amount and one\ncorrect set of flagged lines per case. The environment trains an agent to apply real AP\ncontrols (authorization, quantity and price tolerance, tax and freight handling, early-\npayment discounts, and duplicate detection) in the correct order.\n\n## Task spec\nInput (per case): a vendor and invoice number with optional duplicate marker; payment\nterms (e.g. `2/10 net 30`) and whether the case is paid within the discount window; a PO\n(sku, ordered_qty, unit_price per line); a goods-receipt record (sku, received_qty); and a\nvendor invoice with line items (sku, billed_qty, billed_unit_price) plus header freight and\ninvoiced tax at a stated rate.\n\nOutput: JSON `{\"approved_amount\": <number>, \"flagged_skus\": [\"<sku>\"]}`. The flag set may\ninclude SKU strings and the literal tokens `TAX` (tax mismatch) and `DUPLICATE`.\n\nPolicy, applied in order (tolerances and rates are illustrative defaults):\n1. **Duplicate check:** if the invoice duplicates an already-paid invoice for the same\n   vendor, pay 0 and flag `DUPLICATE`; stop.\n2. **Authorization:** only SKUs on the PO are payable; an off-PO invoice SKU is never paid\n   and is always flagged.\n3. **Quantity:** payable quantity is the lesser of received and billed quantity; an over-\n   bill beyond the 2% quantity tolerance is flagged (and held to the received quantity).\n4. **Price tolerance:** a billed price within 2% of the PO price is treated as a match and\n   paid at the billed price with no flag; outside tolerance the line is paid at the PO unit\n   price and flagged.\n5. **Header:** add freight as billed; recompute tax at 7% of the approved goods subtotal,\n   and if the invoiced tax differs from the recomputed tax by more than a cent, pay the\n   recomputed tax and flag `TAX`.\n6. **Early-payment discount:** with terms like `2/10 net 30` paid inside the window,\n   subtract 2% of the goods-plus-tax subtotal (freight excluded from the discount base).\n\nNet approved = goods subtotal + freight + tax - discount (0 for a duplicate).\n\n## Domain grounding\nThe environment uses standard AP and procurement concepts, named only as real concepts (no\ninvented citations):\n- The **three-way match** of **PO vs goods receipt vs invoice**.\n- **Accounts-payable internal controls** and segregation between authorization (PO),\n  receipt confirmation, and payment.\n- **Tolerance matching** with separate **price tolerance** and **quantity tolerance** bands,\n  so trivial drift is auto-approved instead of routed to a human.\n- **Partial receipts**, paid for what arrived rather than what was ordered.\n- **Duplicate-payment prevention**, one of the highest-value AP controls because duplicate\n  payments are a leading source of recoverable cash leakage.\n- **Freight and tax** header handling, with tax recomputed on the approved goods base.\n- **Payment terms** and early-payment discounts expressed in conventional `2/10 net 30`\n  form, with the discount taken on the goods-plus-tax base.\n\nTolerance percentages, the tax rate, and discount terms are illustrative defaults, not any\nspecific company's AP policy.\n\n## Reward design rationale\nTwo weighted, verifiable components:\n- **approved_amount closeness** (weight 0.7): 1.0 within ~1% relative error, scaling\n  linearly to 0 by ~30% error; a true amount of exactly 0 (a duplicate hold) is matched\n  only by a near-zero prediction. The amount is the payment that actually moves cash, so it\n  carries the larger weight; graded closeness keeps the dollar signal smooth.\n- **flagged-token F1** (weight 0.3): F1 of the predicted flag set (SKUs plus `TAX` /\n  `DUPLICATE`) against ground truth, rewarding both catching real exceptions (recall) and\n  not over-flagging clean lines (precision).\n\nThe components separate cleanly: a correct reconciliation scores ~1.0, while a naive policy\nthat pays the full invoice total (including off-PO lines, over-bills, out-of-tolerance\nprices, and the invoiced rather than recomputed tax, with no discount taken) and flags\nnothing scores far lower, with a gap well above 0.3.\n\n## Edge cases handled\n- In-tolerance price drift: paid at billed price, not flagged (true AP auto-approve), so an\n  agent that flags every nonzero variance loses precision.\n- Out-of-tolerance price: paid at PO price and flagged.\n- Over-billed quantity beyond tolerance: held to received quantity and flagged.\n- Partial receipts where the invoice correctly bills only what arrived: paid in full, not\n  flagged.\n- Off-PO (maverick) invoice lines: never paid, always flagged.\n- Tax mismatch: recomputed tax is paid and `TAX` is flagged.\n- Early-payment discount applied only inside the window and only on the goods-plus-tax base\n  (freight excluded).\n- Duplicate invoice: whole case held to 0 with a single `DUPLICATE` flag.\n\n## EVAL (gpt-4o-mini, n=20): mean reward 0.72 on real gpt-4o-mini rollouts, versus 0.502 for a naive baseline (the gap confirms the reward discriminates competence from guessing)\n\n## Limitations and intended use\n- The policy is a simplified, internally consistent approximation, not a production AP\n  policy; it omits many real-world wrinkles (multi-currency, retainage, blanket POs,\n  consignment, line-level tax exemptions, vendor-specific tolerance overrides).\n- Synthetic cases are independently sampled and do not model real vendor or spend\n  distributions.\n- Intended use is training and evaluating control-application and arithmetic-reconciliation\n  behavior for AP automation agents. The same schema and reward transfer to real AP data\n  (PO, goods-receipt, and invoice records paired to approved-payment outcomes) from any ERP.\n\n## Usage\n```bash\nuv run vf-install invoice-reconciliation\nuv run vf-eval invoice-reconciliation -m gpt-4.1-mini\n```\n`load_environment(num_examples=300, seed=7)` builds synthetic cases with known ground truth.\n\n## Format\n```\n<think> reasoning </think>\n<answer>{\"approved_amount\": 4210.50, \"flagged_skus\": [\"BOLT-12\", \"PANEL-X\", \"TAX\"]}</answer>\n```\n","encoding":"utf-8","truncated":false,"total_bytes":6827},"status":null}