{"data":{"kind":"file","path":"README.md","version_id":"fa784p26i7297r6xemurir60","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6600,"modified_at":"2026-06-05T22:08:34.427000","content_hash":"a799132c2ac15266a99e67503b729035def18626b57dce3dca29861ec2879eb6"},"entries":[],"content":"# financial-statement-reasoning\n\nA single-turn, verifiable (RLVR) environment for financial statement analysis. The\nagent reads an internally consistent set of company financials (a multi-step income\nstatement and a classified balance sheet) and answers one question with a single\ncorrect numeric answer.\n\n## Overview and motivation\n\nComputing the right metric from a set of financial statements is a clean RLVR\ntarget: the inputs are structured line items, the question has exactly one correct\nanswer, and that answer follows from a fixed definition rather than judgment. The\ndifficulty is not arithmetic; it is knowing which line items feed which metric and\nbeing able to recover a withheld figure by inverting an identity. The quick ratio\nexcludes inventory but the current ratio does not; interest coverage uses operating\nincome, not net income; debt-to-equity divides total liabilities by equity, not by\nassets. An agent that confuses two similar ratios, or solves the accounting\nidentity in the wrong direction, produces a wrong-but-plausible number, which is\nexactly what a verifiable reward should distinguish from the correct one.\n\n## Task specification\n\nEach example generates a coherent financial bundle from a deterministic seed and\nthen asks one question, showing only the line items needed to answer it. The agent\nreturns the numeric answer as JSON (ratios and margins as plain decimals, dollar\namounts as plain numbers). The question set covers:\n\n- Liquidity: current ratio, quick (acid-test) ratio, working capital\n- Leverage / solvency: debt-to-equity, interest coverage\n- Profitability / efficiency: gross margin, net profit margin, operating income,\n  inventory turnover\n- Cash flow: operating cash flow by the indirect method (net income plus the\n  depreciation add-back)\n- Identity recovery: a withheld line item solved via Assets = Liabilities + Equity\n  (Total Equity, Total Assets) or via Revenue - Gross Profit (COGS)\n- Multi-step recovery: walking the multi-step income statement from gross profit\n  down to a withheld net income, or summing current liabilities and long-term debt\n  before inverting the accounting identity for equity\n\nThe ground truth is computed by the generator with the same code path that fills\nthe statements, so it is exact and verifiable.\n\n## Domain grounding\n\nThe statements follow standard GAAP structure, named explicitly:\n\n- A multi-step income statement: Revenue, Cost of Goods Sold, Gross Profit,\n  Operating Expenses, Operating Income, Interest Expense, Pretax Income, Income\n  Taxes, Net Income. Each subtotal is derived from the line above it, so the\n  statement foots.\n- A classified balance sheet: current versus non-current assets, current versus\n  long-term liabilities, and equity, tied together by the fundamental accounting\n  identity Assets = Liabilities + Equity, which holds to the cent because equity is\n  derived as the plug.\n- The standard ratio families from financial-statement analysis: liquidity (current\n  ratio, quick / acid-test ratio, working capital), leverage / solvency\n  (debt-to-equity, interest coverage), and profitability / efficiency (gross\n  margin, net profit margin, inventory turnover).\n- An indirect-method operating cash-flow line (net income plus depreciation),\n  reflecting the start of a cash-flow statement.\n\nAll figures are synthetic and internally consistent; no real company data is used.\n\n## Reward design rationale\n\nReward is numeric closeness to the ground truth in [0, 1]: 1.0 within about 1\npercent relative error, decaying linearly to 0 by about 30 percent relative error.\nFor ground-truth values at or near zero (for example a working capital that nets to\nroughly zero), it falls back to a scaled absolute tolerance so the relative-error\ndenominator does not explode. The tolerance band rewards the correct method even\nwhen intermediate rounding differs slightly, while any structural mistake (the\nwrong ratio definition, the wrong direction of an identity, including inventory in\nthe quick ratio) moves the answer far outside the band.\n\n## Edge cases handled\n\n- Withheld line items are recovered by inverting an identity rather than being read\n  off directly, and multi-step variants withhold an intermediate subtotal so the\n  agent must chain two steps.\n- The quick ratio strips inventory while the current ratio does not, and both can be\n  asked of the same balance sheet, so the agent cannot collapse them.\n- Near-zero ground truth is judged on absolute error instead of relative error.\n- The answer parser tolerates a missing or non-dict JSON payload and returns no\n  credit rather than crashing.\n- Every derived figure is computed from primitives, so the statements always foot\n  and the identity always holds, leaving no inconsistent example for the reward to\n  reward by accident.\n\n## Evaluation\n\nEVAL (gpt-4o-mini, n=20): mean reward 0.97 on real gpt-4o-mini rollouts, versus 0.05 for a naive baseline (the gap confirms the reward discriminates competence from guessing)\n\nSeparation check (deterministic policies, n=300): an optimal policy that returns the\nexact ground truth scores 1.0000; a naive baseline that always answers a fixed value\n(1.0) scores 0.0500, a gap of 0.9500.\n\n## Limitations and intended use\n\nThe statements are synthetic and single-period, so cross-period metrics that need\naverages (for example inventory turnover on average inventory, or return on average\nequity) are approximated with the single-period figure. There is no segment detail,\nno deferred-tax or minority-interest nuance, and no footnote disclosure. The\nenvironment is a reasoning task over statement structure and ratio definitions, not\nan audit tool. Because the generator and reward share one code path, the task tests\nwhether an agent applies the stated definitions correctly, not whether it can detect\nfraud or misstatement.\n\nThe same schema and reward transfer to real financials (audited filings, general-\nledger exports, or accounting-system pulls paired to computed metrics): on\nacquisition of a real data source the generator is swapped for real statements while\nthe parser, question set, and reward stay the same.\n\n## Usage\n\n```bash\nuv run vf-install financial-statement-reasoning\nuv run vf-eval financial-statement-reasoning -m gpt-4.1-mini\n```\n\n`load_environment(num_examples=300, seed=7)` builds synthetic, internally consistent\nstatements with known ground truth. The pure helpers (`_build_financials`,\n`make_example`, `closeness_reward`, `extract_value`) are testable without importing\nverifiers.\n\n## Format\n\n```\n<think> reasoning </think>\n<answer>{\"value\": 0.184}</answer>\n```\n","encoding":"utf-8","truncated":false,"total_bytes":6600},"status":null}