{"data":{"kind":"file","path":"README.md","version_id":"ifalhyy0zxutnbzgcxq59ze3","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4734,"modified_at":"2026-05-06T20:40:13.086000","content_hash":"3bbf997751756847f38db0d3bedde569425bc77d63d0a5f32986c9dd1a4382f4"},"entries":[],"content":"# AR Credit Command Post Evals\n\n**AR Credit Command Post Evals** is developed by **Cognida.ai** as a **realistic enterprise-style mock ERP workflow for Accounts Receivable credit decisions**—orders blocked on credit hold are reviewed using **structured data only** (no PDFs or invoice images).\n\nAgents behave like embedded credit analysts: pull exposure from AR aging, credit limits, open orders, and payment-behavior signals, then submit an auditable decision.\n\n## Enterprise Context\n\nThis environment models a common AR control point:\n\n1. A sales order is **on credit hold** pending review.\n2. The analyst **queries ERP facts** (customer master, AR aging, order economics).\n3. The analyst issues **RELEASE**, **HOLD**, **CONDITIONAL_RELEASE** (for example deposit/prepay), or **ESCALATE** for manual credit review—**with reason codes** aligned to a fixed vocabulary.\n\nThe benchmark uses deterministic policy labels so scores are reproducible across runs and models.\n\n## Environment Profile\n\n- **Environment ID**: `ar-credit-release-v1`\n- **Owner/Organization**: Cognida.ai\n- **Type**: Multi-turn tool-use environment (`StatefulToolEnv`)\n- **Primary use**: RL training and evaluation of AR / order-to-cash automation agents\n- **Decision labels**: `RELEASE`, `HOLD`, `CONDITIONAL_RELEASE`, `ESCALATE`\n\n## Tool Contract\n\nExpected tools:\n\n- `get_credit_hold_case(case_ref)`\n- `get_customer_credit_profile(customer_id)`\n- `get_ar_aging_summary(customer_id)`\n- `list_open_orders(customer_id)`\n- `get_order_financials(order_id)`\n- `submit_credit_decision(decision, reason_codes, conditions_json)`\n\nOperating pattern:\n\n1. Open the hold case by `case_ref` from the prompt.\n2. Pull supporting structured facts as needed.\n3. Call `submit_credit_decision` **exactly once** with the final decision.\n\nFor `CONDITIONAL_RELEASE`, populate `conditions_json` when deposits apply (for example `{\"deposit_pct\": 30, \"currency\": \"USD\"}`).\n\n## Dataset\n\nBundled file:\n\n- `data/cases/ar_credit_release_v1_150.jsonl` (**150** scenarios)\n\nCoverage themes include clean releases, strategic near-limit paths, new-customer utilization guardrails, over-limit holds, strategic conditioned releases inside tolerance, past-due escalations, disputed-balance blocks, NSF / high-risk escalations, medium-risk edge escalations, strict policy packs, and multi-line order economics.\n\n## Scoring Model\n\nWeighted rubric:\n\n`reward = 0.55 * decision_accuracy + 0.25 * reason_code_accuracy + 0.10 * conditions_accuracy + 0.10 * submitted_decision_format`\n\n- **conditions_accuracy** applies when the reference decision is `CONDITIONAL_RELEASE` and expects structured deposit metadata; otherwise it scores neutral (`1.0`).\n\nAuxiliary metric: `submitted_once_metric`.\n\n## Quickstart\n\nSmoke test (hosted-safe defaults):\n\n```bash\nprime --plain eval run ar-credit-release-v1 \\\n  -n 8 -r 1 \\\n  -a '{\"include_case_hints\": false, \"create_eval_split\": false, \"max_examples\": -1}'\n```\n\nFull 150-case benchmark:\n\n```bash\nprime --plain eval run ar-credit-release-v1 \\\n  -n 150 -r 1 \\\n  -a '{\"include_case_hints\": false, \"create_eval_split\": false, \"max_examples\": -1}'\n```\n\nLocal dev with explicit JSONL path:\n\n```bash\nprime --plain eval run ar-credit-release-v1 \\\n  -a '{\"external_cases_path\": \"/home/primeintellect/environments/ar_credit_release_v1/data/cases/ar_credit_release_v1_150.jsonl\", \"include_case_hints\": false, \"create_eval_split\": false, \"max_examples\": -1}'\n```\n\n## Configuration Arguments\n\n- `max_examples` (int, default `-1`): cap loaded examples\n- `external_cases_path` (str, default `\"\"`): optional JSONL path; if omitted, packaged `ar_credit_release_v1_150.jsonl` is used\n- `max_turns` (int, default `10`): rollout turn cap\n- `include_case_hints` (bool, default `false`): include scenario labels in the prompt (benchmark mode keeps this off)\n- `create_eval_split` (bool, default `false`): deterministic train/eval split\n- `eval_split_fraction` (float, default `0.5`): eval fraction when split enabled\n\n## JSONL Case Schema\n\nEach line is one JSON object:\n\n- `question` (optional string)\n- `info.case_id` — stable identifier\n- `info.case_ref` — queue reference shown to the agent\n- `info.mock_erp` — full structured ERP snapshot (customer, order with lines, AR aging, payment behavior, open orders, policy knobs)\n- `info.expected_decision`\n- `info.expected_reason_codes`\n- `info.expected_conditions` — object; often `{}`, populated for conditional releases\n\nSee `cases_schema_example.jsonl`.\n\n## Production Notes\n\n- Deterministic labels align with `evaluate_credit_policy()` in `ar_credit_release_v1.py` (reference policy for the benchmark).\n- Re-run `scripts/generate_ar_credit_cases.py` after changing policy logic; validate bundled JSONL before publishing.\n","encoding":"utf-8","truncated":false,"total_bytes":4734},"status":null}