{"data":{"kind":"file","path":"README.md","version_id":"fzlpu9jnu58wn4ild2ia7rtv","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3075,"modified_at":"2026-05-28T12:45:04.460000","content_hash":"f7b68b9bfa16f6671bf262b74b49281d5f9db4fb28de8076de14ff8d198fd3d4"},"entries":[],"content":"# api-coding-eval\n\nEvaluate an LLM's ability to code against a real third-party API by reading its documentation — not by solving algorithmic puzzles.\n\n## Why this eval exists\n\nMost coding benchmarks measure **data-structure and algorithm** ability: sort this list, find the shortest path, reverse a string. Those skills matter for interviews but they do not predict whether an agent can actually build software in the real world.\n\nReal software development looks nothing like LeetCode. A developer opening a ticket for the first time must:\n\n1. Read unfamiliar documentation they have never seen before.\n2. Understand the API's data model, authentication scheme, and error conventions.\n3. Translate that understanding into working code.\n4. Handle edge cases the docs mention only in passing.\n\nStandard coding evals skip all of step 1. The model gets a pre-digested problem statement with every detail already spelled out. That is not how agents work in production.\n\n**api-coding-eval closes that gap.** The model receives a goal (\"implement these Stripe payment functions\") and two live research tools:\n\n- `fetch_url(url)` — reads any documentation page via Jina AI Reader.\n- `search_docs(query)` — searches Stripe's docs with DuckDuckGo.\n\nThe model must explore the Stripe API reference on its own, form a mental model of how charges, customers, payment intents, and refunds work, and write a correct Python module from scratch — exactly as a developer would.\n\n## Task\n\nImplement eight functions against the [Stripe API](https://stripe.com/docs/api):\n\n| Function | Description |\n|---|---|\n| `create_charge(amount, currency, source)` | Create a charge; raise on amount ≤ 0 |\n| `get_charge(charge_id)` | Retrieve a charge by ID; raise if missing |\n| `create_refund(charge_id, amount=None)` | Full or partial refund; raise if charge missing |\n| `create_customer(email, name)` | Create a customer record |\n| `get_customer(customer_id)` | Retrieve a customer; raise if missing |\n| `create_payment_intent(amount, currency, customer_id=None)` | Create a PaymentIntent; raise on amount ≤ 0 |\n| `get_payment_intent(payment_intent_id)` | Retrieve a PaymentIntent; raise if missing |\n| `list_charges(limit=10)` | Return up to `limit` recent charges |\n\n## Scoring\n\nA local Flask server mocks the Stripe API. The model's submitted code is tested against 25 pytest cases covering:\n\n- Correct return shapes and field values for each resource type.\n- Validation errors (zero/negative amounts, missing required fields).\n- Not-found errors (missing charge, customer, or payment intent IDs).\n- Partial refund amounts.\n- No hardcoded API keys in the solution.\n\nReward = fraction of tests passed (0.0–1.0). No real Stripe account or network access to Stripe is required.\n\n## Running\n\n```bash\nprime eval run api-coding-eval -m <model-id>\n\n# Examples\nprime eval run api-coding-eval -m or/deepseek-r1\nprime eval run api-coding-eval -m sonnet\nprime eval run api-coding-eval -m or/kimi-k2\n```\n\nThe model is allowed up to 15 turns to research documentation and write its implementation.\n","encoding":"utf-8","truncated":false,"total_bytes":3075},"status":null}