{"data":{"kind":"file","path":"README.md","version_id":"qltsflio0hi66e0zurov07fr","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4352,"modified_at":"2026-06-20T20:27:45.615000","content_hash":"e0bd9599ed0e2d1da8fd28f54155ad2a72a7f20ecc984190072396b51b35474f"},"entries":[],"content":"# RedlineBench v2\n\nMulti-term contract negotiation, scored by a verifiable outcome instead of an AI judge.\n\n## The idea\n\nHow do you tell whether an AI did a good job at negotiating? With math or code you can check the answer. A negotiation has no obvious right answer, so normally a person or another AI reads it and gives an opinion. That is slow, and a little subjective.\n\nThe first version scored a negotiation as a single number. That worked, but it was the easy case. This version handles a whole contract with eight terms and no single right answer. There is still nobody judging it. The score comes from the deal itself, by measuring how close the two sides got to the best deal they both could have accepted. The one human step is deciding up front how much each term is worth. After that, the negotiation scores itself.\n\n## What's here so far\n\n- A scorekeeper that grades a finished contract from the client's side, with no judge\n- Eight terms the two sides weight differently, so the skill being tested is trading across them (logrolling), not splitting them\n- An opposing-counsel vendor that counters offers and walks away from flat, untraded offers\n- A six-model frontier baseline, all measured by the same verifiable reward\n- A verified skill gradient (no API, no judge) showing exactly what the reward rewards\n\n![Six frontier models all score below a 20-line counter-reading bot (0.46); gpt-5 scores 0.02, closing almost no deals; the optimal-play ceiling is 1.0](redline_v2_baseline.png)\n\n## The finding: frontier models negotiate worse than a 20-line script\n\nSix frontier models, n=32 each, same verifiable reward, reasoning models given an\n8000-token budget:\n\n| model | buyer reward | closes |\n| --- | --- | --- |\n| gpt-5 | **0.02** | 3% |\n| gpt-4.1-nano | 0.16 | 44% |\n| deepseek-v4-flash | 0.16 | 28% |\n| claude-sonnet-4.5 | 0.21 | 38% |\n| gpt-4.1-mini | 0.23 | 44% |\n| claude-haiku-4.5 | 0.23 | 50% |\n\nEvery model scores at or below a naive 50/50 split (0.24) and far below a 20-line\nrule-based bot (0.46) that just reads the vendor's counters and trades. The most\nadvanced model, gpt-5, is the **worst**: it anchors so aggressively (its own\nreasoning says \"hold firm at 0.9-0.95\") that it almost never concedes enough to\nclose, walking away from 31 of 32 deals. The other models adapt and close 28-50%,\nso the environment is clearly winnable; gpt-5 specifically fails by refusing to\ntrade. Same prompt for every model.\n\n## What the reward actually requires\n\nThe vendor's walkaway is set above what a flat, untraded offer yields, so the only\nway to close a good deal is to infer the vendor's priorities from its counters and\ntrade. Four reference policies, scored through the real vendor loop and reward with\nno API and no judge (`python baselines.py`, n=3000):\n\n| policy | buyer reward | closes | what it is |\n| --- | --- | --- | --- |\n| blind constant `[0.6]×8` | **0.00** | 0% | ignores every counter; the vendor just walks |\n| naive split `[0.5]×8` | 0.24 | 51% | split every term down the middle |\n| **rule-based logroller** | **0.46** | 62% | ~20 lines: reads the counters, concedes what the vendor wants and it values least, holds the rest |\n| logroll oracle (full info) | 1.00 | 100% | optimal trade; the ceiling |\n\nThe point: not-trading is near-worthless (the vendor walks), a simple counter-reading\nheuristic captures real value, and optimal logrolling is the 1.0 ceiling. The reward\nhas a genuine, climbable skill gradient with no judge anywhere.\n\n## Status\n\nEnvironment, the verified skill gradient, and the six-model frontier baseline are\ncomplete. The reward has a real, climbable gradient (a rule-based logroller reaches\n0.46, optimal is 1.0), and a small base model sits in the trainable range (44% close\nrate, non-degenerate reward spread), so the next step is RL training a model to beat\nthe frontier baseline.\n\nCaveats: the opposing counsel is a fixed rule-based policy and the scenarios are\nsynthetic, so this is a research probe, not solved contract negotiation. n=32 per\nmodel; the deal-or-no-deal reward makes per-model variance high, so treat the\nranking among the mid-pack models as approximate (gpt-5's collapse and the gap to\nthe bot are the robust results).\n\nLive on the Prime Intellect Hub: `prime env install fa1zvn/redline-v2`. Next:\nself-play / RL training a model to clear the 20-line bot's bar.\n","encoding":"utf-8","truncated":false,"total_bytes":4352},"status":null}