{"data":{"kind":"file","path":"README.md","version_id":"qcc8h1576w46c9pb1t2x94w1","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5616,"modified_at":"2026-03-24T20:34:26.866000","content_hash":"45ee27447fa0f9ac652943778a481fac0420ab4e07172f8486664e8343622bdd"},"entries":[],"content":"# SDR-Arena\n\nA [verifiers](https://github.com/primeintellect-ai/verifiers) environment for **SDR-Bench** — benchmarking LLM personalization capabilities on B2B sales research tasks.\n\n**Paper & resources:**\n- [SDR-Bench paper](https://behavior-in-the-wild.github.io/SDR-Bench.html) | [Dataset (HuggingFace)](https://huggingface.co/datasets/behavior-in-the-wild/SDR-Bench) | [Leaderboard](https://huggingface.co/spaces/behavior-in-the-wild/SDR-Arena)\n\n---\n\n## Background\n\nSDR-Bench evaluates how well LLMs can generate personalized value propositions for B2B sales prospects. The benchmark is grounded in a Bayesian Persuasion framework that formally defines personalization, using real customer success stories as ground truth.\n\n**Dataset**: 6,279 verified business success stories across 203 domains, 134 industries, and 361 seller companies. Each story contains verified pitch points extracted from real deal evidence.\n\n**Key insight from the paper**: Deep research agents (STORM, ODR) that use multi-turn web search outperform standard LLMs, but a substantial gap remains before human-level sales proficiency. The benchmark uses **time-restricted web search** (results before the story's publication date) to prevent data leakage.\n\n**Top results from the paper**: STORM-QWEN-2.5 achieved the highest aggregate Weighted Coverage Score (42.51 on success stories), with significant variance across industries — deep research agents excel in Healthcare while standard models plateau in competitive Tech sectors.\n\n---\n\n## This Environment\n\nThis verifiers environment implements SDR-Bench with:\n- **Time-filtered web search** via Brightdata SERP API (date-restricted to before each story's publication)\n- **LLM-as-judge scoring** (Azure GPT-4o) with a 0-5 Likert scale per pitch point\n- **Two modes**: `SDRArenaEnv` for standard LLM eval/training, `AgentArena` for benchmarking custom agents (STORM, ODR, etc.)\n\n### How Scoring Works\n\nFor each ground-truth pitch point, the judge assigns:\n\n| Score | Meaning |\n|-------|---------|\n| 0 | Miss — concept absent from candidate pitch |\n| 1 | Marketing fluff — vague mention, no substance |\n| 2 | Topic match — correct product/pain, wrong solution |\n| 3 | Soft match — correct value prop, missing hero evidence |\n| 4 | Strong — core value + key mechanism captured |\n| 5 | Bullseye — product + pain + value + specific metric |\n\nReward = `avg(scores) / 5.0` normalized to [0, 1].\n\n---\n\n## Quick Start\n\n### Install\n\n```bash\nprime env install sdr-arena\n# or\npip install git+https://github.com/h4shk4t/sdr-arena-verifiers.git\n```\n\n### Setup credentials\n\nCreate a `.env` file in the environment directory:\n\n```bash\n# Azure GPT-4o (judge model)\nENDPOINT=https://<resource>.cognitiveservices.azure.com\nSUBSCRIPTION_KEY=<azure-api-key>\nAPI_VERSION=2025-01-01-preview\n\n# Brightdata SERP (web search)\nBRIGHTDATA_API_TOKEN=<brightdata-token>\nBRIGHTDATA_SERP_ZONE=serp_api_bdr\n```\n\n### Run evaluation\n\n```bash\n# Default (5 examples, 3 rollouts)\nprime eval run sdr-arena\n\n# With Azure GPT-4o as the policy model\nset -a && source .env && set +a\nprime eval run sdr-arena -m azure-gpt4o -n 45 -r 3 -s -v \\\n  -b \"https://<resource>.cognitiveservices.azure.com/openai/deployments/gpt-4o?api-version=2025-01-01-preview\" \\\n  -k SUBSCRIPTION_KEY\n```\n\n> **Azure note**: Always pass both `-b` and `-k` explicitly — the prime bridge otherwise injects its own inference URL and API key.\n\n---\n\n## Custom Agents (AgentArena)\n\nBenchmark any Python agent (STORM, ODR, custom pipelines) against the same dataset and judge:\n\n```python\nfrom sdr_arena import build_dataset, build_rubric, AgentArena\n\nasync def my_agent(topic, search_fn, client=None, model=\"\", info={}):\n    results = await search_fn([f\"{info['seller']} {info['customer']} case study\"])\n    # ... your multi-step research pipeline ...\n    return \"pitch points string\"\n\ndef load_environment(**kwargs):\n    return AgentArena(agent_fn=my_agent, dataset=build_dataset(), rubric=build_rubric())\n```\n\nRegister in `pyproject.toml` and run:\n\n```toml\n[tool.verifiers.environments]\nmy-agent = \"my_agent_module\"\n```\n\n```bash\nprime eval run my-agent -m azure-gpt4o -n 15 -r 3 -b \"...\" -k SUBSCRIPTION_KEY\n```\n\nThe included `storm-arena` environment implements the full [STORM](https://arxiv.org/abs/2402.14207) pipeline (perspective generation, knowledge curation via parallel Q&A with web search, two-pass outline, pitch synthesis).\n\n---\n\n## Dataset\n\n| Field | Description |\n|-------|-------------|\n| `prompt` | BDR task: seller, customer, products to pitch |\n| `answer` | Ground-truth pitch points with evidence |\n| `info.seller` | Seller company |\n| `info.customer` | Target customer |\n| `info.products` | Products being pitched |\n| `info.published_date` | Story publication date (search cutoff) |\n\n7,283 matched samples derived from [SDR-Bench](https://huggingface.co/datasets/behavior-in-the-wild/SDR-Bench). The original dataset contains 6,279 success stories; the matched samples pair prompts with extracted pitch point ground truth.\n\n---\n\n## Environment Arguments\n\n| Arg | Default | Description |\n|-----|---------|-------------|\n| `websearch_concurrency` | `20` | Max parallel Brightdata SERP requests |\n\n---\n\n## Citation\n\n```bibtex\n@misc{sdrbench2026,\n  author = {Srivastava, Ashutosh and Yedlapati, Siddharth and Aggarwal, Vinay and Dixit, Shashwat and Singla, Yaman Kumar},\n  title = {SDR-Bench: Benchmarking the Personalization Capabilities of Large Language Models},\n  year = {2026},\n  publisher = {Behavior in the Wild},\n  howpublished = {\\url{https://behavior-in-the-wild.github.io/SDR-Bench.html}},\n}\n```\n\n## License\n\nMIT\n","encoding":"utf-8","truncated":false,"total_bytes":5616},"status":null}