{"data":{"kind":"file","path":"README.md","version_id":"xz1qtt9nsc0h4s4zzrfzh8e0","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2931,"modified_at":"2026-02-03T13:29:02","content_hash":"8b12ba03d77936ac0de2435851ab86db940186d16771672c0dd288c6f143a073"},"entries":[],"content":"# crm-sql\n\n### Overview\n- **Environment ID**: `crm-sql`\n- **Short description**: Train models to answer natural language business questions by querying a synthetic CRM database\n- **Tags**: tool-use, sql, multi-turn, crm, train, eval\n\n### Datasets\n- **Primary dataset(s)**: Synthetically generated CRM data (companies, contacts, deals, activities)\n- **Source**: Built-in SQLite database with procedural generation\n- **Split sizes**: 500 train / 20 eval (configurable)\n\n### Task\n- **Type**: multi-turn tool use (`ToolEnv`)\n- **Tools**: `list_tables`, `describe_table`, `run_sql`\n- **Max turns**: 5 (configurable)\n- **Rubric**: Correct answer match, efficiency bonus (fewer queries), optional LLM judge\n\n### Quickstart\n\nRun an evaluation with default settings:\n\n```bash\nprime eval run crm-sql\n```\n\nConfigure model and sampling:\n\n```bash\nprime eval run crm-sql \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7 \\\n  -a '{\"num_train_examples\": 500, \"num_eval_examples\": 20, \"judge_model\": \"gpt-4.1\"}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_train_examples` | int | `500` | Number of training questions |\n| `num_eval_examples` | int | `20` | Number of evaluation questions |\n| `max_turns` | int | `5` | Maximum tool-calling turns per rollout |\n| `seed` | int | `42` | Random seed for data generation |\n| `judge_model` | str \\| None | `None` | Model for LLM judge (e.g. `gpt-4.1`) |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `correct_answer` | 1.0 if the model's final answer matches expected answer |\n| `efficiency_bonus` | 1.0/num_tool_turns — rewards fewer queries for correct answers |\n| `judge` (optional) | LLM judge score for SQL quality and answer clarity |\n\n### Database Schema\n\nThe synthetic CRM database contains 4 tables:\n\n- **companies** (30 rows): id, name, industry, size, annual_revenue, country, created_at\n- **contacts** (120 rows): id, first_name, last_name, email, role, company_id, created_at\n- **deals** (200 rows): id, title, value, stage, owner, company_id, contact_id, created_at, closed_at\n- **activities** (500 rows): id, type, subject, deal_id, contact_id, owner, created_at\n\n### Question Difficulty Levels\n\n- **Easy**: Single-table lookups and counts (e.g. \"How many companies are in Technology?\")\n- **Medium**: Aggregations and simple joins (e.g. \"Which company has the highest deal value?\")\n- **Hard**: Multi-table joins with complex aggregations (e.g. \"What is the win rate for each account manager?\")\n\n### Training Configuration\n\nExample `rl.toml` for hosted training with Qwen3-30B-A3B:\n\n```toml\n[[orchestrator.env]]\nid = \"mahir/crm-sql\"\nargs = { max_turns = 5, judge_model = \"gpt-4.1\" }\n```\n\n### Changelog\n\n#### v0.1.0\n\n- Initial release\n- 3 tools: list_tables, describe_table, run_sql\n- 14 question templates across 3 difficulty levels\n- Correct answer + efficiency bonus rubric\n- Optional LLM judge support\n","encoding":"utf-8","truncated":false,"total_bytes":2931},"status":null}