{"data":{"kind":"file","path":"README.md","version_id":"u7ys3cy95cg7y2va3a8if3pt","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5187,"modified_at":"2026-03-03T07:39:20.777000","content_hash":"2579d508b58c4c73753615fd303026b76b72910e41f1041f685337cc42b86a57"},"entries":[],"content":"# duckdb-qa\n\n### Overview\n\n- **Environment ID**: `duckdb-qa`\n- **Short description**: Data analyst assistant that answers questions by writing and executing SQL queries against a DuckDB database with 4 schemas (IMDB, COVID, HackerNews, UK property data)\n- **Tags**: sql, duckdb, data-analysis, tool-use, multi-turn\n\n### Datasets\n\n- **Primary dataset**: `diicell/duckdb-qa-v3` — 578 question-answer pairs across 4 DuckDB schemas\n- **Source links**: [HuggingFace Dataset](https://huggingface.co/datasets/diicell/duckdb-qa-v3)\n- **Database**: `duckdb_data.db` downloaded from the same HuggingFace repo\n- **Split**: `train` / `eval` (falls back to 85/15 split with seed 42 if no eval split exists)\n\n### Task\n\n- **Type**: Multi-turn tool use\n- **Parser**: XMLParser with fields `sql_query`, `query_result`, `response`\n- **Rubric**: Single composite reward function (`sql_composite_reward`) combining soft execution accuracy, syntax, schema linking, format, multipliers, and behavioral bonuses\n\n### Tools\n\n| Tool               | Description                                              |\n| ------------------ | -------------------------------------------------------- |\n| `list_tables`      | List all tables and views in the database with comments  |\n| `list_columns`     | List all columns of a table/view with types and comments |\n| `sample_rows`      | Preview sample rows from a table (default limit 5)       |\n| `run_select_query` | Execute a SELECT SQL query (limited to 1024 rows)        |\n\n### Reward Function\n\nThe reward is a single composite function with weighted components, multipliers, and behavioral bonuses:\n\n**Components** (normalized by total weight 7.0):\n\n| Component   | Weight | Description                                                                        |\n| ----------- | ------ | ---------------------------------------------------------------------------------- |\n| `soft_exec` | 5.0    | Continuous execution accuracy (sharpened F1^1.5, value F1, column Jaccard)         |\n| `syntax`    | 1.0    | 1.0 if query executes without error, 0.0 otherwise                                 |\n| `schema`    | 0.5    | Jaccard similarity of schema items (tables/columns) between predicted and gold SQL |\n| `format`    | 0.5    | 1.0 if all 3 XML tags present, 0.0 otherwise                                       |\n\n**Multipliers** (applied to the entire base reward):\n\n| Multiplier  | Range    | Description                                          |\n| ----------- | -------- | ---------------------------------------------------- |\n| `tool_use`  | 0.15–1.0 | 1.0 if `run_select_query` was called, 0.15 otherwise |\n| `reasoning` | 0.4–1.0  | Based on reasoning text length between tool calls    |\n\n**Behavioral bonuses** (additive, post-multiplier):\n\n| Bonus             | Value | Trigger                                        |\n| ----------------- | ----- | ---------------------------------------------- |\n| `exploration`     | 0.08  | Called `sample_rows` before `run_select_query` |\n| `self_correction` | 0.05  | Made 2+ query attempts                         |\n| `error_recovery`  | 0.08  | Failed query → reasoning → successful query    |\n| `completion`      | 0.03  | Last tool call was `run_select_query`          |\n\n**LLM judge fallback**: GPT-4.1-mini fires when value F1 and row F1 disagree (high values, low row match), catching cases where the model got the right answer in a different structure.\n\n### Expected Output Format\n\n```xml\n<sql_query>\nSELECT * FROM schema.table LIMIT 10\n</sql_query>\n<query_result>\n| col1 | col2 |\n|------|------|\n| val1 | val2 |\n</query_result>\n<response>\nNatural language interpretation of results\n</response>\n```\n\n### Quickstart\n\n```bash\nprime eval run duckdb-qa\n```\n\nConfigure model and sampling:\n\n```bash\nprime eval run duckdb-qa \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7 \\\n  -a '{\"max_turns\": 5}'\n```\n\n### Environment Arguments\n\n| Arg                 | Type  | Default                       | Description                             |\n| ------------------- | ----- | ----------------------------- | --------------------------------------- |\n| `corpus_dataset`    | str   | `\"diicell/duckdb-qa-v3\"`      | HuggingFace dataset ID for questions    |\n| `corpus_split`      | str   | `\"train\"`                     | Dataset split for training              |\n| `eval_split`        | str   | `\"eval\"`                      | Dataset split for evaluation            |\n| `eval_fraction`     | float | `0.15`                        | Fallback eval fraction if no eval split |\n| `split_seed`        | int   | `42`                          | Seed for train/eval split               |\n| `max_turns`         | int   | `10`                          | Maximum number of tool-use turns        |\n| `judge_model`       | str   | `\"gpt-4.1-mini\"`              | Model for LLM judge fallback            |\n| `judge_base_url`    | str   | `\"https://api.openai.com/v1\"` | API base URL for judge                  |\n| `judge_api_key_var` | str   | `\"OPENAI_API_KEY\"`            | Env var name for judge API key          |\n\n### Dependencies\n\n- `verifiers>=0.1.10`\n- `duckdb>=1.4.4`\n- `OPENAI_API_KEY` env var (for LLM judge fallback)\n","encoding":"utf-8","truncated":false,"total_bytes":5187},"status":null}