{"data":{"kind":"file","path":"README.md","version_id":"cxrwktdvuiphl8eoh6fml95t","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4560,"modified_at":"2025-09-14T20:20:10.523000","content_hash":"e52925d2e281b444ca127a14712d950196a1fbacfd048814d76d16fe81882eb0"},"entries":[],"content":"# csv-qa\n\n### Overview\n- **Environment ID**: `csv-qa`\n- **Short description**: Synthetic CSV QA over a Faker-generated product catalog. Two single-turn tasks (sum per category, top-K expensive) with optional tool use (`csv_filter`, `csv_agg`)\n- **Tags**: synthetic, tool-use, csv, tabular, arithmetic, hello-world\n\n### Datasets\n- **Primary dataset(s)**: Generated on-the-fly (Faker).\n- **Split sizes**: N/A (custom-generated; TODO)\n\n### Task\n- **Type:** Filtering, summing, sorting, top-K selection on a CSV table. Single-turn (or multi-turn with tool use).\n- **Parsers:** `XMLParser(fields=[\"think\", \"answer\"], answer_field=\"answer\")`\n- **Rubric overview:** `reward_exact_numeric_match` (parses `<answer>...</answer>` and compares to ground-truth). No ToolRubric used.\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval csv-qa\n```\n\nDefaults used in the published eval:\n\n```bash\nvf-eval csv_qa \\\n  --model gpt-4.1-mini \\\n  --num-examples 20 \\\n  --temperature 0.7 \\\n  --rollouts-per-example 3 \\\n  --save-dataset \\\n  --env-args '{\n        \"think\": true,\n        \"allow_tool_use\":true,\n        \"num_items\":100,\n        \"num_categories\":10,\n        \"locale\":\"en\",\n        \"seed\":42,\n        \"task_configs\":{\n            \"sum_price_by_category\":{},\n            \"top_k_expensive_by_category_sorted\":{\"top_k\":3}\n       }}'\n```\nThese were not specifially tuned or anything.\n\n<details> \n\n<summary>GPT-4.1-mini ablation</summary>\n\n| Setting      | Think | Tool Use | Avg Reward (= Accuracy) | Std Dev | Correct / Total | Saved Dataset                                 |\n| ------------ | ----- | -------- | ----------------------- | ------- | --------------- | --------------------------------------------- |\n| GPT-4.1-mini | ✅     | ✅        | **0.950** (95.0%)       | 0.218   | 57 / 60         | `outputs/evals/csv_qa--gpt-4.1-mini/dd172a05` |\n| GPT-4.1-mini | ❌     | ✅        | **0.217** (21.7%)       | 0.412   | 13 / 60         | `outputs/evals/csv_qa--gpt-4.1-mini/7fc45f47` |\n| GPT-4.1-mini | ✅     | ❌        | **0.983** (98.3%)       | 0.128   | 59 / 60         | `outputs/evals/csv_qa--gpt-4.1-mini/d06eca6c` |\n| GPT-4.1-mini | ❌     | ❌        | **0.167** (16.7%)       | 0.373   | 10 / 60         | `outputs/evals/csv_qa--gpt-4.1-mini/7fb58b6b` |\n\n</details>\n\n\n### Environment Arguments\n\n| Arg              | Type | Default     | Description                                                                                            |\n| ---------------- | ---- | ----------- | ------------------------------------------------------------------------------------------------------ |\n| `think`          | bool | `true`      | If true, adds think prompt.                                                                            |\n| `allow_tool_use` | bool | `true`      | If true, runs as `ToolEnv` with `csv_filter` & `csv_agg`, else behaves like single-turn.               |\n| `num_items`      | int  | `100`       | Number of products generated.                                                                          |\n| `num_categories` | int  | `10`        | Distinct categories.                                                                                   |\n| `locale`         | str  | `\"en\"`      | Faker locale.                                                                                          |\n| `seed`           | int  | `42`        | RNG/Faker seed for reproducibility.                                                                    |\n| `task_configs`   | dict | see example | Adjust task configs. If a task is not provided, it is excluded from the dataset.                       |\n\n**How many examples are evaluated on?**\nWith the current category-based tasks, for each included task, you get **one example per category**.\nTotal examples in the dataset = `num_categories * (#enabled_tasks)`.\nThe CLI `--num-examples` caps the total, which gets multiplied by `--rollouts-per-example`.\n\n\n### Metrics\n\n| Metric                       | Meaning                                                                                                       |\n| ---------------------------- | ------------------------------------------------------------------------------------------------------------- |\n| `reward`                     | The scalar reward (weighted sum across rubric fns). In this env it curr. equals `reward_exact_numeric_match`. |\n| `reward_exact_numeric_match` | 1.0 iff the parser’s extracted `<answer>` exactly equals the ground-truth string; else 0.0.                   |\n\n","encoding":"utf-8","truncated":false,"total_bytes":4560},"status":null}