{"data":{"kind":"file","path":"README.md","version_id":"pqs92moj393yjupewhae0x9d","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":1911,"modified_at":"2025-10-04T16:34:33.840000","content_hash":"db642c56c5bc56ce131234662637c8aa4918440a920a030c97151bc85a510996"},"entries":[],"content":"# taarof-bench\n\nThis env is inspired by [this paper](https://arxiv.org/pdf/2509.01035). Taarof is a cultural pheonmena whitin the Iranian community and the authors have created a set of data to \n\n### Overview\n- **Environment ID**: `taarof-bench`\n- **Short description**: Loads the taarof related data that can be used for benchmarking.\n- **Tags**: Linguistic, culture, Persian, LLM-as-Judge\n\n### Datasets\n- **Primary dataset(s)**: Nikta/TAAROFBENCH\n- **Source links**: [HF](https://huggingface.co/datasets/Nikta/TAAROFBENCH/viewer/default/taarof_expected?views%5B%5D=taarof_expected&row=0)\n\n### Task\n- **Type**: single-turn\n- **Rubric overview**: Checks if the returned value is correct\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval taarof-bench\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval taarof_bench \\\n  -n 5 -r 1 -s \\\n  -a '{\"judge_model\": \"openai/gpt-oss-20b:free\",\n    \"judge_base_url\": \"https://openrouter.ai/api/v1\",\n    \"judge_api_key_var\": \"AK\", \"seed\": 11}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\nDocument any supported environment arguments and their meaning. Example:\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `judge_model` | str | `\"gpt-4.1-mini\"` | The model to use for judging responses |\n| `judge_base_url` | str | `\"https://api.openai.com/v1\"` | The base URL for the judge API |\n| `judge_api_key_var` | str | `\"OPENAI_API_KEY\"` | Environment variable name for the API key |\n| `split_size` | float | `0.9` | Proportion of data to use for training (0-1) |\n| `seed` | int | `13` | Random seed for reproducibility |\n\n### Metrics\nSummarize key metrics your rubric emits and how they’re interpreted.\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Binary response |\n| `accuracy` | Exact match on target answer |\n\n","encoding":"utf-8","truncated":false,"total_bytes":1911},"status":null}