{"data":{"kind":"file","path":"README.md","version_id":"dny2mdsn5twlq19lgwwmxhb8","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2329,"modified_at":"2025-09-25T09:48:30.585000","content_hash":"c2822e53af62c9dab9f50b73a4558a18328d45e94854bfc46e52159709fa0457"},"entries":[],"content":"# multi-challenge\n\n### Overview\n- **Environment ID**: `multi-challenge`\n- **Short description**: MultiChallenge is a novel benchmark developed by Scale AI, designed to evaluate large language models (LLMs) on their ability to handle multi-turn conversations with human users.\n- **Tags**: eval, train, multi-turn\n\n### Datasets\n- **Primary dataset(s)**: The `MultiChallenge` dataset of multi-turn conversation stems, along with some judgement criteria per example to determine whether a final response adheres to guidelines.\n- **Source links**: [Dataset](https://github.com/ekwinox117/multi-challenge/tree/main/data), [main repo](https://github.com/ekwinox117/multi-challenge/tree/main)\n- **Split sizes**: `273` test items\n\n### Task\n- **Type**: multi-turn\n- **Parser**: Custom\n- **Rubric overview**: The reward directly implements the original `MultiChallenge` criteria, which relies on an LLM judge to determine if a constraint imposed at some point in the conversation (i.e. `Make sure the walk is under 5 minutes...`) matches a particular passing criteria (i.e. `YES` or `NO`).\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval multi-challenge\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval multi-challenge  -m gpt-4.1-mini   -n 20 -r 3 -t 1024 -T 0.7   -a '{\"system_prompt\": \"You are an assistant, particularly skilled at long-horizon conversations and remembering facts and requests from users.\"}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --------- | ---- | ------------- | ----------- |\n| `system_prompt` | str | `DEFAULT_SYSTEM_PROMPT` | The system prompt used for model to generate a final response conditional on the static conversation history. |\n| `judge_model` | str | `\"gpt-4.1-mini\"` | The model used to evaluate the responses. |\n| `judge_base_url` | str | `\"https://api.openai.com/v1\"` | The base URL for the judge API. |\n| `judge_api_key_var` | str | `\"OPENAI_API_KEY\"` | The environment variable storing the API key for the judge. |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `criteria_passed` | As judged by `judge_model`, did the model response satisfy the constraints imposed by the `target_question, pass_criteria` pair? |\n\n","encoding":"utf-8","truncated":false,"total_bytes":2329},"status":null}