{"data":{"kind":"file","path":"README.md","version_id":"vf8bq8f3v9t3ud0wsqibde74","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2160,"modified_at":"2025-12-03T16:40:02.010000","content_hash":"eeace2686a131b47d74091558bd71e89ab2b98d9ee533c86812d9bdec53688f6"},"entries":[],"content":"# MultiCrit\n\n### Overview\n- **Environment ID**: `MultiCrit`\n- **Short description**: Based on [Multi-Crit paper](https://www.alphaxiv.org/abs/2511.21662), a benchmark designed to evaluate multimodal judges on their ability to follow pluralistic evaluation criteria and produce reliable criterion-level judgments rather than single overall preferences. It addresses the limitations of existing benchmarks by incorporating multi-criterion human annotations and novel metrics to assess how well large multimodal models recognize trade-offs and resolve conflicts between diverse evaluation dimensions.\n- **Tags**: multi-modal, eval, image\n\n### Datasets\n- **Primary dataset(s)**: [txiong23/Multi-Crit](https://huggingface.co/datasets/txiong23/Multi-Crit)\n- **Split sizes**: open_ended (1k), reasoning (425)\n\n### Task\n- **Type**: <single-turn>\n- **Rubric overview**:\n  - `p_acc`: Computes Pluralistic Accuracy(whether all responses match the expected answers).\n  - `tos`: Returns the `Trade off sensitivity` value from the state (rate for which both answers are **not correct** if there is conflict).\n  - `cmr`: Returns the `Conflict matching rate` value from the state (rate for which both answers are **correct** if there is conflict).\n\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval MultiCrit\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `splits` | str | `\"bar\"` | Comma-separated list of dataset splits to use (e.g., \"reasoning, open_ended\"). |\n| `add_reverse` | bool | `False` | Whether to add reverse examples to the dataset.\n| `cap` | int | `1000` | Maximum number of examples to load from the dataset (use -1 for all). |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `p_acc` | Pluralistic Accuracy: whether all responses match the expected answers. |\n| `tos` | Trade off sensitivity: rate for which both answers are not correct if there is conflict. |\n| `cmr` | Conflict matching rate: rate for which both answers are correct if there is conflict. |\n","encoding":"utf-8","truncated":false,"total_bytes":2160},"status":null}