{"data":{"kind":"file","path":"README.md","version_id":"hu4tyc18yzh27jos0c0282p0","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2355,"modified_at":"2025-10-15T07:22:58.716000","content_hash":"9f8d04232be3d9d8688d9c549d472d7e37a27d571c345b54b0ddb76b7b86a3d0"},"entries":[],"content":"# RealWorldQA\n\nRealWorldQA is an xAI benchmark dataset with 765 images evaluating real-world spatial understanding through question-answering. Released with Grok-1.5 Vision, it contains anonymized vehicle and real-world images that are easy for humans but challenging for frontier models. This environment evaluates vision-language models on image-based reasoning tasks with answers wrapped in `**`.\n\n### Overview\n- **Environment ID**: `RealWorldQA`\n- **Short description**: Real-world question answering with image-based reasoning\n- **Tags**: vlm, image, question-answering, real-world\n\n### Datasets\n- **Primary dataset(s)**: RealWorldQA dataset containing questions with corresponding images\n- **Source links**: [visheratin/realworldqa](https://huggingface.co/datasets/visheratin/realworldqa)\n- **Split sizes**: Test is the only split with `765` instances\n\n### Task\n- **Type**: single-turn\n- **Parser**: Custom parser for extracting answers wrapped in `**`\n- **Rubric overview**: Exact match on the correct answer extracted from model response\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval RealWorldQA\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval RealWorldQA   -m google/gemini-2.5-flash-image-preview   -n 5 -r 1 -s -t 1024 -T 0.7   -a '{\"num_samples\": 5}'\n```\n\nExamples with different environment arguments:\n\n```bash\n# Use only 10 evaluation samples\nuv run vf-eval RealWorldQA -a '{\"num_samples\": 10}'\n\n# Use all available samples\nuv run vf-eval RealWorldQA -a '{\"num_samples\": -1}'\n\n# Use different model with OpenRouter\nuv run vf-eval RealWorldQA -k AK -b https://openrouter.ai/api/v1 -m 'google/gemini-2.5-flash-image-preview' -a '{\"num_samples\": 5}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\nDocument any supported environment arguments and their meaning. Example:\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_samples` | int | `-1` | Number of samples to use (-1 for all available samples) |\n\n### Metrics\nSummarize key metrics your rubric emits and how they’re interpreted.\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Binary reward (1 for correct answer, 0 for incorrect) |\n| `accuracy` | Exact match on target answer extracted from `***` wrapped response |\n\n","encoding":"utf-8","truncated":false,"total_bytes":2355},"status":null}