{"data":{"kind":"file","path":"README.md","version_id":"xmi8haytufw7jydvlhtme42t","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":1782,"modified_at":"2025-10-16T16:02:41.853000","content_hash":"5711bc46019d4c8392a14f868df1c9b35b8c95ead67a37122368d629c464e5c9"},"entries":[],"content":"# EmbSpatial-Bench\n\n### Overview\n- **Environment ID**: `EmbSpatial-Bench`\n- **Short description**: EmbSpatial-Bench is a benchmark for evaluating embodied spatial understanding of LVLMs.\n- **Tags**: spatial, LVLM, RLVR\n\n### Datasets\n- **Primary dataset(s)**: `FlagEval/EmbSpatial-Bench`\n- **Source links**: [HF Dataset](https://huggingface.co/datasets/FlagEval/EmbSpatial-Bench)\n- **Split sizes**: test ( 3,640 QA pairs)\n\n### Task\n- **Type**: `single-turn`\n- **Parser**: Custom parser for extracting answers wrapped in `**`\n- **Rubric overview**: Exact match accuracy on multiple-choice answers, supports filtering by relations and data sources\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval EmbSpatial-Bench\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval EmbSpatial-Bench -k <API_KEY> -b <BASE_URL> \\\n  -m 'google/gemma-3-12b-it' \\\n  -n 5 -r 3 -s -a '{\n        \"num_samples\": 5, \n        \"exclude_datasources\": [\"ai2thor\"], \n        \"exclude_relations\": [\"under\",\"left\",\"close\",\"right\"]\n    }'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\nDocument any supported environment arguments and their meaning. Example:\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_samples` | int | `-1` | Limit on dataset size (use -1 for all) |\n| `exclude_relations` | List | `[]` | List of relations to exclude |\n| `exclude_datasources` | List | `[]` | List of data sources to exclude |\n\n### Metrics\nSummarize key metrics your rubric emits and how they’re interpreted.\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar reward (1 for correct answer, 0 for incorrect) |\n| `accuracy` | Exact match on target answer |\n\n","encoding":"utf-8","truncated":false,"total_bytes":1782},"status":null}