{"data":{"kind":"file","path":"README.md","version_id":"csz6a2u1hfdeuoiewm67ikd7","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4282,"modified_at":"2026-05-18T11:52:34.848000","content_hash":"26a83b0ad49899d040c2b2589a13f3e73258dee2dca6379124e0ac32c3af71e3"},"entries":[],"content":"# CurveBench Environment\n\nA [Prime Intellect Environments Hub](https://app.primeintellect.ai/dashboard/environments/amirmohseni/curvebench-env) environment for evaluating vision-language models on exact topological reasoning over nested Jordan curves.\n\n- **Paper:** [CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves](https://arxiv.org/abs/2605.14068)\n- **Dataset:** [AmirMohseni/CurveBench-Easy](https://huggingface.co/datasets/AmirMohseni/CurveBench-Easy)\n- **Collection:** [AmirMohseni/curvebench](https://huggingface.co/collections/AmirMohseni/curvebench)\n- **GitHub:** [Amir-Mohseni/CurveBench](https://github.com/Amir-Mohseni/CurveBench)\n\n---\n\n## Task Description\n\nGiven an image of pairwise non-intersecting Jordan curves, the model must recover the full rooted containment tree — where each node is a bounded region and each edge connects a region to the one directly containing it.\n\n- The root node (0) represents the unbounded outer region\n- Each curve-enclosed region is assigned a unique node number\n- Edges represent containment: parent region directly contains child region\n\n---\n\n## Dataset\n\nThis environment uses the [CurveBench-Easy](https://huggingface.co/datasets/AmirMohseni/CurveBench-Easy) dataset:\n\n| Split | Level 1 (easy) | Level 2 (medium) | Level 3 (hard) | Total |\n|-------|----------------|------------------|----------------|-------|\n| Train | 84             | 78               | 47             | 210   |\n| Val   | 18             | 16               | 10             | 45    |\n| Test  | 18             | 18               | 11             | 45    |\n\nAvailable splits: `total_train`, `total_validation`, `total_test`, `level_1_train`, `level_1_test`, `level_2_train`, `level_2_test`, `level_3_train`, `level_3_test`\n\n---\n\n## Expected Response Format\n\nModels should respond inside `<answer>...</answer>` tags. The first line is the number of nodes (excluding the root), followed by one edge per line as `u v` (child, parent):\n\n```\n<answer>\n3\n1 0\n2 0\n3 1\n</answer>\n```\n\n---\n\n## Scoring\n\n| Reward              | Weight | Description |\n|---------------------|--------|-------------|\n| `tree_reward`       | 0.7    | 1.0 if the predicted tree is isomorphic to the ground truth (up to node relabelling) |\n| `node_count_reward` | 0.3    | 1.0 if the predicted node count matches the ground truth |\n\n---\n\n## Running Evaluations\n\n### Install the environment\n\n```bash\nprime env install amirmohseni/curvebench-env\n```\n\n### Evaluate a model\n\nRun against the full test set with 4 rollouts per example:\n\n```bash\nprime eval run amirmohseni/curvebench-env \\\n  -m \"google/gemma-3-27b-it\" \\\n  -n -1 \\\n  -a '{\"split\": \"total_test\"}' \\\n  -r 4\n```\n\nOther models to try:\n\n```bash\n# Qwen3-VL-235B (thinking mode)\nprime eval run amirmohseni/curvebench-env \\\n  -m \"qwen/qwen3-vl-235b-a22b-thinking\" \\\n  -n -1 \\\n  -a '{\"split\": \"total_test\"}' \\\n  -r 4\n```\n\n### Evaluate a custom endpoint\n\n```bash\nprime eval run amirmohseni/curvebench-env \\\n  -m \"your-model-name\" \\\n  -b \"https://your-endpoint.example.com/v1\" \\\n  -k \"YOUR_API_KEY_ENV_VAR\" \\\n  -n -1 \\\n  -a '{\"split\": \"total_test\"}' \\\n  -r 4\n```\n\n### Evaluate a specific difficulty level\n\n```bash\nprime eval run amirmohseni/curvebench-env \\\n  -m \"google/gemma-3-27b-it\" \\\n  -n -1 \\\n  -a '{\"split\": \"level_3_test\"}' \\\n  -r 4\n```\n\n### CLI flags\n\n| Flag | Description |\n|------|-------------|\n| `-m` | Model name |\n| `-b` | API base URL (for custom endpoints) |\n| `-k` | Environment variable name containing the API key |\n| `-n` | Number of examples (`-1` for all) |\n| `-r` | Number of rollouts per example |\n| `-a` | JSON string of arguments passed to `load_environment()` (e.g. split) |\n\n---\n\n## Using as a Python Library\n\n```python\nfrom curvebench_env import load_environment\n\nenv = load_environment(split=\"total_test\")\n```\n\n---\n\n## Citation\n\nIf you use this environment in your research, please cite:\n\n```bibtex\n@misc{mohseni2026curvebench,\n      title={CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves},\n      author={Amirreza Mohseni and Mona Mohammadi and Morteza Saghafian and Naser Talebizadeh Sardari},\n      year={2026},\n      eprint={2605.14068},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https://arxiv.org/abs/2605.14068},\n}\n```\n","encoding":"utf-8","truncated":false,"total_bytes":4282},"status":null}