{"data":{"kind":"file","path":"README.md","version_id":"py97jm2zorx80a8s6rw7q6t7","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":1783,"modified_at":"2025-08-28T21:18:16.267000","content_hash":"18adab41dc67fb4791ce0116800af8c39e5f1b313b31f75d90f172bff0c63351"},"entries":[],"content":"# graphwalks\n\n### Overview\n\n- **Environment ID**: `graphwalks`\n- **Short description**: Evaluate graph operations (BFS depth, parents) on the OpenAI GraphWalks benchmark.\n- **Tags**: reasoning, long-context, graphs, single-turn\n\n### Dataset\n\n- **Primary dataset**: `openai/graphwalks` In Graphwalks, the model is given a graph represented by its edge list and asked to perform an operation. Graphwalks is a multi hop reasoning long context benchmark.\n- **Source links**: <https://huggingface.co/datasets/openai/graphwalks>\n- **Split sizes**: full train split 1.5k rows\n\n### Task\n\n- **Type**: single-turn\n- **Parser**: simple extractor expecting the model to end with: `Final Answer: [node1, node2, ...]`\n- **Rubric**:\n  - `exact`: set equality between predicted nodes and truth\n  - `f1`: partial credit using overlap-based F1\n\n### Quickstart\n\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval graphwalks\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval graphwalks -m gpt-4.1-mini -n 20 -r 3 -t 1024 -T 0.7 -a '{\"key\": \"value\"}'\n```\n\nNotes:\n\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `split` | str | `\"train\"` | HF split to load |\n| `scoring` | str | `\"exact\"` | One of `exact` or `f1` |\n| `num_examples` | int | `-1` | Limit dataset size (>0 to enable) |\n\n### Metrics\n\n- **exact**: 1.0 if sets match exactly, else 0.0\n- **f1**: uses OpenAI's GraphWalks overlap F1\n\n## Evaluation Reports\n\n<!-- Do not edit below this line. Content is auto-generated. -->\n<!-- vf:begin:reports -->\n<p>No reports found. Run <code>uv run vf-eval ifeval -a '{\"key\": \"value\"}'</code> to generate one.</p>\n<!-- vf:end:reports -->","encoding":"utf-8","truncated":false,"total_bytes":1783},"status":null}