{"data":{"kind":"file","path":"README.md","version_id":"jlecesrsit8x2fpi0x9sx6nh","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2783,"modified_at":"2026-06-01T20:27:36.666000","content_hash":"1391e64b85384d29bd81feda9a6cbdf7a874afbac4a057df85fac0713d1c3755"},"entries":[],"content":"# graphwalks\n\n### Overview\n\n- **Environment ID**: `graphwalks`\n- **Short description**: GraphWalks graph traversal benchmark (single-turn)\n- **Tags**: single-turn, long-context, graph\n\n### How It Works\n\nThis environment implements the [GraphWalks benchmark](https://huggingface.co/datasets/openai/graphwalks) for evaluating graph traversal capabilities in a single-turn setting. The full prompt (instructions + graph edge list + operation question) is passed directly to the model.\n\n### Dataset\n\n- [openai/graphwalks](https://huggingface.co/datasets/openai/graphwalks) — 1,150 samples in the training split\n- Columns: `prompt`, `answer_nodes`, `prompt_chars` (2.6k–1.75M), `problem_type`, `date_added`\n\n### Quickstart\n\n```bash\n# Basic evaluation (exact scoring)\nuv run vf-eval graphwalks -m gpt-5-mini -n 5\n\n# F1 scoring\nuv run vf-eval graphwalks -m gpt-5-mini -n 5 -a '{\"scoring\": \"f1\"}'\n\n# Filter by problem type\nuv run vf-eval graphwalks -m gpt-5-mini -n 5 -a '{\"problem_type\": \"parents\"}'\n\n# Only prompts with >1M characters\nuv run vf-eval graphwalks -m gpt-5-mini -n 5 -a '{\"prompt_chars_filter\": \">1000000\"}'\n\n# Only prompts with <10k characters\nuv run vf-eval graphwalks -m gpt-5-mini -n 5 -a '{\"prompt_chars_filter\": \"<10000\"}'\n\n# Only prompts between 128k and 256k characters (inclusive range)\nuv run vf-eval graphwalks -m gpt-5-mini -n 5 -a '{\"prompt_chars_filter\": \"128000-256000\"}'\n\n# Combine filters: BFS problems with >100k chars\nuv run vf-eval graphwalks -m gpt-5-mini -n 5 -a '{\"problem_type\": \"BFS\", \"prompt_chars_filter\": \">100000\"}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `split` | str | `\"train\"` | Dataset split |\n| `scoring` | str | `\"exact\"` | Scoring mode: `\"exact\"` (set equality) or `\"f1\"` (set overlap) |\n| `prompt_chars_filter` | str \\| None | `None` | Filter on `prompt_chars` column using comparison operators (`\">\"`, `\"<\"`, `\">=\"`, `\"<=\"`, `\"==\"`) or an inclusive range (`\"128000-256000\"`). Example: `\">1000000\"` for entries with >1M prompt chars |\n| `problem_type` | str \\| None | `None` | Filter by `problem_type` (e.g., `\"parents\"`, `\"BFS\"`) |\n| `shuffle` | bool | `False` | Whether to shuffle the dataset |\n| `seed` | int \\| None | `None` | Random seed for shuffling |\n| `num_examples` | int | `-1` | Maximum number of examples (-1 = all) |\n\n### Metrics\n\n- **Exact** (`scoring=\"exact\"`): 1.0 if predicted node set exactly matches truth set, 0.0 otherwise.\n- **F1** (`scoring=\"f1\"`): Harmonic mean of precision and recall based on set overlap of predicted vs truth nodes.\n\nThe model's final answer is expected in the format: `Final Answer: [node1, node2, node3]`\n\n### Changelog\n\n- 0.1.0: Initial version with `prompt_chars_filter`, `problem_type` filter, exact/F1 scoring\n","encoding":"utf-8","truncated":false,"total_bytes":2783},"status":null}