{"data":{"kind":"file","path":"README.md","version_id":"nps0ub64mu63kxt84fmz7574","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3463,"modified_at":"2025-10-26T00:09:52.448000","content_hash":"848a4ae589e1ccdbf1649e5bcaa70466a5e0d5c8a23e53b23578299b573adeec"},"entries":[],"content":"# regex-golf\n\n### Overview\n- **Environment ID**: `regex-golf`\n- **Short description**: Generate regular expressions from natural language descriptions with test cases\n- **Tags**: regex, single-turn, train, eval\n\n### Datasets\n- **Primary dataset(s)**: RegexEval - 762 regex patterns with human-written prompts and test cases\n- **Source links**: [s2e-lab/RegexEval on HuggingFace](https://huggingface.co/datasets/s2e-lab/RegexEval)\n- **Split sizes**: 700 train / 62 eval (default, configurable)\n\n### Task\n- **Type**: single-turn\n- **Parser**: XMLParser with `<regex>...</regex>` tags\n- **Rubric overview**:\n  - `correctness_reward_func`: Test regex against all match/non-match examples (weight: 5.0)\n  - `syntax_validity_reward_func`: Valid Python regex syntax (weight: 1.0)\n  - `length_efficiency_reward_func`: Brevity compared to reference solution (weight: 2.0)\n  - `format_reward_func`: Proper XML tag format (weight: 0.5)\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval regex-golf\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval regex-golf \\\n  -m gpt-4o-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7 \\\n  -a '{\"num_train_examples\": 500, \"num_eval_examples\": 50}'\n```\n\nTest with a small sample:\n\n```bash\nuv run vf-eval regex-golf -n 2 -a '{\"num_eval_examples\": 2}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_train_examples` | int | `700` | Number of training examples |\n| `num_eval_examples` | int | `62` | Number of evaluation examples |\n| `seed` | int | `42` | Random seed for reproducibility |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Weighted sum of all reward functions |\n| `correctness_reward_func` | Percentage of test cases passed (match examples must match, non-match examples must not match) |\n| `syntax_validity_reward_func` | Binary reward: 1.0 if regex compiles without errors, 0.0 otherwise |\n| `length_efficiency_reward_func` | Efficiency score: 1.0 if shorter than or equal to reference, decreasing for longer patterns |\n| `format_reward_func` | Reward for proper `<regex>...</regex>` XML tag formatting |\n\n### Dataset Details\n\nEach example includes:\n- **Natural language description**: Clear specification of what the regex should match\n- **Match examples**: 5+ strings the regex must match\n- **Non-match examples**: 5+ strings the regex must NOT match\n- **Reference solution**: Expert regex pattern for comparison\n\n### Benchmarking\n\nThis environment can be used to:\n1. Benchmark model performance on regex generation tasks\n2. Compare against reference solutions from the RegexEval dataset\n3. Measure both correctness and efficiency (pattern length)\n4. Evaluate understanding of regex syntax and edge cases\n\nExample benchmark metrics:\n- **Correctness rate**: Percentage of examples where regex passes all test cases\n- **Average length ratio**: Model's regex length vs reference solution length\n- **Syntax error rate**: Percentage of invalid regex patterns generated\n\n### Citation\n\nIf you use this environment or the RegexEval dataset, please cite:\n\n```bibtex\n@dataset{regexeval2024,\n  title={RegexEval: A framework for evaluating regex expressions against natural language descriptions},\n  author={S2E Lab},\n  year={2024},\n  publisher={Hugging Face},\n  url={https://huggingface.co/datasets/s2e-lab/RegexEval}\n}\n```\n","encoding":"utf-8","truncated":false,"total_bytes":3463},"status":null}