{"data":{"kind":"file","path":"README.md","version_id":"ugnmw4es5mjnlio1rubvy3k9","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6862,"modified_at":"2025-10-19T12:01:45.955000","content_hash":"07345c712662e1928e603ea2fb64ca721a1d4775686e3fcaeefe48dea5502650"},"entries":[],"content":"# IFEval Environment\n\nAn instruction-following evaluation environment based on the IFEval benchmark, testing models' ability to follow 25 different types of constraints.\n\n## Overview\n\nThis environment evaluates language models on their ability to strictly follow instructions across diverse constraint types including:\n\n- **Content Constraints**: Required/forbidden keywords, language requirements\n- **Format Constraints**: JSON format, title/postscript markers, quotations\n- **Length Constraints**: Word/sentence/paragraph counts with quantifiers (at least, around, at most)\n- **Structure Constraints**: Bullet points, sections, placeholders\n- **Case Constraints**: Uppercase, lowercase, capitalized words frequency\n- **Character Constraints**: Letter frequency, no commas\n\n## Installation\n\n```bash\n# From the environments directory\nvf-install vf-ifeval\n\n# Or with uv\nuv pip install -e ./environments/vf_ifeval\n```\n\n## Usage\n\n### Basic Evaluation\n\n```python\nimport verifiers as vf\nfrom openai import AsyncOpenAI\n\n# Load environment\nenv = vf.load_environment(\"vf-ifeval\")\n\n# Evaluate a model\nclient = AsyncOpenAI()\nresults = await env.a_evaluate(\n    client,\n    model=\"gpt-4o-mini\",\n    num_examples=10,\n    rollouts_per_example=1\n)\n\nprint(f\"Average reward: {sum(results.reward) / len(results.reward):.2f}\")\n```\n\n### CLI Evaluation\n\n```bash\n# Quick test\nvf-eval vf-ifeval -m gpt-4o-mini -n 5\n\n# Full evaluation with multiple rollouts\nvf-eval vf-ifeval -m gpt-4o-mini -n 100 -r 3 -s\n```\n\n### Custom Dataset\n\n```python\nfrom datasets import Dataset\n\n# Create custom examples\nexamples = [\n    {\n        \"prompt\": \"Write about AI. Include the words 'neural' and 'network'.\",\n        \"answer\": '{\"func_name\": \"verify_keywords\", \"keyword_list\": [\"neural\", \"network\"]}',\n    },\n    {\n        \"prompt\": \"Respond in exactly 25 words.\",\n        \"answer\": '{\"func_name\": \"validate_word_constraint\", \"N\": 25, \"quantifier\": \"around\"}',\n    }\n]\n\ndataset = Dataset.from_list(examples)\nenv = vf.load_environment(\"vf-ifeval\", dataset_path=\"path/to/dataset.json\")\n```\n\n## Dataset Format\n\nEach example requires:\n\n```python\n{\n    \"prompt\": str,      # The task with embedded constraint instructions\n    \"answer\": str,      # JSON string specifying the constraint to verify\n    \"info\": dict,       # Optional metadata\n}\n```\n\nThe `answer` field should be a JSON string with format:\n```json\n{\n    \"func_name\": \"constraint_function_name\",\n    \"arg1\": \"value1\",\n    \"arg2\": \"value2\"\n}\n```\n\n## Available Constraints\n\n### Content Constraints\n\n- `verify_keywords`: Check for required keywords\n  - Args: `keyword_list` (list of strings)\n\n- `validate_forbidden_words`: Ensure words are absent\n  - Args: `forbidden_words` (list of strings)\n\n- `validate_response_language`: Verify language\n  - Args: `language` (language code like \"en\", \"es\")\n\n### Format Constraints\n\n- `validate_json_format`: Entire output must be valid JSON\n\n- `validate_title`: Must contain title in `<<title>>` format\n\n- `validate_quotation`: Entire response wrapped in double quotes\n\n- `verify_postscript`: Must end with postscript marker\n  - Args: `postscript_marker` (string)\n\n- `validate_repeat_prompt`: Must repeat prompt before answering\n  - Args: `original_prompt` (string)\n\n### Length/Count Constraints\n\n- `validate_word_constraint`: Word count constraint\n  - Args: `N` (int), `quantifier` (\"at least\"|\"around\"|\"at most\")\n\n- `verify_sentence_constraint`: Sentence count constraint\n  - Args: `N` (int), `quantifier` (\"at least\"|\"around\"|\"at most\")\n\n- `verify_paragraph_count`: Paragraph count (separated by `* * *`)\n  - Args: `N` (int)\n\n- `verify_keyword_frequency`: Keyword appears N times\n  - Args: `word` (string), `N` (int)\n\n- `verify_letter_frequency`: Letter appears N times\n  - Args: `letter` (string), `N` (int)\n\n### Structure Constraints\n\n- `verify_bullet_points`: Exactly N bullet points\n  - Args: `N` (int)\n\n- `validate_placeholders`: Minimum N placeholders like `[address]`\n  - Args: `N` (int)\n\n- `validate_sections`: N sections with splitter\n  - Args: `N` (int), `section_splitter` (string)\n\n- `validate_highlighted_sections`: Minimum N `*highlighted*` sections\n  - Args: `N` (int)\n\n- `validate_paragraphs`: N paragraphs with i-th starting with word\n  - Args: `N` (int), `i` (int), `first_word` (string)\n\n### Case Constraints\n\n- `validate_uppercase`: Entire response in uppercase\n\n- `validate_lowercase`: Entire response in lowercase\n\n- `validate_frequency_capital_words`: All-caps words frequency\n  - Args: `N` (int), `quantifier` (\"at least\"|\"around\"|\"at most\")\n\n### Special Constraints\n\n- `validate_two_responses`: Two different responses separated by `******`\n\n- `validate_choice`: Response must match one of options\n  - Args: `options` (list of strings)\n\n- `validate_end`: Must end with exact phrase\n  - Args: `end_phrase` (string)\n\n- `validate_no_commas`: No commas allowed\n\n## Example Prompts\n\n```python\n# Keyword constraint\n{\n    \"prompt\": \"Explain machine learning. Use the keywords 'data', 'model', and 'training'.\",\n    \"answer\": '{\"func_name\": \"verify_keywords\", \"keyword_list\": [\"data\", \"model\", \"training\"]}'\n}\n\n# Word count constraint\n{\n    \"prompt\": \"Summarize photosynthesis in around 30 words.\",\n    \"answer\": '{\"func_name\": \"validate_word_constraint\", \"N\": 30, \"quantifier\": \"around\"}'\n}\n\n# Format constraint\n{\n    \"prompt\": \"Create a title for a science fiction story using <<title>> format.\",\n    \"answer\": '{\"func_name\": \"validate_title\"}'\n}\n\n# Combined constraints (using multiple in rubric)\n{\n    \"prompt\": \"Write 3 bullet points about climate change. Use at least 50 words total.\",\n    \"answer\": '{\"func_name\": \"verify_bullet_points\", \"N\": 3}'  # Could add word constraint separately\n}\n```\n\n## Training\n\nTrain models to better follow instructions:\n\n```python\nimport verifiers as vf\n\nenv = vf.load_environment(\"vf-ifeval\")\nmodel, tokenizer = vf.get_model_and_tokenizer(\"Qwen/Qwen2.5-1.5B-Instruct\")\n\nargs = vf.grpo_defaults(run_name=\"ifeval-training\")\nargs.num_generations = 16\nargs.max_tokens = 512\n\ntrainer = vf.GRPOTrainer(\n    model=model,\n    processing_class=tokenizer,\n    env=env,\n    args=args,\n)\n\ntrainer.train()\n```\n\n## Performance Notes\n\n- Different constraint types have varying difficulty levels\n- Some models struggle with precise counting (letter frequency, word counts)\n- Format constraints (JSON, titles) are generally easier\n- Language constraints require multilingual capabilities\n- Combining multiple constraints increases difficulty significantly\n\n## Citation\n\nIf you use this environment, please cite the original IFEval paper:\n\n```bibtex\n@article{zhou2023instructionfollowing,\n  title={Instruction-Following Evaluation for Large Language Models},\n  author={Zhou, Jeffrey and Lu, Tianjian and Mishra, Swaroop and Brahma, Siddhartha and Basu, Sujoy and Luan, Yi and Zhou, Denny and Hou, Le},\n  journal={arXiv preprint arXiv:2311.07911},\n  year={2023}\n}\n```\n","encoding":"utf-8","truncated":false,"total_bytes":6862},"status":null}