{"data":{"kind":"file","path":"README.md","version_id":"jjxev07c6yomw5zkxgpeevsr","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":8722,"modified_at":"2025-08-24T21:38:42.125000","content_hash":"b06a75fe40bec96cebcc5deadd6ce681d14b8b0ac28c85e84b2e89cd6697c4c3"},"entries":[],"content":"# IFBench\n\nProfessional IFBench evaluation environment with **single configuration argument** for automatic dataset selection.\n\n## **Single Configuration Argument**\n\n**One `mode` argument controls everything:**\n```bash\nuv run vf-eval ifbench -a '{\"mode\": \"test\"}'           # → IFBench_test\nuv run vf-eval ifbench -a '{\"mode\": \"multi_test\"}'     # → IFBench_multi-turn  \nuv run vf-eval ifbench -a '{\"mode\": \"train\"}'          # → IF_multi_constraints_upto5\n```\n\n### **What You Get:**\n- **Zero Configuration**: Works out-of-the-box\n- **Automatic Dataset Selection**: Each mode uses its optimal dataset\n- **Professional Interface**: Clean, consistent behavior\n- **Full Dataset Access**: 294 test + 1,387 multi-turn + 95,373 training examples\n\n## Overview\n**Environment ID**: ifbench  \n**Short description**: Instruction following evaluation with verifiable constraints from IFBench benchmark  \n**Tags**: ifbench, single-turn, instruction-following, constraints, verification, evaluation\n\n## Datasets\n**Primary dataset(s)**: IFBench  \n**Source links**: [IFBench GitHub](https://github.com/allenai/IFBench)  \n**Split sizes**: Variable (configurable via dataset_path and split parameters)\n\n## Task\n**Type**: single-turn  \n**Parser**: Custom IFBenchParser for constraint verification  \n**Rubric overview**: Three reward functions focusing on constraint adherence, format quality, and parsing success\n\n## Quickstart\nRun an evaluation with default settings (test dataset):\n\n```bash\nuv run vf-eval ifbench\n```\n\nConfigure model and sampling with different datasets:\n\n```bash\n# Test dataset (default)\nuv run vf-eval ifbench \\\n  -m gpt-4 \\\n  -n 20 -r 3 -t 1024 \\\n  -a '{\"constraint_filter\": \"keyword\"}'\n\n# Multi-turn evaluation mode\nuv run vf-eval ifbench \\\n  -a '{\"mode\": \"multi_test\"}'\n\n# Training evaluation mode\nuv run vf-eval ifbench \\\n  -a '{\"mode\": \"train\", \"constraint_filter\": \"word\"}'\n```\n\n## Available Modes\n| Mode | Description | Dataset Source | Examples | Output Message |\n|------|-------------|----------------|----------|----------------|\n| `test` | Standard test evaluation (default) | Local IFBench data + HF fallback | 294 | \"Mode: Test evaluation - using IFBench test dataset\" |\n| `multi_test` | Multi-turn evaluation | IFBench multi-turn test dataset | 1,387 | \"Mode: Multi-turn evaluation - using IFBench multi-turn test dataset\" |\n| `train` | Training evaluation | IF-RLVR training dataset | 95,373 | \"Mode: Training evaluation - using IF-RLVR training dataset (multi-constraint)\" |\n\n**Notes**:\n- Use `-a / --env-args` to pass environment-specific configuration as a JSON object\n- Each mode automatically selects its optimal dataset with zero configuration\n- Test mode prioritizes local GitHub data when available\n\n## Environment Arguments\n| Arg | Type | Default | Description |\n|-----|------|---------|-------------|\n| `mode` | str | `\"test\"` | Dataset mode: \"test\" (IFBench_test), \"multi_test\" (IFBench_multi-turn), or \"train\" (IF_multi_constraints_upto5) |\n| `constraint_filter` | str | `None` | Filter constraints by type (e.g., \"keyword\", \"word\", \"paragraph\") |\n| `system_prompt` | str | `(default provided)` | Custom system prompt for the task |\n\n## Metrics\n| Metric | Meaning | Weight |\n|--------|---------|---------|\n| `reward` | Final score (weighted sum of rubric functions) | - |\n| `constraint_adherence_reward` | Score for following specific constraints (0.0-0.9) | 0.6 |\n| `format_quality_reward` | Score for response quality and formatting (0.0-1.0) | 0.3 |\n| `format_reward` | Score for parsing success and content quality (0.0-1.0) | 0.1 |\n\n## Constraint Types Supported\n- **Keyword counting**: Evaluates keyword frequency requirements\n- **Word counting**: Assesses response length based on word count constraints\n- **Paragraph counting**: Checks paragraph structure and count\n- **Generic constraints**: Fallback scoring for other constraint types\n\n## Implementation Details\n- **Custom Parser**: `IFBenchParser` class handles constraint verification\n- **Safe Data Extraction**: Robust handling of various completion formats\n- **Flexible Scoring**: Adapts scoring based on constraint type\n- **Error Handling**: Graceful degradation for malformed responses\n\n## Example Usage\n```python\nfrom ifbench import load_environment\n\n# Load test dataset (default)\nenv = load_environment()\n\n# Load multi-turn evaluation mode\nenv = load_environment(mode=\"multi_test\")\n\n# Load training evaluation mode with keyword filter\nenv = load_environment(mode=\"train\", constraint_filter=\"keyword\")\n\n# Load limited examples from test set\nenv = load_environment(mode=\"test\", num_examples=100)\n\n# Run evaluation\nresults = env.evaluate(...)\n```\n\n## **Verified Working Modes**\n\nAll three modes have been tested and verified to work correctly:\n\n### **Test Mode** (`mode: \"test\"`)\n```bash\nuv run vf-eval ifbench -a '{\"mode\": \"test\"}'\n# Loads 294 examples from IFBench_test\n# Uses local GitHub data when available\n# Falls back to Hugging Face seamlessly\n```\n\n### **Multi-Test Mode** (`mode: \"multi_test\"`)\n```bash\nuv run vf-eval ifbench -a '{\"mode\": \"multi_test\"}'\n# Loads 1,387 examples from IFBench_multi-turn\n# Handles multi-turn conversation format\n# Includes conversation history in prompts\n```\n\n### **Train Mode** (`mode: \"train\"`)\n```bash\nuv run vf-eval ifbench -a '{\"mode\": \"train\"}'\n# Loads 95,373 examples from IF_multi_constraints_upto5\n# IF-RLVR training dataset format\n# Multi-constraint examples up to 5 constraints\n```\n\n## Automatic GitHub Integration\nThe IFBench environment automatically integrates with the official repository:\n\n### **No Setup Required:**\n```bash\n# Just run - automatically downloads from GitHub\nuv run vf-eval ifbench\n```\n\n### **How It Works:**\n1. **Automatic Detection**: Checks for local IFBench data\n2. **GitHub Download**: Automatically clones `https://github.com/allenai/IFBench.git`\n3. **Local Caching**: Saves data locally for future use\n4. **Fallback**: Uses Hugging Face dataset if GitHub unavailable\n\n### **Manual Setup (Optional):**\nIf you prefer manual control:\n```bash\n# Clone IFBench repository\ngit clone https://github.com/allenai/IFBench.git\n\n# Copy data to environment\ncp -r IFBench/data environments/ifbench/temp_ifbench/\n\n# Run evaluation\nuv run vf-eval ifbench\n```\n\nThe environment automatically provides the best available data source with real constraint verification.\n\n## Related Resources\n- **Official IFBench Repository**: [https://github.com/allenai/IFBench.git](https://github.com/allenai/IFBench.git)\n- **IFBench Paper**: [Precise IF Generalization Abilities](https://github.com/allenai/IFBench/blob/main/Precise_IF_Generalization_Abilities.pdf)\n- **IF-RLVR Training**: [open-instruct IFEvalG](https://github.com/allenai/open-instruct/tree/main/open_instruct/IFEvalG)\n\n## **Key Benefits**\n\n- **Single Configuration**: One `mode` argument controls everything\n- **Zero Configuration**: Works out-of-the-box with automatic dataset selection\n- **Professional Interface**: Clean, consistent behavior across all modes\n- **Full Dataset Access**: All three major IFBench datasets available\n- **Automatic Optimization**: Each mode uses its optimal dataset\n\n## Enhanced Features\n- **Local Data Integration**: Automatically detects and uses local IFBench test data when available\n- **Real Constraint Verification**: Uses actual IFBench constraint parameters for accurate scoring\n- **Keyword Counting**: Implements precise keyword frequency verification (e.g., kaleidoscope×1, nebula×2, whisper×3, labyrinth×5, paradox×7)\n- **Streamlined Modes**: Simple mode-based configuration (test, multi_test, train) using reliable datasets\n- **Fallback Support**: Gracefully falls back to Hugging Face datasets when local data unavailable\n\n## **Complete Dataset Comparison**\n\n| Mode | Dataset | Examples | Purpose | Format | Source |\n|------|---------|----------|---------|---------|---------|\n| `test` | IFBench_test | 294 | Standard evaluation | IFBench format | Local GitHub + HF |\n| `multi_test` | IFBench_multi-turn | 1,387 | Multi-turn evaluation | Multi-turn conversation | Hugging Face |\n| `train` | IF_multi_constraints_upto5 | 95,373 | Training evaluation | IF-RLVR format | Hugging Face |\n\n### **Dataset Details**\n- **Local (GitHub)**: `IFBench_test.jsonl` - 294 test examples with OOD constraints\n- **Hugging Face**: \n  - `allenai/IFBench_test` - Test dataset (fallback for test mode)\n  - `allenai/IFBench_multi-turn` - Multi-turn test dataset (1,387 examples)\n  - `allenai/IF_multi_constraints_upto5` - IF-RLVR training dataset (95,373 examples, up to 5 constraints)\n- **Note**: Each mode automatically uses its specialized dataset for optimal evaluation\n\n## Evaluation Reports\nNo reports found. Run `uv run vf-eval ifbench -a '{\"key\": \"value\"}'` to generate one.\n","encoding":"utf-8","truncated":false,"total_bytes":8722},"status":null}