{"data":{"kind":"file","path":"README.md","version_id":"v4jd4j4pkeuoqszir8inkrr5","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":7837,"modified_at":"2025-09-22T02:03:00.242000","content_hash":"c26462893f6eff8a0d4f1488b7f4152ddbef3c4dcc40cd013f94a304113c2c69"},"entries":[],"content":"# vf-react-scene-eval\n\nEnvironment for evaluating language models on React component generation for video scenes. This environment generates React components for Remotion video scenes, renders them to images, and uses multimodal LLM evaluation to assess design quality.\n\n## Overview\n\nThis environment tests models' ability to:\n- Generate syntactically correct React/TSX components for Remotion\n- Create visually appealing scene layouts with proper spacing and alignment\n- Follow Remotion animation patterns and component structure\n- Include specified text content accurately\n- Export components with proper structure (`export default ComponentName`)\n\nThe evaluation pipeline consists of:\n1. **Component Generation**: Model creates React component code in XML tags\n2. **Code Parsing**: Tree-sitter extracts and validates component structure\n3. **Remotion Rendering**: Component is rendered to PNG image at final frame\n4. **Design Evaluation**: GPT-4o-mini evaluates rendered image quality\n\n## Architecture\n\n### Core Components\n\n- **`ReactSceneEnvironment`**: Main environment class inheriting from `verifiers.Environment`\n- **`SceneComponentParser`**: Tree-sitter based parser for React/TSX code extraction\n- **`DesignJudgeRubric`**: Multimodal LLM judge for design quality assessment\n- **`RemotionRenderer`**: Handles component rendering using Remotion CLI\n\n### Evaluation Pipeline\n\n```\nUser Query → Model Generation → XML Extraction → Component Parsing → Remotion Rendering → Design Evaluation\n```\n\n## Scoring System\n\nThe environment uses a **binary reward system**:\n\n### Parser Validation (Weight: 1.0)\n- **Score**: `0.0` for valid syntax, `-1.0` for invalid\n- **Checks**: Imports, exports, syntax errors via tree-sitter\n- **Requirement**: Code must parse successfully and have proper structure\n\n### Design Quality Evaluation (Weight: 1.0)\nThree binary criteria evaluated by GPT-4o-mini on rendered images:\n\n1. **Space Utilization** (Binary: 0 or 1)\n   - Effective use of whitespace and layout areas\n   - Avoids overcrowding and excessive empty space\n   - Maintains comfortable margins and visual weight distribution\n\n2. **Alignment & Spatial Relationships** (Binary: 0 or 1)  \n   - Consistent alignment (horizontal and vertical)\n   - Coherent grid/layout structure\n   - No overlapping or partially visible elements\n\n3. **Text Correctness** (Binary: 0 or 1)\n   - Required text is present and correctly spelled\n   - Text is not truncated, clipped, or overlapped\n   - Sufficient legibility with proper size and contrast\n   - Scores 1 if no specific text was requested\n\n### Final Scoring\n- **Success**: `reward = 1.0` (all criteria met)\n- **Failure**: `reward = -1.0` (any failure: parsing, rendering, or design)\n\n## System Prompts\n\nThe environment uses YAML-configured prompts:\n\n### Generation Prompt\n- Instructs models to create Remotion components\n- Requires `<code></code>` XML wrapper for extraction\n- Specifies `totalFrames` variable and export structure\n- Emphasizes final frame completeness for evaluation\n\n### Judge Prompt  \n- Evaluates rendered image against design criteria\n- Structured JSON output with reasoning and binary scores\n- Uses multimodal GPT-4o-mini with image input\n\n## Usage\n\n### Basic Setup\n```python\nfrom single_turn_env import ReactSceneEnvironment\nfrom datasets import Dataset\n\n# Create test dataset\nqueries = [\n    \"Create a simple title scene with big bold text 'Hello World' centered.\",\n    \"Design a lower-third banner with the name 'Alex Doe' and a subtitle.\",\n]\ndataset = Dataset.from_dict({\"question\": queries})\n\n# Initialize environment\nenv = ReactSceneEnvironment(dataset=dataset)\n```\n\n### Evaluation\n```python\nimport asyncio\nfrom openai import AsyncOpenAI\n\nasync def evaluate():\n    client = AsyncOpenAI()\n    \n    # Format inputs\n    formatted = env.get_dataset()\n    inputs = {\n        \"prompt\": list(formatted[\"prompt\"]),\n        \"info\": [{} for _ in range(len(formatted))],\n    }\n    \n    # Run evaluation\n    results = await env.a_generate(\n        inputs=inputs,\n        client=client,\n        model=\"gpt-4o-mini\",\n        sampling_args={\"temperature\": 0.2},\n        score_rollouts=True,\n    )\n    \n    return results\n\nresults = asyncio.run(evaluate())\n```\n\n### Results Structure\n```python\n# Results contain:\nresults.completion  # Generated component code\nresults.reward      # Binary rewards: 1.0 (success) or -1.0 (failure)\nresults.metrics     # Detailed metrics:\n#   - parser_reward_func: Parser validation score\n#   - space_utilization: Space usage score  \n#   - alignment: Alignment quality score\n#   - text_correctness: Text accuracy score\n```\n\n## Dependencies\n\n### Required\n- `verifiers>=0.1.2` - Base environment framework\n- `openai>=1.0.0` - LLM evaluation\n- `pydantic>=2.0.0` - Data validation\n- `datasets>=2.0.0` - Dataset handling\n- `tree-sitter-typescript` - Code parsing\n- `python-dotenv>=1.0.0` - Environment variables\n- `pyyaml>=6.0.0` - Configuration loading\n\n### External Tools\n- **Node.js & npm** - Required for Remotion CLI\n- **Remotion CLI** - Component rendering (`npx remotion`)\n- **Chrome/Chromium** - Headless browser for rendering\n\n### Installation\n```bash\n# Install package\npip install -e .\n\n# Install Remotion globally\nnpm install -g remotion\n\n# Set up environment variables\necho \"OPENAI_API_KEY=your_key_here\" > .env\n```\n\n## File Structure\n\n```\nvf-react-scene-eval/\n├── single_turn_env.py          # Main environment class\n├── parser.py                   # React component parser  \n├── rubrics.py                  # Design evaluation rubric\n├── remotion_utils.py           # Remotion rendering utilities\n├── yaml_prompt_loader.py       # YAML prompt loading\n├── prompts.yaml                # System prompts configuration\n├── templates/                  # Remotion configuration templates\n│   ├── Image.template.tsx      # Composition wrapper\n│   ├── index.template.ts       # Entry point\n│   ├── remotion.config.template.ts\n│   └── tsconfig.template.json\n├── tests/                      # Comprehensive test suite\n│   ├── test_env.py            # End-to-end integration tests\n│   ├── test_parser.py         # Parser unit tests\n│   └── test_rubric.py         # Rubric unit tests\n└── test_data/                 # Sample components for testing\n```\n\n## Testing\n\n```bash\n# Run all tests\npython -m pytest tests/\n\n# Run integration tests\npython -m pytest tests/test_env.py -v\n\n# Run quick integration test\npython tests/test_env.py\n```\n\n## Configuration\n\n### Environment Variables\n- `OPENAI_API_KEY` - Required for design evaluation\n- `OPENAI_MODEL` - Model for generation (default: gpt-4o-mini)\n\n### Customization\n- **System prompts**: Edit `prompts.yaml`\n- **Rendering settings**: Modify `remotion_utils.py`\n- **Evaluation criteria**: Update `rubrics.py`\n- **Parser behavior**: Adjust `parser.py`\n\n## Common Issues\n\n### Rendering Failures\n- **Chrome not found**: Install Chrome/Chromium browser\n- **Remotion not installed**: Run `npm install -g remotion`\n- **Component syntax errors**: Check import/export structure\n- **Missing `totalFrames`**: Ensure variable is defined at top level\n\n### Parser Issues\n- **XML extraction fails**: Verify `<code></code>` wrapper is present\n- **Tree-sitter errors**: Check TypeScript/TSX syntax validity\n- **Component name missing**: Ensure proper `export default` statement\n\n## Contributing\n\n### Adding New Evaluation Criteria\n1. Update `ScoreReason` model in `rubrics.py`\n2. Modify judge prompt in `prompts.yaml`\n3. Add corresponding test cases\n4. Update scoring logic in `DesignJudgeRubric`\n\n### Improving Parser\n1. Extend tree-sitter queries in `parser.py`\n2. Add new validation patterns\n3. Update reward function logic\n4. Add comprehensive test coverage\n\n## License\n\nMIT License - see `LICENSE` file for details.","encoding":"utf-8","truncated":false,"total_bytes":7837},"status":null}