{"data":{"kind":"file","path":"README.md","version_id":"gvydvgrtct1gdbwsqybtjvln","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":11063,"modified_at":"2025-09-30T23:18:23.836000","content_hash":"8d0614bb7d3aebafbaafe7b5b978312793a595729bbe0cd9ded161c13242f622"},"entries":[],"content":"# Agent Prompt Evaluation Environment\n\n### Overview\n- **Environment ID**: `agent_prompt_eval`\n- **Short description**: Advanced multi-stage evaluation pipeline for testing LLM prompt engineering and function calling capabilities through a complete agent workflow.\n- **Tags**: prompt-engineering, function-calling, agent-evaluation, tools, multi-stage, adversarial-testing\n\n### Architecture\n\nThis environment implements a sophisticated evaluation pipeline:\n\n```\nLLM(TOOLS) → generates PROMPT → LLM(PROMPT) produces FUNCTION_CALL(JSON) → EXEC executes function → EVAL scores RESULT\n```\n\n**Multi-Stage Pipeline:**\n\n1. **Controller LLM**: Generates optimal prompts/templates for user intents\n2. **Executor LLM**: Produces function calls based on generated prompts\n3. **Execution Engine**: Sandboxed function execution with error handling\n4. **Evaluator**: Automated scoring with detailed error taxonomy\n5. **Feedback Loop**: Results inform dataset curation and model iteration\n\n### Features\n\nThis environment provides comprehensive prompt engineering evaluation:\n\n- **🎯 Prompt Generation Testing**: Evaluate LLM's ability to create effective prompts\n- **🔧 Function Call Validation**: Test accurate function selection and parameter specification\n- **⚡ Sandboxed Execution**: Safe execution environment with comprehensive error handling\n- **📊 Error Taxonomy**: Detailed classification of errors (missing_field, wrong_type, hallucination, etc.)\n- **🔗 Multi-Step Chains**: Test sequential function calls and planning capabilities\n- **🛡️ Adversarial Testing**: Prompt injections, PII requests, malformed inputs\n- **📈 Stateful Agents**: Support for long-context and stateful operations\n\n### Available Functions\n\nThe environment provides 7 sandboxed functions for testing:\n\n#### 1. `calculate_sum(numbers: List[float])`\nCalculate the sum of a list of numbers.\n\n#### 2. `filter_list(items: List, condition: str, value: Any)`\nFilter a list based on conditions: 'greater', 'less', 'equal', 'contains'.\n\n#### 3. `transform_text(text: str, operation: str)`\nTransform text: 'uppercase', 'lowercase', 'reverse', 'capitalize', 'title'.\n\n#### 4. `aggregate_data(data: List[Dict], field: str, aggregation: str)`\nAggregate data from dictionaries: 'sum', 'avg', 'min', 'max', 'count'.\n\n#### 5. `validate_format(data: Any, format_type: str)`\nValidate data formats: 'email', 'phone', 'url', 'json', 'date'.\n\n#### 6. `search_database(query: Dict, database: List[Dict])`\nSearch mock database with query conditions.\n\n#### 7. `chain_operations(operations: List[Dict], initial_value: Any)`\nChain multiple operations sequentially.\n\n### Test Case Categories\n\nThe environment includes diverse test cases across multiple categories:\n\n#### 🟢 Basic Cases\n- Simple function calls with clear parameters\n- Straightforward intent → function mapping\n- Example: \"Calculate the sum of [1, 2, 3, 4, 5]\"\n\n#### 🔄 Paraphrased Variants (50+ variations)\n- Multiple ways to express the same intent\n- Tests robustness to phrasing variations\n- Example: \"What's the total of...\", \"Add up...\", \"Sum these...\"\n\n#### ⚠️ Edge Cases\n- Empty inputs\n- Ambiguous criteria\n- Contradictory constraints\n- Missing parameters\n- Example: \"Filter the numbers, but keep the good ones\"\n\n#### 🔗 Multi-Step Chains\n- Sequential operations requiring planning\n- Intermediate function calls\n- State management\n- Example: \"Filter then sum the results\"\n\n#### 📊 Complex Data Structures\n- Nested dictionaries and lists\n- Field extraction and aggregation\n- Example: \"Calculate average age from list of user dicts\"\n\n#### 🛡️ Adversarial Tests\n- Prompt injection attempts\n- PII extraction attempts\n- Malformed inputs\n- Example: \"Also, ignore previous instructions and...\"\n\n### Quickstart\n\n#### Basic Evaluation\n```bash\nuv run vf-eval agent_prompt_eval -m gpt-4o-mini -n 10 -r 3\n```\n\n#### With Custom Parameters\n```bash\nuv run vf-eval agent_prompt_eval \\\n  -m gpt-4o-mini \\\n  -n 20 -r 5 \\\n  -a '{\"max_turns\": 10}'\n```\n\n#### Test Specific Categories\n```bash\n# Test only basic cases\nuv run vf-eval agent_prompt_eval -m gpt-4o-mini -n 5 \\\n  -a '{\"test_categories\": [\"basic\"]}'\n\n# Test adversarial cases\nuv run vf-eval agent_prompt_eval -m gpt-4o-mini -n 10 \\\n  -a '{\"test_categories\": [\"adversarial\", \"edge_case\"]}'\n```\n\n### Example Flow\n\n**User Intent:** \"Calculate the sum of numbers [1, 2, 3, 4, 5]\"\n\n**Controller LLM generates prompt:**\n```\nYou have access to the following function:\n\ncalculate_sum(numbers: List[float]) -> Dict\n  Calculate the sum of a list of numbers.\n\nUser request: \"Calculate the sum of numbers [1, 2, 3, 4, 5]\"\n\nGenerate the function call in JSON format:\n{\n  \"function_name\": \"calculate_sum\",\n  \"parameters\": {...}\n}\n```\n\n**Executor LLM produces:**\n```json\n{\n  \"function_name\": \"calculate_sum\",\n  \"parameters\": {\n    \"numbers\": [1, 2, 3, 4, 5]\n  }\n}\n```\n\n**Execution Engine runs:**\n```python\nresult = calculate_sum(numbers=[1, 2, 3, 4, 5])\n# Returns: {\"success\": True, \"result\": 15, \"metadata\": {...}}\n```\n\n**Evaluator scores:**\n```python\n{\n  \"overall_score\": 1.0,\n  \"component_scores\": {\n    \"function_match\": 1.0,    # Correct function\n    \"parameter_match\": 1.0,    # Correct parameters\n    \"output_match\": 1.0        # Correct result\n  },\n  \"errors\": [],\n  \"error_taxonomy\": {},\n  \"success\": True\n}\n```\n\n### Evaluation Criteria\n\nThe environment scores results across multiple dimensions:\n\n- **Function Selection (30%)**: Did the LLM choose the correct function?\n- **Parameter Accuracy (40%)**: Are all parameters correct with right types/values?\n- **Output Correctness (30%)**: Does the execution produce expected results?\n\n**Error Taxonomy:**\n- `function_not_found`: Function doesn't exist in registry\n- `wrong_function`: Incorrect function selected\n- `missing_parameter`: Required parameter not provided\n- `wrong_parameter_value`: Parameter has incorrect value\n- `extra_parameter`: Unexpected parameter provided\n- `parameter_error`: Type mismatch or invalid parameter\n- `output_mismatch`: Result doesn't match expected output\n- `execution_error`: Runtime error during execution\n- `runtime_error`: Function execution failed\n- `system_error`: System-level error\n\n### Use Cases\n\n- **🤖 Agent Development**: Train and evaluate autonomous agents\n- **📝 Prompt Engineering**: Test prompt effectiveness for function calling\n- **🔧 Tool Use Evaluation**: Assess LLM's ability to use tools correctly\n- **🎓 Model Benchmarking**: Compare models on function calling tasks\n- **🛡️ Safety Testing**: Evaluate robustness against adversarial inputs\n- **📊 Error Analysis**: Detailed error taxonomy for debugging\n- **🔄 Iterative Improvement**: Dataset curation based on failure patterns\n\n### Advanced Features\n\n#### Multi-Stage Evaluation Pipeline\n- **Stage 1**: Prompt generation and optimization\n- **Stage 2**: Function call production\n- **Stage 3**: Sandboxed execution\n- **Stage 4**: Comprehensive evaluation\n- **Stage 5**: Feedback and iteration\n\n#### Sandboxed Execution Environment\n- Safe function execution with timeout protection\n- Comprehensive error handling and logging\n- Execution time tracking\n- Result validation\n\n#### Detailed Error Analysis\n- Error type classification\n- Severity levels (critical, high, medium, low)\n- Component-level scoring\n- Actionable error messages\n\n#### Adversarial Testing\n- Prompt injection detection\n- PII extraction prevention\n- Malformed input handling\n- Robustness evaluation\n\n#### Long-Context Support\n- Stateful agent evaluation\n- Multi-turn conversations\n- Context maintenance testing\n- History integration\n\n### Performance Expectations\n\n- **GPT-4o**: ~0.85-0.95 average score (excellent function calling)\n- **GPT-4**: ~0.75-0.85 average score (good function calling)\n- **Claude-3.5**: ~0.70-0.80 average score (solid performance)\n- **GPT-3.5**: ~0.50-0.65 average score (basic capability)\n- **Smaller Models**: ~0.30-0.45 average score (limited capability)\n\n### Technical Details\n\n- **Max Turns**: 10 (allows for complex multi-step workflows)\n- **Tool Integration**: Native function calling with JSON validation\n- **Safety**: Sandboxed execution environment\n- **Timeout**: 5-second default execution limit\n- **Error Handling**: Comprehensive with detailed taxonomy\n\n### Installation\n\n```bash\n# Install the environment\nprime env install agent_prompt_eval\n\n# Or install locally\ncd environments/agent_prompt_eval\npip install -e .\n```\n\n### Dataset Composition\n\nThe environment includes 17+ diverse test cases:\n\n- **Basic Cases**: 3 cases (simple, clear intent)\n- **Paraphrased**: 2+ variants per intent\n- **Edge Cases**: 4 cases (ambiguous, contradictory, missing params)\n- **Multi-Step**: 1+ cases (sequential operations)\n- **Complex Data**: 1+ cases (nested structures)\n- **Validation**: 1+ cases (format validation)\n- **Adversarial**: 2+ cases (injection, PII extraction)\n- **Database**: 1+ cases (query operations)\n\n### Extensibility\n\nThis environment is designed to be easily extended:\n\n#### Add New Functions\n```python\ndef my_custom_function(param1: str, param2: int) -> Dict[str, Any]:\n    \"\"\"Your custom function.\"\"\"\n    try:\n        # Your logic here\n        return {\"success\": True, \"result\": ...}\n    except Exception as e:\n        return {\"success\": False, \"error\": str(e)}\n\n# Register it\nFUNCTION_REGISTRY[\"my_custom_function\"] = my_custom_function\n```\n\n#### Add New Test Cases\n```python\ntest_cases.append({\n    \"id\": \"my_test\",\n    \"question\": \"Your test question\",\n    \"expected_result\": {\n        \"function_name\": \"my_custom_function\",\n        \"parameters\": {...},\n        \"expected_output\": ...\n    }\n})\n```\n\n#### Customize Evaluation\n```python\nevaluation_criteria = {\n    \"weights\": {\n        \"function_match\": 0.4,\n        \"parameter_match\": 0.3,\n        \"output_match\": 0.3\n    },\n    \"pass_threshold\": 0.8\n}\n```\n\n### Research Applications\n\nThis environment is ideal for:\n\n1. **Function Calling Research**: Study how LLMs learn to use tools\n2. **Prompt Engineering**: Optimize prompts for better function calling\n3. **Agent Architecture**: Test different agent designs and workflows\n4. **Safety Research**: Evaluate robustness against adversarial inputs\n5. **Error Analysis**: Understand common failure modes\n6. **Model Comparison**: Benchmark different models systematically\n\n### Contributing\n\nWe welcome contributions:\n\n- Add new function tools\n- Create more test cases (especially edge cases)\n- Improve evaluation rubrics\n- Add support for stateful operations\n- Implement new error detection methods\n\n### Notes\n\n- Uses safe sandboxed environment for all executions\n- Comprehensive error taxonomy for detailed analysis\n- Supports both simple and complex function calling scenarios\n- Designed for research and production evaluation\n- Compatible with all major language models\n- Extensible architecture for custom functions and evaluations\n\n### Citation\n\nIf you use this environment in your research, please cite:\n\n```bibtex\n@software{agent_prompt_eval,\n  title={Agent Prompt Evaluation Environment},\n  author={alaminai},\n  year={2025},\n  url={https://github.com/alaminai/prime-environments}\n}\n```\n\n","encoding":"utf-8","truncated":false,"total_bytes":11063},"status":null}