{"data":{"kind":"file","path":"README.md","version_id":"c6rte66mjcuttgod5xy9g5zi","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":17298,"modified_at":"2025-10-01T14:22:23.834000","content_hash":"842589dc53b5df6e6b8238d60225fa27e9cefe75123a71426df6ce7766143567"},"entries":[],"content":"# Agent Prompt Evaluation Environment V3 - The Complete Solution\n\n### Overview\n- **Environment ID**: `agent_prompt_eval_xlam_60k_v3`\n- **Short description**: The definitive agent prompt evaluation environment that successfully evaluates the entire pipeline from prompt generation through tool execution using real-world datasets.\n- **Tags**: prompt-engineering, function-calling, agent-evaluation, multiturn-env, xlam-60k, production-ready, schema-bypass\n\n## 🎯 The Problem We Solved\n\n**The Challenge**: Evaluating agent prompt engineering requires testing the complete flow:\n1. **Prompt Generation** → LLM receives tools and generates prompts for itself\n2. **Tool Generation** → Same LLM uses its own generated prompt to produce function calls  \n3. **Tool Execution** → Functions are executed and results evaluated\n4. **Self-Evaluation** → Model tests how well its own prompt engineering works\n\n**The Limitation**: Standard `ToolEnv` approaches fail due to hardcoded schema validation that crashes on parameter mismatches (e.g., `breedName` vs `breed_name`).\n\n**The Solution**: V3 uses `MultiTurnEnv` with custom multi-stage flow to evaluate the complete agent prompt engineering pipeline while bypassing schema validation.\n\n## 🚀 What Makes V3 Revolutionary\n\n### ✅ **Complete Pipeline Evaluation**\nUnlike V1 and V2, V3 evaluates the **entire agent prompt engineering workflow**:\n- **Stage 1 - Prompt Generation**: LLM receives tools and generates prompts for itself\n- **Stage 2 - Tool Generation**: Same LLM uses its own generated prompt to produce function calls\n- **Stage 3 - Tool Execution**: Functions are executed and results evaluated\n- **Self-Evaluation**: Model tests how well its own prompt engineering works\n- **Multi-Stage Scoring**: Each stage is scored independently with comprehensive evaluation\n- **Parameter Mismatch Handling**: Gracefully handles mismatches (scores them, doesn't crash)\n\n### ✅ **Real-World Dataset Integration**\n- **60,000+ Examples**: Uses the actual `Salesforce/xlam-function-calling-60k` dataset\n- **Dynamic Tool Registration**: Automatically creates mock functions from dataset schemas\n- **Production Complexity**: Real-world edge cases, ambiguous queries, and multi-domain scenarios\n\n### ✅ **Schema Validation Bypass**\n- **MultiTurnEnv Success**: Uses custom tool parsing instead of OpenAI tool calling\n- **No Crashes**: Handles parameter mismatches gracefully (scores them, doesn't crash)\n- **Full Control**: Complete control over tool execution and error handling\n\n## 📊 Evolution: V1 → V2 → V3\n\n| Feature | V1 | V2 | V3 |\n|---------|----|----|----|\n| **Dataset** | Synthetic | Real (xLAM-60K) | Real (xLAM-60K) |\n| **Tool Execution** | ✅ Mock tools | ❌ Text only | ✅ Real execution |\n| **Schema Validation** | ✅ Custom | ❌ Crashes | ✅ Bypassed |\n| **Parameter Mismatches** | ✅ Handled | ✅ Handled | ✅ Handled |\n| **Full Pipeline** | ✅ Complete | ❌ Partial | ✅ Complete |\n| **Production Ready** | ✅ Yes | ✅ Yes | ✅ **Best** |\n\n### **V1: The Foundation**\n- **Purpose**: Proof of concept with synthetic data\n- **Strengths**: Custom tools, full pipeline evaluation\n- **Limitations**: Synthetic dataset, limited real-world complexity\n\n### **V2: Real Dataset Integration**\n- **Purpose**: Real-world dataset with text-based evaluation\n- **Strengths**: Real dataset, parameter mismatch handling\n- **Limitations**: No actual tool execution, partial pipeline evaluation\n\n### **V3: The Complete Solution**\n- **Purpose**: Full pipeline evaluation with real-world complexity\n- **Strengths**: Real dataset + real execution + schema bypass\n- **Achievement**: **The only environment that successfully evaluates the complete agent prompt engineering pipeline**\n\n## 🏗️ Architecture\n\n```\nLLM(TOOLS) → generates PROMPT → LLM(PROMPT) produces FUNCTION_CALL(JSON) \n→ PARSE extracts tool call → EXEC executes function → EVAL scores RESULT\n```\n\n**Multi-Stage Pipeline:**\n\n1. **Dataset Loader**: Loads Salesforce xlam-function-calling-60k dataset\n2. **Dynamic Function Registry**: Creates function registry from dataset tools\n3. **Stage 1 - Prompt Generation**: LLM receives tools and generates prompts for itself\n4. **Stage 2 - Tool Generation**: Same LLM uses its own generated prompt to produce function calls\n5. **Stage 3 - Tool Execution**: Functions are executed and results evaluated\n6. **Self-Evaluation**: Model tests how well its own prompt engineering works\n7. **Multi-Stage Scoring**: Each stage is scored independently with comprehensive evaluation\n8. **Parameter Mismatch Handling**: Gracefully handles mismatches (scores them, doesn't crash)\n9. **Custom Tool Execution**: Manual tool execution with full control over error handling\n\n#### 🔄 MultiTurnEnv Approach\n\n**V3 solves the schema validation problem** by using `MultiTurnEnv` with custom tool parsing:\n\n- **Bypasses Schema Validation**: Uses `MultiTurnEnv` without `oai_tools` to avoid hardcoded validation\n- **Custom Tool Parsing**: Extracts tool calls from model text output using custom parsing\n- **Manual Tool Execution**: Direct tool execution with full control over error handling\n- **Parameter Mismatch Handling**: Detects and scores parameter name mismatches (e.g., `breedName` vs `breed_name`)\n- **Graceful Error Handling**: Continues evaluation even if tool execution fails\n- **State Management**: Tracks parameter mismatches and execution results in state\n- **Comprehensive Scoring**: Combines text quality, parameter accuracy, and execution success\n\n## 📈 Dataset: Salesforce/xlam-function-calling-60k\n\nThis environment leverages the **xlam-function-calling-60k** dataset, which contains:\n\n- **60,000+ Examples**: Diverse function calling scenarios\n- **Real-World Tools**: Actual API definitions and tool schemas\n- **Multi-Domain**: Weather, email, calendar, database, calculations, and more\n- **Complex Queries**: Natural language queries with implicit parameters\n- **Edge Cases**: Ambiguous requests, missing information, multi-step operations\n\n**Dataset Structure:**\n```python\n{\n    \"query\": \"What's the weather in San Francisco?\",\n    \"tools\": [{\n        \"name\": \"get_weather\",\n        \"description\": \"Get current weather for a location\",\n        \"parameters\": {\n            \"type\": \"object\",\n            \"properties\": {\n                \"location\": {\"type\": \"string\"},\n                \"unit\": {\"type\": \"string\", \"enum\": [\"celsius\", \"fahrenheit\"]}\n            },\n            \"required\": [\"location\"]\n        }\n    }],\n    \"answers\": {\n        \"name\": \"get_weather\",\n        \"arguments\": {\"location\": \"San Francisco\", \"unit\": \"fahrenheit\"}\n    }\n}\n```\n\n## 🎯 Features\n\n#### 🎯 Real-World Function Calling\n- **MultiTurnEnv**: Uses `MultiTurnEnv` with custom tool parsing to bypass schema validation\n- **Flexible Evaluation**: Handles parameter mismatches gracefully (scores them, doesn't crash)\n- **Dynamic Schema Validation**: Validates against actual function definitions\n- **Type Checking**: Ensures parameter types match schema requirements\n- **Required Parameters**: Checks all required parameters are present\n- **Extra Parameters**: Detects and warns about unexpected parameters\n\n#### 📊 Comprehensive Evaluation\n- **Function Selection (40%)**: Did the model choose the correct function?\n- **Parameter Extraction (40%)**: Are all parameters correctly extracted and formatted?\n- **JSON Structure (20%)**: Is the output valid JSON with proper structure?\n\n#### 🛡️ Production-Ready Testing\n- **60K Examples**: Extensive coverage of real-world scenarios\n- **Multi-Domain**: Tests across different API types and use cases\n- **Edge Cases**: Natural ambiguity, implicit parameters, complex queries\n- **Error Taxonomy**: Detailed classification of failure modes\n\n## 🚀 Quickstart\n\n#### Basic Evaluation (10 examples, ~20 tools)\n```bash\nuv run vf-eval agent_prompt_eval_xlam_60k_v3 -m gpt-4o-mini -n 5 -r 3\n```\n\n#### Medium-Scale Evaluation (100 examples, ~200 tools)\n```bash\nuv run vf-eval agent_prompt_eval_xlam_60k_v3 \\\n  -m gpt-4o-mini \\\n  -n 20 -r 5 \\\n  -a '{\"num_examples\": 100}'\n```\n\n#### Full Dataset Evaluation\n```bash\nuv run vf-eval agent_prompt_eval_xlam_60k_v3 \\\n  -m gpt-4 \\\n  -n 100 -r 10 \\\n  -a '{\"num_examples\": 60000}'\n```\n\n#### Save Results and Browse\n```bash\n# Run evaluation with saving enabled\nuv run vf-eval agent_prompt_eval_xlam_60k_v3 -s -m gpt-4o-mini -n 10 -r 3\n\n# Browse saved results with the terminal UI\nuv run vf-tui\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_examples` | int | `10` | Number of examples to load from dataset (controls dynamic tool count) |\n| `seed` | int | `42` | Random seed for dataset sampling |\n| `filter_complexity` | str | `None` | Filter by complexity level (if available) |\n\n**Note:** V3 uses 10 examples by default to keep the number of dynamically created tools manageable. Each example may have 1-5 tools, so 10 examples typically creates 20-50 mock functions.\n\n## 📋 Example Evaluation Flow\n\n**User Query:** \"Send an email to john@example.com with subject 'Meeting' and message 'See you at 3pm'\"\n\n**Available Tools from Dataset:**\n```json\n[{\n  \"name\": \"send_email\",\n  \"description\": \"Send an email message\",\n  \"parameters\": {\n    \"type\": \"object\",\n    \"properties\": {\n      \"to\": {\"type\": \"string\", \"description\": \"Recipient email\"},\n      \"subject\": {\"type\": \"string\", \"description\": \"Email subject\"},\n      \"body\": {\"type\": \"string\", \"description\": \"Email body\"}\n    },\n    \"required\": [\"to\", \"subject\", \"body\"]\n  }\n}]\n```\n\n**Model Output:**\n```json\n{\n  \"name\": \"send_email\",\n  \"arguments\": {\n    \"to\": \"john@example.com\",\n    \"subject\": \"Meeting\",\n    \"body\": \"See you at 3pm\"\n  }\n}\n```\n\n**V3 Evaluation Process:**\n1. ✅ **Parse Tool Call**: Extract function call from model text\n2. ✅ **Parameter Validation**: Check all required parameters present\n3. ✅ **Tool Execution**: Execute `send_email` function with parameters\n4. ✅ **Result Scoring**: Score based on execution success and accuracy\n5. ✅ **Mismatch Detection**: Handle parameter name mismatches gracefully\n\n## 📊 Evaluation Metrics\n\n#### Component Scores\n\n- **Execution Success (20%)**: Did the function call execute successfully?\n- **Function Match (40%)**: Is the correct function selected?\n- **Parameter Match (40%)**: Are all parameters correctly extracted?\n\n#### Error Taxonomy\n\n- `function_name_mismatch`: Wrong function selected\n- `missing_parameters`: Required parameters not provided\n- `wrong_parameter_value`: Parameter has incorrect value\n- `parameter_name_mismatch`: Parameter name doesn't match schema (e.g., `breedName` vs `breed_name`)\n- `execution_error`: Tool execution failed\n- `schema_validation_error`: Parameters don't match expected schema\n\n## 🎯 Use Cases\n\n### **Research Applications**\n- **Prompt Engineering**: Evaluate different prompt strategies\n- **Function Calling**: Test model accuracy on real-world APIs\n- **Parameter Extraction**: Study how models handle complex parameter schemas\n- **Error Analysis**: Understand common failure modes in agent systems\n\n### **Production Applications**\n- **Agent Development**: Test agent capabilities before deployment\n- **API Integration**: Validate function calling accuracy\n- **Model Comparison**: Benchmark different models on agent tasks\n- **Quality Assurance**: Ensure agent reliability in production\n\n### **Educational Applications**\n- **Agent Training**: Teach models to use tools effectively\n- **Curriculum Learning**: Progressive difficulty in agent tasks\n- **Error Correction**: Learn from parameter mismatch examples\n\n## ⚡ Performance Expectations\n\n- **Small Scale** (10 examples): ~2-5 minutes\n- **Medium Scale** (100 examples): ~20-50 minutes  \n- **Large Scale** (1000+ examples): ~2-8 hours\n- **Memory Usage**: ~1-4GB depending on dataset size\n- **Tool Creation**: ~1-2 seconds per example\n\n## 🏆 Advantages Over V1 and V2\n\n### **vs V1 (Synthetic Dataset)**\n- ✅ **Real-World Complexity**: 60K real examples vs synthetic data\n- ✅ **Production Relevance**: Actual API schemas and edge cases\n- ✅ **Scalability**: Handles large-scale evaluation\n- ✅ **Domain Coverage**: Multi-domain function calling scenarios\n\n### **vs V2 (Text-Only Evaluation)**\n- ✅ **Complete Pipeline**: Evaluates entire workflow, not just text\n- ✅ **Real Execution**: Actually executes functions, not just text evaluation\n- ✅ **Schema Bypass**: Successfully handles parameter mismatches\n- ✅ **Production Ready**: Full agent evaluation capabilities\n\n### **V3: The Definitive Solution**\n- 🎯 **Complete Evaluation**: Only environment that evaluates full pipeline\n- 🎯 **Real-World Data**: Uses actual production datasets\n- 🎯 **Schema Bypass**: Successfully overcomes validation limitations\n- 🎯 **Production Ready**: Comprehensive agent evaluation platform\n\n## 🔧 Technical Details\n\n### **MultiTurnEnv Implementation**\n```python\nclass AgentPromptMultiTurnEnv(vf.MultiTurnEnv):\n    \"\"\"\n    Custom MultiTurnEnv that successfully bypasses schema validation.\n    \"\"\"\n    def parse_tool_call(self, content: str) -> dict:\n        \"\"\"Parse tool call from model text output\"\"\"\n        \n    def call_tool(self, tool_name: str, tool_args: dict) -> str:\n        \"\"\"Execute tool with custom error handling\"\"\"\n        \n    def env_response(self, messages, state, **kwargs):\n        \"\"\"Handle tool execution and return environment response\"\"\"\n```\n\n### **Dynamic Tool Creation**\n```python\ndef create_mock_function_from_schema(tool_schema: Dict[str, Any]) -> Any:\n    \"\"\"Create mock Python function from dataset tool schema\"\"\"\n    \ndef extract_and_create_tools_from_dataset(dataset: Dataset) -> List[Any]:\n    \"\"\"Extract unique tools from dataset and create mock functions\"\"\"\n```\n\n## 📦 Installation\n\n```bash\n# Install from Prime Intellect Hub\nprime env install alaminai/agent_prompt_eval_xlam_60k_v3\n\n# Or install locally\ncd environments/agent_prompt_eval_v3\nuv pip install -e .\n```\n\n## 📚 Dataset Access\n\nThe environment automatically downloads the `Salesforce/xlam-function-calling-60k` dataset on first use:\n\n```python\nfrom datasets import load_dataset\n\n# Load the dataset\ndataset = load_dataset(\"Salesforce/xlam-function-calling-60k\", split=\"train\")\n```\n\n**Dataset Requirements:**\n- Internet connection for first download\n- ~2GB disk space for dataset storage\n- Cached locally after first download\n\n## 🔬 Advanced Usage\n\n#### Custom Dataset Size\n```bash\n# Use specific number of examples\nuv run vf-eval agent_prompt_eval_xlam_60k_v3 \\\n  -a '{\"num_examples\": 500}'\n```\n\n#### Batch Evaluation\n```bash\n# Run multiple evaluations with different parameters\nfor examples in 10 50 100 500; do\n  uv run vf-eval agent_prompt_eval_xlam_60k_v3 \\\n    -a \"{\\\"num_examples\\\": $examples}\" \\\n    -n 5 -r 3 -s\ndone\n```\n\n## 🎓 Research Applications\n\n### **Agent Prompt Engineering**\n- Evaluate different prompt strategies for function calling\n- Study the impact of prompt engineering on tool usage accuracy\n- Develop better prompting techniques for agent systems\n\n### **Function Calling Research**\n- Analyze parameter extraction accuracy across different model sizes\n- Study error patterns in real-world function calling scenarios\n- Develop better function calling training methods\n\n### **Agent Evaluation**\n- Benchmark agent capabilities across different domains\n- Study the relationship between prompt quality and execution success\n- Develop comprehensive agent evaluation frameworks\n\n## 🤝 Contributing\n\nWe welcome contributions to improve V3:\n\n1. **New Evaluation Metrics**: Add more sophisticated scoring methods\n2. **Additional Datasets**: Support for other function calling datasets\n3. **Error Analysis**: Enhanced error taxonomy and analysis tools\n4. **Performance Optimization**: Faster tool creation and execution\n\n## 📖 Citation\n\nIf you use V3 in your research, please cite:\n\n```bibtex\n@misc{agent_prompt_eval_v3,\n  title={Agent Prompt Evaluation Environment V3: Complete Pipeline Evaluation with Real-World Datasets},\n  author={Prime Intellect Team},\n  year={2024},\n  url={https://hub.primeintellect.com/alaminai/agent_prompt_eval_xlam_60k_v3}\n}\n```\n\n## 📝 Notes\n\n- **V3 is production-ready** and successfully evaluates the entire agent prompt engineering flow\n- **MultiTurnEnv approach** successfully bypasses schema validation limitations\n- **Full pipeline evaluation** from prompt generation through tool execution\n- **Parameter mismatch handling** provides comprehensive scoring without crashes\n- Requires internet connection for dataset download (first time only)\n- Dataset is cached locally after first download\n- Fallback to mock data if dataset unavailable\n- Compatible with all major language models\n- Production-ready for real-world agent evaluation\n\n## 🔄 Comparison with Other Environments\n\n| Environment | Examples | Data Source | Tool Type | Execution | Schema Bypass |\n|-------------|----------|-------------|-----------|-----------|---------------|\n| **V1** | 100 | Synthetic | Custom | ✅ Mock | ✅ Custom |\n| **V2** | 60K | Real (xLAM) | Dataset | ❌ Text | ✅ Text-based |\n| **V3** | 60K | Real (xLAM) | Dataset | ✅ Real | ✅ MultiTurnEnv |\n\n**V3 is the only environment that successfully combines real-world datasets with complete pipeline evaluation while bypassing schema validation limitations.**","encoding":"utf-8","truncated":false,"total_bytes":17298},"status":null}