{"data":{"kind":"file","path":"README.md","version_id":"ctkmhaluwamwazkkoz6dft0b","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":10663,"modified_at":"2025-09-30T23:59:25.561000","content_hash":"d630b40bc35eb3b08d6f0c01f03b84cb19a0aef3fe1243b21472a0218180e309"},"entries":[],"content":"# Agent Prompt Evaluation Environment V2\n\n### Overview\n- **Environment ID**: `agent_prompt_eval_v2`\n- **Short description**: Production-ready agent evaluation environment using real-world Salesforce/xlam-function-calling-60k dataset with 60,000+ function calling examples.\n- **Tags**: prompt-engineering, function-calling, agent-evaluation, xlam, real-world-data, salesforce, production-ready\n\n### 🆕 What's New in V2\n\n**Major Upgrade:** This version uses the **Salesforce/xlam-function-calling-60k** dataset, providing:\n\n- ✅ **60,000+ Real-World Examples**: Actual function calling scenarios from production systems\n- ✅ **Multi-Domain Coverage**: APIs, databases, calculations, file operations, and more\n- ✅ **Dynamic Function Execution**: Validates against actual function schemas from the dataset\n- ✅ **Production Complexity**: Real-world edge cases, ambiguous queries, and complex parameter extraction\n- ✅ **Comprehensive Evaluation**: Parameter validation, type checking, and schema compliance\n\n### Architecture\n\n```\nLLM(TOOLS) → generates PROMPT → LLM(PROMPT) produces FUNCTION_CALL(JSON) \n→ DYNAMIC_EXEC validates against schema → EVAL scores against dataset answer → feedback\n```\n\n**Multi-Stage Pipeline:**\n\n1. **Dataset Loader**: Loads Salesforce xlam-function-calling-60k dataset\n2. **Dynamic Function Registry**: Creates function registry from dataset tools\n3. **Prompt Generation**: LLM generates prompts for function calling\n4. **Execution Engine**: Validates function calls against real schemas\n5. **Evaluator**: Scores against ground truth from dataset\n6. **Feedback Loop**: Detailed error taxonomy and improvement suggestions\n\n### Dataset: Salesforce/xlam-function-calling-60k\n\nThis environment leverages the **xlam-function-calling-60k** dataset, which contains:\n\n- **60,000+ Examples**: Diverse function calling scenarios\n- **Real-World Tools**: Actual API definitions and tool schemas\n- **Multi-Domain**: Weather, email, calendar, database, calculations, and more\n- **Complex Queries**: Natural language queries with implicit parameters\n- **Edge Cases**: Ambiguous requests, missing information, multi-step operations\n\n**Dataset Structure:**\n```python\n{\n    \"query\": \"What's the weather in San Francisco?\",\n    \"tools\": [{\n        \"name\": \"get_weather\",\n        \"description\": \"Get current weather for a location\",\n        \"parameters\": {\n            \"type\": \"object\",\n            \"properties\": {\n                \"location\": {\"type\": \"string\"},\n                \"unit\": {\"type\": \"string\", \"enum\": [\"celsius\", \"fahrenheit\"]}\n            },\n            \"required\": [\"location\"]\n        }\n    }],\n    \"answers\": {\n        \"name\": \"get_weather\",\n        \"arguments\": {\"location\": \"San Francisco\", \"unit\": \"fahrenheit\"}\n    }\n}\n```\n\n### Features\n\n#### 🎯 Real-World Function Calling\n- **Dynamic Schema Validation**: Validates against actual function definitions\n- **Type Checking**: Ensures parameter types match schema requirements\n- **Required Parameters**: Checks all required parameters are present\n- **Extra Parameters**: Detects and warns about unexpected parameters\n\n#### 📊 Comprehensive Evaluation\n- **Function Selection (40%)**: Did the model choose the correct function?\n- **Parameter Extraction (40%)**: Are all parameters correctly extracted and formatted?\n- **JSON Structure (20%)**: Is the output valid JSON with proper structure?\n\n#### 🛡️ Production-Ready Testing\n- **60K Examples**: Extensive coverage of real-world scenarios\n- **Multi-Domain**: Tests across different API types and use cases\n- **Edge Cases**: Natural ambiguity, implicit parameters, complex queries\n- **Error Taxonomy**: Detailed classification of failure modes\n\n### Quickstart\n\n#### Basic Evaluation (100 examples)\n```bash\nuv run vf-eval agent_prompt_eval_v2 -m gpt-4o-mini -n 10 -r 3\n```\n\n#### Large-Scale Evaluation (1000 examples)\n```bash\nuv run vf-eval agent_prompt_eval_v2 \\\n  -m gpt-4o-mini \\\n  -n 50 -r 5 \\\n  -a '{\"num_examples\": 1000}'\n```\n\n#### Full Dataset Evaluation\n```bash\nuv run vf-eval agent_prompt_eval_v2 \\\n  -m gpt-4 \\\n  -n 100 -r 10 \\\n  -a '{\"num_examples\": 60000}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_examples` | int | `100` | Number of examples to load from dataset |\n| `seed` | int | `42` | Random seed for dataset sampling |\n| `filter_complexity` | str | `None` | Filter by complexity level (if available) |\n\n### Example Evaluation Flow\n\n**User Query:** \"Send an email to john@example.com with subject 'Meeting' and message 'See you at 3pm'\"\n\n**Available Tools from Dataset:**\n```json\n[{\n  \"name\": \"send_email\",\n  \"description\": \"Send an email message\",\n  \"parameters\": {\n    \"type\": \"object\",\n    \"properties\": {\n      \"to\": {\"type\": \"string\", \"description\": \"Recipient email\"},\n      \"subject\": {\"type\": \"string\", \"description\": \"Email subject\"},\n      \"body\": {\"type\": \"string\", \"description\": \"Email body\"}\n    },\n    \"required\": [\"to\", \"subject\", \"body\"]\n  }\n}]\n```\n\n**Expected Output (from dataset):**\n```json\n{\n  \"name\": \"send_email\",\n  \"arguments\": {\n    \"to\": \"john@example.com\",\n    \"subject\": \"Meeting\",\n    \"body\": \"See you at 3pm\"\n  }\n}\n```\n\n**Validation Process:**\n1. ✅ Check function name matches: `send_email`\n2. ✅ Validate all required parameters present: `to`, `subject`, `body`\n3. ✅ Check parameter types: all strings\n4. ✅ Verify parameter values match expected\n5. ✅ Calculate overall score based on accuracy\n\n### Evaluation Metrics\n\n#### Component Scores\n\n- **Execution Success (20%)**: Did the function call validate against schema?\n- **Function Match (40%)**: Is the correct function selected?\n- **Parameter Match (40%)**: Are all parameters correctly extracted?\n\n#### Error Taxonomy\n\n- `function_name_mismatch`: Wrong function selected\n- `missing_parameters`: Required parameters not provided\n- `type_mismatch`: Parameter type doesn't match schema\n- `parameter_value_mismatch`: Parameter value incorrect\n- `extra_parameters`: Unexpected parameters provided\n- `parse_error`: Could not parse function call JSON\n- `runtime_error`: Execution error\n\n### Use Cases\n\n- **🤖 Production Agent Testing**: Test agents on real-world scenarios before deployment\n- **📊 Model Benchmarking**: Compare models on 60K diverse examples\n- **🎓 Function Calling Training**: Fine-tune models on real function calling data\n- **🔬 Research**: Study LLM function calling capabilities at scale\n- **📈 Performance Analysis**: Identify common failure patterns across domains\n- **🛡️ Robustness Testing**: Evaluate handling of edge cases and ambiguity\n\n### Performance Expectations\n\n- **GPT-4o**: ~0.90-0.95 average score (excellent real-world performance)\n- **GPT-4**: ~0.85-0.90 average score (strong real-world performance)\n- **Claude-3.5**: ~0.80-0.85 average score (good real-world performance)\n- **GPT-3.5**: ~0.60-0.70 average score (moderate performance)\n- **Smaller Models**: ~0.40-0.55 average score (limited capability)\n\n### Advantages Over V1\n\n| Feature | V1 (Mock Data) | V2 (Real Dataset) |\n|---------|----------------|-------------------|\n| **Examples** | 7 hand-crafted | 60,000+ real-world |\n| **Domains** | Limited | Multi-domain |\n| **Complexity** | Basic | Production-level |\n| **Functions** | 7 custom | Dynamic from dataset |\n| **Validation** | Simple | Schema-based |\n| **Edge Cases** | Few | Thousands |\n| **Realism** | Synthetic | Actual usage |\n\n### Technical Details\n\n- **Max Turns**: 10 (allows for complex interactions)\n- **Tool Integration**: Dynamic schema validation\n- **Dataset Size**: Up to 60,000 examples\n- **Validation**: Comprehensive parameter and type checking\n- **Error Handling**: Detailed error taxonomy and reporting\n\n### Installation\n\n```bash\n# Install the environment\nprime env install alaminai/agent_prompt_eval_v2\n\n# Or install locally\ncd environments/agent_prompt_eval_v2\npip install -e .\n```\n\n### Dataset Access\n\nThe environment automatically downloads the Salesforce/xlam-function-calling-60k dataset from Hugging Face. If you encounter authentication issues:\n\n```bash\n# Login to Hugging Face\nhuggingface-cli login\n\n# Or set token in environment\nexport HF_TOKEN=\"your_token_here\"\n```\n\n**Fallback Mode:** If dataset loading fails, the environment falls back to mock examples for testing.\n\n### Advanced Usage\n\n#### Custom Dataset Size\n```bash\n# Use 5000 examples for comprehensive evaluation\nuv run vf-eval agent_prompt_eval_v2 -m gpt-4 \\\n  -a '{\"num_examples\": 5000, \"seed\": 123}'\n```\n\n#### Batch Evaluation\n```python\nfrom agent_prompt_eval_v2 import load_environment\n\n# Load with custom configuration\nenv = load_environment({\n    \"num_examples\": 10000,\n    \"seed\": 42\n})\n\n# Run evaluation\nresults = env.evaluate(model=\"gpt-4o-mini\", num_rollouts=5)\n```\n\n### Research Applications\n\nThis environment is ideal for:\n\n1. **Large-Scale Benchmarking**: Evaluate models on 60K real examples\n2. **Domain Analysis**: Study performance across different API types\n3. **Error Pattern Analysis**: Identify systematic failure modes\n4. **Model Comparison**: Compare function calling capabilities at scale\n5. **Fine-Tuning**: Use as training data for function calling models\n\n### Contributing\n\nWe welcome contributions:\n\n- Report issues with specific dataset examples\n- Suggest improvements to validation logic\n- Add support for additional dataset features\n- Improve evaluation metrics\n\n### Citation\n\nIf you use this environment in your research, please cite:\n\n```bibtex\n@software{agent_prompt_eval_v2,\n  title={Agent Prompt Evaluation Environment V2},\n  author={alaminai},\n  year={2025},\n  url={https://github.com/alaminai/prime-environments},\n  note={Using Salesforce/xlam-function-calling-60k dataset}\n}\n\n@dataset{xlam_function_calling_60k,\n  title={xlam-function-calling-60k},\n  author={Salesforce Research},\n  year={2024},\n  url={https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k}\n}\n```\n\n### Notes\n\n- Requires internet connection for dataset download (first time only)\n- Dataset is cached locally after first download\n- Fallback to mock data if dataset unavailable\n- Compatible with all major language models\n- Production-ready for real-world agent evaluation\n\n### Comparison with Other Environments\n\n| Environment | Examples | Data Source | Complexity |\n|-------------|----------|-------------|------------|\n| agent_prompt_eval (V1) | 7 | Synthetic | Basic |\n| **agent_prompt_eval_v2** | **60,000+** | **Real-world** | **Production** |\n| Other function calling envs | 100-1000 | Mixed | Moderate |\n\n**Why V2?** Real-world data, scale, and production-level complexity make this the most comprehensive function calling evaluation environment available.\n\n","encoding":"utf-8","truncated":false,"total_bytes":10663},"status":null}