{"data":{"kind":"file","path":"README.md","version_id":"g0g7ejwrws7egfgh4i7o272z","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2917,"modified_at":"2026-01-30T15:51:25.330000","content_hash":"4c5a10d2473d6ef5fff93101dadf295405e2780778429d327f4f000a0d127299"},"entries":[],"content":"# APEX-Agents Environment\n\nA Verifiers environment for the [APEX-Agents benchmark](https://huggingface.co/datasets/mercor/apex-agents) - 480 professional services tasks across Law, Investment Banking, and Management Consulting.\n\n## Overview\n\nAPEX-Agents evaluates whether AI agents can execute long-horizon, cross-application professional tasks. Tasks were created by investment banking analysts, management consultants, and corporate lawyers, and require agents to navigate realistic work environments with files and tools.\n\n## Dataset Statistics\n\n| Metric | Value |\n|--------|-------|\n| Total Tasks | 480 |\n| Domains | 3 (Law, Investment Banking, Management Consulting) |\n| Tasks per Domain | 160 |\n| Total Rubric Criteria | 1,948 |\n| Avg Criteria per Task | 4.06 |\n| Worlds | 33 |\n\n## Environment Features\n\n- **Multi-turn agent interaction** with up to 20 turns\n- **Tool use** for filesystem, document reading, email, chat, and calendar\n- **LLM-based rubric evaluation** using binary criteria\n- **Per-task workspace isolation** with world and task-specific files\n\n## Tools Available to Agents\n\n| Tool | Description |\n|------|-------------|\n| `list_directory` | List files and directories in workspace |\n| `read_file` | Read text file contents |\n| `read_pdf` | Extract text from PDF documents |\n| `read_docx` | Extract text from Word documents |\n| `read_xlsx` | Read Excel spreadsheets |\n| `search_files` | Search for files by name or content |\n| `read_emails` | Read email inbox (.mbox) |\n| `read_chat_messages` | Read chat message history |\n| `read_calendar` | View calendar events |\n\n## Usage\n\n### Basic Evaluation\n\n```bash\n# Evaluate on 50 random tasks\nprime eval run apex-agents -m claude-sonnet-4-20250514\n\n# Evaluate specific domain\nprime eval run apex-agents -m gpt-5 --args '{\"domain\": \"Law\"}'\n\n# Limit number of examples\nprime eval run apex-agents -m claude-sonnet-4-20250514 --args '{\"num_examples\": 10}'\n```\n\n### Programmatic Usage\n\n```python\nfrom environments.apex_agents import load_environment\n\n# Load full environment\nenv = load_environment()\n\n# Load specific domain\nenv = load_environment(domain=\"Investment Banking\")\n\n# Load limited examples\nenv = load_environment(num_examples=50)\n```\n\n## Evaluation Metrics\n\n- **Pass@1**: Probability that all rubric criteria pass in a single run\n- **Mean Score**: Average fraction of criteria passed per task\n- **Per-criterion scores**: Individual criterion pass rates\n\n## Output Types\n\nTasks expect different output types:\n\n| Type | Count | Description |\n|------|-------|-------------|\n| `message_in_console` | 417 | Text response |\n| `make_new_doc` | 20 | Create Word document |\n| `make_new_sheet` | 14 | Create Excel spreadsheet |\n| `edit_existing_sheet` | 16 | Modify Excel spreadsheet |\n| `make_new_slide_deck` | 5 | Create PowerPoint |\n| `edit_existing_doc` | 2 | Modify Word document |\n\n## License\n\nThe APEX-Agents benchmark is released under CC-BY 4.0.\n","encoding":"utf-8","truncated":false,"total_bytes":2917},"status":null}