{"data":{"kind":"file","path":"README.md","version_id":"ljwqf3s5x6uym9dcx5tsyc0y","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6399,"modified_at":"2026-05-12T21:08:45.577000","content_hash":"148cfcf09190396524f8419ce349464b055d2d84e46f488fed36afaf6f4c3836"},"entries":[],"content":"# mcp_atlas\n\n### Overview\n- **Environment ID**: `mcp_atlas`\n- **Short description**: MCP-Atlas tool-use benchmark wired into `verifiers`.\n- **Tags**: `tool-use`, `mcp`, `llm-as-judge`, `multi-turn`, `sandbox`\n\n### Datasets\n- **Primary dataset**: [ScaleAI/MCP-Atlas](https://huggingface.co/datasets/ScaleAI/MCP-Atlas) (`train`, 500 tasks).\n- **Optional local dataset**: a CSV with Atlas columns like [`sample_tasks.csv`](https://github.com/scaleapi/mcp-atlas/blob/main/services/mcp_eval/sample_tasks.csv).\n\n### Task\n- **Type**: multi-turn tool use against a live MCP-Atlas `agent-environment` service.\n- **Runtime shape**: the env launches the official Atlas container inside a Prime sandbox, waits for the in-sandbox Atlas service to answer `/list-tools`, and then forwards model tool calls to that service.\n- **Rubric**: `vf.JudgeRubric` over an OpenAI-compatible judge endpoint, defaulting to Prime Intellect's pinference service.\n\n### Setup\n\nSet a Prime judge key before scoring:\n\n```bash\nexport PRIME_API_KEY=...\n```\n\nEnable the upstream-style MCP-Atlas system prompt only if you want it:\n\n```bash\nexport USE_SYSTEM_PROMPT_IN_COMPLETION=true\n```\n\nInstall the environment from this repository root:\n\n```bash\nuv pip install -e ./environments/mcp_atlas\n```\n\nClone Atlas only if you want to use its local sample CSV:\n\n```bash\ngh repo clone scaleapi/mcp-atlas /path/to/mcp-atlas\n```\n\n### Quickstart\n\nRun one filtered debug rollout against the default Hugging Face dataset:\n\n```bash\nuv run vf-eval mcp_atlas --debug --verbose --num-examples 1 --rollouts-per-example 1\n```\n\nRun against the cloned sample CSV instead:\n\n```bash\nuv run vf-eval mcp_atlas --debug --verbose --num-examples 1 --rollouts-per-example 1 -a '{\n  \"dataset_file\": \"/path/to/mcp-atlas/services/mcp_eval/sample_tasks.csv\"\n}'\n```\n\nPass Atlas API-backed server secrets into the sandbox if you want more than the default no-key servers:\n\n```bash\nuv run vf-eval mcp_atlas --num-examples 1 --rollouts-per-example 1 -a '{\n  \"atlas_environment_vars\": {\n    \"ENABLED_SERVERS\": \"calculator,wikipedia,filesystem,git,fetch,github\",\n    \"GITHUB_TOKEN\": \"...\"\n  }\n}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `dataset_name` | str | `\"ScaleAI/MCP-Atlas\"` | Hugging Face dataset name |\n| `dataset_split` | str | `\"train\"` | Hugging Face split |\n| `dataset_file` | str \\| None | `None` | Optional local CSV path instead of Hugging Face |\n| `filter_unavailable_tasks` | bool | `True` | Skip tasks whose `ENABLED_TOOLS` are not present in the current Atlas sandbox |\n| `max_turns` | int | `20` | Maximum agent turns |\n| `list_tools_timeout` | float | `900.0` | Total seconds to wait for Atlas startup and a successful `/list-tools` probe |\n| `tool_call_timeout` | float | `60.0` | Max seconds for each Atlas `/call-tool` request |\n| `atlas_docker_image` | str | `\"ghcr.io/scaleapi/mcp-atlas:1.2.5\"` | Atlas container image used for each sandbox |\n| `atlas_start_command` | str | `\"bash -lc 'cd /agent-environment && exec /agent-environment/entrypoint.sh uv run python -m uvicorn agent_environment.main:app --host 0.0.0.0 --port 1984'\"` | Command started inside the Atlas container |\n| `atlas_environment_vars` | dict[str, str] \\| None | `None` | Extra environment variables injected into the Atlas sandbox |\n| `sandbox_cpu_cores` | int | `4` | CPU cores for each Atlas sandbox |\n| `sandbox_memory_gb` | int | `10` | Memory in GB for each Atlas sandbox |\n| `sandbox_disk_size_gb` | int | `20` | Disk size in GB for each Atlas sandbox |\n| `sandbox_gpu_count` | int | `0` | GPU count for each Atlas sandbox |\n| `sandbox_timeout_minutes` | int | `60` | Overall lifetime for each Atlas sandbox |\n| `sandbox_labels` | list[str] \\| None | `None` | Optional sandbox labels; defaults to `[\"mcp-atlas\"]` |\n| `sandbox_client_max_workers` | int \\| None | `None` | Worker count for the threaded sandbox client |\n| `sandbox_creations_per_minute` | float \\| None | `128` | Prime sandbox creation rate limit, matching the `SandboxMixin` pattern used by the OpenCode envs |\n| `judge_model` | str | `\"openai/gpt-5-nano\"` | Judge model routed through pinference |\n| `judge_api_key_var` | str | `\"PRIME_API_KEY\"` | Env var used for the judge API key |\n| `judge_base_url` | str \\| None | `\"https://api.pinference.ai/api/v1\"` | OpenAI-compatible Prime Intellect endpoint |\n| `system_prompt` | str \\| None | `None` | Optional explicit system prompt; if unset, `USE_SYSTEM_PROMPT_IN_COMPLETION=true` injects the built-in MCP-Atlas prompt |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main MCP-Atlas coverage score |\n| `coverage_score` | Same as `reward` |\n| `total_tool_calls` | Number of tool calls made by the model |\n| `num_turns` | Number of assistant turns before stopping |\n\n### Notes\n- With `filter_unavailable_tasks=True`, `load_environment()` probes Atlas once with a short-lived sandbox and keeps only tasks whose `ENABLED_TOOLS` are present in the current image and sandbox env configuration.\n- Rollout sandboxes now use `SandboxMixin`, so sandbox creation is rate-limited with `sandbox_creations_per_minute` the same way the OpenCode envs are.\n- Atlas startup uses short repeated `/list-tools` probes inside the longer `list_tools_timeout` window so one stuck request does not consume the whole startup budget.\n- The eval rows themselves stay small: they only store each task's `enabled_tool_names`.\n- The model only sees each task's own `ENABLED_TOOLS` subset through `state[\"tool_defs\"]`, and that subset is built at rollout time from the live `/list-tools` response of the sandboxed Atlas service.\n- Filesystem-like tool arguments are constrained to `/data` before the env forwards them into Atlas.\n- The judge path is intentionally the same OpenAI-compatible pinference route used by other envs in this repository.\n- The upstream MCP-Atlas system prompt is available, but disabled by default. Set `USE_SYSTEM_PROMPT_IN_COMPLETION=true` to enable it without changing eval args.\n\n### Changelog\n\n#### v0.1.2\n- Default `sandbox_client_max_workers` to `None` so the shared sandbox client uses the verifiers default worker cap unless callers explicitly override it.\n\n#### v0.1.1\n- Harden filesystem path confinement so absolute inputs are also rejected unless they resolve to `/data` or a descendant, preventing escapes like `/etc/passwd`\n\n#### v0.1.0\n- Initial release\n","encoding":"utf-8","truncated":false,"total_bytes":6399},"status":null}