{"data":{"kind":"file","path":"README.md","version_id":"cqk9obstevdsuf0hr00zw0x3","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4355,"modified_at":"2026-03-24T20:32:11.155000","content_hash":"70999e7258476dc69118385b022df6c9e6a9b03d1b40f23cb27014fa38f0aa73"},"entries":[],"content":"# OpenRCA Environment\n\nA Verifiers environment implementing the [OpenRCA benchmark](https://github.com/microsoft/OpenRCA) (ICLR 2025) for evaluating and training LLMs on root cause analysis of software failures.\n\n## Overview\n\nOpenRCA assesses LLMs' ability to identify root causes of failures in enterprise software systems by analyzing telemetry data (metrics, traces, and logs). The benchmark includes 335 failures across three systems:\n\n- **Bank** — A banking microservice platform\n- **Market** — An E-commerce platform with two cloudbeds\n- **Telecom** — A telecom database system\n\nTasks range from easy (identify one root cause element) to hard (identify all three: datetime, component, and reason).\n\n## Setup\n\n### Data\n\nThe telemetry dataset (~65 GB) is hosted on [HuggingFace](https://huggingface.co/datasets/cdreetz/OpenRCA) and downloaded automatically — no manual setup needed. Each sandbox downloads the relevant system's data on rollout start.\n\nThe host also auto-downloads `query.csv` files on first run for dataset building. To pre-download:\n\n```bash\npython -c \"from src.download import download_dataset; download_dataset('dataset')\"\n```\n\n### Installation\n\n```bash\nprime env install openrca-env\n```\n\n## Usage\n\n### Evaluation\n\n```bash\n# Quick smoke test (1 system, few examples)\nvf-eval openrca_env -m gpt-4.1-mini -n 5 -a '{\"systems\": [\"Telecom\"]}'\n\n# Full evaluation on a specific system\nvf-eval openrca_env -m gpt-4.1-mini -n 50 -r 1 -s -a '{\"systems\": [\"Bank\"]}'\n\n# All systems\nvf-eval openrca_env -m gpt-4.1-mini -n 20 -r 1 -s\n```\n\n### Configuration\n\nPass these via `-a` (env args):\n\n| Parameter | Default | Description |\n|-----------|---------|-------------|\n| `data_dir` | `\"dataset\"` | Path to query.csv files (auto-downloaded) |\n| `systems` | All systems | `\"Bank\"`, `\"Market/cloudbed-1\"`, `\"Market/cloudbed-2\"`, `\"Telecom\"` |\n| `max_turns` | `30` | Maximum tool-use turns per rollout |\n| `num_examples` | `-1` | Number of examples (-1 for all) |\n| `docker_image` | `\"python:3.11-slim\"` | Docker image for sandbox containers |\n| `sandbox_data_dir` | `\"/data/openrca\"` | Path inside sandbox where telemetry data is placed |\n\n## How It Works\n\nEach rollout runs in its own **isolated sandbox container** (via `PythonEnv` / `SandboxEnv`).\n\nPer-rollout flow:\n1. A sandbox container is created (4 GB RAM, 50 GB disk)\n2. `huggingface_hub` is installed and the relevant system's telemetry data is downloaded from [cdreetz/OpenRCA](https://huggingface.co/datasets/cdreetz/OpenRCA)\n3. A persistent Python REPL is initialized with pandas, numpy, pytz, and `DATA_DIR`\n\nThe environment exposes two tools:\n\n1. **`python`** — Persistent Python REPL with common imports pre-loaded. `DATA_DIR` points to the telemetry data root.\n\n2. **`list_directory`** — Lists files and directories within the telemetry data directory.\n\nThe model iteratively analyzes metrics, traces, and logs to identify root causes, then provides a structured JSON answer.\n\n### Scoring\n\nPredictions are scored using the original OpenRCA evaluation methodology:\n- **Component match**: Exact string match against ground truth\n- **Reason match**: Exact string match against ground truth\n- **Time match**: Within 1 minute of ground truth timestamp\n- For multiple root causes, all permutations are evaluated to find the best matching\n\n### Metrics\n\n- `openrca_score`: Primary reward (0.0-1.0) based on scoring criteria match\n- `difficulty_metric`: Task difficulty level (0=easy, 1=medium, 2=hard)\n\n## Project Structure\n\n```\nopenrca_env/\n  openrca_env.py       # load_environment() + OpenRCAEnv class\n  src/\n    dataset.py          # Dataset builder (query.csv -> HF Dataset)\n    download.py         # Auto-download from HuggingFace\n    evaluation.py       # Scoring logic (ported from OpenRCA)\n    prompts.py          # System prompts, telemetry schemas, candidates\n  pyproject.toml\n  README.md\n```\n\n## Reference\n\n```bibtex\n@inproceedings{xu2025openrca,\n  title={OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures?},\n  author={Xu, Junjielong and Zhang, Qinan and Zhong, Zhiqing and He, Shilin and Zhang, Chaoyun and Lin, Qingwei and Pei, Dan and He, Pinjia and Zhang, Dongmei and Zhang, Qi},\n  booktitle={The Thirteenth International Conference on Learning Representations},\n  year={2025},\n  url={https://openreview.net/forum?id=M4qNIzQYpd}\n}\n```\n","encoding":"utf-8","truncated":false,"total_bytes":4355},"status":null}