{"data":{"kind":"file","path":"README.md","version_id":"zbb6irc7vgcc27a0q0pwmsca","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3254,"modified_at":"2026-02-26T18:50:30.877000","content_hash":"28771bed97f2db54ba5eb48086196845a21aea7edbb6db2a52a1372844c03fff"},"entries":[],"content":"# deeprepo-orchestration\n\nMulti-turn REPL environment for training LLMs to orchestrate codebase analysis via sub-LLM delegation. Part of the [DeepRepo](https://github.com/Leonwenhao/deeprepo) project by Dolores Research.\n\n### Overview\n- **Environment ID**: `deeprepo-orchestration`\n- **Type**: Multi-turn tool-use (custom REPL loop)\n- **Tags**: multi-turn, code-analysis, tool-use, delegation, security-audit\n\n### What It Trains\n\nThe model acts as a **root orchestrator** in a Python REPL, analyzing codebases for security vulnerabilities. It must learn to:\n\n1. **Explore** the file tree to understand project structure\n2. **Delegate** focused analysis tasks to sub-LLM workers via `llm_batch()` / `llm_query()`\n3. **Synthesize** findings from workers into a structured security report\n4. **Report** via `set_answer()` with specific vulnerability details\n\nThe environment rewards efficient delegation behavior over brute-force self-analysis.\n\n### Dataset\n\n3 mini codebases with planted security bugs:\n\n| Codebase | Type | Files | Planted Bugs |\n|----------|------|-------|-------------|\n| `calculator` | Python CLI | ~10 | Command injection, eval() usage |\n| `todo-api` | REST API | ~12 | SQL injection, hardcoded secrets |\n| `blog` | Web app | ~15 | XSS, MD5 password hashing |\n\n### Reward Function\n\nComposite reward with three components:\n\n| Component | Weight | What It Measures |\n|-----------|--------|-----------------|\n| `coverage` | 0.5 | Did the model find the planted bugs? |\n| `finding_quality` | 0.3 | Are findings specific, actionable, with remediation? |\n| `efficiency` | 0.2 | Did it delegate vs self-analyze? Fewer turns = better |\n\n### GRPO Training Results\n\nTrained with [GRPO on Prime Intellect](https://app.primeintellect.ai/dashboard/training/uo6bmz5849wujlwu8286j2r8) (100 steps, batch 128, 8 rollouts/example, LoRA on Qwen3-30B-A3B).\n\n| Metric | Qwen3-8B (untrained) | Qwen3-30B-A3B (untrained) | Qwen3-30B-A3B (GRPO trained) |\n|--------|----------------------|---------------------------|-------------------------------|\n| Avg Reward | 0.551 (σ 0.176) | 0.785 (σ 0.100) | **0.987** (σ 0.018) |\n| Reward Range | 0.200 — 0.729 | 0.636 — 0.900 | 0.962 — 1.000 |\n| Avg Turns | 3.67 (2–5) | 8.67 (5–18) | **6.00** (6–6) |\n| Input Tokens (avg) | 4,899 | 14,599 | **8,585** |\n\n**Key findings:**\n- RL training effect: **+25.7%** reward (0.785 → 0.987)\n- Turn variance collapsed to **zero** (σ 4.42 → σ 0.00)\n- Token efficiency improved **41%** (14,599 → 8,585 input tokens)\n- 4/6 rollouts achieved **perfect 1.0 reward**\n- Model converged on a consistent 6-turn workflow: explore → delegate → read → synthesize → report → answer\n\nAll evals: 3 examples x 2 rollouts each, independently run via `prime eval run`.\n\n### Quickstart\n\nRun an evaluation:\n\n```bash\nprime eval run deeprepo-orchestration -m qwen/qwen3-8b -n 3 -r 2\n```\n\nRun against the trained adapter (requires deployment):\n\n```bash\nprime eval run deeprepo-orchestration \\\n  -m Qwen/Qwen3-30B-A3B-Instruct-2507:t2nj7mvwh0qs5h1szlfekqoh \\\n  -n 3 -r 2\n```\n\n### Metrics\n\n| Metric | Meaning |\n|--------|---------|\n| `delegation_reward` | Main composite reward (coverage + quality + efficiency) |\n| `num_turns` | Number of REPL turns used by the model |\n","encoding":"utf-8","truncated":false,"total_bytes":3254},"status":null}