{"data":{"kind":"file","path":"README.md","version_id":"ryrdh7pb6tcl0qcomd3bztdu","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6770,"modified_at":"2026-05-13T00:23:54.293000","content_hash":"34d7fad3d85a0d648501479d821a443f67318b0a8eb6a5823f0dfbb60e58c657"},"entries":[],"content":"# SWE-grep\n\nThis recipe is inspired by Cognition’s [SWE-grep](https://cognition.ai/blog/swe-grep): a reinforcement learning setup for training a model to retrieve the right code context quickly.\n\nInstead of optimizing for open-ended code generation, this environment optimizes for **efficient code search**. The model is rewarded for finding the right files, answering correctly, and using parallel tool calls well.\n\n## Why this environment exists\n\n`grep`-style search is still one of the most reliable ways to navigate a large codebase.\n\nCompared with embedding-heavy retrieval pipelines, grep-based search has a few advantages:\n\n- no vector database to manage\n- direct access to exact code matches\n- fast iteration on search patterns\n- easy grounding in real file paths and line-level evidence\n\nThe challenge is that the model must learn to search **efficiently**, not just eventually. A strong agent should turn a high-level question like:\n\n> How is the panning and zooming functionality implemented?\n\ninto a small number of targeted, parallel search operations that surface the right files quickly.\n\n## Environment overview\n\nThe environment is implemented in `swe_grep.py` as `SweGrepEnv`, which extends `vf.SandboxEnv`.\n\nThe stack looks like this:\n\n- `StatefulToolEnv`: gives the model tool access and preserves rollout state\n- `SandboxEnv`: provisions a Prime sandbox for each rollout\n- `SweGrepEnv`: customizes the sandbox and tools for grep-centric retrieval\n\nSee the Verifiers docs for more on stateful environments:\nhttps://docs.primeintellect.ai/verifiers/environments#stateful-tool-environments\n\n## Tools exposed to the model\n\n`SweGrepEnv` removes the default bash tool and replaces it with three task-specific tools:\n\n- `grep_tool`: search for text patterns with `ripgrep`\n- `list_files`: inspect directory contents\n- `read_file`: read bounded line ranges from a file\n\n```python\nself.remove_tool(self.bash)\nself.add_tool(self.grep_tool, args_to_skip=[\"sandbox_id\"])\nself.add_tool(self.list_files, args_to_skip=[\"sandbox_id\"])\nself.add_tool(self.read_file, args_to_skip=[\"sandbox_id\"])\n```\n\nThis keeps the action space narrow and focuses learning on search behavior rather than arbitrary shell usage.\n\n## Stateful tool pattern\n\nEach rollout gets its own Prime sandbox. The environment injects `sandbox_id` into tool calls so the model does not have to manage sandbox state itself.\n\n```python\ndef update_tool_args(self, tool_name: str, tool_args: dict[str, Any], messages, state, **kwargs):\n    updated_args = dict(tool_args)\n    if tool_name in [\"grep_tool\", \"list_files\", \"read_file\"]:\n        updated_args[\"sandbox_id\"] = state[\"sandbox_id\"]\n    return updated_args\n```\n\nThis is the core `StatefulToolEnv` pattern: keep persistent rollout state in `state`, and let the environment handle internal bookkeeping.\n\n## Sandbox setup\n\nFor each rollout, the sandbox is prepared by:\n\n1. installing `git` and `ripgrep`\n2. cloning the VS Code repository\n3. verifying that the clone succeeded\n\nThe model then searches that repo to answer questions.\n\n## Dataset\n\nThe dataset is loaded from `cdreetz/swe-grep-v2` and filtered to examples where `check == \"Yes\"`.\n\nDuring preprocessing:\n\n- `user_query` is renamed to `question`\n- `ground_truth` is renamed to `answer`\n- `file_path` and `file_path_2` are preserved for reward computation\n- the dataset is split into train and eval sets\n\nThe examples are synthetic but grounded in real code from Microsoft’s VS Code repository. The goal is to train retrieval behavior on realistic developer questions paired with technical explanations and source files.\n\nFor more detail on the dataset generation pipeline, see:\nhttps://app.primeintellect.ai/dashboard/environments/prime/swe-grep/files/frt126ew7h8p1fud3bwl9ceu/src/create_dataset.py\n\n## Reward design\n\nThis recipe uses a `vf.JudgeRubric` with three active rewards and one tracking metric:\n\n- **Correct answer** (`0.4`): did the model produce the right technical explanation?\n- **Correct file paths** (`0.4`): did it identify the relevant file or files?\n- **Parallel tool calls** (`0.2`): did it use available tool parallelism effectively?\n- **Efficiency bonus** (`0.0`): among correct rollouts, reward fewer turns\n\n```python\nrubric = vf.JudgeRubric(judge_prompt=JUDGE_PROMPT)\nrubric.add_reward_func(correct_answer_reward_func, weight=0.4)\nrubric.add_reward_func(correct_file_paths_reward_func, weight=0.4)\nrubric.add_reward_func(parallel_tool_calls_reward_func, weight=0.2)\nrubric.add_reward_func(efficiency_bonus_for_correct, weight=0.0)\n```\n\nA few notable design choices:\n\n- correctness is judged semantically, not by exact string match\n- multi-file tasks are supported via `file_path` and `file_path_2`\n- the environment explicitly encourages parallelism\n- the default system prompt constrains the agent to **2 turns**, increasing pressure to search well\n\n## Agent behavior being optimized\n\nThe system prompt pushes the model toward a very specific behavior profile:\n\n- use tools aggressively\n- make multiple tool calls per turn\n- gather evidence from all relevant files\n- return both file paths and a final answer\n\nExpected response format:\n\n```text\nFiles:\n- <path/to/file1>\n- <path/to/file2>\nAnswer: <your answer here>\n```\n\n## Quick start\n\nFrom this recipe directory, install dependencies and run eval through Verifiers or Prime tooling.\n\n### Environment entrypoint\n\n`pyproject.toml` should point at:\n\n```toml\n[tool.verifiers.environment]\nentrypoint = \"swe_grep:load_environment\"\n```\n\n### Eval defaults currently present\n\n```toml\n[tool.verifiers.eval]\nnum_examples = 5\nrollouts_per_example = 3\n```\n\n### Python usage\n\n```python\nfrom swe_grep import load_environment\n\nenv = load_environment()\n```\n\n## Files\n\n- `swe_grep.py`: environment, tools, prompt, dataset loading, and rewards\n- `src/create_dataset.py`: dataset generation pipeline\n- `src/sandbox_metrics.py`: sandbox execution metrics and retry helpers\n\n## Notes and limitations\n\n- The current eval defaults are very small (`5 x 3`) and seem intended for quick iteration rather than robust benchmarking.\n- Reward quality depends on judge quality, so score stability may vary across judge models.\n- The environment is intentionally opinionated: it trains search behavior under strict turn limits rather than general software engineering performance.\n\n## Environment Hub\n\nPrime Environment Hub:\nhttps://app.primeintellect.ai/dashboard/environments/prime/swe-grep\n\n## Summary\n\nThis recipe is a compact example of RL for retrieval behavior:\n\n- ground the model in a real repository\n- give it a small, focused tool set\n- reward correctness, coverage, and speed\n- encourage parallel search under tight constraints\n\nIt is not the only way to train a strong grep agent, but it is a clear and practical starting point.\n","encoding":"utf-8","truncated":false,"total_bytes":6770},"status":null}