{"data":{"kind":"file","path":"README.md","version_id":"e8q0xr77i4dh34hndmrnsrbh","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5240,"modified_at":"2026-02-27T06:53:05","content_hash":"6c66aeeac54691bfec0f0a52f0ef708c3e85c4829a27d35634140900f90ab401"},"entries":[],"content":"# RSAgent\n\nEnd-to-end RL training for **Recursive Self-Aggregation (RSA)** — a multi-step inference algorithm where a language model generates candidate solutions, then iteratively aggregates them to improve quality.\n\n**Paper**: [Recursive Self-Aggregation for LLMs](https://arxiv.org/abs/2509.26626)\n\nRSAgent wraps any [verifiers](https://github.com/PrimeIntellect-ai/verifiers) environment and runs the full RSA loop as one rollout. The model learns to both generate good initial answers *and* effectively aggregate candidate solutions, trained end-to-end with GRPO.\n\n## What is RSA?\n\nRSA is a multi-step inference algorithm that improves LLM output quality through iterative self-aggregation. Given a problem, the model:\n\n1. **Generates** a population of N candidate solutions\n2. **Aggregates** by sampling K candidates, presenting them alongside the original problem, and generating an improved solution\n3. **Repeats** aggregation for T steps, with each candidate seeing different subsets of prior solutions\n\nThe key insight is that training the model end-to-end on the full RSA loop (via RL) teaches it to both produce diverse initial candidates and effectively synthesize information across candidates during aggregation.\n\n```\n              RSA Loop (T=3, M=1, N=4, K=2)\n\nStep 0:  Question --> Gen 0 --> \"answer A\"\n(initial)         --> Gen 1 --> \"answer B\"\n                  --> Gen 2 --> \"answer C\"\n                  --> Gen 3 --> \"answer D\"\n\nStep 1:  Agg(K=2) --> Gen 0' --> \"refined A\"\n(aggregate)       --> Gen 1' --> \"refined B\"\n                  --> Gen 2' --> \"refined C\"\n                  --> Gen 3' --> \"refined D\"\n\nStep 2:  Agg(K=2) --> Gen 0'' --> \"final A\"\n(aggregate)       --> Gen 1'' --> \"final B\"\n                  --> Gen 2'' --> \"final C\"\n                  --> Gen 3'' --> \"final D\"\n\nReward = score(final population) --> all 12 trajectory steps\n```\n\nEach individual generation becomes an independent training sample via prime-rl's **branching trajectory strategy**.\n\n## Quick Start\n\nInstall from the Environments Hub:\n\n```bash\nprime env install hyperpotatoneo/rsagent\n```\n\nOr manually:\n\n```bash\nuv pip install rsagent --extra-index-url https://hub.primeintellect.ai/hyperpotatoneo/simple/\n```\n\n### Training Config\n\n```toml\n[orchestrator]\ntrajectory_strategy = \"branching\"   # REQUIRED: each trajectory step = independent sample\nrollouts_per_example = 4            # >= 2 for GRPO signal\nbatch_size = 64\n\n[orchestrator.sampling]\nmax_tokens = 4096\ntemperature = 1.0\n\n[[orchestrator.env]]\nid = \"rsagent\"\nargs = { inner_env_id = \"countdown\", inner_env_args = { num_train_examples = 5000, seed = 42 }, M = 1, N = 4, K = 2, T = 3, task = \"rg\" }\n```\n\nAny [reasoning_gym](https://github.com/reasoning-machines/reasoning-gym) environment can be used as `inner_env_id` (e.g. `\"countdown\"`, `\"basic_arithmetic\"`, `\"sokoban-env\"`).\n\n### Launch Training\n\n```bash\nuv run rl @ experiments/rsagent/rl.toml \\\n  --inference_gpu_ids 0,1 --trainer_gpu_ids 2,3 \\\n  --inference.parallel.dp 2\n```\n\n## Parameters\n\n| Param | Description | Default |\n|-------|-------------|---------|\n| `inner_env_id` | Base verifiers/reasoning_gym environment | `\"sokoban-env\"` |\n| `M` | Number of islands (must be power of 2) | `1` |\n| `N` | Population per island | `4` |\n| `K` | Candidates sampled per aggregation step | `2` |\n| `T` | Total RSA steps | `3` |\n| `task` | Prompt format: `\"rg\"`, `\"math\"`, `\"supergpqa\"` | `\"rg\"` |\n| `final_reward_metric` | `\"mean_accuracy\"`, `\"majority_vote\"`, `\"pass_at_n\"` | `\"mean_accuracy\"` |\n| `greedy_reward_weight` | Weight for per-candidate greedy reward in rollout reward | `0.0` |\n| `seed` | Base random seed for RSA candidate sampling | `1234` |\n\n## Reward Structure\n\nThe rollout-level reward is computed from the final population after T steps:\n\n- **`mean_accuracy`**: fraction of final candidates that are correct\n- **`majority_vote`**: 1.0 if the most common parsed answer is correct\n- **`pass_at_n`**: 1.0 if any final candidate is correct\n\nWith branching strategy, all T x M x N training samples from one rollout share the same reward and advantage. GRPO signal comes from comparing multiple independent RSA runs (`rollouts_per_example >= 2`) on the same problem.\n\n## Island Merging\n\nWith M > 1 islands, candidates within each island only see each other during aggregation. Islands merge progressively over T steps following a pairwise schedule based on `log2(M)`, enabling diverse exploration before convergence.\n\n## Sequence Length\n\nAggregation prompts embed K previous candidate responses. With `max_tokens=4096` and `K=2`, step 1+ prompts can reach ~12K tokens. Set `seq_len >= 16384`.\n\n## Logged Metrics\n\nRSAgent automatically logs per-step metrics to wandb (no prime-rl changes needed):\n\n- `metrics/final_reward`, `metrics/final_mean_accuracy`, `metrics/final_majority_vote`, `metrics/final_pass_at_n`\n- `metrics/mean_greedy` -- average greedy score across all candidates and steps\n- `metrics/step_0_accuracy`, `metrics/step_1_accuracy`, ... -- per-RSA-step accuracy\n\n## Citation\n\n```bibtex\n@article{venkatraman2025rsa,\n  title={Recursive Self-Aggregation for LLMs},\n  author={Venkatraman, Siddarth and Bae, Juhan and Pineau, Joelle},\n  journal={arXiv preprint arXiv:2509.26626},\n  year={2025}\n}\n```\n","encoding":"utf-8","truncated":false,"total_bytes":5240},"status":null}