{"data":{"kind":"file","path":"README.md","version_id":"qcjzv5wl8lwtn1ujx21jjcbe","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":16555,"modified_at":"2026-03-28T00:30:10.040000","content_hash":"ab45dff385343e1eb42b101d72ecdfa9fee0ff9f73940e859016a25b3a50e031"},"entries":[],"content":"# Asteroid\n\n**A Qdrant-native agent toolkit for multi-hop search. Works with any LLM. Open source.**\n\nEveryone building search agents is focused on the model. Train a better model, get better search results. This gets the causality backwards. The retrieval infrastructure underneath the model determines what the agent can do. A better model issuing queries against a limited search backend is still limited by that backend.\n\nAsteroid is a set of agent tools built directly on Qdrant's retrieval primitives: server-side hybrid search, payload filtering, full-text search, grouped retrieval, relevance feedback, ColBERT reranking, and self-editing context management. These tools work with any LLM that supports tool calling and any Qdrant collection you already have.\n\nWe also train a reference model on [Prime Intellect](https://www.primeintellect.ai/) using [INTELLECT-3.1](https://huggingface.co/PrimeIntellect/INTELLECT-3.1) (106B MoE, 12B active) that shows what happens when you optimize a model specifically for these tools. The reference model is the ceiling. An off-the-shelf LLM using the same tools is the floor. The gap between them is the research contribution. Both are useful.\n\n## How it works\n\nThe agent gets seven tools. Each one maps to a Qdrant capability that existing search agents lack.\n\n```\nUser Query\n    |\n    v\nAny LLM with tool calling (INTELLECT-3.1, Claude, GPT-4, Llama, etc.)\n    |\n    |--- search_corpus -----> Qdrant hybrid search (BM25 + dense + RRF)\n    |--- filter_search -----> Qdrant payload-filtered hybrid search\n    |--- text_search -------> Qdrant full-text index (phrase, all/any tokens)\n    |--- read_document -----> Qdrant document retrieval by ID\n    |--- group_search ------> Qdrant grouped results by payload field\n    |--- refine_search -----> Qdrant relevance feedback (mark good/bad results)\n    |--- prune_chunks ------> context window management (self-editing)\n    |\n    v\nFinal Answer + Retrieved Evidence\n```\n\nThe harness speaks OpenAI-compatible chat completions with tool use. Swap the model by changing a URL.\n\n### What makes these tools different\n\n**Server-side hybrid search.** BM25 and dense vector fusion in a single API call through Qdrant's Query API. Most search agents run two queries and merge results in application code. Qdrant handles the fusion server-side with reciprocal rank fusion. Weighted RRF (new in v1.17) lets the agent control how much weight goes to lexical vs semantic results.\n\n```python\nclient.query_points(\n    collection_name=\"corpus\",\n    prefetch=[\n        models.Prefetch(query=sparse_vec, using=\"bm25\", limit=50),\n        models.Prefetch(query=dense_vec, using=\"dense\", limit=50),\n    ],\n    query=models.FusionQuery(fusion=models.Fusion.RRF),\n    limit=10,\n)\n```\n\n**Payload filtering with nested field support.** The agent issues structured metadata queries using dot notation for nested fields: \"10-K filings from Tesla filed after 2023\" or \"patents in semiconductor manufacturing classified under H01L.\" Qdrant supports keyword, integer, float, datetime, and geo filters. No existing search agent exposes this.\n\n**Full-text search on indexed payloads.** Qdrant maintains a text index with configurable tokenizers (word, whitespace, prefix) and three query modes: AllTokens (AND), AnyTokens (OR), and Phrase (exact sequence). This replaces regex grep with something faster and more flexible, while still running server-side.\n\n**Grouped search results.** The agent can group results by a payload field. Searching SEC filings grouped by `company` returns the top result per company instead of five results from the same company. This changes how the agent explores a corpus: it sees breadth first, then dives deep.\n\n**Relevance feedback.** New in Qdrant v1.17. The agent marks retrieved results as \"more relevant\" or \"less relevant,\" and Qdrant adjusts scoring on the fly without retraining any model. This creates a self-improving search loop within a single query session. The agent searches, reads, judges, refines, searches again. Each iteration is better than the last.\n\n**ColBERT reranking.** Native multivector support for token-level late interaction. A second-stage precision boost running inside Qdrant rather than in a separate service. Applied automatically during `search_corpus` when the collection has multivectors.\n\n**Single collection architecture.** All corpora live in one collection with a `corpus` payload field. The agent scopes searches through payload filtering. This is simpler, uses fewer resources, and makes filtering a core pattern rather than an afterthought.\n\n## Works with what you have\n\nAsteroid does not require a specific embedding model or a specific LLM.\n\n**Any LLM.** The harness calls the model through an OpenAI-compatible API with tool definitions. If your model supports tool calling, it can drive Asteroid. We test with INTELLECT-3.1, Claude, and GPT-4. Local models work too.\n\n**Any Qdrant collection.** Dense search uses whatever embedding model the collection was built with. BM25 sparse search does not depend on pre-computed vectors. Qdrant computes BM25 scores server-side from text stored in payloads, so any collection that stores its original text gets hybrid search automatically. Payload filtering, full-text search, grouping, and relevance feedback all operate on payloads and point IDs. The only feature that requires additional indexing is ColBERT reranking, which needs pre-computed multivectors.\n\nA Qdrant user with an existing OpenAI-embedded collection and text payloads gets hybrid search, full-text search, payload filtering, grouped results, and relevance feedback on day one. No re-embedding. No migration.\n\n## The reference model\n\nThe toolkit is the product. The reference model is the proof that the toolkit is worth optimizing for.\n\nWe fine-tune [INTELLECT-3.1](https://huggingface.co/PrimeIntellect/INTELLECT-3.1) (106B MoE, 12B active) on [Prime Intellect](https://www.primeintellect.ai/) to produce a model specifically trained to use Asteroid's Qdrant tools. This is a joint effort:\n\n- **Prime Intellect** contributes the model, the training infrastructure (prime-rl, open source), and the serving layer (inference API with LoRA adapter deployment). Nobody has published agentic benchmarks for INTELLECT-3.1. Asteroid produces the first.\n- **Qdrant** contributes the retrieval infrastructure that the model is trained against.\n\n### Training\n\nAll training runs on Prime Intellect via prime-rl. No supervised fine-tuning. We go straight to RL with a three-level curriculum, following Prime's recommended approach for training search agents.\n\n**Level 1: Metadata retrieval.** Single tool calls with exact-match rewards. The agent learns basic tool use: issue a query, read the result, extract an answer. Tools available: `search_corpus`, `filter_search`. Expected to saturate near 1.0 reward within 15 steps. If it doesn't, the tools or data are broken.\n\n**Level 2: Multi-step reasoning.** Two or more tool calls with binary rewards. The agent learns to chain searches, compare results, and compute answers from multiple sources. Tools added: `text_search`. Expected to reach ~0.70 reward by step 50.\n\n**Level 3: Open-ended analysis.** Full tool set with LLM-judged rewards. The agent learns synthesis, cross-document reasoning, context management, and when to stop searching. Tools added: `read_document`, `group_search`, `refine_search`, `prune_chunks`. All seven tools available. Reward includes hallucination penalty (0.2x multiplier) and efficiency bonus. Expected to reach ~0.50 reward by step 100 with high variance.\n\nEach level loads the LoRA checkpoint from the previous level. The curriculum progressively unlocks tools and increases question difficulty. Training configs live in `configs/level1.toml`, `configs/level2.toml`, `configs/level3.toml`.\n\n**After RL: Embedding fine-tuning.** Fine-tune BGE-M3 on successful trajectories from Level 3. Joint optimization of dense, sparse, and ColBERT representations. Reindex corpora with the improved embeddings.\n\n## Evaluation\n\n### The core question\n\nHow much does a model trained for these specific tools outperform an off-the-shelf model using the same tools? We run every benchmark twice: once with the fine-tuned reference model, once with an unmodified LLM. The delta is what training buys you. The baseline is what the Qdrant tools buy you for free.\n\n### Public benchmarks (comparability)\n- **HotpotQA**: multi-hop question answering (saturated by SID-1 and Context-1, included for comparability)\n- **FRAMES**: financial reasoning\n- **BrowseComp**: complex retrieval requiring multiple search steps\n- **FinQA**: financial question answering (SID-1 benchmark)\n- **BARExamQA**: legal reasoning (SID-1 benchmark)\n- **SciFact**: scientific claim verification\n\n### First agentic benchmarks for INTELLECT-3.1\n- **TAU2-Bench**: agentic tool use\n- **BFCL-V4**: function calling\n\nNo published scores exist for this model on either benchmark.\n\n### Domain evals across enterprise verticals\n\nEach eval uses Prime Intellect's verifier format and targets scenarios where Qdrant's retrieval features provide a measurable difference.\n\n*Finance:* SEC filing search across 10-K, 10-Q, and S-1 forms with temporal reasoning scoped by `filter_search`. Quarterly earnings comparison filtered by ticker, quarter, and year.\n\n*Healthcare:* PubMed multi-hop medical literature retrieval, where hybrid search handles the overlap between biomedical terminology and clinical semantics.\n\n*Legal and government:* Patent prior-art discovery. US Legal Code cross-referencing with exact citation matching via `text_search`. Congressional Daily Digest search scoped by date, committee, and topic.\n\n*Technical:* Arxiv paper retrieval. Codebase navigation combining `text_search` for symbol lookup with `filter_search` on language and file path. Gutenberg literary analysis testing context management through pruning.\n\n*Multi-hop reasoning:* Latent Two-Hop Reasoning, a synthetic benchmark that isolates the core capability.\n\n### Out-of-domain evals (generalization)\n\nThe toolkit should work beyond the domains we train on. These evals test that:\n\n- **SpreadsheetBench**: tabular data reasoning. Structurally different from document search, tests whether the agent adapts its tool use to non-prose data.\n- **New News**: current events retrieval with temporal reasoning. Tests freshness and time-scoped search via `filter_search`.\n- **CLRS (Introduction to Algorithms)**: textbook Q&A across a full algorithms textbook. Tests long-form technical retrieval where precise definitions matter. The `text_search` tool should excel at finding exact theorem statements and pseudocode.\n- **Deep Learning (Goodfellow et al.)**: textbook Q&A across a dense mathematical text. Tests whether hybrid search handles notation-heavy content where lexical matching on symbols matters as much as semantic similarity.\n\n### The ablation\n\nThe most important evaluation. Same model, same queries. Add one Qdrant feature at a time and measure:\n\n1. Dense search only\n2. Add BM25 hybrid via server-side RRF\n3. Add full-text search via `text_search`\n4. Add payload filtering via `filter_search`\n5. Add grouped results via `group_search`\n6. Add relevance feedback via `refine_search`\n7. Add ColBERT reranking\n8. Add fine-tuned BGE-M3 embeddings\n\nEach row answers a concrete question: how much does this piece of retrieval infrastructure improve agent performance?\n\n## Timeline\n\n| Week | Phase | Deliverable |\n|------|-------|-------------|\n| 1-2 | Agent toolkit | Seven Qdrant-native tools, harness, token budget |\n| 2-3 | Indexing pipeline | 10 domain corpora in Qdrant |\n| 3-4 | Baseline evals | Off-the-shelf LLMs on toolkit (floor) |\n| 4-5 | Level 1 RL | Metadata retrieval, basic tool use |\n| 5-6 | Level 2 RL | Multi-step reasoning, tool chaining |\n| 6-8 | Level 3 RL | Open-ended analysis, full tool set |\n| 7-8 | Embedding fine-tuning | Fine-tuned BGE-M3 |\n| 8-9 | Full eval and ablation | Head-to-head: reference model vs baseline, feature ablation |\n| 9-10 | Publication and release | Technical report, toolkit, model weights, ablation |\n\n## Project structure\n\n```\nasteroid/\n  src/asteroid/\n    agent/          # harness, system prompt, token budget, context tracking\n    tools/          # search, filter_search, text_search, read, group, refine, prune\n    retrieval/      # Qdrant client, BGE-M3 embeddings, ColBERT reranker\n    indexing/       # chunking, ingestion pipeline, corpus-specific loaders\n    serving/        # model client (OpenAI-compatible, any provider)\n  training/\n    rl/             # verifiers ToolEnv, 3-level curriculum, reward functions\n    embedding/      # BGE-M3 fine-tuning from trajectory signal\n  configs/          # level1.toml, level2.toml, level3.toml for prime-rl\n  eval/\n    benchmarks/     # HotpotQA, FRAMES, BrowseComp, TAU2, BFCL, Latent Two-Hop\n    domains/        # SEC, PubMed, patents, legal, congressional, arxiv, codebase, gutenberg\n    ablation/       # feature isolation study\n  scripts/          # CLI tools for indexing, eval, deployment\n```\n\n## Stack\n\n- [Qdrant](https://qdrant.tech/) for retrieval\n- [Prime Intellect](https://www.primeintellect.ai/) for training and serving the reference model\n- [INTELLECT-3.1](https://huggingface.co/PrimeIntellect/INTELLECT-3.1) (106B MoE, 12B active)\n- [prime-rl](https://github.com/PrimeIntellect-ai/prime-rl) for RL training\n- [verifiers](https://github.com/PrimeIntellect-ai/verifiers) for environments and evals\n- [BGE-M3](https://huggingface.co/BAAI/bge-m3) for embeddings\n\n## What we open-source vs what exists today\n\nExisting agentic search projects release model weights while keeping the infrastructure closed. Context-1 released weights under Apache 2.0 and a data generation pipeline, but the agent harness, evaluation code, and training pipeline remain unreleased. Without the harness, the weights cannot reproduce the reported results. SID-1 is available via API only.\n\nAsteroid releases everything:\n\n| Component | Context-1 | SID-1 | Asteroid |\n|-----------|-----------|-------|----------|\n| Model weights | Apache 2.0 | API only | Apache 2.0 |\n| Agent harness | Unreleased | Closed | Open source |\n| Tool implementations | Unreleased | Closed | Open source |\n| Training pipeline | Closed (Tinker) | Closed | Open source (prime-rl) |\n| Training data gen | Public (no license) | Closed | Apache 2.0 |\n| Eval harness | Unreleased | Closed | Open source (verifiers) |\n| Embedding model | Unspecified | Closed | Open source (fine-tuned BGE-M3) |\n\nThe harness is the product. Without it, model weights are an artifact you can download and not use. Asteroid ships the harness, the tools, the training recipe, and the eval suite. A researcher can reproduce every result. A developer can run it against their own Qdrant collection today.\n\nAsteroid also serves as a reference implementation for using Qdrant's composable vector search primitives in model training. The training pipeline shows how to wire hybrid search, payload filtering, relevance feedback, and full-text search into an RL reward loop. Developers building their own retrieval agents can use this as a starting point for integrating Qdrant into their training infrastructure.\n\n## Prior work\n\n- [SID-1](https://sid.ai/blog) (December 2025): pioneered RL-trained agentic retrieval. First to demonstrate that small specialized models outperform frontier LLMs on retrieval tasks, achieving 0.84 recall on a 14B model. Much of the field's current understanding of agentic search traces back to their work.\n- [Context-1](https://www.trychroma.com/research/context-1) (March 2026): scaled agentic search to 20B parameters and introduced self-editing context management via pruning. Released model weights and data generation pipeline.\n- [Search-R1](https://github.com/PeterGriffinJin/Search-R1): open RL framework for training models to interleave reasoning with search. Demonstrated 41% performance gains over RAG baselines.\n- [Qdrant Hybrid Search](https://qdrant.tech/articles/hybrid-search/): introduced server-side fusion of sparse and dense retrieval via the Query API, eliminating client-side orchestration.\n- [prime-rl](https://github.com/PrimeIntellect-ai/prime-rl): open source async RL framework that enabled globally distributed training of 100B+ parameter models. Powers INTELLECT-3 and the Environments Hub ecosystem.\n- [Thinking Machines Lab](https://thinkingmachines.ai/): built Tinker, the training API used to train Context-1. Shared key ideas about RL for retrieval that advanced the field.\n\n## License\n\nApache 2.0\n","encoding":"utf-8","truncated":false,"total_bytes":16555},"status":null}