{"data":{"kind":"file","path":"README.md","version_id":"plbpd4dc00jq0mp7j55ix8t9","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3660,"modified_at":"2025-08-24T23:00:25.781000","content_hash":"8972df8e54cb1410b9afbe922d1001236289277f0c26fd6d00c2e606333e1464"},"entries":[],"content":"# quality-ds-search\n\n> Quality search dataset environment for evaluating agentic RAG capabilities with challenging short-form content.\n\n### Overview\n- **Environment ID**: `quality-ds-search`\n- **Short description**: Agentic RAG system that learns to search within specific articles to answer multiple-choice questions about short stories and articles\n- **Tags**: rag, search, quality, train, eval\n\n### Motivation\nThis environment presents a particularly challenging and interesting RAG task. The dataset contains questions about short stories and articles, but with a unique constraint: the text chunks are extremely small (only ~200 characters each). This creates a fascinating challenge for agentic RAG systems, they must learn to:\n\n1. **Navigate fragmented information**: With such small chunks, relevant information is often split across multiple pieces\n2. **Make precise tool calls**: The LLM must extract the correct `article_id` from the question and use it to search within the right article\n3. **Synthesize from multiple chunks**: Success requires retrieving and combining information from several small chunks to answer complex questions\n4. **Handle ambiguity**: Small chunks can be ambiguous out of context, requiring the system to understand relationships between chunks\n\nThis setup tests whether LLMs can learn truly agentic behavior - not just retrieving information, but strategically using tools to navigate a complex, fragmented knowledge space.\n\n### Datasets\n- **Primary dataset**: `bhogan/quality-search-dataset` - Collection of multiple-choice questions about short stories and articles\n- **Source**: [HuggingFace Dataset](https://huggingface.co/datasets/bhogan/quality-search-dataset)\n- **Original paper**: [QuALITY: Question Answering with Long Input Texts, Yes!](https://arxiv.org/abs/2112.08608) by Pang et al. (2021)\n- **Split sizes**: 453 training examples\n- **Format**: Questions with 4 multiple-choice options, each tied to a specific article_id\n\n### Task\n- **Type**: Multi-turn tool use\n- **Parser**: ThinkParser\n- **Rubric overview**: Judge-based evaluation using GPT-4.1-mini to assess answer correctness\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval quality-ds-search\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval quality-ds-search   -m gpt-4.1-mini   -n 20 -r 3 -t 1024 -T 0.7   -a '{\"key\": \"value\"}'  # env-specific args as JSON\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\nConfiguration options for the quality search environment:\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `max_turns` | int | `10` | Maximum number of tool calls allowed per question |\n| `judge_model` | str | `\"gpt-4.1-mini\"` | Model used for judging answer correctness |\n| `embed_model` | str | `\"text-embedding-3-small\"` | OpenAI embedding model for similarity search |\n| `judge_base_url` | str | `\"https://api.openai.com/v1\"` | Base URL for judge model API |\n| `embed_base_url` | str | `\"https://api.openai.com/v1\"` | Base URL for embedding model API |\n\n### Tools\nThe environment provides one main tool:\n\n- **`search_article_chunks(article_id, query)`**: Searches within a specific article for the top 5 most similar chunks to the query, using cosine similarity on embeddings. Returns chunks with similarity scores.\n\n### Metrics\nKey evaluation metrics for the quality search task:\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar reward (1.0 for correct answers, 0.0 otherwise) |\n| `judge_score` | Binary score from GPT-4.1-mini judge evaluating answer correctness |\n\n","encoding":"utf-8","truncated":false,"total_bytes":3660},"status":null}