{"data":{"kind":"file","path":"README.md","version_id":"xyakhyu50tctf0vywb4mxvj6","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5690,"modified_at":"2026-05-28T06:03:11.876000","content_hash":"3de3eb3084bc1da74b63c2ecc3d6c6f6b58ccaa5d5e050d6f93995301abeeb0a"},"entries":[],"content":"# docs-qa environment\n\n`docs-qa` is a `verifiers` v1 environment for training a docs agent on\nLangChain documentation Q&A. It is published as:\n\n```text\nvtrivedy/docs-qa@0.28.0\n```\n\n## What the environment does\n\nEach rollout:\n\n1. Starts from a packaged Q&A row in `docs_qa/questions.jsonl`.\n2. Runs in a sandbox with a pinned `langchain-ai/docs` archive at `/workspace/docs`.\n3. Lets the model search and read the docs with three tools.\n4. Scores the final answer with an LLM judge and deterministic shaping rewards.\n\nThe default corpus root inside the sandbox is `/workspace/docs/src`.\nThe default docs ref is `2d6e8b128c0746b451810b38badbb94610e58ad3`.\n\n## Packaged files\n\n- `docs_qa/env.py`: environment implementation\n- `docs_qa/questions.jsonl`: 466 Q&A rows\n- `docs_qa/docs_nav.txt`: compact docs navigation summary used in prompts\n\nQuestion rows use this shape:\n\n```json\n{\"question\": \"...\", \"answer\": \"...\", \"gold_path\": \"oss/langchain/middleware.mdx\"}\n```\n\nRows are assigned to `train` or `validation` deterministically at load time with\npage-level hashing by default, so validation pages do not overlap training pages.\nUse `taskset.split = \"train\"`, `\"validation\"`, or `\"all\"` to select the source.\n\n## Tools\n\nThe environment exposes two Mintlify-MCP-compatible tools plus one direct page\nreader:\n\n- `search_docs_by_lang_chain(query, language=None)`\n- `read_doc_docs_by_lang_chain(path, start_line=1, num_lines=120)`\n- `query_docs_filesystem_docs_by_lang_chain(command)`\n\n`search_docs_by_lang_chain` performs weighted lexical search over MDX files,\nexcluding snippets and reference stubs from broad search results. It returns\npage titles, docs URLs, page paths, descriptions, and keywords.\n\n`read_doc_docs_by_lang_chain` reads one docs page with bounded line-numbered\noutput and a continuation marker when more lines are available.\n\n`query_docs_filesystem_docs_by_lang_chain` runs read-only shell-style commands\nagainst the docs tree. Allowed commands are:\n\n```text\nrg grep find tree ls cat head tail stat wc sort uniq cut sed awk jq\n```\n\nPipes and command chains are allowed. Common no-op redirects to `/dev/null` are\nignored; other redirects and write operations are rejected.\n\n## Rewards\n\n| Reward | Weight | Purpose |\n|--------|--------|---------|\n| `judge_correct` | 1.0 | LLM judgment against the gold answer. |\n| `path_citation` | 0.25 | Rewards citing the gold docs path or basename; incorrect answers receive no citation credit. |\n| `research_process` | 0.05 | Low-weight shaping for search, read, docs citations, and substantive answers; incorrect answers are capped. |\n| `tool_usage_required` | 0.05 | Low-weight shaping for using search and read before answering; incorrect answers are capped. |\n| `answer_specificity` | 0.03 | Adds small source-specific shaping and reward tie-breaking. |\n| `concise_final_answer` | 0.02 | Rewards complete final answers without runaway length. |\n| `gold_page_in_search_top_k` | 0.02 | Small evidence-progress reward: gold page appeared in search output. |\n| `gold_page_read` | 0.03 | Small evidence-progress reward: rollout read the gold page. |\n| `correct_but_no_gold_cite` | 0.0 | Diagnostic: judge accepted answer but final answer did not cite the gold page. |\n| `wrong_but_full_process` | 0.0 | Diagnostic: rollout searched, read, and answered but judge rejected it. |\n\nThe judge and path-citation rewards require at least one docs read call, except\nfor the `retrieve` diagnostic objective, which scores whether the final answer\ncites the gold page. Direct answers without tool calls get no correctness\ncredit. Process and evidence-progress rewards are intentionally too small for\nan incorrect answer to outrank a correct one.\n\n## Diagnostic objectives\n\nUse `taskset.objective` to isolate failure modes:\n\n- `answer`: Full search, read, answer, and cite task.\n- `retrieve`: Search for the best page and cite the exact docs path.\n- `read_given_page`: Read the provided gold page, then answer the question.\n\nThe full task is the training objective. The diagnostic objectives are for\nshort held-out evals before spending training compute.\n\n## Important config\n\nHosted Training should use:\n\n```toml\n[[env]]\nid = \"vtrivedy/docs-qa\"\nargs = { taskset = { max_turns = 6 } }\n```\n\nFor training, use the train split explicitly:\n\n```toml\n[[env]]\nid = \"vtrivedy/docs-qa\"\nargs = { taskset = { split = \"train\", max_turns = 6 } }\n```\n\nFor held-out evaluation, use:\n\n```toml\nenv_args = { taskset = { split = \"validation\", max_turns = 6, max_examples = 20 } }\n```\n\nFor diagnostic evaluation, set `taskset.objective`:\n\n```toml\nenv_args = { taskset = { split = \"validation\", objective = \"retrieve\", max_turns = 4, max_examples = 20 } }\n```\n\nThe env loader accepts nested args directly:\n\n```python\nload_environment(taskset={\"max_turns\": 6})\n```\n\n`OPENAI_API_KEY` is required for the judge. Pass it to Hosted Training by name:\n\n```bash\nprime --plain train configs/rl/docs-qa-smoke.toml --yes --output json --env-var OPENAI_API_KEY\n```\n\n## Regenerating data\n\nOnly rerun these when intentionally refreshing the packaged dataset:\n\n```bash\ncd environments/docs_qa\npython build_nav.py --structure-only --max-bytes 16000\nANTHROPIC_API_KEY=... python generate_questions.py --total 500\n```\n\nReview and commit only intended changes to:\n\n- `docs_qa/questions.jsonl`\n- `docs_qa/docs_nav.txt`\n\n## Compatibility notes\n\nThe environment pins `verifiers>=0.1.13,<0.1.15`. `env.py` includes small\nHosted Training compatibility shims for chat-message normalization, multi-turn\ntool-use fallback, top-level `sampling_args`, and Prime's expected timing shape.\nThese shims are local to serialization/client boundaries and do not change the\ntool contract or reward objective.\n","encoding":"utf-8","truncated":false,"total_bytes":5690},"status":null}