{"data":{"kind":"file","path":"README.md","version_id":"evahyq8s2dpjvmds2cca0f17","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5700,"modified_at":"2026-03-31T08:32:40.314000","content_hash":"c292c8fa2da4875f5feae88282f347cb1695cbd6f0f0667fd4fc1914bc540c48"},"entries":[],"content":"# Long-context retrieval (`long_context_retrieval`)\n\nVerifiers **RLM** environment for question answering over one or more research-paper PDFs in a local workspace. The root model uses a Python REPL with SQL (registry), vectors, graphs, scoped filesystem tools, artifact registration, and `llm_batch()` for delegation.\n\n**Environment ID (Prime / verifiers / Hub):** `long-context-retrieval` — matches `pyproject.toml` `[project].name`. Verifiers still imports the module `long_context_retrieval` (hyphens map to underscores).\n\n## What this environment does\n\n1. Resolves each example’s `workspace_dir` (or builds from `pdf_dir` / `pdf_paths`) and stages `context_dir` for the RLM sandbox.\n2. Initializes a SQLite registry (documents, artifacts, provenance, namespaces) under `.workspace_state/`.\n3. Exposes shared tools to root and sub-LLMs; workspace-scoped reads are confined to the paper tree (not the host `tasks/` JSONL bundle).\n4. Scores rollouts with the default rubric in `core/rewards.py` (expects a structured final answer; see **Answer contract**).\n\n## Repository layout\n\n| Path | Purpose |\n|------|---------|\n| `long_context_retrieval.py` | Entrypoint: `load_environment()` for Prime / verifiers |\n| `core/` | `Config`, dataset build, `LongContextRetrievalEnv`, tools, workspace init, rubric |\n| `scripts/build_dataset.py` | Fetch arXiv PDFs, write `workspace/` + `tasks/` bundle |\n| `configs/eval/eval.toml` | Local eval preset (model, `[eval.env_args]`) |\n| `configs/rl/long-context-retrieval.toml` | Hosted RL smoke (`[env.args]`) |\n\n## Setup\n\nRequires **Python 3.11+**. Use **`uv`** from this directory so `long_context_retrieval` and `core` resolve correctly.\n\n```bash\ncd long_context_retrieval\nuv sync\nuv run python -c \"import long_context_retrieval; print(long_context_retrieval.load_environment)\"\n```\n\n**Dev:** `uv sync --group dev` (Ruff, pytest).\n\n## Running evaluations\n\nRun commands from **`long_context_retrieval/`** (the directory that contains this package’s `pyproject.toml`).\n\n### Local\n\nBuild data once (default output `./contexts/`):\n\n```bash\nuv run python scripts/build_dataset.py --query \"cat:cs.IR\" --max-papers 10\nuv run prime eval run configs/eval/eval.toml --skip-upload\n```\n\nSet the API key expected by the TOML (commonly `PRIME_API_KEY` and `OPENAI_API_KEY` for Prime Inference). `configs/eval/eval.toml` sets `model` and `api_base_url`.\n\n### CLI (`env_id`)\n\n```bash\nprime eval run long-context-retrieval -a '{\"dataset_path\": \"contexts/tasks/dataset.jsonl\", \"max_examples\": 2}'\n```\n\n### Hosted RL (smoke)\n\n1. Push the environment: `prime env push --path .`\n2. Set `[[env]].id` in `configs/rl/long-context-retrieval.toml` to your Hub slug (e.g. `YOUR_USERNAME/long-context-retrieval`).\n3. Run: `prime rl run configs/rl/long-context-retrieval.toml -e WANDB_API_KEY -e OPENAI_API_KEY`\n\n## On-disk layout (`build_dataset.py`)\n\nDefault `--output-dir` is `./contexts`.\n\n| Path | Role |\n|------|------|\n| `contexts/workspace/` | PDFs, `papers.json`, `.workspace_state/` — filesystem tools use this tree |\n| `contexts/tasks/` | `dataset.jsonl`, `hf/` (Hugging Face `save_to_disk`), `manifest.json` — harness only |\n\n## Environment config\n\n`load_environment(config, **kwargs)` merges dict kwargs into **`Config`** (`core/config.py`). Fields match the **`lhaw`** pattern for RLM + sandbox (e.g. `sub_model`, `max_turns`, `repl_language`, `sub_llm_max_turns`, `max_sub_llm_parallelism`, `max_output_length`, `pip_install_packages`, `code_execution_timeout`, `max_startup_wait_seconds`, `abort_on_code_timeout`, sandbox CPU/RAM/disk/GPU/timeout/image). Prompt verbosity is fixed in `core/config.py` constants.\n\n**Passthrough:** any key **not** on `Config` is forwarded to **`RLMEnv`** (e.g. `sandbox_labels`, `sub_max_completion_tokens`).\n\n**Aliases:** `rlm_model` in JSON is accepted as `sub_model`.\n\n**TOML:** same shape as sibling **`lhaw/configs/eval/*.toml`** in the Athena repo — `[eval.env_args]` here, `[env.args]` in `configs/rl/long-context-retrieval.toml`.\n\n## Python API\n\n```python\nfrom long_context_retrieval import load_environment\n\nenv = load_environment({\"dataset_path\": \"contexts/tasks/dataset.jsonl\"})\nenv = load_environment({\"workspace_dir\": \"/abs/path/to/workspace\"})\n```\n\nAdvanced (custom rows): `create_environment(cfg=..., dataset=...)` in `core/environment.py` with a `datasets.Dataset` (`prompt`, `answer`, `info.workspace_dir`).\n\n## Tools and answer contract\n\nREPL tools: **SQLite** (`sql_query` / `sql_write`: one statement per call, any DQL/DDL/DML on registry or scratch/state DBs), **Chroma** (`vector_search`, `vector_get`, upsert, delete), **NetworkX** (`graph_query` helpers plus `op=\"algo\"` with an allowlisted `networkx` function name), scoped filesystem IO, artifact helpers — see `core/tools.py`.\n\nFinal model answer should be JSON:\n\n```json\n{\n  \"answer\": \"short answer text\",\n  \"citations\": [\n    {\n      \"document_id\": \"doc-id\",\n      \"path\": \"pdfs/paper.pdf\",\n      \"page\": 1,\n      \"excerpt\": \"supporting text\"\n    }\n  ]\n}\n```\n\n## Development\n\n```bash\ncd long_context_retrieval\nuv sync --group dev\nuv run pytest -q\nuv run ruff check .\n```\n\n## Packaging\n\n```bash\ncd long_context_retrieval\nuv build\nprime env push --path .\n```\n\nPyPI distribution name: **`long-context-retrieval`**. After install, `import long_context_retrieval` (module `long_context_retrieval.py` + `core/` in the wheel per Hatch `only-include`).\n\nIf **Environments Hub** lists both `long-context-retrieval` and `long_context_retrieval`, the latter usually came from older eval configs or a mismatched `env_id`. Use the hyphenated name everywhere (this repo now does); you can remove the stray underscore entry from the Hub UI if you no longer need it.\n","encoding":"utf-8","truncated":false,"total_bytes":5700},"status":null}