{"data":{"kind":"file","path":"README.md","version_id":"soer7gc42rme35atj9fg4gi6","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4443,"modified_at":"2026-04-13T17:37:42.150000","content_hash":"761cd7be848fe3fff0bc3d3fd29452287cfda2941b1127f2ac0a1e613cf4c929"},"entries":[],"content":"# drop\n\n<a href=\"https://github.com/PrimeIntellect-ai/research-environments/tree/main/environments/drop\">\n<img src=\"https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white\" alt=\"Source Code\">\n</a>\n\n### Overview\n- **Environment ID**: `drop`\n- **Short description**: DROP (Discrete Reasoning Over Paragraphs) reading-comprehension evaluation environment\n- **Tags**: `drop`, `reasoning`, `eval`, `single-turn`\n\n### Provenance\n- Reproduces the OLMES `drop::llama3` task used by the `tulu_3_dev` evaluation suite (Dua et al. 2019).\n- Integrated into prime-rl for evaluating OTA Tulu-3 specialist SFT checkpoints under a unified vLLM serving stack.\n\n### Reference model & reproduction\n\n**Reference SFT model**: [`allenai/Llama-3.1-Tulu-3-8B-SFT`](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-SFT) — the Tulu-3 paper's 8B SFT checkpoint built on Llama-3.1-8B.\n\n| Score | Tulu-3 paper | Δ |\n|------:|-------------:|---:|\n| **63.0%** | 61.3% | +1.7pp |\n\n- **Metric**: F1\n- **Sampling**: greedy, n=9535 (best of 3 reruns)\n\nThis environment was built to reproduce the OLMES `tulu_3_dev` evaluation methodology faithfully, so the Tulu-3 SFT reference number above is the canonical \"is the env correct?\" sanity check. Numbers within ±2pp of the paper indicate the env is correctly mirroring OLMES.\n\n### Datasets\n- **Primary dataset**: `ucinlp/drop`, `validation` split\n- **Source links**: [HF](https://huggingface.co/datasets/ucinlp/drop), [DROP paper](https://arxiv.org/abs/1903.00161)\n- **Split sizes**: ~9,500 validation problems\n- **Few-shot source**: `environments/olmes_fewshot.json` under key `drop_OLMES:drop` (3 examples)\n\n### Task\n- **Type**: single-turn, reading-comprehension, eval\n- **Decoding**: greedy (temperature 0) — answers are short, deterministic spans/numbers/dates\n- **Parser**: `MaybeThinkParser`; the prediction is taken as the first non-empty line of the model's parsed answer\n- **Rubric overview**: token-level F1 against the set of gold answer spans (weighted 1.0) plus a binary exact-match metric (weighted 0.0, reported only)\n\n### OLMES methodology faithfulness\n- **3-shot prompting** using the OLMES `drop::llama3` few-shot examples.\n- OLMES fewshot entries carry DROP's structured `{spans, number, date}` answer dict; `_answer_to_text` in `drop.py` flattens these into plain strings so they render correctly in the prompt (`number` > `spans` joined by `\", \"` > `month day year`).\n- Prompt format mirrors OLMES: `Passage: ...\\n\\nQuestion: ...\\n\\nAnswer: ...` per shot, followed by the target passage/question with a trailing `Answer:` cue.\n- Scoring uses the official DROP normalization (lowercase, strip punctuation, drop articles `a|an|the`, collapse whitespace) and token-overlap F1 against every gold span.\n\n### Known caveat\n- OLMES runs `drop::llama3` as a **completion-style** task without a chat template. Our vLLM serving path always applies the model's chat template, which may alter how the 3-shot prefix and `Answer:` cue are tokenized.\n- Our current tulu3_ref scores sit ~5pp below the Tulu-3 paper number (~55–59% vs 61.3%); the chat-template mismatch is the leading hypothesis. Investigation pending — see `project_ota_eval_outstanding.md`.\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run drop\n```\n\nOverride the system prompt:\n\n```bash\nprime eval run drop -a '{\"system_prompt\": \"Answer the question based on the passage.\"}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `system_prompt` | str or `None` | `None` | System prompt shown to the model |\n| `num_shots` | int | `3` | Number of OLMES few-shot examples to prepend (set `0` to disable) |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `f1` | Token-level F1 between the prediction and the best-matching gold span (DROP-normalized) |\n| `exact_match` | Binary: whether the normalized prediction matches any gold span exactly |\n\n### Reference results\n\n| Checkpoint | f1 | Notes |\n| ---------- | -- | ----- |\n| `tulu3_ref` | ~55–59% | From an older `/full/` eval run; v4 OLMES-env rerun pending |\n| Specialists (math / coding / KR / precise_if / general) | TBD | DROP was not part of the specialist eval sweep |\n\nPaper reference: Tulu-3 reports 61.3% F1 on DROP under the `drop::llama3` configuration.\n","encoding":"utf-8","truncated":false,"total_bytes":4443},"status":null}