{"data":{"kind":"file","path":"README.md","version_id":"li1rlrprnuri34a4goioke42","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5316,"modified_at":"2026-04-13T17:37:42.154000","content_hash":"390276612c1b19b0098eed49e98d4f454fe61a79da703430403265173b7f673d"},"entries":[],"content":"# popqa\n\n<a href=\"https://github.com/PrimeIntellect-ai/research-environments/tree/main/environments/popqa\">\n<img src=\"https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white\" alt=\"Source Code\">\n</a>\n\n### Overview\n- **Environment ID**: `popqa`\n- **Short description**: Open-domain factual QA over entity-relation triples (Mallen et al. 2022). Reproduces the OLMES `popqa::tulu` task configuration used in the `tulu_3_dev` suite.\n- **Tags**: `popqa`, `knowledge`, `factual-qa`, `eval`, `single-turn`\n\n### Provenance\nFaithful reproduction of the OLMES `popqa::tulu` task, wired into prime-rl via verifiers for evaluating the OTA Tulu-3 specialist SFT checkpoints. PopQA targets tail-entity knowledge: most questions ask about long-tail entities that models are unlikely to have memorized.\n\n### Reference model & reproduction\n\n**Reference SFT model**: [`allenai/Llama-3.1-Tulu-3-8B-SFT`](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-SFT) — the Tulu-3 paper's 8B SFT checkpoint built on Llama-3.1-8B.\n\n| Score | Tulu-3 paper | Δ |\n|------:|-------------:|---:|\n| **32.4%** | 29.3% | +3.1pp |\n\n- **Metric**: substring match against alias list\n- **Sampling**: greedy, n=14267\n\nThis environment was built to reproduce the OLMES `tulu_3_dev` evaluation methodology faithfully, so the Tulu-3 SFT reference number above is the canonical \"is the env correct?\" sanity check. Numbers within ±2pp of the paper indicate the env is correctly mirroring OLMES.\n\n### Datasets\n- **Primary dataset**: `akariasai/PopQA`, `test` split\n- **Source links**: [HF](https://huggingface.co/datasets/akariasai/PopQA), [Paper](https://arxiv.org/abs/2212.10511)\n- **Size**: 14,267 questions across 16 relations (capital, country, occupation, place_of_birth, religion, sport, screenwriter, director, producer, composer, author, color, mother, father, genre, capital_of).\n\n### Task\n- **Type**: single-turn eval (with a 15-turn fewshot prefix materialized as prior multiturn messages)\n- **Parser**: `MaybeThinkParser`\n- **Rubric**: `exact_match` (substring match against any alias in `possible_answers`)\n\n### OLMES methodology faithfulness\nThe env mirrors OLMES `popqa::tulu` exactly:\n- **15-shot, `fewshot_as_multiturn=True`**: each fewshot example is a separate user/assistant turn rather than concatenated into one prompt.\n- **User turn**: `Q: {question}` (OLMES strips the trailing ` A:` when rendering multiturn).\n- **Assistant turn**: `\" {answer}\"` — note the leading space, matching OLMES `doc_to_target`.\n- **Answer aliases**: gold answers come from the `obj` field, which is a JSON-encoded list of valid surface forms. A prior version of this env read `obj` as a raw string, missing alias matches; this is now parsed via `json.loads` and scored as a substring match against any alias (case-insensitive and capitalized variants).\n- **Greedy decoding** (temperature 0) recommended for reproducibility with OLMES numbers.\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run popqa\n```\n\nOverride the system prompt:\n\n```bash\nprime eval run popqa -a '{\"system_prompt\": \"Answer the question concisely.\"}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `system_prompt` | str or `None` | `None` | Optional system prompt prepended to the conversation. OLMES `popqa::tulu` uses none. |\n| `num_shots` | int | `15` | Number of fewshot examples drawn from `olmes_fewshot.json[\"popqa\"]`. |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `correct` | 1.0 if any alias from `possible_answers` appears in the model response (first-line, substring, case/capitalization variants), else 0.0. Aggregated as mean accuracy across the 14,267 test questions. |\n\n### Reference results (OTA Tulu-3 specialists)\n\n| Checkpoint | PopQA (%) |\n| ---------- | --------- |\n| tulu3_ref (paper) | 29.3 |\n| **tulu3_ref (ours)** | **32.4** (+3.1 over paper) |\n| knowledge_recall specialist | 32.0 |\n| math_reasoning specialist | ~31 |\n| coding specialist | ~31 |\n| precise_if specialist | ~31 |\n| general specialist | ~31 |\n\nAll five OTA specialists land within ~1pp of each other, and `knowledge_recall` barely edges the rest. This suggests \"knowledge specialization\" behaves as a generic SFT-style effect rather than injecting unique factual knowledge — consistent with PopQA's tail-entity profile.\n\n### Known limitations\n- **Substring scoring is lenient toward verbose outputs.** Models that emit additional Q/A pairs after the answer can be over-rewarded when an alias happens to appear downstream, or penalized when the first line produces a wrong answer and the correct alias appears only later. The env mitigates this by parsing only the first line (`response.strip().split(\"\\n\")[0]`).\n- **Format-vs-knowledge audit**: on a sample of failures, only 0.86% contained the correct alias as a substring anywhere in the response. The remaining 99% are genuine knowledge gaps, not scoring artifacts. KR-tank merges showing 3–7% accuracy reflect real generation degradation, not a scoring bug.\n- PopQA measures memorized tail-entity recall and is not a good proxy for reasoning, instruction following, or in-context retrieval quality.\n","encoding":"utf-8","truncated":false,"total_bytes":5316},"status":null}