{"data":{"kind":"file","path":"README.md","version_id":"njmq27nn09lcibvja6bjqcoe","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5290,"modified_at":"2026-04-20T09:46:06.731000","content_hash":"b45e36ddb987016654f3a6ca99ed2d06d3a5f6c5381a9f8465c865c6e2c3eca6"},"entries":[],"content":"# longcot\n\n<a href=\"https://github.com/PrimeIntellect-ai/research-environments/tree/main/environments/longcot\">\n<img src=\"https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white\" alt=\"Source Code\">\n</a>\n\nLongCoT is a long-horizon reasoning benchmark evaluated with deterministic programmatic verifiers across logic, computer science, chemistry, chess, and math tasks.\n\n### Overview\n- **Environment ID**: `longcot`\n- **Short description**: Hugging Face-backed LongCoT benchmark port with deterministic verification.\n- **Tags**: single-turn, eval, reasoning, long-context, deterministic\n\n### Datasets\n- **Primary dataset(s)**: [`LongHorizonReasoning/longcot`](https://huggingface.co/datasets/LongHorizonReasoning/longcot) on Hugging Face\n- **Domains**: `logic`, `cs`, `chemistry`, `chess`, `math`\n- **Difficulties**: `easy`, `medium`, `hard`\n- **Default benchmark split**: `longcot` -> `medium` + `hard`\n- **Easy split**: `easy`\n- **Verifier metadata**: The environment carries a compact packaged metadata artifact for the structured fields needed by deterministic logic and chess verifiers because the public HF rows are flattened for browsing.\n\n### Task\n- **Type**: single-turn\n- **Parser**: Custom `LongCoTParser` that preserves the `solution = ...` contract\n- **Rubric overview**: Deterministic benchmark verification via the vendored LongCoT verifier stack; also records whether the response included a parseable `solution = ...` line\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run longcot\n```\n\nRun a small smoke test on the easy split:\n\n```bash\nprime eval run longcot \\\n  -n 1 -r 1 -t 512 \\\n  -a '{\"split\":\"easy\",\"domain\":\"cs\",\"max_examples\":5}'\n```\n\nEquivalent `vf-eval` flow during local development:\n\n```bash\nuv pip install -e ./environments/longcot\nuv run vf-eval longcot -n 1 -r 1 -d -v -t 512 -a '{\"split\":\"easy\",\"domain\":\"cs\",\"max_examples\":5}'\n```\n\nNotes:\n- This environment loads benchmark examples from the Hugging Face dataset release rather than reading prompt/answer rows from packaged per-domain JSON files.\n- Grading is deterministic-only in the environment path; it does not include any LLM fallback judges or direct URL-based verifier calls.\n- The environment relies on the standard Prime-backed `vf-eval` / `prime eval run` model path and therefore expects `PRIME_API_KEY` to be set for real model execution.\n- Eval rows keep structured verifier metadata in `info`, including `info.reference_source` to indicate whether the verifier grades against top-level `answer` or structured `problem.*` fields.\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `system_prompt` | `str \\| None` | `None` | Optional system prompt prepended to each rollout |\n| `split` | `str` | `\"longcot\"` | Benchmark split selector: `longcot`, `easy`, `medium`, `hard`, or `all` |\n| `domain` | `str \\| None` | `None` | Optional domain filter: `logic`, `cs`, `chemistry`, `chess`, or `math` |\n| `max_examples` | `int \\| None` | `None` | Optional cap for the number of loaded examples |\n| `shuffle` | `bool` | `False` | Shuffle examples before applying `max_examples` |\n| `seed` | `int \\| None` | `None` | Random seed used when `shuffle=true` |\n\n### Record Metadata\n- `info.problem`: serialized structured verifier metadata used by deterministic grading\n- `info.reference_source`: where the verifier reads its gold target from\n  - `answer` for standard answer-based templates\n  - `problem.solution` for scalar logic templates\n  - `problem.instance` for state-based templates such as `BlocksWorld`, `RandomHanoi`, `Sokoban`, and `Sudoku`\n  - `problem.final_fen+forced_moves` for chess `reconstruct_moves`\n- Top-level `answer` may therefore be `null` for templates whose success condition is defined by structured metadata rather than a single canonical answer string\n\n### Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `_deterministic_verify_response` | 1.0 if the bundled deterministic verifier marks the response correct, else 0.0 |\n| `has_solution_line` | 1.0 if the response contains a parseable `solution = ...` line, else 0.0 |\n| `reward` | Weighted benchmark reward; equal to deterministic correctness |\n\n### Changelog\n\n### v0.1.0\n- Initial publishable `research-environments` port of LongCoT\n- Loads benchmark rows from the public Hugging Face dataset and keeps only verifier-side structured metadata in-package\n- Supports canonical split selectors (`longcot`, `easy`, `medium`, `hard`, `all`)\n\n### Citation\n\nIf you use this environment, please cite:\n\n```bibtex\n@article{motwani2026longcot,\n  title         = {LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning},\n  author        = {Motwani, Sumeet Ramesh and Nichols, Daniel and London, Charles and Li, Peggy and Pizzati, Fabio and Blake, Acer and Hammoud, Hasan and McDonald, Tavish and Naik, Akshat and Ivanova, Alesia and Baskaran, Vignesh and Laptev, Ivan and Glatt, Ruben and Ben-Nun, Tal and Torr, Philip and Jaques, Natasha and Prabhu, Ameya and Bartoldson, Brian and Kailkhura, Bhavya and Schroeder de Witt, Christian},\n  year          = {2026},\n  eprint        = {2604.14140},\n  archivePrefix = {arXiv},\n  primaryClass  = {cs.LG},\n  url           = {https://arxiv.org/abs/2604.14140}\n}\n```\n","encoding":"utf-8","truncated":false,"total_bytes":5290},"status":null}