{"data":{"kind":"file","path":"README.md","version_id":"heb3vvhvdljyouomsuzh574b","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4313,"modified_at":"2026-04-20T09:20:54.612000","content_hash":"b423be5335757be1b16915bf8bec4b25353603b9ece22b8f27721532df665402"},"entries":[],"content":"# longcot_rlm\n\nLongCoT as an RLM environment. The benchmark, dataset selection, and deterministic verifier behavior are the same as `longcot`; the difference is that examples are presented through an RLM sandbox so the model can use a Python REPL before submitting its final answer.\n\n## Overview\n\n- **Environment ID**: `longcot_rlm`\n- **Task type**: multi-turn RLM / sandboxed REPL\n- **Dataset**: [`LongHorizonReasoning/longcot`](https://huggingface.co/datasets/LongHorizonReasoning/longcot)\n- **Domains**: `logic`, `cs`, `chemistry`, `chess`, `math`\n- **Default split**: `longcot` (`medium` + `hard`)\n- **Easy split**: `easy`\n\nThe original LongCoT prompt is stored in `info.raw_question` and placed in `context.txt`. The model is instructed to solve the task there and must still produce a final response containing a parseable `solution = ...` line so the bundled deterministic verifier can grade it exactly like the single-turn version.\n\nMost domains keep their canonical reference in the top-level `answer` field, but some templates intentionally do not. In particular, public `logic` rows use `answer = null` because correctness is defined by structured verifier metadata such as `problem.solution`, `problem.instance`, or chess-specific fields like `problem.final_fen` and `problem.forced_moves`. The environment surfaces that contract in `info.reference_source` so downstream tooling can tell whether a row is graded from `answer` or from structured metadata.\n\n## Quickstart\n\n```bash\nuv pip install -e ./environments/longcot_rlm\nuv run vf-eval longcot_rlm -n 1 -r 1 -d -v -a '{\"split\":\"easy\",\"domain\":\"cs\",\"max_examples\":2}'\n```\n\n## Environment Arguments\n\n### Dataset selection\n\n- `system_prompt`: optional system prompt prepended to the RLM conversation\n- `split`: one of `longcot`, `easy`, `medium`, `hard`, `all`\n- `domain`: optional filter among `logic`, `cs`, `chemistry`, `chess`, `math`\n- `max_examples`: optional limit after filtering\n- `shuffle`: whether to shuffle the selected examples\n- `seed`: random seed used for shuffling\n- `include_env_tips`: append LongCoT-specific strategy tips to the prompt\n- `prompt_in_context_file`: if `true`, store both the short instruction and the full task inside the context payload\n\n### Record metadata\n\n- `info.problem`: serialized structured verifier metadata used by deterministic grading\n- `info.reference_source`: where the verifier reads its gold target from\n  - `answer` for standard answer-based templates\n  - `problem.solution` for scalar logic templates\n  - `problem.instance` for state-based templates such as `BlocksWorld`, `RandomHanoi`, `Sokoban`, and `Sudoku`\n  - `problem.final_fen+forced_moves` for chess `reconstruct_moves`\n- Top-level `answer` may therefore be `null` for templates whose success condition is defined by structured metadata rather than a single canonical answer string\n\n### RLM options\n\n- `max_turns`\n- `sub_llm_max_turns`\n- `sub_model`\n- `max_sub_llm_parallelism`\n- `max_output_length`\n- `code_execution_timeout`\n- `abort_on_code_timeout`\n- `max_startup_wait_seconds`\n- `pip_install_packages`\n- `repl_language`\n\n### Sandbox options\n\n- `sandbox_docker_image`\n- `sandbox_cpu_cores`\n- `sandbox_memory_gb`\n- `sandbox_disk_size_gb`\n- `sandbox_gpu_count`\n- `sandbox_timeout_minutes`\n\n## Metrics\n\n- `deterministic_verify_final_answer`: 1.0 when the bundled LongCoT verifier marks the final answer correct\n- `has_solution_line_from_state`: 1.0 when the submitted final answer contains a parseable `solution = ...` line\n- `reward`: equal to deterministic correctness\n\n## Citation\n\nIf you use this environment, please cite:\n\n```bibtex\n@article{motwani2026longcot,\n  title         = {LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning},\n  author        = {Motwani, Sumeet Ramesh and Nichols, Daniel and London, Charles and Li, Peggy and Pizzati, Fabio and Blake, Acer and Hammoud, Hasan and McDonald, Tavish and Naik, Akshat and Ivanova, Alesia and Baskaran, Vignesh and Laptev, Ivan and Glatt, Ruben and Ben-Nun, Tal and Torr, Philip and Jaques, Natasha and Prabhu, Ameya and Bartoldson, Brian and Kailkhura, Bhavya and Schroeder de Witt, Christian},\n  year          = {2026},\n  eprint        = {2604.14140},\n  archivePrefix = {arXiv},\n  primaryClass  = {cs.LG},\n  url           = {https://arxiv.org/abs/2604.14140}\n}\n```\n","encoding":"utf-8","truncated":false,"total_bytes":4313},"status":null}