{"data":{"kind":"file","path":"README.md","version_id":"c5xgvsomsenk7ooow22jhctt","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":10647,"modified_at":"2026-04-23T07:33:19.097000","content_hash":"83489b274fa73840f46e9a8eb7976997f9302b06830fb3f54746f6487b067807"},"entries":[],"content":"# longcot-rlm\n\n### Overview\n\n- **Environment ID**: `longcot-rlm`\n- **Short description**: LongCoT long-horizon reasoning benchmark using RLM (Recursive Language Model) with Python REPL\n- **Tags**: reasoning, rlm, python, multi-turn, repl, math, logic, chemistry, chess, cs\n\n### How It Works\n\nThis environment implements the [LongCoT benchmark](https://huggingface.co/datasets/LongHorizonReasoning/longcot) for evaluating long-horizon reasoning, using the `RLMEnv`.\n\nEach question in LongCoT is self-contained: the prompt embeds the full task (a chess position, logic puzzle, chemistry subproblem chain, CS algorithm trace, or chained-math problem) and instructs the model to return its answer in the form `solution = <answer>`.\n\nThe model receives that prompt in the user message and has access to a Python REPL in a sandbox. It can install / use:\n\n- `python-chess` for chess templates (FEN, SAN move generation, `board_fen()`).\n- `rdkit` for chemistry SMILES templates (canonicalization).\n- `sympy` for math templates (symbolic equivalence).\n\nScoring is delegated to the upstream [`longcot`](https://github.com/LongHorizonReasoning/longcot) package's `verify()` — the exact template-dispatched verifier used by the reference harness. The full `problem` dict for each question (needed by logic + some chess verifiers) comes from the JSON files bundled inside the `longcot` package, not from the HF parquet (which omits `problem` metadata).\n\n### Dataset\n\n- [LongHorizonReasoning/longcot](https://huggingface.co/datasets/LongHorizonReasoning/longcot) — 2,502 questions across 5 domains × 3 difficulties. Questions ship inside the `longcot` package, not loaded from HF.\n- Domains: `logic`, `cs`, `chemistry`, `chess`, `math`.\n- Difficulties: `easy`, `medium`, `hard`.\n- Templates (dispatched to domain-specific verifiers):\n  - **logic**: `BlocksWorld`, `Dungeon`, `PackagingMinWaste`, `RandomHanoi`, `Sokoban`, `Sudoku`, `TrapezoidCounting`, `WizardsTotalStrength`\n  - **cs**: `HM`, `MFMC`, `Scheduling`, `TM`, `MCM`, `LLVM`, `Backprop`, `DistMem`, `VLIW`, `CodeTrace`\n  - **chemistry**: `easy1`, `easy2`, `med1`–`med4`, `hard1`–`hard4`\n  - **chess**: `uci_to_fen`, `piece_combinations`, `reconstruct_moves`, `best_3_moves`, `best_move`, `knight_path`, `knight_path_enemy`, `knight_game`, `max_rooks`, `forced_checkmate`\n  - **math**: `linear`, `dag`, `dag_first`, `conditional`, `backtracking`\n\n### Quickstart\n\n```bash\n# GPT-5.2 on longcot-mini (easy split, ~500 questions) — the upstream \"mini\" benchmark\nuv run vf-eval longcot-rlm -m openai/gpt-5.2 -s -n 500 -r 1 -a '{\"include_env_tips\": true, \"benchmark\": \"longcot-mini\"}'\n\n# Same model on the fixed 25-question mini slice (5 per domain); IDs are ``MINI_BALANCED_EVAL_QUESTION_IDS`` in ``longcot_rlm.py``\nuv run vf-eval longcot-rlm -m openai/gpt-5.2 -s -n 25 -r 1 -a '{\"include_env_tips\": true, \"use_mini_balanced_eval\": true}'\n\n# GPT-5.2 on the full longcot benchmark (medium + hard, ~2,000 questions)\nuv run vf-eval longcot-rlm -m openai/gpt-5.2 -s -n 2000 -r 1 -a '{\"include_env_tips\": true, \"benchmark\": \"longcot\"}'\n\n# All splits (easy + medium + hard)\nuv run vf-eval longcot-rlm -m openai/gpt-5.2 -s -n 2500 -r 1 -a '{\"benchmark\": \"all\"}'\n\n# Just math\nuv run vf-eval longcot-rlm -m openai/gpt-5.2 -s -n 500 -r 1 -a '{\"include_env_tips\": true, \"benchmark\": \"longcot-mini\", \"domain\": \"math\"}'\n\n# Chess only, medium+hard\nuv run vf-eval longcot-rlm -m gpt-5-mini -n 5 -a '{\"domain\": \"chess\", \"difficulty\": [\"medium\", \"hard\"]}'\n\n# A single template\nuv run vf-eval longcot-rlm -m gpt-5-mini -n 5 -a '{\"template\": \"BlocksWorld\"}'\n\n# With environment tips + shuffling\nuv run vf-eval longcot-rlm -m gpt-5-mini -n 5 -a '{\"include_env_tips\": true, \"shuffle\": true}'\n\n# Enable Gemini fallback judges (needs GEMINI_API_KEY / GOOGLE_API_KEY)\nuv run vf-eval longcot-rlm -m gpt-5-mini -n 5 -a '{\"math_enable_fallback\": true, \"chemistry_enable_fallback\": true}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `benchmark` | `\"longcot-mini\"` \\| `\"longcot\"` \\| `\"all\"` \\| None | `None` | Upstream benchmark alias. `\"longcot-mini\"` = easy (~500), `\"longcot\"` = medium + hard (~2,000), `\"all\"` = every split. Mutually exclusive with `difficulty`. |\n| `use_mini_balanced_eval` | bool | `False` | If `True`, sets `benchmark` to `\"longcot-mini\"` and `question_id` to the fixed 25-question list (five per domain). Mutually exclusive with `question_id` and `difficulty`. |\n| `domain` | str \\| list[str] \\| None | `None` | Domain filter: `\"logic\"`, `\"cs\"`, `\"chemistry\"`, `\"chess\"`, `\"math\"`, or a list. `None` = all. |\n| `difficulty` | str \\| list[str] \\| None | `None` | Difficulty filter: `\"easy\"`, `\"medium\"`, `\"hard\"`, or a list. `None` = all. Mutually exclusive with `benchmark`. |\n| `template` | str \\| list[str] \\| None | `None` | Optional template-name filter (e.g. `\"BlocksWorld\"`, `\"uci_to_fen\"`, `\"linear\"`). |\n| `shuffle` | bool | `False` | Whether to shuffle the dataset. |\n| `seed` | int \\| None | `None` | Random seed for shuffling. |\n| `max_examples` | int \\| None | `None` | Maximum number of examples (None = all). |\n| `include_env_tips` | bool | `False` | Append strategy tips (wrapped in `<env_tips>`) to the prompt. |\n| `prompt_in_context_file` | bool | `False` | If `True`, stash the prompt inside the RLM context file (`{\"query\": prompt, \"context\": \"\"}`) and leave the user message empty. |\n| `exclude_broken_easy_math_ids` | bool | `True` | **Temporary** — drops the 21 easy-math question IDs flagged as wrong/impossible in [LongHorizonReasoning/longcot#4](https://github.com/LongHorizonReasoning/longcot/issues/4) so they don't contaminate longcot-mini scoring. Remove once upstream fixes the dataset. |\n| `math_enable_fallback` | bool | `False` | Enable the upstream Gemini fallback judge for math equivalence. |\n| `chemistry_enable_fallback` | bool | `False` | Enable the upstream Gemini fallback SMILES extractor. |\n| `math_numeric_fallback` | bool | `True` | Local numeric-equivalence fallback for math templates. Runs only when the upstream verifier rejects, and accepts component pairs whose 30-digit SymPy evaluation agrees to 1e-12 relative tolerance — catches formatting differences like `1.01^100` ↔ `(101/100)^100` that `sp.simplify` misses because of Float/Rational type mixing. |\n| `math_textual_judge_model` | str \\| None | `None` | OpenAI-compatible model ID for a per-component textual-equivalence judge (e.g. `\"openai/gpt-5-nano\"`). Invoked only for components where both sides are textual (free-form families of solutions, set descriptions) so the numeric / SymPy paths can't decide. `None` disables the judge. |\n| `math_textual_judge_api_key_var` | str | `\"PRIME_API_KEY\"` | Env var holding the API key for the textual judge. |\n| `math_textual_judge_base_url` | str \\| None | `\"https://api.pinference.ai/api/v1\"` | Base URL for the textual judge. |\n| `repl_language` | Literal[\"bash\", \"python\"] | `\"python\"` | REPL language for the RLM. |\n| `max_turns` | int | `30` | Maximum REPL iterations. |\n| `sub_llm_max_turns` | int | `5` | Max tool-calling turns for each sub-LLM call. |\n| `sub_model` | str \\| None | `None` | Model for sub-LLM calls (defaults to same as root model). |\n| `max_sub_llm_parallelism` | int | `5` | Max concurrent sub-LLM calls. |\n| `max_output_length` | int | `8192` | Maximum code execution output length. |\n| `code_execution_timeout` | int | `600` | Timeout in seconds for a single REPL call. Also bounds the sandbox-side HTTP timeout on `llm_batch` (upstream uses `code_execution_timeout - 5`). LongCoT uses 600 (vs 120 for other RLM envs) because GPT-5.2 with high reasoning regularly takes 90–300s on hard competition-math sub-problems — at 120 roughly one-fifth of `llm_batch` calls time out. |\n| `abort_on_code_timeout` | bool | `False` | If True, abort rollout on code timeout. |\n| `max_startup_wait_seconds` | int | `120` | Max seconds to wait for sandbox startup. |\n| `pip_install_packages` | str | `\"rdkit chess sympy numpy\"` | Packages to install in sandbox. |\n| `sandbox_docker_image` | str | `\"python:3.11-slim\"` | Docker image for sandbox. |\n| `sandbox_cpu_cores` | int | `1` | CPU cores for sandbox. |\n| `sandbox_memory_gb` | int | `2` | Memory in GB for sandbox. |\n| `sandbox_disk_size_gb` | int | `5` | Disk size in GB for sandbox. |\n| `sandbox_gpu_count` | int | `0` | Number of GPUs for sandbox. |\n| `sandbox_timeout_minutes` | int | `60` | Overall sandbox lifetime in minutes. |\n\n### Metrics\n\nThe rubric calls `longcot.verify(question, final_answer, options)` and emits `1.0` for correct, `0.0` otherwise. Per-template scoring:\n\n- **Math** (`linear`, `dag`, `dag_first`, `conditional`, `backtracking`): SymPy-based list equivalence. On upstream rejection, a **per-component** fallback runs, trying in order:\n  1. longcot's own SymPy compare (already the upstream behavior).\n  2. Local numeric equivalence (30-digit precision, 1e-12 relative tolerance) — catches `1.01^100` ↔ `(101/100)^100`, `1/2` ↔ `0.5`, etc., which the upstream rejects because `sp.simplify(Float - Rational)` returns ~1e-15 rather than exact 0.\n  3. If `math_textual_judge_model` is configured, an LLM judge is invoked for textual components (free-form families of solutions, set descriptions) — e.g. gold `\"All polynomials of the form f(x)=x^m for some m∈ℤ^+ and f(x)=c for some c∈ℤ^+ with ω(c)≤2023^{2023}+1\"` vs predicted `\"P(x)=x^k (k∈ℤ_{≥1}) or P(x)=c with c∈ℤ_{>0} and ω(c)≤2023^{2023}+1\"`.\n  4. Optional upstream Gemini fallback (`math_enable_fallback=True`) for the whole list.\n- **Chemistry SMILES** (`easy1`, `easy2`, `med3`, `hard3`): RDKit canonicalization match; optional Gemini fallback to extract SMILES from noisy output.\n- **Chemistry list** (`med1`, `med2`, `med4`, `hard1`, `hard2`, `hard4`): element-wise equality (int/string/mixed).\n- **Chess**: FEN piece-placement equality, SAN token equality, replay-to-final-FEN, or integer equality depending on template.\n- **CS**: strict JSON/dict equality, integer equality, or int-list equality.\n- **Logic**: full simulation of the puzzle against `problem[\"instance\"]` with state verification.\n\nThe model should include `solution = <answer>` somewhere in its final answer (verifiers also have fallbacks that scan the whole response when that marker is missing).\n\n### Changelog\n\n- 0.1.1: Added `MINI_BALANCED_EVAL_QUESTION_IDS` and `use_mini_balanced_eval` for a reproducible 25-question longcot-mini slice (5 per domain).\n- 0.1.0: Initial RLM version using the upstream `longcot.verify` for template-dispatched scoring; supports `domain`, `difficulty`, and `template` filtering.\n","encoding":"utf-8","truncated":false,"total_bytes":10647},"status":null}