{"data":{"kind":"file","path":"README.md","version_id":"wmtip383n1dwokvp91ybhjqu","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":13496,"modified_at":"2026-05-15T16:09:42.385000","content_hash":"e61a350dc266601a903f1140190dcbf53455a2a565866478004ea28635ac45ac"},"entries":[],"content":"# rlm_rlvr\n\n### Overview\n- **Environment ID**: `rlm_rlvr`\n- **Short description**: Prime-native recursive RLVR environment that trains a single local model on root and recursive RLM turns, while keeping plain LLM subcalls trace-only.\n- **Tags**: `rlvr`, `recursive`, `prime-rl`, `multi-turn`\n\n### Datasets\n- **Primary dataset(s)**: `lsteno/BEEG-agents` from Hugging Face.\n- **Source links**: `dataset_id` / split args (`dataset_train_split`, `dataset_eval_split`).\n- **Split sizes**: uses dataset-provided splits; if the eval split is missing, eval falls back to a deterministic 10% train holdout.\n\n### Task\n- **Type**: multi-turn\n- **Output format expectations (optional)**: assistant text with optional ```repl``` blocks and a final `FINAL(...)` or `FINAL_VAR(...)` answer.\n- **Rubric overview**: exact match plus semantic correctness from an LLM judge, with either backward-compatible static token-cost shaping or adaptive GRPO-group cost shaping. Monitor metrics track recursion, subcalls, depth, token cost, and adaptive penalty state.\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run rlm_rlvr\n```\n\nConfigure model and sampling:\n\n```bash\nprime eval run rlm_rlvr -m gpt-4.1-mini -n 20 -r 3 -t 1024 -T 0.7 -a '{\"data_paths\": [\"./data/my_dataset/train.parquet\"], \"eval_data_paths\": [\"./data/my_dataset/eval.parquet\"]}'\n```\n\nUse the Hugging Face dataset defaults explicitly:\n\n```bash\nprime eval run rlm_rlvr -a '{\"dataset_id\": \"lsteno/BEEG-agents\", \"max_examples\": 64, \"max_eval_examples\": 16}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n- This environment reuses the published `rlms` package for prompt construction, parsing, and local REPL execution.\n- Install with Prime CLI (`prime env install rlm_rlvr -p /home/coder/prime_run/environments`) to ensure dependencies are available.\n- For Vertex-backed semantic judging and plain subcalls, set `GOOGLE_CLOUD_PROJECT`, use Application Default Credentials, and keep Gemini 3 configs on the `global` location. OpenAI-compatible providers still use `OPENROUTER_API_KEY` or the configured API-key variable.\n- `repl_backend` currently supports only `\"local\"`; remote REPL backends are future work.\n- The default prompt is `sanjaya_text_v1`. `prompt_variant=\"default\"` is kept as a compatibility alias for the same prompt.\n- Local parquet mode is opt-in: pass `data_paths` explicitly. If `eval_data_paths` is omitted, eval defaults to a deterministic 10% holdout from `data_paths`.\n- `inference_mode = \"hosted\"` is for managed hosted training. `inference_mode = \"local\"` is the standard setting for self-managed `prime-rl` runs on on-demand GPUs.\n- Active self-managed configs keep root and recursive RLM generations on local vLLM/Prime GPUs, while routing only non-trainable plain `llm_query*` subcalls to Vertex Gemini Flash-Lite.\n- `llm_query_batched` remains concurrent for external API fanout. `rlm_query_batched` executes recursive child RLM calls serially by default because child calls can run local REPL code and the local REPL mutates process-global cwd/stdout/stderr.\n- In the standard self-managed `prime-rl` path, the launcher handles the local inference base URL and API wiring. You do not need to set `RLM_LOCAL_INFERENCE_BASE_URL` or `RLM_LOCAL_INFERENCE_API_KEY` unless you are overriding the default local server.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `dataset_id` | `str \\| null` | `\"lsteno/BEEG-agents\"` | Hugging Face dataset identifier (primary mode) |\n| `dataset_train_split` | `str` | `\"train\"` | Training split name for Hugging Face dataset loading |\n| `dataset_eval_split` | `str` | `\"eval\"` | Evaluation split name; falls back to deterministic holdout if absent |\n| `dataset_config` | `str \\| null` | `null` | Optional Hugging Face dataset config |\n| `dataset_revision` | `str \\| null` | `null` | Optional Hugging Face revision/commit pin |\n| `data_paths` | `list[str] \\| null` | `null` | Optional explicit training parquet paths (local mode) |\n| `eval_data_paths` | `list[str] \\| null` | `null` | Optional explicit eval parquet paths; if omitted, a 10% holdout is derived from `data_paths` |\n| `seed` | `int` | `42` | Dataset shuffle seed |\n| `max_examples` | `int` | `-1` | Limit training examples after splitting |\n| `max_eval_examples` | `int` | `-1` | Limit eval examples |\n| `max_iterations` | `int` | `15` | Recursive reasoning turns per call before forcing a final answer |\n| `max_depth` | `int` | `2` | Maximum recursion depth |\n| `turn_max_tokens` | `int` | `192` | Max assistant tokens for each recursive reasoning turn |\n| `subcall_max_tokens` | `int` | `128` | Max assistant tokens for leaf/plain subcalls |\n| `temperature` | `float` | `1.0` | Root and recursive sampling temperature |\n| `top_p` | `float` | `1.0` | Root and recursive nucleus sampling |\n| `tokenizer_name` | `str \\| null` | `null` | Optional tokenizer override; defaults to the rollout model name |\n| `prompt_variant` | `str` | `\"sanjaya_text_v1\"` | System prompt variant. Supported values: `sanjaya_text_v1`, `default` where `default` is a compatibility alias |\n| `live_trace_dir` | `str \\| null` | `\"outputs/rlm_rlvr/live_traces\"` | Directory for compact per-sample live traces updated after root and recursive steps. Set to `null` to disable |\n| `subcall_prompt_limit_ratio` | `float` | `0.85` | Blocks `llm_query*` and `rlm_query*` prompts whose estimated character size exceeds this fraction of the configured subcall context window, returning a REPL-visible error instead of truncating context |\n| `efficiency_penalty_mode` | `str` | `\"static_per_1k\"` | Reward shaping mode. Use `\"static_per_1k\"` for backward-compatible per-1k-token penalty or `\"adaptive_group\"` for solve-rate-aware group scoring |\n| `efficiency_penalty_coef` | `float` | `0.02` | Static-mode cost-aware shaping coefficient applied only to correct answers. Incorrect/no-answer rollouts receive `0`; correct rollouts receive `max(0, 1 - efficiency_penalty_coef * total_tokens / 1000)` |\n| `adaptive_efficiency_beta_max` | `float` | `0.05` | Maximum adaptive cost coefficient in group mode |\n| `adaptive_efficiency_gamma` | `float` | `2.0` | Exponent for the solve-rate ramp in group mode |\n| `adaptive_efficiency_solve_rate_floor` | `float` | `0.25` | No adaptive cost pressure is applied when group solve rate is at or below this floor |\n| `adaptive_efficiency_cost_basis` | `str` | `\"total_tokens\"` | Token-cost basis for adaptive group penalty. Currently supports total rollout tokens |\n| `inference_mode` | `str` | `\"hosted\"` | Inference routing mode. Use `hosted` for managed hosted training and `local` for self-managed `prime-rl` on local or on-demand GPUs |\n| `inference_base_url` | `str \\| null` | `null` | Override the OpenAI-compatible inference endpoint. Usually unset for self-managed `prime-rl`, which wires the local inference server automatically |\n| `inference_api_key` | `str \\| null` | `null` | Override API key for the inference endpoint. Usually unset for self-managed `prime-rl` local inference |\n| `llm_subcall_provider` | `str` | `\"openai_compatible\"` | Provider for plain non-trainable `llm_query*` calls. Use `\"vertex\"` for Vertex AI Gemini |\n| `llm_subcall_model` | `str \\| null` | `null` | Optional separate model for plain `llm_query*` calls; `null` reuses the rollout endpoint |\n| `llm_subcall_vertex_project_env` | `str` | `\"GOOGLE_CLOUD_PROJECT\"` | Environment variable that stores the Vertex project for plain subcalls |\n| `llm_subcall_vertex_location` | `str \\| null` | `\"global\"` | Vertex location for plain subcalls. Active Gemini 3 configs use `global` |\n| `llm_subcall_thinking_level` | `str \\| null` | `\"medium\"` | Gemini thinking level for plain Vertex subcalls |\n| `llm_subcall_empty_response_max_attempts` | `int` | `1` | Number of attempts for retrying empty Vertex plain-subcall responses |\n| `llm_subcall_empty_response_base_retry_seconds` | `float` | `1.0` | Base exponential-backoff delay for empty plain-subcall retries |\n| `llm_subcall_empty_response_max_retry_seconds` | `float` | `30.0` | Maximum backoff delay for empty plain-subcall retries |\n| `llm_subcall_base_url` | `str \\| null` | `null` | OpenAI-compatible plain subcall provider base URL |\n| `llm_subcall_api_key_var` | `str` | `\"OPENROUTER_API_KEY\"` | Environment variable for OpenAI-compatible plain subcalls |\n| `judge_provider` | `str` | `\"openai_compatible\"` | Provider for binary semantic judging. Use `\"vertex\"` for Vertex AI Gemini |\n| `judge_model` | `str` | `\"z-ai/glm-5\"` | Model used for binary semantic judging. Active Vertex configs use `gemini-3-flash-preview` |\n| `judge_vertex_project_env` | `str` | `\"GOOGLE_CLOUD_PROJECT\"` | Environment variable that stores the Vertex project for judging |\n| `judge_vertex_location` | `str \\| null` | `\"global\"` | Vertex location for judging. Gemini 3 Flash requires `global` |\n| `judge_thinking_level` | `str \\| null` | `\"medium\"` | Gemini thinking level for Vertex judging |\n| `judge_base_url` | `str` | `\"https://openrouter.ai/api/v1\"` | OpenAI-compatible judge provider base URL |\n| `judge_api_key_var` | `str` | `\"OPENROUTER_API_KEY\"` | Environment variable that stores the OpenAI-compatible judge API key |\n| `judge_http_referer` | `str \\| null` | `null` | Optional OpenAI-compatible `HTTP-Referer` header |\n| `judge_app_title` | `str \\| null` | `null` | Optional OpenAI-compatible `X-Title` header |\n| `repl_backend` | `str` | `\"local\"` | RLM REPL backend. Only `local` is currently supported |\n| `repl_backend_kwargs` | `dict \\| null` | `null` | Reserved for future backend-specific kwargs |\n| `repl_timeout_seconds` | `float \\| null` | `null` | Optional wall-clock timeout for generated REPL code blocks that call `llm_query*` or `rlm_query*` helpers |\n| `repl_fast_timeout_seconds` | `float \\| null` | `null` | Optional shorter wall-clock timeout for generated REPL code blocks with no LLM/RLM subcalls |\n| `recursive_rlm_batch_mode` | `str` | `\"serial\"` | Execution mode for `rlm_query_batched`. Default `\"serial\"` avoids local REPL thread-safety hazards; `\"thread\"` is an explicit advanced override |\n\n### Metrics\nSummarize key metrics your rubric emits and how they’re interpreted.\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Zero for incorrect/no-answer rollouts; correct rollouts minus optional static or adaptive token-cost penalty clipped at zero |\n| `correctness` | Raw binary judge score before cost shaping |\n| `judge_score` | Binary judge score for observability |\n| `efficiency_penalty` | Static or adaptive token-cost penalty subtracted from reward |\n| `cost_prompt_tokens` | Total prompt tokens consumed across all root turns, recursive turns, and subcalls |\n| `cost_completion_tokens` | Total completion tokens consumed across all root turns, recursive turns, and subcalls |\n| `cost_total_tokens` | Sum of prompt and completion tokens used for cost shaping |\n| `cost_trainable_tokens` | Token count from trainable root/recursive/finalize RLM turns |\n| `cost_plain_subcall_tokens` | Token count from non-trainable plain `llm_query*` subcalls |\n| `adaptive_group_solve_rate` | Within-prompt group correctness rate used by adaptive group mode |\n| `adaptive_beta` | Solve-rate-dependent adaptive cost coefficient |\n| `adaptive_normalized_cost` | Min-max normalized cost among correct rollouts in the group |\n| `adaptive_cost_penalty` | Adaptive cost penalty applied to this rollout |\n| `used_repl` | Fraction of rollouts that executed at least one REPL block |\n| `used_recursion` | Fraction of rollouts that invoked any `llm_query(...)` or `rlm_query(...)` subcall |\n| `used_llm_subcalls` | Fraction of rollouts that used at least one plain `llm_query(...)` subcall |\n| `used_rlm_subcalls` | Fraction of rollouts that used at least one recursive `rlm_query(...)` subcall |\n| `num_subcalls` | Total number of LLM plus RLM subcalls executed in the rollout |\n| `num_llm_subcalls` | Number of plain `llm_query(...)` subcalls executed in the rollout |\n| `num_rlm_subcalls` | Number of recursive `rlm_query(...)` subcalls executed in the rollout |\n| `max_depth_reached` | Deepest aggregate recursion depth reached, with plain LLM subcalls counted as depth 1 |\n\n### Prime-RL Notes\n- The environment emits flattened recursive segments in `rlm_segments` with explicit call provenance. Prime-RL trains only segments marked as RLM-owned turns (`root_turn`, `recursive_turn`, `finalize_turn`) and skips plain `llm_query(...)` subcalls.\n- With `prime eval run -s`, completed rollout rows are written incrementally to `environments/rlm_rlvr/outputs/evals/<env>--<model>/<run_id>/results.jsonl`.\n- Long in-flight rollouts also update compact live trace files under `outputs/rlm_rlvr/live_traces/<prompt_variant>/<source_id>.json`. These include assistant text, executed code blocks, REPL feedback, final answer state, subcall counters, and compact segment token counts without duplicating full token id arrays.\n- Keep `orchestrator.use_token_client = false` for this environment. Recursive rollouts use message-based chat completions; the token-in/token-out endpoint is for linear TITO and prefill flows.\n- For local SFT warmup on an 8xH100 node, see `/home/coder/prime_run/configs/rlm_sft/README.md` and `/home/coder/prime_run/configs/rlm_sft/local_h100x8_qwen3_4b.toml`.\n\n### Development\n\nRun the environment tests with uv:\n\n```bash\nuv run --project environments/rlm_rlvr --group dev pytest environments/rlm_rlvr/tests -q\n```\n","encoding":"utf-8","truncated":false,"total_bytes":13496},"status":null}