{"data":{"kind":"file","path":"README.md","version_id":"tygujryzm2f913qtn6p5apbf","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4896,"modified_at":"2026-03-13T21:08:27.820000","content_hash":"ee2264f6fa82ef934807625f5314e78df80337926f93ca0daacf8b0431862f6e"},"entries":[],"content":"# Discover GSM8K\n\nRubric-discovery environment for GSM8K.\n\nLearn a scoring function `rubric_fn(prompt, completion) -> float` from `(prompt, completion, score)` examples. Evaluation is on held-out test examples (agreement rate; Spearman vs GT).\n\n## Overview\n\n- **Environment ID**: `discover_gsm8k`\n- **Type**: multi-turn + tools (RLM)\n- **Goal**: infer and implement `rubric_fn` from train examples, evaluated on held-out test examples\n- **Key tool**: `get_rubric_run_result(fn_code, examples)` to run candidate rubric on examples\n\n## Quickstart\n\n### Run (from repo root)\n\nSet `PRIME_API_KEY` for sandbox-backed runs, then:\n\n```bash\nprime eval run discover-gsm8k -m gpt-4.1-mini -a '{\"dataset_path\": \"data/data.jsonl\"}'\n```\n\n### Run (from `discover_gsm8k/`)\n\n```bash\ncd discover_gsm8k\nuv sync\nuv run prime eval run discover-gsm8k -a '{\"dataset_path\": \"data/data.jsonl\"}'\n```\n\n### Hosted RL (2-example smoke test)\n\nTo run a minimal RL job on 2 examples via [Lab Hosted Training](https://docs.primeintellect.ai):\n\n1. Push the environment to the Hub (once):\n   ```bash\n   prime env push --path . -v PRIVATE\n   ```\n2. In `config/rl_test_2examples.toml`, set `[[env]].id` to your Hub env (e.g. `YOUR_USERNAME/discover_gsm8k`).\n3. Start the run:\n   ```bash\n   prime rl run config/rl_test_2examples.toml\n   ```\n\nThe config uses `max_examples = 2`, `max_steps = 2`, and `batch_size = 2` for a quick smoke test.\n\n## Config\n\n`load_environment(config)` accepts a dict (or `Config`) with:\n\n- `**dataset_path**` (`str`, required): path to JSONL (e.g. `data/data.jsonl`)\n- `**max_train_per_task**` (`int | None`, default `2`): max train `(input, response, score)` examples per task in `contexts/<i>/task.json`; use `None` to use all\n- `**max_test_per_task**` (`int | None`, default `5`): max test examples per task (in state for reward); use `None` to use all\n- `**rlm_model**` (`str`, default `\"gpt-4.1-mini\"`): sub-LLM for RLM\n- `**max_turns**` (`int`, default `100`): max RLM iterations\n- `**max_examples**` (`int | None`, default `None`): cap number of tasks (rows)\n- `**timeout_s**` (`int`, default `30`): code execution timeout\n- `**margin**` (`float`, default `0.3`): agreement threshold |pred - expected| \\le margin\n- `**parallelism**` (`int`, default `5`): max parallel sub-LLM calls\n\nExample: limit contexts to 3 train and 2 test per task:\n\n```bash\nprime eval run discover-gsm8k -a '{\"dataset_path\": \"data/data.jsonl\", \"max_train_per_task\": 3, \"max_test_per_task\": 2}'\n```\n\n## Data format\n\nJSONL rows contain:\n\n- `**train_examples**`: list of `{ prompt, completion, score }`\n- `**test_examples**`: list of `{ prompt, completion, score }`\n- `**task_hint**` (optional): string\n\nThis matches the verifiers naming convention:\n\n- `**prompt**`: input text (what the model is asked to score / respond to)\n- `**completion**`: model output text being evaluated\n\n## Dataset generation\n\nDataset rows can be generated from verifiers-based source environments using `scripts/generate_dataset.py` and a YAML config.\n\n### Requirements on source environments\n\nAssuming the environment is implemented on top of `verifiers` and can be loaded with `verifiers.load_environment(env_id)`:\n\n- **Dataset access**\n  - The env must implement either `get_dataset(n, seed)` or `get_eval_dataset(n, seed)`.\n  - Each dataset row must expose at least:\n    - `prompt`: either\n      - a string, or\n      - a list of messages like `[{ \"role\": \"user\", \"content\": \"...\"}, ...]` with at least one non-empty user message.\n    - Optionally `answer`, `task`, and `info` (used when building the scoring `State`).\n- **Rubric / scoring**\n  - The env must define a rubric such that `env.rubric.score_group([state])`:\n    - runs without error, and\n    - **sets** `state[\"reward\"]` **to a numeric score [0, 1] for the** `(prompt, completion, answer, info, task)` tuple.\n\n### YAML config (single or multiple source envs)\n\nExample (single env):\n\n```bash\nuv run scripts/generate_dataset.py --config config/envs_gsm8k.yaml\nuv run scripts/generate_dataset.py --config config/envs_ifeval.yaml\n```\n\nExample config (multiple envs in one file):\n\n```yaml\nout: data/mixed.jsonl\n\nenvs:\n  - source_env: primeintellect/gsm8k\n    n: 50\n    train_per_task: 2\n    test_per_task: 2\n  - source_env: arcee-ai/ifeval\n    n: 50\n    train_per_task: 2\n    test_per_task: 2\n```\n\nFor each `env` entry:\n\n- `**source_env**` (required): environment id string passed to `verifiers.load_environment`.\n- `**n**` (optional, default `50`): number of source examples to sample.\n- `**train_per_task**`, `**test_per_task**` (optional): caps on the number of train / test `(prompt, completion, score)` examples per JSONL row.\n- `**responses_per_example**`, `**train_ratio**`, `**temperatures**`, `**seed**`, `**task_hint**` (optional): advanced knobs; see `scripts/generate_dataset.py` for details and validation rules.\n\n## Development\n\n- **Package manager**: `uv`\n- **Lint**: `ruff`\n\n","encoding":"utf-8","truncated":false,"total_bytes":4896},"status":null}