{"data":{"kind":"file","path":"README.md","version_id":"d21l74twxrwi6bo87oom39rf","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5952,"modified_at":"2026-03-26T03:02:55.211000","content_hash":"16b15200e397f890d7b024d6c2e84aee82cc5996f3248936ed5ff8c35b8a8095"},"entries":[],"content":"# learn-helpsteer3\n\n### Overview\n- **Environment ID**: `learn-helpsteer3`\n- **Short description**: Single-turn A/B preference and rubric-learning environment over the HelpSteer3 DPO-style dataset.\n- **Tags**: `preferences`, `verifiers`, `single-turn`, `xml`, `helpsteer3`\n\n### Dataset\n- **Primary dataset**: `sumuks/helpsteer3-dpo-style` on Hugging Face.\n- **Split sizes**: 23,959 train / 1,288 test (approximate; confirm on the Hub if the author updates the card).\n- **Raw row shape**: each row has a conversation transcript in `prompt` (often multi-turn, with `user:` / `assistant:` turns), a preferred `chosen` assistant response, a less-preferred `rejected` response, and a numeric `difficulty` field in `0` to `1`. The dataset does not ship `baseline_rewards`; optional fields like `baseline_rewards` are read with safe defaults, but `hard_only` requires rows that include a non-empty `baseline_rewards` mapping.\n\n### Task\n- **Type**: `single-turn`\n- **Model input**: the environment supports both direct A/B preference judging and rubric generation for later judging.\n- **Output format**: `standard_rm_preference` expects `<answer>A</answer>` or `<answer>B</answer>`, while `rupo_preference` expects `<rubric>...</rubric>`.\n- **Implemented paths**: `standard_rm_preference`, `rupo_preference`, `rupo_rubric_advantage`\n\n### Standard RM Preference Path\nThe raw dataset stores a preferred `chosen` response and a less-preferred `rejected` response. The environment converts that into a standard reward-model preference task:\n\n1. It assigns whether the preferred response appears as `A` or `B` with a deterministic 50/50 split across the loaded dataset.\n2. It renders a single prompt asking the model which response is better overall for the conversation context.\n3. It parses the model output with `vf.XMLParser([\"answer\"])`.\n4. It scores the rollout with exact-match accuracy on the preferred side.\n\nThe rubric also exposes an auxiliary `xml_format_score` metric so malformed answers are easy to spot during evaluation.\n\n### RUPO Preference Path\nThe `rupo_preference` path uses the rollout model to generate a rubric for the conversation context first, then scores that rubric with a separate judge model:\n\n1. The rollout model receives the conversation context and must return a rubric in `<rubric>...</rubric>` XML.\n2. The reward function parses the generated rubric from the rollout completion.\n3. A separate async OpenAI-compatible judge client evaluates the `chosen`/`rejected` pair twice using that rubric:\n   one with `chosen` as `A`, and one with `chosen` as `B`.\n4. The final reward is `1.0` if both judge calls prefer the chosen response, `0.5` if exactly one does, and `0.0` otherwise.\n\nThe judge client uses:\n- a shared `AsyncOpenAI` client\n- `judge_model`, `judge_base_url`, and `judge_api_key`\n- a configurable request semaphore\n- a request timeout of `1800` seconds\n- judge completions capped at `512` tokens\n\nThe RUPO path defaults the rollout model to `2048` max completion tokens so rubric generation has enough budget. Truncated rubrics still parse as invalid and receive `0.0`.\n\nIf you pass `hard_only=true`, the RUPO training split is filtered before prompt construction to keep only examples where the stored dataset baseline reward for the resolved baseline model is at most `0.5`. That requires `baseline_rewards` on each row. The eval split is left unchanged.\n\n### RUPO Rubric Advantage Path\nThe `rupo_rubric_advantage` path uses the same rubric-generation and two-orientation judging flow as `rupo_preference`, but changes the scalar reward:\n\n`reward = rupo_preference_reward - baseline_model_reward`\n\nThis means the rollout is rewarded for beating the baseline stored on the dataset for the same model. Rows without `baseline_rewards` yield `baseline_model_reward = 0.0`.\n\n### Install\n\n```bash\nuv pip install -e environments/learn_helpsteer3\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `dataset_name` | `str` | `\"sumuks/helpsteer3-dpo-style\"` | Hugging Face dataset name. |\n| `train_split` | `str` | `\"train\"` | Training split name. |\n| `test_split` | `str` | `\"test\"` | Evaluation split name. |\n| `run_setting` | `str` | `\"standard_rm_preference\"` | Environment variant to load. |\n| `judge_model` | `str` | `\"Qwen/Qwen3-4B-Instruct-2507\"` or `JUDGE_MODEL` | Judge model used by the RUPO paths. |\n| `judge_base_url` | `str` | `\"http://192.222.59.13:30000/v1/\"` or `JUDGE_BASE_URL` | OpenAI-compatible judge endpoint used by the RUPO paths. |\n| `judge_api_key` | `str` | `\"EMPTY\"` or `JUDGE_API_KEY` | Judge API key used by the RUPO paths. |\n| `judge_max_concurrent_requests` | `int` | `1024` or `JUDGE_MAX_CONCURRENT_REQUESTS` | Semaphore limit for concurrent judge requests in RUPO paths. |\n| `baseline_model_name` | `str \\| None` | `None` | Optional explicit baseline reward key. If omitted, the environment falls back to the judge model key or the sole key in `baseline_rewards`. |\n| `hard_only` | `bool` | `false` | For RUPO paths only, keep only train examples with stored baseline reward `<= 0.5`. Requires `baseline_rewards` on the dataset. |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Same as `preferred_side_accuracy` for the direct path. |\n| `preferred_side_accuracy` | `1.0` when the parsed XML answer matches the preferred side, else `0.0`. |\n| `xml_format_score` | Parser-based XML formatting score for the expected `<answer>...</answer>` schema. |\n| `rupo_preference_reward` | Two-orientation judge score for a generated rubric: `0.0`, `0.5`, or `1.0`. |\n| `rubric_xml_format_score` | Parser-based XML formatting score for the expected `<rubric>...</rubric>` schema. |\n| `baseline_model_reward` | Baseline score stored on the dataset for the current rollout model. |\n| `rupo_rubric_advantage_reward` | `rupo_preference_reward - baseline_model_reward`. This is the scalar reward used by `rupo_rubric_advantage`. |\n","encoding":"utf-8","truncated":false,"total_bytes":5952},"status":null}