{"data":{"kind":"file","path":"README.md","version_id":"ytu8omlir51l4ag96xek3tf0","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5053,"modified_at":"2026-03-26T03:03:00.683000","content_hash":"8ebee38bbe592fe614fb4952c95fc46a25fbf43e2abcc442ac42ab77aa3db4c3"},"entries":[],"content":"# learn-helpsteer3-pointwise\n\n### Overview\n- **Environment ID**: `learn-helpsteer3-pointwise`\n- **Short description**: Single-turn pointwise scoring and rubric-learning environment over the HelpSteer3 DPO-style dataset.\n- **Tags**: `pointwise`, `verifiers`, `single-turn`, `xml`, `helpsteer3`\n\n### Dataset\n- **Primary dataset**: `sumuks/helpsteer3-dpo-style`\n- **Split sizes**: 23,959 train / 1,288 eval (approximate; confirm on the Hub if the author updates the card).\n- **Raw row shape**: each row has a conversation transcript in `prompt`, a preferred `chosen` response, a less-preferred `rejected` response, and `difficulty`. This pointwise environment consumes an optional `pointwise_scores` column when present (model key to `{chosen, rejected}` score maps). Without it, baseline pointwise metrics and `hard_only` filtering are not available in the usual way.\n\n### Task\n- **Type**: `single-turn`\n- **Output format**: direct pointwise scoring expects `<score>...</score>`, while rubric generation expects `<rubric>...</rubric>`.\n- **Implemented paths**: `standard_rm_pointwise`, `rupo_pointwise`, `rupo_pointwise_advantage`, `rupo_pointwise_binary`, `rupo_pointwise_advantage_binary`\n\n### Standard RM Pointwise Path\nThe direct path expands each preference pair into two rows:\n- one row scores the preferred `chosen` response toward a target score of `100`\n- one row scores the less-preferred `rejected` response toward a target score of `0`\n\nThe reward is based on closeness to the target score.\n\n### RUPO Pointwise Path\nThe rubric path asks the rollout model to generate a rubric for the conversation context, then uses a separate judge model to score the `chosen` and `rejected` responses independently on a `0` to `100` scale.\n\nThe scalar reward is:\n\n`(chosen_score - rejected_score) / 100`\n\n### RUPO Pointwise Advantage Path\nThe pointwise advantage path uses the same rubric-generation and judge-scoring flow, but subtracts the stored baseline pointwise margin from the dataset:\n\n`reward = rupo_pointwise_reward - baseline_pointwise_margin`\n\nWhen `pointwise_scores` is missing, the baseline margin reads as `0.0`.\n\n### RUPO Pointwise Binary Path\nThe binary pointwise path uses the same rubric-generation and judge-scoring flow, but collapses the scalar comparison into a win/loss reward:\n\n`reward = 1.0 if chosen_score > rejected_score else 0.0`\n\n### RUPO Pointwise Advantage Binary Path\nThe binary advantage path subtracts the dataset's baseline binary win indicator:\n\n`reward = rupo_pointwise_binary_reward - baseline_pointwise_binary_reward`\n\n### Quickstart\nRun an evaluation with default settings (requires the Prime CLI if you use that workflow):\n\n```bash\nprime eval run learn-helpsteer3-pointwise\n```\n\n### Install\n\n```bash\nuv pip install -e environments/learn_helpsteer3_pointwise\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `dataset_name` | `str` | `\"sumuks/helpsteer3-dpo-style\"` | Hugging Face dataset name. |\n| `train_split` | `str` | `\"train\"` | Training split name. |\n| `test_split` | `str` | `\"test\"` | Eval split name. |\n| `run_setting` | `str` | `\"standard_rm_pointwise\"` | Environment variant to load. |\n| `judge_model` | `str` | `\"Qwen/Qwen3-4B-Instruct-2507\"` or `JUDGE_MODEL` | Judge model used by RUPO pointwise paths. |\n| `judge_base_url` | `str` | `\"http://192.222.59.13:30000/v1/\"` or `JUDGE_BASE_URL` | OpenAI-compatible judge endpoint used by RUPO pointwise paths. |\n| `judge_api_key` | `str` | `\"EMPTY\"` or `JUDGE_API_KEY` | Judge API key used by RUPO pointwise paths. |\n| `judge_max_concurrent_requests` | `int` | `1024` or `JUDGE_MAX_CONCURRENT_REQUESTS` | Semaphore limit for concurrent judge requests in RUPO pointwise paths. |\n| `baseline_model_name` | `str \\| None` | `None` | Optional explicit baseline pointwise score key. |\n| `hard_only` | `bool` | `false` | For RUPO pointwise paths only, keep only train examples whose stored baseline chosen-minus-rejected margin is at most `0`. Requires non-empty `pointwise_scores` rows. |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar reward for the selected run setting. |\n| `predicted_pointwise_score` | Parsed score from the direct pointwise scoring path. |\n| `score_xml_format_score` | Parser-based XML formatting score for `<score>...</score>`. |\n| `rupo_pointwise_reward` | Normalized chosen-minus-rejected pointwise margin from the rubric-scored judge path. |\n| `rupo_pointwise_binary_reward` | `1.0` when the rubric-scored chosen response beats the rejected response, else `0.0`. |\n| `rupo_chosen_pointwise_score` | Judge score for the chosen response. |\n| `rupo_rejected_pointwise_score` | Judge score for the rejected response. |\n| `rubric_xml_format_score` | Parser-based XML formatting score for `<rubric>...</rubric>`. |\n| `baseline_pointwise_margin` | Stored dataset chosen-minus-rejected margin for the current baseline model. |\n| `baseline_pointwise_binary_reward` | `1.0` when the stored baseline chosen score beats the rejected score, else `0.0`. |\n","encoding":"utf-8","truncated":false,"total_bytes":5053},"status":null}