{"data":{"kind":"file","path":"README.md","version_id":"v52vmvagzkwfd7fx689or8iy","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":9142,"modified_at":"2026-04-06T08:31:14.542000","content_hash":"6bf4316b1c3fcd5eeb84286ed8fabfbeea87cbc91215086c6af5c2a76b9737da"},"entries":[],"content":"# dpo-to-rupo\n\nLearn rubrics from DPO-style preference data.\n\nThis environment turns each `prompt / chosen / rejected` example into a rubric-writing task. The policy can either see the original prompt plus two candidate responses or only the prompt, writes a reusable rubric, and a separate judge scores the hidden `chosen` and `rejected` responses against that rubric.\n\n## At a glance\n\n- Environment ID: `dpo-to-rupo`\n- Task shape: single-turn rubric generation\n- Default dataset: `sumuks/litbench-ha`\n- Required dataset columns: `prompt`, `chosen`, `rejected`\n- Required dataset splits: `train`, `test`\n\n## Flow\n\n1. Load one DPO-style example with `prompt`, `chosen`, and `rejected`.\n2. Show the policy either the original prompt plus two candidate responses or only the prompt, depending on `policy_prompt_mode`.\n3. Ask the policy to return `<analysis>...</analysis>` and `<rubric>...</rubric>`.\n4. Extract the rubric text.\n5. Ask the judge to score the hidden `chosen` response against that rubric.\n6. Ask the judge to score the hidden `rejected` response against that rubric.\n7. Convert those two judge scores into a final reward on a `0..1` scale.\n\nThe judge itself scores each response on a `0..100` integer scale. The environment then maps those two scores into the final rollout reward.\n\n## Reward modes\n\nAll reward modes return a final scalar in `0..1`.\n\n| Mode | Definition | Behavior |\n| --- | --- | --- |\n| `margin` | clipped linear margin with a `±5` deadzone and saturation by `±35` | Ignores tiny judge gaps as noise, rewards moderate separation, then stops paying for ever-larger spreads. |\n| `absolute` | `1.0` if `chosen > rejected`, `0.5` if equal, `0.0` otherwise | Winner-only signal with neutral ties. |\n| `sigmoid` | `sigmoid(k * (chosen_score - rejected_score))` | Very steep around ties, then saturates quickly. Tuned so a `+10` margin is about `0.95`. |\n| `criteria_margin` | weighted average of clipped per-criterion margins | Forces a structured rubric, asks the judge to score each criterion separately on a shared `0..100` scale, then applies rubric weights in Python. |\n| `criteria_absolute` | weighted average of per-criterion wins, ties, and losses | More criteria pointing toward `chosen` means higher reward, regardless of how large each winning gap is. |\n| `criteria_absolute_deadzone` | weighted average of per-criterion wins after a `±5` tie band | Same criterion-count objective, but ignores small per-criterion judge differences as noise. |\n| `criteria_total_absolute` | `1.0` if weighted chosen total > weighted rejected total, `0.5` if equal, `0.0` otherwise | Structured-rubric path with per-criterion scoring, but final reward depends only on the overall weighted total comparison. |\n\n## Defaults\n\nThese are the defaults used by `load_environment(...)`.\n\n| Argument | Type | Default | Meaning |\n| --- | --- | --- | --- |\n| `dataset_name` | `str` | `\"sumuks/litbench-ha\"` | Hugging Face dataset with `train` and `test` splits and `prompt/chosen/rejected` columns. |\n| `system_prompt` | `str \\| None` | `None` | Optional override for the policy system prompt. When omitted, the environment picks the built-in system prompt for the selected `policy_prompt_mode`. |\n| `randomize_order` | `bool` | `True` | Randomly swap candidate A/B presentation while preserving the hidden labels. This is ignored in `prompt_only` mode because the policy never sees candidate order. |\n| `policy_prompt_mode` | `\"pair\" \\| \"prompt_only\"` | `\"pair\"` | Whether the policy sees both candidate responses or only the task prompt. |\n| `judge_reward_type` | `\"margin\" \\| \"absolute\" \\| \"sigmoid\" \\| \"criteria_margin\" \\| \"criteria_absolute\" \\| \"criteria_absolute_deadzone\" \\| \"criteria_total_absolute\"` | `\"margin\"` | Final reward shaping mode. Any `criteria_*` reward appends a structured criterion format to the policy system prompt automatically. |\n| `judge_model` | `str \\| None` | `None` | Judge model. Uses `OPENAI_MODEL` when set, otherwise `gpt-4.1-mini`. |\n| `judge_system_prompt` | `str` | built-in judge prompt | System prompt passed only to the judge model. |\n| `judge_base_url` | `str \\| None` | `None` | Optional OpenAI-compatible base URL. Uses `OPENAI_BASE_URL` when set, with `OPENAI_API_BASE` as fallback. |\n| `judge_api_key` | `str \\| None` | `None` | Explicit judge API key. |\n| `judge_api_key_env_var` | `str` | `\"OPENAI_API_KEY\"` | Environment variable used when `judge_api_key` is omitted. |\n| `judge_max_concurrent_requests` | `int` | `64` | Maximum in-flight judge requests. |\n| `judge_max_tokens` | `int` | `4096` | Token cap for each judge completion. |\n| `trace_jsonl_path` | `str \\| None` | `\"outputs/dpo_to_rupo_rollouts.jsonl\"` | Optional JSONL trace sink for judge events and final rollout rewards. Set `null` to disable trace logging. |\n\n## Quickstart\n\nRun the helper script:\n\n```bash\n./scripts/eval/litbench_basic.sh\n```\n\nRun Prime directly:\n\n```bash\nprime eval run dpo-to-rupo \\\n  --env-dir-path ./environments \\\n  -m openai/gpt-4.1-mini \\\n  -n 20 \\\n  -r 3 \\\n  -a '{\"dataset_name\":\"sumuks/litbench-ha\",\"randomize_order\":true,\"judge_reward_type\":\"margin\"}'\n```\n\nPrompt-only policy variant:\n\n```bash\nprime eval run dpo-to-rupo \\\n  --env-dir-path ./environments \\\n  -m openai/gpt-4.1-mini \\\n  -n 20 \\\n  -r 3 \\\n  -a '{\"dataset_name\":\"sumuks/litbench-ha\",\"policy_prompt_mode\":\"prompt_only\",\"judge_reward_type\":\"margin\"}'\n```\n\nCriterion-wise reward variant:\n\n```bash\nprime eval run dpo-to-rupo \\\n  --env-dir-path ./environments \\\n  -m openai/gpt-4.1-mini \\\n  -n 20 \\\n  -r 3 \\\n  -a '{\"dataset_name\":\"sumuks/litbench-ha\",\"policy_prompt_mode\":\"prompt_only\",\"judge_reward_type\":\"criteria_margin\"}'\n```\n\nCriterion-wise direction variant:\n\n```bash\nprime eval run dpo-to-rupo \\\n  --env-dir-path ./environments \\\n  -m openai/gpt-4.1-mini \\\n  -n 20 \\\n  -r 3 \\\n  -a '{\"dataset_name\":\"sumuks/litbench-ha\",\"policy_prompt_mode\":\"prompt_only\",\"judge_reward_type\":\"criteria_absolute_deadzone\"}'\n```\n\nCriterion-wise overall-absolute variant:\n\n```bash\nprime eval run dpo-to-rupo \\\n  --env-dir-path ./environments \\\n  -m openai/gpt-4.1-mini \\\n  -n 20 \\\n  -r 3 \\\n  -a '{\"dataset_name\":\"sumuks/litbench-ha\",\"policy_prompt_mode\":\"prompt_only\",\"judge_reward_type\":\"criteria_total_absolute\"}'\n```\n\n## Environment variables\n\nThe judge defaults follow the repo-local OpenAI-compatible convention:\n\n- `OPENAI_MODEL`\n- `OPENAI_BASE_URL`\n- `OPENAI_API_KEY`\n\n`OPENAI_API_BASE` is also accepted as a fallback alias for the base URL.\n\n## Output contract\n\nThe policy must return:\n\n```xml\n<analysis>...</analysis>\n<rubric>...</rubric>\n```\n\nWhen `judge_reward_type` starts with `criteria_`, the `<rubric>` block must contain only structured `<criterion>` blocks:\n\n```xml\n<rubric>\n  <criterion>\n    <name>Prompt fidelity</name>\n    <weight>40</weight>\n    <description>What strong responses do on this dimension and what weak responses miss.</description>\n  </criterion>\n</rubric>\n```\n\nThe judge must return:\n\n```xml\n<analysis>...</analysis>\n<score>0-100</score>\n```\n\nWhen `judge_reward_type` starts with `criteria_`, the environment makes one judge call per criterion and each call returns the standard shared-scale score XML:\n\n```xml\n<analysis>...</analysis>\n<score>0-100</score>\n```\n\nOptional `<think>...</think>` content is tolerated before XML extraction.\n\n## Trace logging\n\nWhen `trace_jsonl_path` is set, the environment writes one JSONL trace file through `loguru`.\n\nIt records:\n\n- one `judge_response` event for the `chosen` story,\n- one `judge_response` event for the `rejected` story,\n- one `rollout_reward` event with the final reward and rollout summary.\n\nEach record includes a stable `rollout_id` so downstream analysis can join the per-judge events with the final rollout reward.\n\n## Metrics\n\n| Metric | Meaning |\n| --- | --- |\n| `reward` | Final environment reward in `0..1` using the configured reward mode. |\n| `num_turns` | Standard verifiers single-turn metric. |\n| `rubric_parse_success` | `1.0` when the completion contains a non-empty `<rubric>...</rubric>` payload. |\n| `rubric_parse_failure` | `1.0` when rubric extraction fails. |\n| `rubric_char_count` | Character length of the extracted rubric text. |\n\nStructured `criteria_*` runs also log these zero-weight diagnostics:\n\n| Metric | Meaning |\n| --- | --- |\n| `structured_xml_parse_success` | `1.0` when the extracted rubric parses as XML. |\n| `structured_rubric_valid` | `1.0` when the rubric has at least two complete criteria, positive integer weights, and total weight `100`. |\n| `structured_rubric_invalid` | `1.0` when rubric text is present but fails strict structured-rubric validation. |\n| `structured_criterion_count` | Number of `<criterion>` blocks found. |\n| `structured_complete_criterion_count` | Number of criteria with non-empty `name`, `weight`, and `description`. |\n| `structured_positive_weight_count` | Number of criteria whose weight parses to a positive integer. |\n| `structured_total_weight` | Sum of declared criterion weights. |\n\n## Requirements\n\n- The dataset must expose `train` and `test` splits.\n- The judge needs an API key, either from `judge_api_key` or from the environment variable named by `judge_api_key_env_var`.\n","encoding":"utf-8","truncated":false,"total_bytes":9142},"status":null}