{"data":{"kind":"file","path":"README.md","version_id":"bm940yxhnjtso0mtn6xwo7ii","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":1596,"modified_at":"2026-03-15T21:29:21.810000","content_hash":"9be0ee51c1aab5502b5bca69e6b08044ef2cee9f0b3fb7fa156c7b5d45c49d43"},"entries":[],"content":"# unslop-dpo-reward\n\nPrime Intellect Verifiers environment for the Hugging Face dataset `qfennessy/unslop-dpo`.\n\n## Behavior\n\nEach source dataset row becomes one single-turn pairwise reward-model example:\n\n- the prompt contains the original user prompt plus two candidate responses in neutral `A`/`B` order\n- the better response is randomized between `A` and `B` so position alone is not enough\n- the model is instructed to output JSON with `response_a_score` and `response_b_score`\n\nReward logic:\n\n- parse the two scores from the model output\n- compute the preferred-minus-dispreferred margin\n- use a Bradley-Terry objective: reward is `log(sigmoid(margin))`\n- malformed outputs receive a fixed penalty\n\n## Install\n\n```bash\nprime env install unslop-dpo-reward -p ./environments\n```\n\n## Evaluate\n\n```bash\nprime eval run unslop-dpo-reward\n```\n\n## Optional Arguments\n\n`load_environment()` accepts:\n\n- `split`: Hugging Face split to use for training data, default `train`\n- `num_examples`: limit for training rows, default `-1` for all rows\n- `eval_split`: split to use for evaluation, default `train`\n- `eval_num_examples`: limit for eval rows, default `100`\n- `shuffle`: shuffle training rows before slicing, default `True`\n- `eval_shuffle`: shuffle eval rows before slicing, default `False`\n- `seed`: shuffle seed, default `42`\n\n## Notes\n\nThis environment is for training a text-generating reward model in Verifiers, not for directly scoring a policy response against the original `chosen` and `rejected` strings. The model under training is expected to emit two real-valued scores as JSON text.\n","encoding":"utf-8","truncated":false,"total_bytes":1596},"status":null}