{"data":{"kind":"file","path":"README.md","version_id":"i26667t9e7g7al4jcqsrhpym","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3227,"modified_at":"2026-05-05T01:54:26.853000","content_hash":"4e5e0f652e2dc5455cc5118506f393fc0c02e9f4d5fe7bf576cbe164245a5fc8"},"entries":[],"content":"# semver-diff\n\nSingle-turn classification environment. Given two semver strings (per\n[https://semver.org](https://semver.org)) — `v_from` and `v_to` — the\nmodel must return the single release-bump category that takes one to the\nother. Gold answers are computed deterministically from the official\nSemVer 2.0 regex and §11 precedence rules, so no LLM-judge is needed.\n\n### Overview\n- **Environment ID**: `semver-diff`\n- **Short description**: Classify the release bump between two semver versions (MAJOR / MINOR / PATCH / PRERELEASE / BUILD_METADATA / NO_CHANGE / DOWNGRADE).\n- **Tags**: single-turn, classification, semver, versioning, deterministic, eval\n\n### Datasets\n- **Primary dataset**: synthetic — generated at load time by\n  `_gen_examples(seed, num_examples)`. Examples cover all seven\n  categories (numeric-core bumps, prerelease/build-metadata edits,\n  identical pairs, and downgrades) with a fixed RNG seed for\n  reproducibility.\n- **Source**: SemVer 2.0 spec, https://semver.org. Reference regex from\n  the spec's official suggestion.\n- **Split sizes**: eval-only; `num_examples` defaults to `120`.\n\n### Task\n- **Type**: single-turn\n- **Output format**: free text — the parser pulls the first occurrence of\n  any known category token (`MAJOR`, `MINOR`, `PATCH`, `PRERELEASE`,\n  `BUILD_METADATA`, `NO_CHANGE`, `DOWNGRADE`).\n- **Rubric**:\n  - `exact_match` (weight `1.0`) — model output equals the gold category.\n  - `core_vs_meta` (weight `0.0`, diagnostic) — `1.0` when the prediction\n    agrees on the *family* (numeric-core bump vs. same-precedence change\n    vs. downgrade); useful for spotting models that get the family right\n    but pick the wrong subtype.\n\n### Quickstart\n\n```bash\nuv run vf-install semver-diff\nuv run vf-eval -s semver-diff -m gpt-5-mini -n 10 -r 1\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_examples` | int | `120` | Dataset size. Stable for a given `seed`. |\n| `seed` | int | `0` | RNG seed for synthetic example generation. |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Equals `exact_match` (the only weighted reward). |\n| `exact_match` | 1.0 iff prediction equals the gold category. |\n| `core_vs_meta` | 1.0 when prediction and gold belong to the same family (`{MAJOR, MINOR, PATCH}` / `{PRERELEASE, BUILD_METADATA, NO_CHANGE}` / `{DOWNGRADE}`). Diagnostic only. |\n\n### Edge cases the gold classifier handles\n\n- A version with a prerelease has lower precedence than the same version\n  without one, so `1.0.0-rc.1` → `1.0.0` is `PRERELEASE` (a prerelease\n  was removed, precedence increased), not `NO_CHANGE`.\n- `+build` metadata never affects precedence per SemVer §10. Changing\n  only the build tag (`1.0.0+sha.abc` → `1.0.0+sha.def`) is\n  `BUILD_METADATA`, distinct from `NO_CHANGE`.\n- Going down on any of major/minor/patch is `DOWNGRADE` regardless of\n  prerelease/build state.\n\n### Notes\n\n- No external credentials, dataset downloads, or API calls during load.\n  The dataset is pure-Python synthesis.\n- `classify(v_from, v_to)` is exposed as a top-level function so it can\n  be reused as a reference implementation by other environments or\n  trainer integrations.\n","encoding":"utf-8","truncated":false,"total_bytes":3227},"status":null}