{"data":{"kind":"file","path":"README.md","version_id":"dmvpyb4n1ts6kzcxr2w0r8bm","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6496,"modified_at":"2026-02-26T07:22:45.328000","content_hash":"9fae95dad413bfab8586e2022a01a71ab85ed1a00903076fb7b3442af2c5eb44"},"entries":[],"content":"# SYNUR\n\n### Overview\n- **Environment ID**: `synur`\n- **Short description**: Extract structured nursing flowsheet observations from synthetic clinical transcripts (SYNUR).\n- **Tags**: clinical, information-extraction, structured-output, train, eval\n\n### Datasets\n- **Primary dataset(s)**: `mkieffer/SYNUR` (cleaned), `microsoft/SYNUR` (original)\n- **Source links**:\n  - https://huggingface.co/datasets/mkieffer/SYNUR\n  - https://huggingface.co/datasets/microsoft/SYNUR\n- **Split usage**:\n  - train data: `mediqa_synur_train`\n  - eval data (default): `mediqa_synur_dev`\n  - alternate eval target: `mediqa_synur_test` via `eval_split=\"test\"` or `eval_split=\"mediqa_synur_test\"`\n\n### Task\n- **Type**: single-turn\n- **Output format expectations**: Strict JSON only. The top-level output must be a JSON array of observation objects with keys `id`, `name`, `value_type`, `value`.\n- **Rubric overview**:\n  - Main reward: official ID+value F1, gated by strict output validity.\n  - Official matching logic: a prediction is counted correct when `id` matches and `value` matches (using official `json_values_equal` behavior).\n  - `name` is not used to decide TP/FP/FN for official F1; `value_type` is only used to unroll `MULTI_SELECT` values.\n  - Validity gate: every predicted observation must include `id`, `name`, `value_type`, `value` and be schema-consistent.\n  - No partial shaping bonuses/penalties are applied.\n\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run synur\n```\n\nConfigure model and environment arguments:\n\n```bash\nprime eval run synur \\\n  -m gpt-4.1-mini \\\n  -n 50 \\\n  -r 1 \\\n  -t 1024 \\\n  -T 0.2 \\\n  -a '{\"eval_size\": 50, \"dataset_seed\": 7}'\n```\n\nUse a different endpoint and API key (example: OpenRouter):\n\n```bash\nprime eval run synur \\\n  -m arcee-ai/trinity-mini:free \\\n  -b https://openrouter.ai/api/v1 \\\n  -k OPENROUTER_API_KEY \\\n  -n 50 \\\n  -r 1 \\\n  -t 1024 \\\n  -T 0.2 \\\n  -a '{\"eval_size\": 50, \"dataset_seed\": 7}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n- `-k` expects the name of an environment variable containing the API key.\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `repo_id` | str | `\"mkieffer/SYNUR\"` | Hugging Face dataset repo id (`mkieffer/SYNUR` cleaned default; `microsoft/SYNUR` original source) |\n| `revision` | str | `\"main\"` | Dataset git revision (branch, tag, or commit) |\n| `train_split` | str | `\"mediqa_synur_train\"` | Source split for training data (fixed to partitioned train split; alias: `\"train\"`) |\n| `eval_split` | str | `\"mediqa_synur_dev\"` | Source split for eval data (`\"mediqa_synur_dev\"` / `\"mediqa_synur_test\"`; aliases: `\"dev\"` / `\"validation\"` / `\"test\"`) |\n| `train_size` | int | `-1` | Max train examples after shuffling (`-1` uses full `train_split`) |\n| `eval_size` | int | `128` | Max eval examples after shuffling (`-1` uses full `eval_split`) |\n| `dataset_seed` | int | `7` | Shuffle seed used independently per split before optional size caps |\n| `max_examples` | int | `-1` | Optional global per-split cap applied before `train_size` / `eval_size` (`-1` for all) |\n| `schema_mode` | str | `\"compact\"` | `\"compact\"` uses a reduced schema in the prompt; `\"full\"` uses the raw schema |\n| `numeric_tolerance` | float | `0.01` | Legacy arg retained for config compatibility (not used by official-style scoring path) |\n| `few_shot_examples` | int | `0` | Number of few-shot demonstrations prepended to each prompt (`0`-`3`, sampled from the shuffled `train_split`) |\n| `curriculum_mode` | str | `\"none\"` | Training-set curriculum mode (`\"none\"` or `\"auto_bins\"`). `\"auto_bins\"` enables automatic difficulty binning by number of gold observations |\n| `curriculum_phase` | str | `\"early\"` | Curriculum phase schedule used by `\"auto_bins\"` (`\"early\"`, `\"mid\"`, `\"late\"`, `\"all\"`) |\n| `curriculum_num_bins` | int | `4` | Target number of automatic quantile-style bins before undersized-bin merging |\n| `curriculum_min_bin_fraction` | float | `0.15` | Minimum fraction of training set each bin must contain after merge |\n| `curriculum_bin_weights` | list[float] \\| null | `null` | Optional custom sampling weights by bin (overrides `curriculum_phase`; must be non-negative) |\n| `curriculum_seed` | int \\| null | `null` | Optional RNG seed for curriculum resampling (defaults to `dataset_seed`) |\n\n### Curriculum Details\n- Curriculum is applied to the **training** split only; eval remains the selected `eval_split` dataset.\n- Difficulty is approximated by `difficulty_count` (the number of gold observations in each sample).\n- In `auto_bins` mode, train examples are sorted by difficulty and split into quantile-style bins from easiest to hardest.\n- Undersized bins are merged into a neighbor until each bin has at least `ceil(curriculum_min_bin_fraction * train_size)` examples.\n- For `early` / `mid` / `late`, the train set is resampled (with replacement) back to the same length using bin sampling weights.\n- For `all`, the train set is traversed once in three staged chunks (early -> mid -> late); sampling is random within bins and each example appears at most once.\n- Phase presets (for 4 bins) are:\n  - `early`: `[0.55, 0.30, 0.12, 0.03]`\n  - `mid`: `[0.35, 0.30, 0.22, 0.13]`\n  - `late`: `[0.20, 0.25, 0.30, 0.25]`\n- `all` runs these presets sequentially in one run (roughly one-third of samples per stage).\n- If you provide `curriculum_bin_weights`, they override phase presets.\n- If bin merging changes the final number of bins, sampling falls back to uniform weights across the final bins.\n\n### Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main reward (range `[0.0, 1.0]`): equals `official_f1_metric` when strict-valid, else `0.0` |\n| `json_valid_metric` | 1 if output is strict-valid JSON with complete schema-consistent observation fields, else 0 |\n| `exact_match_metric` | 1 if strict-valid and official TP/FP/FN indicates perfect match, else 0 |\n| `official_precision_metric` | Official-style precision from TP/(TP+FP), where matches are based on `id + value` |\n| `official_recall_metric` | Official-style recall from TP/(TP+FN), where matches are based on `id + value` |\n| `pred_gold_ratio_metric` | Ratio of predicted to gold matched-bookkeeping units (`pred_units / gold_units`; if `gold_units=0`, reports `1.0` when `pred_units=0`, else `pred_units`) |\n| `official_f1_metric` | Official-style F1 from TP/FP/FN where matches are based on `id + value` |\n","encoding":"utf-8","truncated":false,"total_bytes":6496},"status":null}