{"data":{"kind":"file","path":"README.md","version_id":"dpatmf4vcegbsd7c1y46do8k","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2800,"modified_at":"2025-08-22T23:43:28.052000","content_hash":"f861a8752eacfdb0142e75d1970adcd95d2700f6ba7a870d849f6daec1b279e3"},"entries":[],"content":"# OpenMed_PubMedQA\n\n### Overview\n- **Environment ID**: `OpenMed_PubMedQA`\n- **Short description**: Single-turn biomedical QA on PubMedQA with chain-of-thought and a final decision inside `\\boxed{yes|no|maybe}`.\n- **Tags**: biomedical, pubmedqa, single-turn, think, boxed-decision\n\n### Dataset\n- **Source**: `qiaojin/PubMedQA` (default config: `pqa_labeled`)\n- **Splits**: If only `train` exists (typical), an eval holdout is created from `train` via `train_test_split` (seed=42). If `test` or `validation` is present, it is used for eval.\n- **Fields used**:\n  - `question` and a recursively-flattened `context` (e.g., from `context[\"contexts\"]`) are combined in the user message.\n  - `final_decision` is normalized to `yes|no|maybe` and used as the target `answer`.\n\n### Prompting & Schema\n- **System message**: Instructs to reason inside `<think>...</think>` and put the final decision in `\\boxed{...}` using exactly one token from {yes,no,maybe}.\n- **User message**: `question` + a flattened `Context:` built from nested dict/list structures (prefers keys like `contexts`, `context`, `passages`, `evidence`, `abstract`).\n- **Example schema per example**:\n  - `prompt`: list of messages `[{\"role\":\"system\",...}, {\"role\":\"user\",...}]`\n  - `answer`: lowercased `final_decision` (`yes|no|maybe`)\n\n### Parser & Rewards\n- **Parser**: `ThinkParser` with `extract_boxed_answer` to read the final decision from `\\boxed{...}`.\n- **Rewards**:\n  - `correct_decision_reward_func` (weight 1.0): 1.0 if parsed decision equals target, else 0.0.\n  - `parser.get_format_reward_func()` (weight 0.0): format adherence only (currently not counted).\n- BLEU shaping against `long_answer` is currently disabled.\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `config_name` | str | `\"pqa_labeled\"` | PubMedQA config: `pqa_labeled`, `pqa_artificial`, or `pqa_unlabeled` |\n| `num_train_examples` | int | `-1` | Limit training set size (`-1` for all) |\n| `num_eval_examples` | int | `-1` | Limit eval set size (`-1` for all) |\n\n### Quickstart\n\nEvaluate with defaults (uses the env’s internal dataset handling):\n\n```bash\nuv run vf-eval OpenMed_PubMedQA \\\n  -a '{\"config_name\":\"pqa_labeled\", \"num_train_examples\":-1, \"num_eval_examples\":-1}'\n```\n\nTrain with the provided script:\n\n```bash\npython src/pubmedqa_training_vf_env.py\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n- Reports (if produced) will be placed under `./environments/OpenMed_PubMedQA/reports/`.\n\n## Evaluation Reports\n\n<!-- Do not edit below this line. Content is auto-generated. -->\n<!-- vf:begin:reports -->\n<p>No reports found. Run <code>uv run vf-eval OpenMed_PubMedQA -a '{\"key\": \"value\"}'</code> to generate one.</p>\n<!-- vf:end:reports -->","encoding":"utf-8","truncated":false,"total_bytes":2800},"status":null}