{"data":{"kind":"file","path":"README.md","version_id":"i4nw425jlvec0i7da241d3f3","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4871,"modified_at":"2025-10-12T22:55:12.317000","content_hash":"fa83543ae38fa1a77892477a13d63faa348887491ccd1169eae72328e9b2b16e"},"entries":[],"content":"# medredqa\n\n### Overview\n- **Environment ID**: `medredqa`\n- **Short description**: MedRedQA dataset from https://data.csiro.au/collection/csiro:62454\n- **Tags**: \n\n### Datasets\n- **Primary dataset(s)**: MedRedQA - large non-factoid English consumer Question Answering (QA) dataset containing 51,000 pairs of consumer questions and their corresponding expert answers. This dataset is useful for bench-marking or training systems on more difficult real-world questions and responses which may contain spelling or formatting errors, or lexical gaps between consumer and expert vocabularies.\n- **Source links**: https://huggingface.co/datasets/bagga005/medredqa\n- **Split sizes**: 40.7k (train) / 5.1k (val) / 5k (test)\n\n### Task\n- **Type**: single-turn\n- **Parser**: JudgeRubric\n- **Rubric overview and Example**: \nEvaluates model opinion and recommendation against ground truth (provided by a medical professional) opinion and recommendation. Judge breaks ground truth into atomic facts (inspired by FactSore) and assigns a score based on occurance of each atomic fact in model output.\n\n#### Sample Evaluation\n##### User Query\n28M. Small “grain of rice” under the skin on the shaft of my penis\nAge - 28\nSex - M\nHeight - 6’2”\nWeight - 230lbs \nRace - White\nDuration of complaint - 3-4 months\n28M. I’ve already scheduled an appointment with my PCP, but the appointment is not for a couple of weeks. Was hoping to get an opinion or thought about what this may before then. \nI have a small, grain of rice sized lump on the shaft of my penis near the dorsal vein. It doesn’t hurt, it popped up around 3-4 months ago as far as I remember, and it does not seem to be getting larger or smaller. I can’t see it while flaccid, so it is definitely under the skin. \nI have no other relevant conditions that I’m aware of. Currently taking medications that include levothyroxine, loratadine, and Baclofen. \nHere is a picture for reference:\n##### Professional Opinion and Recommendation:\nThe small spots may be Fordyce spots but the  lump in the circle is not. If it is firm and non compressible then it is probably an epidermoid cyst. If it can be squashed flat it may be a blood vessel. It doesn’t look worrying and if it has not changed then there is a low likelihood of being serious.\n\n##### Model Opinion and Recommendation:\nBased on the information you provided, the small, painless, stable lump under the skin of your penile shaft is most likely a benign lesion such as an epidermoid cyst or a normal vein irregularity. Since it has been present for several months without change or symptoms, this is reassuring. \n\nYour plan to see your primary care provider is appropriate. In the meantime, avoid manipulating or irritating the area, monitor for any changes (growth, pain, ulceration, discharge), and seek urgent care if these occur.\n\nAt your appointment, your provider can examine the lump and, if needed, refer you to a urologist or dermatologist for further evaluation or possible removal.\n\nIf you develop new symptoms such as pain, rapid growth, redness, or ulceration before your appointment, seek medical evaluation sooner.\n\nOverall, keep track of any changes and follow up as scheduled.\n\n##### Judge reasoning and Score:\nAtomic Facts and matching \nFact: The small spots may be Fordyce spots., Score: False\nFact: A firm and non-compressible lump is probably an epidermoid cyst., Score: True\nFact: A firm, noncompressible lump is likely an epidermoid cyst., Match: True\nFact: If the lump doesn’t look worrying and has not changed, there is a low likelihood of it being serious., Score: True\nTotal Matches: 2, Total Facts: 4, Score: **0.5**\n\n\n### Environment Arguments\nUse `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `judge_model` | str | `\"gpt-4o-mini\"` | Model to use for LLM-as-a-Judge evaluation |\n| `judge_base_url` | str | `None` | Optional base URL for judge model API |\n| `judge_api_key` | str | `None` | Optional API key for judge model (defaults to OPENAI_API_KEY env var) |\n| `use_think` | bool | `False` | Whether to use ThinkParser with `<think>...</think>` tags and `\\boxed{}` format for reasoning models |\n\n### Usage Examples\n\nStandard evaluation (free-form responses):\n\n```bash\nuv run vf-eval medredqa -m gpt-4.1-mini -n 20 -r 3 -t 1024 -T 0.7\n```\n\nVerbose with saved output:\n\n```bash\nuv run vf-eval medredqa -m gpt-4.1-mini -n 20 -r 3 -t 1024 -T 0.7 -s -v\n```\n\nEvaluation with reasoning model (with think tags and boxed answers):\n\n```bash\nuv run vf-eval medredqa \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7 \\\n  -a '{\"use_think\": true}' \\\n  -s\n```\n\nCustom judge model:\n\n```bash\nuv run vf-eval medredqa \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7 \\\n  -a '{\"judge_model\": \"gpt-4o\", \"use_think\": true}' \\\n  -s\n```\n","encoding":"utf-8","truncated":false,"total_bytes":4871},"status":null}