{"data":{"kind":"file","path":"README.md","version_id":"p8wesbj0q5n0k2481a79sju2","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5312,"modified_at":"2026-02-20T23:11:44.201000","content_hash":"fb9e21d2f828272151607d14298097ea115fe8921d14133809d4388492057c27"},"entries":[],"content":"# MEDEC\n\nEvaluation environment for the MEDEC dataset.\n\n## Overview\n- **Environment ID**: `medec`\n- **Short description**: A benchmark for medical error detection, extraction, and correction in clinical notes, based on the MEDIQA-CORR 2024 shared task.\n- **Tags**: medical, clinical, error-detection, error-correction, single-turn, llm-as-judge, evaluation, metrics\n\n## Datasets\n- **Source links**: [Paper](https://arxiv.org/html/2412.19260v1), [Original GitHub](https://github.com/abachaa/MEDEC), [HF Dataset](https://huggingface.co/datasets/sauravlmx/MEDEC-MS)\n- **Split sizes**:\n\n| Split         | Count | Description                         |\n| ------------- | ----- | ----------------------------------- |\n| train_ms      | 2,189 | MS Training Set                     |\n| validation_ms | 574   | MS Validation Set with Ground Truth |\n| test_ms       | 597   | MS Test Set with Ground Truth       |\n\n## Task\n- **Type**: single-turn\n- **Rubric overview**: Supports three evaluation modes via `eval_method`:\n  - **`\"judge\"` (default)**: LLM-as-a-Judge multi-part rubric (No Free Labels inspired multi-axis judge)\n  - **`\"metrics\"` (replication)**: ROUGE, BERTScore, and BLEURT; primary score is weighted average of `flag_accuracy` and paper metrics\n  - **`\"both\"` (combined)**: Judge mode score + paper metrics at weight 0 for analysis\n\n## Quickstart\n\n### 1. Export API Key\n\n```bash\nexport OPENAI_API_KEY=\"your-openai-api-key\"\n```\n\n### 2. Run Evaluation (Default Judge Mode)\n\n```bash\nprime eval run medec -m \"openai/gpt-5-mini\" -n 5 -s\n```\n\n```bash\nmedarc-eval medec -m \"openai/gpt-5-mini\" -n 10 -s --judge-model \"openai/gpt-5-mini\"\n```\n\nThis environment supports multi-judge via the `medarc-verifiers` `MultiJudge` rubric. When multiple judge models are specified, each scores independently and the final reward is their mean. `judge_model`, `judge_base_url`, and `judge_api_key` are all list-valued and correspond positionally; if only one `judge_base_url` or `judge_api_key` is provided, it is used for all judge models.\n\n```bash\nmedarc-eval medec -m \"openai/gpt-5-mini\" -n 10 -s --judge-model \"openai/gpt-5-mini\" --judge-model \"google/gemini-3-flash-preview\"\n```\n\n### 3. Run Evaluation (Paper Replication Mode)\n\n```bash\nmedarc-eval medec -m \"openai/gpt-5-mini\" --eval-method metrics -n 10 -s\n```\n\n### 4. Evaluate a Different Model (e.g., Anthropic)\n\n```bash\nexport OPENAI_API_KEY=\"your-openai-api-key\"\nexport ANTHROPIC_API_KEY=\"your-anthropic-api-key\"\n\nmedarc-eval medec -m gpt-5-mini -b https://api.anthropic.com/v1 -k ANTHROPIC_API_KEY --header 'anthropic-version: 2023-06-01' -n 10 -s\n```\n\n## Environment Arguments\n\n| Argument         | Type | Default         | Description                                                                      |\n| ---------------- | ---- | --------------- | -------------------------------------------------------------------------------- |\n| `judge_model`    | str  | `\"gpt-4o-mini\"` | Model used for judge-based scoring (in `\"judge\"` or `\"both\"` mode).              |\n| `judge_base_url` | str  | `None`          | API endpoint for the judge model (defaults to OpenAI API).                       |\n| `judge_api_key`  | str  | `None`          | API key for the judge model (defaults to `OPENAI_API_KEY`).                      |\n| `eval_method`    | str  | `\"judge\"`       | Evaluation mode (`\"judge\"`, `\"metrics\"`, or `\"both\"`).                           |\n| `device`         | str  | `None`          | Device to use for metrics (`cpu`, `cuda:0`, etc.). Defaults to GPU if available. |\n\n## Metrics\n\n### Judge Mode (`eval_method=\"judge\"`)\n\n| Metric             | Weight | Meaning                                                                     |\n| ------------------ | ------ | --------------------------------------------------------------------------- |\n| `error_flag`       | 1/3    | 1.0 if predicted `error_id` matches ground truth; else 0.0.                 |\n| `error_sentence`   | 1/3    | 1.0 if predicted `incorrect_sentence` matches ground truth; else 0.0.       |\n| `error_correction` | 1/3    | 1.0 if LLM judge deems `correction` medically equivalent; else 0.0.         |\n\n### Metrics Mode (`eval_method=\"metrics\"`)\n\n| Metric           | Weight | Meaning                                                               |\n| ---------------- | ------ | --------------------------------------------------------------------- |\n| `error_flag`     | 1/3    | 1.0 if predicted `error_id` matches ground truth; else 0.0.           |\n| `error_sentence` | 1/3    | 1.0 if predicted `incorrect_sentence` matches ground truth; else 0.0. |\n| `rouge_score`    | 1/6    | ROUGE-1 F1 score.                                                     |\n| `bertscore`      | 1/6    | BERTScore F1.                                                         |\n| `bleurt`         | 1/6    | BLEURT score.                                                         |\n\n### Both Mode (`eval_method=\"both\"`)\n\nSame as Judge mode, plus paper's evaluation metrics with weight 0 (for analysis only):\n\n| Metric        | Weight | Meaning                       |\n| ------------- | ------ | ----------------------------- |\n| `rouge_score` | 0      | ROUGE-1 F1 score.             |\n| `bertscore`   | 0      | BERTScore F1.                 |\n| `bleurt`      | 0      | BLEURT score.                 |\n","encoding":"utf-8","truncated":false,"total_bytes":5312},"status":null}