{"data":{"kind":"file","path":"README.md","version_id":"aqt553fxoxtfwq4fipoyueua","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3638,"modified_at":"2025-10-21T05:57:38.444000","content_hash":"0bb8a881815b79ee88140486eab9849739ce0b89a41f6278343267ceb84ad558"},"entries":[],"content":"# k_qa\n\n### Overview\n- **Environment ID**: `k_qa`\n- **Short description**: An Evaluation for free form answers using concepts from FActScore ( break in to atomic statements) and calculating metrics for any LLM using comprehensiveness and hallucination.\n\n- **Tags**: medllm, medical, FActScore \n\n### Datasets\n- **Primary dataset(s)**: Dataset used to evaluate is from K-QA paper with the dataset name \"question_w_answers.jsonl\"\n- **Source links**: \n - Paper: https://arxiv.org/abs/2401.14493\n - Dataset / Project: https://github.com/Itaymanes/K-QA\n- **Split sizes**: 201 counts\n\n### Task\n- **Type**: single-turn\n- **Parser**: Uses `medarc_verifiers.JSONParser` to parse a `{\"claims\": [...]}` JSON from the extractor LLM output.\n\n- **Rubric overview**:\n  - A `RubricGroup` with two phases that share state:\n    1.  **Extraction**: A `JudgeRubric` extracts claims from the model's free-form answer and stores them in `state`.\n    2.  **Scoring**: A second `JudgeRubric` reads the claims from `state` and computes:\n        - `comprehensiveness`: fraction of must-have claims entailed by the model’s predicted claims.\n        - `hallucination_rate`: fraction of predicted claims that contradict any gold claim (must-have or nice-to-have).\n  - Both phases rely on an NLI-style judge LLM.\n\n### Quickstart\n\nMake sure you have an OpenAI API key available to the process:\n```bash\nexport OPENAI_API_KEY=\"YOUR_KEY\"\n```\n\nRun an evaluation with defaults:\n```bash\nuv run vf-eval k_qa\n```\n\nConfigure the generation model and sampling (example):\n```bash\nuv run vf-eval k_qa \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7\n```\n\nOverride environment arguments (see table below):\n```bash\nuv run vf-eval k_qa \\\n  -a '{\"extractor_model\":\"gpt-4-mini\",\"judge_model\":\"gpt-4-mini\"}'\n```\n\nTo run a batched evaluation(default is set to false):\n```bash\nuv run vf-eval k_qa -a '{\"batch\": true}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### How it works \n1. The LLM agent produces a free-form answer to a `Question`.\n2. An `extractor` rubric prompts the `extractor_model` to decompose the answer into atomic `claims`. These are stored internally.\n3. A `scorer` rubric then uses the `judge_model` to evaluate entailment and contradiction between the model's claims and the gold claims to compute metrics.\n   - `comprehensiveness` is calculated based on how many \"must have\" gold claims are entailed by the model's claims.\n   - `hallucination_rate` is calculated based on how many of the model's claims contradict any of the gold claims.\n\n\n\n### Environment Arguments\nDocument any supported environment arguments and their meaning. Example:\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `extractor_model` | str | `\"gpt-4-mini\"` | The model used to extract claims from the free form answer. |\n| `judge_model` | str | `gpt-4-mini` | The model used for NLI-style scoring (entailment and contradiction) |\n| `batch` | bool | `False` | Whether to run evaluation in a single batch call to the judge model. |\n\n### Metrics\nSummarize key metrics your rubric emits and how they’re interpreted.\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | The primary reward signal, which is equivalent to comprehensiveness |\n| `comprehensiveness` | The fraction of \"must-have\" gold claims that are entailed by the claims made in the generated answer. A value of 1.0 means all essential information was covered |\n| `hallucination_rate` | The count of claims in the generated answer that contradict any of the gold standard claims (\"must-have\" or \"nice-to-have\"). A lower value is better|\n\n","encoding":"utf-8","truncated":false,"total_bytes":3638},"status":null}