{"data":{"kind":"file","path":"README.md","version_id":"cui7fyeuy6zdet3xktmauap9","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5378,"modified_at":"2026-05-18T10:11:40.423000","content_hash":"9ab599ebaa019525dd0b2d52c9ac6998d46e142ecf850ea7e40fbf12cf1fe096"},"entries":[],"content":"# openfarm-dog-pain-triage\n\n<p>\n  <a href=\"https://github.com/ob1-s/happy-farm/tree/main/environments/openfarm_dog_pain_triage\">\n    <img align=\"left\" hspace=\"4\" src=\"https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white\" alt=\"GitHub\">\n  </a>\n  <a href=\"https://app.primeintellect.ai/dashboard/environments/openfarm/openfarm-dog-pain-triage\">\n    <img align=\"left\" hspace=\"4\" src=\"https://img.shields.io/badge/Prime%20Intellect-Envs%20Hub-181717?style=for-the-badge&labelColor=181717&logoColor=white\" alt=\"Prime Intellect Environments Hub\">\n  </a>\n  <a href=\"https://huggingface.co/datasets/oliveirabruno01/openfarm-dog-pain\">\n    <img align=\"left\" hspace=\"4\" src=\"https://img.shields.io/badge/Hugging%20Face-Dataset-181717?style=for-the-badge&logo=huggingface&logoColor=yellow&labelColor=181717\" alt=\"Hugging Face Dataset\">\n  </a>\n</p>\n<br clear=\"all\" />\n\n### Overview\n- **Environment ID**: `openfarm-dog-pain-triage`\n- **Short description**: A canine clinical text triage environment evaluating whether models can accurately deduce OpenFARM/AWW affective pain states from leakage-safe tabular veterinary metadata.\n- **Tags**: animal-welfare, veterinary, dog, pain, clinical-text, openfarm, llm-judge, single-turn, eval, train\n\n### Datasets\n- **Primary dataset(s)**: `oliveirabruno01/openfarm-dog-pain`\n- **Source links**:\n  - [Hugging Face Datasets](https://huggingface.co/datasets/oliveirabruno01/openfarm-dog-pain)\n  - Dog Pain Database: A Multidimensional Dataset for Investigating Canine Pain, [Zenodo record 15303646](https://zenodo.org/records/15303646), DOI [`10.5281/zenodo.15303646`](https://doi.org/10.5281/zenodo.15303646), CC-BY-4.0.\n- **Split sizes**:\n  - `train` (68 rows) / `test` (18 rows): Strictly class-balanced, dog-heldout splits to prevent majority-class memorization during RL training.\n  - `train_raw` (441 rows) / `test_raw` (116 rows): The natural, highly imbalanced dog-heldout distribution.\n\n*Note: Use `test` for baseline reasoning checks. Use `test_raw` with Macro-F1 scoring to evaluate real-world prevalence handling.*\n\n### Task\n- **Type**: single-turn\n- **Target labels**: `Negative_Nociceptive` vs `Neutral_Resting`.\n- **Output format expectations**: XML tags. Depending on `require_explanation`, models must output just `<answer>...</answer>` or both `<explanation>...</explanation>` and `<answer>...</answer>`.\n- **Rubric overview**:\n  - Exact-match classification accuracy (0.0 or 1.0).\n  - XML format adherence.\n  - Optional `AdaptiveJudgeRubric` to score the medical soundness of the model's `<explanation>`.\n\n### Quickstart\nRun an evaluation with default settings (uses the balanced `test` split):\n\n```bash\nprime eval run openfarm-dog-pain-triage\n```\n\nEvaluate on the natural, imbalanced raw distribution to check for mode collapse:\n\n```bash\nprime eval run openfarm-dog-pain-triage \\\n  -m openai/gpt-4.1-mini \\\n  -n 100 -r 1 \\\n  -a '{\"test_split\": \"test_raw\", \"require_explanation\": true}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n- The public dataset rows may include rich clinical/source metadata for audits and ablations. The env prompt intentionally uses only a selected leakage-safe subset by default.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `dataset_id` | str | `\"oliveirabruno01/openfarm-dog-pain\"` | HuggingFace dataset ID |\n| `dataset_revision` | Optional[str] | `None` | HuggingFace dataset revision |\n| `train_split` | str | `\"train\"` | Split used for training dataset (`train` or `train_raw`) |\n| `test_split` | str | `\"test\"` | Split used for evaluation dataset (`test` or `test_raw`) |\n| `max_examples` | int | `-1` | Limit on dataset size (use -1 for all) |\n| `seed` | int | `42` | Random seed for dataset shuffling |\n| `include_extended_surgery_fields` | bool | `False` | Include `time_since_surgery` and `recovered_anesthesia` if present. Disabled by default because missingness is label-correlated. |\n| `require_explanation` | bool | `True` | If True, forces `<explanation>` tags for Chain-of-Thought reasoning. |\n| `accuracy_reward_weight` | float | `1.0` | Weight for the exact-match category classification reward. |\n| `judge_reward_weight` | float | `1.0` | Weight for the LLM-as-a-judge medical reasoning reward. |\n| `format_reward_weight` | float | `0.0` | Weight for XML format adherence. |\n| `judge_mode` | str | `\"none\"` | Judge strategy: `\"external\"`, `\"self\"`, or `\"none\"`. |\n| `judge_model` | str | `\"gpt-4o-mini\"` | Model used for the LLM judge. |\n| `judge_api_key_var` | str | `\"PRIME_API_KEY\"` | Environment variable containing the judge API key. |\n| `judge_base_url` | str | `\"https://api.pinference.ai/api/v1\"` | Base URL for the judge model provider. |\n| `system_prompt_override` | str | `None` | Optional custom system prompt string. |\n| `user_prompt_override` | str | `None` | Optional custom user prompt template string. |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar reward (weighted sum of accuracy, judge, and format rewards) |\n| `accuracy_reward` | (0.0 or 1.0) Exact match on the correct Target Category |\n| `format_reward` | (0.0 or 1.0) Adherence to the requested XML tag format |\n| `hybrid_judge_reward` | (0.0 to 1.0) LLM judge score evaluating the biological soundness of the reasoning |\n","encoding":"utf-8","truncated":false,"total_bytes":5378},"status":null}