{"data":{"kind":"file","path":"README.md","version_id":"mjwval05yxfunb5cv77hpysu","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2635,"modified_at":"2025-09-19T09:13:50.009000","content_hash":"a1bb8ff8ff5992d78559a7543116b328e6988f2fbbdc4bdc879ea9fd212b5e69"},"entries":[],"content":"# LegalBench Environment\n\nA Prime Intellect Verifiers environment for evaluating legal reasoning tasks from the LegalBench dataset.  \nPaper - https://arxiv.org/pdf/2308.11462  \nDataset - https://huggingface.co/datasets/nguha/legalbench  \nWebsite - https://hazyresearch.stanford.edu/legalbench/  \nGithub (Source implementation) - https://github.com/HazyResearch/legalbench  \n\nContributed by - [@omkar-334](https://github.com/omkar-334)\n\n## Overview\n\n- Environment ID: `legalbench`\n- Splits: use the `test` split for evalutation\n\n## Dataset\n\nLegalBench is a comprehensive benchmark for evaluating legal reasoning capabilities in large language models. This environment provides a standardized interface for testing models on 162 legal tasks including:\n\n- **Binary Classification**: Yes/No questions about legal concepts\n- **Open-ended Questions**: Free-form legal reasoning tasks\n- **Multi-label Tasks**: Tasks requiring multiple legal labels\n- **Numeric Tasks**: Tasks requiring specific numerical answers\n\nSee the [full list of tasks here](https://hazyresearch.stanford.edu/legalbench/tasks/)  \nFor description of each task, either go to the website or go to the respective task folder and view the README.md\n\n## Usage\n\n### Basic Usage\n\n```bash\nuv run vf-eval -s legalbench --env-args '{\"task_name\":\"abercrombie\", \"split\":\"test\"}'\n```\n\nRun with model arguments. \n\n```bash\nuv run vf-eval -s legalbench -m gpt-4.1-mini --api-key-var <YOUR_API_KEY> --env-args '{\"task_name\":\"abercrombie\", \"split\":\"test\"}'\n```\n\n### Environment Arguments\n\nUse `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `task_name` | str | `abercrombie` | The `legalbench` task to load |\n| `dataset_split` | str | `train` | The dataset split to use |\n| `num_examples` | int | `None` | The number of examples to use for evaluation |\n| `use_claude` | bool | `false` | Whether to use the `claude`-specific user prompts. Set to `true` if using `claude` models, else set to `false`|\n\n### Task Categories\n\n- `classification`: 162 tasks come under this category, either multi-class or binary classification.\n- `open_ended`: Free-form questions (e.g., `citation_prediction_open`, `rule_qa`)\n- `multi_label`: Multiple label tasks (e.g., `successor_liability`)\n- `numeric`: Numerical answer tasks (e.g., `sara_numeric`)\n- `extraction`: Extraction tasks (e.g., `definition_extraction`, `ssla`)\n\n### List Available Tasks\n\n```python\nfrom legalbench import list_available_tasks\n\ntasks = list_available_tasks()\nprint(f\"Available tasks: {len(tasks)}\")\nprint(tasks)\n```\n\n","encoding":"utf-8","truncated":false,"total_bytes":2635},"status":null}