{"data":{"kind":"file","path":"README.md","version_id":"dd4glmawswmofvtrle511u2i","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":1710,"modified_at":"2025-09-26T20:53:50.051000","content_hash":"0bb73093fff39e3fa7dccd15b99388ba272a0199a644d49f57d91e869ce2eb21"},"entries":[],"content":"# oab-bench\n\nBenchmark adapted from [Automatic Legal Writing Evaluation of LLMs](https://arxiv.org/abs/2504.21202), made to evaluate llms in the Brazilian Bar Examination, using a multi-judge system. This implementation is not 100% equal to the methodology implemented in the paper.\n\n### Overview\n- **Environment ID**: `oab-bench`\n- **Short description**: Benchmark made to evaluate llms in the Brazilian Bar Examination, using a multi-judge system.\n- **Tags**: llm-judge, legal, brazil, single-turn\n\n### Datasets\n- **Primary dataset(s)**: OAB bench from MaritacaAI.\n- **Source links**: [maritaca-ai/oab-bench](https://huggingface.co/datasets/maritaca-ai/oab-bench/viewer/questions?views%5B%5D=questions)\n- **Split sizes**: 75% train 25% test.\n\n### Task\n- **Type**: single-turn\n- **Parser**: XMLParser\n- **Rubric overview**: The env automatically generates a LLMJudge rubric for each model passed in the `judge_models` argument.\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval oab-bench\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval oab-bench -s -m gpt-4.1-mini  -a '{\"judge_models\": [{\"model\": \"gpt-4.1\", \"base_url\": \"https://api.openai.com/v1\"}, {\"model\": \"gpt-5\", \"base_url\": \"https://api.openai.com/v1\"}]}'  # env-specific args as JSON\n```\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `judge_models` | List[Dict[str, str]] | `[{\"model\": \"gpt-4.1\", \"base_url\": \"https://api.openai.com/v1\"}]` | List of judge models and their urls. |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Mean of judge rewards |\n| `reward_judge_<judge_name>` | Reward given by each individual judge |\n\n","encoding":"utf-8","truncated":false,"total_bytes":1710},"status":null}