{"data":{"kind":"file","path":"README.md","version_id":"ftnmztp0u5gepa6g5wdtz05l","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":8107,"modified_at":"2025-12-10T00:55:42.541000","content_hash":"25b7ddb795dfd6e8e8900ce6272b7dd357b92285e8e491d63cb4f4bbac400ab9"},"entries":[],"content":"# poetav2-min\n\n### Overview\n- **Environment ID**: `poetav2-min`\n- **Short description**: A minimal, opinionated subset of the PoETa v2 benchmark focusing on native Portuguese tasks (NLI, STS, QA, proverbs, toxicity).\n- **Tags**: portuguese, brazil, benchmark, eval, train, single-turn, enem, unicamp, fuvest, bluex, inferbr, broverbs, hatebr, fasquad, assin2\n\n### Datasets\n- **Primary dataset(s)**: \n  - `assin2` (RTE & STS) [hf](https://huggingface.co/datasets/nilc-nlp/assin2)\n  - `maritaca-ai/enem` (University Exams) [hf](https://huggingface.co/datasets/maritaca-ai/enem)\n  - `portuguese-benchmark-datasets/BLUEX` (University Exams) [hf](https://huggingface.co/datasets/portuguese-benchmark-datasets/BLUEX)\n  - `franciellevargas/HateBR` (Toxicity) [hf](https://huggingface.co/datasets/franciellevargas/HateBR)\n  - `liafacom/faquad` (Extractive QA) [github](https://github.com/liafacom/faquad)\n  - `Tropic-AI/BRoverbs` (Cultural Reasoning) [hf](https://huggingface.co/datasets/Tropic-AI/BRoverbs)\n  - `hapaxlegomenon/InferBR` (NLI) [hf](https://huggingface.co/datasets/hapaxlegomenon/InferBR)\n- **Source links**: [PoETa V2 benchmark suite](https://github.com/PoETaV2/PoETaV2)\n- **Split sizes**: Varies by sub-task. Generally uses `train` for training and `test` for evaluation.\n\n### Task\n- **Type**: single-turn (EnvGroup)\n- **Parser**: `Parser` with `extract_boxed_answer`\n- **Rubric overview**: \n  - **Classification/Multiple Choice**: Exact match accuracy (1.0 or 0.0).\n  - **STS (Semantic Text Similarity)**: Linear proximity score (1.0 - error/range).\n  - **Extractive QA**: F1 score.\n  - **Format**: All tasks include a format reward for adhering to `\\boxed{}`.\n\n### Quickstart\nRun an evaluation on all tasks with default settings:\n\n```bash\nuv run vf-eval poetav2-min\n```\n\nRun on a specific task (e.g., ENEM) with a sample limit:\n\n```bash\nuv run vf-eval poetav2-min \\\n  -m gpt-4.1-mini \\\n  -n 10 \\\n  -a '{\"task\": \"enem\", \"examples_per_task\": 10}'\n```\n\n### Notes\n\n#### Training\n\nThe tasks `enem`, `bluex`, `hatebr` and `broverbs_*` don't have a official training set.\n\nCurrently, we are using the `test` set as a default training set for these tasks. Keep this in mind when training models on this environment or when using it as a benchmark.\n\n#### Compute NPM\n\nTo compute the Normalized Preferred Metric (NPM) for poetav2-min evals, run the following command:\n\n```bash\ncd environments/poetav2_min && uv run scripts/calculate_npm.py --model gpt-4.1-mini\n```\n\n##### NPM results\n\n<details>\n  \n```\n    PoETa v2 (Min) Results: Qwen--Qwen3-4B-Thinking-2507   \n┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━┳━━━━━━━━━━━┳━━━━━━━━┓\n┃ Task                        ┃  N ┃ Raw Score ┃    NPM ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━╇━━━━━━━━━━━╇━━━━━━━━┩\n│ assin2_rte                  │ 30 │    0.8667 │  73.33 │\n│ assin2_sts                  │ 30 │    0.8704 │  67.60 │\n│ bluex                       │ 30 │    0.8000 │  75.00 │\n│ broverbs_history_to_proverb │ 30 │    1.0000 │ 100.00 │\n│ broverbs_proverb_to_history │ 30 │    0.9667 │  95.83 │\n│ enem                        │ 30 │    0.9000 │  87.50 │\n│ faquad                      │ 30 │    0.9933 │  99.33 │\n│ hatebr                      │ 30 │    0.9000 │  80.00 │\n│ inferbr                     │ 30 │    0.9000 │  85.00 │\n├─────────────────────────────┼────┼───────────┼────────┤\n│ AVERAGE (NPM)               │    │           │  84.84 │\n└─────────────────────────────┴────┴───────────┴────────┘\nRun ID: e168bf51\n----------------------------------------\n    PoETa v2 (Min) Results: arcee-ai--AFM-4.5B       \n┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━┳━━━━━━━━━━━┳━━━━━━━┓\n┃ Task                        ┃  N ┃ Raw Score ┃   NPM ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━╇━━━━━━━━━━━╇━━━━━━━┩\n│ assin2_rte                  │ 30 │    0.8667 │ 73.33 │\n│ assin2_sts                  │ 30 │    0.7471 │ 36.77 │\n│ bluex                       │ 30 │    0.3667 │ 20.83 │\n│ broverbs_history_to_proverb │ 30 │    0.5667 │ 45.83 │\n│ broverbs_proverb_to_history │ 30 │    0.5000 │ 37.50 │\n│ enem                        │ 30 │    0.6333 │ 54.17 │\n│ faquad                      │ 30 │    0.7068 │ 70.68 │\n│ hatebr                      │ 30 │    0.6000 │ 20.00 │\n│ inferbr                     │ 30 │    0.5000 │ 25.00 │\n├─────────────────────────────┼────┼───────────┼───────┤\n│ AVERAGE (NPM)               │    │           │ 42.68 │\n└─────────────────────────────┴────┴───────────┴───────┘\nRun ID: e2a9c1bc\n----------------------------------------\n    PoETa v2 (Min) Results: Qwen--Qwen3-4B-Instruct-2507   \n┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━┳━━━━━━━━━━━┳━━━━━━━━┓\n┃ Task                        ┃  N ┃ Raw Score ┃    NPM ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━╇━━━━━━━━━━━╇━━━━━━━━┩\n│ assin2_rte                  │ 30 │    0.9000 │  80.00 │\n│ assin2_sts                  │ 30 │    0.8646 │  66.15 │\n│ bluex                       │ 30 │    0.5000 │  37.50 │\n│ broverbs_history_to_proverb │ 30 │    1.0000 │ 100.00 │\n│ broverbs_proverb_to_history │ 30 │    0.9667 │  95.83 │\n│ enem                        │ 30 │    0.8333 │  79.17 │\n│ faquad                      │ 30 │    0.9182 │  91.82 │\n│ hatebr                      │ 30 │    1.0000 │ 100.00 │\n│ inferbr                     │ 30 │    0.9000 │  85.00 │\n├─────────────────────────────┼────┼───────────┼────────┤\n│ AVERAGE (NPM)               │    │           │  81.72 │\n└─────────────────────────────┴────┴───────────┴────────┘\nRun ID: 26591c36\n----------------------------------------\n```\n\n</details>\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `task` | str | `\"all\"` | Specific task to load. |\n| `seed` | int | `-1` | Random seed for dataset shuffling. No shuffling if negative. |\n| `examples_per_task` | int | `-1` | Maximum number of examples to load per sub-task (useful for quick evals or balanced training). |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | The main scalar score. The sum of `format_reward` and the task-specific reward. The maximum possible value is 2.0. |\n| `format_reward` | 1.0 if the model successfully put the answer inside `\\boxed{}`, 0.0 otherwise. |\n| `*_accuracy` | (Task specific) Exact match accuracy for that specific sub-task. |\n| `assin2_sts_proximity_score` | (STS only) How close the predicted float value was to the gold standard. |\n| `faquad_f1_score` | (FaQuAD only) Token overlap F1 score between prediction and ground truth. |","encoding":"utf-8","truncated":false,"total_bytes":8107},"status":null}