{"data":{"kind":"file","path":"README.md","version_id":"novvnxnr5f7zvhqnan2sujlt","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3643,"modified_at":"2025-11-05T09:21:34.450000","content_hash":"ca39fcd07f2275be230dc9164bd452942953dfd79c1d55e5ddff4ca86a7df385"},"entries":[],"content":"# fox\n\n### Overview\n- **Environment ID**: `fox`\n- **Short description**: A comprehensive, multi-task benchmark for fine-grained, multi-page document understanding.\n- **Tags**: eval, multimodal, document-understanding, ocr, vqa, translation, summarization, vision, single-turn\n\n### Datasets\n- **Primary dataset(s)**: Fox Benchmark Data, a collection of single and multi-page documents in English and Chinese, designed for a variety of fine-grained analysis tasks.\n- **Source links**: [ucaslcl/Fox_benchmark_data on Hugging Face](https://huggingface.co/datasets/ucaslcl/Fox_benchmark_data)\n- **Split sizes**: The environment uses all available examples from the source dataset for each of its 9 sub-tasks. The underlying benchmark consists of 112 English and 100 Chinese document pages. The number of examples can be limited using the `examples_per_subtask` argument.\n\n### Task\n- **Type**: single-turn\n- **Parser**: Primarily uses the default `vf.Parser`. The `cross_page_vqa` sub-task uses a specialized parser to extract answers from `\\boxed{}`.\n- **Rubric overview**: Scoring is tailored to the sub-task:\n    - OCR/Translation tasks are scored primarily by **Similarity (1 - normalized edit distance)**.\n    - Summarization/Captioning tasks are scored by **ROUGE-L F1 score**.\n    - The VQA task is scored by **Exact Match Accuracy**.\n    - Auxiliary metrics like BLEU, METEOR, and F1-score are also calculated for OCR tasks.\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval fox\n```\n\nConfigure model, number of examples, and sampling parameters:\n\n```bash\nuv run vf-eval fox \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 1 \\\n  -a '{\"examples_per_subtask\": 10}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object. The `-n` argument will be superseded by the total number of examples loaded via `examples_per_subtask`.\n- **Note on the `cross_page_vqa` sub-task:** The evaluation for this task has been adapted to be more robust. Instead of requiring the model's entire output to be just the numeric answer, we instruct it to place the final answer in a `\\boxed{}` block. Our parser then extracts this value, correctly scoring models that provide reasoning or conversational content before their answer.\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `examples_per_subtask` | int | `-1` | Limits the number of examples loaded for each of the 9 sub-tasks. Use -1 to load all available examples. |\n\n### Metrics\nSummarize key metrics your rubric emits and how they’re interpreted.\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | The main scalar reward, which is the weighted sum of the primary reward functions for each sub-task (Similarity, ROUGE-L, or Accuracy). |\n| `similarity_reward` | 1.0 minus the normalized Levenshtein (edit) distance. The primary score for OCR/translation tasks. |\n| `rouge_l_f_reward` | The F1-score for ROUGE-L, used as the primary score for summarization and captioning. |\n| `accuracy_reward` | Exact match score (1.0 for correct, 0.0 for incorrect), used as the primary score for the VQA task. |\n| `bleu_reward` | Standard NLP metric for evaluating token overlap, calculated as an auxiliary score. |\n| `meteor_reward` | Standard NLP metric considering synonyms and stemming, calculated as an auxiliary score. |\n| `f1_score_reward` | The F1-score based on token-level precision and recall, calculated as an auxiliary score. |\n| `precision_reward` | Token-level precision, calculated as an auxiliary score. |\n| `recall_reward` | Token-level recall, calculated as an auxiliary score. |","encoding":"utf-8","truncated":false,"total_bytes":3643},"status":null}