{"data":{"kind":"file","path":"README.md","version_id":"acyb636vynqkxvxo592whuwv","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2411,"modified_at":"2025-10-14T23:07:43","content_hash":"4a6e0a01f18eb64c5ff532373999e991927bdd54375935fe40c1f1331083490e"},"entries":[],"content":"# LitBench\n\nLitBench is based on the paper \"LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing\" [arXiv:2507.00769](https://arxiv.org/pdf/2507.00769), which introduces a standardized benchmark for creative writing verification comprising 2,480 human-labeled story comparisons from Reddit and a 43,827-pair training corpus. The benchmark evaluates zero-shot LLM judges and trains reward models that achieve 78% accuracy, outperforming off-the-shelf judges while providing a vetted resource for reliable automated evaluation of creative-writing systems.\n\nThis environment implements a literary evaluation benchmark using pair-wise comparison with self-critique prompting and structured scoring.\n\n### Overview\n- **Environment ID**: `LitBench`\n- **Short description**: Literary evaluation benchmark using pair-wise comparison with self-critique prompting\n- **Tags**: GenRM, Pair-wise, Literary, Self-Critique, BRPO\n\n### Datasets\n- **Primary dataset(s)**: LitBench-Train dataset with story generation and pair-wise comparisons\n- **Source links**: [HF LitBench-Train](https://huggingface.co/datasets/SAA-Lab/LitBench-Train)\n- **Split sizes**: Uses streaming train split with configurable indices\n\n### Task\n- **Type**: single-turn\n- **Parser**: Standard response parsing\n- **Rubric overview**: Extracts numerical scores from LaTeX boxed format and compares them based on random flip logic\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval LitBench\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval LitBench   -m gpt-4.1-mini   -n 20 -r 3 -t 1024 -T 0.7   -a '{\"indices\": [[432,5], [12030,5]]}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n- The environment uses self-critique prompting (SPC) to generate evaluation prompts\n- Randomly flips response order to ensure unbiased evaluation\n\n### Environment Arguments\nDocument any supported environment arguments and their meaning. Example:\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `indices` | List[List[int]] | `[[10500, 10], [35000, 5]]` | List of [start_index, length] tuples for dataset portions |\n\n### Metrics\nSummarize key metrics your rubric emits and how they're interpreted.\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Binary reward based on score comparison (1 if correct, 0 if incorrect) |\n\n","encoding":"utf-8","truncated":false,"total_bytes":2411},"status":null}