{"data":{"kind":"file","path":"README.md","version_id":"hi79hvp2fkmzbbhchut5g5v7","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4105,"modified_at":"2025-10-09T18:28:52.553000","content_hash":"5923d74f530b05aa7117006feaaf13c8850a6e094eaf9c67aff5698d06e2b828"},"entries":[],"content":"# backend-bench\n\n### Overview\n- **Environment ID**: `backend-bench`\n- **Short description**: Environment to evaluate LLMs on the ability to generate correct and fast GPU kernels, passing tests provided by `Torch`\n- **Tags**: eval, multi-turn, reasoning, coding, code, cuda, triton\n\n### Datasets\n- **Primary dataset(s)**: `BackendBench` - evaluation suite for testing how well LLMs and humans can write PyTorch backends\n- **Source links**: https://github.com/meta-pytorch/BackendBench\n\n### Task\n- **Type**: multi-turn\n- **Parser**: custom - parses ```python ... ``` blocks from model output\n- **Rubric overview**: The rubric includes reward functions for correctness and efficiency. The reward being `correctness_score * performance_score`, which are provided by the `BackendBench`.\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval backend-bench\n```\n\nConfigure environment-specific arguments:\n\n```bash\nuv run vf-eval backend_bench -b http://localhost:8000/v1 -m Qwen/Qwen3-32B -T 0.7 -k EMPTY -a '{\"gpu\":\"H100\", \"suite\":\"torchbench\", \"num_turns\": 10}' --num-examples -1 --rollouts-per-example=1\n```\n\n#### Imporant notes:\n- Make sure to have [Modal](https://modal.com/) set up and credits available if you want the performance to be correctly evaluated - on local GPUs multiple kernels can be run at the same time, resulting in incorrect performance numbers.\n- Feedback loop can become pretty long, with 3+ steps sequence length of at-least 128k can be needed (depends on the model and the task suite)\n- `facto` suite requires `facto` to be installed\n\n\n### Environment Argument\nDocument any supported environment arguments and their meaning. Example:\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `suite` | str | `\"torchbench\"` | Controls which suite of tasks the model is evaluated on. Options are `smoke`, `opinfo`, `torchbench` and `facto` |\n| `ops` | list[str] | `[]` | List of specific ops to evaluate. If empty, evaluates all ops in the selected suite. See [Torch IR](https://docs.pytorch.org/docs/main/torch.compiler_ir.html) for details |\n| `gpu` | str | `\"local\"` | Which GPU to use. Options are `local` (uses local GPU for debugging - results aren't correct as no scheduling is in place). If option from `T4`, `L4`, `A100`, `H100`, `H200` or `B200`, uses [Modal](https://modal.com/) to run evaluation on that GPU type. This requires Modal to be set-up on the machine and credits to be available. |\n| `num_turns` | int | `1` | Number of turns in the feedback loop. If >1, the model gets feedback on correctness and performance of its previous attempt and can try to improve it. |\n| `feedback_loop` | str | `\"until_max_turns\"` | How to handle the feedback loop. Options are `until_max_turns` (model gets `num_turns` attempts), `until_correct` (model gets up to `num_turns` attempts but stops if it gets it correct) and `none` (model only gets one attempt, no feedback). |\n\n### Metrics\nSummarize key metrics your rubric emits and how they’re interpreted.\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main reward metric - calculated as `performance_score * correctness_score`. Best performing kernel is selected in case of multiple turns. |\n| `last_reward` | Reward (calculated as above) of the last turn. |\n| `best_reward` | Best reward across all turns. |\n| `test_accuracy` | Fraction of tests passed by the best performing kernel. |\n| `perf_accuracy` | Fraction of performance tests passed by the best performing kernel. |\n| `best_run_performance_score` | Performance score of the best performing kernel. |\n| `best_run_correctness_score` | Correctness score of the best performing kernel. |\n\n### TODO\n\n- [ ] Use [`FeedbackInfo`](https://github.com/meta-pytorch/BackendBench/blob/4468ee2a6a2d7568abdc270af431d94314a07656/BackendBench/backends/llm.py#L38) from `BackendBench` to stay true to the original implementation (Feedback is currently reimplemented, as the original one can explode context length too much)\n- [ ] Better/configurable system prompt\n- [ ] Possibly save kernels to a directory\n- [ ] Add evaluations\n","encoding":"utf-8","truncated":false,"total_bytes":4105},"status":null}