{"data":{"kind":"file","path":"README.md","version_id":"ckpzb9v96jh409yzakztt0g1","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3250,"modified_at":"2026-02-25T09:21:49.939000","content_hash":"2e595c87072b8f38e56838b85db762a6a9fd85b5249919fcf613f511c526e274"},"entries":[],"content":"# flashinfer-bench-env\n\n### Overview\n- **Environment ID**: `flashinfer-bench-env`\n- **Short description**: Multi-turn environment that ports FlashInfer-Bench kernel submission/evaluation to verifiers with Modal-backed B200 execution.\n- **Tags**: eval, multi-turn, coding, triton, cuda, flashinfer\n\n### Datasets\n- **Primary dataset(s)**: FlashInfer Trace (Definition + selected workload sets per definition).\n- **Source links**: https://huggingface.co/datasets/flashinfer-ai/flashinfer-trace\n- **Split sizes**: Determined at runtime from selected definitions; each example contains one definition plus a selected workload set optionally bucketed into small/medium/large input sizes.\n\n### Task\n- **Type**: multi-turn code generation\n- **Output format expectations**: The model should return code wrapped in `<python>...</python>` or a fenced `python` block. The environment extracts code then evaluates it as `main.py::run`.\n- **Rubric overview**: Reward is based on the best submitted solution: `correctness_score * performance_score`, where `performance_score = speedup_factor / (1 + speedup_factor)`. Evaluation for one example runs across the selected workloads for that definition; `speedup_factor` is the average speedup across those workloads.\n\n### Quickstart\n\nNote:\n- Modal must be configured on the machine running evaluation, with enough credits for B200 jobs.\n- On larger reasoning models, rollouts can take 30+ minutes with default settings.\n\n```bash\nmodal deploy environments/flashinfer_bench_env/src/modal_runner.py\n```\n\nRun an evaluation:\n\n```bash\nprime eval run flashinfer-bench-env\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `definitions` | list[str] \\| null | `null` | Optional definition-name filter. |\n| `workload_limit_per_definition` | int \\| null | `8` | Maximum workloads included per generated example. With input-size bucketing enabled, this limit applies per size bucket. Set `null` to use all workloads in each generated example. |\n| `input_size_bucketing` | bool | `true` | Split each definition's workloads into input-size buckets (`small`,`medium`,`large`) using axis-based size scoring. |\n| `num_turns` | int | `4` | Maximum multi-turn tool iterations per rollout. |\n| `stop_on_first_pass` | bool | `false` | Stop policy for multi-turn rollouts. `false`: always use full turn budget. `true`: stop once a submission passes. |\n| `warmup_runs` | int | `1` | Warmup runs for benchmark config passed to evaluator. |\n| `iterations` | int | `50` | Iterations per trial. |\n| `num_trials` | int | `2` | Number of benchmark trials. |\n| `rtol` | float | `1e-2` | Relative tolerance for correctness check. |\n| `atol` | float | `1e-2` | Absolute tolerance for correctness check. |\n| `timeout_s` | int | `120` | Timeout for benchmark execution. |\n\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar reward (`correctness_score * performance_score`). |\n| `correctness_score` | Correctness score of the best submission in the rollout. |\n| `performance_score` | Normalized performance score of the best submission: `speedup_factor / (1 + speedup_factor)`. |\n| `is_passed` | `1.0` when the best submission status is `PASSED`, else `0.0`. |\n","encoding":"utf-8","truncated":false,"total_bytes":3250},"status":null}