{"data":{"kind":"file","path":"README.md","version_id":"pymnbrwz2yv64bq381aox2bg","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3658,"modified_at":"2026-02-26T12:09:55.033000","content_hash":"cac5bd535b4137f3e66eaec9fe03f4c77266675caa7baee67b5be96ad0337345"},"entries":[],"content":"# cutile-tilegym-env\n\n### Overview\n- **Environment ID**: `cutile-tilegym-env`\n- **Short description**: Multi-turn kernel-generation environment for TileGym derived ops evaluated with PyGPUBench on Modal B200.\n- **Tags**: eval, multi-turn, coding, cutile, cuda, tilegym, pygpubench\n\n### Datasets\n- **Primary dataset(s)**: Static packaged TileGym-benchmark-style dataset embedded in the environment (`src/dataset.py`).\n- **Source links**: TileGym and PyGPUBench repositories used as guidance for task design.\n- **Split sizes**: Determined at runtime from selected definitions; each example contains one definition plus a selected workload set optionally bucketed into small/medium/large input sizes.\n\n### Task\n- **Type**: multi-turn code generation\n- **Language target**: cutile (`cuda.tile`)\n- **In-context guidance**: The system prompt includes a compact cutile core reference (execution model and key APIs) plus minimal vector-add and matmul launch patterns derived from `cutile-python` docs/samples.\n- **Output format expectations**: The model should return code wrapped in `<python>...</python>` or a fenced `python` block. Code extraction prefers `<python>...</python>` and falls back to fenced code.\n- **Rubric overview**: Reward is based on the best submitted solution: `correctness_score * performance_score`, where `performance_score = speedup_factor / (1 + speedup_factor)`. Evaluation for one example runs across the selected workloads for that definition; `speedup_factor` is the average speedup across those workloads.\n\n### Quickstart\n\nDeploy the Modal runner:\n\n```bash\nmodal deploy environments/cutile_tilegym_env/src/modal_runner.py\n```\n\nRun an evaluation:\n\n```bash\nprime eval run cutile-tilegym-env\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `definitions` | list[str] \\| null | `null` | Optional definition-name filter. |\n| `workload_limit_per_definition` | int \\| null | `8` | Maximum workloads included per generated example. With input-size bucketing enabled, this limit applies per size bucket. Set `null` to use all workloads in each generated example. |\n| `input_size_bucketing` | bool | `true` | Split each definition's workloads into input-size buckets (`small`,`medium`,`large`) using axis-based size scoring. |\n| `num_turns` | int | `4` | Maximum multi-turn iterations per rollout. |\n| `stop_on_first_pass` | bool | `false` | Stop policy for multi-turn rollouts. `false`: always use full turn budget. `true`: stop once a submission passes. |\n| `iterations` | int | `16` | Number of generated test cases per PyGPUBench trial (`>2`). |\n| `num_trials` | int | `2` | Number of PyGPUBench trials per workload. |\n| `seed` | int | `7` | Base RNG seed for benchmark input generation. |\n| `discard_l2` | bool | `true` | Whether PyGPUBench should discard cache lines before timed runs. |\n| `rtol` | float | `1e-2` | Relative tolerance for correctness checks. |\n| `atol` | float | `1e-2` | Absolute tolerance for correctness checks. |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar reward (`correctness_score * performance_score`). |\n| `correctness_score` | Correctness score of the best submission in the rollout. |\n| `performance_score` | Normalized performance score of the best submission: `speedup_factor / (1 + speedup_factor)`. |\n| `is_passed` | `1.0` when the best submission status is `PASSED`, else `0.0`. |\n\n### Included Definitions\nThis environment uses definitions derived from the TileGym benchmark repository (including forward and backward-style tasks).\nSource repository: [NVIDIA/TileGym](https://github.com/NVIDIA/TileGym)\n","encoding":"utf-8","truncated":false,"total_bytes":3658},"status":null}