{"data":{"kind":"file","path":"README.md","version_id":"lf8pxj28xdbp5oz9tsbwste4","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4814,"modified_at":"2026-02-24T22:50:20.082000","content_hash":"0224085ce4412b36970f5b0367a48b29a3c6dfad31dda4a3d14b8f53158a91d2"},"entries":[],"content":"# kernelbot-env\n\nUnified GPU kernel evaluation environment (`verifiers`-compatible RL env) for evaluating LLMs on writing GPU kernels. Supports multiple problem sets and hardware targets.\n\n## Setup\n\n```bash\nuv sync\n```\n\n## Running Evaluations\n\nAll evaluations use `vf-eval`. The `-a` flag passes JSON kwargs to `load_environment()`.\n\n```bash\nuv run vf-eval kernelbot-env -m <model> -b <base_url> -k <api_key> -a '<json>'\n```\n\n### Run All Competitions\n\n```bash\nuv run vf-eval kernelbot-env -m gpt-5.2 -b https://api.openai.com/v1 -k OPENAI_API_KEY\n```\n\nThis loads every competition YAML (`amd.yaml`, `bioml.yaml`, etc.) and evaluates all problems.\n\n### Run a Specific Competition\n\n```bash\n# AMD problems (MI300X via RunPod)\nuv run vf-eval kernelbot-env -m <model> -b <base_url> -k <api_key> \\\n  -a '{\"competition\":\"amd\"}'\n\n# BioML problems (H100/A100/B200 via Modal, or MI300 via RunPod)\nuv run vf-eval kernelbot-env -m <model> -b <base_url> -k <api_key> \\\n  -a '{\"competition\":\"bioml\"}'\n```\n\n### Run a Specific Problem\n\n```bash\nuv run vf-eval kernelbot-env -m <model> -b <base_url> -k <api_key> \\\n  -a '{\"competition\":\"amd\",\"problems\":[\"amd-identity\"]}'\n```\n\n### Override GPU Target\n\nForce all problems to run on a specific GPU (useful for testing):\n\n```bash\nuv run vf-eval kernelbot-env -m <model> -b <base_url> -k <api_key> \\\n  -a '{\"competition\":\"bioml\",\"gpu\":\"H100\"}'\n```\n\n### Control Eval Size\n\nOverride defaults from `kernelbot_env/pyproject.toml` (default: `num_examples=3`, `rollouts_per_example=1`):\n\n```bash\n# Evaluate 1 problem with 2 rollouts\nuv run vf-eval kernelbot-env -m <model> -b <base_url> -k <api_key> \\\n  -n 1 -r 2 -a '{\"competition\":\"amd\"}'\n```\n\n### Multi-turn Options\n\n```bash\n# Single-shot (no feedback loop)\nuv run vf-eval kernelbot-env -m <model> -b <base_url> -k <api_key> \\\n  -a '{\"competition\":\"amd\",\"num_turns\":1,\"feedback_loop\":\"none\"}'\n\n# Stop as soon as correctness passes (up to 5 turns)\nuv run vf-eval kernelbot-env -m <model> -b <base_url> -k <api_key> \\\n  -a '{\"competition\":\"amd\",\"num_turns\":5,\"feedback_loop\":\"until_correct\"}'\n```\n\n### Run via Prime Intellect\n\n```bash\nprime eval run kernelbot-env -m <model>\n```\n\n## Infrastructure Setup\n\n### Modal (NVIDIA GPUs: H100, A100, B200)\n\nModal runs a deployed server that receives eval requests. You need to deploy it before running any NVIDIA problems.\n\n**1. Authenticate:**\n```bash\nuv run modal setup\n```\n\n**2. Deploy the eval server:**\n```bash\nuv run modal deploy kernelbot_env/deploy/modal_app.py\n```\n\nThis deploys `cuda-bench-runner` with functions `eval_kernel_h100`, `eval_kernel_a100`, `eval_kernel_b200`. Each function auto-scales up to 40 containers.\n\n**3. Verify deployment:**\n```bash\nuv run modal run kernelbot_env/deploy/modal_app.py::smoke\n```\n\nNo environment variables needed — Modal uses its own auth token from `modal setup`.\n\n### RunPod (AMD GPUs: MI300X, MI300x8)\n\nRunPod requires an SSH-accessible pod. The runner SCPs a self-contained Python script to the pod and executes it via SSH.\n\n**1. Set up SSH access:**\n\nAdd a host entry in `~/.ssh/config`:\n```\nHost runpod-tcp\n    HostName <pod-ip>\n    User root\n    Port <ssh-port>\n    IdentityFile ~/.ssh/<key>\n```\n\n**2. Set environment variables:**\n\n| Variable | Default | Description |\n|---|---|---|\n| `RUNPOD_SSH_HOST` | `runpod-tcp` | SSH config host alias for the pod |\n| `RUNPOD_PYTHON_CMD` | `conda run -n py_3.10 python3` | Python command on the pod |\n\n```bash\nexport RUNPOD_SSH_HOST=runpod-tcp\n```\n\n**3. Verify connectivity:**\n```bash\nssh runpod-tcp echo ok\n```\n\n## Problems\n\nProblems vendored from [gpu-mode/reference-kernels](https://github.com/gpu-mode/reference-kernels).\n\n### AMD Developer Challenge (`amd.yaml`)\n\n| Problem | GPU |\n|---|---|\n| amd-identity | MI300 |\n| amd-fp8-mm | MI300 |\n| amd-mixture-of-experts | MI300 |\n| amd-mla-decode | MI300 |\n\n### BioML Kernels (`bioml.yaml`)\n\n| Problem | GPUs |\n|---|---|\n| trimul | B200, H100, A100, MI300 |\n\n## Architecture\n\n```\nvf-eval CLI\n  └── KernelEnv (verifiers MultiTurnEnv)\n        ├── Prompt builder → LLM\n        ├── Parser (extracts <answer> tags)\n        ├── Evaluator → Runner\n        │     ├── ModalRunner  → H100, A100, B200 (deployed server)\n        │     └── RunPodRunner → MI300X, MI300x8 (SSH + SCP)\n        └── Rubric (score = correctness * performance)\n```\n\n## `load_environment()` Parameters\n\n| Parameter | Type | Default | Description |\n|---|---|---|---|\n| `competition` | `str` | `\"all\"` | `\"all\"`, `\"amd\"`, `\"bioml\"` |\n| `gpu` | `str \\| None` | `None` | Force GPU (e.g. `\"H100\"`, `\"MI300X\"`) |\n| `num_turns` | `int` | `3` | Max feedback turns |\n| `feedback_loop` | `str` | `\"until_max_turns\"` | `\"until_max_turns\"`, `\"until_correct\"`, `\"none\"` |\n| `problems` | `list[str] \\| None` | `None` | Specific problem names to run |\n","encoding":"utf-8","truncated":false,"total_bytes":4814},"status":null}