{"data":{"kind":"file","path":"README.md","version_id":"nbc9c1as8ozm2xchd9hg9a1n","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4239,"modified_at":"2026-04-26T20:02:14.183000","content_hash":"e4bb96f05f3c67c2c28fa2168bac9dc8c99b83506214c26745d56e80233e6a1a"},"entries":[],"content":"# kv-cache-rl\n\nRL environment for KV-cache eviction policy optimization in LLM inference serving.\n\n## Inspiration\n\nThis env borrows ideas from NVIDIA Dynamo's March 2026 post, \"Full-Stack Optimizations for Agentic Inference with Dynamo\":\n\n- Blog: `https://docs.nvidia.com/dynamo/dev/blog/agentic-inference`\n\nAdapted from the post:\n\n- KV block value differences (`system/context` high value vs `reasoning/ephemeral` lower value)\n- Priority/retention-style cache policy ideas\n- Agent lifecycle angle (ephemeral work should be easier to evict)\n\nThis is a simplified RL simulator inspired by those ideas, not Dynamo itself.\n\n## Problem\n\nThe agent manages GPU KV-cache pressure with **multiple actions per step**:\n\n| Action | Effect |\n|--------|--------|\n| `evict(seq_id)` | Removes a sequence, frees GPU memory, sequence is dropped |\n| `compress(seq_id)` | Quantizes one tier (fp16→int8→int4), each halving memory cost |\n| `swap_to_cpu(seq_id)` | Frees GPU memory, adds swap access latency |\n| `keep()` | No action |\n\nThe model can call multiple tools per step — under high pressure it should batch 2-3 evictions/compressions to create headroom.\n\nGoal: maximize throughput while minimizing failures, latency, and unsafe evictions.\n\n```\n├── __init__.py            # Compatibility exports for env loading\n└── kv_cache_rl/\n    ├── kv_cache_eviction.py   # StatefulToolEnv + rewards + load_environment()\n    ├── simulator.py           # KVCacheSimulator + dynamics + pressure tracking\n    └── scenarios.py           # Scenario generation + compact prompt builders\n```\n\n## Simulator Highlights\n\n- Episode length: 15 steps\n- Difficulty tiers:\n  - Easy: memory budget 0.90-1.00, low/medium arrivals\n  - Medium: memory budget 0.75-0.90, medium/high arrivals\n  - Hard: memory budget 0.65-0.80, high/bursty arrivals\n- Block types: `system`, `context`, `generation`, `reasoning`, `ephemeral`\n- `eviction_value` indicates eviction safety (`0.0` worst to evict, `1.0` safest)\n- Compression uses actual quantization multipliers (fp16=1.0, int8=0.5, int4=0.25)\n\n### Pressure Signals\n\nThe simulator tracks and exposes:\n\n- `time_above_0_95`: number of steps with normalized memory usage above 0.95\n- `pending_queue_growth`: step-to-step change in pending queue size\n\n## Observation Format (Compact)\n\nPrompts include a compact summary instead of full raw cache dumps to reduce token usage.\n\nTop-level fields include:\n\n- `memory_usage`, `memory_budget`, `episode_remaining`\n- `time_above_0_95`, `pending_queue_growth`\n- `cache_overview` (block counts, compressed/swapped counts)\n- `top_memory_entries` (largest entries)\n- `top_eviction_candidates` (ranked by safety/priority/progress)\n- `pending_summary`\n- `visible_seq_ids` (recommended IDs for tool actions)\n\n## Reward Functions\n\n| Function | Weight | Description |\n|----------|--------|-------------|\n| `failure_penalty` | 0.40 | Accelerating penalty: `min(f*0.15 + max(0,f-2)*0.1, 1.0)` |\n| `throughput_reward` | 0.25 | `min(total_tokens / 500, 1.0)` |\n| `headroom_bonus` | 0.18 | Per-step memory tracking, rewards staying below 0.8/0.95/1.0 thresholds |\n| `memory_efficiency` | 0.07 | Step function on final memory state (0.2 if <0.85, -0.2 if overflow) |\n| `eviction_quality` | 0.05 | Rewards preserving high-value blocks (system, context) |\n| `latency_penalty` | 0.03 | `-min(total_swap_latency * 0.2, 0.3)` |\n| `risky_eviction_penalty` | 0.02 | Penalizes evicting system/high-priority/high-progress blocks |\n\nTracked metrics (weight=0): `total_tokens_metric`, `total_failures_metric`, `total_latency_metric`, `pressure_steps_metric`.\n\n## Usage\n\n```bash\n# Install local environment\nprime env install ./environments/kv_cache_eviction\n\n# Run eval\nprime eval run kv_cache_eviction -m gpt-5.4-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -n 5 -r 1 --max-concurrent 1\n\n# Hard-only scenarios\nprime eval run kv_cache_eviction --env-args '{\"difficulty\":\"hard\"}'\n```\n\n## Environment Arguments\n\n| Arg | Type | Default | Description |\n|-----|------|---------|-------------|\n| `difficulty` | str | `\"all\"` | One of `easy`, `medium`, `hard`, `all` |\n| `num_examples` | int | `-1` | Number of scenarios to use (`-1` = all) |\n| `seed` | int | `42` | Scenario generation seed |\n","encoding":"utf-8","truncated":false,"total_bytes":4239},"status":null}