{"data":{"kind":"file","path":"README.md","version_id":"qri0hjh1f5k5fvmnoq9yq37e","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2488,"modified_at":"2026-03-03T23:23:17.746000","content_hash":"a4cc4d9b5c60706bfbdba19721248abf7eebfff8ba4abde221f1c560d68f0bf2"},"entries":[],"content":"# Sandbox Code (PythonEnv — isolated code execution)\n\nThe model generates Python solutions to programming challenges.  \nCode is executed inside a **Prime Sandbox** (isolated container).  \nReward is the fraction of test cases that pass.\n\n## Requirements\n\n```bash\npip install verifiers>=0.1.10\nexport PRIME_API_KEY=your-key-here\n```\n\n## What it does\n\n- **50 coding challenges** from basic (sum a list) to intermediate (edit distance, Sieve of Eratosthenes)\n- Model outputs a Python `solve()` function\n- Sandbox runs the test harness and reports `PASS_RATE:n/m`\n- Reward = `n / m` (fraction of tests passed)\n\n## Setup & Quick eval\n\n```python\nfrom lab_cookbook.recipes.sandbox_code.sandbox_code import load_environment\nimport verifiers as vf\n\nenv = load_environment(num_examples=10)\nvf.evaluate(env, model=\"gpt-4.1-mini\", rollouts_per_example=2)\n```\n\n## Training run\n\n```bash\nprime rl run config.toml\n```\n\n## Expected metrics\n\n| Model | Starting reward | After 100 steps |\n|-------|----------------|-----------------|\n| GPT-4.1-mini | ~0.85 | ~0.97 |\n| Llama-3.2-3B | ~0.35 | ~0.65 |\n| Llama-3.2-1B | ~0.15 | ~0.40 |\n\n## Key patterns\n\n### PythonEnv lifecycle\n```python\nenv = vf.PythonEnv(\n    dataset=dataset,\n    system_prompt=SYSTEM_PROMPT,\n    rubric=rubric,\n)\n```\n`PythonEnv` manages the sandbox lifecycle:\n- Creates an isolated Python REPL container per rollout\n- Executes code in the container\n- Cleans up after each rollout\n\n### Code extraction (no regex)\nThe `_extract_code()` helper looks for:\n1. ` ```python ... ``` ` fenced blocks\n2. Plain ` ``` ` blocks\n3. First line starting with `def `\n\n### Test harness pattern\n```python\ndef _build_test_harness(code, test_cases):\n    # Embeds user code + test runner into a single script\n    # Outputs: \"PASS_RATE:3/5\"\n```\n\n## Challenge categories\n\n| Category | Count | Examples |\n|----------|-------|---------|\n| Basic list ops | 10 | sum, max, reverse, count |\n| String manipulation | 8 | palindrome, vowels, capitalize |\n| Math & number theory | 8 | factorial, Fibonacci, prime, GCD |\n| Data structures | 8 | dict ops, group-by, chunk |\n| Algorithms | 8 | binary search, merge sort, BFS islands |\n| Dynamic programming | 8 | coin change, edit distance, Pascal's triangle |\n\n## Adapting to new domains\n\nReplace `_CHALLENGES` with domain-specific coding tasks.  \nTest cases are the key — each problem needs deterministic, verifiable outputs.\n\nFor **harder** problems: increase `max_tokens` to 2048+ for chain-of-thought code generation.\n","encoding":"utf-8","truncated":false,"total_bytes":2488},"status":null}