{"data":{"kind":"file","path":"README.md","version_id":"o4md56qsi9fcoa4uuwjqxu1o","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2590,"modified_at":"2025-09-24T09:27:04.618000","content_hash":"1db21411ff67cf1aedc9c754fc3fc38694052925b42433b62a22a34172bd6505"},"entries":[],"content":"# MBPP Baseline (Verifiers environment)\nBaseline MBPP environment for verifiers with compile and tests scoring via SandboxFusion.\n\n- Task: MBPP (Mostly Basic Python Programs).\n- Entry point: `load_environment(...) -> verifiers.Environment`.\n- Key parameters: `dataset_split`, `eval_split`, `num_examples`, `data_dir`, `prompt_file`, `include_tests_in_prompt`.\n\n## Rubric and SandboxFusion\nThe rubric executes code via SandboxFusion:\n- `compile_reward`: extracts the model's Python code block, removes any `if __name__ == \"__main__\":` section, and executes it.\n- `tests_reward`: builds a test harness from MBPP `test_list` and optional `test_setup_code`, then executes it.\n- Both call `mbpp_baseline.util.sandbox.run_code`, which POSTs to sandbox servers with load balancing and fault tolerance.\n\n### Single Sandbox (default)\n```bash\nexport SANDBOX_URL=http://localhost:8080\nsudo docker run -d --rm --privileged -p 8080:8080 \\\n  --name sandboxfusion volcengine/sandbox-fusion:server-20250609\n```\n\n### Multiple Shards (for higher throughput)\n```bash\nexport SANDBOX_URLS=\"http://localhost:8080,http://localhost:8081,http://localhost:8082\"\nsudo docker run -d --rm --privileged -p 8080:8080 --name sandbox1 volcengine/sandbox-fusion:server-20250609\nsudo docker run -d --rm --privileged -p 8081:8081 --name sandbox2 volcengine/sandbox-fusion:server-20250609\nsudo docker run -d --rm --privileged -p 8082:8082 --name sandbox3 volcengine/sandbox-fusion:server-20250609\n```\n\nThe system automatically load-balances across shards and fails over if any become unreachable.\n\n## Data\nProvide MBPP JSONL under `data_dir` or place files at a known location. Expected filenames:\n- `mbpp_train.jsonl`, `mbpp_valid.jsonl`, `mbpp_test.jsonl` (or `mbpp.jsonl` for `full`)\n\nDataset available at: **https://huggingface.co/datasets/jash404/mbpp**\n\nYou can pass `data_dir` via code or CLI flags supported by verifiers.\n\n## Quick start (local)\n```bash\nuv sync  # installs deps incl. verifiers[all]\nuv run vf-install mbpp_baseline -p environments\n```\n\n## Minimal usage (programmatic)\n```python\nfrom verifiers import load_environment\nenv = load_environment(\"mbpp_baseline\", dataset_split=\"valid\", num_examples=50)\n```\n\n## Rewards\n- **compile_reward**: 0/1 if extracted Python code runs with return code 0; else 0.\n- **tests_reward**: fractional score = passed/total from an isolated harness; falls back to 0/1 on failure to parse.\n- **format_reward**: 0/1 if the final output ends with a single fenced `python` code block; else 0.\n- **final**: \\(0.2\\)·compile_reward + \\(0.7\\)·tests_reward + \\(0.1\\)·format_reward.\n\n","encoding":"utf-8","truncated":false,"total_bytes":2590},"status":null}