{"data":{"kind":"file","path":"README.md","version_id":"xms3z1krmj4tx60uq7v3sv1p","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2719,"modified_at":"2026-04-26T20:55:31.373000","content_hash":"d8136e1b74a2902a467016874d38dfbd1cb9d11d5e38178187e503c12d7cb8e7"},"entries":[],"content":"# benchmark-mc\n\n**Behavioral benchmark MCQ environment for testing LLM overgeneralization of library rules.**\n\nEvery question presents a code snippet where a known rule *almost* applies — but doesn't, due to a non-obvious boundary condition. The hard negative distractor is always the output a model confidently applying the rule would predict, making wrong answers a signal of systematic overgeneralization rather than mere knowledge gaps.\n\nAll questions are **execution-verified**: both the confirming snippet and the confounder were actually run before inclusion.\n\n## Benchmarks included\n\n| Library | Tasks | Families |\n|---------|------:|---------|\n| pandas | 10 | groupby_semantics, dtype_coercion, nan_semantics, index_alignment, copy_semantics |\n| scikit-learn | 10 | preprocessing_scaling_behavior, cross_validation_data_leakage, feature_importance_interpretation, random_state_reproducibility |\n\n## Example question\n\n```\nRule: StandardScaler applies training mean/variance to test data.\n\nQuestion — what does this print?\n\n  scaler = StandardScaler()\n  scaler.fit([[10],[20],[30]])\n  scaler.fit([[100],[200],[300]])   ← called fit() again\n  print(scaler.transform([[100]])[0][0])\n\n  A. -1.224744871391589   ← CORRECT (second fit() overwrites statistics)\n  B. -1.0\n  C. 0.0\n  D. 9.797958971132712    ← hard negative (rule naively applied)\n```\n\n## Usage\n\n```bash\n# Evaluate a model on all 20 questions\nprime eval run benchmark_mc -m openai/gpt-4.1-mini -n 20\n\n# Target one library\nprime eval run benchmark_mc -m openai/gpt-4.1-mini --env-args '{\"library\": \"pandas\"}' -n 10\n\n# Evaluate a small open model (via hosted inference)\nprime eval run benchmark_mc -m openai/Qwen/Qwen2.5-1.5B-Instruct -n 20 --hosted\n```\n\n## For RL training\n\n```python\nfrom benchmark_mc.benchmark_mc import load_environment\n\n# Chain-of-thought mode recommended for RL — rewards structured reasoning\nenv = load_environment(library=\"all\", chain_of_thought=True)\n```\n\nThe environment returns `1.0` for a correct letter answer, `0.0` otherwise. The model can reason freely; only the last standalone `A/B/C/D` on the final line is scored.\n\nTo bootstrap training on models that initially abstain, set `format_reward_weight=0.2` for the first ~50 gradient steps, then anneal to `0.0`.\n\n## Generating more questions\n\nThe environment is generated by [benchmark-creator](https://github.com/dinkarjuyal/benchmark-creator). To add more questions or target a new library:\n\n```bash\nANTHROPIC_API_KEY=sk-ant-... python3 -m benchmark_creator \\\n  --repo https://github.com/your-library/repo \\\n  --strategy sgs --n 5 \\\n  --output benchmarks/your_library\n\npython3 scripts/bundle_pi_data.py --benchmarks your_library\nprime env push --auto-bump\n```\n","encoding":"utf-8","truncated":false,"total_bytes":2719},"status":null}