{"data":{"kind":"file","path":"README.md","version_id":"ivzv9dam5ckpxq6wl3kluycr","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4586,"modified_at":"2026-05-10T14:39:16.364000","content_hash":"c5ee59922ef3c96fc0a804bd91880c80dc48bc46548569d2876a3ac501b18ecb"},"entries":[],"content":"# teachingbench\n\nA Verifiers / Prime-Intellect RL environment that evaluates how well an LLM **teaches a specific concept** to a specific student.\n\nEach rollout is a multi-turn dialog (1..N turns, set per task) between the tutor model under test and a simulated student. After the dialog ends, an LLM judge scores the full transcript against a structured per-criterion rubric. Reward = mean of non-null per-criterion scores.\n\n## Status\n\nv0.1 — runnable, two sample tasks (`cs/intro_python_hello_world`, `cs/recursion_base_cases`). Reward path is the LLM judge; quiz / self-rating code in `grader/quiz.py` and `LLMStudent.{answer_quiz,self_rate}` is dormant.\n\n## Local development\n\n```bash\n# from the repo root\ncd environments/teachingbench\npip install -e .\n\n# dry-run (no API calls — verifies dataset + imports):\npython -m teachingbench.smoke_test\n\n# live (one rollout against real models):\npython -m teachingbench.smoke_test --live \\\n    --task cs/intro_python_hello_world \\\n    --tutor-model gpt-5.4-nano \\\n    --student-model gpt-5.4-mini \\\n    --judge-model gpt-5.4-nano\n```\n\nLive runs save outputs to `environments/teachingbench/outputs/runs/<timestamp>__<task>__<model>/{results.jsonl, metadata.json}`.\n\n## Push to Prime\n\nFrom the repo root:\n\n```bash\nprime env push --path environments/teachingbench\n```\n\n`prime env push` builds the wheel locally then uploads. The metadata test runs on Prime's side — verify the Prime dashboard shows it passing; CLI exit-success is **not** sufficient.\n\n### Pre-push wheel verification\n\n```bash\npip wheel environments/teachingbench --no-deps -w /tmp/wheel_test\nunzip -l /tmp/wheel_test/teachingbench-0.1.0-py3-none-any.whl   # confirm tasks/* shipped\npip install --force-reinstall --no-deps /tmp/wheel_test/teachingbench-0.1.0-py3-none-any.whl\ncd /tmp && python -m teachingbench.smoke_test                    # no source-tree dep\n```\n\n## Package layout\n\n```\nteachingbench/\n├── env.py              # TeachingEnv (MultiTurnEnv subclass) + load_environment\n├── smoke_test.py       # one-rollout driver, dry-run + live modes\n├── prompts.py          # DEFAULT_TUTOR_SYSTEM_PROMPT (empty), DEFAULT_STUDENT_SYSTEM_PROMPT,\n│                       #   DEFAULT_RUBRIC, TRANSCRIPT_JUDGE_PROMPT\n├── dataset.py          # discovers tasks from tasks/<subject>/<topic>/, builds HF Dataset\n├── student/            # Student protocol + LLMStudent (active) + HumanStudent (stub)\n├── grader/judge.py     # TeachingRubric (subclass of vf.JudgeRubric); transcript_score reward\n├── grader/quiz.py      # dormant — quiz/self-rating reward path, kept for re-enable\n├── tools/image_gen.py  # stub tool; verifiers auto-derives the schema from the function signature\n└── tasks/              # one folder per topic; meta.yaml + optional materials/\n```\n\n## Key design constraints\n\n- **Tutor system prompt defaults to empty** — realistic case is \"user opens ChatGPT, no system prompt.\" Per-task `tutor_system_prompt` in meta.yaml replaces the default.\n- **Student system prompt** = generic behavior rules (50-word cap, \"you are not the tutor\", typos OK, etc.) + per-task situational orientation appended.\n- **Per-task `turns: N`** — `turns: 1` ⇒ no student LLM call (seed → tutor → done). Higher N ⇒ N student-tutor exchanges before the judge scores.\n- **Programmable student messages** — task can pin specific student replies via `fixed_student_followups` (turns 2..N+1).\n- **Rubric is structured** — list of `{id, description, anchors}` criteria. Default in `prompts.py`; per-task override possible.\n- **Null criteria** — judge returns null when a criterion doesn't apply (e.g. `bridging` when no materials uploaded). Composite is mean of non-null criteria.\n- **Tutor model is general** — Verifiers passes the rollout `client` and `model`; nothing pinned.\n- **Tool calls supported** — `image_gen` stubbed but the dispatch path is live.\n```\n\n## Configuration via `load_environment`\n\n```python\nload_environment(\n    judge_client=...,           # AsyncOpenAI, defaults to env-var-driven AsyncOpenAI()\n    judge_model=\"gpt-5.4-nano\",\n    student_client=...,         # defaults to judge_client\n    student_model=...,          # defaults to judge_model\n    default_turns=4,            # fallback when meta.yaml has no `turns:`\n    max_turns=32,               # hard ceiling\n    pass_threshold=0.6,\n    task_filter=None,           # restrict to a single task_id\n)\n```\n\n## See also\n\n- Repo-root `PLAN.md` — phase plan + open follow-ups.\n- Repo-root `PROGRESS.md` — checkbox status per phase.\n","encoding":"utf-8","truncated":false,"total_bytes":4586},"status":null}