{"data":{"kind":"file","path":"README.md","version_id":"t37oyok9lgqv32os9qe44gk0","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":7024,"modified_at":"2026-05-10T11:34:23.528000","content_hash":"8b302c6292bfb1bcba3c5834a3685927fd971bfe075f686834c4f35f0decf85c"},"entries":[],"content":"# teaching-env\n\n### Overview\n- **Environment ID**: `teaching-env`\n- **Short description**: Evaluates LLM explanations of textbook excerpts across pedagogy dimensions including concept coverage, coherence, prerequisite ordering, and originality.\n- **Tags**: single-turn, teaching, pedagogy, nlp, train, eval\n\n---\n\n## Quickstart\n\n### 1. Install\n\n```bash\npip install -e .\n# spaCy model (downloaded automatically on first run, or manually:)\npython -m spacy download en_core_web_sm\n```\n\n### 2. Run an evaluation\n\n```bash\n# Default: 5 examples, 3 rollouts each\nprime eval run teaching-env\n\n# Specify model, scale, and sampling\nprime eval run teaching-env \\\n  --model openai/gpt-4.1-mini \\\n  --num-examples 20 \\\n  --rollouts-per-example 3 \\\n  --max-tokens 1024 \\\n  --temperature 0.7\n```\n\n### 3. Run the test suite\n\n```bash\n# Fast metric-only tests (no model needed, ~1s)\npytest tests/test_adversarial.py -v -m \"not slow\"\n\n# Full adversarial suite including NLI + embedder (~5 min first run)\npytest tests/test_adversarial.py -v\n```\n\n---\n\n## Sample Task\n\n**System prompt:**\n```\nYou are an expert tutor. Read the textbook excerpt provided and explain the concept\nclearly to a student with no prior knowledge of the topic. Introduce prerequisite\nideas before the main concept, use at least one concrete example, and connect each\nidea to the next with explicit reasoning.\n```\n\n**User message (source excerpt):**\n```\nBefore understanding force, one must first grasp the concept of mass — the measure\nof an object's resistance to acceleration. Newton's second law of motion states that\nthe net force acting on an object equals its mass multiplied by its acceleration:\nF = ma. Force is measured in Newtons (N), mass in kilograms (kg), and acceleration\nin metres per second squared (m/s²). For example, a 2 kg object subjected to a 10 N\nforce accelerates at 5 m/s².\n```\n\n---\n\n## Sample Rollout\n\n**Model response (gpt-4.1-mini, reward = 0.82):**\n```\nTo understand Newton's second law, we first need to know what mass is.\n\nMass is a measure of how much matter an object contains, and more practically,\nhow hard it is to get the object moving or to stop it. A bowling ball has much\nmore mass than a tennis ball — it takes a lot more effort to throw.\n\nForce is a push or a pull. When you kick a soccer ball, your foot applies a force\nto the ball. Force is measured in Newtons (N).\n\nAcceleration is the rate at which an object's speed changes. Step on a car's gas\npedal and the car accelerates — its speed increases.\n\nNewton's second law connects all three: F = ma. The net force on an object equals\nits mass times its acceleration.\n\nConcrete example: a 2 kg book is pushed across a table with a 10 N force.\n  acceleration = F / m = 10 N / 2 kg = 5 m/s²\nIf we doubled the force to 20 N, the acceleration would double to 10 m/s².\nIf instead we doubled the mass to 4 kg and kept the force at 10 N, the acceleration\nwould halve to 2.5 m/s².\n\nSo the law tells us two things: more force produces more acceleration, and more mass\nproduces less acceleration for the same force.\n```\n\n---\n\n## Sample Grader Output\n\n```json\n{\n  \"concept_coverage\":    0.92,\n  \"sentence_coverage\":   0.81,\n  \"contradiction\":       0.97,\n  \"entailment_chain\":    0.84,\n  \"order\":               1.00,\n  \"example_grounding\":   1.00,\n  \"information_density\": 0.61,\n  \"readability_curve\":   0.73,\n  \"originality\":         0.78,\n  \"copy_rate\":           0.31,\n  \"composite\":           0.82\n}\n```\n\n**Score interpretation:**\n\n| Score | What it means |\n|---|---|\n| `concept_coverage = 0.92` | mass, force, acceleration, and F=ma all appear in the response |\n| `contradiction = 0.97` | response does not contradict source claims |\n| `order = 1.00` | mass introduced before force, prerequisites respected |\n| `example_grounding = 1.00` | at least one concrete numerical example present |\n| `entailment_chain = 0.84` | most consecutive sentences follow logically from the previous |\n| `originality = 0.78` | response is largely paraphrased rather than copied |\n| `copy_rate = 0.31` | only 31% of response sentences closely match a source sentence |\n| `composite = 0.82` | weighted average (passes threshold of 0.75) |\n\n---\n\n## Metrics\n\n| Metric | Weight (physics) | Meaning |\n|---|---|---|\n| `concept_coverage` | 0.20 | Fraction of KG concepts present in the response (word-boundary matched) |\n| `sentence_coverage` | 0.14 | Semantic similarity: how much of the source each response sentence covers |\n| `contradiction` | 0.22 | Absence of NLI-detected contradictions between source and response |\n| `entailment_chain` | 0.20 | Logical coherence of consecutive response sentences |\n| `order` | 0.13 | Prerequisite concepts introduced before dependent concepts |\n| `example_grounding` | 0.06 | Presence of concrete examples with quantitative anchors or named entities |\n| `information_density` | 0.02 | Content word density (signal-to-noise) |\n| `readability_curve` | 0.01 | Gradual increase in sentence complexity across the explanation |\n| `originality` | 0.02 | Paraphrase distance from source (ROUGE-based, penalises verbatim copying) |\n\nSubject-specific weight profiles are applied automatically from `metadata[\"subject\"]`. Supported subjects: `math`, `chemistry`, `physics`, `biology`, `computer_science`, `business`, `humanities`. The default profile is used when no subject is provided.\n\n### Copy penalty\n\nA multiplicative penalty is applied to the composite when `copy_rate` exceeds 0.90:\n\n```\ncopy_penalty = max(0, (copy_rate - 0.90) / 0.10)\ncomposite    = composite × (1 − copy_penalty × 0.50)\n```\n\nA verbatim copy (`copy_rate ≈ 1.0`) halves the composite regardless of other scores.\n\n---\n\n## Datasets\n\n- **Primary dataset**: Curated textbook excerpts in `data/` covering math, physics, chemistry, biology, CS, business, and humanities.\n- **Formats**: Markdown, PDF, slideshow, and plain text — parsed and cleaned via `parsing.py`.\n- **KG generation**: Knowledge graphs (concepts + prerequisite edges) are auto-generated per excerpt using fastembed embeddings and keyword extraction.\n\n---\n\n## Environment Arguments\n\nThis environment takes no user-facing arguments. Dataset, KG, and subject-weight profiles are loaded automatically.\n\n---\n\n## Known Limitations\n\n- **Numerical contradictions**: NLI models frequently fail to flag wrong numbers (e.g., \"20 m/s²\" vs \"5 m/s²\") because the sentence structure is identical. A dedicated numerical-consistency pass is not yet implemented.\n- **Humanities grounding**: `example_grounding` is STEM-shaped (detects numbers/entities/code) and under-detects entity-free vivid prose — the primary mode of humanities teaching examples.\n- **KG quality**: `concept_coverage` and `order` are only as good as the auto-generated KG. Noisy source text (PDF extraction artifacts) can produce noisy concepts; `_is_valid_concept()` filters the worst cases.\n- **RL suitability**: Validated for offline evaluation. Not yet validated for online RL training — length gaming and adversarial NLI evasion are untested exploit vectors.\n","encoding":"utf-8","truncated":false,"total_bytes":7024},"status":null}