{"data":{"kind":"file","path":"README.md","version_id":"zbxfdhwsfmjbg1j71etcms76","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4291,"modified_at":"2026-05-06T22:12:05.095000","content_hash":"237d9f8587a994bfa0ff303f871151fdaf11eb8213ce8d838d40cc46f1221d80"},"entries":[],"content":"# meeting_intent\n\nReal meeting transcripts. The model has to pull out the action items that people actually committed to. Casual ideas, hedged statements, and parking-lot items are all traps.\n\nThe headline result on Claude Opus 4.6: pass@1 = 0.887 on hand-authored synthetic transcripts, pass@1 = 0.000 on three real public earnings calls. Same model, same rubric, same prompt. The failure pattern is dialogue-confirmation: commitments that get built across multiple speaker turns instead of stated in one clean sentence.\n\n## Motivation\n\nMost meeting datasets are scripted or polished. Real workplace dialogue is messier, people hedge, restate, change topics, and commit through back and forth. Built this env to test if a frontier model could handle real meeting data.\n\nThe answer turned out to be no. The model treats \"I can pair with you if you want\" plus \"yeah\" as an optional offer that fizzled, even when the conversation closed the loop.\n\n## Task\n\nThe model gets a meeting transcript and a system prompt with the rules. It outputs one JSON object:\n\n```json\n{\"action_items\": [{\"owner\": \"Devon\", \"task\": \"...\", \"due\": \"Friday\"}]}\n```\n\nEmpty list is a valid output if there are no real commitments.\n\nThe model is told to filter out:\n- Casual ideas and \"we should someday\" suggestions\n- Hedged statements (maybe, might, probably, I'll see)\n- Parking-lot items\n- Things with no specific owner or no specific deadline\n- Vague follow-ups like \"let's circle back\"\n\n## Rubric\n\nFully deterministic. No LLM judge.\n\n1. Parse the model's JSON. Bad JSON gets a 0.\n2. Canonicalize each predicted item's owner (lowercased first name) and due date (weekday names, plus \"today\" and \"tomorrow\", with normalization for \"EOD\", \"by Wednesday\", \"before 4pm\", \"asap\", \"end of week\", and similar).\n3. Match predicted items to ground truth on canonical (owner, due). For a match to count, the predicted task field also has to contain at least one anchor keyword for that ground-truth item. Otherwise the model could guess the owner and date and write something totally wrong.\n4. Reward = F1 over correctly matched items. Pass = F1 >= 0.99.\n\nA second metric, format compliance, tracks whether the output parses to valid JSON of the expected shape. It is reported but not weighted in the reward.\n\n## Dataset\n\n15 transcripts in `data/`:\n\n- 10 synthetic, hand-authored project meetings covering brainstorms, design crits, eng standups, customer calls, exec syncs, pipeline reviews, support escalations, hiring debriefs, and prod triage.\n- 4 real public earnings calls (IIPR Q1 2025, NVDA Q3 FY26, JPM Q1 2026, plus one extra) downloaded from The Motley Fool's free transcripts.\n- 1 real anonymized client product demo (around 80 minutes).\n\nEach transcript has a matching `_ground_truth.json` and an entry in the `ANCHORS` dict in `meeting_intent.py`.\n\n## Results\n\nClaude Opus 4.6, sampling at temperature 1.0.\n\n| Dataset | Source | Transcripts | Samples each | pass@1 | mean F1 | format compliance |\n|---|---|---|---|---|---|---|\n| Synthetic | hand-authored project meetings | 10 | 64 | 0.887 | 0.965 | 1.000 |\n| Real | public earnings calls (IIPR, NVDA, JPM) | 3 | 16 | 0.000 | 0.000 | 1.000 |\n\nFormat compliance is 1.0 in both runs. On the real transcripts the model returns `{\"action_items\": []}` on 47 out of 47 valid samples. The failure is over-conservative refusal, not hallucination.\n\nThe pass@1 to pass@32 gap on the synthetic set is small (0.887 to 0.900) because the failures are deterministic, not stochastic. Sampling more doesn't help, so the fix has to come from training.\n\nSonnet 4.6 and Haiku 4.5 also failed 0/0 on the real transcripts.\n\n## How to run\n\nInstall and run with the Prime Intellect verifiers stack:\n\n```bash\npip install verifiers\nprime env install <owner>/meeting_intent\n```\n\nThen use `vf-eval` or the Python API. From a script:\n\n```python\nimport verifiers as vf\nenv = vf.load_environment(\"meeting_intent\")\n```\n\nThe env loads all transcripts under `data/` automatically. There are no constructor arguments.\n\n## Files\n\n```\nmeeting_intent.py           # SingleTurnEnv + Rubric + load_environment\npyproject.toml              # package metadata\ntest_rubric.py              # deterministic rubric tests\ndata/                       # transcripts and ground truth JSON\n```\n\n## License\n\nMIT.\n","encoding":"utf-8","truncated":false,"total_bytes":4291},"status":null}