{"data":{"kind":"file","path":"README.md","version_id":"ykwu0wz7f3gwrubl9t8vvhho","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":9348,"modified_at":"2026-04-27T09:56:53.850000","content_hash":"ed712a9dd0a03a67e8c096d0ebdc86e1f2c4ab6766dea38a958fbab20bc738ff"},"entries":[],"content":"# geopolitical-claim-verification\n\nA single-turn verifiers environment that tests whether a model can correctly classify a factual geopolitical claim as `TRUE`, `FALSE`, `PARTIALLY_TRUE`, or `UNVERIFIED` by reasoning over a corpus of mixed-reliability sources under realistic adversarial conditions.\n\nThe benchmark targets a real-world skill: separating well-supported claims from belligerent narrative warfare, single-source rumors, partially-true statements where the headline is right but the details are disputed, and source-poisoned propaganda framings.\n\n## Dataset\n\n- **22 hand-curated claims** anchored on the Operation Epic Fury period (US–Iran conflict, February–April 2026).\n- Sources include: Trump Truth Social posts, Iranian Supreme National Security Council statements (Araghchi), Pakistani mediator statements (Sharif), Polymarket market resolutions and rules, contemporaneous wire / think-tank reporting.\n- **Verdict distribution** (intentionally FALSE-skewed — emphasizes the adversarial reasoning patterns where common heuristics fail):\n\n  | Verdict | Count | % |\n  |---|---:|---:|\n  | TRUE | 5 | 23% |\n  | FALSE | 12 | 55% |\n  | PARTIALLY_TRUE | 4 | 18% |\n  | UNVERIFIED | 1 | 5% |\n\nThe full schema for a claim entry is in [`data/SCHEMA.md`](data/SCHEMA.md). Skeletons for future expansion (Russia–Ukraine, oil/tanker OSINT) are preserved under [`data/_skeletons/`](data/_skeletons/).\n\n## Reasoning patterns tested\n\nThe 22 claims were designed to stress seven distinct failure modes that frontier models commonly miss:\n\n1. **Temporal-deadline precision** — distinguishing \"X happened by date D\" from \"X happened after D\" (e.g. ceasefire by March 31 vs April 8).\n2. **Definitional gap, broad vs narrow target** — \"Kharg Island was hit\" (broad, TRUE) vs \"Kharg Island oil terminal was hit\" (specific subset, FALSE).\n3. **Threats vs actions** — distinguishing a conditional future threat from an executed order.\n4. **Unilateral action vs mutual agreement** — a head-of-state extending their own pause is not the same as a bilateral ceasefire extension.\n5. **Source-poisoning recognition** — rejecting clusters of one-sided belligerent or state-media sources as insufficient regardless of count.\n6. **Belligerent claim contradicted by counterparty behavior** — Trump's \"Iran is fractured\" is FALSE, not UNVERIFIED, when Iran's documented diplomatic activity contradicts it.\n7. **Prediction-market source calibration** — treating Polymarket resolutions and rules as authoritative evidence (the dataset uses them as primary sources for many event-claims).\n\n## Verdict labels — strict definitions\n\nThe dataset enforces strict label semantics (see [`data/SCHEMA.md`](data/SCHEMA.md) for the full spec):\n\n- **`TRUE`** — well-supported by reliable, independent sources. `supporting_source_ids` lists only sources that affirmatively support the claim.\n- **`FALSE`** — reliable sources contradict the claim. `supporting_source_ids` lists only sources that affirmatively contradict.\n- **`PARTIALLY_TRUE`** — the core event is confirmed but key details (target identity, scope, timing, casualties, attribution) are disputed. `supporting_source_ids` lists only sources supporting the TRUE component; contradicting / qualifying sources go in `caveats`.\n- **`UNVERIFIED`** — *absence* of evidence, not *presence* of contradicting evidence. Reserved for cases of insufficient evidence: single-source belligerent claim with no corroboration, source-poisoning, or genuine epistemic uncertainty. `supporting_source_ids` MUST be `[]`. If any source contradicts the claim, the verdict is `FALSE` or `PARTIALLY_TRUE`, not `UNVERIFIED`. *(`UNVERIFIED` = absence of evidence; `FALSE` = presence of contradicting evidence — these are categorically different epistemic states.)*\n\n## Scoring rubric\n\nComposite reward is the weighted sum of four components:\n\n| Component | Weight | Type | What it measures |\n|---|---:|---|---|\n| `verdict_match` | 0.4 | deterministic | Exact match against gold verdict label |\n| `source_weighting` | 0.3 | LLM-judge (+ deterministic shortcut for `UNVERIFIED`) | Did the model select `supporting_source_ids` per the strict per-verdict semantics, weighting reliable / independent sources higher than single-side belligerent statements? |\n| `caveat_quality` | 0.2 | LLM-judge | Especially for `PARTIALLY_TRUE` and `UNVERIFIED`, did caveats reflect the required nuance? |\n| `hallucination_check` | 0.1 | deterministic | All `supporting_source_ids` cited by the model exist in the input source list |\n\nLLM-judge calls go through the Prime Intellect inference router (`https://api.pinference.ai/api/v1`) using **`qwen/qwen3-30b-a3b-instruct-2507`** as judge — chosen for cost-effectiveness and consistency. Routing through pinference also generates a double-signal (env author + inference user) for provenance.\n\n## Multi-model evaluation results\n\nEvaluated **2026-04-27** through the Prime Intellect inference router. All models tested with identical 22-claim dataset, identical system prompt, identical judge model (qwen3-30b), and identical sampling args (`max_tokens=2000, temperature=0.2`).\n\n| Rank | Model | Composite | Verdict match | Source weight | Caveat | Halluc | Cost (22 claims) |\n|---:|---|---:|---:|---:|---:|---:|---:|\n| 1 | `google/gemini-2.5-pro` | **0.892** | **91%** | 0.761 | 1.000 | 1.000 | $0.391 |\n| 2 | `anthropic/claude-opus-4.7` | 0.831 | 82% | 0.689 | 0.986 | 1.000 | $0.391 |\n| 3 | `prime-intellect/intellect-3` | 0.763 | 64% | 0.711 | 0.977 | 1.000 | **$0.027** |\n| 4 | `anthropic/claude-sonnet-4.6` | 0.737 | 64% | 0.734 | 0.814 | 1.000 | $0.204 |\n| 5 | `qwen/qwen3-30b-a3b-instruct-2507` | 0.729 | 64% | 0.605 | 0.964 | 1.000 | $0.014 |\n| 6 | `deepseek/deepseek-r1-0528` | 0.725 | 64% | 0.568 | 1.000 | 1.000 | $0.304 |\n| 7 | `openai/gpt-5.2` | 0.655 | 55% | 0.473 | 0.973 | 1.000 | $0.186 |\n| † | `PrimeIntellect/INTELLECT-3.1` | broken | broken | broken | — | — | — |\n\n† **`INTELLECT-3.1` returns HTTP 500 across all tested configurations** on the pinference router — 8 distinct request_ids documented in [`data/multi_model_summary.md`](data/multi_model_summary.md). Likely a server-side dispatch crash on this specific model (sibling `prime-intellect/intellect-3` works correctly through the same key + endpoint). Bug report drafted; will re-run on resolution.\n\nFull per-claim breakdown, per-model uniqueness analysis, and INTELLECT-3.1 diagnostic detail in [`data/multi_model_summary.md`](data/multi_model_summary.md).\n\n## Notable findings\n\n- **Gemini 2.5 Pro consistently outperforms** other frontier models on definitional-precision cases (ceasefire vs peace deal, broad-vs-narrow target subset, unilateral-vs-mutual agreement).\n- **Most models systematically fail at distinguishing FALSE-by-evidence-contradiction from UNVERIFIED-by-evidence-absence** (claims `iran-israel-003`, `005`, `011`). They default to `UNVERIFIED` when the corpus is small, even when sources actively contradict the claim through their own framing.\n- **Prediction-market-as-primary-source claims are systematically undervalued** by most models (claims `iran-israel-018`, `021`). Models discount Polymarket resolutions even at high `reliability_prior` and prefer wire-service confirmation that doesn't exist in the corpus.\n- **`prime-intellect/intellect-3` offers the strongest cost / performance ratio** on this benchmark — 3rd place composite at **14× cheaper than Sonnet 4.6** and 7× cheaper than Opus 4.7 / Gemini 2.5 Pro.\n- **GPT-5.2 underperforms unexpectedly** (7th of 7 working models), particularly on `source_weighting` (0.473 vs 0.6–0.76 typical for the others).\n\n## Usage\n\nInstall and run locally:\n\n```bash\nuv venv --python 3.11 .venv\nuv pip install --python .venv/bin/python verifiers datasets openai pytest\n\n# Smoke test on the placeholder dataset (5 claims) — no PRIME_KEY needed if you skip\nPRIME_KEY=$(security find-generic-password -s api_prime -w) \\\n  .venv/bin/pytest tests/test_smoke.py -s\n\n# Full 22-claim run on the curated dataset:\nCLAIMS_PATH=data/claims.json \\\nPRIME_KEY=$(security find-generic-password -s api_prime -w) \\\n  .venv/bin/pytest tests/test_smoke.py -s\n\n# Multi-model comparison (8 models, sequential, ~$1.50 total):\nPRIME_KEY=$(security find-generic-password -s api_prime -w) \\\n  .venv/bin/python scripts/multi_model_eval.py --cap 4\n\n# Aggregate the per-model results into data/multi_model_summary.md:\n.venv/bin/python scripts/aggregate_results.py\n```\n\n`PRIME_KEY` must be available as an env var — typically sourced inline from macOS Keychain via `security find-generic-password -s api_prime -w` so it never lands in shell history or env-dump.\n\nTo validate or extend the dataset:\n\n```bash\n# Validate full dataset envelope or single-claim files\n.venv/bin/python scripts/validate_dataset.py data/claims.json\n.venv/bin/python scripts/validate_dataset.py data/_skeletons/ru-ua/<file>.json\n\n# Dry-mode cost estimate (no API calls)\n.venv/bin/python scripts/cost_estimator.py --mode dry data/claims.json\n\n# After filling more skeletons, merge them into the canonical claims.json\n.venv/bin/python scripts/merge_curated.py\n```\n\n## Citation\n\n```\nbitdrongo/geopolitical-claim-verification, April 2026.\nGitHub: https://github.com/bitdrongo/geopolitical-claim-verification\n```\n\n## License\n\nApache License 2.0. See [LICENSE](LICENSE).\n","encoding":"utf-8","truncated":false,"total_bytes":9348},"status":null}