{"data":{"kind":"file","path":"README.md","version_id":"ix986jk66i4hno0nxzubwpq7","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":7373,"modified_at":"2026-03-07T18:58:57.744000","content_hash":"dcbb88db1e2f4c993a648018cdb90e9ca00c3162a23b21e6618b95998cc452a0"},"entries":[],"content":"# hypothesis-forge\n\nA [`verifiers`](https://github.com/kye/verifiers) environment for training models to come up with a **new idea**, back it up with **retrieved evidence**, and end with an **experiment that could prove it wrong**.\n\n## Why\n\nMost public RL environments train models to be correct. They do not put much pressure on models to be **interesting in a disciplined way**: to search over a design space instead of jumping to the nearest familiar answer, to ground claims in evidence that was actually gathered, and to propose the next experiment that could prove the idea wrong.\n\nI built `hypothesis-forge` because I wanted an environment that rewards **novelty with accountability**.\n\n## What it trains\n\nA concrete sequence of behaviors:\n\n1. Search a constrained design problem broadly enough to find multiple plausible directions.\n2. Know what is already known instead of paraphrasing the nearest baseline.\n3. Use tools to inspect evidence instead of bluffing.\n4. Commit to a mechanism that deserves follow-up.\n5. Propose an experiment that distinguishes the idea from what came before.\n\nThis is not a creativity benchmark for nice-sounding prose. The target is exploratory, evidence-grounded, falsifiable ideation.\n\n## Why hybrid synthetic\n\nPure open-web plus judge-model setups are slow, expensive, noisy, and easy to game under RL. Pure symbolic novelty puzzles are clean and cheap but too toy-like to say much about real research behavior.\n\n`hypothesis-forge` sits in the middle: synthetic enough to run cheaply and deterministically, structured enough for honest reward functions, open-ended enough that the model still has to search, and tool-based so evidence gathering is part of the task.\n\n## How an episode works\n\nEach episode is a research brief from one of four domains:\n\n- materials discovery\n- biosensor design\n- surrogate model discovery\n- LLM memory systems\n\nThe model sees a research objective, a shortlist of candidate components, and tool access to evidence cards and known baselines. It returns one JSON object with a `title`, `selected_components`, `cited_cards`, `hypothesis`, and `experiment`.\n\nFive tools:\n\n- `list_brief()`\n- `list_components()`\n- `search_evidence(query)`\n- `read_card(card_id)`\n- `list_known_ideas()`\n\n## Hardening\n\nSeveral choices make the reward signal more honest:\n\n- Evidence-card IDs and baseline IDs are opaque per task.\n- Grounding only credits evidence the model actually surfaced through tool use.\n- Realistic-mode grounding cannot max out from search-only snippets; at least one opened card is required.\n- `distinguishes_from` only scores when the referenced baseline was actually revealed.\n- Novelty is constrained by feasibility and baseline-relative improvement, so obviously weak ideas do not win for being merely different.\n- Grounding uses citation precision, not citation stuffing.\n- The default benchmark is a blended hard public dev set: 20 hard synthetic tasks and 20 realistic surrogate tasks.\n\n## Reward design\n\n| Reward | Weight | Rationale |\n|--------|--------|-----------|\n| `feasibility_reward` | 0.25 | Clever nonsense should not win |\n| `grounding_reward` | 0.22 | Ideas need evidence that was actually inspected |\n| `experiment_reward` | 0.20 | Ideas need a discriminating next step |\n| `novelty_reward` | 0.15 | Reward movement beyond baseline imitation |\n| `explanation_reward` | 0.13 | Mechanism clarity should matter, but not dominate |\n| `format_reward` | 0.03 | Structure enforcement |\n| `group_diversity_bonus` | 0.02 | Encourage varied proposals only after a quality floor |\n\n## Output format\n\n```json\n{\n  \"title\": \"Adaptive residual surrogate\",\n  \"selected_components\": [\"C1\", \"C4\", \"C7\"],\n  \"cited_cards\": [\"EV-7K2Q9M\", \"EV-R4D8TX\", \"EV-N6P3LC\"],\n  \"hypothesis\": \"Combining the physics residual head with the uncertainty calibrator and graph prior encoder should improve calibration while preserving fast inference.\",\n  \"experiment\": {\n    \"manipulation\": \"Remove the uncertainty calibrator while keeping the rest fixed.\",\n    \"readout\": \"Expected calibration error and OOD performance on shifted inputs.\",\n    \"prediction\": \"The full stack should maintain better uncertainty calibration and robustness than the ablated variant.\",\n    \"distinguishes_from\": \"KI-3M7QPD\"\n  }\n}\n```\n\n## Installation\n\n```bash\n# From source\nuv sync --group dev\n\n# After package release\nuv pip install hypothesis-forge\n```\n\nThe package exposes `load_environment()`.\n\n## Prime Hub\n\nThe public Prime Hub environment lives at `casella/hypothesis-forge`.\n\n```bash\n# Inspect the published environment\nprime env info casella/hypothesis-forge\n\n# Install it from Prime Hub\nprime env install casella/hypothesis-forge\n\n# Run an official Prime eval with Prime Inference\nprime eval run casella/hypothesis-forge --model openai/gpt-4.1-mini\n\n# Run the realistic surrogate-mode track through Prime\nprime eval run casella/hypothesis-forge \\\n  --model openai/gpt-4.1-mini \\\n  --env-args '{\"task_mode\":\"surrogate_realistic\",\"domain_filter\":\"surrogate model discovery\"}'\n```\n\nOfficial Prime evals use Prime Inference by default, so the Hub-hosted path does not need the OpenRouter base URL or API-key flags.\n\n## Quickstart\n\n```bash\n# Default local eval\nuv run vf-eval hypothesis-forge\n\n# OpenRouter\nexport OPENROUTER_API_KEY=...\nuv run vf-eval hypothesis-forge -b https://openrouter.ai/api/v1 -k OPENROUTER_API_KEY --debug\n\n# Bundled realistic surrogate-model pack\nuv run vf-eval hypothesis-forge -a '{\"task_mode\":\"surrogate_realistic\",\"domain_filter\":\"surrogate model discovery\"}'\n\n# Explicit default blended-hard public dev set\nuv run vf-eval hypothesis-forge -n 40 -r 4 -a '{\"task_mode\":\"blended_hard\"}'\n\n# Private holdout pack outside the public repo\nuv run vf-eval hypothesis-forge -a '{\"task_pack_path\":\"./holdout/private_surrogate_holdout.json\"}'\n\n# Calibration ladder helper\npython scripts/calibrate_ladder.py --dry-run\n```\n\n## Environment arguments\n\n| Arg | Type | Default | Description |\n|-----|------|---------|-------------|\n| `num_train` | int | `200` | Generated train examples |\n| `num_eval` | int | `40` | Generated eval examples |\n| `seed` | int | `0` | Global seed |\n| `max_turns` | int | `25` | Max tool-use turns |\n| `max_components` | int | `4` | Max selected components |\n| `domain_filter` | `str \\| None` | `None` | Restrict tasks to one domain, such as `\"surrogate model discovery\"` |\n| `task_mode` | `str` | `\"blended_hard\"` | Task source mode: blended hard default, `\"synthetic\"`, or bundled `\"surrogate_realistic\"` |\n| `task_pack_path` | `str \\| None` | `None` | Optional JSON or JSONL task pack path for private dev or holdout packs |\n| `system_prompt` | str | built-in | Optional prompt override |\n\n## Limitations\n\n- Only one realistic dev pack exists today, and it is focused on surrogate model discovery.\n- The other public domains are still synthetic.\n- The reward functions are symbolic, not judge-based.\n- There is no cross-episode memory or curriculum yet.\n- This is a benchmark prototype for a harder class of research environments, not the final realism layer.\n\n## Next\n\nThe next direction I care most about is tightening the credibility loop around the realistic surrogate track: stronger adversarial evals, private holdout packs loaded through `task_pack_path`, and clearer evidence that training on this environment changes real ideation behavior rather than just benchmark behavior.\n","encoding":"utf-8","truncated":false,"total_bytes":7373},"status":null}