{"data":{"kind":"file","path":"README.md","version_id":"gf7w644ijpdjkqzlcfkegu67","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5900,"modified_at":"2026-05-05T23:48:54.987000","content_hash":"1f01770d2e4c1722cd5bb0fc466e5576d13194eb0e6696d615b37a24ab1127d1"},"entries":[],"content":"# surrogate-discovery\n\n### Overview\n- **Environment ID**: `surrogate-discovery`\n- **Short description**: Tool-use environment for synthetic surrogate-model discovery decisions.\n- **Tags**: tool-use, science, surrogate-modeling, train, eval\n\n### Design Goal\nThis environment trains a model to act like a discovery scientist that must choose a surrogate candidate from hidden target data. The agent has to inspect the target, search candidates, evaluate candidate measurements, compare options, and submit a strict decision.\n\nThe first tool-use training run showed that simple exact-answer quote tasks saturated quickly and produced high zero-advantage filtering. This scaffold therefore includes continuous rank/regret reward so near-miss candidate choices still produce training signal.\n\n### Dataset\n- **Primary dataset**: deterministic synthetic targets and candidate pools generated in `build_synthetic_dataset(...)`.\n- **Default split sizes**: 60 train, 20 eval.\n- **Task families**:\n  - `best_surrogate`: maximize validated score while accounting for budget, uncertainty, and risk.\n  - `cheapest_feasible`: find the lowest-cost feasible candidate above the target threshold.\n  - `uncertainty_constrained`: choose the best candidate inside an uncertainty limit.\n  - `active_learning`: pick a promising informative candidate for the next experiment.\n  - `pareto_tradeoff`: balance score, cost, uncertainty, and risk.\n- **Ranking modes**:\n  - `oracle`: `rank_candidates` returns the true task-specific recommendation. Used by `easy` and `mixed`.\n  - `screening`: `rank_candidates` returns a proxy shortlist based on predicted scores, so the agent must evaluate enough candidates to resolve the objective before submitting. Used by `hard`.\n\n### Tools\n| Tool | Purpose |\n| --- | --- |\n| `inspect_target` | Reveals objective, metric, families, budget, and uncertainty constraints for a target. |\n| `search_candidates` | Searches the hidden candidate pool by family, predicted score, cost, and feasibility. |\n| `rank_candidates` | Returns a recommended candidate and compact task-specific ranking. In hard mode this is a screening shortlist, not the final answer. |\n| `evaluate_candidate` | Reveals final `expected_score`, cost, risk, and feasibility for one candidate. |\n| `submit_decision` | Finalizes the candidate choice with strict decision fields. |\n| `compare_candidates` | Optional shortlist comparison that returns task-specific utility ranks for selected candidates. |\n| `read_research_note` | Distractor tool that does not reveal hidden measurements. |\n\nExpected action economy: inspect the target, run one broad candidate search with `family=\"\"` and `min_predicted_score=0`, rank candidates, evaluate the measured shortlist needed for the objective, optionally compare close measured tradeoffs once, then call `submit_decision`.\n\n### Output Format\nThe final submitted decision has this shape:\n\n```json\n{\"candidate_id\":\"SD-001-C03\",\"decision\":\"select\",\"expected_score\":0.82,\"expected_cost\":240,\"risk_level\":\"low\"}\n```\n\nFor active-learning tasks, `decision` is `run_experiment`; other task families use `select`. `expected_score` is the candidate's evaluated surrogate score returned by `evaluate_candidate`.\n\n### Reward\n| Metric | Meaning |\n| --- | --- |\n| `final_answer_accuracy` | Exact `candidate_id` and `decision`. |\n| `field_accuracy` | Partial credit for candidate, decision, score, cost, and risk fields. |\n| `rank_regret_score` | Continuous credit based on the utility gap between chosen and best candidate. |\n| `required_tools_used` | Fraction of expected discovery tools used: inspect, search, rank, and submit. |\n| `evaluation_efficiency` | Rewards using a ranked shortlist with enough measured candidates to resolve the objective. |\n| `no_distractor_tools` | Penalizes use of `read_research_note`. |\n\n### Quickstart\nInstall locally:\n\n```bash\nprime env install surrogate-discovery\n```\n\nRun a smoke eval:\n\n```bash\nprime eval run surrogate-discovery -m openai/gpt-4.1-mini -n 5 -r 1 -t 1024 -T 0.2 -s\n```\n\nRun a harder eval:\n\n```bash\nprime eval run surrogate-discovery \\\n  -m openai/gpt-4.1-mini \\\n  -n 30 -r 3 -c 6 -t 1024 -T 0.2 -s \\\n  -a '{\"train_size\":80,\"eval_size\":30,\"seed\":23,\"difficulty\":\"hard\",\"max_turns\":10}'\n```\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | --- | --- | --- |\n| `train_size` | int | `60` | Number of synthetic training rows. |\n| `eval_size` | int | `20` | Number of synthetic eval rows. |\n| `seed` | int | `19` | Deterministic dataset seed. |\n| `difficulty` | str | `\"mixed\"` | One of `easy`, `mixed`, or `hard`. |\n| `max_turns` | int | `10` | Maximum tool-use turns per rollout. |\n\n### Implementation Notes\nDashboard-facing `info` is kept as a dictionary for Prime sample upload. Internal rollout-only structured fields are stored as JSON strings: `info_json`, `case_json`, `tool_trace_json`, `evaluated_candidate_ids_json`, `parsed_answer_json`, and `submitted_answer_json`.\n\nIn `0.1.6` and later, `submit_decision` still validates that `candidate_id` exists, but it stores the model-submitted decision fields instead of replacing them with canonical hidden values. This makes incorrect `decision`, `expected_score`, `expected_cost`, and `risk_level` arguments visible to the rubric and gives RL more field-level reward variance.\n\nIn `0.1.8` and later, hard-mode screening is less oracle-like: the screening ranker no longer leaks exact hidden `risk_level`, it can return up to five shortlist candidates, and `evaluate_candidate` now returns `uncertainty` as documented. Screening-mode efficiency reward allows evaluating 2-4 candidates so RL is not pushed into a deterministic top-two recipe, while the prompt still constrains the model to one broad search and one rank call.\n\nIn `0.1.9` and later, the broad-search instruction is explicit about passing `family=\"\"` so models do not spend turns searching once per family.\n","encoding":"utf-8","truncated":false,"total_bytes":5900},"status":null}