{"data":{"kind":"file","path":"README.md","version_id":"mso5dccb6yvoyq4h1aoy5wun","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":8244,"modified_at":"2026-05-13T00:33:56.114000","content_hash":"e3979ac9c16d18e95d22f5444840e227157bd9ebfd2affc6a7af4e6c19938a6a"},"entries":[],"content":"# opencode-science\n\n### Overview\n- **Environment ID**: `opencode_science`\n- **Short description**: Solve science problems using an OpenCode agent inside a sandbox, verified with `math_verify`.\n- **Tags**: `science`, `opencode`, `multi-turn`\n\n### Datasets\n- **Primary dataset**: [PrimeIntellect/INTELLECT-3-RL](https://huggingface.co/datasets/PrimeIntellect/INTELLECT-3-RL) (subset `science`, split `train`).\n- Any HuggingFace dataset with question/answer columns can be used.\n\n### Task\n- **Type**: multi-turn (OpenCode CLI agent in a sandbox)\n- **Output format expectations**: Agent output should contain a `\\boxed{}` answer.\n- **Rubric**: `HybridMathRubric` — extracts `\\boxed{}` from the agent's terminal output and verifies against the expected answer using `math_verify`. Produces a binary `correct_answer` score (1.0 or 0.0).\n\n### Architecture\n\n`OpenCodeScienceEnv` inherits from base classes in the `verifiers` package:\n\n```\nOpenCodeScienceEnv  (environments/opencode_science/opencode_science.py)\n  └── OpenCodeQAEnv  (verifiers/envs/experimental/opencode_qa_env.py)\n       └── OpenCodeEnv  (verifiers/envs/experimental/opencode_env.py)\n            └── vf.CliAgentEnv  (verifiers/envs/experimental/cli_agent_env.py)\n```\n\n- **`OpenCodeEnv`** — installs and configures the OpenCode CLI agent in a sandbox, handles prompt/config upload.\n- **`OpenCodeQAEnv`** — loads a HuggingFace QA dataset and formats it for the agent.\n- **`OpenCodeScienceEnv`** — sets science-specific defaults (dataset, rubric, instruction prompt).\n\n### Quickstart\n\n```bash\n# install (local development)\nuv pip install -e ./environments/opencode_science\n\n# single debug rollout\nuv run vf-eval --env opencode_science -d -v -n1 -r1\n\n# multiple rollouts, save results\nuv run vf-eval --env opencode_science -n5 -r3 -s\n```\n\n### Environment Arguments\n\nThese are the arguments accepted by `load_environment()`:\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `dataset_name` | str | `\"PrimeIntellect/INTELLECT-3-RL\"` | HuggingFace dataset name |\n| `dataset_subset` | str | `\"science\"` | Dataset subset/config |\n| `dataset_split` | str | `\"train\"` | Dataset split |\n| `question_key` | str | `\"question\"` | Column name for questions |\n| `answer_key` | str | `\"answer\"` | Column name for expected answers |\n| `instruction_prompt` | str | `\"Solve the following problem.\\n\\n\"` | Prefix prepended to each question |\n| `instruction_prompt_post` | str | `\"\"` | Suffix appended to each question |\n| `difficulty_key` | str \\| None | `\"avg@16_qwen3_4b_instruct_2507\"` | Column for difficulty filtering |\n| `min_avg_reward` | float | `0.0` | Minimum reward for dataset filtering |\n| `max_avg_reward` | float | `1.0` | Maximum reward for dataset filtering |\n| `system_prompt` | str \\| None | *(OpenCode default)* | System prompt for the agent |\n| `disabled_tools` | list[str] \\| None | `[\"question\", \"task\", \"websearch\"]` | OpenCode tools to disable |\n| `agent_workdir` | str | `\"/app\"` | Working directory inside the sandbox |\n| `answer_path` | str | `\"/app/answer.txt\"` | Path where the agent writes its final answer |\n| `score_remotely` | bool | `True` | Whether to read the answer from `answer_path` in the sandbox |\n| `use_judge_fallback` | bool | `True` | Fall back to LLM judge if math_verify fails |\n| `judge_model` | str | `\"openai/gpt-5-nano\"` | Model for the judge fallback |\n| `judge_base_url` | str \\| None | `\"https://api.pinference.ai/api/v1\"` | Base URL for the judge API |\n| `judge_api_key_var` | str \\| None | `\"PRIME_API_KEY\"` | Environment variable for the judge API key |\n| `sandbox_docker_image` | str | `\"...opencode-science:rl2\"` | Docker image for the sandbox (opencode binary baked in) |\n| `timeout_seconds` | float | `3600.0` | Rollout timeout (1h) |\n| `sandbox_cpu_cores` | int | `1` | CPU cores for the sandbox |\n| `sandbox_memory_gb` | int | `2` | Memory (GB) for the sandbox |\n| `sandbox_disk_size_gb` | int | `4` | Disk size (GB) for the sandbox |\n| `sandbox_client_max_workers` | int \\| None | `None` | Max concurrent sandbox workers |\n| `max_turns` | int | `100` | Max conversation turns |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar reward: 1.0 if `math_verify` confirms correctness, else 0.0 |\n| `correct_answer` | Binary `math_verify` result (same as reward when no other reward functions are added) |\n\n### How it works\n\n1. On init, loads the HuggingFace dataset (science subset) and prepends the instruction prompt to each question.\n2. Each rollout creates a sandbox, installs the OpenCode CLI, uploads the prompt and config, then runs the agent.\n3. The agent's API calls are intercepted and routed to the configured LLM.\n4. After the agent finishes, the rubric reads the answer from `/app/answer.txt` in the sandbox (when `score_remotely=True`) or extracts the `\\boxed{}` answer from the conversation, and verifies it against the expected answer using `math_verify`. If verification fails and `use_judge_fallback=True`, an LLM judge provides a fallback score.\n\n### Changelog\n\n### v0.3.11\n- Bump `verifiers` to `>=0.1.15.dev2` for the OpenCode harness config that disables title-generation calls while preserving the `small_model` pin.\n\n### v0.3.10\n- Default `sandbox_client_max_workers` to `None` so the shared sandbox client uses the verifiers default worker cap unless callers explicitly override it.\n\n### v0.3.9\n- Harden sandbox image bootstrap against transient Ubuntu archive mirror sync flakes by adding apt acquire retries.\n\n### v0.3.8\n- Fix `sandbox_docker_image` prefix. The `cme8364tg000o1139v84cu0cv/...` prefix carried over from v0.3.7 is a user-scoped ID that the cluster cannot pull from, causing `ImagePullBackOff` on every sandbox creation. Swap to the team-scoped `team-clyvldofb0000gg1kx39rgzjq/opencode-science:rl2`.\n\n### v0.3.7\n- Pin `sandbox_docker_image` default to `team-clyvldofb0000gg1kx39rgzjq/opencode-science:rl2`. The new image bakes the opencode v1.1.63-rl2 binary into the sandbox so cold sandboxes no longer need to install it at rollout time. Documentation and image table updated to match.\n\n### v0.3.5\n- Bump opencode fork release from `1.1.63-rl1` to `1.1.63-rl2` ([PrimeIntellect-ai/opencode#3](https://github.com/PrimeIntellect-ai/opencode/pull/3)). Fork release surfaces session-level retry exhaustion as a non-zero exit with a structured stderr dump, so hosted RL rollouts that previously returned silent empty trajectories now produce real `AgentError` entries. Companion default bump in verifiers: [PrimeIntellect-ai/verifiers#1184](https://github.com/PrimeIntellect-ai/verifiers/pull/1184).\n\n### v0.3.4\n- Bump verifiers to stable `>=0.1.12`.\n\n### v0.3.3\n- Bump verifiers to `>=0.1.13.dev1`.\n\n### v0.3.2\n- Bump verifiers to stable `>=0.1.12`.\n\n### v0.3.1\n- Fix package structure: convert flat module to proper package directory so hatchling includes it in the wheel. Fixes `ModuleNotFoundError` in hosted training.\n- Import harness and taskset from `verifiers.envs.experimental.composable` instead of separate packages.\n\n### v0.3.0\n- Import harness and taskset from verifiers package proper (verifiers >= 0.1.12.dev5).\n\n### v0.2.2\n- Migrate OpenCode fork from `rasdani/opencode` to `PrimeIntellect-ai/opencode`. Bump release from `1.1.63-swe8` to `1.1.63-rl1` (trimmed system prompt for RL training efficiency).\n\n### v0.2.1\n- Bump verifiers to >=0.1.12.dev3: fixes opencode model ID for LoRA adapter names without `/` in hosted training.\n- Use personal sandbox image for public reproducibility.\n\n### v0.2.0\n- Rewrite to composable architecture. Uses `ComposableEnv` + `MathTaskSet(subset=\"science\")` + `opencode_harness`. Scored by `RemoteHybridMathRubric` with judge fallback. Replaces `OpenCodeScienceEnv` class hierarchy.\n- Verify OpenCode tarball integrity with pinned SHA-256 checksum (via `opencode_harness`).\n\n### v0.1.1\n- Bump verifiers to v0.1.12.dev1: perf improvements to `MathRubric` (used internally by `HybridMathRubric`); now uses `extract_boxed_answer` in strict mode — if no `\\boxed{}` answer is found the parsed answer is `\"\"` which always scores 0, preventing false positives where the model is rewarded for containing the correct answer anywhere in the response\n\n### v0.1.0\n- Initial release\n","encoding":"utf-8","truncated":false,"total_bytes":8244},"status":null}