{"data":{"kind":"file","path":"README.md","version_id":"ol3h7exw7psdmg6nyvw760ih","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3150,"modified_at":"2026-06-01T19:55:35.784000","content_hash":"58a44f0a7c118afa4e4c8a05bdf7de721eadd7020a800c77c02c6b2b881acb9a"},"entries":[],"content":"# scicode\n\n<a href=\"https://github.com/PrimeIntellect-ai/research-environments/tree/main/environments/scicode\">\n<img src=\"https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white\" alt=\"Source Code\">\n</a>\n\nSciCode is a research-grade scientific code-generation benchmark across 16 natural science subfields. Each problem is decomposed into multiple subproblems requiring reasoning, coding, and integration.\n\n### Overview\n- **Environment ID**: `scicode`\n- **Short description**: Scicode evaluation environment\n\n### Datasets\n- **Primary dataset(s)**: `scicode-bench/SciCode`\n- **Source links**: [Paper (arXiv:2407.13168)](https://arxiv.org/abs/2407.13168) · [HF](https://huggingface.co/datasets/scicode-bench/SciCode)\n- **Split sizes**: 80 examples (consisting of 340 subproblems)\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run scicode\n```\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `with_background` | bool | `True` | Whether to include step background text in the prompts |\n| `timeout_per_test` | int | `300` | Maximum execution time (in seconds) for each test |\n| `sandbox_docker_image` | str | `\"cmakj7hyo002rz091pdjngniy/scicode:latest\"` | Docker image to use for the sandbox |\n| `sandbox_name` | str | `\"scicode\"` | Name of the sandbox environment |\n| `system_message` | str | `None` | System message to use for the environment |\n\n### Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `subproblem_score` | Fraction of subproblems solved across all problems (weight: 1.0) |\n| `main_problem_score` | Fraction of main problems solved end-to-end (all subproblems correct) (weight: 0.0) |\n\n### Changelog\n\n#### v0.1.7 (May 15, 2026)\n\n- Make Docker image configurable via `sandbox_docker_image` argument (defaults to the original image)\n- Make sandbox name configurable via `sandbox_name` argument (defaults to `\"scicode\"`)\n- Make timeout and system message configurable via `timeout_per_test` and `system_message` arguments, with defaults unified between `SciCodeEnv` and `load_environment`\n\n#### v0.1.6 (May 7, 2026)\n\n- Harden sandbox image bootstrap against transient Ubuntu archive mirror sync flakes by adding apt acquire retries.\n\n#### v0.1.5 (Feb 17, 2026)\n\n- Return `vf.UserMessage` types instead of dicts in `env_response` and `setup_state` to conform with verifiers custom message types\n\n#### v0.1.4 (Jan 21, 2026)\n\n- Move test execution to sandbox (with custom Docker image)\n- Fix bug about skipped steps (should load ground truth code so that model can reference it in later steps)\n- Fix bug where history is not accumulative (i.e. information from previous steps is included into the prompt template, not as conversation history)\n- Align parsing logic with official SciCode implementation (ours was overly strict)\n- Combine `validation` and `test` splits into a single dataset,  score using fraction of subproblems solved, and use `with_background=True` by default (matching [AA](https://artificialanalysis.ai/methodology/intelligence-benchmarking))\n- Integrate monitor rubric pattern\n- More debug logging\n","encoding":"utf-8","truncated":false,"total_bytes":3150},"status":null}