{"data":{"kind":"file","path":"README.md","version_id":"jsfo5jgicmssj967mjmii0id","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":8617,"modified_at":"2026-05-13T03:31:13.247000","content_hash":"32c6052ac40256a9cc2506b8e173cdb27367346eaad235e5b971f3c637fd382c"},"entries":[],"content":"# usaco-bronze\n\n### Overview\n- **Environment ID**: `usaco-bronze`\n- **Short description**: Code-generation environment for USACO problems with official-test rewards.\n- **Tags**: `usaco`, `code`, `train`, `eval`\n\n### Dataset\n- **Problems**: external full-corpus archive loaded at runtime\n- **Source**: USACO result pages and official test-data zips\n- **Split size**: determined by the runtime corpus plus requested statement and solution languages\n- **Official tests**: loaded from `test_data_dir` or `test_data_archive`; official tests are not packaged in the environment wheel\n\nRegenerate a local corpus for development:\n\n```bash\npython3 scripts/scrape_usaco_contest.py --year 2025 --month jan --data-dir data/usaco_jan25_all_divisions_multilingual --languages en\n```\n\nFor local smoke runs, regenerate with a small test cap:\n\n```bash\npython3 scripts/scrape_usaco_contest.py --year 2025 --month jan --data-dir data/usaco_jan25_all_divisions_multilingual --languages en --max-tests-per-problem 3\n```\n\nPackage the full local hidden-test set as an external artifact:\n\n```bash\ntar -C data -cJf data/usaco_2023_2025_all_full.tar.xz usaco_2023_2025_all_full\n```\n\n### Task Modes\n- **Single-turn**: output exactly one fenced code block in the requested language; reward is official `passed_tests / total_tests`.\n- **Agentic**: use native tool calls to interact with `statement.txt`, `edit_file`, `run_bash`, and `submit_file`; reward is the best official pass fraction minus small submission/turn penalties.\n- The agentic harness tolerates common OpenAI-compatible parser artifacts such as leaked Harmony channel suffixes on tool names, while still exposing only the canonical native tool schemas.\n\n### Local Smoke Test\nInstall the environment package before running Prime evals:\n\n```bash\ncd environments/usaco_bronze\nuv pip install -e .\n```\n\nRun a cheap evaluation with one rollout per problem:\n\n```bash\nprime eval run usaco-bronze --skip-upload\n```\n\nFor GPT-OSS-20B agentic evals on full statements, use at least `--max-tokens 16384`. An 8k cap can truncate reasoning before the model reaches `edit_file` on longer Bronze statements.\n\nFor the current local Ollama setup, use the repo script instead of `prime eval`\nso the endpoint registry is honored directly by Verifiers:\n\n```bash\n./configs/local_ollama_smoke.sh\n```\n\nFor Qwen 3.5 models in Ollama, use the native smoke script. Ollama's OpenAI\ncompatibility endpoint returns Qwen 3.5 thinking tokens as `reasoning`, which\ncauses empty completions unless the native API is called with `think:false`.\n\n```bash\n./scripts/ollama_native_smoke.py --model qwen3.5:4b --num-predict 512\n```\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | --- | --- | --- |\n| `max_examples` | int | `-1` | Limit dataset rows; use `1` for the cheapest smoke test. |\n| `timeout_seconds` | float | `10.0` | Per-test execution timeout. |\n| `rollout_timeout_seconds` | float | `180.0` | Wall-clock budget for one agent conversation; separate from submitted-program execution timeout. |\n| `problem_slugs` | str | `\"\"` | Comma-separated subset, for example `its-mooin-time-ii`. |\n| `statement_languages` | str | `\"en\"` | Comma-separated statement languages. Only languages present for each problem are used. |\n| `solution_languages` | str | `\"cpp\"` | Comma-separated solution languages. Use `cpp,python` to create separate C++17 and Python3 episodes. |\n| `solution_language_overrides` | str | `\"\"` | Optional JSON object mapping problem slug to `cpp` or `python`; use it for mixed-language evals without duplicating every problem. |\n| `statement_translation_dir` | str | `\"\"` | Optional custom translation root, e.g. `data_translations/es/<problem>/statement.txt`. |\n| `prompt_style` | str | `\"full\"` | Use `\"compact\"` for faster local smoke tests. |\n| `test_data_dir` | str | `\"\"` | External directory containing `manifest.json`, problem statements, and problem test folders. |\n| `test_data_archive` | str | `\"\"` | Package-relative path, local path, or HTTPS URL for a `.tar.*` or `.zip` corpus archive. |\n| `use_sandbox` | bool | `false` | Run generated solutions inside Prime Sandboxes instead of the local process. |\n| `sandbox_docker_image` | str | `\"gcc:13-bookworm\"` | Compiler image for sandbox judging. |\n| `max_sandbox_concurrency` | int | `8` | Maximum concurrent Prime Sandbox judges per env server. |\n| `sandbox_judge_deadline_seconds` | float | `180.0` | Overall deadline for one completion's full sandbox judge. |\n| `allow_thinking` | bool | `true` | Permit `<think>...</think>` before the final code block. |\n| `log_io` | bool | `false` | Print prompt/completion JSON to run logs for debugging. |\n| `format_reward_weight` | float | `0.0` | Optional weak format shaping; keep `0.0` for production official-test reward. |\n| `single_turn_token_penalty_weight` | float | `0.0` | Optional single-turn length penalty, computed as `-weight * approx_output_tokens / 1000`; use small values to discourage overlong thinking without dominating official-test reward. |\n| `require_external_tests` | bool | `false` | Fail at environment load time unless `test_data_dir` or `test_data_archive` is provided. |\n| `require_full_tests` | bool | `false` | Fail at environment load time if any problem has fewer test pairs than its manifest `test_count`. |\n| `environment_mode` | str | `\"single_turn\"` | Use `\"agentic\"` for the multi-turn contest workspace. |\n| `agent_backend` | str | `\"sandbox\"` | Agent workspace backend for agentic mode; use `\"sandbox\"` for production. |\n| `judge_backend` | str | `\"sandbox\"` | Official judge backend for agentic `submit_file`; use `\"sandbox\"` for production. |\n| `max_turns` | int | `50` | Maximum model turns in agentic mode. |\n| `max_submissions` | int | `3` | Maximum official submissions in agentic mode. |\n| `expose_native_tools` | bool | `true` | Required for agentic mode; exposes native OpenAI-compatible tool schemas. |\n| `wrong_submission_penalty` | float | `0.02` | Penalty per non-perfect official submission in agentic mode. |\n| `no_submission_penalty` | float | `0.01` | Penalty when an agent never reaches an official submission. |\n| `turn_penalty` | float | `0.005` | Penalty per model turn after `free_turns` in agentic mode. |\n| `free_turns` | int | `5` | Number of initial turns exempt from turn penalty. |\n| `agentic_shaping_reward_weight` | float | `0.0` | Optional weak agent-behavior shaping for agentic mode; rewards reading the statement, writing source, compiling, submitting, and passing the sample. Keep small so official tests dominate. |\n| `require_submission_for_shaping` | bool | `true` | In agentic mode, suppress tool-use shaping unless the agent reaches `submit_file`; prevents no-submit trajectories from receiving positive reward for only reading/editing/testing. |\n| `pre_submission_shaping_reward_weight` | float | `0.0` | Optional curriculum signal for weak tool callers. It only applies before any submission, rewards read/edit/run progress, and should stay below `no_submission_penalty` so no-submit trajectories remain negative. |\n\n### Hosted Data\nThe Prime environment package intentionally contains only environment code. It does not package `data/`, `data_capped/`, `data_translations/`, or local outputs. Hosted runs should pass `test_data_archive` or `test_data_dir` and set `require_external_tests = true`.\n\n### Production Judging\nProduction runs should use full official tests, set `require_full_tests = true`, and use sandbox backends when available.\n\nExternal corpora must include the same layout as `data/`: a top-level `manifest.json`, one directory per problem, each problem's `statement.txt`, and official `.in`/`.out` pairs.\n\nThe hosted full-corpus artifact is:\n\n```text\nhttps://huggingface.co/datasets/etetereagsdfg/usaco-2023-2025-official-testdata-mirror/resolve/main/usaco_2023_2025_all_full.tar.gz\n```\n\nThe config `configs/usaco_bronze_sandbox_full_local.toml` shows the full-data sandbox path. For hosted Prime runs, use the Hugging Face artifact URL or an equivalent worker-accessible archive URL.\n\nSandbox judging is fail-closed: sandbox dependency or infrastructure failures raise errors instead of silently converting infrastructure failures into model reward `0`.\n\n### Multilingual Corpora\nMultilingual statement variants share the same official tests. A corpus manifest can list available languages per problem, and training can request them with `statement_languages`.\n\nExample local all-language Jan 2025 load:\n\n```bash\npython3 scripts/scrape_usaco_contest.py --year 2025 --month jan --data-dir data/usaco_jan25_all_divisions_multilingual --languages en ru es fr zh hy\n```\n","encoding":"utf-8","truncated":false,"total_bytes":8617},"status":null}