{"data":{"kind":"file","path":"README.md","version_id":"wz6b3h0vonzuat0gjdcpesej","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5932,"modified_at":"2026-05-19T17:35:50.269000","content_hash":"5d4d287a60a92c67cb9a94df1f03ec2d80b622c30b86b1be259435a0aba6bc85"},"entries":[],"content":"# Algorithms - Sedgewick & Wayne \n\n### Overview\n- **Environment ID**: `algorithms`\n- **Short description**: Multi-turn evaluation of exercises from the Sedgewick-Wayne *Algorithms* textbook. Coding questions are compiled and executed in a Java sandbox (`infinitasium/algorithms-textbook`); theoretical questions are scored by an LLM judge.\n- **Tags** (from `pyproject.toml`): `algorithms`, `train`, `eval`, `coding`, `qa`.\n- **Max Turns**: `8` (default; overridable via `max_turns`).\n\n### Dataset\n- **Source**: `AmitPrakash/algorithms-dataset` on Hugging Face — auto-downloaded into the environment directory on first use if the bundled JSONL is missing.\n- **Upstream**: examples reference the official [Princeton `algs4`](https://algs4.cs.princeton.edu/) materials (URLs stored in `metadata.url`).\n- **Size**: 469 examples covering chapters 1–6 of *Algorithms*, 4th Edition. Shuffled with `seed=42` at load time.\n- **Split by type** (via `metadata.code_execution`):\n  - `true` → 48 coding examples (compile + execute + reference-output match)\n  - `false` → 421 theoretical examples (LLM-judged)\n\n### Evaluation Method\n\nEach example's prompt is a three-message list: a system prompt (overridable via `system_prompt`), the question, and a follow-up user message stating the exact `java <class> <params>` invocation that will be used.\n\n#### Coding questions (`code_execution: true`)\n1. **Per-example sandbox**: in `setup_state`, a fresh `prime_sandboxes.AsyncSandboxClient` is created and a new `infinitasium/algorithms-textbook:latest` sandbox is started with `timeout_minutes` lifetime. The sandbox id is stored on `state[\"_sbid\"]` and the sandbox is deleted by the `@vf.cleanup` handler at rollout end.\n2. **Reference run** (once, in `setup_state`): all entries in `info.support_files` are downloaded into `/tmp` via `curl -fLO`, then the reference solution is compiled with `javac-algs4` and executed with `java-algs4` for each parameter set in `info.params` (defaulting to `[\"\"]`). Per-set stdout becomes `state[\"ref_output\"]`.\n3. **Student run** (each turn, in `env_response`): the assistant's last message is parsed for a Java block (preferring ```` ```java ```` fences but plain fenced blocks are also accepted), `package` declarations are stripped, `import edu.princeton.cs.algs4.*;` is auto-injected if missing, and the code is then compiled (30 s timeout) and executed (120 s timeout) against the same parameter sets.\n4. **Multi-turn feedback**: the assistant receives a user message describing the failure and may retry on any of:\n   - no parseable code block,\n   - no `public class ClassName` found,\n   - compile error (`javac-algs4` exit ≠ 0),\n   - runtime error (`java-algs4` exit ≠ 0),\n   - execution timeout (`CommandTimeoutError`),\n   - output mismatch (a per-line diff, capped at 10 lines, is returned).\n\n   On a full match across all parameter sets, `state[\"solved\"] = True` is set and the rollout ends via the `@vf.stop` handler `is_solved`. Otherwise the rollout terminates at `max_turns`.\n\n#### Theoretical questions (`code_execution: false`)\nEffectively single-turn: as soon as `env_response` sees an assistant reply for a theoretical example, it sets `state[\"solved\"] = True` and returns `[]`, ending the rollout. The judge then scores via `qa_reward`.\n\n### Rewards\nBoth reward functions have weight `1.0` and are mutually exclusive per example, so the per-example maximum is `1.0`.\n\n| Reward function | Weight | Applies to | Description |\n|---|---|---|---|\n| `code_execution_reward` | 1.0 | coding | `1.0` if `state[\"solved\"]` is `True` and `info.code_execution` is `True`, else `0.0`. |\n| `qa_reward` | 1.0 | theoretical | Calls the judge; returns `1.0` iff the response starts with `CORRECT`, else `0.0`. Always `0.0` when `info.code_execution` is not `False`. |\n\n### Auth / required env vars\n- **Sandbox** (coding questions): `prime_sandboxes.AsyncSandboxClient` is instantiated with no arguments, so it relies on the standard Prime SDK auth — typically `PRIME_API_KEY` in the environment (or whatever the SDK auto-resolves).\n- **Judge** (theoretical questions): by default reads the API key from `os.environ[\"PRIME_API_KEY\"]` and calls `https://api.pinference.ai/api/v1/` with `judge_model=\"openai/gpt-5-mini\"`. Override via `judge_api_key`, `judge_base_url`, `judge_model`.\n- **Team header**: passing `team_id` attaches `X-Prime-Team-ID: <team_id>` to judge requests.\n\n### Quickstart\n\n```bash\n# Default: evaluate via the Prime Inference provider\nuv run prime eval algorithms -s -n 10 -r 1 -m openai/gpt-5.5\n\n# Override the judge model via env-module args\nuv run prime eval algorithms -s -n 10 -r 1 -m openai/gpt-5.5 \\\n  -a '{\"judge_model\": \"openai/gpt-5.5\"}'\n\n# Equivalent run via the verifiers CLI\nuv run vf-eval algorithms -s -n 10 -r 1 -m openai/gpt-5.5\n```\n\n### Environment Arguments\nDefaults reflect the current `load_environment` signature in `algorithms.py`.\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `max_turns` | `int` | `8` | Maximum interaction turns per rollout. |\n| `timeout_minutes` | `int \\| None` | `80` | Sandbox lifetime in minutes. When unset, resolved to `max_turns * 10` (so `80` with the default `max_turns=8`). |\n| `data_path` | `str \\| None` | `None` | Path to a custom dataset JSONL. Defaults to the bundled `algorithms_dataset.jsonl`; auto-downloads from HF (`AmitPrakash/algorithms-dataset`) if absent. |\n| `system_prompt` | `str \\| None` | `None` | Override the default assistant system prompt. |\n| `judge_model` | `str` | `\"openai/gpt-5-mini\"` | Model used by the LLM judge for theoretical questions. |\n| `judge_base_url` | `str \\| None` | `\"https://api.pinference.ai/api/v1/\"` | Base URL for the judge client. |\n| `judge_api_key` | `str` | `\"PRIME_API_KEY\"` | Name of the env var that holds the judge's API key. |\n| `team_id` | `str \\| None` | `None` | If set, attaches `X-Prime-Team-ID: <team_id>` to judge requests. |\n","encoding":"utf-8","truncated":false,"total_bytes":5932},"status":null}