{"data":{"kind":"file","path":"README.md","version_id":"ibu1bz3obvgwdjqkn1bhyp7a","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4703,"modified_at":"2025-09-13T06:48:20.699000","content_hash":"7d269f8a6f9a78a9fbd89d070f58fd2bf065a9a4fc22668f0e51d361a80c54a1"},"entries":[],"content":"# clrs-algorithms\n\n**Environment ID**: `clrs-algorithms`  \n**Short description**: CLRS-style single-turn Q&A environment where the **FINAL `<answer>`** is graded by an **LLM-as-a-judge** using a detailed rubric (correctness, coverage, CLRS-specific fidelity, and format).  \n**Tags**: algorithms, clrs, single-turn, judge, eval, train\n\n> Source implementation (fork/PR):\n\n## Overview\n\nThis environment evaluates short-form answers to CLRS questions. The model must reply in the protocol:\n\n```text\n<think> ... (hidden) ... </think>\n<answer> FINAL TEXT HERE </answer>\n```\n\nOnly the **FINAL `<answer>`** is judged; `<think>` is ignored. Scoring is **entirely LLM-as-judge** (no EM/F1 hard matchers by default), which avoids brittleness across LaTeX/Markdown/formatting variants.\n\n- **Parsers**: custom `CLRSThinkParser` (extracts `<answer>...</answer>`)\n- **Rubric**: single judge rubric (numeric score in `[0,1]`) with a prompt that weights:\n\n  - Correctness vs. reference (0.50)\n  - Coverage / completeness (0.20)\n  - CLRS-specific fidelity (0.20)\n\n    - Trace/final-array tasks: sorted & permutation-match\n    - Descending insertion sort pseudocode pattern\n    - Linear search + loop-invariant triad mention\n\n  - Format compliance (0.10) — `<answer>` present, no spillover after `</answer>`\n\n## Datasets\n\n- **Primary**: [`Siddharth899/clrs-qa`](https://huggingface.co/datasets/Siddharth899/clrs-qa)\n  Markdown questions/answers aligned to CLRS chapters/sections, sometimes with image refs for trace problems.\n\n**Split usage**: The env defaults to `train`. If a dataset has a different split schema, the loader will pick the sole available split when only one exists.\n\n## Task\n\n- **Type**: single-turn\n- **Parser**: `CLRSThinkParser` (derives from `vf.ThinkParser`; extracts only the `<answer>` span)\n- **Scoring**: **LLM-as-judge only** via `vf.JudgeRubric` (wrapped so the judge returns a numeric score in `[0,1]`)\n\n## Quickstart\n\nRun a small evaluation with defaults:\n\n```bash\nuv run vf-eval clrs-algorithms -s\n```\n\nSpecify model and sampling:\n\n```bash\nuv run vf-eval clrs-algorithms \\\n  -m gpt-4.1-nano \\\n  -n 20 -r 3 -s\n```\n\n> Use `-a / --env-args` to pass environment-specific args as JSON.\n\nRecommended for quick, low-cost smoke tests:\n\n- **API**: `gpt-4.1-nano`, `gpt-4.1-mini`\n\n## Environment Arguments\n\n| Arg            | Type      | Default                   | Description                                                         |\n| -------------- | --------- | ------------------------- | ------------------------------------------------------------------- |\n| `dataset_name` | str       | `\"Siddharth899/clrs-qa\"`  | HF dataset path.                                                    |\n| `split`        | str       | `\"train\"`                 | Dataset split to load; if missing, uses the only available split.   |\n| `num_examples` | int\\|None | `None`                    | If set, selects the first N examples for quick evals.               |\n| `judge_model`  | str       | `\"gpt-5\"`                 | Judge LLM name (passed to `vf.JudgeRubric`).                        |\n| `judge_prompt` | str\\|None | _internal default prompt_ | Override judge instructions (must output a single score in \\[0,1]). |\n\nThe system prompt given to student models enforces the `<think>/<answer>` protocol and warns against trailing content after `</answer>`.\n\n## Metrics\n\nThe environment emits:\n\n| Metric                 | Meaning                                                                                          |\n| ---------------------- | ------------------------------------------------------------------------------------------------ |\n| `reward`               | Final scalar = judge score in \\[0,1].                                                            |\n| `numeric_judge_reward` | The rubric’s internal judge score (same value as `reward`, broken out per rollout for debugging) |\n\n## Implementation Notes\n\n- **Verifiers version**: `verifiers>=0.1.3`\n- **Judge**: `vf.JudgeRubric` with a detailed, CLRS-aware prompt that **grades the FINAL `<answer>`** only.\n- **Parser**: `CLRSThinkParser` pulls the `<answer>` span; `<think>` is ignored by both the parser and the judge.\n- **Formatting**: Markdown/LaTeX is tolerated; the judge is instructed to prefer semantic alignment over literal match.\n- **Code quality**: run `ruff check --fix .` before pushing.\n\n## Example Commands\n\nSmall eval on 20 samples, 3 rollouts each, saving outputs:\n\n```bash\nuv run vf-eval clrs-algorithms \\\n  -m gpt-4.1-nano \\\n  -n 20 -r 3 -s\n```\n\nUse a different judge:\n\n```bash\nuv run vf-eval clrs-algorithms \\\n  -m gpt-4.1 \\\n  -n 20 -r 3 -s \\\n  -a '{\"judge_model\": \"gpt-4.1\"}'\n```\n\n## Evaluation Reports\n\n```\n\n```\n","encoding":"utf-8","truncated":false,"total_bytes":4703},"status":null}