{"data":{"kind":"file","path":"README.md","version_id":"a9qbbuaz7uqyeandqdvls4sh","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4254,"modified_at":"2026-03-25T00:12:12.465000","content_hash":"da94274a01990627b66d5edf5abb007ccd7ae70580ab3688c0b1976fc7acfba0"},"entries":[],"content":"# LHAW RLM (`lhaw_rlm`)\n\nVerifiers environment for the **LHAW** clarification loop on the [ScaleAI/lhaw](https://huggingface.co/datasets/ScaleAI/lhaw) dataset. Models may call `ask_user` against underspecified prompts; scoring depends on the chosen reward mode (see below).\n\n## Reward modes\n\n| Mode | Behavior |\n|------|----------|\n| `reconstruction_judge` | Clarification-focused task; an LLM judge scores how well the clarified task was reconstructed. |\n| `native_reward` | Same `ask_user` interaction; reward comes from benchmark-native results in example metadata or linked result files. |\n\nSet `reward_mode` under `[eval.env_args]` in your TOML. For `native_reward`, examples should carry `native_result` or `native_result_path`; optional fields like `native_trials`, `native_baseline_trials`, and `native_summary` support offline metrics (`pass@3`, `Ask%`, `Avg/Traj`, `Gain/Q`, etc.).\n\n## Repository layout\n\n| Path | Purpose |\n|------|---------|\n| `lhaw_rlm.py` | Entrypoint: `load_environment()` for Prime / verifiers |\n| `core/` | Config, dataset transform, `LHAWRLMEnv`, judge rubric, native reward helpers |\n| `configs/eval/` | `*.toml` presets (standard smoke, scale, model tiers, dataset slices, dimensions, reward modes) |\n| `configs/eval/standard.toml` | Moderate local smoke config |\n| `configs/endpoints.toml` | Optional shared endpoint registry |\n| `scripts/launch_hosted_evals.sh` | Runs high-signal presets on Prime **hosted** workers |\n| `docs/hf_dataset_schema.md` | Hugging Face dataset column reference |\n| `tests/` | Pytest suite |\n\n## Setup\n\nRequires **Python 3.11+**. Dependencies are in `pyproject.toml` (`verifiers`, `datasets`).\n\n```bash\ncd lhaw\nuv sync\nuv run python -c \"import lhaw_rlm; print(lhaw_rlm.load_environment)\"\n```\n\n**Dev dependencies:** `uv sync --group dev` (Ruff, pytest).\n\n## Running evaluations\n\nRun commands from the **`lhaw/`** directory so `lhaw_rlm` and `core` resolve on `sys.path`.\n\n### Local\n\n```bash\nuv run prime eval run configs/eval/standard.toml\n```\n\nSet the API key expected by the TOML (commonly `PRIME_API_KEY`). Add or reference `configs/endpoints.toml` if you use a shared endpoint registry.\n\n### Hosted (Prime Evals)\n\nHosted jobs run on Prime infrastructure and show up in **Prime Evals**. You need a **published** environment (`prime env push`) and a Hub slug the launcher can target.\n\nThe script rewrites each preset to set `env_id` from `PRIME_EVAL_ENV_ID` and strips `env_dir_path` so workers load the package from the Hub. It does **not** use `--env-path` (that would pin to a local checkout and break hosted). Secrets are expected from the Environments Hub flow, not from ad-hoc `.env` or `--custom-secrets` in this script.\n\n```bash\ncd lhaw\nexport PRIME_API_KEY=...\nexport PRIME_EVAL_ENV_ID=your-org/lhaw_rlm   # optional; see below\nprime config set-api-key \"$PRIME_API_KEY\"\n# prime env push   # optional: publish before evals\n./scripts/launch_hosted_evals.sh\n```\n\n**`PRIME_EVAL_ENV_ID`:** use the env var if set; else read `owner/name` from `.prime/.env-metadata.json` after a local `prime env push`; else default `stochi0/lhaw_rlm`.\n\n**Presets (in run order):** `slice_outcome_critical`, `slice_benign`, `slice_divergent`, `slice_swe_bench`, `slice_mcp_atlas`, `slice_agent_company`, `dim_goal`, `dim_constraint`, `dim_input`, `dim_context`, `dim_goal_and_constraint`.\n\n**Tuning:** `LHAW_HOSTED_POLL_INTERVAL` (default `30`), `LHAW_HOSTED_TIMEOUT_MINUTES` (default `180`). Each run uses `--allow-sandbox-access`, `--allow-instances-access`, and `--eval-name lhaw-<config-stem>`.\n\nFor CI, run the same script from `lhaw/` with `PRIME_API_KEY` and `PRIME_EVAL_ENV_ID` and the `prime` CLI available.\n\n## Logging and debug UI\n\nWith the default Rich TUI, worker logs are tailed from `<run_dir>/eval.log` (verifiers writes this when `debug = false` in the eval TOML). For tqdm plus plain console logging, add `--debug`, e.g. `prime eval run configs/eval/standard.toml --debug`.\n\n## Development\n\n```bash\ncd lhaw\nuv sync --group dev\nuv run pytest\nuv run ruff check .\n```\n\n## Packaging and Environments Hub\n\n```bash\ncd lhaw\nuv build\nprime env push\n```\n\nPyPI package name: **`lhaw_rlm`**. After install, `import lhaw_rlm`. The wheel includes `lhaw_rlm.py` and `core/` (Hatch `only-include`).\n","encoding":"utf-8","truncated":false,"total_bytes":4254},"status":null}