{"data":{"kind":"file","path":"README.md","version_id":"hd54d1jkqjxm2xxlwy4s6zdn","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5073,"modified_at":"2025-09-10T04:35:39.483000","content_hash":"1e6c1e63db3fb54b8116e928183638041dc4608e7c009cb148d9a40cc0d99fc5"},"entries":[],"content":"# GPUTrust-Bench\n\nAn agentic environment where a model verifies untrusted remote GPU work under a cost/latency budget. The agent chooses among costed tools (Freivalds, spot checks, timed runs), then finalizes with:\n\n```json\n{\"final\": {\"verdict\": \"accept|reject\", \"p\": 0..1}}\n```\n\nReward = decision accuracy + calibration (negative Brier) − cost penalty (+ optional audit bonus).\n\n## Overview\n\n- **Environment ID**: gputrust-bench\n- **Short description**: Budget-aware verification of remote GPU results using Freivalds, spot checks, and timed runs; reports accuracy, calibration, and cost.\n- **Tags**: verification, tool-use, multi-turn, calibration, cost-aware, reliability, systems, RL\n\n## Datasets\n\n- **Primary dataset(s)**: Small, synthetic verification scenarios packaged with the environment (remote job outputs with varying correctness, cost/latency tradeoffs, and hints).\n- **Source links**: (in-repo data; replace with links if you externalize)\n- **Split sizes**: current sample set ≈ 4 examples (eval falls back to train by default)\n\nYou can expand the dataset and wire an explicit eval split later; until then the env will use the train set for eval if eval is unset.\n\n## Task\n\n- **Type**: multi-turn + tool use\n- **Parser**: simple JSON end-marker parser expecting a final block with a `{\"final\": {...}}` payload\n\n### Rubric overview\n\n- **accuracy_reward**: exactness of the accept/reject decision\n- **neg_brier_reward**: calibration of the reported probability p (higher is better)\n- **neg_cost_reward**: penalty proportional to tool costs\n- **reward**: weighted sum of the above (primary score)\n\n## Tools and Costs\n\n- **freivalds(k)**: probabilistic matrix check; cost = 0.2 × k\n- **spotcheck(n)**: randomized sub-checks; cost = 0.1 × n\n- **timed_run(n)**: performance probe; cost = 0.05 × n\n\nEpisodes continue until the agent emits the final JSON. If it doesn't finalize, the env returns another observation and applies a small shaping penalty.\n\n## Quickstart\n\n### Minimal run (defaults)\n```bash\nuv run vf-eval gputrust-bench\n```\n\n### Configure model & sampling (OpenAI-compatible endpoint)\n```bash\n# Example: local server (Ollama/LM Studio/llama.cpp)\nexport OPENAI_BASE_URL=http://localhost:11434/v1\nexport OPENAI_API_KEY=ollama   # any string works for Ollama\n\nuv run vf-eval gputrust-bench \\\n  -m qwen2.5:3b-instruct \\\n  -b \"$OPENAI_BASE_URL\" \\\n  -k OPENAI_API_KEY \\\n  -n 8 -r 2 -s --max-concurrent 4 \\\n  --sampling-args '{\"temperature\":0.1,\"top_p\":0.3,\"max_tokens\":200}'\n```\n\n### Configure model & sampling (hosted API)\n```bash\n# Example: OpenAI\nexport OPENAI_API_KEY=sk-...\nuv run vf-eval gputrust-bench \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7 \\\n  --sampling-args '{\"key\":\"value\"}'\n```\n\n## Notes\n\n- Use `-a` / `--env-args` to pass environment-specific config as a JSON object (see next section).\n- Add `-s` to save artifacts under `outputs/evals/gputrust-bench--<model>/<run-id>`.\n\n## Environment Arguments\n\n| Arg | Type | Default | Description |\n|-----|------|---------|-------------|\n| max_examples | int | -1 | Limit the dataset size used for the run (-1 = all available). |\n| budget | float | 1.0 | Normalized budget the agent should implicitly respect (used by rubric shaping). |\n| costs | object | tool defaults | Override per-tool unit costs (e.g., `{\"freivalds\":0.2,\"spotcheck\":0.1,\"timed_run\":0.05}`). |\n| allow_tools | list[str] | all | Restrict tool set (e.g., `[\"freivalds\",\"spotcheck\"]`). |\n| require_final | bool | true | Require the final JSON to terminate the episode. |\n| seed | int | 123 | RNG seed for deterministic spot checks/sampling. |\n\nIf your local copy exposes a different arg surface, update this table to match your `load_environment` implementation.\n\nPass env args like:\n\n```bash\nuv run vf-eval gputrust-bench \\\n  -m qwen2.5:3b-instruct -b \"$OPENAI_BASE_URL\" -k OPENAI_API_KEY \\\n  -a '{\"budget\":1.5,\"allow_tools\":[\"freivalds\",\"spotcheck\"],\"max_examples\":4}'\n```\n\n## Metrics\n\n| Metric | Meaning |\n|--------|---------|\n| reward | Main scalar reward (weighted sum of criteria below). |\n| accuracy_reward | Exact match on accept/reject vs. ground truth. |\n| neg_brier_reward | Calibration quality of p (higher is better; equals -Brier). |\n| neg_cost_reward | Negative of incurred tool cost (penalizes over-spend). |\n| cost | Raw summed tool cost (lower is better). |\n\n## Development\n\n```bash\n# Install env locally for iteration\nuv pip install -e .\n\n# (optional) install Prime CLI to publish\nuv tool install prime\n\n# Evaluate locally and view results\nuv run vf-eval gputrust-bench -m qwen2.5:3b-instruct \\\n  -b \"$OPENAI_BASE_URL\" -k OPENAI_API_KEY -n 4 -r 2 -s\n# Browse runs\nuv run vf-tui\n```\n\n### Publishing to Prime Intellect (Environment Hub)\n\n```bash\nprime login\nprime env push --path . --name gputrust-bench --visibility PUBLIC\n# to a team:\n# prime env push --team <team-slug>\n```\n\nBump the version in `pyproject.toml` (and/or edit this README) before pushing again so the content hash changes.\n\n## Evaluation Reports\n\nReports saved by `vf-eval -s` will auto-render on the Hub for this environment/version.\n","encoding":"utf-8","truncated":false,"total_bytes":5073},"status":null}