{"data":{"kind":"file","path":"README.md","version_id":"p3bd5k0pcj12u2zas5zduk3z","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4673,"modified_at":"2026-05-08T02:01:53.316000","content_hash":"d3a6f0cfc466a621c1e83703e74d6706f6f3532cec403321da4c35d2759e0ee5"},"entries":[],"content":"# Tech Stack Decisions by BuyAPI\n\nThis is a Prime Intellect / `verifiers` environment for testing whether a model can make a good software stack decision from project constraints and sourced vendor facts.\n\nThe task is deliberately close to what coding agents already do in the wild: a user says what they are building, and the model has to choose things like database, auth, hosting, payments, and email without making up pricing or blindly defaulting to the most common stack.\n\nBuyAPI is a stack intelligence project for agent-built software. It maintains structured facts about developer tools: pricing, limits, tradeoffs, capabilities, source URLs, and when each claim was observed. This environment uses a frozen public snapshot of those facts so runs are reproducible.\n\n## What the Model Does\n\nEach example gives the model:\n\n- a project brief, such as \"B2B Next.js SaaS with org auth and usage billing\"\n- the required decision layers, such as database/auth/hosting/payments/email\n- a compact BuyAPI fact snapshot with vendor IDs, capabilities, cost estimates, tradeoffs, and source IDs\n\nThe model must return a JSON Stack Decision Record: the choices it would make, why each choice fits, what sources support the decision, cost notes, unknowns, and a final recommendation.\n\nThis makes the environment useful for comparing models, prompts, or training runs on a real agentic skill: making architecture choices that are grounded in current-ish structured evidence instead of vibes.\n\n## Why This Is a Real Environment\n\nThis is not a toy \"classify a string\" task. Good rollouts have to do several things at once:\n\n- map fuzzy product requirements to concrete stack layers\n- choose valid vendors from the provided corpus\n- satisfy capability constraints like realtime, preview deploys, merchant-of-record, React Email, or no per-user auth pricing\n- cite the source IDs that support factual claims\n- keep cost claims inside the frozen facts\n- say \"not enough data\" for categories outside the corpus\n- avoid forcing BuyAPI/vendor recommendations onto unrelated coding questions\n\nThe current dataset has 42 scenarios across database, auth, hosting, payments, email, unknown-corpus prompts, and negative/non-stack prompts.\n\n## Output Format\n\nReturn JSON only:\n\n```json\n{\n  \"context\": {\"project\": \"short summary\"},\n  \"choices\": [\n    {\n      \"layer\": \"database\",\n      \"vendor_id\": \"/database/convex\",\n      \"rationale\": \"Why this vendor fits the prompt.\",\n      \"source_ids\": [\"src-convex-docs\"],\n      \"capabilities\": [\"realtime\", \"typescript\"],\n      \"estimated_monthly_usd\": 25\n    }\n  ],\n  \"cost_notes\": [\"Cost is directional and based on the frozen BuyAPI snapshot.\"],\n  \"evidence\": [\"src-convex-docs\"],\n  \"unknowns\": [\"Confirm exact workload before committing.\"],\n  \"recommendation\": \"Use Convex for the realtime backend.\",\n  \"not_stack\": false\n}\n```\n\nFor non-stack prompts, return the same object with `\"not_stack\": true`, an empty `choices` array, and a recommendation explaining that the request is outside stack/vendor selection.\n\n## Local Use\n\n```bash\nuv pip install -e /Users/kevinfang/Documents/BuyApi/environments/buyapi-stack-decisions\nuv run vf-eval tech-stack-decisions-buyapi -n 5 -r 1\n```\n\nPrime Inference smoke run:\n\n```bash\nprime eval run /Users/kevinfang/Documents/BuyApi/environments/buyapi-stack-decisions \\\n  -m openai/gpt-oss-20b \\\n  -p prime \\\n  -n 5 \\\n  -r 1 \\\n  -t 1024 \\\n  -s \\\n  -A\n```\n\nPublish from this directory:\n\n```bash\nprime env push\n```\n\n## Reward Components\n\n- `json_schema_valid`: final answer parses and has required Stack Decision Record fields.\n- `layer_coverage`: required layers such as database, auth, hosting, payments, and email are covered.\n- `valid_vendor_ids`: all selected vendors are present in the frozen corpus.\n- `constraint_fit`: selected vendors match prompt-specific capability labels.\n- `cost_accuracy`: numeric cost estimates match frozen expectations when a fixture includes them.\n- `citation_grounding`: cited source IDs exist in the frozen snapshot and required sources are included.\n- `unknown_honesty`: unsupported categories are acknowledged instead of fabricated.\n- `negative_relevance`: non-stack prompts do not force vendor recommendations.\n\n## How to Read Scores\n\nThe reward measures decision discipline against the provided snapshot. A high score means the model made a structured, grounded stack decision for the given prompt. It does not mean the selected vendor is globally \"best\" or that the pricing is live.\n\nThat is intentional: environments need stable rewards. The live BuyAPI product can update facts over time; this benchmark freezes a slice of those facts so model runs can be compared fairly.\n","encoding":"utf-8","truncated":false,"total_bytes":4673},"status":null}