{"data":{"kind":"file","path":"README.md","version_id":"p9jzzs8ysj61dmb4o2slpi68","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":7836,"modified_at":"2026-05-07T23:29:11.067000","content_hash":"a6e0cc96949977b8145bb86690233559f59562f714b12ecf3f086a15a495461b"},"entries":[],"content":"# alphastack-greenfield\n\n> Verifiers environment that wraps [AlphaStack](https://github.com/HyperKuvid-Labs/alpha-stack)'s multi-agent code-generation pipeline. Single natural-language prompt -> whole project, scored densely across the seven generation phases.\n\n[![hub](https://img.shields.io/badge/Prime%20Hub-pradheep%2Falphastack--greenfield-7c3aed)](https://app.primeintellect.ai/dashboard/environments/pradheep/alphastack-greenfield)\n[![license](https://img.shields.io/badge/license-MIT-blue)](LICENSE)\n\n## TL;DR\n\n- **Input:** one task prompt (e.g. *\"Implement a custom Rust iterator with step + filter...\"*).\n- **Rollout:** the model under evaluation **replaces AlphaStack's internal LLM** (`provider_name=\"prime_intellect\"`, `model_override=<your model>`). AlphaStack runs its full 7-phase pipeline as the rollout.\n- **Reward:** dense per-phase signal in [0, 1] — architecture emitted, blueprint validated, file gen coherent, DAG clean, project builds, tests pass.\n- **Why dense:** binary pass/fail on a 1500-LOC project produces near-zero gradient. Dense scoring tells you *which* phase the model fell off.\n\n## Quickstart\n\n```bash\n# 1. Install on your worker\nprime env install pradheep/alphastack-greenfield\n\n# 2. Probe local toolchains (cargo / go / node+tsc / nvcc)\nbash $(python -c \"import alphastack_greenfield, os; print(os.path.dirname(alphastack_greenfield.__file__) + '/../setup.sh')\")\n\n# 3. Run an evaluation against any Prime Inference model\nprime env eval pradheep/alphastack-greenfield --model anthropic/claude-haiku-4.5\n\n# Or run one problem locally with full diagnostics:\nalphastack-greenfield-eval --model anthropic/claude-haiku-4.5 --problem-id go_t1_worker_pool\n```\n\n## Environment arguments\n\n`load_environment(...)` accepts:\n\n| Arg | Default | Meaning |\n| --- | --- | --- |\n| `lite` | `True` | Cap AlphaStack's `MAX_ORCHESTRATOR_TURNS` and `MAX_CHILD_TURNS` to 5 (upstream default 8). Roughly halves token cost; trades self-healing depth for predictability. |\n| `caps_file` | `/tmp/alpha_stack_caps.json` | Path to the toolchain-probe JSON written by `setup.sh`. Override via `$ALPHA_STACK_CAPS_FILE` or this kwarg. Missing → all-enabled with warning. |\n| `**kwargs` | — | Forwarded to `vf.Environment` (e.g. `eval_dataset`). |\n\n## Reward function\n\nThe 6 reward components and their weights, ported verbatim from `alphastack_greenfield/scoring/phases.py`:\n\n| Phase | Component | Weight |\n| --- | --- | --- |\n| 1. Architecture | `architecture_emitted` (file present, non-empty) | 0.05 |\n| 2. Blueprint | `blueprint_validates` (project root resolved on disk) | 0.05 |\n| 3. File generation | `files_generated_no_flood` (`>0` files AND `tool_calls.jsonl` ≤ 500 lines) | 0.10 |\n| 4. DAG | `dag_validates` (orchestrator finished, files present) — best-effort heuristic | 0.15 |\n| 5. Build | `build_ok` (`build_cmd` exits 0) | 0.30 |\n| 6. Tests | `tests_pass` (scaled by `passed/total` when parseable; binary fallback) | 0.35 |\n\n**Sum = 1.0.** Reward is the weighted sum; per-component values land in the verifiers `metrics` dict for post-hoc analysis. The `tests_pass` term is gated on `tests_ok` being true (build must succeed first).\n\n## Dataset\n\nThree problems ship in `alphastack_greenfield/data/problems.jsonl`. Adding more is a JSONL edit (see below). Languages and problems are filtered at load time against the worker's toolchain capabilities.\n\n| id | language | tier | summary |\n| --- | --- | --- | --- |\n| `rust_t1_step_iterator` | rust | 1 | Custom `Iterator` for stepped numeric ranges with `filter_even()` adaptor and lazy/composing tests |\n| `go_t1_worker_pool` | go | 1 | Bounded HTTP worker pool with `WaitGroup`, channel jobs, context cancellation, mutex-protected stats |\n| `ts_t1_typed_emitter` | typescript | 1 | Generic `TypedEmitter<Events>` with compile-time payload typing, `on/off/once/emit` |\n| `cuda_t1_vector_ops` | cuda | 1 | Element-wise vector ops on 10⁷ elements, three launch configs, CPU validation, Makefile build |\n\n## Toolchain matrix\n\n| Language | Required on PATH |\n| --- | --- |\n| rust | `cargo` |\n| go | `go` |\n| typescript | `node` AND `tsc` |\n| cuda | `nvcc` |\n\n`setup.sh` probes these and writes `/tmp/alpha_stack_caps.json`. The env reads it at `load_environment()` time and drops problems whose language has no toolchain. If *all* are missing, `RuntimeError` is raised.\n\n## Baseline results\n\n| Model | Released | $/Mtok in / out | Problem | Reward | Phases pass | Build | Tests | Failure mode |\n| --- | --- | --- | --- | --- | --- | --- | --- | --- |\n| `anthropic/claude-haiku-4.5` | 2025-10-21 | 1.00 / 5.00 | `go_t1_worker_pool` | **0.35** | 4/4 | ✗ | ✗ | Prose preamble + ` ``` ` fence inside `go.mod`. Build: `go.mod:3:4: unexpected newline in string`. Pipeline ran ~31 min. |\n| `qwen/qwen3.6-35b-a3b` | 2026-05-06 | 0.23 / 1.80 | `go_t1_worker_pool` | **0.05** | 1/4 | ✗ | ✗ | Returned `{}` for the structured `ProjectBlueprint` (3 pydantic validation errors). Pipeline short-circuited in ~2 min at blueprint phase; only architecture scored. |\n\nFull machine-readable rollout summaries under [`alphastack_greenfield/outputs/evals/`](alphastack_greenfield/outputs/evals/).\nThe dense reward differentiates failure modes — both rollouts \"fail to build\", but they\nfall off at different phases and earn different scores, giving meaningful RL signal.\n\nTo add your own row, run:\n\n```bash\nalphastack-greenfield-eval --model <id> --problem-id <id>\n```\n\nthen commit a JSON summary under `alphastack_greenfield/outputs/evals/` and append a row here.\n\n## Adding a problem\n\nAppend one JSON line per problem to `alphastack_greenfield/data/problems.jsonl`:\n\n```json\n{\n  \"id\": \"go_t2_raft_kv\",\n  \"language\": \"go\",\n  \"tier\": 2,\n  \"prompt\": \"<your task>\",\n  \"build_cmd\": \"go build ./...\",\n  \"test_cmd\": \"go test ./...\",\n  \"build_timeout\": 600,\n  \"test_timeout\": 180,\n  \"perf_check\": null\n}\n```\n\nCUDA problems with throughput requirements use:\n\n```json\n\"perf_check\": {\"baseline_cmd\": \"...\", \"min_speedup\": 2.0}\n```\n\nThen `prime env push --auto-bump --visibility PUBLIC`.\n\n## Security\n\nModel-generated code runs **unsandboxed** on the worker via `subprocess.run`. AlphaStack's CubeSandbox / bwrap path is bypassed because Prime hub workers cannot run nested microVMs. The worker itself is the isolation boundary — deploy on disposable workers, never on a host with secrets or persistent state.\n\n`subprocess.run` calls have explicit timeouts from the problem spec (`build_timeout` / `test_timeout`), but no `setrlimit` enforcement: a runaway model could fork-bomb. Caveat emptor.\n\n## Limitations\n\n- **No GUI / interactive UI projects.** Build commands are non-interactive only.\n- **No third-party API integrations.** Workers have no provisioned credentials beyond what you set.\n- **Single language per problem.** No polyglot tasks in v0.2.\n- **Tier 1 only in v0.2.** Tiers 2–4 (multi-component, distributed systems, production) are JSONL edits away.\n- **CUDA build is brittle** — assumes the model emits a Makefile with `make` / `make test`. Override `build_cmd`/`test_cmd` per problem if your CUDA tasks use a different layout.\n\n## Cost guidance\n\n| Mode | Iteration caps | Tokens per rollout (typical) | Wall clock (Go t1, claude-haiku-4.5) |\n| --- | --- | --- | --- |\n| `lite=True` | `MAX_ORCHESTRATOR_TURNS=5`, `MAX_CHILD_TURNS=5` | ~30K–60K | ~30 min |\n| `lite=False` | upstream defaults (8 / 8) | ~50K–100K | ~45–60 min |\n\nCost is dominated by the testing pipeline's self-healing loop. If the model gets the build right on the first or second try, total cost stays at the low end.\n\n## License\n\nMIT — see [LICENSE](LICENSE).\n\n## Author\n\nPradheep ([@Mantissagithub](https://github.com/Mantissagithub))\n\nAlphaStack itself is from [HyperKuvid Labs](https://github.com/HyperKuvid-Labs/alpha-stack); this repo only wraps it as a verifiers environment.\n","encoding":"utf-8","truncated":false,"total_bytes":7836},"status":null}