{"data":{"kind":"file","path":"README.md","version_id":"yrwtqn77yz5cdb4pkwukfusg","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":10316,"modified_at":"2026-03-10T05:50:19.818000","content_hash":"b6d298cfc4dbd03273b6e0c90a634380f1cfb35b5fd59adba9c5b50b54454659"},"entries":[],"content":"# epitaph environment\n\nMulti-turn RL environment for guiding procedural civilizations toward federation membership in the Epitaph simulator.\n\nDefault operating mode is now single-civ training:\n\n- one civilization at episode start\n- no new-civilization respawn during the episode\n- original tech tree, events, and FTL-based invite path preserved\n\nThe older multi-civ open-world behavior still exists as an opt-in loader setting.\n\n## Environment Summary\n\n- Environment ID: `epitaph`\n- Task type: multi-turn text policy with XML-constrained actions\n- Core objective: maximize federation joins while minimizing extinctions\n- Default episode shape: single civilization, no respawn\n- Action format: exactly one `<action>...</action>` tag containing a canonical action:\n  - `teach:<civ_idx>:<tech_name>`\n  - `invite:<civ_idx>`\n  - `end_turn`\n- The env also normalizes common aliases into canonical actions (for robustness):\n  - `end turn`, `wait`, `pass`, `skip`, `noop` -> `end_turn`\n  - `teach civ 0 toolmaking` -> `teach:0:toolmaking` (when valid)\n  - `invite civ 0` -> `invite:0` (when valid)\n\n## Local Run Commands\n\nRun tests with the local virtualenv:\n\n```bash\n.venv/bin/pytest -q tests/test_core.py tests/test_env.py\n```\n\nRun an eval:\n\n```bash\n./scripts/run_eval.sh \\\n  -m gpt-5-mini \\\n  -n 16 \\\n  -r 3 \\\n  --env-args '{\"auto_skip_max_turns\": 50}' \\\n  --save-results\n```\n\nEnvironment default eval scope (from `pyproject.toml`):\n\n- `num_examples = 32`\n- `rollouts_per_example = 3`\n\nRun a benchmark-shaped eval (deterministic sampling + larger defaults):\n\n```bash\n./scripts/run_benchmark.sh \\\n  -m gpt-5-mini \\\n  --env-args '{\"auto_skip_max_turns\": 50}'\n```\n\nRun a benchmark matrix across multiple models:\n\n```bash\n./scripts/run_benchmark_matrix.sh \\\n  gpt-5-mini claude-sonnet-openrouter \\\n  -- -e configs/endpoints.toml -n 16 -r 2\n```\n\nRun a fixed seed-suite benchmark:\n\n```bash\n./scripts/run_seed_suite.py \\\n  --suite configs/benchmarks/epitaph.seed-suite.example.json \\\n  --model gpt-5-mini\n```\n\nRun full benchmark CI pipeline in one command:\n\n```bash\n./scripts/run_benchmark_ci.py \\\n  --suite configs/benchmarks/epitaph.seed-suite.example.json \\\n  --gate configs/benchmarks/epitaph.gate.profiles.example.json \\\n  --gate-profile default \\\n  --models gpt-5-mini claude-sonnet-openrouter \\\n  -e configs/endpoints.toml -r 2\n```\n\nRun benchmark regression gates against latest runs per model:\n\n```bash\n./scripts/check_benchmark_gate.py \\\n  --spec configs/benchmarks/epitaph.gate.profiles.example.json \\\n  --profile strict\n```\n\nBenchmark using a non-open-weights model via an OpenAI-compatible endpoint (for example Claude through a compatible gateway):\n\n```bash\n./scripts/run_benchmark.sh \\\n  -m claude-sonnet-4-5 \\\n  -b https://openrouter.ai/api/v1 \\\n  -k OPENROUTER_API_KEY\n```\n\nEndpoint aliases can be managed via:\n\n- `/Users/izzy/epitaph-env/configs/endpoints.example.toml`\n\nSummarize saved eval runs:\n\n```bash\n./scripts/summarize_evals.sh\n```\n\nEmit canonical report artifacts for one run:\n\n```bash\n./scripts/report_eval_run.py environments/epitaph/outputs/evals/epitaph--gpt-5-mini/<run_id>\n```\n\nInspect the static FTL dependency and direct hazard map from a curriculum stage:\n\n```bash\n./scripts/analyze_epitaph_ftl.py --stage networked-pre-ftl --include-invite\n```\n\n## Loader Arguments\n\n`load_environment(...)` arguments and defaults:\n\n| Arg | Type | Default | Description |\n| --- | --- | --- | --- |\n| `split` | `str` | `\"train\"` | Dataset partition to expose from `load_environment(...)`. Supports `train`, `eval`/`validation`, and `test`. |\n| `num_train_examples` | `int` | `64` | Number of train seeds in synthetic dataset. |\n| `num_eval_examples` | `int` | `16` | Number of eval seeds in synthetic dataset. |\n| `num_test_examples` | `int \\| None` | `None` | Number of test seeds. Defaults to the effective eval count. |\n| `eval_seeds` | `list[int] \\| None` | `None` | Optional explicit eval seed list (overrides generated eval seeds). |\n| `base_seed` | `int` | `0` | Base RNG seed for deterministic seed generation. |\n| `env_seed_offset` | `int` | `0` | Additional split offset to avoid overlap. |\n| `max_turns` | `int \\| None` | `None` | Decision-turn cap (disabled when `None`). |\n| `idle_turn_limit` | `int` | `200` | End session after this many consecutive idle-only cycles. |\n| `auto_skip_idle_turns` | `bool` | `True` | Fast-forward consecutive `end_turn`-only periods. |\n| `auto_skip_max_turns` | `int \\| None` | `None` | Optional cap on fast-forwarded idle turns per response. |\n| `history_turn_window` | `int \\| None` | `24` | Conversation window retained before summarization/trimming. |\n| `reward_cfg` | `dict \\| None` | `None` | Optional reward override map (partial keys allowed). |\n| `allow_respawn` | `bool` | `False` | When `False`, disables new-civilization spawning and keeps episodes in single-civ mode. Set `True` to recover the original open-world respawn behavior. |\n| `prompt_profile` | `str` | `\"minimal\"` | Prompt surface to use. Current options: `minimal`, `minimal-plain`, `minimal-thinking`, `minimal-thinking-action-first`, `minimal-thinking-action-first-anchored`, `minimal-thinking-qwen-native`, `minimal-thinking-qwen-native-anchored`, `minimal-thinking-action-first-anchored-no-generic-priors`, `minimal-thinking-action-first-anchored-no-generic-priors-lite`, `compact-risk`, `thinking-tight`, `late-risk-gating`, `late-risk-commit`. `minimal` is the benchmark-faithful default; `minimal-plain` is the cleanest plain-action variant; `minimal-thinking-qwen-native` is the visible `<think>...</think><action>...</action>` variant; `minimal-thinking-qwen-native-anchored` hard-anchors the reply start at `<think>`. |\n| `show_score_in_observation` | `bool` | `True` | When `False`, hides the cumulative reward score from the user-visible observation. |\n| `show_turn_reward_in_observation` | `bool` | `True` | When `False`, hides the per-turn reward banner from the user-visible observation (invalid-action notices still appear). |\n| `show_status_counters_in_observation` | `bool` | `True` | When `False`, hides top-line bookkeeping counters such as joined/extinct/pending-invite/invalid-move counts from the user-visible observation. |\n| `show_notes_in_observation` | `bool` | `True` | When `False`, hides reward/system notes such as FTL-progress shaping, idle fast-forward summaries, and terminal reward bookkeeping from `Latest developments:` while keeping in-world civilization events. |\n\nSplit behavior:\n\n- `split=\"train\"` returns the train partition and keeps an internal eval holdout.\n- `split=\"eval\"` / `split=\"validation\"` returns the eval holdout as the primary dataset.\n- `split=\"test\"` returns a disjoint test holdout, which is the mode hosted Prime RL uses for `[[eval.env]]`.\n\nTraining-mode behavior:\n\n- default: single-civ mode with `allow_respawn=False`\n- opt-in open-world mode: `allow_respawn=True`\n\n## Eval vs Train Orchestration\n\n- `vf-eval` (used by `scripts/run_eval.sh` / `scripts/run_benchmark.sh`) is evaluation only:\n  - loads env + model client\n  - runs rollouts\n  - computes rubric rewards/metrics\n  - saves artifacts\n  - does **not** update model weights\n- PRIME-RL training is a separate orchestration path (`rl` entrypoint with trainer + orchestrator + inference). This repo currently ships eval scripts; PRIME-RL wiring is documented in `docs/prime-rl-compatibility-2026-03-01.md`.\n- Operational deployment/monitoring checklist: `docs/prime-rl-operations-runbook.md`.\n- Starter PRIME-RL config template: `/Users/izzy/epitaph-env/configs/prime-rl/epitaph.rl.toml`\n\n## Flag Quick Reference\n\n- `-m/--model`: model id or endpoint alias\n- `-b/--api-base-url`: OpenAI-compatible API URL\n- `-k/--api-key-var`: environment variable name containing API key\n- `-n/--num-examples`: number of seeds/examples evaluated\n- `-r/--rollouts-per-example`: number of rollouts per example\n- `-a/--env-args`: JSON passed into `load_environment(...)`\n- `-S/--sampling-args`: generation settings JSON (`temperature`, etc)\n- `-C/--state-columns`: state fields to persist into `results.jsonl`\n- `-s/--save-results`: persist outputs to `outputs/evals/...`\n\n## Reward + Metrics\n\nPrimary reward:\n\n- Turn reward includes:\n  - `+invite_reward * joined_delta`\n  - `-extinction_penalty * extinct_delta`\n  - `+tech_milestone_rewards[tech]` when a configured tech is first reached\n  - `-invalid_penalty` for invalid action format/value\n  - `-idle_penalty` when the agent chooses `end_turn` while other actions are available\n  - `-base_step_penalty` per agent decision turn (if configured)\n- Terminal reward includes:\n  - `+victory_bonus` when `joined >= victory_threshold`\n  - `-failure_penalty` if all civilizations are extinct\n  - `+survival_bonus * (alive - joined)` for surviving non-joined civs\n\nAuto-skipped no-choice turns do not incur `idle_penalty` or `base_step_penalty`.\n\nTracked rubric metrics:\n\n- `total_reward`\n- `joined_metric`\n- `extinction_metric`\n- `max_stardate_metric`\n- `max_ftl_required_techs_known_metric`\n- `max_ftl_progress_metric`\n- `alive_metric`\n- `turns_with_actions_metric`\n- `done_reason_metric`\n- parser format reward (`format_reward_func`)\n\n## Observability Artifacts\n\n`scripts/run_eval.sh` now defaults to:\n- `--save-results`\n- `--state-columns turn_traces,last_turn_trace,history_summary_text`\n\nAfter a run, `scripts/report_eval_run.py` writes:\n- `run_manifest.json` (config + identity)\n- `summary.json` (aggregated metrics + outcome breakdown)\n- `turns.jsonl` (flattened per-turn trace rows when `turn_traces` is available)\n- `summary.json` reward block includes distribution stats (`stddev`, `p10`, `p50`, `p90`)\n\nLive replay server (SSE tail for `turns.jsonl`):\n\n```bash\npython viewer/live_turns_server.py \\\n  --turns-file environments/epitaph/outputs/evals/epitaph--gpt-5-mini/<run_id>/turns.jsonl\n```\n\n## Current State (Audit Snapshot: 2026-03-01)\n\n- Historical runs in `outputs/evals/` show weak outcomes:\n  - existing `gpt-5-mini` run: `48/48` episodes ended with `failure_all_extinct`\n  - average joined civilizations: `0.0`\n- Environment is runnable in the local `.venv` and tests pass there.\n- See deep audit and gap log:\n  - [`/Users/izzy/epitaph-env/docs/audit-2026-03-01.md`](/Users/izzy/epitaph-env/docs/audit-2026-03-01.md)\n  - [`/Users/izzy/epitaph-env/docs/bootstrap-gap-log.md`](/Users/izzy/epitaph-env/docs/bootstrap-gap-log.md)\n","encoding":"utf-8","truncated":false,"total_bytes":10316},"status":null}