{"data":{"kind":"file","path":"README.md","version_id":"y1xknefyzmoc0t278z6c4g0j","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2740,"modified_at":"2026-02-22T18:47:03.492000","content_hash":"5afb8806666949522440515d87991eec76162994d8547cf7d7fccb73ff26d298"},"entries":[],"content":"# entity-personality-style\n\nEnvironment ID: `entity-personality-style`\n\nA single-turn Verifiers environment for training an LLM to write in a target entity's style.\n\n## Reward\n\nThe scalar reward is:\n\n- semantic relevance: cosine similarity between title and generated essay\n- style score: harmonic mean of sentence-level and paragraph-level style scores\n- composition:\n  - `rel = similarity - similarity_threshold`\n  - `reward = w_style * style_harmonic + w_similarity * max(rel, -1)`\n  - style term is gated when similarity is below threshold\n\nSentence and paragraph style scores come from an online-updated discriminator head\ntrained to separate target-corpus units (positive) from rollout completions (negative).\n\n### Discriminator bootstrap / refresh\n\n- No offline bootstrap corpora and no external inference bootstrap.\n- If `disc_bootstrap=true`, the first refresh happens as soon as enough distinct titles\n  have been observed in the rollout buffer (`disc_min_titles`), i.e. effectively “step 0”.\n- When `dry_run_env=false` (training mode), the scalar `reward` is held at `0.0` until the\n  first refresh completes (both sentence + paragraph heads trained). This avoids taking\n  policy updates on similarity-only rewards while the classifier is untrained.\n- Subsequent refreshes happen every `disc_refresh_every_rollouts`.\n\n## Prompt format\n\nEach question asks the model to write an essay for a title while copying style as closely as possible.\nA reference paragraph from the target corpus is included in the prompt by default.\n\n## Key load_environment args\n\n- `dataset_name`, `dataset_split`, `title_field`, `text_field`\n- `target_name`, `target_min_words`, `target_max_words`\n- `reward_w_style`, `reward_w_similarity`, `reward_similarity_threshold`\n- `encoder_model`, `reward_device`\n- `disc_enable`, `disc_bootstrap`, `disc_refresh_every_rollouts`, `disc_num_titles`, `disc_min_titles`\n- `dry_run_env` (default `True`)\n\n## Install from Hub\n\n```bash\nprime env install matthewagi/entity-personality-style\nuv run python -c \"import entity_personality_style; print('ok')\"\n```\n\n## PRIME-RL single-GPU example\n\nThis repo includes a minimal PRIME-RL single-GPU example under `examples/prime_rl_single_gpu/`.\n\n```bash\n# Pull the env source so you have the example configs locally.\nprime env pull matthewagi/entity-personality-style@latest\ncd <target_directory>\nuv pip install -e .\n\nexport CFG_DIR=\"$PWD/examples/prime_rl_single_gpu\"\nexport PYTHONPATH=\"$CFG_DIR:$PYTHONPATH\"  # for advantage_zscore.py\n\ncd ~/prime-rl\nCUDA_VISIBLE_DEVICES=0 uv run rl \\\n  --trainer @ \"$CFG_DIR/trainer.toml\" \\\n  --orchestrator @ \"$CFG_DIR/orchestrator.toml\" \\\n  --inference @ \"$CFG_DIR/inference.toml\" \\\n  --trainer-gpu-ids 0 \\\n  --inference-gpu-ids 0\n```\n","encoding":"utf-8","truncated":false,"total_bytes":2740},"status":null}