{"data":{"kind":"file","path":"README.md","version_id":"rpw91nrfhxjnujzbi4iqzssr","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6050,"modified_at":"2026-04-16T13:09:46.983000","content_hash":"448bfe9a464a37db2ff8f28eb040873474e74a5e990de5706af9bcc9fa80c7c7"},"entries":[],"content":"# Multi-Turn Constitutional Tournament Environment\n\nTournament-style reward system for Constitutional AI training with multi-turn conversation support.\n\n## Concept\n\nThis environment extends the Constitutional Tournament with multi-turn conversation handling:\n\n1. **Loads ShareGPT format datasets** (e.g., `anthracite-org/kalo-opus-instruct-22k-no-refusal`)\n2. **Extracts all conversation turns** (excluding system prompts) with configurable `max_turns`\n3. **Pairs off rollouts** (e.g., 256 rollouts per example)\n4. **Judges pairs** using constitutional principles with full conversation context\n5. **Winners advance** to face other winners\n6. **Every win = reward** - responses satisfying more principles accumulate more wins\n\n## Multi-Turn Configuration\n\nControl how many conversation turns to include:\n\n```python\nload_environment(\n    max_turns=-1,  # All turns (default)\n    max_turns=1,   # Single turn (first human message only)\n    max_turns=3,   # Up to 3 human turns with assistant responses between\n)\n```\n\nThe `max_turns` parameter counts human turns. If set to 2, the prompt will include:\n- First human message\n- First assistant response (if present)\n- Second human message\n\nThe model generates the next response in the conversation.\n\n## Multi-Turn Judge Prompt Format\n\nThe judge sees the full conversation context with XML-separated turns:\n\n```xml\n<conversation-context>\n<turn-1 role=\"user\">\nWhat is the capital of France?\n</turn-1>\n\n<turn-2 role=\"assistant\">\nParis is the capital of France.\n</turn-2>\n\n<turn-3 role=\"user\">\nTell me more about it.\n</turn-3>\n</conversation-context>\n\n<response-a>\n[Response A]\n</response-a>\n\n<response-b>\n[Response B]\n</response-b>\n```\n\n## Dataset Format\n\nExpects ShareGPT format with `conversations` field:\n\n```json\n{\n  \"conversations\": [\n    {\"from\": \"system\", \"value\": \"...\"},  // Skipped (not included)\n    {\"from\": \"human\", \"value\": \"...\"},   // Included as user turn\n    {\"from\": \"gpt\", \"value\": \"...\"},     // Included as assistant turn\n    {\"from\": \"human\", \"value\": \"...\"},   // Included as user turn\n    ...\n  ]\n}\n```\n\nSystem prompts are always skipped. The last message in the prompt is always a user message (trailing assistant messages are removed so the model generates the response).\n\n## Configuration\n\n```python\nload_environment(\n    # Dataset - ShareGPT format from HuggingFace\n    dataset_name=\"anthracite-org/kalo-opus-instruct-22k-no-refusal\",\n\n    # Constitution - path to file with one principle per line\n    constitution_path=\"./constitution.txt\",\n\n    # Judge model (required)\n    judge_model=\"Kimi-Linear-48B-A3B-Instruct-int8\",\n    judge_base_url=\"http://77.37.65.225:12148/v1\",\n    judge_api_key=\"dummy-key\",\n    judge_temperature=0.3,\n    judge_timeout=120.0,\n\n    # Concurrency\n    max_concurrent_judges=32,\n    max_concurrent_tournaments=8,\n\n    # Dataset size\n    num_train_examples=10000,\n    num_eval_examples=500,\n\n    # Multi-turn configuration\n    max_turns=-1,  # -1 for all turns, or specific number\n)\n```\n\n## Usage\n\n```bash\n# Install the environment\nprime env install multiturn-constitutional-tournament\n\n# Run local evaluation\nprime eval run multiturn-constitutional-tournament \\\n    -m gpt-5-nano \\\n    -n 5 \\\n    --rollouts-per-example 16\n\n# Hosted training (use the config in configs/lab/)\nprime train run configs/lab/multiturn-constitutional-tournament.toml\n```\n\n## Tournament Structure\n\nSame as Constitutional Tournament - for 256 rollouts per example:\n\n```\nRound 1: 256 -> 128 winners (128 get 1 point)\nRound 2: 128 -> 64 winners  (64 get 2 points)\nRound 3: 64 -> 32 winners   (32 get 3 points)\nRound 4: 32 -> 16 winners   (16 get 4 points)\nRound 5: 16 -> 8 winners    (8 get 5 points)\nRound 6: 8 -> 4 winners     (4 get 6 points)\nRound 7: 4 -> 2 winners     (2 get 7 points)\nRound 8: 2 -> 1 winner      (1 gets 8 points)\n```\n\n**Tournament reward = wins / total_rounds** (normalized to 0-1)\n\n## Positive-Pass Bonus (Flipped Framing)\n\nAfter each tournament, rollouts that survived past a configurable round\nthreshold are subjected to a second pass with **inverted framing**.\n\n- The pairwise judge prompt asks \"which of these is worse under principle Y\"\n  — a negatory framing. That signal is useful but produces a too-strong\n  contrarian pressure on its own.\n- For qualifying rollouts, the positive pass asks the *analyzer* LLM to\n  **charitably explain how the response embodies a sampled principle**. No\n  comparison, no critique — just: \"describe how this correctly does X\".\n- The judge then rates (directly, no tournament) whether that generous\n  analysis is **grounded** in the response or **fabricating** support. If\n  the analyst had to invent claims that aren't in the response, the score\n  collapses. If the analysis is faithful, the bonus holds.\n\nThis reinforces the signal for already-strong rollouts (those that kept\nwinning under adversarial framing) without adding more contrarian pressure.\nIncomplete or lopsided bracket sizes are fine — the bonus only ever\n*reinforces* a signal we already wanted.\n\n### Scoring\n\n```\ntournament_component = wins / total_rounds           # in [0, 1]\npositive_bonus       = mean grounding score          # in [0, 1], qualifiers only\n                      (0-4 Likert -> normalized)\nfinal_reward         = tournament_component\n                     + positive_pass_weight * positive_bonus\n```\n\nNon-qualifying rollouts (wins < threshold) always receive +0.0 bonus.\n\n### Positive-Pass Configuration\n\n```python\nload_environment(\n    # ... normal tournament args ...\n\n    # Toggle / tune the positive pass:\n    positive_pass_enabled=True,\n    positive_pass_wins_threshold=3,   # min wins to qualify (~round 3/4)\n    positive_pass_num_principles=2,   # principles sampled per qualifier\n    positive_pass_weight=0.5,         # bonus multiplier\n\n    # Analyzer LLM (defaults to judge model/endpoint if unset):\n    positive_pass_analyzer_temperature=0.7,\n    analyzer_model=None,\n    analyzer_base_url=None,\n    analyzer_api_key=None,\n)\n```\n\nSet `positive_pass_enabled=False` to restore pre-0.2 tournament-only behavior.\n","encoding":"utf-8","truncated":false,"total_bytes":6050},"status":null}