{"data":{"kind":"file","path":"README.md","version_id":"okhmqtjydzqoon94vbb8grgx","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":10892,"modified_at":"2026-06-03T14:06:06.661000","content_hash":"ffd8f55abaac4b4b4d787ca625b9bf7fb0b07b8bc9be28ebce59d0e19fb96170"},"entries":[],"content":"# pj — Personal Jurisdiction GRPO Environment\n\n### Overview\n- **Environment ID**: `smolclaims/pj`\n- **Version**: 0.1.1\n- **Description**: Train a small model to classify whether a forum state has\n  personal jurisdiction over a defendant by answering three doctrinal yes/no\n  questions (Q1 domicile, Q2 sufficient contacts, Q3 nexus) and applying a\n  deterministic routing rule.\n- **Tags**: law, single-turn, personal-jurisdiction, civil-procedure, legalbench, GRPO\n- **Base model**: [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)\n- **Trained model**: [DoodDood/Personal-Jurisdiction-GRPO](https://huggingface.co/DoodDood/Personal-Jurisdiction-GRPO) — merged LoRA adapter trained on this env, ready to use\n\n### Headline result\n\nGRPO on Qwen3.5-4B against this env lifted LegalBench `personal_jurisdiction` from **66.0% → 74.0%** (+8.0 overall), with the model's complete Q1=Yes refusal at baseline (Q1 = Yes on 0/50 rows) corrected to fire on actual Domicile cases without leaking into the no-contacts class.\n\nOn the 50-row LegalBench `personal_jurisdiction` test (non-thinking, greedy):\n\n| Slice                       | Base Qwen3.5-4B | + GRPO (step 150) |   Δ   |\n|-----------------------------|:---------------:|:-----------------:|:-----:|\n| Domicile                    | 30.0%           | **40.0%**         | +10.0 |\n| No contacts, no nexus       | 100.0%          | 100.0%            |  0.0  |\n| Yes contacts, no nexus      | 87.5%           | **100.0%**        | +12.5 |\n| Yes contacts, yes nexus     | 25.0%           | **41.7%**         | +16.7 |\n| **Overall**                 | **66.0%**       | **74.0%**         | **+8.0** |\n\nZero slice regressions. The dispositive doctrinal lift is on Domicile (+10), which is what the reward stack was designed to attack — the base model literally never produced Q1=Yes regardless of fact pattern.\n\nRun details: 300 GRPO steps · batch 128 · 16 rollouts/example · LR 1e-4 · LoRA r=16 alpha=16 on Qwen3.5-4B · ~$15 hosted on Prime Intellect · checkpoint step 150 selected after sweeping multiple candidates (step 100 had no Domicile lift; step 300 lifted Domicile further but at the cost of NCNN regression to 80%; step 150 caught the Goldilocks point).\n\n### Why this environment looks the way it does\n\nLegalBench's personal_jurisdiction task uses a deliberately simplified\nframework — domicile OR (purposeful contacts AND nexus). It does NOT\ninclude the fair-play / substantial-justice factor, Daimler's \"at home\"\nsophistication, or the Bristol-Myers Squibb / Ford line on \"arise out of\nor relate to.\" This environment matches LegalBench's framework exactly.\nTraining a richer 4- or 5-question doctrine creates train/test mismatch:\nwe'd be teaching the model an element the eval doesn't grade. Deliberate\nsimplification, noted here so it doesn't read as a doctrinal oversight.\n\n### Task\n- **Type**: single-turn binary classification\n- **Output format** (no other text):\n  ```\n  Q1: [Yes/No]\n  Q2: [Yes/No]\n  Q3: [Yes/No]\n  FINAL_CLASSIFICATION: [Yes/No]\n  ```\n- **Routing rule** (frozen, in the system prompt):\n  - Q1 = Yes → Yes (general jurisdiction)\n  - else Q2 = Yes AND Q3 = Yes → Yes (specific jurisdiction)\n  - else → No\n\nThe three questions: Q1 Domicile, Q2 Sufficient Contacts (purposeful\navailment), Q3 Nexus (arise-out-of / relate-to).\n\n### Data\n\n- **Eval**: `data/legalbench_pj_test.tsv` (bundled) — the 50-row LegalBench\n  personal_jurisdiction test set. Columns: `index, answer, text, slice`.\n  Slice distribution: 10 `Domicile` (9 Yes + 1 No plan-to-move trap),\n  20 `No contacts, no nexus` (all No), 8 `Yes contacts, no nexus` (all\n  No), 12 `Yes contacts, yes nexus` (all Yes). Attached as\n  `eval_dataset`. Real target metric.\n\n- **Training**: `data/pj_train.jsonl` (bundled, **2,000 rows**) — synthetic\n  corpus built by parallel orchestrated agents (Antigravity / Codex) across\n  100 batches of 20, with a cross-family blind audit pass (Claude relabeling\n  Gemini-generated rows on a 130-row stratified sample) finding 99.2% slice\n  agreement; 4 doctrinally clear errors corrected by hand. Composition is\n  weighted toward the failure modes the base model exhibits:\n\n  | Slice                                  |  Count |\n  |----------------------------------------|-------:|\n  | Domicile / Yes                         |    500 |\n  | Domicile / No (plan-to-move traps)     |    100 |\n  | Yes contacts, yes nexus / Yes          |    500 |\n  | Yes contacts, no nexus / No            |    396 |\n  | No contacts, no nexus / No             |    504 |\n  | **Total**                              | **2,000** |\n\n  Names, states, and facts are synthetic; LegalBench test party names\n  (Dustin / Laura / Ana / David, etc.) are blacklisted to prevent train/test\n  contamination. Defendant names are deterministically partitioned across\n  batches (25-name windows from a 200-name pool) so no name dominates the\n  corpus.\n\n### The (slice, answer) → decisive Q mapping\n\nThis is the only piece of the design that's more nuanced than the\nAbercrombie playbook. Abercrombie's decisive Q is a function of the\ntrue label alone; for PJ, it's a joint function of (slice, answer)\nbecause the `Domicile` slice carries both Yes answers (actual domicile)\nand No answers (plan-to-move trap).\n\n| Slice | Answer | Decisive Q | Why |\n|---|---|---|---|\n| `Domicile` | Yes | Q1 = Yes | Defendant actually domiciled in forum. |\n| `Domicile` | No  | Q1 = No  | Plan-to-move trap — not yet domiciled. |\n| `No contacts, no nexus` | No | Q2 = No | Purposeful availment fails. |\n| `Yes contacts, no nexus` | No | Q3 = No | Nexus fails (contacts exist, claim unrelated). |\n| `Yes contacts, yes nexus` | Yes | Q3 = Yes | Both Q2 and Q3 required; Q3 is the doctrinally harder call. |\n\nFor the `Yes contacts, yes nexus` slice, Q2 = Yes is also required for\nthe final answer to route to Yes. The choice to grade Q3 (not Q2) as\nthe decisive question reflects that nexus is the trickier doctrinal\ncall in practice — purposeful availment is usually the easier\ninference, nexus is where models confabulate \"relatedness\" most.\n\n### Rubric\n\nFive reward functions, default weights `(1.0, 0.3, 0.2, 0.15, 0.3)`:\n\n| # | Function | Weight | Range | Description |\n|---|----------|--------|-------|-------------|\n| 1 | `final_accuracy_reward` | 1.0 | [-1, 1] | Dominant. +1 correct final Yes/No, -1 wrong, -1 missing/unparseable. Binary analogue of Abercrombie's ordinal_accuracy. |\n| 2 | `decisive_q_reward` | 0.3 | [-1, 1] | +1/-1 on the dispositive sub-element fixed by (slice, answer). Non-decisive Qs are scaffolding and are **not** graded by this function. |\n| 3 | `consistency_bonus` | 0.2 | [0, 1] | +1 only when final correct **AND** decisive-Q answer matches. Gated to block format/coherence from carrying a wrong answer. |\n| 4 | `routing_consistency_reward` | 0.15 | [-1, 1] | +1 when stated FINAL matches the routing of the model's own Q-answers. Coherence signal, not truth. |\n| 5 | `routed_truth_reward` | 0.3 | [-1, 1] | +1 when the model's Q-answers route to the true label. The doctrine-decomposed-correctly signal — orthogonal to whether the model wrote a coherent FINAL. |\n\nDirect port of the Abercrombie stack with two adaptations:\n1. `ordinal_accuracy` → `final_accuracy` (binary task; no ordinal\n   structure between Yes and No).\n2. Decisive-Q derivation is joint on (slice, answer), not pure on\n   label (see the table above).\n\n### Quickstart\n\n```bash\n# install the env from the Hub\nprime env install smolclaims/pj\n\n# evaluate the base model on the bundled LegalBench eval set\nuv run vf-eval pj\n\n# smoke-test on a handful of training rows\nuv run vf-eval pj -a '{\"max_examples\": 20, \"use_legalbench_eval\": false}'\n```\n\nTo use the trained model from this env, see [DoodDood/Personal-Jurisdiction-GRPO](https://huggingface.co/DoodDood/Personal-Jurisdiction-GRPO) — the model card includes the frozen system prompt and the `enable_thinking=False` requirement.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n|-----|------|---------|-------------|\n| `system_prompt` | str | frozen prompt | Override the system prompt. |\n| `dataset_path` | str | bundled JSONL | Local training JSONL to use instead. |\n| `dataset_repo` | str | `None` | HuggingFace repo for training data (columns `prompt`, `answer`, `slice`). |\n| `dataset_split` | str | `\"train\"` | Split when `dataset_repo` is set. |\n| `eval_path` | str | bundled TSV | Local LegalBench TSV for the eval split. |\n| `use_legalbench_eval` | bool | `True` | Attach LegalBench test as `eval_dataset`. |\n| `max_examples` | int | `-1` | Cap training rows (shuffles seed=42 when set). |\n| `weights` | tuple | `(1.0, 0.3, 0.2, 0.15, 0.3)` | Reward weights for the five functions. |\n\n### Metrics\n\n| Metric | Meaning |\n|--------|---------|\n| `reward` | Weighted sum of the five reward functions |\n| `final_accuracy_reward` | Correctness on the final Yes/No label |\n| `decisive_q_reward` | Correctness of the dispositive sub-element |\n| `consistency_bonus` | Right answer reached via the right decisive question |\n| `routing_consistency_reward` | Model's own decomposition routes to its own stated final |\n| `routed_truth_reward` | Model's own decomposition routes to the true label |\n\n### Method context\n\nThis is the third project in a series applying the same GRPO reward shape (final-accuracy dominant + decisive-element gated bonus + routing-coherence signals) to different doctrinal decomposition shapes:\n\n1. **[TOMAGPT](https://huggingface.co/DoodDood/TOMAGPT)** — hearsay 3-element conjunction (binary). +6.4% lift on LegalBench hearsay.\n2. **[Abercrombie-GRPO](https://huggingface.co/DoodDood/abercrombie-grpo)** — trademark distinctiveness ordinal 5-class. +28.4% lift on LegalBench Abercrombie.\n3. **[Personal-Jurisdiction-GRPO](https://huggingface.co/DoodDood/Personal-Jurisdiction-GRPO)** (this) — binary jurisdiction with general/specific routing switch over conjunction. +8.0% lift on LegalBench personal_jurisdiction; complete Q1=Yes refusal at baseline corrected.\n\nEach project targeted a different doctrinal shape; the same reward stack landed all three.\n\n### Checkpoint selection note\n\nPrime's in-training eval scores are not equivalent to the real LegalBench number for this kind of strict-format task — the in-training eval path doesn't pipe `enable_thinking=False` through to inference (no `[eval.sampling]` override in the Hosted Training config schema), so its absolute scores are biased by reasoning-mode generations that don't match the trained model's actual capability. For the trained model linked above, the winning checkpoint was selected by downloading multiple candidates and evaluating each locally with `enable_thinking=False`. Step 150 won over step 100 (no Domicile lift yet) and step 300 (Domicile lift but 20-pt NCNN regression). See the conversation log on the parent project for the full sweep methodology.\n","encoding":"utf-8","truncated":false,"total_bytes":10892},"status":null}