{"data":{"kind":"file","path":"README.md","version_id":"khxqt0ys1z0uhgcvqn7h7dp6","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4615,"modified_at":"2026-05-28T02:55:09.676000","content_hash":"c4d05b2575011ea9f175434d616467f46a38aab1d52df274ee1977ba533ce334"},"entries":[],"content":"# abercrombie\n\n### Overview\n- **Environment ID**: `abercrombie`\n- **Version**: 0.1.3\n- **Description**: Train a small model to classify a trademark on the Abercrombie\n  distinctiveness spectrum (Generic → Descriptive → Suggestive → Arbitrary →\n  Fanciful) by answering five fixed yes/no doctrinal questions and applying a\n  deterministic routing rule.\n- **Tags**: law, single-turn, trademark, abercrombie, legalbench, GRPO\n- **Base model**: [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)\n\n### Task\n- **Type**: single-turn classification\n- **Output format** (no other text):\n  ```\n  Q1: [Yes/No]\n  Q2: [Yes/No]\n  Q3: [Yes/No]\n  Q4: [Yes/No]\n  Q5: [Yes/No]\n  FINAL_CLASSIFICATION: [Generic/Descriptive/Suggestive/Arbitrary/Fanciful]\n  ```\n- **Routing rule** (frozen, in the system prompt):\n  - Q1 = Yes → Fanciful\n  - else Q2 = No → Arbitrary\n  - else Q5 = Yes → Generic\n  - else Q4 = Yes → Descriptive\n  - else Q3 = Yes → Suggestive\n\nThe five questions: Q1 Coined Term, Q2 Semantic Relationship, Q3 Imagination,\nQ4 Immediate Conveyance, Q5 Genus.\n\n### Data\n- **Training**: `abercrombie_train.jsonl` (bundled) — 2,100 synthetic examples,\n  ~420 per class, phrased to match the LegalBench test surface form\n  (`The mark \"X\" for Y.`). Generated from the canonical\n  `Mark: …\\nGoods/Services: …` source under `../Abercrombie Data/` via\n  `reformat_to_legalbench.py`; rerun that script to regenerate. Marks are\n  invented and the generator blacklist excludes every LegalBench test mark, so\n  there is no train/test contamination. Each row carries the gold\n  `classification` plus the mechanically-derived `decisive_q` /\n  `decisive_q_answer`.\n- **Eval**: `legalbench_abercrombie_test.tsv` (bundled) — the 95-row LegalBench\n  Abercrombie test set (19 per class), same `The mark \"X\" for Y.` form,\n  attached as `eval_dataset`. This is the real target metric. Train and eval\n  now share the same surface form (the test set uses straight quotes in 92/95\n  rows; training matches that).\n\n### Rubric\nThree reward functions, weights `(1.0, 0.3, 0.2)`:\n\n| # | Function | Weight | Range | Description |\n|---|----------|--------|-------|-------------|\n| 1 | `ordinal_accuracy_reward` | 1.0 | [-1, 1] | Dominant signal. `1 - 2·(|pred_idx − true_idx| / 4)` on the final label. Exact = +1, adjacent = +0.5, opposite end = -1. Missing/unparseable = -1. |\n| 2 | `decisive_q_reward` | 0.3 | [-1, 1] | +1/-1 on the single dispositive question fixed by the true label. Non-decisive Qs are scaffolding and are **not** graded. Missing = -1. |\n| 3 | `consistency_bonus` | 0.2 | [0, 1] | +1 only when the final label is correct **AND** the decisive-Q answer matches. Gated on correctness so format/consistency can never carry a wrong answer. |\n\nDesign rationale (carried from the prior run): the model's failure mode at\nbaseline was a massive Descriptive bias (Q4 = Yes for nearly everything), 0%\non Arbitrary and Fanciful. The ordinal reward gives partial credit for being\nin the right region of the spectrum; the decisive-Q reward pushes the\ndispositive sub-element to flip; the gated bonus rewards reaching the right\nanswer *via* the right reasoning step without letting consistency alone earn\nreward on a wrong label.\n\n### Quickstart\n```bash\n# evaluate the base model on the bundled LegalBench eval set\nuv run vf-eval abercrombie\n\n# smoke-test on a handful of training rows\nuv run vf-eval abercrombie -a '{\"max_examples\": 20, \"use_legalbench_eval\": false}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n|-----|------|---------|-------------|\n| `system_prompt` | str | frozen prompt | Override the system prompt. |\n| `dataset_path` | str | bundled JSONL | Local training JSONL to use instead. |\n| `dataset_repo` | str | `None` | HuggingFace repo for training data (columns `prompt`, `classification`). |\n| `dataset_split` | str | `\"train\"` | Split when `dataset_repo` is set. |\n| `eval_path` | str | bundled TSV | Local LegalBench TSV for the eval split. |\n| `use_legalbench_eval` | bool | `True` | Attach LegalBench test as `eval_dataset`. |\n| `max_examples` | int | `-1` | Cap training rows (shuffles seed=42 when set). |\n| `weights` | tuple | `(1.0, 0.3, 0.2)` | Reward weights for the three functions. |\n\n### Metrics\n\n| Metric | Meaning |\n|--------|---------|\n| `reward` | Weighted sum of the three reward functions |\n| `ordinal_accuracy_reward` | Distance-based score on the final classification |\n| `decisive_q_reward` | Correctness of the dispositive sub-element |\n| `consistency_bonus` | Right answer reached via the right decisive question |\n","encoding":"utf-8","truncated":false,"total_bytes":4615},"status":null}