{"data":{"kind":"file","path":"README.md","version_id":"vmq5ujs8zvclmz8spdczttbm","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6070,"modified_at":"2026-05-25T22:22:04.981000","content_hash":"64f48be1cb5fc77c4f5680146b6ac91ad8857498eee689189818dcf81815e580"},"entries":[],"content":"# humanize-rl-env\n\n**Prime Intellect Verifiers** single-turn environment for Humanize-RL.\n\nPublished at: `jayshah5696/humanize-rl-env`\n\n---\n\n## What this does\n\nEach episode is a single writing task — rewrite a corporate Slack update into\ncasual human prose, compress a meeting recap, or draft a direct workplace\nmessage. The model receives the task prompt, produces one completion, and the\nenvironment scores it.\n\n---\n\n## Final Reward Formula\n\n```\nreward = 0.50 × ridge_rubric + 0.50 × deterministic + penalties\n         clipped to [-1.0, 1.0]\n```\n\n### Ridge Rubric (50%)\n\nMean of 8 regression heads from a TF-IDF Logistic+Ridge scorer trained on\n10,000 Gemini-labelled writing samples (AUROC 0.9988, rubric MSE 0.041).\nThe scorer estimates the same humanness dimensions the LLM judge uses:\n\n| dim | what it penalizes |\n|---|---|\n| `structural_symmetry` | rigid intro/body/conclusion templates, formulaic lists |\n| `specificity` | vague claims without names, numbers, or concrete context |\n| `formality_gradient` | unnatural tone shifts |\n| `voice_consistency` | generic assistant voice instead of a specific writer |\n| `rhetorical_sophistication` | filler analysis, stakes inflation, shallow reasoning |\n| `padding_density` | repeated restatements, zero-information sentences |\n| `personality_presence` | lack of opinion, perspective, friction |\n| `copula_avoidance` | pompous substitutes for simple is/are/was |\n\nThe scorer pkl is **bundled inside the wheel** (`ridge_state.pkl`) — no\nexternal download or API call required.\n\n### Deterministic (50%)\n\nEqual-weight mean of 6 constraint checks that fire on hard task rules:\n\n| component | what it checks |\n|---|---|\n| `faithfulness` | no invented details, missing numbers/entities, dropped/added facts |\n| `task_following` | no option menus, wrapper phrases, or refusals |\n| `length` | within `max_words` / `min_words` / `exact_sentences` |\n| `format` | no subject lines, signoffs, or forbidden markdown |\n| `placeholder` | placeholder usage matches task constraints |\n| `clarity` | avg sentence length (≤24 words → 1.0, ≤35 → 0.7, else 0.4) |\n\n### Penalties (additive)\n\n| penalty | value |\n|---|---|\n| `invented_detail` | −0.50 |\n| `forbidden_fact` | −0.50 |\n| `option_menu` | −0.40 |\n| `missing_number` | −0.40 |\n| `missing_entity` | −0.40 |\n| `refusal` | −0.40 |\n| `placeholder_disallowed` | −0.35 |\n| `missing_required_fact` | −0.35 |\n| `too_long` | −0.30 |\n| `placeholder_required` | −0.30 |\n| `ai_tell_phrase` | −0.25 |\n| `wrapper_phrase` | −0.20 |\n| `sentence_count` | −0.20 |\n| `wrong_format_markdown` | −0.20 |\n| `subject_line` | −0.20 |\n| `too_short` | −0.15 |\n| `signoff` | −0.15 |\n\n---\n\n## Metrics logged (zero-weight, diagnostics only)\n\n```\nstyle_metric            — layer1 heuristics + ridge P(human) blend\ntask_following_metric   — option_menu / wrapper / refusal / length checks\nfaithfulness_metric     — invented detail / missing facts / forbidden facts\nlength_metric           — word count within constraints\nformat_metric           — subject / signoff / markdown checks\nclarity_metric          — avg sentence length\nplaceholder_metric      — placeholder constraint compliance\nrisk_penalty_metric     — sum of all triggered penalties\noption_menu_penalty_metric\nwrapper_phrase_penalty_metric\ninvented_detail_penalty_metric\n```\n\n---\n\n## Dataset\n\nDefault: `humanize_tasks_v01_smoke.jsonl` — bundled smoke dataset (train/eval\nsplits). Pass `task_path=` to `load_environment()` to use a larger dataset.\n\nTask shape (key fields):\n\n```json\n{\n  \"id\": \"rl_v01_000001\",\n  \"family\": \"rewrite_repair\",\n  \"domain\": \"slack\",\n  \"mode\": \"rewrite\",\n  \"instruction\": \"Clean up this Slack update. Keep it casual and under 30 words.\",\n  \"input_text\": \"Please be advised that staging has been restored...\",\n  \"constraints\": { \"max_words\": 30, \"preserve_numbers\": true, ... },\n  \"reward_profile\": \"rewrite_faithful_concise\",\n  \"required_facts\": [\"staging restored\", \"missing STRIPE_WEBHOOK_SECRET\"],\n  \"forbidden_facts\": [\"database migration\"],\n  \"split\": \"train\"\n}\n```\n\n---\n\n## Usage\n\n```bash\n# Install\nprime env install jayshah5696/humanize-rl-env\n\n# Run evaluation (3 examples, 2 rollouts each)\nprime eval run jayshah5696/humanize-rl-env \\\n  --model google/gemma-3-4b-it \\\n  --api-base-url https://openrouter.ai/api/v1 \\\n  --api-key-var OPENROUTER_API_KEY \\\n  --num-examples 3 \\\n  --rollouts-per-example 2\n\n# Python\nfrom verifiers import load_environment\nenv = load_environment(\"humanize-rl-env\")\n\n# With custom dataset\nenv = load_environment(\"humanize-rl-env\", task_path=\"data/rl/my_tasks.jsonl\", split=\"train\")\n```\n\n---\n\n## Package structure\n\n```\nhumanize_rl_env/\n├── __init__.py              # load_environment(), preview_dataset_row()\n├── ridge_state.pkl          # bundled TF-IDF+Ridge scorer (1.1 MB)\n├── humanize_tasks_v01_smoke.jsonl  # bundled smoke dataset\n├── reward/\n│   ├── reward.py            # score_response(), RewardResult, load_ridge_scorer()\n│   ├── checks.py            # deterministic penalty checks\n│   ├── tasks.py             # RLTask model, load_tasks()\n│   ├── profiles.py          # legacy reward profiles (reference)\n│   ├── verifiers_adapter.py # Prime Verifiers async reward + metric functions\n│   ├── grpo_rewards.py      # TRL/GRPO reward functions\n│   └── env.py               # HumanizeRLEnv, prime_dataset_row()\n└── scoring/\n    ├── aggregator.py        # layer1 score_text()\n    ├── layer1.py            # 8 regex/heuristic dimensions\n    └── patterns.py          # AI-tell regex patterns\n```\n\nAll scoring logic is **self-contained** — no external `humanize-rl` package\ninstall required. The wheel includes the ridge pkl and smoke dataset.\n\n---\n\n## Versions\n\n| version | change |\n|---|---|\n| 0.1.7 | 50/50 ridge-rubric + deterministic formula; 8 rubric dims exposed |\n| 0.1.6 | ridge scorer bundled as state dict pkl |\n| 0.1.5 | self-contained bundle, passes prime eval run end-to-end |\n| 0.1.0 | initial push |\n","encoding":"utf-8","truncated":false,"total_bytes":6070},"status":null}