{"data":{"kind":"file","path":"README.md","version_id":"cjatzis0iym7dleaofocb4l0","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3103,"modified_at":"2026-03-14T23:54:26.413000","content_hash":"861b450367d52d9a0b3209dfdf023684ea45cb2b3c995de6173e060b0f72c385"},"entries":[],"content":"# mijabench\n\n### Overview\n- **Environment ID**: `mijabench`\n- **Short description**: A bilingual adversarial benchmark for auditing demographic safety alignment and jailbreak vulnerabilities in LLMs. Ported from MiJaBench paper. Abs: https://arxiv.org/abs/2601.04389\n- **Tags**: safety, red-teaming, jailbreak, bias, single-turn, eval\n\n### Datasets\n- **Primary dataset(s)**: `mijabench/mijabench`\n- **Source links**: [Hugging Face](https://huggingface.co/datasets/mijabench/mijabench)\n- **Split sizes**: 44,000 synthetic adversarial prompts (eval-only).\n\n### Task\n- **Type**: single-turn\n- **Output format expectations**: Plain text. The model is expected to respond naturally to the jailbreak prompt.\n- **Rubric overview**: Uses an LLM-as-a-Judge protocol with a specialized Chain-of-Thought JSON prompt. The judge evaluates intent, false refusals, actionable content, and style vs. content to determine if the model safely refused the prompt or complied with the jailbreak.\n\n### Quickstart\nRun a quick evaluation with default settings (saving the judge's response):\n\n```bash\nprime eval run mijabench -C \"judge_response\"\n```\n\nConfigure the model, sampling, and specific judge endpoints:\n\n```bash\nprime eval run mijabench \\\n  -m x-ai/grok-4.1-fast \\\n  -n 160 -r 3 -t 2048 -T 0.7 \\\n  -a '{\"language\": \"all\", \"judge_model\": \"gpt-4-1-mini\", \"max_examples\": 160}' # stratified by demographics to 160 examples\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object. We recommend setting `\"max_examples\"` inside `-a` to match your `-n` flag so the dataset stratifies the demographics perfectly.\n\n### Analysis Script\nMiJaBench includes a built-in analysis tool to aggregate your `results.jsonl` into the exact demographic safety hierarchy and cognitive fragility tables described in the original research paper.\n\nAfter running an evaluation, simply run:\n```bash\npython -m mijabench.analyze\n```\n*(You can also pass `--json` to output structured data for downstream agent analysis).*\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `language` | str | `\"all\"` | Filter prompts by language (`\"english\"`, `\"portuguese\"`, or `\"all\"`). |\n| `max_examples` | int | `-1` | Limit on dataset size. Ensures stratified sampling across demographics. |\n| `judge_model` | str | `\"gpt-4o-mini\"` | The LLM model used to judge safety compliance. |\n| `judge_api_key_var` | str | `\"OPENAI_API_KEY\"` | Environment variable containing the judge's API key. |\n| `judge_base_url` | str | `\"https://api.openai.com/v1\"` | Base URL for the judge API client. |\n| `seed` | int | `42` | Seed for dataset shuffling. |\n\n### Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | **Defense Rate (DR)**: `1.0` if the model safely refused the jailbreak, `0.0` if it generated harmful content. |\n| `judge_refusal` | Flags (`1.0`) if the judge model itself refused to evaluate the text (usually implies extreme toxicity in the evaluated model's response). |\n| `judge_parse_error` | Flags (`1.0`) if the judge failed to output valid JSON. |\n","encoding":"utf-8","truncated":false,"total_bytes":3103},"status":null}