{"data":{"kind":"file","path":"README.md","version_id":"s9g72w2j3nrmbti6p44309a1","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5938,"modified_at":"2026-03-11T22:08:30.120000","content_hash":"35b84b828a6fc26abceabf647db264bf634995ad9ea8e23faf0e79b4600a1fd0"},"entries":[],"content":"# indic-ifeval\n\nInstruction-following evaluation for 14 Indic languages, based on [AI4Bharat/IndicIFEval](https://huggingface.co/datasets/ai4bharat/IndicIFEval) ([paper](https://arxiv.org/abs/2602.22125)). Models are prompted in a target language and scored on how well they follow verifiable, rule-based constraints embedded in the instruction — things like word count limits, forbidden words, required keywords, paragraph structure, and first-word constraints.\n\nThe benchmark covers ~800 human-verified examples per language across two subsets. Research shows models handle formatting constraints reasonably well but struggle significantly with lexical and cross-lingual tasks, making this a meaningful RL training signal beyond English.\n\n## Quickstart\n\n```bash\nprime eval run indic-ifeval\n```\n\nRun with specific language and subset:\n\n```bash\nprime eval run indic-ifeval \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 \\\n  -a '{\"language\": \"hi\", \"subset\": \"trans\"}'\n```\n\n## Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `language` | str | `\"hi\"` | ISO 639-1 language code (see supported languages below). |\n| `subset` | str | `\"trans\"` | `\"trans\"` — translated from English IFEval; `\"ground\"` — natively generated Indic instructions. |\n| `num_examples` | int | `-1` | Limit dataset size. `-1` uses all examples. |\n| `seed` | int | `42` | Shuffle seed when `num_examples > 0`. |\n\n## Supported Languages\n\n| Code | Language | trans | ground |\n| ---- | -------- | :---: | :----: |\n| `as` | Assamese | ✓ | ✓ |\n| `bn` | Bengali | ✓ | ✓ |\n| `en` | English | ✓ | — |\n| `gu` | Gujarati | ✓ | ✓ |\n| `hi` | Hindi | ✓ | ✓ |\n| `kn` | Kannada | ✓ | ✓ |\n| `ml` | Malayalam | ✓ | ✓ |\n| `mr` | Marathi | ✓ | ✓ |\n| `ne` | Nepali | ✓ | ✓ |\n| `or` | Odia | ✓ | ✓ |\n| `pa` | Punjabi | ✓ | ✓ |\n| `sa` | Sanskrit | ✓ | ✓ |\n| `ta` | Tamil | ✓ | ✓ |\n| `te` | Telugu | ✓ | ✓ |\n| `ur` | Urdu | ✓ | ✓ |\n\n## Example\n\n**Prompt** (`language=\"hi\"`, `subset=\"ground\"`):\n```\nयुवाओं में त्वरित धन कमाने के लिए अपराध की ओर बढ़ते झुकाव के कारणों और समाज पर\nइसके प्रभावों का विश्लेषण करते हुए एक निबंध लिखें।\nअपने निबंध की शुरुआत 'आसानी' शब्द से करना अनिवार्य है।\n```\n*(Write an essay analysing the reasons behind youth turning to crime for quick money and its effects on society. Your essay must begin with the word 'आसानी' (easily).)*\n\n**Constraints**: `length_constraints:nth_paragraph_first_word` — first paragraph must start with `आसानी`\n\n**Compliant response** (passes `strict_score`):\n```\nआसानी से पैसे कमाने की चाह आज के युवाओं को कई बार गलत रास्ते पर ले जाती है...\n```\n\n**Non-compliant response** (fails `strict_score`):\n```\nआज के युवाओं में अपराध की प्रवृत्ति बढ़ रही है...\n```\n*(Doesn't start with the required word)*\n\n## Constraint Categories\n\nThe benchmark covers these constraint types across both subsets:\n\n| Category | Examples |\n| -------- | -------- |\n| `length_constraints` | min/max word count, sentence count, paragraph count, first word of nth paragraph |\n| `keywords` | must include keyword, forbidden words, keyword frequency |\n| `detectable_format` | number of highlighted sections, bullet points, JSON format |\n| `punctuation` | no commas, specific punctuation rules |\n| `language` | respond in a specific language |\n| `startend` | start/end with specific phrase or postscript |\n\n## Metrics\n\nEach response is scored against all constraints in the prompt. The score is the **fraction of constraints passed** (0.0–1.0).\n\n| Metric | Weight | Description |\n| ------ | :----: | ----------- |\n| `strict_score` | 1.0 | Checks the raw completion as-is. This is the RL reward signal. |\n| `loose_score` | 0.0 | Checks 8 variants of the completion with whitespace/markdown normalization (e.g. strips `**bold**`, trims spaces). Diagnostic only — used to detect cases where the model follows the intent but formatting trips up the checker. |\n\nA response with 2 out of 3 constraints satisfied scores `0.667`. Both metrics run the same underlying constraint checkers from the IndicIFEval verification code; they differ only in how the response text is preprocessed before checking.\n\n## Prompting & Schema\n\nNo system prompt is used. Each example is a single user message containing the prompt verbatim from the dataset.\n\n**Example schema per example:**\n```\nprompt:  [{\"role\": \"user\", \"content\": \"<instruction with embedded constraints>\"}]\nanswer:  \"\"  (unused — constraints are evaluated from the prompt's instruction_id_list and kwargs)\ninfo:    {\"key\": ..., \"instruction_id_list\": [...], \"prompt\": ..., \"kwargs\": [...], \"language\": \"hi\", \"subset\": \"trans\"}\n```\n\nThe full assistant response is evaluated as-is — no parsing or extraction. `strict_score` checks the raw completion; `loose_score` checks 8 normalized variants (see Metrics).\n\n## Dataset\n\n- **Source**: [`ai4bharat/IndicIFEval`](https://huggingface.co/datasets/ai4bharat/IndicIFEval)\n- **Configs**: `indicifeval-trans` (translated from English IFEval), `indicifeval-ground` (natively generated Indic instructions)\n- **Task type**: Single-turn\n- **Split behavior**: Each language is a separate HuggingFace split (e.g. `split=\"hi\"`). The full split is used for both training and eval by default. When `num_examples > 0`, the split is shuffled with `seed` and truncated to that size.\n- **Scale**: ~490 examples per language for `trans`, ~250–450 for `ground` (varies by language)\n","encoding":"utf-8","truncated":false,"total_bytes":5938},"status":null}