{"data":{"kind":"file","path":"README.md","version_id":"vkj3qtomh5zt2raocyz7hvu4","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5260,"modified_at":"2026-02-04T01:21:41.530000","content_hash":"bab7ac33dd7bfd53db6b746b155bc9ab48285dcda92f7f2eaf4d666878c58ee0"},"entries":[],"content":"# low-resource-translation\n\n### Overview\n- **Environment ID**: `low-resource-translation`\n- **Short description**: Translation environment for underrepresented languages using FLORES-200, scored with chrF\n- **Tags**: single-turn, translation, multilingual, low-resource, train, eval\n\n### Motivation\n\nLarge language models perform poorly on low-resource languages due to limited training data. This environment enables RL training to improve translation capabilities for underrepresented languages, which has high social value for linguistic preservation and accessibility.\n\n### Datasets\n- **Primary dataset**: [FLORES-200](https://huggingface.co/datasets/facebook/flores) - Meta's massively multilingual translation benchmark\n- **Source**: Hugging Face (`facebook/flores`)\n- **Split sizes**: ~1000 sentences per language in devtest\n\n### Supported Languages\n\n**Source (pivot):**\n| Code | Language |\n|------|----------|\n| `eng_Latn` | English |\n\n**Target (low-resource):**\n| Code | Language | Region |\n|------|----------|--------|\n| `yor_Latn` | Yoruba | West Africa |\n| `ibo_Latn` | Igbo | West Africa |\n| `hau_Latn` | Hausa | West Africa |\n| `amh_Ethi` | Amharic | East Africa |\n| `swh_Latn` | Swahili | East Africa |\n| `zul_Latn` | Zulu | Southern Africa |\n| `wol_Latn` | Wolof | West Africa |\n| `quy_Latn` | Quechua | South America |\n| `ayr_Latn` | Aymara | South America |\n| `cym_Latn` | Welsh | Europe |\n| `eus_Latn` | Basque | Europe |\n| `gle_Latn` | Irish | Europe |\n| `mlt_Latn` | Maltese | Europe |\n| `mya_Mymr` | Burmese | Southeast Asia |\n| `khm_Khmr` | Khmer | Southeast Asia |\n| `lao_Laoo` | Lao | Southeast Asia |\n| `npi_Deva` | Nepali | South Asia |\n| `sin_Sinh` | Sinhala | South Asia |\n| `kat_Geor` | Georgian | Caucasus |\n| `smo_Latn` | Samoan | Pacific |\n| `mri_Latn` | Maori | Pacific |\n\n### Task\n- **Type**: single-turn\n- **Scoring**: chrF via `sacrebleu` when available (deterministic fallback included)\n- **Reward**: 85% continuous chrF + 15% threshold bonus + small format/length auxiliary reward\n\n### Quickstart\n\n```bash\n# Default: English -> Yoruba\nprime eval run low-resource-translation -m gpt-4.1-mini\n\n# English -> Welsh\nprime eval run low-resource-translation -m gpt-4.1-mini \\\n  -a '{\"target_lang\": \"cym_Latn\"}'\n\n# Swahili -> English (reverse direction)\nprime eval run low-resource-translation -m gpt-4.1-mini \\\n  -a '{\"target_lang\": \"swh_Latn\", \"direction\": \"to_english\"}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n|-----|------|---------|-------------|\n| `source_lang` | str | `\"eng_Latn\"` | Must be `\"eng_Latn\"` for this English-pivoted environment |\n| `target_lang` | str | `\"yor_Latn\"` | Target language code |\n| `direction` | str | `\"from_english\"` | `\"from_english\"` or `\"to_english\"` |\n| `split` | str | `\"devtest\"` | Dataset split (`\"dev\"` or `\"devtest\"`) |\n| `num_examples` | int | `-1` | Limit examples (-1 for all) |\n| `chrf_threshold` | float | `0.2` | Threshold for binary reward bonus |\n\n### Data Source\n\nThe environment uses FLORES-200 sentence files directly (not HF dataset scripts), so it works with `datasets>=4`.\n\n- Auto-download cache: `~/.cache/low_resource_translation`\n- Optional override: `FLORES200_DATA_DIR=/path/to/flores200_dataset`\n- Optional cache root: `FLORES200_CACHE_DIR=/custom/cache/root`\n\n### Metrics\n\n| Metric | Meaning |\n|--------|---------|\n| `reward` | Weighted combination (0.85 * chrF + 0.15 * threshold_bonus + 0.05 * format_length_reward) |\n| `chrf_score` | Character-level F-score (0-1) |\n| `threshold_bonus` | 1.0 if chrF >= threshold, else 0.0 |\n| `format_length_reward` | Small auxiliary score favoring concise, single-translation outputs |\n| `length_ratio` | Output length / reference length |\n| `exact_match` | Exact string match (strict) |\n\n### Why chrF over BLEU?\n\n- **Character-level**: Handles morphologically rich languages better (agglutinative, fusional)\n- **No tokenization needed**: Avoids tokenizer bias for non-English scripts\n- **Better correlation**: Higher correlation with human judgment for low-resource languages\n- **More granular**: Partial credit for nearly-correct words\n\n### Training Tips\n\n1. **Start with related languages**: If training for Yoruba, warm up on Swahili first\n2. **Bidirectional training**: Train both directions (from/to English) for better representations\n3. **Multi-language**: Train on multiple low-resource languages simultaneously for transfer\n4. **Adjust threshold**: Lower `chrf_threshold` for harder languages, raise for easier ones\n\n### Example Training Config\n\n```toml\nmodel = \"Qwen/Qwen3-30B-A3B-Instruct-2507\"\nmax_steps = 500\nbatch_size = 128\nrollouts_per_example = 4\n\n[sampling]\nmax_tokens = 256\n\n[[env]]\nid = \"your-username/low-resource-translation\"\nargs = { target_lang = \"yor_Latn\", direction = \"from_english\" }\n\n[[env]]\nid = \"your-username/low-resource-translation\"\nargs = { target_lang = \"yor_Latn\", direction = \"to_english\" }\n```\n\n### Known Limitations\n\n- **Metric mismatch**: Good translations can score low if phrasing differs from reference\n- **Script issues**: Some scripts (Ethiopic, Myanmar) may need additional normalization\n- **Dialect variation**: FLORES uses standardized forms; dialectal variation not captured\n- **Single reference**: Only one reference translation per sentence\n","encoding":"utf-8","truncated":false,"total_bytes":5260},"status":null}