{"data":{"kind":"file","path":"README.md","version_id":"hkb2kwciuv914ljr5ajop7qm","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4778,"modified_at":"2025-10-30T00:44:04.982000","content_hash":"5f8785618019b642785bb4b6eeb8f6d14f194e96fd2ddb48e2b81b48dec25eed"},"entries":[],"content":"# Dakota1890\r\n\r\n> Reinforcement learning environment for Dakota language grammar and translation, grounded in the 1890 Dakota-English Dictionary grammar rules.\r\n\r\n## What is Dakota1890?\r\n\r\nDakota1890 is a verifiers-compatible RL environment designed to train language models on Dakota grammar through reinforcement learning. The environment includes:\r\n\r\n- **1,497 grammar rules** extracted from the 1890 Dakota-English Dictionary\r\n- **10,576 training tasks** covering morphology, translation, syntax, and pattern identification\r\n- **Special character preservation** for Dakota orthography (ć, š, ŋ, ḣ, ṡ, á, é, í, ó, ú, etc.)\r\n- **Multi-turn and single-turn** task support for complex grammar learning\r\n\r\n### Overview\r\n- **Environment name**: `dakota1890`\r\n- **Version**: `0.1.0`\r\n- **Source**: 1890 Dakota-English Dictionary grammar section (pages 1-88)\r\n- **Task types**: Morphology, translation, reverse translation, syntax, pattern identification\r\n- **Difficulty levels**: Easy (1,973 tasks), Medium (5,294 tasks), Hard (1,172 tasks), Advanced (2,137 tasks)\r\n\r\n## Source Data\r\n\r\nThe environment is built from:\r\n- **Grammar extraction**: 92 pages from the 1890 Dakota-English Dictionary grammar section\r\n- **Rule extraction**: 697 formal grammar rules, 408 interlinear texts, 395 linguistic terms\r\n- **Task generation**: Automatic conversion of rules to RL training tasks with positive/negative examples\r\n\r\n**Repository**: [Dakota1890 GitHub](https://github.com/HarleyCoops/Dakota1890)\r\n\r\n## Installation\r\n\r\n```bash\r\nprime env install <owner>/dakota1890\r\n# Or\r\nuv pip install dakota1890 --extra-index-url https://hub.primeintellect.ai/<owner>/simple/\r\n```\r\n\r\n## Usage\r\n\r\n### Task\r\n- **Type**: Single-turn and multi-turn chat\r\n- **Parser**: `DakotaTranslationParser` (preserves Dakota orthography)\r\n- **Rubric overview**: Character preservation, affix accuracy, semantic correctness, and pattern-based rule coverage metrics.\r\n\r\n### Quickstart\r\n\r\nRun an evaluation with default settings:\r\n\r\n```bash\r\nuv run vf-eval dakota1890\r\n```\r\n\r\nConfigure model and sampling:\r\n\r\n```bash\r\nuv run vf-eval dakota1890 \\\r\n  -m gpt-4.1-mini \\\r\n  -n 20 -T 0.7 -M 1024 \\\r\n  -a '{\"max_examples\": 200, \"difficulty_filter\": [\"easy\"]}'\r\n```\r\n\r\n### For RL Training\r\n\r\nUse with PrimeIntellect RL training:\r\n\r\n```python\r\nfrom dakota_grammar_translation import load_environment\r\n\r\nenv = load_environment(\r\n    dataset_path=\"path/to/grammar_tasks_complete.jsonl\",\r\n    difficulty_filter=[\"easy\", \"medium\"],  # Curriculum learning\r\n    max_examples=1000\r\n)\r\n```\r\n\r\n### Environment Arguments\r\n\r\n| Argument | Type | Default | Description |\r\n| --- | ---- | ------- | ----------- |\r\n| `dataset_path` | str | Auto-detected | Path to RL task JSONL file (grammar_tasks_complete.jsonl). |\r\n| `eval_path` | str | `None` | Optional separate evaluation JSONL. |\r\n| `max_examples` | int | `-1` | Cap number of training examples (-1 = all). |\r\n| `eval_examples` | int | `-1` | Cap number of evaluation examples. |\r\n| `eval_fraction` | float | `0.1` | Fraction reserved for eval when `eval_path` not supplied. |\r\n| `difficulty_filter` | list[str] | `None` | Filter to difficulty levels: `[\"easy\"]`, `[\"medium\"]`, `[\"hard\"]`, `[\"advanced\"]`. |\r\n| `task_filter` | list[str] | `None` | Filter to task types: `[\"morphology\"]`, `[\"translation\"]`, `[\"syntax\"]`, etc. |\r\n| `system_prompt` | str | Auto | Custom system instruction for the model. |\r\n\r\n### Metrics\r\n\r\n| Metric | Meaning |\r\n| ------ | ------- |\r\n| `character_preservation_reward` | Reward for preserving Dakota special characters (ć, š, ŋ, etc.) |\r\n| `affix_accuracy_reward` | Reward for correct affix application (prefixes/suffixes) |\r\n| `semantic_accuracy_reward` | Reward for semantic correctness (translation quality) |\r\n| `composite_reward` | Weighted combination of all metrics |\r\n\r\n## Dakota Language Notes\r\n\r\n### Special Characters\r\n\r\nThe environment enforces preservation of Dakota orthography. These characters are critical for accurate Dakota language representation:\r\n\r\n- **Glottal stop**: ʼ\r\n- **Acute accents**: á, é, í, ó, ú\r\n- **Caron diacritics**: č, š, ž\r\n- **Special consonants**: ŋ (eng)\r\n- **Dotted characters**: ḣ, ṡ, ė\r\n\r\n### Why This Matters\r\n\r\nDakota is a low-resource language with unique orthography. Character preservation is essential for:\r\n- Maintaining linguistic accuracy\r\n- Preserving cultural authenticity\r\n- Enabling proper language learning\r\n- Supporting language revitalization efforts\r\n\r\n## Citation\r\n\r\nIf you use Dakota1890 in your research, please cite:\r\n\r\n```bibtex\r\n@software{dakota1890,\r\n  title = {Dakota1890: RL Environment for Dakota Grammar},\r\n  author = {Dakota Language Lab},\r\n  year = {2025},\r\n  url = {https://github.com/HarleyCoops/Dakota1890}\r\n}\r\n```\r\n\r\n## License\r\n\r\nApache-2.0\r\n\r\n","encoding":"utf-8","truncated":false,"total_bytes":4778},"status":null}