{"data":{"kind":"file","path":"README.md","version_id":"lr13c0x02mf7v4scpsi7kr42","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2381,"modified_at":"2025-10-11T03:34:51.773000","content_hash":"30968ef6cd87bb17263b2b739467354e4c2f2b7204bea290145aafa64f8ebefe"},"entries":[],"content":"# MMHelix\n\nThis environment is inspired by [this paper](https://arxiv.org/pdf/2510.08540).\n\n### Overview\n- **Environment ID**: `MMHelix`\n- **Short description**: MM-Helix Benchmark and training environment\n- **Tags**: Multi-Modal, Image, RLVR\n\n### Datasets\n- **Primary dataset(s)**: MM-HELIX-100K\n- **Source links**: [MM-HELIX-BENCHMARK](https://huggingface.co/datasets/tianhao2k/MM-HELIX), [MM-HELIX-100K](https://huggingface.co/datasets/mjuicem/MM-HELIX-100K)\n\n### Task\n- **Type**: single-turn\n- **Rubric overview**: Exact match on what the original answer is\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval MMHelix\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval MMHelix   -m gpt-4.1-mini   -n 20 -r 3 -t 1024 -T 0.7   -a '{\"key\": \"value\"}'  # env-specific args as JSON\n```\n\nExamples with different environment arguments:\n\n```bash\n# Use only 50 evaluation samples\nuv run vf-eval MMHelix -a '{\"num_eval_samples\": 50}'\n\n# Exclude specific categories from evaluation\nuv run vf-eval MMHelix -a '{\"exclude_cats\": [\"category1\", \"category2\"]}'\n\n# Use only 100 training samples and exclude specific tasks\nuv run vf-eval MMHelix -a '{\"num_train_samples\": 100, \"exclude_task\": [\"task1\", \"task2\"]}'\n\n# Combine multiple arguments\nuv run vf-eval MMHelix -a '{\"num_eval_samples\": 30, \"exclude_cats\": [\"category1\"], \"num_train_samples\": 200}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\nDocument any supported environment arguments and their meaning. Example:\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_train_samples` | int | `-1` | Number of training samples to use (-1 for all) |\n| `num_eval_samples` | int | `-1` | Number of evaluation samples to use (-1 for all) |\n| `exclude_task` | list | `[]` | List of tasks to exclude from training |\n| `exclude_indices` | list | `[]` | List of indices to exclude from training |\n| `exclude_cats` | list | `[]` | List of categories to exclude from training/evaluation |\n| `**kwargs` | dict | `{}` | Additional arguments passed to vf.SingleTurnEnv |\n\n### Metrics\nSummarize key metrics your rubric emits and how they're interpreted.\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar reward (weighted sum of criteria) |\n| `accuracy` | Exact match on target answer |\n\n","encoding":"utf-8","truncated":false,"total_bytes":2381},"status":null}