{"data":{"kind":"file","path":"README.md","version_id":"flm829op4zu4owye36btnije","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5396,"modified_at":"2025-09-18T10:33:55.941000","content_hash":"3f9f7d30daea9b01cb2e586bb39b0d79b5279fdc76abf2819f57b4b16ffef263"},"entries":[],"content":"# alphabet-sort-qwen\n\n### Overview\n- **Environment ID**: `alphabet_sort_qwen`\n- **Short description**: Multi-turn alphabet sorting environment for evaluating language models' ability to maintain and update alphabetically sorted lists across conversation turns.\n- **Tags**: alphabet-sort, multi-turn, reasoning, evaluation, xml-parsing\n\n### Datasets\n- **Primary dataset(s)**: `kalomaze/alphabetic-arxiv-authors-it1` (real arXiv author names)\n- **Source links**: [HuggingFace Dataset](https://huggingface.co/datasets/kalomaze/alphabetic-arxiv-authors-it1)\n- **Split sizes**: Uses `train` split with configurable example limits\n\n### Task\n- **Type**: multi-turn\n- **Parser**: `XMLParser` with dynamic tags (`alphabetical_sorted` for first turn, `combined_alphabetical_sorted` for subsequent turns)\n- **Rubric overview**: Sequence similarity scoring with configurable precision emphasis (similarity^power)\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval alphabet_sort_qwen\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval alphabet_sort_qwen \\\n  -m Qwen/Qwen3-0.6B \\\n  -n 20 -r 3 -t 1024 -T 0.7\n```\n\nCustomize environment parameters:\n\n```bash\nuv run vf-eval alphabet_sort_qwen \\\n  -m Qwen/Qwen3-0.6B \\\n  -n 50 \\\n  -a '{\"max_turns\": 2, \"min_names_per_turn\": 2, \"max_names_per_turn\": 4, \"num_examples\": 500}'\n```\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `dataset_name` | str | `\"kalomaze/alphabetic-arxiv-authors-it1\"` | HuggingFace dataset name for author names |\n| `dataset_split` | str | `\"train\"` | Dataset split to use |\n| `max_turns` | int | `3` | Maximum number of conversation turns |\n| `min_turns` | int | `1` | Minimum number of conversation turns |\n| `min_names_per_turn` | int | `1` | Minimum names to add per turn |\n| `max_names_per_turn` | int | `5` | Maximum names to add per turn |\n| `similarity_power` | int | `4` | Power to raise similarity score (emphasizes precision) |\n| `seed` | int | `42` | Random seed for reproducibility |\n| `num_examples` | int | `1000` | Number of examples to generate |\n\n### Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Average sequence similarity score across all turns, raised to the specified power |\n\n### Example Conversation\n\n**Turn 1:**\n```\nUser: Sort these names in alphabetical order by FIRST name: MarcoEllero, MassimoTessarotto, EnricoFonda\n\nUse exactly this format:\n<alphabetical_sorted>\nName1\nName2\nName3\n</alphabetical_sorted>\n```\n\n**Expected Response:**\n```xml\n<alphabetical_sorted>\nEnricoFonda\nMarcoEllero\nMassimoTessarotto\n</alphabetical_sorted>\n```\n\n**Turn 2:**\n```\nUser: Now sort ALL of these names alphabetically by FIRST name: AliceBrown, BobWilson\n\nThese are in addition to the prior list. Mark any NEW names (that weren't in the prior list) with `// new name!` at the end.\n\nUse exactly this format:\n<combined_alphabetical_sorted>\nName1\nName2\nName3\nName4\nName5\n</combined_alphabetical_sorted>\n```\n\n**Expected Response:**\n```xml\n<combined_alphabetical_sorted>\nAliceBrown // new name!\nBobWilson // new name!\nEnricoFonda\nMarcoEllero\nMassimoTessarotto\n</combined_alphabetical_sorted>\n```\n\n### Evaluation Setup\n\nThis environment is designed for evaluating language models' multi-turn reasoning capabilities. Here's how to set up evaluation:\n\n#### 1. Serve Model with vLLM (Optional)\n```bash\n# Terminal 1: Start vLLM server for local evaluation\nCUDA_VISIBLE_DEVICES=0 vf-vllm --model Qwen/Qwen3-0.6B --enforce-eager --disable-log-requests\n```\n\n#### 2. Run Evaluation\n```bash\n# Terminal 2: Run evaluation against local server\nuv run vf-eval alphabet_sort_qwen \\\n  -b http://localhost:8000/v1 \\\n  -k empty \\\n  -m Qwen/Qwen3-0.6B \\\n  -n 10 -r 3 \\\n  -a '{\"num_examples\": 50, \"num_eval_examples\": 10}'\n```\n\n#### 3. Evaluate with Different Models\n```bash\n# Test with different model sizes\nuv run vf-eval alphabet_sort_qwen \\\n  -b http://localhost:8000/v1 \\\n  -k empty \\\n  -m Qwen/Qwen3-1.7B \\\n  -n 5 -r 2\n\n# Test with different parameters\nuv run vf-eval alphabet_sort_qwen \\\n  -b http://localhost:8000/v1 \\\n  -k empty \\\n  -m Qwen/Qwen3-0.6B \\\n  -n 5 -r 2 \\\n  -a '{\"max_turns\": 2, \"min_names_per_turn\": 2, \"max_names_per_turn\": 3}'\n```\n\n### Use Cases\n- **Multi-turn Reasoning Evaluation**: Tests model's ability to maintain context across conversation turns\n- **Alphabetical Sorting**: Evaluates sub-token level reasoning and sorting capabilities\n- **Format Compliance**: Tests XML parsing and structured output generation\n- **Model Comparison**: Compare different model sizes and architectures on reasoning tasks\n- **Benchmarking**: Standardized evaluation for multi-turn reasoning capabilities\n\n### Performance Expectations\n- **Qwen-3 0.6B**: ~0.3-0.5 average reward (baseline performance)\n- **Qwen-3 1.7B**: ~0.4-0.6 average reward (improved reasoning)\n- **Larger Models**: Generally show better performance with more parameters\n- **Evaluation Time**: ~1-2 minutes for 10 examples with 3 rollouts each\n- **Memory Requirements**: ~4GB VRAM for GPU evaluation, ~8GB RAM for CPU evaluation\n\n### Notes\n- Uses real arXiv author names for more realistic evaluation data\n- Supports both single-turn and multi-turn conversations\n- Configurable difficulty through turn count and name count parameters\n- Compatible with both CPU and GPU evaluation setups\n- Designed for standardized evaluation across different model architectures\n","encoding":"utf-8","truncated":false,"total_bytes":5396},"status":null}