{"data":{"kind":"file","path":"README.md","version_id":"jmafynklqamltogxovppsf7y","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3765,"modified_at":"2025-12-30T03:26:58.341000","content_hash":"fcd8952a754bd30abbdd2f2eacdcad8d3608842fe8b6622632d2491fa398f645"},"entries":[],"content":"# base64bench\n\nBased on the original experiment [Base64Bench- How good are LLMs at base64, and why care about it?](https://www.lesswrong.com/posts/5F6ncBfjh2Bxnm6CJ/base64bench-how-good-are-llms-at-base64-and-why-care-about) by Rich Barton-Cooper\n\n### Overview\n- **Environment ID**: `base64bench`\n- **Short description**: Evaluates a model's ability to accurately perform base64 encoding and decoding across a variety of text and data formats.\n- **Tags**: `base64`, `encoding`, `decoding`, `eval`, `train`\n\n### Datasets\n- **Primary dataset(s)**: The environment uses pre-generated `train.jsonl` and `eval.jsonl` files containing diverse data types, such as UUIDs, API keys, JSON, natural language, and cryptographic data representations.\n- **Source links**: The datasets were generated with https://github.com/richcooper95/base-64-bench\n- **Split sizes**: 340 test examples (10 per data type); 1360 train examples (40 per data type). Each example is used twice, once for an encoding task and once for a decoding task.\n\n### Task\n- **Type**: single-turn\n- **Parser**: `vf.XMLParser`\n- **Rubric overview**: The reward is calculated using normalized Levenshtein similarity. For encoding tasks, the model's output is decoded and compared to the original text. For decoding tasks, the model's output is directly compared to the original text.\n\n### Quickstart\nRun an evaluation with default settings on both encode and decode tasks:\n\n```bash\nuv run vf-eval base64bench\n```\n\nConfigure the model, limit the evaluation to the `decode` task, and limit the evaluation set with 3 examples from each data type:\n\n```bash\nuv run vf-eval base64bench \\\n  -m gpt-4.1-mini \\\n  -n -1 \\\n  -a '{\"task\": \"decode\", \"examples_per_data_type\": 3}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n- The default behavior is to evaluate the model's 0-shot (no reasoning) performance. If you want to train/eval w/ reasoning, use the following `system_prompt`:\n\n`\"Please think step-by-step, then provide the final answer within <answer>...</answer> tags.\"`\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `train_file` | str | `\"train.jsonl\"` | The name of the training dataset file located in the environment's directory. |\n| `eval_file` | str | `\"eval.jsonl\"` | The name of the evaluation dataset file located in the environment's directory. |\n| `system_prompt` | str | `\"Provide just the final answer within <answer>...</answer> tags.\"` | The system prompt to use for the environment. |\n| `seed` | int | `3301` | The seed used for shuffling the dataset. |\n| `task` | str | `None` | Restricts the evaluation to a specific task. Can be `\"encode\"`, `\"decode\"`, or `None` (default) to run both. |\n| `examples_per_data_type`| int | `None` | If set, creates a balanced evaluation set by sampling N examples of each data type from the eval file. |\n\n### Metrics\nThe primary reward is the similarity score itself. The metrics are named based on the task.\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | The main scalar reward. Weighted sum of the three rewards. |\n| `encode_similarity_reward` | Normalized Levenshtein similarity between the original text and the model's base64 output after it has been decoded. A score of 1.0 is a perfect match. |\n| `decode_similarity_reward` | Normalized Levenshtein similarity between the original text and the model's decoded output. A score of 1.0 is a perfect match. |\n| `format_reward` | The format reward, measuring whether the generated text is formatted correctly. |\n\n## Changelog\n\n### v0.1.2\n- **Bumped** `verifiers` to `0.1.8.post2`.\n- **Refactored** to use `vf.XMLParser` and a compatible system prompt. Added `format_reward` to the rubric.","encoding":"utf-8","truncated":false,"total_bytes":3765},"status":null}