{"data":{"kind":"file","path":"README.md","version_id":"n3j2cewqhe9566jz5l28slgc","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":1911,"modified_at":"2026-02-27T18:42:41.003000","content_hash":"33eb1a3e3df12da4821d2091cf30bb002facbce37e9fe7ace0f43f0fd4cea252"},"entries":[],"content":"# seedeval\nBuilding a benchmark to evaluate the diversity of seeds generated using different tool calls.\n### Overview\n- **Environment ID**: `seedeval`\n- **Short description**: <Building a benchmark to evaluate the diversity of seeds generated using different tool calls.>\n- **Tags**: <tool-use, synthetic-data>\n- **Source**: [charvibannur/seed-eval-benchmark](https://github.com/charvibannur/seed-eval-benchmark/tree/main)\n\n### Datasets\n- **Primary dataset(s)**: <Fine Web (Works with any text dataset)>\n\n<!-- ### Task\n- **Type**: <multi-turn | tool use>\n- **Output format expectations (optional)**: <XML>\n- **Rubric overview**: <briefly list reward functions and key metrics> -->\n\n### Architecture\n```\nseed-eval-benchmark/\n└── environments/\n    └── seedeval/\n        ├── README.md\n        ├── pyproject.toml\n        └── seedeval.py\n```\n\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run seedeval\n```\nConfigure model and sampling:\n\n```bash\nprime eval run seedeval -m gpt-4.1-mini  -n 20 -r 3\n```\n<!-- \nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object. -->\n\n<!-- ### Environment Arguments\nDocument any supported environment arguments and their meaning. Example:\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `foo` | str | `\"bar\"` | What this controls |\n| `max_examples` | int | `-1` | Limit on dataset size (use -1 for all) | -->\n\n### Metrics\nSummarize key metrics your rubric emits and how they’re interpreted.\n\n| Metric | Meaning |\n| ------ | ------- |\n| `Diversity Score` | Measures how semantically different the seeds are from each other |\n| `Education Score` | Uses an LLM judge to rate overall seed quality from 0–5 (similar to fine-web edu) |\n| `Not Proper Noun Ratio`| Ratio of words that are not names, dates or places to the total number of generated seeds\n","encoding":"utf-8","truncated":false,"total_bytes":1911},"status":null}