{"data":{"kind":"file","path":"README.md","version_id":"mtl6dxz8jw5z98yeaigfhdzg","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":1990,"modified_at":"2025-09-06T17:56:37.061000","content_hash":"1b0870b3295a07e6e40dd0569bc469930c1bce0e54b12057748bd8aa5cb06b30"},"entries":[],"content":"# ast-as-a-judge\n\n### Overview\n- **Environment ID**: `ast-as-a-judge`\n- **Short description**: Evaluates LLM security analysis capabilities by comparing against Bandit static analysis tool findings\n- **Tags**: security, static-analysis, code-analysis, ast, bandit\n\n### Datasets\n- **Primary dataset(s)**: `wambosec/insecure-code-snippets-text` - Python code snippets with security vulnerabilities\n- **Source links**: [HuggingFace Dataset](https://huggingface.co/datasets/wambosec/insecure-code-snippets-text)\n- **Split sizes**: Variable (configurable via `max_samples` parameter)\n\n### Task\n- **Type**: Single-turn\n- **Parser**: Custom parser for `\\boxed{ANSWER: ...}` format\n- **Rubric overview**: \n  - Judge-based content evaluation (90% weight): OpenAI judge compares LLM analysis with Bandit findings\n  - Format compliance (10% weight): Ensures proper `\\boxed{ANSWER: ...}` formatting\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval ast-as-a-judge\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval ast-as-a-judge -m gpt-4o-mini -n 20 -r 3 -t 1024 -T 0.7 -a '{\"max_samples\": 50}'\n```\n\nNotes:\n- Requires Bandit to be installed: `pip install bandit`\n- Uses OpenAI API for judge evaluation (requires `OPENAI_API_KEY`)\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `dataset_split` | str | `\"train\"` | Dataset split to use for evaluation |\n| `size` | int | `50` | Number of samples to evaluate |\n| `max_samples` | int | `None` | Maximum samples to analyze (passed to Bandit analysis) |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Weighted combination of judge score (0.9) and format compliance (0.1) |\n| `judge_reward` | OpenAI judge evaluation of LLM vs Bandit analysis similarity (0-1) |\n| `format_reward` | Binary reward for proper `\\boxed{ANSWER: ...}` formatting (0 or 1) |\n","encoding":"utf-8","truncated":false,"total_bytes":1990},"status":null}