{"data":{"kind":"file","path":"README.md","version_id":"nntesm60p3na7pnp3ctrvkz4","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":7972,"modified_at":"2025-08-23T16:37:54.726000","content_hash":"7532b8b471a1b928c0a411ed059d11b23f03e9d4696f4cf33561c12f02d2a307"},"entries":[],"content":"# deepconf\r\n\r\n### Overview\r\n- **Environment ID**: `deepconf`\r\n- **Short description**: DeepConf environment for confidence-aware LLM reasoning evaluation based on the DeepConf paper [https://arxiv.org/html/2508.15260v1](https://arxiv.org/html/2508.15260v1)\r\n- **Tags**: deepconf, confidence, reasoning, single-turn, math, eval\r\n\r\n### DeepConf Paper Implementation\r\nThis environment implements the confidence-based reasoning trace filtering methods described in the DeepConf paper. It provides:\r\n\r\n- **Confidence Measurement**: Multiple confidence metrics including token confidence, group confidence, bottom-10% confidence, and tail confidence\r\n- **Offline Filtering**: Confidence-based filtering of reasoning traces after generation\r\n- **Online Thinking**: Real-time confidence monitoring during generation (when supported by inference server)\r\n- **Majority Voting**: Standard and confidence-weighted majority voting for answer aggregation\r\n\r\n### Datasets\r\n- **Primary datasets**: Supports all mathematical reasoning datasets from the DeepConf paper:\r\n  - `aime2024` - AIME 2024 competition problems\r\n  - `aime2025` - AIME 2025 competition problems\r\n  - `gpqa_diamond` - GPQA Diamond dataset\r\n  - `amc2023` - AMC 2023 problems\r\n  - Plus all other datasets supported by `verifiers.utils.data_utils.load_example_dataset()`\r\n- **Source links**: Uses the example loader in `verifiers.utils.data_utils`\r\n- **Split sizes**: Configurable via args; defaults to full train/test\r\n\r\n### Task\r\n- **Type**: single-turn with confidence analysis\r\n- **Parser**: `ThinkParser` with boxed answer extraction\r\n- **Rubric overview**: Multi-criteria evaluation combining correctness, confidence quality, and format adherence\r\n\r\n### Confidence Metrics Implemented\r\n\r\n1. **Token Confidence**: Negative mean of alternative token logprobs\r\n2. **Group Confidence**: Average confidence across token groups\r\n3. **Bottom 10% Confidence**: Confidence of the lowest 10% of tokens\r\n4. **Lowest Group Confidence**: Minimum confidence among token groups\r\n5. **Tail Confidence**: Confidence of the tail percentage of tokens\r\n6. **Average Trace Confidence**: Overall confidence of the reasoning trace\r\n7. **Entropy-based metrics**: Token entropy and max entropy measurements\r\n\r\n### Quickstart\r\n\r\nRun an evaluation with default DeepConf settings:\r\n\r\n```bash\r\n# Install the environment\r\nvf-install deepconf\r\n\r\n# Quick evaluation with AIME 2025\r\nuv run vf-eval deepconf -m gpt-4.1-mini --env-args '{\"dataset_name\": \"aime2025\"}'\r\n```\r\n\r\nConfigure model and confidence parameters:\r\n\r\n```bash\r\nuv run vf-eval deepconf \\\r\n  -m gpt-4.1-mini \\\r\n  -n 10 -r 3 -t 2048 -T 0.7 \\\r\n  -a '{\r\n    \"dataset_name\": \"aime2025\",\r\n    \"confidence_metric\": \"avg_trace_confidence\",\r\n    \"confidence_threshold\": 0.5,\r\n    \"num_traces\": 32\r\n  }'\r\n```\r\n\r\n### Environment Arguments\r\n\r\n| Arg | Type | Default | Description |\r\n| --- | ---- | ------- | ----------- |\r\n| `dataset_name` | str | `\"aime2025\"` | Dataset to use (aime2024, aime2025, gpqa_diamond, etc.) |\r\n| `confidence_metric` | str | `\"avg_trace_confidence\"` | Confidence metric for filtering |\r\n| `confidence_threshold` | float | `0.5` | Threshold for confidence filtering |\r\n| `num_traces` | int | `32` | Number of traces to generate per example |\r\n| `use_think` | bool | `true` | Whether to use ThinkParser for reasoning traces |\r\n| `num_train_examples` | int | `-1` | Limit training set size (-1 for all) |\r\n| `num_eval_examples` | int | `-1` | Limit eval set size (-1 for all) |\r\n\r\n### Confidence Metrics Available\r\n\r\n| Metric | Description |\r\n| ------ | ----------- |\r\n| `token_confidence` | Average confidence across all tokens |\r\n| `group_confidence` | Group-based confidence measurement |\r\n| `bottom_10_percent_confidence` | Confidence of bottom 10% of tokens |\r\n| `lowest_group_confidence` | Minimum group confidence |\r\n| `tail_confidence` | Tail percentage confidence |\r\n| `avg_trace_confidence` | Overall trace confidence |\r\n| `avg_entropy` | Average token entropy |\r\n| `max_entropy` | Maximum token entropy |\r\n\r\n### DeepConf Methods\r\n\r\n#### Offline Filtering\r\n```python\r\nfrom environments.deepconf.deepconf import filter_traces_by_confidence\r\n\r\n# Filter traces below confidence threshold\r\nhigh_confidence_traces = filter_traces_by_confidence(\r\n    traces=generated_traces,\r\n    confidence_threshold=0.5,\r\n    confidence_metric='avg_trace_confidence'\r\n)\r\n```\r\n\r\n#### Confidence-Weighted Majority Voting\r\n```python\r\nfrom environments.deepconf.deepconf import confidence_weighted_majority_vote\r\n\r\n# Get final answer using confidence-weighted voting\r\nfinal_answer = confidence_weighted_majority_vote(\r\n    traces=traces,\r\n    confidence_metric='avg_trace_confidence'\r\n)\r\n```\r\n\r\n#### Confidence Analysis\r\n```python\r\nfrom environments.deepconf.deepconf import analyze_confidence_correlation\r\n\r\n# Analyze correlation between confidence and correctness\r\nanalysis = analyze_confidence_correlation(evaluation_results)\r\nprint(f\"Confidence separation: {analysis.get('confidence_separation', 0.0):.3f}\")\r\n```\r\n\r\n### Advanced Usage\r\n\r\n#### Custom Confidence Metrics\r\n```python\r\n# You can implement custom confidence metrics by extending the environment\r\ndef custom_confidence_metric(completion, logprobs_data):\r\n    # Your custom logic here\r\n    return custom_score\r\n\r\n# Then use it in the environment arguments\r\nenv_args = {\r\n    \"confidence_metric\": \"custom_metric\",\r\n    # ... other args\r\n}\r\n```\r\n\r\n#### Integration with vLLM for Online DeepConf\r\nFor online confidence monitoring during generation, you need vLLM with DeepConf modifications:\r\n\r\n```python\r\n# Enable online confidence monitoring (requires modified vLLM)\r\nclient = OpenAI(\r\n    base_url=\"http://localhost:8000/v1\",\r\n    api_key=\"dummy\"\r\n)\r\n\r\nresponse = client.chat.completions.create(\r\n    model=\"your-model\",\r\n    messages=messages,\r\n    extra_body={\r\n        \"vllm_xargs\": {\r\n            \"enable_conf\": True,\r\n            \"window_size\": 2048,\r\n            \"threshold\": 0.5\r\n        }\r\n    }\r\n)\r\n```\r\n\r\n### Paper Reproduction\r\n\r\nThis environment is designed to reproduce the key experiments from the DeepConf paper:\r\n\r\n- **Figure 1**: Parallel thinking vs DeepConf filtering\r\n- **Table 2**: Performance improvements on AIME 2025\r\n- **Figure 3**: Token reduction vs accuracy trade-offs\r\n- **Table 4**: Cross-model performance improvements\r\n\r\n### Performance Expectations\r\n\r\nBased on the paper, you can expect:\r\n- **AIME 2025**: Up to 99.9% accuracy with DeepConf@512 vs 97.0% with majority voting\r\n- **Token Reduction**: Up to 84.7% reduction in generated tokens\r\n- **GPQA-Diamond**: Significant improvements on challenging reasoning tasks\r\n\r\n### Implementation Notes\r\n\r\n- **Logprobs Requirement**: For full confidence analysis, your inference server must provide `logprobs` and `top_logprobs` in responses\r\n- **vLLM Integration**: Online DeepConf requires the vLLM modifications described in Appendix G of the paper\r\n- **Memory Usage**: Confidence analysis requires storing additional metadata for each generated token\r\n\r\n### Troubleshooting\r\n\r\n#### No Confidence Data\r\nIf you see all confidence metrics as 0.0:\r\n- Ensure your model/inference server supports `logprobs=True`\r\n- Check that `top_logprobs` is set to a value >= 2\r\n- Verify that the response includes logprobs data\r\n\r\n#### Poor Filtering Performance\r\n- Try different confidence metrics (e.g., `group_confidence` instead of `avg_trace_confidence`)\r\n- Adjust the confidence threshold based on your model's output patterns\r\n- Consider using `confidence_weighted_majority_vote` instead of threshold filtering\r\n\r\n#### Memory Issues with Large Traces\r\n- Reduce `num_traces` parameter\r\n- Use smaller `window_size` for confidence calculations\r\n- Consider offline filtering instead of online monitoring\r\n\r\n## Evaluation Reports\r\n\r\n<!-- Do not edit below this line. Content is auto-generated. -->\r\n<!-- vf:begin:reports -->\r\n<p>No reports found. Run <code>uv run vf-eval deepconf -a '{\"key\": \"value\"}'</code> to generate one.</p>\r\n<!-- vf:end:reports -->\r\n","encoding":"utf-8","truncated":false,"total_bytes":7972},"status":null}