{"data":{"kind":"file","path":"README.md","version_id":"b6vf3bb51b0k3fgiehetks1n","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":12951,"modified_at":"2025-12-14T15:35:53.313000","content_hash":"a41e6ac6782a117e3895a1d1f9d1271a062372dfed46627c5ad8a84e5186bb98"},"entries":[],"content":"# CodeBlue: DataAnalytics\n\n![Version](https://img.shields.io/badge/version-0.1.9-blue)\n![Python](https://img.shields.io/badge/python-3.10+-green)\n![License](https://img.shields.io/badge/license-MIT-blue)\n\n**High-quality RLVR environment for training and evaluating LLMs on pandas data analysis tasks**\n\nA comprehensive benchmark for evaluating language models on real-world data analysis tasks with exact-match verification. Features 172 curated questions across 5 datasets with difficulty levels from basic operations to complex multi-hop analytical queries.\n\n## 🎯 Key Features\n\n- **172 High-Quality Questions** across 9 question sets with validated gold code\n- **5 Real-World Datasets** (banking, healthcare, e-commerce, loans, road safety)\n- **Multi-Level Difficulty** (L3-L5): From simple aggregations to complex multi-hop chains\n- **23 Question Templates** covering business analytics patterns\n- **Exact Match Verification** with code execution sandbox\n- **Comprehensive Evaluation** splits for training and testing\n\n---\n\n## 📊 Datasets\n\n| Dataset | Rows | Columns | Domain | Description |\n|---------|------|---------|--------|-------------|\n| **bank** | 750,000 | 18 | Finance | Bank marketing campaign with customer demographics and subscription outcomes |\n| **shopping** | 3,900 | 18 | E-commerce | Customer shopping behavior with purchase patterns and demographics |\n| **heart** | 918 | 12 | Healthcare | Heart disease clinical records with patient health indicators |\n| **loan** | 32,581 | 12 | Finance | Loan applications with approval status and financial metrics |\n| **road** | 12,316 | 21 | Safety | Road accident data with severity indicators and conditions |\n\n---\n\n## 📚 Question Sets & Splits\n\n### Easy Splits (Basic Operations)\n| Split | Questions | Difficulty | Skills | Description |\n|-------|-----------|------------|--------|-------------|\n| `shopping/v1_basic` | 6 | Easy | duplicated, isnull, value_counts | Data quality and basic counts |\n| `heart/v1_basic` | 4 | Easy | groupby, mean, value_counts | Simple aggregations |\n\n### Mixed Difficulty Splits\n| Split | Questions | Difficulty | Skills | Description |\n|-------|-----------|------------|--------|-------------|\n| `bank/v1_validated` | 18 | Mixed | groupby, filtering, percentage | Marketing analytics |\n| `shopping/v1_mined` | 7 | Mixed | groupby, mode, idxmax/min | Shopping behavior patterns |\n| `shopping/v2_validated` | 16 | Mixed | groupby, mean, describe | Enhanced shopping analysis |\n| `loan/v1_mined` | 40 | Mixed | groupby, aggregation | Mined from notebooks |\n| `loan/v1_validated` | 22 | Mixed | groupby, mean, value_counts | Validated loan analytics |\n| `road/v1_validated` | 10 | Mixed | groupby, corr, idxmax | Road safety diagnostics |\n\n### Hard Split (L3-L5 Multi-hop)\n| Split | Questions | Difficulty | Level Distribution | Description |\n|-------|-----------|------------|-------------------|-------------|\n| `bank/v2_hard` | 49 | Hard | L5: 41%, L4: 35%, L3: 24% | Template-generated complex analytics |\n\n**Total:** 172 questions across 9 splits\n\n---\n\n## 🎓 Difficulty Levels & Templates\n\n### L3 Templates (24.5%) - Comparisons & Rankings\nSimple 1-2 step operations focusing on comparisons and filtering:\n- Conversion rate comparisons between segments\n- Top/bottom performing segments\n- Segments above/below thresholds\n- Size-based rankings\n\n**Example:** *\"Which job categories have subscription rates above 15%?\"*\n\n### L4 Templates (34.7%) - Binned Analysis & Cross-Metrics\nQuartile-based analysis and cross-metric operations:\n- Quartile conversion analysis (pd.qcut)\n- Segment breakdown tables\n- Cross-segment performance metrics\n- Metric-based breakdowns\n\n**Example:** *\"Divide customers into 4 balance quartiles. What is the subscription percentage in Q2?\"*\n\n### L5 Templates (40.8%) - Multi-hop Chains & Complex Analytics\nComplex multi-step reasoning requiring 3+ operations:\n- Chain conversion (find extrema → analyze subset)\n- Opportunity segment identification\n- Nested extrema analysis\n- Anomaly detection (high metric + low conversion)\n- Impact potential scoring\n\n**Example:** *\"Find the job with lowest subscription rate. Within that group, what is the highest average day?\"*\n\n**23 Unique Templates** generate questions programmatically with slot-filling from dataset schemas.\n\n---\n\n## 🏆 Benchmark Results (bank/v2_hard - 49 Hard Questions)\n\n| Rank | Model | Execution Success | Correct Answers | Accuracy |\n|------|-------|------------------|-----------------|----------|\n| 🥇 | **Qwen3 Max** | 49/49 (100%) | 37/49 | **75.5%** |\n| 🥈 | **Claude Opus 4.5** | 49/49 (100%) | 35/49 | **71.4%** |\n| 🥉 | **GPT-5.1** | 49/49 (100%) | 32/49 | **65.3%** |\n| 4 | **Grok 4 Fast** | 48/49 (98%) | 30/49 | **61.2%** |\n| 5 | **Qwen 2.5 72B Instruct** | 47/49 (96%) | 28/49 | **57.1%** |\n| 6 | **Gemini 2.5 Flash** | 48/49 (98%) | 25/49 | **51.0%** |\n| 7 | **DeepSeek Chat** | 48/49 (98%) | 24/49 | **49.0%** |\n| 8 | **Llama 3.1 70B Instruct** | 47/49 (96%) | 20/49 | **40.8%** |\n\n**Key Findings:**\n- ✅ **Top performers** (Qwen3 Max, Claude Opus 4.5) achieve 71-76% accuracy on complex multi-hop analytics\n- ⚠️ **Mid-tier models** (GPT-5.1, Grok 4, Qwen 2.5) achieve 57-65% with near-perfect execution\n- 📊 **Execution success** is high (95-100%) for all models - main challenge is correctness, not syntax\n- 🎯 **Challenge level**: Even best models struggle with 25% of L4/L5 multi-hop questions\n\n*Full benchmark results and per-question analysis available in `results/bank_v2_hard/`*\n\n---\n\n## 🚀 Quickstart\n\n### Installation\n```bash\n# Install environment from Prime Hub\nprime env install codeblue/codeblue-analytics\n\n# Or install locally\ncd environments/codeblue-analytics\npip install -e .\n```\n\n### Running Evaluations\n```bash\n# Run easy evaluation (10 questions)\nuv run vf-eval codeblue-analytics --split eval\n\n# Run hard benchmark (49 questions)\nuv run vf-eval codeblue-analytics --split bank/v2_hard\n\n# Run specific dataset evaluations\nuv run vf-eval codeblue-analytics --split heart/v1_basic\nuv run vf-eval codeblue-analytics --split shopping/v2_validated\nuv run vf-eval codeblue-analytics --split loan/v1_validated\n```\n\n### Using Custom Models\n```bash\n# Using eval_runner.py with Prime backend\npython scripts/eval_runner.py \\\n  --split bank/v2_hard \\\n  --backend prime \\\n  --model google/gemini-2.5-flash \\\n  --output results/my_eval.json\n\n# Using OpenAI models (local backend)\npython scripts/eval_runner.py \\\n  --split bank/v2_hard \\\n  --backend local \\\n  --model gpt-4o \\\n  --output results/gpt4o_eval.json\n```\n\n---\n\n## 📖 Task Format\n\n### Input Structure\nModels receive:\n1. **System Prompt** - Instructions for code generation\n2. **DataFrame Schema** - Column names, types, and sample rows\n3. **Natural Language Question** - Business analytics query\n\n### Expected Output\nModels must generate Python code wrapped in `<code></code>` tags:\n\n```python\n<code>\nresult = df.groupby('job')['y'].apply(lambda x: (x == 1).mean() * 100).round(2)\n</code>\n```\n\n### Requirements\n- ✅ DataFrame pre-loaded as `df`\n- ✅ Store final answer in `result` variable\n- ✅ Only pandas (pd) and numpy (np) available\n- ❌ No imports, prints, or file I/O\n- ❌ No external libraries\n\n---\n\n## 💡 Example Task\n\n**Question:** *\"Show the subscription percentage breakdown by job. Return a DataFrame with columns [job, count, rate] where rate is percentage (0-100), sorted by rate descending.\"*\n\n**DataFrame Context:**\n```\nDataset: bank (750,000 rows × 18 columns)\nColumns: id, age, job, marital, education, balance, y, ...\nSample:\n  job         | y\n  ------------|---\n  technician  | 0\n  blue-collar | 0\n  management  | 1\n```\n\n**Expected Gold Code:**\n```python\nbreakdown = df.groupby('job').agg(\n    count=('y', 'size'),\n    rate=('y', lambda x: round((x == 1).mean() * 100, 2))\n).sort_values('rate', ascending=False)\nresult = breakdown.reset_index()\n```\n\n**Expected Output:**\n```python\n[\n  {'job': 'student', 'count': 11767, 'rate': 34.08},\n  {'job': 'retired', 'count': 35185, 'rate': 24.62},\n  {'job': 'unemployed', 'count': 17634, 'rate': 17.98},\n  ...\n]\n```\n\n---\n\n## 🎯 Evaluation Metrics\n\n| Metric | Weight | Description |\n|--------|--------|-------------|\n| **Execution Success** | 0.3 | Code executes without errors in sandbox |\n| **Correctness** | 0.7 | Output matches expected result exactly |\n\n**Scoring:**\n- ✅ **Perfect (1.0)**: Code executes AND output matches exactly\n- ⚠️ **Partial (0.3)**: Code executes BUT wrong output\n- ❌ **Failed (0.0)**: Code fails to execute OR no code extracted\n\n---\n\n## 📁 Repository Structure\n\n```\nenvironments/codeblue-analytics/\n├── codeblue_analytics.py          # Main environment implementation\n├── pyproject.toml                  # Package configuration (v0.1.7)\n├── data/\n│   ├── datasets/                   # CSV files for all datasets\n│   │   ├── bank.csv               # 750k banking records\n│   │   ├── shopping.csv           # 3.9k shopping records\n│   │   ├── heart.csv              # 918 health records\n│   │   ├── loan.csv               # 32k loan records\n│   │   └── road.csv               # 12k accident records\n│   └── questions/                  # Question sets (JSONL)\n│       ├── registry.yaml          # Question set metadata\n│       ├── bank/v2_hard.jsonl     # 49 hard questions\n│       ├── shopping/              # Shopping questions\n│       ├── heart/                 # Heart questions\n│       ├── loan/                  # Loan questions\n│       └── road/                  # Road questions\n├── configs/splits/                 # Evaluation split configs\n└── .prime/                         # Prime environment metadata\n\nscripts/\n├── eval_runner.py                  # Unified eval runner (Prime + local)\n├── analyze_failures.py             # Failure analysis by template/level\n├── analyze_by_level.py             # Performance breakdown by L3/L4/L5\n├── run_benchmark_suite.py          # Multi-model benchmark runner\n└── summarize_results.py            # Results aggregation\n\nqa-gen/                             # Question generation pipeline\n└── templates/__init__.py           # 23 question templates (L3-L5)\n```\n\n---\n\n## 🔧 Advanced Usage\n\n### Creating Custom Splits\nCreate a YAML file in `configs/splits/`:\n```yaml\n# configs/splits/my_custom_split.yaml\ninclude:\n  - question_set: bank/v2_hard\n    filters:\n      level: [\"L5\"]  # Only L5 questions\n  - question_set: shopping/v2_validated\n```\n\nRun with:\n```bash\nuv run vf-eval codeblue-analytics --split my_custom_split\n```\n\n### Analyzing Results\n```bash\n# Analyze failures by template and level\npython scripts/analyze_failures.py results/my_eval.json\n\n# Compare performance across difficulty levels\npython scripts/analyze_by_level.py results/my_eval.json\n\n# Aggregate results from multiple runs\npython scripts/summarize_results.py results/bank_v2_hard/*.json\n```\n\n### Question Generation\nThe environment includes a template-based QA generation pipeline:\n```bash\ncd qa-gen\npython run.py --dataset bank --output-dir output/\n```\n\nSee `qa-gen/templates/__init__.py` for 23 template definitions.\n\n---\n\n## 📊 Dataset Statistics\n\n| Metric | bank | shopping | heart | loan | road |\n|--------|------|----------|-------|------|------|\n| **Rows** | 750,000 | 3,900 | 918 | 32,581 | 12,316 |\n| **Columns** | 18 | 18 | 12 | 12 | 21 |\n| **Target Column** | y (binary) | - | DEATH_EVENT | Status | Accident_Severity |\n| **Groupby Cols** | job, marital, month | Category, Season | Sex, Smoking | Purpose, Gender | Weather, Road_Type |\n| **Numeric Cols** | age, balance, day | Age, Purchase Amount | Age, Ejection_Fraction | Amount, Income | Speed_limit |\n| **Question Sets** | 2 (67 total) | 3 (29 total) | 1 (4 total) | 2 (62 total) | 1 (10 total) |\n\n---\n\n## 🤝 Contributing\n\nWe welcome contributions! Areas of interest:\n- Adding new datasets with domain-specific analytics\n- Creating new question templates for different analytical patterns\n- Improving evaluation metrics and verification logic\n- Extending to multi-turn analytical conversations\n\n---\n\n## 📄 License\n\nMIT License - See LICENSE file for details\n\n---\n\n## 🔗 Links\n\n- **Prime Environment Hub**: [codeblue/codeblue-analytics](https://app.primeintellect.ai/dashboard/environments/codeblue/codeblue-analytics)\n- **Install**: `prime env install codeblue/codeblue-analytics`\n- **Version**: 0.1.9\n- **Last Updated**: 2025-12-14\n\n---\n\n## 📧 Citation\n\nIf you use this environment in your research, please cite:\n\n```bibtex\n@software{codeblue_analytics_2025,\n  title = {CodeBlue DataAnalytics: A Comprehensive RLVR Environment for Pandas Code Generation},\n  author = {CodeBlue Team},\n  year = {2025},\n  version = {0.1.9},\n  url = {https://app.primeintellect.ai/dashboard/environments/codeblue/codeblue-analytics}\n}\n```\n\n---\n\n**Built with ❤️ for the RLVR community**\n","encoding":"utf-8","truncated":false,"total_bytes":12951},"status":null}