{"data":{"kind":"file","path":"README.md","version_id":"fmsfrk7edcowolzxp53v58f9","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5665,"modified_at":"2026-02-18T20:13:17.141000","content_hash":"ba6c98f0be83e1f8549709f2368ea2d9d8c98f75fe8a7bd43fe311ae78e58899"},"entries":[],"content":"# Data Tasks Environment\n\nA [Harbor-format](https://github.com/PrimeIntellect-ai/verifiers) environment for evaluating and training LLMs on real-world data science and analytics tasks.\n\nBuilt on the [Prime Intellect Verifiers](https://github.com/PrimeIntellect-ai/verifiers) framework.\n\n## Overview\n\nEach task drops a model into a sandboxed Python environment with synthetic datasets and asks it to produce analytical outputs — charts, JSON reports, notebooks — using a set of bash and file tools. A verifier checks the outputs for correctness and writes a binary reward (1 = pass, 0 = fail).\n\nThe agent has four tools available inside the sandbox:\n\n| Tool | Description |\n|------|-------------|\n| `bash` | Run shell commands, Python scripts, install packages |\n| `read_file` | Read any file by path |\n| `write_file` | Create or overwrite a file |\n| `str_replace` | Make a targeted single-occurrence edit to a file |\n\n## Tasks\n\n### 1. `analyze-company-data` — Investigate Product Engagement Regression\n\n**Difficulty:** Medium  \n**Category:** Product Analytics\n\nThe agent receives two messy CSVs (`users.csv`, `events.csv`) spanning 28 days of product events and must investigate a user engagement regression. It needs to clean the data, identify the affected platform segment, pinpoint when the regression started, and produce:\n\n- `submission/results.json` — structured findings (segment, funnel drop, engagement changes by platform and country)\n- 5 visualizations (engagement trend, funnel comparison, geographic breakdown, mobile event volume, platform gap)\n- `submission/summary.md` — executive summary with key findings and funnel breakdown table\n\n**Verifier:** Checks JSON schema and value correctness, plot file existence.\n\n---\n\n### 2. `customers-spending-behavior` — Customer Value Analysis Notebook\n\n**Difficulty:** Medium  \n**Category:** Business Intelligence / Data Storytelling\n\nThe agent receives a customer spending CSV (~3,900 rows) and must produce an `analysis.ipynb` notebook with three specific visualizations using matplotlib/seaborn:\n\n1. **True Value Scatter** — Age vs. Annual Value (frequency-adjusted purchase amount), colored by subscription status, filtered to customers with >10 purchases\n2. **Category Performance Matrix** — Bubble chart of avg rating vs. total revenue per category, with a dashed average-revenue reference line\n3. **Drivers of Value Heatmap** — Annotated correlation matrix of Age, Previous Purchases, Review Rating, and Annual Value\n\n**Verifier:** Executes the notebook, inspects plot data against ground-truth calculations.\n\n---\n\n### 3. `pandas-memory-optimization-colab` — Memory-Efficient Data Processing\n\n**Difficulty:** Hard  \n**Category:** Performance Engineering / Data Engineering\n\nThe agent receives a Jupyter notebook (`process_data.ipynb`) that loads an entire CSV into memory and crashes with `MemoryError` on large files. The notebook's main function call is commented out. The agent must:\n\n- Rewrite the notebook to use chunked reading (`pd.read_csv(chunksize=...)`) or streaming\n- Uncomment/add the function call so the notebook executes end-to-end\n- Keep peak memory under 400 MB while processing a ~250 MB CSV\n- Produce the same output files: `daily_summary.csv`, `monthly_summary.csv`, `category_summary.csv`, `pivot_daily.csv`, `pivot_monthly.csv`, `data_YYYY.csv`\n\n**Verifier:** Generates a fresh 500K-row test CSV, executes the notebook via `jupyter nbconvert`, monitors peak memory usage, and validates all output files and schemas.\n\n---\n\n## Eval Results (haiku-4.5)\n\n| Task | Avg Reward | Avg Turns | Notes |\n|------|-----------|-----------|-------|\n| `analyze-company-data` | 1.0 | 23 | Passes consistently |\n| `customers-spending-behavior` | 1.0 | 16 | Passes consistently |\n| `pandas-memory-optimization-colab` | — | — | Requires chunking implementation |\n\n## Quick Start\n\n```bash\nprime env install fenil/data-tasks\n```\n\nRun a single task:\n```bash\nprime eval run data-tasks -m haiku -r 1 --env-args '{\"tasks\": [\"analyze-company-data\"]}'\nprime eval run data-tasks -m haiku -r 1 --env-args '{\"tasks\": [\"customers-spending-behavior\"]}'\nprime eval run data-tasks -m haiku -r 1 --env-args '{\"tasks\": [\"pandas-memory-optimization-colab\"]}'\n```\n\nRun all tasks:\n```bash\nprime eval run data-tasks -m haiku\n```\n\n## Environment Design\n\n### How It Works\n\n1. A `python:3.11-slim` Prime Sandbox starts\n2. Base deps are installed (`openai`, `pandas`, `numpy`, `matplotlib`)\n3. Task-specific `prepare_data.py` runs to generate synthetic datasets and install any extra deps\n4. The agent script (`agent.py`) is uploaded and executed\n5. The agent reads `/task/instruction.md` and uses its tools to produce outputs\n6. `tests/test.sh` runs the verifier, writing `1` or `0` to `/logs/verifier/reward.txt`\n\n### Task Structure\n\n```\ntasks/<task-name>/\n├── task.toml              # Metadata (difficulty, timeouts, docker image)\n├── instruction.md         # Problem statement the agent sees\n├── environment/\n│   └── prepare_data.py    # Generates synthetic data + installs extra deps\n├── solution/\n│   └── solve.sh           # Reference solution (reward = 1)\n└── tests/\n    ├── test.sh            # Verifier runner (writes reward.txt)\n    └── test_*.py          # Pytest test logic\n```\n\n### Per-Task Docker Images\n\nEach task can specify its own Docker image in `task.toml`:\n\n```toml\n[environment]\ndocker_image = \"python:3.11-slim\"\n```\n\nThis is resolved at runtime — no Docker-in-Docker required.\n\n## Dependencies\n\n```toml\n[project]\ndependencies = [\n    \"verifiers>=0.1.10.dev4\",\n    \"prime-sandboxes>=0.2.10\",\n    \"tomli>=2.3.0\",\n]\n```\n","encoding":"utf-8","truncated":false,"total_bytes":5665},"status":null}