{"data":{"kind":"file","path":"README.md","version_id":"y4r5o8zyspp0bpo19hrbc0ud","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":14340,"modified_at":"2026-05-20T13:21:49.977000","content_hash":"a58aded203c4f32d3e0bebc6d68d52e8eacf5601d14f7ee83ce1beb754e40dd1"},"entries":[],"content":"# autonomous-skill-evolution\n\n**Self-Improving Agent Environment** for Prime Intellect Environments Hub.\n\nAn RL training environment where agents learn to **create, validate, refine, compose, and evolve reusable skills** from execution traces -- modeling a complete self-improvement loop with persistent memory, sandbox code execution, and cross-episode skill transfer.\n\n---\n\n## Concept\n\nReal-world agents don't just solve tasks -- they **learn from them**. This environment trains models to:\n\n1. **Observe** -- Record execution traces from completed tasks\n2. **Abstract** -- Extract reusable skills (code + documentation)\n3. **Validate** -- Verify skills work via sandbox code execution\n4. **Refine** -- Improve skills based on feedback and new requirements\n5. **Compose** -- Combine multiple skills into pipelines\n6. **Transfer** -- Apply patterns across domains\n7. **Evolve** -- Build meta-skills that improve other skills autonomously\n\nThe agent operates on a **SkillRegistry** -- an in-memory store of skills, traces, preferences, and pipelines that persists across tool calls within a rollout.\n\n---\n\n## Architecture\n\n```\n+-------------------------------------------------------------+\n|                    Agent (LLM)                              |\n|  Receives task instruction -> decides which tools to call    |\n+---------+---------------------------------------------------+\n          | tool calls\n          v\n+-------------------------------------------------------------+\n|                   Tool Layer (8 tools)                      |\n|                                                             |\n|  record_trace --> create_skill --> execute_code_sandbox     |\n|       |               |                    |                |\n|       |               v                    v                |\n|       |         execute_skill <-- sandbox_verified          |\n|       |               |                                    |\n|       v               v                                    |\n|  refine_skill <-- feedback              compose_skills      |\n|       |                                   (pipeline)        |\n|       v                                                     |\n|  recall_skills      update_preferences                     |\n|  (cross-episode)    (learn patterns)                        |\n+---------+---------------------------------------------------+\n          | state updates\n          v\n+-------------------------------------------------------------+\n|               SkillRegistry (per-rollout state)             |\n|                                                             |\n|  skills: {id -> Skill}     traces: [{task_id, steps, ...}]  |\n|  preferences: {k -> v}     pipelines: [{name, skills, ...}]  |\n|  sandbox_results: [{success, output, validation, ...}]      |\n+-------------------------------------------------------------+\n          | reward signals\n          v\n+-------------------------------------------------------------+\n|                 Rubric (8 reward functions)                  |\n|                                                             |\n|  trace_recorded      skills_created     skill_quality       |\n|  skills_used         skills_composed    sandbox_executed    |\n|  preferences_learned skills_refined                         |\n+-------------------------------------------------------------+\n```\n\n### Data Flow per Rollout\n\n1. `setup_state()` creates fresh `SkillRegistry` (optionally preloaded with skills)\n2. Agent receives system prompt + task instruction\n3. Agent calls tools in sequence (up to 12 turns)\n4. Each tool mutates the registry state\n5. Rubric functions read final registry state to compute rewards\n6. Total reward = sum of 8 reward scores (max 8.0)\n\n---\n\n## 5-Tier Evolution System\n\n### T0: Basic Skill Creation (17 tasks)\n**Goal:** Create a single skill from a clean, structured task description.\n\nAgent learns the fundamental loop: trace -> create -> execute -> validate.\n\nExample task:\n> \"You sent marketing emails to 500 subscribers. Steps: load subscriber list, personalize greeting, insert product recommendations, send via SMTP. Create a reusable skill for this.\"\n\n### T1: Noisy Input + Refinement (6 tasks)\n**Goal:** Handle ambiguous/unstructured input and refine existing skills.\n\nAgent learns to parse noisy instructions and iteratively improve skills.\n\nExample task:\n> \"so basically i sent some emails to customers, load the list, make it personal, add some product stuff, and blast it out. can u make a skill for this? also need A/B testing\"\n\n### T2: Multi-Skill Composition (6 tasks)\n**Goal:** Create multiple related skills and compose them into pipelines.\n\nAgent learns skill composition via `compose_skills()` tool.\n\nExample task:\n> \"Create multiple skills for employee onboarding: 1) create_accounts, 2) send_welcome, 3) schedule_intro, 4) assign_tasks. Then compose them into an onboarding pipeline.\"\n\n### T3: Cross-Domain Transfer (5 tasks)\n**Goal:** Transfer patterns across domains, resolve conflicting requirements.\n\nAgent learns abstract pattern recognition and adaptive skill design.\n\nExample task:\n> \"Your invoice processing skill works well (extract -> validate -> transform -> approve). Apply this same pattern to create a code review skill.\"\n\n### T4: Meta-Skills + Full Autonomy (5 tasks)\n**Goal:** Build skills that improve other skills. Full autonomous loop.\n\nAgent learns self-improvement: review, evolve, deprecate, and create meta-skills.\n\nExample task:\n> \"Review all your skills. Find the one with lowest success rate. Analyze why it underperforms. Rewrite it. Then create a META-SKILL called 'skill_reviewer'.\"\n\n---\n\n## Tools (8)\n\n| Tool | Description | Key Capability |\n|------|-------------|----------------|\n| `record_trace` | Record task execution trace | Foundation for all skills |\n| `create_skill` | Create reusable skill with code | Core skill creation |\n| `execute_skill` | Test a skill, update success rate | Verification |\n| `refine_skill` | Improve skill based on feedback | Iterative improvement |\n| `recall_skills` | Search skills by keyword/domain | Cross-episode recall |\n| `update_preferences` | Learn user preferences | Pattern learning |\n| `compose_skills` | Combine skills into pipeline | Multi-skill composition |\n| `execute_code_sandbox` | Validate Python code (AST) | Sandbox execution |\n\n### Sandbox Execution\n\n`execute_code_sandbox` validates Python code using AST parsing (no subprocess -- safe for eval sandbox):\n\n- **Syntax validation** -- catches errors before execution\n- **Structure analysis** -- extracts functions, imports, complexity\n- **Pattern detection** -- comprehensions, classes, error handling, async functions\n- **Simulated output** -- detects print statements, return values, and code patterns\n- **Complexity scoring** -- simple/moderate/complex based on AST node count\n\n---\n\n## Reward Functions (8)\n\nAll reward functions use **graduated thresholds** -- partial credit for partial completion:\n\n| Reward | What It Measures | 1 action | 2 actions | 3+ actions |\n|--------|------------------|----------|-----------|------------|\n| `trace_recorded` | Execution trace recorded | 0.5 | 0.8 | 1.0 |\n| `skills_created` | Skills created | 0.5 | 0.8 | 1.0 |\n| `skill_quality` | Quality (desc, code, usage, sandbox) | Up to 1.0 (per-check 0.25) |\n| `skills_used` | Skills executed/verified | 0.5 | 0.8 | 1.0 |\n| `skills_composed` | Skills composed into pipelines | 0.6 | 1.0 | -- |\n| `sandbox_executed` | Code validated in sandbox | 0.6 | 0.8 | 1.0 |\n| `preferences_learned` | User preferences captured | 0.5 | 0.8 | 1.0 |\n| `skills_refined` | Skills iteratively improved | 0.5 | 0.8 | 1.0 |\n\n**Design philosophy:** A single correct tool call gives >= 0.3 on any reward function. The agent is rewarded for basic actions, then scales up for advanced behavior (composition, sandbox, refinement).\n\n**Max possible:** 8.0 per rollout. A well-behaved agent that creates 1 skill, validates it, executes it, and records a trace should score >= 3.0.\n\n---\n\n## Difficulty Levels\n\nUse the `level` parameter to control task difficulty:\n\n```bash\n# All tasks (default)\nprime eval run tonyteo/autonomous-skill-evolution -m gpt-5-mini\n\n# Only T0 + T1 tasks (basic + noisy)\nprime eval run tonyteo/autonomous-skill-evolution -m gpt-5-mini --env-args '{\"level\": 1}'\n\n# T0 + T1 + T2 tasks (add composition)\nprime eval run tonyteo/autonomous-skill-evolution -m gpt-5-mini --env-args '{\"level\": 2}'\n\n# T0-T3 tasks (add cross-domain)\nprime eval run tonyteo/autonomous-skill-evolution -m gpt-5-mini --env-args '{\"level\": 3}'\n\n# All tasks including meta-skills (same as default)\nprime eval run tonyteo/autonomous-skill-evolution -m gpt-5-mini --env-args '{\"level\": 4}'\n```\n\n| Level | Tiers Included | Tasks | Focus |\n|-------|---------------|-------|-------|\n| 0 (default) | T0-T4 | 40 train + 5 eval | All tasks |\n| 1 | T0, T1 | 23 train + 3 eval | Basic creation + noisy input |\n| 2 | T0-T2 | 29 train + 4 eval | Add multi-skill composition |\n| 3 | T0-T3 | 34 train + 5 eval | Add cross-domain transfer |\n| 4 | T0-T4 | 40 train + 5 eval | Full autonomous loop |\n\n---\n\n## Gold Solution Example\n\n**Task:** \"You sent marketing emails to 500 subscribers. Steps: load subscriber list, personalize greeting, insert product recommendations, send via SMTP. Create a reusable skill.\"\n\n### Step 1: Record Trace\n```json\n{\n  \"tool\": \"record_trace\",\n  \"args\": {\n    \"task_id\": \"email_marketing\",\n    \"steps\": \"load_subscribers,personalize,add_recommendations,send_via_smtp\",\n    \"outcome\": \"success\"\n  }\n}\n```\n\n### Step 2: Create Skill\n```json\n{\n  \"tool\": \"create_skill\",\n  \"args\": {\n    \"name\": \"send_marketing_email\",\n    \"description\": \"Load subscriber list from CSV, personalize greeting using customer name, insert product recommendations based on purchase history, send via SMTP with tracking pixels\",\n    \"code_snippet\": \"def send_marketing_email(subscribers_csv, template, smtp_config):\\n    subs = load_csv(subscribers_csv)\\n    for sub in subs:\\n        body = template.replace('{name}', sub['name'])\\n        body = add_recommendations(body, sub['history'])\\n        send_smtp(body, sub['email'], smtp_config)\",\n    \"domain\": \"email\"\n  }\n}\n```\n\n### Step 3: Validate in Sandbox\n```json\n{\n  \"tool\": \"execute_code_sandbox\",\n  \"args\": {\n    \"code\": \"def send_marketing_email(subscribers_csv, template, smtp_config):\\n    subs = load_csv(subscribers_csv)\\n    for sub in subs:\\n        body = template.replace('{name}', sub['name'])\\n        send_smtp(body, sub['email'], smtp_config)\"\n  }\n}\n```\nExpected response: `{\"success\": true, \"syntax_valid\": true, \"functions_defined\": [\"send_marketing_email\"], \"complexity\": \"simple\"}`\n\n### Step 4: Execute Skill\n```json\n{\n  \"tool\": \"execute_skill\",\n  \"args\": {\n    \"skill_id\": \"<id from step 2>\"\n  }\n}\n```\n\n### Step 5: Learn Preference\n```json\n{\n  \"tool\": \"update_preferences\",\n  \"args\": {\n    \"key\": \"email_tone\",\n    \"value\": \"professional\"\n  }\n}\n```\n\n**Expected reward for this gold solution:**\n- trace_recorded: 0.5 (1 trace)\n- skills_created: 0.5 (1 skill)\n- skill_quality: 0.75 (desc + code + usage)\n- skills_used: 0.5 (1 executed)\n- skills_composed: 0.0 (no pipeline)\n- sandbox_executed: 0.6 (1 valid run)\n- preferences_learned: 0.5 (1 pref)\n- skills_refined: 0.0 (no refinement)\n- **Total: 3.35 / 8.0**\n\n---\n\n## Evaluation\n\n### Running Evals\n\n```bash\n# Install from Hub\nprime env install tonyteo/autonomous-skill-evolution\n\n# Run eval with a specific model\nprime eval run tonyteo/autonomous-skill-evolution -m gpt-5-mini\n\n# Run with difficulty filter\nprime eval run tonyteo/autonomous-skill-evolution -m gpt-5-mini --env-args '{\"level\": 2}'\n```\n\n### Eval Tasks (5 held-out tasks, all tiers covered)\n\n| Task | Tier | Format | Tests |\n|------|------|--------|-------|\n| `eval_simple` | T0 | Step-based | Single skill creation + sandbox validation |\n| `eval_scenario` | T0 | Scenario-based | Debug Slack bot, figure out steps yourself |\n| `eval_refine` | T1 | Noisy + preloaded | Refine email skill from messy feedback |\n| `eval_multi` | T2 | Multi-skill | 5 skills + pipeline composition |\n| `eval_adapt` | T3 | Preloaded | Cross-language skill adaptation |\n| `eval_meta` | T4 | Preloaded (3 skills) | Meta-skill: skill auditor + improvement |\n\n### Interpreting Results\n\n| Score Range | Agent Behavior |\n|-------------|----------------|\n| 0.0 - 1.0 | Minimal: recorded trace but no skills |\n| 1.0 - 3.0 | Basic: created 1 skill, maybe executed it |\n| 3.0 - 5.0 | Competent: multiple skills, sandbox validation, preferences |\n| 5.0 - 7.0 | Advanced: composition, refinement, cross-domain transfer |\n| 7.0 - 8.0 | Expert: full autonomous loop with meta-skills |\n\n---\n\n## Quick Start\n\n```bash\n# Install\nprime env install tonyteo/autonomous-skill-evolution\n\n# Evaluate\nprime eval run tonyteo/autonomous-skill-evolution -m gpt-5-mini\n\n# Evaluate with difficulty filter\nprime eval run tonyteo/autonomous-skill-evolution -m gpt-5-mini --env-args '{\"level\": 2}'\n\n# View results\nprime eval view\n```\n\n## Version History\n\n| Version | Changes |\n|---------|---------|\n| v1.0.0 | Fixed reward fairness: skills_created/skills_used/skills_refined now exclude preloaded and composed skills. skill_quality uses average instead of max, skips composed pipeline skills. Added preloaded_ids tracking to SkillRegistry. num_examples=6 to match all eval tasks. |\n| v0.9.0 | Added 5 diverse training tasks (scenario, goal, conversational formats). Expanded eval to 5 tasks covering all tiers (T0-T4). Level filter now applies to eval tasks too. |\n| v0.8.0 | Fixed critical concurrency bug: rewrote to use StatefulToolEnv with update_tool_args for per-rollout registry injection (asyncio.gather was corrupting module-level global). All rollouts now score correctly. Graduated reward thresholds. Added level parameter for difficulty filtering. Improved sandbox simulation. |\n| v0.7.0 | Attempted state management fix (reward functions + level param + graduated rewards) — partially fixed but still had concurrency bug |\n| v0.6.0 | 35 tasks across 5 tiers, 8 tools, 8 rewards, compose_skills, execute_code_sandbox |\n| v0.5.0 | Added sandbox execution, compose_skills, improved rewards, detailed README with gold solutions |\n| v0.4.0 | Simplified to 6 tools, explicit system prompt with examples, generous reward thresholds |\n| v0.3.0 | Initial Hub release with 25 tasks and 5-tier system |\n\n## License\n\nMIT\n","encoding":"utf-8","truncated":false,"total_bytes":14340},"status":null}