{"data":{"kind":"file","path":"README.md","version_id":"bhao6mxqjy1uquiku0slv0f5","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4091,"modified_at":"2026-01-25T02:39:29.661000","content_hash":"fedb5de936853cc1939f0272c952fdcff884da167d34b3940aff9c8cc98a2d36"},"entries":[],"content":"# FluxEM Tools Environment v2\n\nJudgment-based tool calling curriculum for RL training.\n\n## Key Differences from v1\n\n1. **No explicit thresholds** in system prompt - model must learn judgment\n2. **Five problem categories** requiring different types of reasoning\n3. **Multi-component reward** (correctness + tool appropriateness + reasoning quality)\n4. **Curriculum stages**: foundation → judgment → orchestration → mastery\n5. **20% adversarial problems** - traps where tools are wrong\n\n## Problem Categories\n\n### A: Precision Ambiguity\nModel must infer precision requirements from context.\n- A1: Consequential (medical dosing, engineering safety)\n- A2: Estimation-sufficient (road trip budgets, rough planning)\n- A3: Precision pivots (sanity checks to decide if detail needed)\n\n### B: Hidden Complexity\nLooks simple but requires tools.\n- B1: Exponential blowup (lily pad doubling - reasoning, not computation)\n- B2: Iterative requirements (loan payoff with compound interest)\n- B3: Constraint satisfaction (linear programming as word problems)\n\n### C: Multi-Step Reasoning + Selective Computation\nInterleave reasoning and tool use.\n- C1: Reasoning gates computation (figure out WHAT to compute)\n- C2: Computation gates reasoning (compute to unlock reasoning)\n- C3: Tool chaining (output of one feeds another)\n\n### D: Domain Recognition\n- D1: Cross-domain (user growth → compound interest formula)\n- D2: Ambiguous domain (multiple tools could apply)\n- D3: No-tool traps (LOOK computational but are pure reasoning)\n\n### E: Judgment Under Uncertainty\n- E1: Missing information (must ask or state assumptions)\n- E2: Implicit constraints (project scheduling with parallelism)\n- E3: Sensitivity analysis (small changes, big impacts)\n\n## Reward Function\n\n```\nreward = 0.4 * correctness + 0.35 * tool_appropriateness + 0.25 * reasoning_quality\n```\n\n### Tool Appropriateness Scoring\n\n| Expected Tool Use | Used Tools | Score |\n|------------------|-----------|-------|\n| Required | Yes | 1.0 |\n| Required | No | 0.0 |\n| Beneficial | Yes | 1.0 |\n| Beneficial | No | 0.5 |\n| Either Valid | Either | 1.0 |\n| Discouraged | No | 1.0 |\n| Discouraged | Yes | 0.3 |\n| Pure Reasoning | No | 1.0 |\n| Pure Reasoning | Yes | 0.0 |\n\n## Curriculum Stages\n\n1. **Foundation** - Basic tool use without explicit rules\n2. **Judgment** - Context-dependent precision decisions\n3. **Orchestration** - Multi-tool planning and chaining\n4. **Mastery** - All problem types including adversarial\n\n## Usage\n\n### Local Testing\n\n```python\nfrom fluxem_tools_env_v2 import build_dataset, create_tool_registry\n\n# Build dataset\ndataset = build_dataset(n_problems=100, stage=\"judgment\")\n\n# Test tools\ntools = create_tool_registry()\nresult = tools[\"arithmetic.multiply\"](a=70, b=2.5)\n```\n\n### With Verifiers\n\n```python\nfrom fluxem_tools_env_v2 import load_environment\n\nenv = load_environment(\n    n_problems=500,\n    stage=\"mastery\",\n    max_turns=6\n)\n\n# Use with Verifiers training\n```\n\n### With Prime Intellect\n\n```bash\n# Install\nprime env install fluxem-tools-env-v2 --path ./environments/fluxem_tools_env_v2\n\n# Test\nprime eval run fluxem-tools-env-v2 -m openai/gpt-4o-mini -n 10\n\n# Push to Hub\nprime env push --path environments/fluxem_tools_env_v2\n\n# Train\nprime rl run configs/lab/fluxem_tools_v2.toml\n```\n\n## System Prompt Philosophy\n\nThe system prompt explicitly avoids thresholds:\n\n> These tools provide exact, verifiable results. Your role is to decide when computation adds value versus when reasoning or estimation is more appropriate.\n>\n> Consider using tools when:\n> - Precision matters for the task at hand\n> - Mental calculation would likely introduce errors\n> - Multiple steps that compound errors\n>\n> Consider NOT using tools when:\n> - Rough estimation is sufficient\n> - Input values are themselves uncertain\n> - The answer is obvious from reasoning alone\n>\n> **Exercise judgment** about which approach best serves the user's needs.\n\n## Target Baseline\n\nBefore training, the base model should achieve ~50% accuracy on this curriculum (not 87% like v1). This ensures there's room for the model to learn judgment.\n","encoding":"utf-8","truncated":false,"total_bytes":4091},"status":null}