{"data":{"kind":"file","path":"README.md","version_id":"omsso8lfy4g2013sbm5fu2x0","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":8204,"modified_at":"2025-12-23T23:58:28.641000","content_hash":"1e2b9605cfffa7e0f902160419ca3841b804fe60a766778f07b65c2ea610cd87"},"entries":[],"content":"# Replay Labs Benchmark\n\n**The definitive benchmark for AI trading capabilities** — like ImageNet for vision or GLUE for NLP, but for financial decision-making.\n\n[![Prime Intellect](https://img.shields.io/badge/Prime%20Intellect-Environment-blue)](https://primeintellect.ai)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n## Overview\n\nReplay Labs Benchmark evaluates LLMs on real-world trading skills:\n\n| Category | Tasks | Description |\n|----------|-------|-------------|\n| **Pattern Recognition** | dump_detection, pump_detection, volatility_spike | Predict market events before they happen |\n| **Timing** | entry_timing, exit_timing, direction_15m | Optimal trade execution |\n| **Technical Analysis** | rsi_reversal, macd_crossover, supertrend_flip | Indicator interpretation |\n| **Behavioral** | fomo_meter, paper_hands, hindsight_bias | Bias and psychology testing |\n| **Multi-Asset** | contagion_detector, correlation_hunter | Cross-asset analysis |\n\n## Quick Start\n\n### Install\n\n```bash\n# Install from Prime Intellect\nprime env install chef/replay-labs-benchmark\n\n# Or install locally\npip install -e .\n```\n\n### Run Evaluation\n\n```bash\n# Evaluate on dump detection\nuv run vf-eval replay-labs-benchmark --task dump_detection --num_samples 100\n\n# Evaluate all core tasks\nuv run vf-eval replay-labs-benchmark --task all\n```\n\n### Python API\n\n```python\nfrom replay_labs_benchmark import load_environment\n\n# Create environment\nenv = load_environment(\n    task=\"dump_detection\",\n    num_samples=100,\n    seed=42,  # For reproducibility\n)\n\n# Iterate and evaluate\nfor item in env:\n    # Get model prediction\n    response = your_model.generate(item.prompt)\n    \n    # Score the prediction\n    result = item.verifier.verify(response, item.metadata)\n    print(f\"Correct: {result.correct}, Score: {result.score}\")\n\n# Get aggregate metrics\nmetrics = env.get_metrics()\nprint(f\"Accuracy: {metrics.accuracy:.1%}\")\nprint(f\"Calibration Error: {metrics.calibration_error:.3f}\")\n```\n\n## Available Tasks\n\n### Core Prediction Tasks\n\n#### `dump_detection`\nPredict if price will drop 3%+ in the next 15 minutes.\n\n```json\n{\"prediction\": \"dump\" or \"no_dump\", \"confidence\": 0-100, \"reasoning\": \"...\"}\n```\n\n#### `pump_detection`\nPredict if price will rise 3%+ in the next 15 minutes.\n\n```json\n{\"prediction\": \"pump\" or \"no_pump\", \"confidence\": 0-100, \"reasoning\": \"...\"}\n```\n\n#### `direction_15m`\nPredict if price will be higher or lower in 15 minutes.\n\n```json\n{\"prediction\": \"up\" or \"down\", \"confidence\": 0-100, \"reasoning\": \"...\"}\n```\n\n#### `volatility_spike`\nPredict if volatility will expand 2x+ in the next 15 minutes.\n\n```json\n{\"prediction\": \"spike\" or \"no_spike\", \"confidence\": 0-100, \"reasoning\": \"...\"}\n```\n\n#### `regime_change`\nPredict if market will change from trending to ranging (or vice versa).\n\n```json\n{\"prediction\": \"change\" or \"no_change\", \"confidence\": 0-100, \"reasoning\": \"...\"}\n```\n\n### Timing Tasks\n\n#### `entry_timing`\nFind the optimal entry point after a market event.\n\n```json\n{\"prediction\": \"enter\" or \"wait\", \"entry_price\": 95000, \"confidence\": 0-100, \"reasoning\": \"...\"}\n```\n\n#### `exit_timing`\nFind the optimal exit point from a winning position.\n\n```json\n{\"prediction\": \"exit\" or \"hold\", \"exit_price\": 96500, \"confidence\": 0-100, \"reasoning\": \"...\"}\n```\n\n### Behavioral Tasks\n\n#### `fomo_meter`\nMeasures chase behavior during pump sequences. Tests if the model can resist FOMO.\n\n#### `paper_hands`\nMeasures loss tolerance. Tests if the model sells too early during drawdowns.\n\n#### `hindsight_bias`\nMeasures \"I knew it all along\" bias. Tests if the model overrates predictability after seeing outcomes.\n\n### Technical Analysis Tasks\n\n#### `rsi_reversal`\nPredict if RSI extremes lead to reversal or continuation.\n\n#### `indicator_showdown`\nGiven conflicting indicator signals, pick which one to trust.\n\n### Multi-Asset Tasks\n\n#### `contagion_detector`\nPredict crash propagation timing across assets (BTC → ETH → SOL).\n\n## Metrics\n\n### Primary Metrics\n\n| Metric | Description |\n|--------|-------------|\n| **Accuracy** | Percentage of correct predictions |\n| **F1 Score** | Harmonic mean of precision and recall |\n| **Calibration Error** | How well confidence matches actual accuracy |\n\n### Calibration\n\nWe measure Expected Calibration Error (ECE):\n- 80% confidence predictions should be correct ~80% of the time\n- Lower ECE = better calibrated model\n\n### Skill Radar\n\nMulti-task evaluation produces a skill radar showing strengths/weaknesses:\n\n```\nPattern Recognition: ████████░░ 80/100\nTiming:              ███████░░░ 72/100\nIndicators:          ██████░░░░ 65/100\nBehavioral:          ████████░░ 82/100\nMulti-Asset:         █████░░░░░ 55/100\n```\n\n## Data\n\n### Symbols\n- COINBASE_SPOT_BTC_USD (Bitcoin)\n- COINBASE_SPOT_ETH_USD (Ethereum)\n- COINBASE_SPOT_SOL_USD (Solana)\n- AERODROME_AERO_USD (Aerodrome)\n\n### Input Data\nEach test case includes:\n- **60 OHLCV candles** (1-minute timeframe)\n- **Technical indicators**: RSI, MACD, ATR, ADX, SuperTrend, Stoch RSI, CMF, Bollinger Width\n- **Ground truth**: Pre-computed outcomes for scoring\n\n### Ground Truth Sources\n\nGround truth is derived from:\n1. **Actual future price movement** (for synthetic data)\n2. **Replay Labs annotations** (for production data):\n   - `dump_event`: 3%+ drops detected algorithmically\n   - `pump_event`: 3%+ rises detected algorithmically\n   - `optimal_entry`: Pre-computed best entry points\n   - `optimal_exit`: Pre-computed best exit points\n   - `regime_transition`: Market character changes\n\n## Leaderboard\n\nModels are ranked by overall score across all tasks:\n\n| Rank | Model | Accuracy | F1 | Calibration | Overall |\n|------|-------|----------|-----|-------------|---------|\n| 1 | Claude Sonnet 4 | 74.2% | 0.718 | 0.05 | 75.3 |\n| 2 | GPT-4.1 | 73.5% | 0.712 | 0.07 | 73.5 |\n| 3 | Gemini 2.5 Flash | 72.1% | 0.698 | 0.09 | 72.0 |\n\n## Configuration\n\n### Environment Variables\n\n```bash\n# Use Replay Labs API for real data (optional)\nREPLAY_LAB_API_URL=https://replay-lab.vercel.app\nREPLAY_LAB_API_KEY=your-api-key\n\n# Cache settings\nREPLAY_LABS_CACHE_DIR=~/.cache/replay-labs\n```\n\n### Advanced Options\n\n```python\nenv = load_environment(\n    task=\"dump_detection\",\n    num_samples=100,\n    seed=42,                           # Reproducibility\n    difficulty=\"hard\",                 # Filter by difficulty\n    symbol=\"COINBASE_SPOT_BTC_USD\",   # Filter by symbol\n)\n```\n\n## Multi-Task Evaluation\n\n```python\nfrom replay_labs_benchmark import MultiTaskBenchmark\n\n# Evaluate across all core tasks\nbenchmark = MultiTaskBenchmark(\n    tasks=[\"dump_detection\", \"pump_detection\", \"direction_15m\"],\n    num_samples_per_task=50,\n    seed=42,\n)\n\n# Run evaluations\nfor task in benchmark.tasks:\n    env = benchmark.get_environment(task)\n    for item in env:\n        response = your_model.generate(item.prompt)\n        result = env.evaluate(response, item.metadata)\n        benchmark.record_result(task, result)\n\n# Get skill radar\nradar = benchmark.get_skill_radar()\nprint(radar)\n# {'pattern_recognition': 82.5, 'timing': 71.2, 'behavioral': 68.0}\n\n# Full report\nprint(benchmark.get_full_report())\n```\n\n## Contributing\n\nWe welcome contributions! Areas of interest:\n\n1. **New tasks**: Propose new evaluation scenarios\n2. **Real data integration**: Connect to live market data APIs\n3. **Model baselines**: Submit baseline results for popular models\n4. **Annotations**: Improve ground truth labeling\n\n## Citation\n\n```bibtex\n@software{replay_labs_benchmark,\n  title = {Replay Labs Benchmark: Financial Decision-Making Evaluation for LLMs},\n  author = {Replay Labs},\n  year = {2024},\n  url = {https://github.com/replay-labs/replay-labs-benchmark}\n}\n```\n\n## License\n\nMIT License - see [LICENSE](LICENSE) for details.\n\n## Links\n\n- [This Environment on Prime Intellect](https://app.primeintellect.ai/dashboard/environments/chef/replay-labs-benchmark)\n- [Prime Intellect Environments Hub](https://app.primeintellect.ai/dashboard/environments)\n- [Prime Intellect Documentation](https://docs.primeintellect.ai)\n- [Verifiers Library](https://github.com/PrimeIntellect-ai/verifiers)\n\n","encoding":"utf-8","truncated":false,"total_bytes":8204},"status":null}