{"data":{"kind":"file","path":"README.md","version_id":"iji6g6nr9fkk2ar5glxknvst","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4106,"modified_at":"2026-01-17T22:49:59.220000","content_hash":"04cf961bd506608343aa132df58370b214859bfe06c0c7e184cde495c239ff57"},"entries":[],"content":"# Spatio-Temporal Reasoning Environment\n\nA procedurally generated environment for training models on multi-agent spatial and temporal reasoning tasks.\n\n## Overview\n\nModels must track multiple agents moving on a grid over time, handling:\n- **8-directional movement** (N/S/E/W + diagonals)\n- **Collision detection** - agents block each other\n- **Mandatory handoffs** - packages auto-transfer when carriers become adjacent to non-carriers\n- **Capacity constraints** - each agent can hold at most 1 item\n- **State queries** - positions, item locations, agent meetings\n\n## Problem Format\n\n```\nRules:\n- Agents attempt moves but may fail if blocked by obstacles or other agents\n- Agents cannot occupy the same cell as another agent\n- When an agent with a package becomes adjacent to another agent without one, the package is automatically handed off\n- Each agent can carry at most 1 package\n- Moves are processed in alphabetical order by agent ID\n\nGrid: 30x30\nObstacles: (3, 2), (10, 13)\nAgents:\n  A: starts at (0, 0), carrying [package_1]\n  B: starts at (15, 15)\n\nTimeline:\n  t=0: Initial state\n  t=1: A attempts to move East; B attempts to move West\n  t=2: A attempts to move Northeast; B attempts to move Southwest\n\nQuestion: Who has package_1 at t=2?\n```\n\nAnswer: `\\boxed{Agent A}` or `\\boxed{Agent B}` depending on whether they became adjacent\n\n## Difficulty Scaling\n\n| Parameter | Easy (0.0) | Hard (1.0) |\n|-----------|------------|------------|\n| Grid size | 10x10 | 50x50 |\n| Agents | 2 | 6 |\n| Obstacles | 0 | 20 |\n| Time horizon | t=4 | t=8 |\n| Packages | 1 | 3 |\n| Variable speed | No | Yes |\n\n## Question Types\n\n- **Position**: \"Where is Agent A at t=3?\"\n- **Carrier**: \"Who has package_1 at t=4?\"\n- **Item location**: \"Where is package_2 at t=2?\"\n- **Meeting**: \"Did A and B ever occupy the same cell?\"\n- **Meeting time**: \"When did A and B first meet?\"\n- **Distance**: \"Manhattan distance between A and B at t=3?\"\n\n## Key Mechanics\n\n1. **Attempted moves**: Timeline shows \"A attempts to move East\" - the model must determine if the move succeeds based on obstacles, boundaries, and other agents.\n\n2. **Collision blocking**: Agents cannot occupy the same cell. If A tries to move into B's cell, A stays in place.\n\n3. **Mandatory handoff**: When an agent carrying a package becomes adjacent to an agent without one, the package automatically transfers to the adjacent agent.\n\n4. **Alphabetical ordering**: Moves are processed in agent ID order (A before B before C).\n\n## Scoring\n\nUses **LLM-as-judge** for robust semantic equivalence checking. This handles equivalent phrasings like:\n- \"Agent A\" vs \"A\"\n- \"(5, 0)\" vs \"(5,0)\"\n- \"No one (on ground at (5, 0))\" vs \"package_1 is on the ground at (5, 0)\"\n- \"Never\" vs \"They never met\"\n\nBy default, uses `anthropic/claude-4.5-haiku` via Prime Inference for judging (requires `PRIME_API_KEY`).\n\n## Usage\n\n### Local Evaluation\n```bash\nprime env install spatio_temporal\nprime eval run spatio_temporal -n 20 -m gpt-4.1-mini\n```\n\n### Training Config\n```toml\n[[env]]\nid = \"shyampathak/st-bench\"\nargs = { min_difficulty = 0.2, max_difficulty = 0.8, num_examples = 10000 }\n```\n\n### With Difficulty Filtering\n```toml\n[buffer]\nonline_difficulty_filtering = true\neasy_threshold = 0.8\nhard_threshold = 0.2\neasy_fraction = 0.1\nhard_fraction = 0.1\n```\n\n## Parameters\n\n| Parameter | Type | Default | Description |\n|-----------|------|---------|-------------|\n| `num_examples` | int | 10000 | Number of problems to generate |\n| `min_difficulty` | float | 0.0 | Minimum difficulty (0-1) |\n| `max_difficulty` | float | 1.0 | Maximum difficulty (0-1) |\n| `seed` | int | 42 | Random seed for reproducibility |\n| `judge_model` | str | `anthropic/claude-4.5-haiku` | Model for answer verification |\n| `use_prime_inference` | bool | True | Use Prime Inference for judge (else OpenAI) |\n\n## Architecture\n\n```\nspatio_temporal/\n├── st_bench.py          # Environment wrapper (verifiers integration)\n├── generator.py         # Procedural problem generation\n├── simulator.py         # Deterministic simulation engine\n├── pyproject.toml\n└── README.md\n```\n","encoding":"utf-8","truncated":false,"total_bytes":4106},"status":null}