{"data":{"kind":"file","path":"README.md","version_id":"au0y1a06rm4d6kxg8iy2q8u3","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5243,"modified_at":"2026-04-04T22:13:08.968000","content_hash":"fcc7839e57a939dc17c77380c359f9b4662d9ff28b2b684e2be22fe6059ebd54"},"entries":[],"content":"# hermes-agent-reasoning-traces\n\nReplay RL environment using the original Hermes reasoning-trace dataset for multi-turn tool-calling training.\n\n## Overview\n\nThis environment trains models to imitate and improve Hermes-style agent behavior on real tool-use trajectories. Given the original system prompt, user request, and available tool schema from a recorded trace, the model must produce the next assistant turn in Hermes format, including `<think>` blocks, `<tool_call>` blocks, and final natural-language answers when appropriate.\n\nUnlike a live tool sandbox, this environment replays recorded tool responses offline. If the model emits a sufficiently correct tool-use step, the environment appends the real recorded tool output and continues the rollout. If the step is malformed or drifts too far from the reference action, the rollout terminates early and the model is scored on the partial trajectory.\n\nThis environment name mirrors the original dataset name and is intended specifically for RL training on replayed Hermes agent trajectories.\n\n## Dataset\n\n- **Source**: [`lambda/hermes-agent-reasoning-traces`](https://huggingface.co/datasets/lambda/hermes-agent-reasoning-traces) on Hugging Face\n- **Size**: 7.6K multi-turn agent trajectories\n- **Task**: Multi-turn next-step replay for tool-calling agents\n- **Trace Format**: Hermes-style conversations with `<think>`, `<tool_call>`, and `<tool_response>` blocks\n\n## Task Format\n\n**Input**: Original system prompt, user request, and replayed tool responses from a real Hermes trace\n\n**Output**: Model emits the next Hermes assistant turn with:\n- reasoning in `<think>...</think>` when expected\n- one or more `<tool_call>` blocks for tool steps\n- a plain final answer for terminal steps\n\n## Example\n\n```text\nSystem:\nYou are a function calling AI model. You are provided with function signatures...\n\nUser:\nLook through past sessions for the deployment config we created for the staging environment.\n\nAssistant:\n<think>\nThe user wants a past deployment config. I should search session history first.\n</think>\n<tool_call>\n{\"name\": \"session_search\", \"arguments\": {\"query\": \"staging deployment config\"}}\n</tool_call>\n\nEnvironment replay:\n<tool_response>\n{\"tool_call_id\": \"functions.session_search:0\", \"name\": \"session_search\", \"content\": {\"results\": []}}\n</tool_response>\n\nAssistant:\n<think>\nThe session search came back empty. I should search the current repository for staging-related config files.\n</think>\n<tool_call>\n{\"name\": \"search_files\", \"arguments\": {\"pattern\": \"staging\", \"target\": \"content\"}}\n</tool_call>\n```\n\n## Reward Structure\n\n- **50% Accuracy Reward**: Correct tool selection, tool arguments, and final-answer matching\n- **10% Thinking Reward**: Proper `<think>...</think>` usage with structured reasoning\n- **40% Format Reward**: Well-formed Hermes tool-call formatting and replay-compatible outputs\n\nReplay-specific accuracy includes:\n\n- **Tool steps**: Action type match, tool-name match, and tool-argument similarity\n- **Final steps**: Final-action match and normalized response similarity to the reference answer\n\nThinking reward includes:\n\n- **Think Presence**: Whether `<think>` tags appear when expected\n- **Think Quality**: Well-formed tags, correct placement, and bounded thought length\n\nFormat reward includes:\n\n- **Hermes Structure**: Proper `<tool_call>` formatting and parseable tool JSON\n- **Replay Compatibility**: Tool steps aligned with traces that have recorded `<tool_response>` blocks\n- **Terminal Cleanliness**: Final steps that cleanly stop with a valid answer\n\n## Replay Scoring\n\nThe environment uses deterministic replay scoring rather than an LLM judge:\n\n- **Tool steps**: Compared against the recorded assistant tool-call set\n- **Parallel calls**: Matched as a set instead of strict order\n- **Arguments**: Canonicalized JSON similarity with penalties for missing, extra, or incorrect fields\n- **Final answers**: Normalized text similarity to the reference terminal response\n- **Continuation gate**: Tool-use steps must clear an accuracy threshold before replay continues\n\n## Use Cases\n\n- Training generalist tool-calling agents with hosted Prime RL\n- Improving Hermes-format compliance for `<think>` and tool-call outputs\n- Teaching agents to recover through long multi-turn coding and planning workflows\n- Warm-starting agent RL from real trajectories without rebuilding live tool sandboxes\n- Benchmarking multi-turn policy quality on realistic terminal, browser, and repo tasks\n\n## Real-World Impact\n\nReal tool-use traces are much closer to production agent behavior than synthetic single-turn instruction data. This environment helps train models that:\n\n- select tools more reliably in multi-step workflows\n- produce cleaner, more structured Hermes-style actions\n- stay aligned with long-horizon task progress\n- improve replay fidelity before investing in expensive live sandbox training\n- benefit from dense deterministic rewards rather than sparse success-only signals\n\n## Environment Arguments\n\n- `max_examples`: Optional cap on train and eval examples after split selection\n- `seed`: Seed used for deterministic hashed train/eval splitting\n\n## License\n\nThe replay data comes from a publicly available Apache-2.0 dataset on Hugging Face.\n","encoding":"utf-8","truncated":false,"total_bytes":5243},"status":null}