{"data":{"kind":"file","path":"README.md","version_id":"vka45wxtpu3l3jtt5xcmuns4","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":7441,"modified_at":"2025-10-07T23:13:51.078000","content_hash":"79927077186bc084125b3bc91bdc89a2a4489bae39227b28a7b15226c1f58760"},"entries":[],"content":"This environment evaluates and trains multimodal models on speech prosody and emotion understanding. It measures how well a system can extract pitch, speaking rate, energy, pause structure, and an emotion label from raw audio.\r\n- Fully deterministic: rewards derive from reproducible DSP, not human scoring.\r\n- Fine‑grained numeric + categorical feedback in a single JSON contract.\r\n- Auto manifest regeneration keeps percentile scaling stable across dataset changes.\r\n- Dual baselines: near‑perfect DSP oracle and a perfect oracle (gold ceiling) for calibration.\r\n- Hash matching ensures exact reference retrieval when audio is known.\r\n- Supports URL audio delivery to avoid prompt bloat and enable true multimodal evaluation.\r\n\r\nAuthor: Fahim  \r\nTwitter: [@DevModeFahim](https://x.com/DevModeFahim)\r\n\r\n## Overview\r\nEnvironment ID: `prosodyanalyzer`  \r\nShort description: Single-turn prosody and emotion analysis over speech WAV clips. Model returns JSON with four acoustic metrics and a style label.  \r\nTags: audio, prosody, emotion, single-turn, deterministic, json\r\n\r\n## Datasets\r\n| Field | Value |\r\n|-------|-------|\r\n| Primary dataset | Internal curated subset of CREMA-D style clips (packaged) |\r\n| Original corpus size | 7,442 emotional clips / 91 actors (full CREMA-D reference) |\r\n| Licensed usage | Research (user responsible for upstream license) |\r\n| Strengths | Emotional variability, consistent naming convention |\r\n| Source link | https://github.com/CheyneyComputerScience/CREMA-D |\r\n| Loaded subset | 82 clips (full environment load); use subset helper for speed |\r\n\r\n## Task\r\n| Aspect | Spec |\r\n|--------|------|\r\n| Type | Single-turn evaluation (one prompt, one answer per clip) |\r\n| Input | System + user message; user contains WAV reference |\r\n| Output | Strict JSON: pitch_mean_hz, speaking_rate_sps, energy_rms, pause_ratio, style |\r\n| Allowed styles | ANGRY, DISGUSTED, FEAR, HAPPY, NEUTRAL, SAD |\r\n| Determinism | DSP metrics deterministic; manifest stabilizes scaling |\r\n| Reward weights | regression 0.70, style 0.25, format 0.05 |\r\n\r\n## Audio Input Modes\r\nThe environment supports two delivery modes for audio so you can choose based on model + token constraints:\r\n\r\n| Mode | How It Works | Pros | Cons | When To Use |\r\n|------|--------------|------|------|-------------|\r\n| Inline base64 (default) | WAV bytes are embedded in the prompt inside a fenced block ```audio/wav-base64 ...``` | Self‑contained; works with plain text-only chat completion endpoints | Increases prompt token count; large clips may hit context limits | Quick local experiments; models without URL/audio fetch capability |\r\n| URL reference | Prompt includes a single line `AUDIO_URL: http://.../file.wav` served by `serve_audio` | Minimal tokens; lets genuinely multimodal models fetch/stream; reproducible caching | Requires running an HTTP server & model must be allowed to fetch | Scaling evaluations; large batches; longer clips |\r\n| Native multimodal parts (future) | Direct binary/audio part (e.g. OpenAI / Gemini structured message) | Most efficient; no base64 overhead | Requires provider SDK & vf adapter that forwards structured content | When an evaluation adapter supports rich message arrays |\r\n\r\nImplementation detail: if the environment variable `PROSODY_AUDIO_BASE_URL` is set, prompts switch automatically to URL mode. If unset, inline base64 is used. Both produce identical target rewards because the oracle hash computation is independent of delivery mode.\r\n\r\nWhy not always base64? Large audio inflates prompt size, increasing latency + cost and occasionally truncating prompts for LLMs with smaller context windows. URL mode keeps prompts compact while still deterministic (the dataset content is fixed and served locally or from a controlled host). If your model sandbox forbids outbound HTTP, stay with base64.\r\n\r\nFor reinforcement learning, URL mode usually yields faster iteration because the verifier prompt stays small and constant in length, stabilizing token-based batching heuristics.\r\n\r\n## Quickstart\r\n\r\n### 1. Install\r\n```powershell\r\nuv sync\r\n```\r\n\r\n### 2. Evaluate Your Model (OpenAI-compatible endpoint)\r\nAssumes you already have a chat completions style API at `http://127.0.0.1:8000/v1`.\r\n```powershell\r\nuv run vf-eval prosodyanalyzer -s -m \"your-model\" -b http://127.0.0.1:8000/v1\r\n```\r\nThis streams each audio clip (inline base64 by default) and expects a strict JSON response.\r\n\r\n### 3. Use URL Audio Mode\r\nKeeps prompts tiny by referencing audio instead of embedding it.\r\n```powershell\r\nuv run python -m prosodyanalyzer.serve_audio --port 8002\r\n$env:PROSODY_AUDIO_BASE_URL = \"http://127.0.0.1:8002/audio\"\r\nuv run vf-eval prosodyanalyzer -s -m \"your-model\" -b http://127.0.0.1:8000/v1\r\n```\r\n\r\n### 4. (Optional) Run Baselines\r\nDeterministic DSP oracle (near-perfect):\r\n```powershell\r\nuv run python -m prosodyanalyzer.baselines.oracle_openai_shim\r\nuv run vf-eval prosodyanalyzer -s -m \"dsp-oracle\" -b http://127.0.0.1:8001/v1 -n 30\r\n```\r\nPerfect oracle (upper ceiling):\r\n```powershell\r\nuv run python -m prosodyanalyzer.baselines.perfect_oracle_openai_shim\r\nuv run vf-eval prosodyanalyzer -s -m \"perfect-oracle\" -b http://127.0.0.1:8003/v1 -n 30\r\n```\r\n\r\n### 5. (Optional) RL Loop Sketch\r\nYour trainer samples an audio prompt from `env.dataset`, queries model, parses JSON, applies reward = row['reward'](parsed_json), and performs an update. Because the environment is deterministic you can safely cache (prompt -> reward) pairs for off-policy replay.\r\n\r\n### 6. Regenerate Manifest if Adding more WAVs\r\n```powershell\r\nuv run python -m prosodyanalyzer.tools.generate_manifest\r\n```\r\n\r\n## Environment Arguments\r\n| Arg | Type | Default | Description |\r\n|-----|------|---------|-------------|\r\n| (none) | – | – | `load_environment()` always returns full dataset |\r\n| subset n (helper) | int | – | Use `load_environment_subset(n)` for local tests |\r\n\r\n## Metrics\r\n| Metric | Meaning |\r\n|--------|---------|\r\n| reward_regression | Range‑normalized closeness of four numeric metrics (p5–p95 scaling if manifest) |\r\n| reward_style | 1.0 if style matches exactly, else 0.0 |\r\n| reward_format | 1.0 if JSON schema correct + numeric bounds valid |\r\n| overall | Weighted sum: 0.70 * regression + 0.25 * style + 0.05 * format |\r\n\r\n## Background\r\nEach audio file is analyzed with deterministic DSP: pitch (pyin → yin → autocorr fallback), speaking rate (onset density), energy (mean square), pause ratio (low-amplitude fraction). Style label comes from filename emotion code or band inference. The manifest stores per-file metrics, percentile ranges, and SHA256 hashes. If missing or stale, it regenerates automatically. Two baselines ship with the environment: a hash-aware DSP oracle (near-perfect) and a perfect oracle returning exact manifest metrics. Unified style inference prevents drift.\r\n\r\n## Adding / Updating Audio\r\n1. Place WAV files under `prosodyanalyzer/data/`.\r\n2. Run any environment load or regenerate explicitly:\r\n```powershell\r\nuv run python -m prosodyanalyzer.tools.generate_manifest\r\n```\r\n3. Commit the new manifest for reproducible scoring.\r\n\r\n## Expected Model Output\r\n```json\r\n{\"pitch_mean_hz\": 210.2, \"speaking_rate_sps\": 4.9, \"energy_rms\": 0.065, \"pause_ratio\": 0.11, \"style\": \"HAPPY\"}\r\n```\r\n\r\n## Citation\r\n```bibtex\r\n@misc{prosodyanalyzer2025,\r\n  title={ProsodyAnalyzer: Deterministic Prosody & Emotion Evaluation Environment},\r\n  author={Fahim},\r\n  year={2025}\r\n}\r\n```\r\n","encoding":"utf-8","truncated":false,"total_bytes":7441},"status":null}