{"data":{"kind":"file","path":"README.md","version_id":"wmwci9ag5svwi43s2urmos93","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3797,"modified_at":"2025-09-09T04:20:50.709000","content_hash":"cd6ea3c740ef1de986f44e0a798913eebcbfc1db690ec90b0ad0110193c57674"},"entries":[],"content":"# meld-acoustic-thinking\n\n\n### Overview\n- **Environment ID**: `meld-acoustic-thinking`\n- **Short description**: Prosody-only, single-sentence acoustic impressions for MELD emotional utterances; scored against discretized pitch/intensity/rate labels with contradiction and formatting checks. This environment is intended to be used with gpt-4o-audio-preview. Furtheremore, this environment requires modifications to the verifiers library to support audio models. See the minor changes in [https://github.com/yurpl/verifiers/blob/main/verifiers/envs/environment.py](https://github.com/yurpl/verifiers/blob/main/verifiers/envs/environment.py).\n\nThis environment builds off of work from MELD (see [https://affective-meld.github.io/](https://affective-meld.github.io/)) and uses the MELD textual acoustic descriptions from Wu et al. (2024) [https://arxiv.org/abs/2407.21315](https://arxiv.org/abs/2407.21315).\n\nThe dream: can we get audio (multimodal) LLMs to produce good acoustic impressions? In other words: if you give an audio LLM an audio clip, will it reason about the audio well and produce a good acoustic impression?\n\n- **Tags**: audio, speech, prosody, multimodal, single-turn, MELD, evaluation, XML\n\n### Datasets\n- **Primary dataset(s)**: [https://huggingface.co/datasets/jmurzaku/meld-acoustic-dataset](https://huggingface.co/datasets/jmurzaku/meld-acoustic-dataset). A MELD-derived corpus with 16 kHz mono utterance audio and discretized acoustic labels such as avg_pitch_category, pitch_std_category, avg_intensity_category, intensity_variation_category, pitch_range_category, and articulation_rate_category. Built from the MELD multi-party conversation dataset (from the Friends TV show!), which provides text/audio/vision modalities and 7-way emotion labels. We ONLY use the text and audio modalities.\n- **Split sizes**: train=4.58k, validation=1.11k, test=2.62k.\n\n### Task\n- **Type**: single-turn\n- **Parser**: XMLParser\n- **Rubric overview**: Impression match (the core of this idea): model language to 6 acoustic dimensions (pitch, pitch variation, volume, volume variation, speech rate, pitch range). Full credit for exact category match; partial credit when the gold label is medium and the model hedges to nearby categories; penalty for contradictions (e.g., “higher” & “lower” pitch together); small bonus when hedging is appropriate for medium labels.\n\nOne-sentence check (one_sentence_reward, weight 0.1):\nRewards outputs with ≤1 sentence-ending punctuation.\n\nFormat check (parser.get_format_reward_func(), weight 0.1):\nEnsures the answer is inside <impression>...</impression> and nothing else.\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval meld-acoustic-thinking -m gpt-4o-audio-preview \\\n        --env-args '{\"hf_name\":\"jmurzaku/meld-acoustic-dataset\",\"split\":\"train\"}' --num-examples 300\n```\n\n\n### Environment Arguments\nDocument any supported environment arguments and their meaning. Example:\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `hf_name` | str | `\"jmurzaku/meld-acoustic-dataset\"` | Hugging Face dataset id to load the data (jmurzaku/meld-acoustic-dataset). |\n| `split` | str | `\"train\"` | Split to use (\"train\" or \"dev\" or \"test\") |\n| `limit` | int|null | `-1` | Limit on dataset size (use -1 for all) |\n\n\n### Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar reward (weighted sum of criteria) |\n| `impression_reward` | Exact match on target answer |\n| `one_sentence_reward` | Dimension-aware match score (0–1) with partial credit, contradiction penalty, and hedging adjustment. |\n| `format_reward` | Output inside <impression>...</impression> and nothing else |\n\n## Evaluation Reports\n<!-- Do not edit below this line. Content is auto-generated. -->\n","encoding":"utf-8","truncated":false,"total_bytes":3797},"status":null}