{"data":{"kind":"file","path":"README.md","version_id":"wpuhw0fw55j4t4qyize9f357","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4578,"modified_at":"2025-08-25T07:55:32.057000","content_hash":"64a695bd09b48b7dd189fb55a3893f8b6e409a54111a979e9b283ce0791f4d0c"},"entries":[],"content":"# misguided-attention\n\n## Overview\n\n- **Environment ID**: `misguided-attention`\n- **Short description**: Complex reasoning tasks with multi-criteria weighted evaluation using LLM judges\n- **Tags**: misguided-attention, single-turn, reasoning, logic-puzzles, multi-criteria, llm-judge\n\nThis is a collection of prompts to challenge the reasoning abilities of large language models in presence of misguiding information. They are slight variations of commonly known thought experiments, riddles or paradoxes (\"trick questions\").\n\nThe expected behavior would be that the LLMs solve the problems, as they are stated, by logical deduction. However, many LLMs will mistakenly recognize the unmodified problem due to frequent occurrence in their training data. In consequence, they will respond with a solution to the unmodified problem instead of going through the details step-by-step to find a solution for the modified problem. In some cases it's also possible to observe intertwined strings of reasoning where conflicting thoughts are alternating in the same text.\n\nParallels to this can be drawn to human behavior, where recognition of familiar patterns leads to the execution of previously learned routines, even if they are not applicable to the current situation. This is known as the Einstellungseffekt. However, we would expect that a computerized reasoning system would not be subject to such a fallacy.\n\n## Datasets\n\n- **Primary dataset(s)**: MisguidedAttention v4 and MisguidedAttention v4 Long\n- **Source links**:\n  - MisguidedAttention v4: [GitHub Repository](https://github.com/cpldcpu/MisguidedAttention)\n  - Dataset files: [v4.scr](https://github.com/cpldcpu/MisguidedAttention/blob/main/eval/harness/misguided_attention_v4.scr), [v4_long.scr](https://github.com/cpldcpu/MisguidedAttention/blob/main/eval/harness/misguided_attention_v4_long.scr)\n- **Split sizes**:\n  - Evaluation:\n    - MisguidedAttention v4: 13 examples\n    - MisguidedAttention v4 Long: 52 examples\n\n## Task\n\n- **Type**: single-turn\n- **Parser**: Identity parser (returns stripped response)\n- **Rubric overview**: Multi-criteria weighted evaluation using LLM judge with configurable providers (DeepSeek, OpenAI, Gemini)\n\n## Quickstart\n\nRun an evaluation with default settings:\n\n```bash\nexport OPENAI_API_KEY=\"your-api-key\"\nuv run vf-eval misguided-attention\n```\n\nConfigure model and evaluation parameters:\n\nUse with OpenAI judge:\n\n```bash\nexport OPENAI_API_KEY=\"your-api-key\"\nuv run vf-eval misguided-attention \\\n  -a '{\"judge_base_url\": \"https://api.openai.com/v1\", \"judge_api_key_var\": \"OPENAI_API_KEY\", \"judge_model\": \"gpt-4o-mini\"}'\n```\n\nNotes:\n\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n- The environment automatically downloads and descrambles datasets if not present locally.\n- Default configuration uses OpenAI as the judge provider.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `dataset_name` | str | `\"long\"` | Dataset version (\"normal\" for 13 examples, \"long\" for 52 examples) |\n| `judge_model` | str | `\"gpt-4o-mini\"` | Judge model name |\n| `judge_base_url` | str | `\"https://api.openai.com/v1\"` | Judge API base URL |\n| `judge_api_key_var` | str | `\"OPENAI_API_KEY\"` | Environment variable name for judge API key |\n| `judge_temperature` | float | `0.0` | Judge sampling temperature |\n| `judge_max_tokens` | int | `1000` | Judge maximum number of tokens |\n| `max_examples` | int | `-1` | Maximum number of examples to load (-1 for all) |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Final weighted score across all criteria (0.0 to 1.0) |\n\n### Example Structure\n\nEach example contains:\n\n- **prompt**: Complex reasoning task or ethical dilemma\n- **criteria**: List of evaluation criteria (1-3 per example)\n- **weights**: Numerical weights for each criterion\n- **metadata**: prompt_id, category, title\n\nExample:\n\n```txt\nprompt_id: trolley_problem_easy\ncriteria: [\n  \"States that pulling the lever is the morally correct choice\",\n  \"Mentions the utilitarian principle\"\n]\nweights: [0.5, 0.5]\n```\n\n### Judge Integration\n\nThe environment supports multiple LLM judge providers:\n\n**OpenAI:**\n\n```json\n{\n  \"judge_base_url\": \"https://api.openai.com/v1\", \n  \"judge_api_key_var\": \"OPENAI_API_KEY\",\n  \"judge_model\": \"gpt-4o-mini\"\n}\n```\n\n## Evaluation Reports\n<!-- Do not edit below this line. Content is auto-generated. -->\n<!-- vf:begin:reports -->\n<p>No reports found. Run <code>uv run vf-eval misguided-attention</code> to generate one.</p>\n<!-- vf:end:reports -->","encoding":"utf-8","truncated":false,"total_bytes":4578},"status":null}