{"data":{"kind":"file","path":"README.md","version_id":"zmobvpddca68kt3k0dscfxyn","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5582,"modified_at":"2025-10-07T13:52:10.213000","content_hash":"c781686eb153c6cffcee16772a81dd73aa413ee21ef60e0de668edc699af00f5"},"entries":[],"content":"# FinQA Reasoning Benchmark for PrimeIntellect.ai\n\nFinancial Question Answering benchmark environment focused on complex multi-step reasoning with tool calling capabilities, designed for evaluation on PrimeIntellect.ai platform.\n\n## Overview\n\nThis environment provides access to financial data through structured tools, allowing models to perform complex multi-step reasoning tasks on financial datasets. The benchmark focuses on the **FinQA Reasoning** subset (79 curated questions) that require iterative tool use and multi-table analysis.\n\nModels can:\n\n- Discover available financial tables for companies\n- Query table schemas and metadata\n- Execute SQL queries on financial data\n- Perform calculations on retrieved data\n- Synthesize final answers through iterative tool use\n\n## Quick Start\n\n### Installation\n\n```bash\n# Install the environment\nuv pip install -e .\n```\n\n### Basic Usage\n\n```python\nimport verifiers as vf\nfrom finqa import load_environment\n\n# Load FinQA Reasoning dataset (79 questions - included by default)\nenv = load_environment(data_type=\"finqa_reasoning\")\n\n# Evaluate with a model\nresult = await vf.evaluate(\n    env=env,\n    model=\"claude-sonnet-4-5\",\n    n_samples=10\n)\n```\n\n### Data Included\n\nThe environment includes the **FinQA Reasoning** dataset with:\n- 79 curated questions focused on complex multi-step reasoning\n- Financial data for 20 companies across multiple tables\n- Requires iterative tool use for question answering\n\n## Environment Details\n\n### Available Tools\n\n1. **`get_descriptions(company_name)`** - List available tables for a company\n2. **`get_table_info(company_name, table_name)`** - Get table schema and metadata\n3. **`sql_query(company_name, table_name, query)`** - Execute SQL queries on financial data\n4. **`calculator(expression)`** - Perform numerical calculations\n\n### Evaluation Tasks\n\nThe benchmark includes the **FinQA Reasoning** dataset (79 questions):\n\n- Curated subset focused on complex multi-step reasoning\n- Requires iterative tool use and strategic planning\n- Multi-table analysis across financial statements\n- Complex calculations with step-by-step explanations\n- Question types include:\n  - Financial ratios and compositions\n  - Growth metrics and year-over-year analysis\n  - Lease analysis and debt structure\n  - Tax analysis and income breakdowns\n\n### Supported Models\n\nThe environment supports multiple LLM providers through the model factory pattern:\n- OpenAI (GPT-4.1, O3, O4)\n- Anthropic (Claude Sonnet/Opus variants)\n- Together AI (Qwen, Llama models)\n- Google (Gemini)\n- Mistral, Cohere, Amazon Bedrock, xAI (Grok)\n\n## Configuration\n\n### Environment Variables\n\nCreate an `env` file with your API keys:\n\n```bash\nOPENAI_API_KEY=your_openai_key\nANTHROPIC_API_KEY=your_anthropic_key\nTOGETHER_API_KEY=your_together_key\n# ... other provider keys\n```\n\n### Custom Models\n\nAdd new models to `src/models.yaml`:\n\n```yaml\nprovider_name:\n  - model-name-1\n  - model-name-2\n```\n\nFor new providers, extend `finqa/model_factory.py`.\n\n## Advanced Usage\n\n### Batch Evaluation\n\n```python\n# Load FinQA Reasoning dataset (default)\nenv = load_environment(\n    data_type=\"finqa_reasoning\",\n    max_turns=100\n)\n\n# Or use custom dataset\nenv = load_environment(\n    dataset_path=\"path/to/your/dataset.csv\",\n    max_turns=20\n)\n\n# Run evaluation across multiple models\nresults = await vf.evaluate(\n    env=env,\n    models=[\"claude-sonnet-4-5\", \"gpt-4\", \"qwen-2.5-72b\"],\n    n_samples=79,  # Full FinQA Reasoning dataset\n    parallel=True\n)\n```\n\n### Custom Dataset Format\n\nCreate a CSV file with the following columns:\n\n```csv\nprompt,ground_truth,company_name\n\"What was Apple's Q3 revenue?\",\"81.8 billion\",\"APPLE INC\"\n\"Calculate Tesla's operating margin for 2023\",\"8.2%\",\"TESLA INC\"\n```\n\nThen load it:\n\n```python\nenv = load_environment(\n    dataset_path=\"path/to/custom_dataset.csv\",\n    max_turns=50\n)\n```\n\n## Evaluation Metrics\n\nThe environment uses LLM Judge (LLMJ) for accuracy evaluation, leveraging Claude to score responses based on:\n\n1. **Correctness** (80%): Does the model response match the ground truth?\n   - Handles different answer formats (e.g., \"52\" vs \"\\\\boxed{52}\")\n   - Understands fractional representations\n   - Contextual correctness beyond exact string matching\n\n2. **Completeness** (20%): Is the reasoning process complete?\n   - Model provided a complete response\n   - Not stuck in the middle of reasoning\n   - Arrived at a final answer\n\nThis provides more nuanced evaluation compared to simple string matching or rule-based approaches, resulting in scores between 0 and 1.\n\n## Data Structure\n\n```\nfinqa/\n├── __init__.py          # Package exports\n├── environment.py       # Main environment class\n├── tools.py            # Financial data tools\n├── model_factory.py    # LLM provider factory\n├── rubrics.py          # Evaluation rubrics\n├── data/\n│   ├── benchmark/\n│   │   └── finqa_reasoning.csv  # 79 questions (included)\n│   └── raw/            # Financial data for 20 companies\n│       ├── APPLE INC/\n│       ├── TESLA INC/\n│       ├── ...\n│       └── tables_cleaned_all_companies.json\n└── src/\n    ├── prompts/        # System prompts\n    └── models.yaml     # Model configurations\n```\n\n## License\n\nMIT License\n\n## Links\n\n- Hugging Face Dataset: https://huggingface.co/datasets/snorkelai/agent-finance-reasoning\n- Snorkel Finance Leaderboard: https://leaderboard.snorkel.ai/category/snorkelfinance\n- Snorkel Finance Reasoning Leaderboard : https://leaderboard.snorkel.ai/category/financereasoning","encoding":"utf-8","truncated":false,"total_bytes":5582},"status":null}