{"data":{"kind":"file","path":"README.md","version_id":"v163uwnzqb26kxdpb35qid5q","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3108,"modified_at":"2025-09-10T06:49:58.432000","content_hash":"166d9f50572b5d3a247834a272082396a473951f7aae1fb0373d41d4f8e51325"},"entries":[],"content":"# finalyser-env-market-mood\n\n\n### Overview\n- **Environment ID**: `finalyser-env-market-mood`\n- **Short description**: Single-turn evaluation of financial models predicting NIFTY index movement (Rise/Fall/Neutral) and magnitude of change.\n- **Tags**: finance, single-turn, classification, regression, market-mood, nifty, eval, train\n\n### Datasets\n- **Primary dataset(s)**: raeidsaqur/NIFTY — financial news and context labeled with NIFTY market movement direction and percentage change.\n- **Source links**: https://huggingface.co/datasets/raeidsaqur/NIFTY\n- **Split sizes**: If dataset provides train/test splits → uses them directly. Otherwise → creates a train/eval split via train_test_split(test_size=0.2, seed=42). Supports max_examples argument for limiting dataset size.\n\n### Task\n- **Type**: single-turn\n- **Parser**: MarketMoodParser (custom) — extracts final direction (rise/fall/neutral) and predicted percentage change\n- **Rubric overview**: \n    - *reward_classification*: exact match on direction.\n    - *reward_regression*: closeness of predicted % change to ground truth (tolerance-based decay).\n    - *reward_format*: enforces correct output format (direction + % on separate lines).\n    - *Weighted sum*: [1.0, 0.5, 0.2].\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval finalyser-env-market-mood\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval finalyser-env-market-mood   -m gpt-4.1-mini   -n 20 -r 3 -t 1024 -T 0.7   -a '{\"key\": \"value\"}'  # env-specific args as JSON\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n- use_think=true enables chain-of-thought style prompting where the model thinks inside <think>...</think> before outputting the final prediction.\n\n### Environment Arguments\nDocument any supported environment arguments and their meaning.  \n\n| Arg            | Type   | Default | Description |\n| -------------- | ------ | ------- | ----------- |\n| `max_examples` | int    | `-1`    | Limit on dataset size (use -1 for all). |\n| `test_size`    | float  | `0.2`   | Fraction of data to use for testing split. |\n| `seed`         | int    | `42`    | Random seed for reproducibility of train/test split. |\n| `use_think`    | bool   | `True`  | Whether to include “thought” reasoning traces in prompts. |\n| `batch_size`   | int    | `1`     | Number of examples evaluated in a single rollout. |\n| `shuffle`      | bool   | `True`  | Whether to shuffle dataset before splitting. |\n\n\n### Metrics\nSummarize key metrics your rubric emits and how they’re interpreted.  \n\n| Metric       | Meaning |\n| ------------ | ------- |\n| `reward`     | Main scalar reward (weighted sum of criteria). |\n| `accuracy`   | Exact match between model output and target label. |\n| `f1`         | F1 score for classification tasks (handles imbalance better). |\n| `precision`  | Fraction of predicted positive moods that were correct. |\n| `recall`     | Fraction of true positive moods that were recovered. |\n| `coverage`   | Share of dataset examples for which the model produced an output. |\n\n\n","encoding":"utf-8","truncated":false,"total_bytes":3108},"status":null}