{"data":{"kind":"file","path":"README.md","version_id":"ukj5l5q3vdwhpgzf49sh8dq6","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":12162,"modified_at":"2026-03-07T00:06:52.412000","content_hash":"5042dc5931f816a2e3e202e85e83e50334caf89b2a716d216cb3b2a70bb6812d"},"entries":[],"content":"# BlicketTest_CausalReasoning\n\n### Overview\n- **Environment ID**: `BlicketTest_CausalReasoning`\n- **Short description**: Multi-turn causal reasoning environment based on the Blicket detector paradigm from developmental psychology. Tests an LLM's ability to explore, reason causally, and identify which objects are Blickets.\n- **Tags**: multi-turn, reasoning, eval, train\n\n### Reference\nBased on [Do LLMs Think Like Scientists? Causal Reasoning and Hypothesis Testing in LLMs](https://arxiv.org/pdf/2505.09614).\n\n### Task\n- **Type**: multi-turn\n- **Parser**: XMLParser (fields: `reasoning`, `action`)\n- **Rubric overview**: Reward is a weighted combination of active components: blicket set Jaccard similarity (0.5), format compliance (0.05), per-step information-seeking efficiency (0.1), and posterior Jaccard (0.35).\n\nThe agent interacts with a simulated \"Blicket-detecting machine\" across two phases:\n1. **Exploration phase** — toggle objects on/off the machine one at a time, observe whether the machine activates, and exit when ready.\n2. **Answer phase** — declare which objects are Blickets. The agent has up to `MAX_ANSWER_ATTEMPTS` (3) retries to produce a correctly-formatted answer before the episode ends with no score.\n\nEach example's configuration — number of objects, which objects are Blickets (at least 2, up to floor(n/2)), and rule type — is fixed at dataset generation time. The hidden rule is either **disjunctive** (machine activates if *any* Blicket is present) or **conjunctive** (machine activates only if *all* Blickets are present). Neither the rule type nor the Blicket assignments are revealed to the agent.\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run BlicketTest_CausalReasoning\n```\n\nConfigure model and sampling:\n\n```bash\nprime eval run BlicketTest_CausalReasoning \\\n  -m openai/gpt-4.1-mini \\\n  -n 50 -r 3 -t 4096 -T 0.7 \\\n  -a '{\"num_examples\": 50}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_examples` | int | `250` | Number of training examples (clamped to [100, 500]) |\n\n### Dataset Generation\n\nTraining and eval datasets are generated independently and are guaranteed to be fully disjoint. The eval dataset is independent of num_examples. Given a fixed num examples, different runs will produce the same training set and the same eval set.\n\n**Training** (`num_examples` configs, n ∈ [4, 10], seed=42):\n- Rule split: ~2/3 conjunctive, ~1/3 disjunctive. Conjunctive problems are oversampled because they are empirically harder for LLMs.\n- Generated by calling `sample_balanced_configs` with `n_conjunctive = round(2 * num_examples / 3)` and `n_disj = num_examples - n_conj`.\n- Sliced from a pre-generated max-size pool (333 conjunctive + 167 disjunctive), so the set of training configs is stable as `num_examples` changes — larger values just take more from the front of each rule-type list.\n\n**Eval** (loaded from HuggingFace: `irfanjamil/BlicketEnv_Eval_Set`, split `eval`):\n- Built by `build_new_eval_dataset.py` and uploaded via `upload_eval_dataset.py`.\n- num_objects sampled from N(9.5, 1.5), clipped to [5, 13], so the distribution is centered around 8–11 objects.\n- 35 conjunctive + 25 disjunctive configs (60 total); num_blickets ∈ [2, 8].\n- Disjoint from the full 250-config training pool (seed=42, n∈[4,10]) by exclusion at generation time.\n\n### Dataset Profiles\n\nVisual breakdowns of the training and eval dataset distributions are saved in [`datasets_profile/`](datasets_profile/):\n\n- **`train_profile.png`** — generated by `profile_train_dataset.py`. Shows the 250-example pool used by the developer for Rl training: rule-type split, num-objects distribution, (num_objects × num_blickets) heatmap, and (num_blickets × rule_type) grouped bar chart.\n- **`eval2_profile.png`** — generated by `profile_eval_dataset.py`. Loads the eval set directly from HuggingFace (`irfanjamil/BlicketEnv_Eval_Set`, split `eval`) and shows the same four breakdowns as the train profile: rule-type split, num-objects distribution (centered around 8–11), (num_objects × num_blickets) heatmap, and (num_blickets × rule_type) grouped bar chart.\n\n### Architecture\n\n`BlicketEnv` subclasses `vf.MultiTurnEnv`. The verifiers `max_turns` is set to `global_max_steps + 1 + MAX_ANSWER_ATTEMPTS` (the largest step budget across all rows + transition turn + up to 3 answer-phase retries).\n\n**Rollout lifecycle:**\n\n1. `setup_state()` reads per-row config from the dataset `info` field (blickets, rule type, step budget, and pre-computed total hypotheses to eliminate). Initializes zeroed object states, the full hypothesis space (2^N blicket assignments × 2 rule types), and tracking counters.\n2. `env_response()` drives the game loop across both phases. All turns increment `exploration_and_answer_count`:\n   - **Exploration**: calls `parse_response(..., \"exploration\", ...)` which strips reasoning blocks, requires exactly one `<action>` tag, and delegates to `parse_action`. Validates the action, toggles object state, computes machine activation, filters the hypothesis space against the observation, records hypotheses eliminated this step into `hypotheses_eliminated_per_step`, and returns a compact observation. Invalid and redundant actions still consume a step.\n   - **Answer**: calls `parse_response(..., \"answer\", ...)` which applies the same strict tag rules then delegates to `parse_blicket_set`. On successful parse, scores with Jaccard similarity and terminates. On failure, sends a reformat message and loops up to `MAX_ANSWER_ATTEMPTS` (3) total attempts; if all exhausted, exits with score 0.\n3. Termination is handled by the base class `has_final_env_response` stop condition.\n\n\n**Machine activation logic:**\n- **Disjunctive** (OR): machine ON if *any* Blicket is on the machine.\n- **Conjunctive** (AND): machine ON only if *all* Blickets are on the machine.\n\n**Transition to answer phase** happens when the agent sends `exit` or exhausts `max_num_steps`. The transition message includes a full observation history recap so the agent can reason over all experiments at once.\n\n### File Structure (`BlicketTest_CausalReasoning.py`)\n\n**Module-level constants:**\n- `MAX_ANSWER_ATTEMPTS = 3` — maximum answer-phase retries before the episode ends with score 0.\n\n**Entry point:**\n- `load_environment(num_examples)` — generates training and eval datasets, builds the parser/rubric, and returns a `BlicketEnv` instance. See [Dataset Generation](#dataset-generation) above.\n\n**Environment class:**\n- `BlicketEnv(vf.MultiTurnEnv)`\n  - `setup_state()` — reads pre-computed per-row config from dataset `info`. Initializes blicket array, zeroed object/machine states, step counter, phase tracker, history log, hypothesis space, and action-tracking counters (`total_action_count`, `exploration_and_answer_count`, `parseable_action_count`, `valid_action_count`, `redundant_action_count`, `out_of_range_count`, `answer_attempt_count`). Also loads `optimal_hypotheses_eliminated` for use by the `hypotheses_eliminated` diagnostic.\n  - `env_response()` — core game loop. Handles exploration (parse action via `parse_response`, validate, toggle, compute machine state, filter hypotheses, record per-step eliminations, return observation) and answer phase (parse predictions via `parse_response`, retry loop up to `MAX_ANSWER_ATTEMPTS`, score via Jaccard and signal termination on success). On valid answer, stores both `state[\"final_score\"]` (Jaccard) and `state[\"final_predictions\"]` (the raw predicted set) for reward functions.\n  - `_build_transition_message()` — assembles the observation history recap when moving to answer phase.\n\n\n**Metrics:**\n- `blicket_set_jaccard()` — reads `state[\"final_score\"]`, set by `env_response` as the Jaccard similarity between the predicted Blicket set and the gold set. Returns 0.0 if no valid answer was recorded.\n- `exploration_efficiency()` — `1 - (wasted / parseable_action_count)`, where waste = redundant actions + out-of-range object IDs + non-contiguous configuration revisits. Higher is better. Weight 0.0 (retained for metric logging only).\n- `format_compliance()` — `parseable_action_count / exploration_and_answer_count` across all turns in both phases. Higher is better.\n- `hypotheses_eliminated()` — fraction of total hypotheses eliminated relative to the theoretical maximum (`2^(N+1) - 1`). Weight 0.0 (retained for metric logging only).\n- `per_step_efficiency_dynamic()` — counterfactual per-step information-seeking reward. At each step `t`, reconstructs the agent's hypothesis set H_t from the initial space filtered by all prior observations, then computes: (1) `agent_balance(t)` = min(on_count, off_count) for the agent's actual toggle applied to H_t; (2) `optimal_balance(t)` = max over all N possible single-object toggles of min(on_count, off_count) from H_t. Returns the mean of `agent_balance / optimal_balance` across steps where `optimal_balance > 0`. Unlike the old path-dependent oracle comparison, the baseline is always computed from the agent's actual belief state at each step.\n- `posterior_jaccard()` — at the end of exploration, computes the mean Jaccard similarity between each remaining valid hypothesis's implied blicket set and the gold blicket set (rule type ignored). Rewards the agent for narrowing the hypothesis space to hypotheses close to the truth, independently of the final answer.\n- `blicket_precision()` — `TP / |predicted|` over the final predicted blicket set. Weight 0.0 (diagnostic: decompose identification failures into overclaiming vs. missing).\n- `blicket_recall()` — `TP / |gold|` over the final predicted blicket set. Weight 0.0 (diagnostic: decompose identification failures into overclaiming vs. missing).\n\n### Reward Function\n\n| Component | Weight | Meaning |\n| --------- | ------ | ------- |\n| `blicket_set_jaccard` | 0.50 | Jaccard similarity between predicted and gold Blicket sets |\n| `posterior_jaccard` | 0.35 | Mean Jaccard of remaining hypotheses against gold at end of exploration |\n| `per_step_efficiency_dynamic` | 0.10 | Counterfactual info-seeking reward: agent's action balance vs. optimal balance from the same belief state |\n| `format_compliance` | 0.05 | Parseable actions across all turns (both phases) |\n| `exploration_efficiency` | 0.00 | `1 - (wasted / parseable)` — fraction of productive actions (diagnostic) |\n| `hypotheses_eliminated` | 0.00 | Fraction of hypotheses eliminated vs. theoretical maximum (diagnostic) |\n| `blicket_precision` | 0.00 | `TP / \\|predicted\\|` — overclaiming diagnostic |\n| `blicket_recall` | 0.00 | `TP / \\|gold\\|` — missing-blicket diagnostic |\n\n### Evaluation Results\n\nEval run on [`irfanjamil/BlicketEnv_Eval_Set`](https://huggingface.co/datasets/irfanjamil/BlicketEnv_Eval_Set) (60 examples: 35 conjunctive, 25 disjunctive; n ∈ [5, 13]).\n\nModels suffixed with **v1, v2, v3, v4** are our trained models — each representing a distinct RL training run of Qwen3-30b-a3b-instruct-2507 on this environment.\n\n**Blicket Set Jaccard** — Jaccard similarity between the predicted and gold Blicket sets, averaged across rollouts per example:\n\n![Blicket Set Jaccard (eval set 2)](eval_plots/blicket_set_jaccard_eval2.png)\n\n### Training Results\n\nThe plots below are from the RL training run that produced **qwen3-30b-a3b-v4** (the v4 model shown in the evaluation results above), using the reward function defined in **environment v0.1.4** (weights: `blicket_set_jaccard` 0.50, `posterior_jaccard` 0.35, `per_step_efficiency_dynamic` 0.10, `format_compliance` 0.05).\n\n**Total reward:**\n\n![Reward](training_plots/reward.png)\n\n**Blicket set Jaccard similarity:**\n\n![Jaccard Similarity](training_plots/jaccard_similarity.png)\n\n**Posterior Jaccard** (hypothesis-space quality at end of exploration):\n\n![Posterior Jaccard](training_plots/posterior_JS.png)\n\n**Per-step information-seeking efficiency:**\n\n![Dynamic Efficiency](training_plots/dynamic_efficiency.png)\n\n**Blicket precision** (`TP / |predicted|`):\n\n![Precision](training_plots/precision.png)\n\n**Blicket recall** (`TP / |gold|`):\n\n![Recall](training_plots/recall.png)\n","encoding":"utf-8","truncated":false,"total_bytes":12162},"status":null}