{"data":{"kind":"file","path":"README.md","version_id":"bu5lenkagjjukj0p1ir9nmms","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5379,"modified_at":"2026-03-26T02:04:26.415000","content_hash":"c5fe2de86a179cb3e143b116821a1a9a7d26b50f94c9420cb1b5504380035fc1"},"entries":[],"content":"# CausalReasoningEnv\n\nSingle-turn causal reasoning benchmark and RL training environment, built with Prime Intellect's [verifiers](https://github.com/PrimeIntellect-ai/verifiers) framework.\n\nThe environment trains models to identify causal effects via backdoor adjustment, frontdoor criterion, or instrumental variables over CPT-based DAGs. The focus is on **causal identification** — selecting the correct method, the correct node set, and the correct probability queries — not on computing numerical ATE/LATE values.\n\n---\n\n## Design\n\n### Single-Turn Rollout\n\nThe model receives a DAG description and produces **one response** containing:\n\n1. A `<declare method=\"...\" nodes=\"...\"/>` tag committing to an identification method and relevant node set.\n2. Between 1 and 3 probability query tags (`<marginal/>` or `<conditional/>`) specifying the queries needed to compute the causal effect under that method.\n\nNo tool-calling API is used. All output is plain text with XML self-closing tags parsed directly from the assistant message.\n\n### Output Format\n\n```xml\n<reasoning>\n...your reasoning here...\n</reasoning>\n<declare method=\"backdoor\" nodes=\"1,3\"/>\n<marginal variables=\"1,3\"/>\n<conditional query=\"6\" given=\"4,1,3\"/>\n```\n\n```xml\n<declare method=\"frontdoor\" nodes=\"5\"/>\n<conditional query=\"5\" given=\"3\"/>\n<marginal variables=\"3\"/>\n<conditional query=\"7\" given=\"3,5\"/>\n```\n\n```xml\n<declare method=\"iv\" nodes=\"0\"/>\n<conditional query=\"3\" given=\"0\"/>\n<conditional query=\"7\" given=\"0\"/>\n```\n\n### Problem Types\n\n| Type | Fraction | Identification strategy |\n|------|----------|------------------------|\n| `backdoor_standard` | ~35% | Non-empty minimal adjustment set Z; verified no frontdoor |\n| `backdoor_empty` | ~15% | Empty adjustment set (no backdoor confounding); verified no frontdoor |\n| `frontdoor` | ~40% | Latent confounder; valid mediator set M (possibly multi-node); verified no backdoor |\n| `iv` | ~10% | No backdoor/frontdoor; fresh exogenous IV node Z→X + latent confounder L→X,Y |\n\nEach problem has **exactly one valid identification method** by construction (mutual exclusivity enforced at generation time).\n\n### Reward Rubric\n\n| Component | Weight | Description |\n|-----------|--------|-------------|\n| `format_compliance` | 0.10 | 1.0 if response has exactly 1 valid `<declare/>`, 1–3 probability query tags, and all node IDs are integers |\n| `method_validity` | 0.30 | 1.0 if declared method matches the problem's identification method |\n| `set_validity` | 0.30 | 1.0 if declared node set correctly identifies the effect for the declared method |\n| `minimality` | 0.00 | 1.0 if set equals `minimal_set`; k/\\|declared\\| if valid superset (metric only, zero weight) |\n| `process_correctness` | 0.30 | Graded score for specifying probability queries that target the right distributions; gated on `set_validity` and `format_compliance` |\n\n`process_correctness` targets per method:\n\n| Method | Targets |\n|--------|---------|\n| `backdoor_empty` | `conditional(Y \\| X)` |\n| `backdoor_standard` | `marginal(Z)` + `conditional(Y \\| X,Z)` |\n| `frontdoor` | `conditional(M \\| X)` + `marginal(X)` + `conditional(Y \\| X,M)` |\n| `iv` | `[marginal(Z,Y) or conditional(Y\\|Z)]` + `[marginal(Z,X) or conditional(X\\|Z)]` |\n\n---\n\n## Results\n\n### Example Problem — DAG Visualization\n\nA `backdoor_standard` problem from the eval set: 8 nodes (one latent, shown as a square), treatment X=3 (green), outcome Y=7 (red), minimal adjustment set {0, 4}.\n\n![Example eval DAG](plots/eval_example_DAG/first_eval_problem.png)\n\n### Evaluation — Baseline & RLFT'd Models\n\nMetric breakdown (method validity, set validity, process correctness, reward) across baseline and RLFT'd models, ordered weakest → strongest by reward. RLFT models are trained in our environment for 200 steps on our training dataset.\n\n![Eval metric breakdown](plots/eval_results/eval_summary_chart.png)\n\n![Reward by model](plots/eval_results/eval_reward_chart.png)\n\n![Process correctness by model](plots/eval_results/eval_process_correctness_chart.png)\n\n### RL Training — qwen/qwen3-30b-a3b-instruct\n\nReward curve over 200 training steps, rising from ~0.4 to ~0.85.\n\n![Training reward](plots/RL_training/qwen3-30b-instruct-reward.png)\n\n---\n\n## Install\n\n```bash\nprime env install CausalReasoningEnv\n```\n\nOr from source:\n\n```bash\nprime env install CausalReasoningEnv -p ./environments\n```\n\n---\n\n## Quickstart\n\n### Run eval\n\n```bash\nprime eval run CausalReasoningEnv\nprime eval run CausalReasoningEnv -m openai/gpt-4.1-mini -n 50 -r 1\n```\n\n### Regenerate dataset\n\n```bash\ncd environments/CausalReasoningEnv\npython data_generation/generate_datasets.py --n 1000 --save-local\npython data_generation/generate_datasets.py --n 1000 --push-hub\n```\n\n### Train\n\n```bash\nprime rl run configs/lab/rl_config.toml\n```\n\n---\n\n## File Structure\n\n```\nCausalReasoningEnv/\n  CausalReasoningEnv.py           # load_environment() entry point\n  env.py                          # CausalATEEnv — XML parser, rubric, reward functions\n  prompts.py                      # SYSTEM_PROMPT — XML tag format and examples\n  data_generation/\n    gen.py                        # DAG generation, problem sampling, dataset builder\n    generate_datasets.py          # CLI: regenerate and upload datasets\n  pyproject.toml\n  README.md\n```\n---\n\n## Dependencies\n\nDeclared in `pyproject.toml`: `verifiers>=0.1.9.post3`, `networkx>=3.0`, `datasets`.\n","encoding":"utf-8","truncated":false,"total_bytes":5379},"status":null}