{"data":{"kind":"file","path":"README.md","version_id":"wh0t0srvnc1rpi32z0a1q73j","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":8889,"modified_at":"2026-05-14T16:54:59.128000","content_hash":"d4adee40210a6b5b582cef2c50561e6db75b81cd2c2381e74328f36d6747a49f"},"entries":[],"content":"# crystal-relaxation\n\n> Replace the placeholders below, then remove this callout.\n\n### Overview\n- **Environment ID**: `crystal-relaxation`\n- **Short description**: 'Crystal Generation Tasks (Stability, Composition, and validity constraints)'\n- **Tags**: materials,crystals,grpo,energy-of-formation,stability,novel\n\n### Datasets\n- **Primary dataset(s)**: MP-20\n- **Source links**: \n- **Split sizes**: \n\n### Task\n- **Type**: multi-turn \n- **Parser**: custom\n- **Rubric overview**: \n    Rubrics depend on the task, but they follow the structure:\n        - Standard crystal rubrics in a regular `vf.Rubric`\n        - A reasoning-trace `vf.JudgeRubric`\n        - Both aggregated together with `vf.RubricGroup`\n\n### Quickstart\nThe CHGNet server (used for energy/force calculations) is automatically instantiated on environment load. You can configure the port and server type (mock vs real) via the environment arguments. \n\nIf you prefer to start a persistent server manually (e.g., in a separate tmux pane), you can do so, and the environment will detect it if you pass the correct port:\n\n```bash\n# Optional manual server start\nexport CHGNET_PATH=\"./crystal_relaxation/finetuned_chgnet_model.pth\"\nuvicorn crystal_relaxation.server:app --host 0.0.0.0 --port 1111\n```\n\nRun an evaluation with updated settings:\n\n- Crystal Generation Task\n```bash\nuv run vf-eval crystal_relaxation \\\n--num-examples 5 \\\n--rollouts-per-example 3 \\\n-m gpt-5 \\\n--env-args='{\"dataset_path\":\"./data/mp_20/mp_20_group_classification_dataset.json\", \"max_turns\":1, \"enabled_rubrics\":[\"spacegroup_classification\"], \"to_generate_cif\": false}'\n```\n\n- Relaxation Task\n```bash\nuv run vf-eval crystal_relaxation  \\\n--num-examples 5 \\\n--rollouts-per-example 3 \\\n-m gpt-5 \\\n--env-args='{\"dataset_path\":\"./data/mp_20/mp_20_relaxation_generation_dataset.json\", \"max_turns\":2, \"enabled_rubrics\":[\"format\",\"composition\",\"formation_energy\",\"bond_lengths\"], \"port\":1111, \"mock\":false}'\n```\n\n- Relaxation Task with reasoning judge grouped with the crystal rubrics\n```bash\nprime eval run crystal_relaxation \\\n    -m \"qwen/qwen3-30b-a3b-instruct-2507\" \\\n    --num-examples 5 \\\n    --rollouts-per-example 3 \\\n    --env-args='{\n        \"dataset_path\":\"./data/mp_20/mp_20_relaxation_generation_dataset.json\",\n        \"max_turns\":2,\n        \"enabled_rubrics\":[\"format\",\"reasoning_judge\",\"composition\",\"formation_energy\",\"bond_lengths\"],\n        \"judge_model\":\"qwen/qwen3-8b\",\n        \"mock\":false,\n        \"testing\":false,\n        \"port\":1111\n    }'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n- When `reasoning_judge` is enabled, the environment builds a `vf.JudgeRubric` for the reasoning score and groups it with the regular crystal rubric using `vf.RubricGroup`.\n- The judge model should return a single numeric score in `[0, 1]`, which is used directly as the reasoning reward.\n\n### Environment Arguments\nThese arguments are passed to `load_environment(**kwargs)`.\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `max_turns` | int | 1 | Maximum number of refinement turns (1 for single-shot) |\n| `target_composition` | str | `None` | Optional target chemical composition for structures |\n| `formation_energy_threshold` | float | `0.1` | Maximum acceptable formation energy (eV/atom) |\n| `use_summarized_feedback` | bool | `True` | Whether to use concise feedback (token-efficient) vs detailed |\n| `dataset_path` | str | `None` | Path to dataset JSON file. If None, searches default paths |\n| `to_generate_cif` | bool | `True` | If True, prompt for CIF generation. If False, for space group classification |\n| `enabled_rubrics` | list[str] | `None` | List of enabled rubric names (see below) |\n| `rubric_modes` | dict[str, str]| `None` | Dictionary mapping rubrics to specific scoring modes (e.g. `linear`, `exponential`) |\n| `rubric_weights` | dict[str, float] | `None` | Optional final-scoring weights passed to the rubric builder |\n| `evaluation_weights` | dict[str, float] | `None` | Optional per-turn feedback weights for composition, formation energy, forces, and bond lengths |\n| `mock` | bool | `True` | If True, use a Mock CHGNet model (no real calculations) |\n| `testing` | bool | `True` | If True, loads a small reduced dataset for quick testing |\n| `port` | int | 1111 | Port number for the CHGNet energy/force calculation server |\n| `chgnet_model_path` | str | `None` | Path to fine-tuned CHGNet model checkpoint (.pth file) |\n| `judge_model` | str | `None` | Judge model name used by the `vf.JudgeRubric` when `reasoning_judge` is enabled |\n| `judge_client` | object | `None` | Optional OpenAI-compatible client passed to `vf.JudgeRubric` |\n| `judge_prompt` | str | `None` | Optional override for the reasoning-judge prompt |\n| `judge_sampling_args` | dict[str, Any] | `None` | Optional sampling arguments for judge completions |\n| `**kwargs` | dict | | Additional arguments passed to the `MultiTurnEnv` wrapper |\n\n### Available Rubrics\nYou can enable any combination of these rubrics in `enabled_rubrics`.\n\n| Rubric Name | Weight | Description | Available Modes |\n| :--- | :--- | :--- | :--- |\n| `format` | 0.1 | Checks CIF format validity | N/A |\n| `reasoning_judge` | 0.12 | Uses `vf.JudgeRubric` to score reasoning quality from 0 to 1 | N/A |\n| `composition` | 0.15 | Checks composition matches target | N/A |\n| `formation_energy` | 0.6 | Checks energy below threshold | `discrete`, `linear`, `exponential` |\n| `forces` | 0.24 | Checks atomic forces are reasonable | `discrete`, `continuous_decay`, `continuous_ratio` |\n| `bond_lengths` | 0.15 | Checks bond lengths are reasonable | `discrete`, `continuous` |\n| `stable_bonds` | 0.4 | Checks for stable (low energy) structures with reasonable bonds | N/A |\n\n### Feedback Mechanism\nThe environment provides feedback to the agent after each turn, which helps guide the iterative refinement process. The feedback can be detailed or summarized based on the `use_summarized_feedback` argument.\n\n#### Standard Feedback\nPass/fail status for each enabled rubric, along with critical error messages if parsing fails.\nExample:\n```\n✓ Formation energy: -0.150 eV/atom (good)\n✗ Forces: NOT good, atoms too close!\n✓ Bond lengths: check\n✓ Composition: matches target.\n```\n\n#### Detailed Feedback (when `use_summarized_feedback=False`)\nIncludes an overall assessment score, detailed breakdown of each criterion, and specific suggestions for improvement.\nExample:\n```\n============================================================\nFEEDBACK ON YOUR CRYSTAL STRUCTURE\n============================================================\n\nOverall Score: 0.65/1.00\nAssessment: ACCEPTABLE, but there are several areas for improvement.\n\n✓ CIF was successfully extracted and parsed\n\n------------------------------------------------------------\nDETAILED EVALUATION:\n------------------------------------------------------------\n✓ Composition: Matches target 'Fe2O3'\n✓ Formation Energy: -0.1500 eV/atom (below threshold 0.1 eV/atom)\n❌ Atomic Forces: Some forces are too high (structure not well-relaxed)\n   💡 SUGGESTIONS:\n   - Atoms may be too close together\n   - Check for overlapping atomic positions\n   - Adjust positions to minimize forces\n✓ Bond Lengths: Reasonable (score: 0.85/1.00)\n\n============================================================\nSUMMARY:\n============================================================\nYour structure has potential. Address the major issues (❌) first.\n============================================================\n```\n\n\n### Metrics\nSummarize key metrics your rubric emits and how they’re interpreted.\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar reward (weighted sum of criteria: 0.0 to 1.0) |\n| `max_turns` | Number of turns taken to reach the final answer |\n| `rubric_name` | Individual score for each enabled rubric (e.g., `reasoning_judge`, `formation_energy`, `forces`, `bond_lengths`) |\n| `accuracy` | Exact match/Threshold pass (1.0 if all criteria met, 0.0 otherwise) |\n\n### Version Notes\n\n#### 1.1.6 - 2026-05-14\n\n- Fixed concurrent multi-turn state isolation in `CrystalRelaxationMultiTurnEnv`.\n- Each rollout now gets its own `CrystalRelaxationEnv`, keyed by trajectory state, so target composition and feedback state cannot leak across concurrent examples.\n- Added `rubric_weights` support in `load_environment()` and wired it to the final verifier rubric builder.\n- Added `evaluation_weights` support for per-turn feedback scoring, allowing classic runs to use the same weighting scheme as the RLM-native environment.\n- Validated the fix with a `3x3` `openai/gpt-5.5` smoke run at `max_concurrent=9` and corrected `25x3` reruns for `openai/gpt-5.5`, `openai/gpt-5.4-mini`, `google/gemini-2.5-flash`, and `anthropic/claude-haiku-4.5`; all corrected classic runs had `0/75` prompt/answer/feedback target-composition mismatches.\n","encoding":"utf-8","truncated":false,"total_bytes":8889},"status":null}