{"data":{"kind":"file","path":"README.md","version_id":"rsxjwlw63pe80jdphc9kps8n","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":10840,"modified_at":"2026-01-09T22:50:44.790000","content_hash":"bd06355e4b3ba3daa418dff90f65bc979578319b4a40c38bf8d9b5612c0546c7"},"entries":[],"content":"# BeautifulSoup RL Environment\n\nAn RL environment for training agents on BeautifulSoup HTML parsing tasks, built for [Prime Intellect's Environments Hub](https://docs.primeintellect.ai/verifiers/source/environments).\n\n## Why This Environment\n\nMost web scraping benchmarks use clean, tutorial-style HTML. Production HTML is malformed, noisy, and often adversarial. Agents trained on sanitized examples fail when they encounter real websites.\n\nThis environment trains models to handle the messy reality of web scraping: extracting data from malformed HTML, avoiding common BS4 API pitfalls, recognizing when static parsing is impossible, and respecting safety boundaries around credential extraction.\n\n## What's Inside\n\nThe environment contains 57 task archetypes spanning core extraction (text, attributes, images, links), table parsing with rowspan/colspan and nested structures, form parsing, structured data extraction (JSON-LD, Open Graph, microdata), DOM traversal, internationalization (Unicode, RTL, CJK), and BS4-specific gotchas like `.string` vs `.get_text()` and the `class_` reserved word. Tasks range from primer-level for bootstrapping models that start at 0% to hard multi-step extraction requiring filtering, aggregation, and chained navigation.\n\nFour \"limitation\" archetypes present content that is genuinely unparseable with static methods—JavaScript-rendered text, image-based content—where the correct behavior is to abstain and cite evidence from the HTML proving why.\n\n## Design Highlights\n\n**Deterministic grading.** All rewards come from exact or normalized matching against ground truth. No LLM judges. Scores are reproducible across runs.\n\n**Anti-reward-hacking.** Limitation tasks require evidence: a literal substring from the HTML proving why abstention is correct. Models can't game the system by always claiming \"can't parse.\"\n\n**Procedural generation.** Each archetype is a parameterized generator. Same seed produces the same task across all runs. Train, eval, and bench splits use non-overlapping seed ranges.\n\n**Fixed benchmark manifest.** The bench split uses 1040 pre-selected (archetype, seed) pairs, ensuring apples-to-apples comparison across environment versions.\n\n**Safety built-in.** Credential extraction is penalized. Sandboxed execution with network disabled by default.\n\n## Results\n\nRL training on Qwen3-8B showed a 30-point improvement: baseline average reward of 61.7% increased to 91.8% after training, with pass rate jumping from 63.8% to approximately 90%.\n\n## Naming\n\n- **Hub name**: `seconds-0/beautiful-soup-env` (use in `prime env eval`)\n- **Python module**: `beautiful_soup_env` (use in `from beautiful_soup_env import ...`)\n\n## Installation\n\n```bash\n# Basic installation\npip install -e .\n\n# With development dependencies\npip install -e \".[dev]\"\n\n# With all extras (dev, verifiers, prime, eval)\npip install -e \".[all]\"\n```\n\n### Optional Dependencies\n\n| Extra | Purpose | When to Use |\n|-------|---------|-------------|\n| `.[dev]` | pytest, ruff, mypy | Local development and testing |\n| `.[eval]` | OpenAI client | Running LLM evaluations |\n| `.[prime]` | Prime sandboxes | Production execution on Prime |\n| `.[verifiers]` | Verifiers framework | Integration with Prime's RL pipeline |\n| `.[all]` | All of the above | Full development environment |\n\n## Quick Start\n\n```python\nfrom beautiful_soup_env import load_environment\n\n# Load the environment (defaults to bench split for fast loading)\nenv = load_environment()\n\n# Or with custom config\nenv = load_environment(\n    split=\"train\",       # Use \"bench\" for quick testing, \"train\" for RL training\n    mode=\"mvp\",\n    difficulty=\"mixed\",\n    executor_backend=\"local\"\n)\n\n# Get a task\ntask = env.get_example()\nprint(task[\"prompt\"])\n```\n\n## Task Types\n\n### Extraction Tasks\nThe agent extracts data from HTML using BeautifulSoup code. Graded against deterministic ground truth.\n\n```json\n{\n  \"status\": \"ok\",\n  \"answer\": \"The extracted text\"\n}\n```\n\n### Limitation Tasks\nSome content is intentionally unparseable (JS-rendered, image-based, etc.). The agent must recognize this and abstain with evidence.\n\n```json\n{\n  \"status\": \"limit\",\n  \"limit\": {\n    \"reason\": \"js_required\",\n    \"evidence\": \"<script>renderContent()\"\n  }\n}\n```\n\n## Reward Structure\n\n| Outcome | Reward |\n|---------|--------|\n| Correct extraction | +1.0 |\n| Correct limitation (with evidence) | +0.5 |\n| Wrong answer | 0.0 |\n| Safety violation | -0.5 |\n\n## PRIME-RL Training Configuration\n\n### Recommended Training Config\n\n```python\nfrom beautiful_soup_env import load_environment\n\n# Training configuration for PRIME-RL\nenv = load_environment(\n    split=\"train\",              # 1000 seeds per archetype\n    mode=\"all\",                 # All 52 archetypes (phase 1 + phase 2)\n    difficulty=\"mixed\",         # Mix all difficulties\n    executor_backend=\"prime\",   # Use Prime's sandboxed executor\n    network_access=False,       # Determinism + safety\n    timeout_s=30.0,\n    seed=42,                    # Reproducibility\n)\n\n# Alternative: Use mode=\"mvp\" for 29 core phase-1 archetypes only\n# Alternative: Use mode=\"tiered\" for difficulty-weighted sampling\n```\n\n### Key Training Parameters\n\n| Parameter | Training | Evaluation | Benchmark |\n|-----------|----------|------------|-----------|\n| `split` | `\"train\"` | `\"eval\"` | `\"bench\"` |\n| Seed range | 0-100k | 100k-110k | 110k+ (fixed) |\n| Examples/archetype | 1000 | 100 | 20 |\n\n### Mode Options\n\n| Mode | Archetypes | Description |\n|------|------------|-------------|\n| `\"mvp\"` | 29 | Phase 1 core archetypes (production-ready) |\n| `\"phase2\"` | 23 | Phase 2 archetypes (advanced) |\n| `\"all\"` | 52 | All archetypes, uniform sampling |\n| `\"tiered\"` | 52 | All archetypes with difficulty-weighted sampling |\n| `\"hard_only\"` | 18 | Only hard difficulty archetypes |\n| `\"bootstrap\"` | ~15 | Primer + easy archetypes for 0% models |\n\n### Bench Split Behavior\n\nThe `bench` split uses a **fixed manifest** for reproducibility across runs:\n- The `mode` parameter is **ignored** for bench - all 52 archetypes are included regardless of mode setting\n- `archetype_filter` and `difficulty` filters still apply if specified\n- Tasks are loaded from `bs4_env/data/bench_manifest.json` (20 seeds × 52 archetypes = 1040 tasks)\n- This ensures benchmark scores are comparable across environment versions\n\n## Running Evaluations\n\n### Local Testing\n```bash\n# Preview dataset\npython -m bs4_env.scripts.preview_dataset\n\n# Smoke test\npython -m bs4_env.scripts.smoke_eval_local\n```\n\n### Prime Evaluation\n```bash\n# Run benchmark evaluation\nprime env eval seconds-0/beautiful-soup-env -m meta-llama/llama-3.1-70b-instruct -n 100\n\n# With specific environment config\nprime env eval seconds-0/beautiful-soup-env \\\n  -a '{\"split\":\"bench\",\"mode\":\"mvp\"}' \\\n  -m <model-name> \\\n  -n 100\n```\n\n### Prime Sandbox Mode\n\nWhen running on Prime's infrastructure, the environment uses sandboxed execution:\n\n```python\nenv = load_environment(\n    executor_backend=\"prime\",  # Uses Prime's sandboxed executor\n    network_access=True,       # Required to install bs4/lxml in default image\n)\n```\n\n**Sandbox Configuration:**\n- **Default image**: Uses `python:3.11-slim` with `network_access=True` to pip install dependencies at runtime\n- **Prebuilt image (recommended for production)**: Use a custom Docker image with dependencies pre-installed and set `network_access=False`\n- Code execution timeout defaults to 30 seconds\n- A starter training config template is available at `configs/prime-rl/default.toml` (see also model-specific examples in the same directory)\n\n### Prebuilt Image Strategy (for `network_access=False`)\n\nFor deterministic training and security-isolated execution, you need a prebuilt Docker image since runtime `pip install` requires network access.\n\n**Required dependencies:**\n```dockerfile\nFROM python:3.11-slim\nRUN pip install --no-cache-dir beautifulsoup4 lxml html5lib\n```\n\n**Using the prebuilt image:**\n```python\nenv = load_environment(\n    executor_backend=\"prime\",\n    network_access=False,      # No runtime network access\n    docker_image=\"your-registry/bs4-prebuilt:latest\",  # Your prebuilt image\n)\n```\n\n**Benefits of prebuilt images:**\n- **Determinism**: No network variability or version drift from pip installs\n- **Speed**: No dependency installation overhead per sandbox\n- **Security**: Complete network isolation during code execution\n- **Reliability**: No failures from network issues or package availability\n\n## Development\n\n```bash\n# Run tests\npytest\n\n# Run with coverage\npytest --cov=bs4_env\n\n# Type checking\nmypy bs4_env\n\n# Linting\nruff check bs4_env\n```\n\n## Environment Card\n\n### Skill Description\n\nThis environment trains agents to use BeautifulSoup (BS4), Python's most popular HTML parsing library. Tasks cover real-world web scraping scenarios including:\n\n- **Core extraction**: Text, attributes, and structured data from HTML elements\n- **Traversal**: Navigating DOM trees, sibling relationships, parent/child\n- **Tables**: Complex tables with rowspan/colspan, nested tables\n- **Forms**: Input fields, labels, validation attributes\n- **Structured data**: JSON-LD, microdata, Open Graph metadata\n- **Error handling**: Malformed HTML, missing elements, API gotchas\n- **Internationalization**: Unicode, RTL text, CJK characters\n- **Limitation detection**: Recognizing when static parsing is impossible\n\n### Task Distribution\n\n| Metric | Count |\n|--------|-------|\n| **Total Archetypes** | 57 |\n| **Solvable Tasks** | 53 |\n| **Limitation Tasks** | 4 |\n\n**By Difficulty:**\n| Primer | Easy | Medium | Hard |\n|--------|------|--------|------|\n| 5 | 10 | 24 | 18 |\n\n**By Category:**\n| Category | Count | Description |\n|----------|-------|-------------|\n| hard | 11 | Complex extraction requiring multiple techniques |\n| advanced | 8 | Advanced BS4 features and edge cases |\n| forms | 6 | Form parsing and input extraction |\n| core_extraction | 5 | Basic text and attribute extraction |\n| primer | 5 | Ultra-simple tasks for 0% model bootstrap |\n| limitations | 4 | Tasks requiring abstention with evidence |\n| table_parsing | 4 | Table structure and cell extraction |\n| i18n | 3 | Internationalization and encoding |\n| bs4_gotchas | 3 | Common API pitfalls and mistakes |\n| error_bait | 3 | Tasks designed to trigger common errors |\n| traversal | 2 | DOM navigation and relationships |\n| structured_data | 2 | JSON-LD, microdata extraction |\n| output_normalization | 1 | Format normalization tasks |\n\n### Benchmark Stability\n\nThe benchmark split uses a fixed manifest (`bs4_env/data/bench_manifest.json`) containing 1040 pre-selected (archetype, seed) pairs. This ensures:\n\n- **Reproducibility**: Same tasks across runs and environment versions\n- **Isolation**: Adding/removing archetypes doesn't affect existing benchmark tasks\n- **Versioning**: Manifest version tracks benchmark changes (currently v1.2.0)\n\n## License\n\nMIT\n","encoding":"utf-8","truncated":false,"total_bytes":10840},"status":null}