{"data":{"kind":"file","path":"README.md","version_id":"bhnl5khblhsxzprr07vd49ah","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":9134,"modified_at":"2026-04-03T00:29:00.648000","content_hash":"8a7d6fffa1e617116396eca1170595efa0be1beab66e41b93a132e59c5fc5ffb"},"entries":[],"content":"# chart-extraction\n\n### Overview\n\n- **Environment ID**: `chart-extraction`\n- **Short description**: Extract structured line-chart data from chart images.\n- **Tags**: `single-turn`, `multimodal`, `vision`\n\n### Datasets\n\n- **Primary dataset(s)**: `13point5/line-ex`, a line-chart image dataset with chart text annotations and ground-truth series points.\n- **Source links**:\n  - Dataset: [13point5/line-ex on Hugging Face](https://huggingface.co/datasets/13point5/line-ex \"Hugging Face dataset for the chart-extraction environment\")\n  - Paper: [LineEX: Data Extraction From Scientific Line Charts](https://openaccess.thecvf.com/content/WACV2023/papers/P._LineEX_Data_Extraction_From_Scientific_Line_Charts_WACV_2023_paper.pdf \"Original LineEX paper from WACV 2023\")\n  - Upload and analysis repo: [13point5/line-ex-paper-analysis](https://github.com/13point5/line-ex-paper-analysis \"Scripts and analysis for uploading the original LineEX dataset to Hugging Face\") (includes the scripts used to upload the original LineEX paper dataset to Hugging Face)\n- **Split sizes**: `train` has 30,000 examples and `test` has 20,000 examples.\n\n### Task\n\n- **Type**: `single-turn`\n- **Parser**: depends on `system_prompt`\n- **Output format expectations**:\n  - `system_prompt=\"v1\"`: return a JSON object matching the selected chart extraction schema inside `<answer>...</answer>` tags\n  - `system_prompt=\"v2\"`: return step-by-step reasoning in `<reasoning>...</reasoning>` tags, followed by a JSON object matching the selected chart extraction schema in `<answer>...</answer>` tags\n- **Schema versions**:\n  - `v1`: original compact point format `[[x0, y0], [x1, y1], ...]`\n  - `v2`: explicit point objects with `index`, `x`, and `y`\n- **System prompt versions**:\n  - `v1`: a standalone prompt that requires only an `<answer>` JSON block\n  - `v2`: a standalone prompt that requires `<reasoning>` followed by `<answer>` and adds explicit guidance to inspect ticks, colors, markers, series identity, and line points step by step before answering\n- **Schema implementation**:\n  - versioned model schemas live in [`schemas/v1.py`](./schemas/v1.py) and [`schemas/v2.py`](./schemas/v2.py)\n  - rewards use a schema-agnostic internal shape from [`schemas/canonical.py`](./schemas/canonical.py)\n  - both `v1` and `v2` parse into typed Pydantic models and then convert via `.to_canonical()` before scoring\n- **Rubric overview**: The main `reward` always includes the format reward plus the task rewards:\n  - `format_reward_func` (`weight = 1.0`): checks that the response follows the output format required by the selected `system_prompt`.\n  - `series_name_f1` (`weight = 1.0`): computes F1 between predicted series names and gold legend names.\n  - `series_point_count_ratio` (`weight = 2.0`): scores agreement on how many points each gold series contains, weighted by series length, but is gated to `0` whenever the raw point-value score falls below `0.3`.\n  - `series_point_value` (`weight = 2.0`): scores matched series points with a point-only OKS criterion, giving credit only when predicted points land close to labeled gold points after chart-scale normalization. It does not give credit for landing somewhere along the line segment between gold points.\n  - `series_point_count_ratio_raw` and `series_point_value_raw` (`weight = 0.0` metrics): log the ungated count and point-value scores and cache them in rubric state so later reward functions can reuse them without recomputing.\n- **Info payload**: The dataset `info` JSON includes `schema_version`, `system_prompt`, `expected_answer` in the same schema shown to the model, and the original dataset columns such as `chart_elements` and `lines` so you can compare raw annotations directly in Prime's UI.\n\n### Quickstart\n\nUse this config for now:\n\n```toml\nargs = { schema_version = \"v2\", series_point_value_oks_k = 0.05, series_point_value_oks_threshold = 0.35 }\n```\n\nRun an evaluation with a vision model:\n\n```bash\nprime eval run chart-extraction -m 'qwen/qwen3-vl-8b-instruct' -n 5 -r 1 --env-kwargs '{\"schema_version\":\"v2\",\"series_point_value_oks_k\":0.05,\"series_point_value_oks_threshold\":0.35}'\n```\n\nNotes:\n\n- Use `-n` / `--num-examples` to limit how many examples are evaluated.\n- `max_examples` is different from `-n`: it limits how many rows are loaded and transformed into the environment for each split before evaluation begins.\n\n### Environment Arguments\n\n- `schema_version`: chooses the output schema and matching gold `expected_answer` shape.\n  - default: `\"v1\"`\n  - `\"v1\"`: original compact point pairs\n  - `\"v2\"`: explicit indexed point objects\n- `system_prompt`: chooses which system prompt variant the model sees.\n  - default: `\"v1\"`\n  - `\"v1\"`: standalone prompt with `<answer>...</answer>` output only\n  - `\"v2\"`: standalone prompt with `<reasoning>...</reasoning><answer>...</answer>` output and deliberate chart-reading guidance\n- `max_examples`: caps both the train and eval dataset splits before transformation.\n  - default: `null`\n  - when set, the environment loads `train[:max_examples]` and `test[:max_examples]`\n  - useful for quick local evals where full image transformation is unnecessarily slow\n- `series_point_value_oks_k`: widens or tightens the point-value reward tolerance.\n  - default: `0.025`\n  - larger values are more forgiving\n  - must be greater than `0`\n- `series_point_value_oks_threshold`: sets how high the OKS score must be before a point counts as matched.\n  - default: `0.5`\n  - lower values are more forgiving\n  - must be between `0` and `1`\nThe environment always uses the dataset `train` split for rollouts and the `test` split for eval.\n\n### Metrics\n\n| Metric                     | Meaning                                                                        |\n| -------------------------- | ------------------------------------------------------------------------------ |\n| `reward`                   | Main scalar reward: format reward plus the task rewards                        |\n| `format_reward_func`       | Output-format adherence score from the XML parser reward                       |\n| `series_name_f1`           | F1 score for predicted series names versus gold legend names                   |\n| `series_point_count_ratio_raw` | Weighted agreement on the number of points in each gold series before gating |\n| `series_point_value_raw`   | Weighted point-only OKS score before any downstream reward gating              |\n| `series_point_count_ratio` | Weighted count reward after the `series_point_value_raw < 0.3` gate            |\n| `series_point_value`       | Weighted point-only OKS score for labeled gold points, without nearby line-segment credit |\n| `num_turns`                | Number of turns taken in the rollout                                           |\n\n### Parsing And Scoring Flow\n\n1. The environment chooses a schema version via `load_environment(schema_version=...)`.\n2. The environment independently chooses a system prompt variant via `load_environment(system_prompt=...)`.\n3. The parser format is chosen from the selected `system_prompt`:\n   - `v1`: `XMLParser([\"answer\"], answer_field=\"answer\")`\n   - `v2`: `XMLParser([\"reasoning\", \"answer\"], answer_field=\"answer\")`\n4. The model output is validated against the corresponding typed schema:\n   - `Chart_V1`\n   - `Chart_V2`\n5. The parsed schema object converts itself into `CanonicalChart` via `.to_canonical()`.\n6. All reward functions operate on that canonical internal representation.\n\nThis means the env does not infer schema version from payload shape. The configured `schema_version` is used directly for both model outputs and gold `expected_answer` parsing, while `system_prompt` controls both the prompt wording and the expected XML wrapper format.\n\n### `series_point_value` reward\n\nThis reward is a strict point-matching metric inspired by the OKS portion of the LineEX keypoint metric, but without the relaxed line-segment fallback.\n\nAlgorithm:\n\n1. Match predicted and gold series by exact series name.\n2. Collect all gold points across the chart and normalize `x` and `y` coordinates by the full gold chart span so the tolerance is scale-aware.\n3. For each predicted point in a matched series, find the nearest labeled gold point in that same series.\n4. Convert that normalized point distance `d` into an OKS score:\n\n```text\nOKS(d) = exp(-(d^2) / (2 * k^2)), where k defaults to 0.025\n```\n\n5. Count the predicted point as a match only if `OKS(d) > threshold`, where the default threshold is `0.5`.\n6. Score each series as:\n\n```text\nmatched_unique_gold_points / total_gold_points\n```\n\n7. Return the weighted average of those series scores, using the number of gold points in each series as the weight.\n\nImplications:\n\n- Small `x` and `y` errors around a labeled gold point can still earn credit.\n- A prediction does not earn credit just for lying near the curve between labeled points.\n- Extra predicted points do not help unless they land close enough to distinct labeled gold points.\n- To make this reward more forgiving during training, increase `series_point_value_oks_k`, lower `series_point_value_oks_threshold`, or do both through `--env-kwargs`.\n","encoding":"utf-8","truncated":false,"total_bytes":9134},"status":null}