{"data":{"kind":"file","path":"README.md","version_id":"q00lsgkbpwrr7zphelfeseqt","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":1955,"modified_at":"2025-09-25T04:09:18.781000","content_hash":"2d44e2a304b38273c9636f7f1884abcb11ce7612d534d4447da7dc99e8a74b74"},"entries":[],"content":"# long-code-edit\n\n### Overview\n- **Environment ID**: `long-code-edit`\n- **Short description**: Long Context environment that asks a model to fix a corrupted code function when provided with the code and test cases.\n- **Tags**: `long-code-edit`, `train`, `eval`, `single-turn`, `long-context`, `code`\n\n### Datasets\n- **Primary dataset(s)**: `nreHieW/BigCodeBench-corrupted-long-context` and `nreHieW/BigCodeBench-corrupted-long-context-no-tests`\n- **Source links**: https://huggingface.co/datasets/nreHieW/BigCodeBench-corrupted-long-context and https://huggingface.co/datasets/nreHieW/BigCodeBench-corrupted-long-context-no-tests\n- **Split sizes**: 250 train \n\n### Task\n- **Type**: Single-turn\n- **Parser**: `CodeParser`\n- **Rubric overview**: The first rubric checks if the model correctly identifies the corrupted function name. The second rubric checks if the model correctly fixes the corrupted function.\n\n### Quickstart\nEnsure that the evaluator sandbox in `evaluator/Dockerfile` is built and running.\n\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval long-code-edit\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval long-code-edit   -m gpt-4.1-mini   -n 20 -r 3 -t 1024 -T 0.7   -a '{\"key\": \"value\"}'  # env-specific args as JSON\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\nThe environment supports the following arguments:\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_samples_per_bin` | int | `200` | Number of samples to use per bin |\n| `max_seq_len` | int | `128000` | Maximum sequence length |\n| `tests` | bool | `False` | Whether to use test cases |\n\n### Metrics\nThe rubric emits the following metrics:\n\n| Metric | Meaning |\n| ------ | ------- |\n| `execution_score` | Score for the execution of the fixed function |\n| `detection_score` | Score for the detection of the corrupted function name |\n\n","encoding":"utf-8","truncated":false,"total_bytes":1955},"status":null}