Skip to content

Graceful Error Handling for cast_column("image", Image(decode=True)) in Hugging Face Datasets #7632

@ganiket19

Description

@ganiket19

Feature request

Currently, when using dataset.cast_column("image", Image(decode=True)), the pipeline throws an error and halts if any image in the dataset is invalid or corrupted (e.g., truncated files, incorrect formats, unreachable URLs). This behavior disrupts large-scale processing where a few faulty samples are common.
reference : https://discuss.huggingface.co/t/handle-errors-when-loading-images-404-corrupted-etc/50318/5
https://discuss.huggingface.co/t/handling-non-existing-url-in-image-dataset-while-cast-column/69185

Proposed Feature

Introduce a mechanism (e.g., a continue_on_error=True flag or global error handling mode) in Image(decode=True) that:

Skips invalid images and sets them as None, or

Logs the error but allows the rest of the dataset to be processed without interruption.

Example Usage

from datasets import load_dataset, Image

dataset = load_dataset("my_dataset")
dataset = dataset.cast_column("image", Image(decode=True, continue_on_error=True))

Benefits

Ensures robust large-scale image dataset processing.

Improves developer productivity by avoiding custom retry/error-handling code.

Aligns with best practices in dataset preprocessing pipelines that tolerate minor data corruption.

Potential Implementation Options

Internally wrap the decoding in a try/except block.

Return None or a placeholder on failure.

Optionally allow custom error callbacks or logging.

Motivation

Robustness: Large-scale image datasets often contain a small fraction of corrupt files or unreachable URLs. Halting on the first error forces users to write custom workarounds or preprocess externally.

Simplicity: A built-in flag removes boilerplate try/except logic around every decode step.

Performance: Skipping invalid samples inline is more efficient than a two-pass approach (filter then decode).

Your contribution

  1. API Change
    Extend datasets.features.Image(decode=True) to accept continue_on_error: bool = False.

  2. Behavior
    If continue_on_error=False (default), maintain current behavior: any decode error raises an exception.
    If continue_on_error=True, wrap decode logic in try/except:
    On success: store the decoded image.
    On failure: log a warning (e.g., via logging.warning) and set the field to None (or a sentinel value).

  3. Optional Enhancements
    Allow a callback hook:
    Image(decode=True, continue_on_error=True, on_error=lambda idx, url, exc: ...)
    Emit metrics or counts of skipped images.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions