-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Feature request
Currently, when using dataset.cast_column("image", Image(decode=True)), the pipeline throws an error and halts if any image in the dataset is invalid or corrupted (e.g., truncated files, incorrect formats, unreachable URLs). This behavior disrupts large-scale processing where a few faulty samples are common.
reference : https://discuss.huggingface.co/t/handle-errors-when-loading-images-404-corrupted-etc/50318/5
https://discuss.huggingface.co/t/handling-non-existing-url-in-image-dataset-while-cast-column/69185
Proposed Feature
Introduce a mechanism (e.g., a continue_on_error=True flag or global error handling mode) in Image(decode=True) that:
Skips invalid images and sets them as None, or
Logs the error but allows the rest of the dataset to be processed without interruption.
Example Usage
from datasets import load_dataset, Image
dataset = load_dataset("my_dataset")
dataset = dataset.cast_column("image", Image(decode=True, continue_on_error=True))
Benefits
Ensures robust large-scale image dataset processing.
Improves developer productivity by avoiding custom retry/error-handling code.
Aligns with best practices in dataset preprocessing pipelines that tolerate minor data corruption.
Potential Implementation Options
Internally wrap the decoding in a try/except block.
Return None or a placeholder on failure.
Optionally allow custom error callbacks or logging.
Motivation
Robustness: Large-scale image datasets often contain a small fraction of corrupt files or unreachable URLs. Halting on the first error forces users to write custom workarounds or preprocess externally.
Simplicity: A built-in flag removes boilerplate try/except logic around every decode step.
Performance: Skipping invalid samples inline is more efficient than a two-pass approach (filter then decode).
Your contribution
-
API Change
Extend datasets.features.Image(decode=True) to accept continue_on_error: bool = False. -
Behavior
If continue_on_error=False (default), maintain current behavior: any decode error raises an exception.
If continue_on_error=True, wrap decode logic in try/except:
On success: store the decoded image.
On failure: log a warning (e.g., via logging.warning) and set the field to None (or a sentinel value). -
Optional Enhancements
Allow a callback hook:
Image(decode=True, continue_on_error=True, on_error=lambda idx, url, exc: ...)
Emit metrics or counts of skipped images.