Skip to content

🐞 OOM of patchcore and padim on Intel iGPU #2886

@zyz9740

Description

@zyz9740

Describe the bug

When training padim and patchcore on Intel iGPU, out-of-memory occurs, killing the training process.

Dataset

MVTecAD

Model

PatchCore

Steps to reproduce the behavior

pip install anomalib[full]==2.1
pip install torch torchvision --index-url https://download.pytorch.org/whl/xpu --force-reinstall

import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

import argparse
from pathlib import Path
from anomalib.data import MVTecAD
from anomalib.models import Patchcore
from anomalib.engine import Engine
from anomalib.engine import Engine, SingleXPUStrategy, XPUAccelerator
from anomalib.data.utils import ValSplitMode


def arg_parser():
    parser = argparse.ArgumentParser(description='Train PatchCore model on MVTec bottle dataset', formatter_class=argparse.RawTextHelpFormatter)
    parser.add_argument('-o', '--output', type=Path, default='ovmodels', required=False, help='Path to save the output OpenVINO IR.\n(default: %(default)s)')
    return parser.parse_args()

if __name__ == "__main__":
    args = arg_parser()
    
    print("Starting training PatchCore model on MVTec bottle dataset...")
    
    # Initialize the datamodule, model and engine for bottle category
    dataset_path = Path("../datasets/MVTecAD")

    datamodule = MVTecAD(
        root=dataset_path, 
        category="bottle", 
        train_batch_size=1, 
        eval_batch_size=1, 
        val_split_mode=ValSplitMode.FROM_TEST, 
        val_split_ratio=0.0001,
    )

    model = Patchcore()

    engine = Engine(
        strategy=SingleXPUStrategy(),
        accelerator=XPUAccelerator(),
    )
    
    # Train the model
    engine.fit(datamodule=datamodule, model=model)
    
    print("Training completed!")
    

OS information

OS information:

  • OS: ubuntu 22.04
  • Python version: 3.12
  • Anomalib version: 2.1.0 (v2.0.0 work well, no oom on both models)
  • PyTorch version:2.8
  • CUDA/cuDNN version: NA
  • GPU models and configuration: 1270P iGPU
  • Any other relevant information: 32G memory

Expected behavior

Work well just as same as the behavior on v2.0.0.
I believe the 2 models should work well on MVTecAD dataset when training on Intel iGPU

Screenshots

No response

Pip/GitHub

pip

What version/branch did you use?

main

Configuration YAML

NA

Logs

(anomalib_2_1) intel@intel-O-E-M:~/zyz/bmk_openvino/unsupervised/patchcore$ python train.py 
/home/intel/zyz/bmk_openvino/unsupervised/anomalib_2_1/lib/python3.12/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
Starting training PatchCore model on MVTec bottle dataset...
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Zero subset length encountered during splitting. This means one of your subsets
            might be empty or devoid of either normal or anomalous images.
Zero subset length encountered during splitting. This means one of your subsets
            might be empty or devoid of either normal or anomalous images.
/home/intel/zyz/bmk_openvino/unsupervised/anomalib_2_1/lib/python3.12/site-packages/lightning/pytorch/core/optimizer.py:183: `LightningModule.configure_optimizers` returned `None`, this fit will run with no optimizer

  | Name           | Type           | Params | Mode 
----------------------------------------------------------
0 | pre_processor  | PreProcessor   | 0      | train
1 | post_processor | PostProcessor  | 0      | train
2 | evaluator      | Evaluator      | 0      | train
3 | model          | PatchcoreModel | 24.9 M | train
----------------------------------------------------------
24.9 M    Trainable params
0         Non-trainable params
24.9 M    Total params
99.450    Total estimated model params size (MB)
19        Modules in train mode
174       Modules in eval mode
/home/intel/zyz/bmk_openvino/unsupervised/anomalib_2_1/lib/python3.12/site-packages/lightning/pytorch/utilities/data.py:106: Total length of `DataLoader` across ranks is zero. Please make sure this was your intention.
Epoch 0:  44%|████████████████████████████████████████████████████████████████████████████████████████████████████                                                                                                                             | 93/209 [00:10<00:12,  9.28it/s]Killed
Killed

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions