-
Notifications
You must be signed in to change notification settings - Fork 797
Open
Description
Describe the bug
When training padim and patchcore on Intel iGPU, out-of-memory occurs, killing the training process.
Dataset
MVTecAD
Model
PatchCore
Steps to reproduce the behavior
pip install anomalib[full]==2.1
pip install torch torchvision --index-url https://download.pytorch.org/whl/xpu --force-reinstall
import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
import argparse
from pathlib import Path
from anomalib.data import MVTecAD
from anomalib.models import Patchcore
from anomalib.engine import Engine
from anomalib.engine import Engine, SingleXPUStrategy, XPUAccelerator
from anomalib.data.utils import ValSplitMode
def arg_parser():
parser = argparse.ArgumentParser(description='Train PatchCore model on MVTec bottle dataset', formatter_class=argparse.RawTextHelpFormatter)
parser.add_argument('-o', '--output', type=Path, default='ovmodels', required=False, help='Path to save the output OpenVINO IR.\n(default: %(default)s)')
return parser.parse_args()
if __name__ == "__main__":
args = arg_parser()
print("Starting training PatchCore model on MVTec bottle dataset...")
# Initialize the datamodule, model and engine for bottle category
dataset_path = Path("../datasets/MVTecAD")
datamodule = MVTecAD(
root=dataset_path,
category="bottle",
train_batch_size=1,
eval_batch_size=1,
val_split_mode=ValSplitMode.FROM_TEST,
val_split_ratio=0.0001,
)
model = Patchcore()
engine = Engine(
strategy=SingleXPUStrategy(),
accelerator=XPUAccelerator(),
)
# Train the model
engine.fit(datamodule=datamodule, model=model)
print("Training completed!")
OS information
OS information:
- OS: ubuntu 22.04
- Python version: 3.12
- Anomalib version: 2.1.0 (v2.0.0 work well, no oom on both models)
- PyTorch version:2.8
- CUDA/cuDNN version: NA
- GPU models and configuration: 1270P iGPU
- Any other relevant information: 32G memory
Expected behavior
Work well just as same as the behavior on v2.0.0.
I believe the 2 models should work well on MVTecAD dataset when training on Intel iGPU
Screenshots
No response
Pip/GitHub
pip
What version/branch did you use?
main
Configuration YAML
NA
Logs
(anomalib_2_1) intel@intel-O-E-M:~/zyz/bmk_openvino/unsupervised/patchcore$ python train.py
/home/intel/zyz/bmk_openvino/unsupervised/anomalib_2_1/lib/python3.12/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
Starting training PatchCore model on MVTec bottle dataset...
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Zero subset length encountered during splitting. This means one of your subsets
might be empty or devoid of either normal or anomalous images.
Zero subset length encountered during splitting. This means one of your subsets
might be empty or devoid of either normal or anomalous images.
/home/intel/zyz/bmk_openvino/unsupervised/anomalib_2_1/lib/python3.12/site-packages/lightning/pytorch/core/optimizer.py:183: `LightningModule.configure_optimizers` returned `None`, this fit will run with no optimizer
| Name | Type | Params | Mode
----------------------------------------------------------
0 | pre_processor | PreProcessor | 0 | train
1 | post_processor | PostProcessor | 0 | train
2 | evaluator | Evaluator | 0 | train
3 | model | PatchcoreModel | 24.9 M | train
----------------------------------------------------------
24.9 M Trainable params
0 Non-trainable params
24.9 M Total params
99.450 Total estimated model params size (MB)
19 Modules in train mode
174 Modules in eval mode
/home/intel/zyz/bmk_openvino/unsupervised/anomalib_2_1/lib/python3.12/site-packages/lightning/pytorch/utilities/data.py:106: Total length of `DataLoader` across ranks is zero. Please make sure this was your intention.
Epoch 0: 44%|████████████████████████████████████████████████████████████████████████████████████████████████████ | 93/209 [00:10<00:12, 9.28it/s]Killed
Killed
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
No labels