-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Description
Any advice on how to deal with running out of GPU memory? I'm just getting started with pytorch / this package, and this is what happens when I try an initial test run using 7000 steps (57000 training images, size 128x128, on a GPU with 15GB memory):
>>> from denoising_diffusion_pytorch import Unet, GaussianDiffusion, Trainer
>>>
>>> model = Unet(
... dim = 64,
... dim_mults = (1, 2, 4, 8)
... ).cuda()
>>>
>>> diffusion = GaussianDiffusion(
... model,
... image_size = 128,
... timesteps = 1000, # number of steps
... loss_type = 'l1' # L1 or L2
... ).cuda()
>>> trainer = Trainer(
... diffusion,
... 'training-set-2',
... train_batch_size = 32,
... train_lr = 2e-5,
... train_num_steps = 7000, # total training steps
... gradient_accumulate_every = 2, # gradient accumulation steps
... ema_decay = 0.995, # exponential moving average decay
... amp = True # turn on mixed precision
... )
>>>
>>> trainer.train()
sampling loop time step: 100%|██████████████████| 1000/1000 [08:45<00:00, 1.90it/s]
loss: 0.2902: 14%|███▊ | 1001/7000 [55:22<5:31:53, 3.32s/it]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/lib/python3.7/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 823, in train
self.accelerator.backward(loss)
File "/opt/conda/lib/python3.7/site-packages/accelerate/accelerator.py", line 884, in backward
loss.backward(**kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 14.56 GiB total capacity; 13.02 GiB already allocated; 84.44 MiB free; 13.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Metadata
Metadata
Assignees
Labels
No labels