Skip to content

Running out of memory #74

@andrewmoise

Description

@andrewmoise

Any advice on how to deal with running out of GPU memory? I'm just getting started with pytorch / this package, and this is what happens when I try an initial test run using 7000 steps (57000 training images, size 128x128, on a GPU with 15GB memory):

>>> from denoising_diffusion_pytorch import Unet, GaussianDiffusion, Trainer
>>> 
>>> model = Unet(
...     dim = 64,
...     dim_mults = (1, 2, 4, 8)
... ).cuda()
>>> 
>>> diffusion = GaussianDiffusion(
...     model,
...     image_size = 128,
...     timesteps = 1000,   # number of steps                                           
...     loss_type = 'l1'    # L1 or L2                                                  
... ).cuda()
>>> trainer = Trainer(
...     diffusion,
...     'training-set-2',
...     train_batch_size = 32,
...     train_lr = 2e-5,
...     train_num_steps = 7000,         # total training steps                          
...     gradient_accumulate_every = 2,    # gradient accumulation steps                 
...     ema_decay = 0.995,                # exponential moving average decay            
...     amp = True                        # turn on mixed precision                     
... )
>>> 
>>> trainer.train()
sampling loop time step: 100%|██████████████████| 1000/1000 [08:45<00:00,  1.90it/s]
loss: 0.2902:  14%|███▊                       | 1001/7000 [55:22<5:31:53,  3.32s/it]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.7/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 823, in train
    self.accelerator.backward(loss)
  File "/opt/conda/lib/python3.7/site-packages/accelerate/accelerator.py", line 884, in backward
    loss.backward(**kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 14.56 GiB total capacity; 13.02 GiB already allocated; 84.44 MiB free; 13.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions