Skip to content

Commit 74bd0b8

Browse files
authored
Docs for diffusion models (#976)
* Applies patch Signed-off-by: Charlelie Laurent <claurent@nvidia.com> * Revert changes on README.md Signed-off-by: Charlelie Laurent <claurent@nvidia.com> * Fixed typos and adaded details Signed-off-by: Charlelie Laurent <claurent@nvidia.com> * Updated CHANGELOG.md Signed-off-by: Charlelie Laurent <claurent@nvidia.com> * Fixed broken link in docs + axdded links to papers for DDPM++ and NSCN++ Signed-off-by: Charlelie Laurent <claurent@nvidia.com> --------- Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
1 parent 399422a commit 74bd0b8

33 files changed

+1365
-781
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1010

1111
### Added
1212

13+
- Improved documentation for diffusion models and diffusion utils.
14+
1315
### Changed
1416

17+
- physicsnemo.utils.generative renamed into physicsnemo.utils.diffusion
18+
1519
### Deprecated
1620

1721
### Removed

docs/api/physicsnemo.models.rst

Lines changed: 427 additions & 30 deletions
Large diffs are not rendered by default.

docs/api/physicsnemo.utils.rst

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -56,21 +56,23 @@ consistent interfaces for data access.
5656
:members:
5757
:show-inheritance:
5858

59-
Generative utils
59+
.. _diffusion_utils:
60+
61+
Diffusion utils
6062
----------------
6163

62-
Tools for working with generative models, including deterministic and stochastic sampling utilities.
63-
These are particularly useful when implementing diffusion models or other generative approaches.
64+
Tools for working with diffusion models and other generative approaches,
65+
including deterministic and stochastic sampling utilities.
6466

65-
.. automodule:: physicsnemo.utils.generative.deterministic_sampler
67+
.. automodule:: physicsnemo.utils.diffusion.deterministic_sampler
6668
:members:
6769
:show-inheritance:
6870

69-
.. automodule:: physicsnemo.utils.generative.stochastic_sampler
71+
.. automodule:: physicsnemo.utils.diffusion.stochastic_sampler
7072
:members:
7173
:show-inheritance:
7274

73-
.. automodule:: physicsnemo.utils.generative.utils
75+
.. automodule:: physicsnemo.utils.diffusion.utils
7476
:members:
7577
:show-inheritance:
7678

@@ -100,11 +102,19 @@ and atmospheric parameters. These utilities are used extensively in weather pred
100102
.. automodule:: physicsnemo.utils.zenith_angle
101103
:show-inheritance:
102104

105+
.. _patching_utils:
106+
103107
Patching utils
104108
--------------
105109

106-
Utilities for handling data patching operations, particularly useful in image-based deep learning
107-
models where processing needs to be done on patches of the input data.
110+
Patching utilities are particularly useful for *patch-based* diffusion, also called
111+
*multi-diffusion*. This approach is used to scale diffusion to very large images.
112+
The following patching utilities extract patches from 2D images, and typically gather
113+
them in the batch dimension. A batch of patches is therefore composed of multiple
114+
smaller patches extracted from each sample in the original batch of larger
115+
images. Diffusion models can then process these patches independently. These
116+
utilities also support fusing operations to reconstruct the entire predicted
117+
image from the individual predicted patches.
108118

109119
.. automodule:: physicsnemo.utils.patching
110120
:members:

docs/conf.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@
5353
("index", "rst2pdf", "Sample rst2pdf doc", "Your Name"),
5454
]
5555

56-
napoleon_custom_sections = ["Variable Shape"]
56+
napoleon_custom_sections = [("Variable Shape", "notes"), ("Forward", "params_style"), ("Outputs", "returns_style")]
5757

5858
# -- Options for HTML output -------------------------------------------------
5959

@@ -131,4 +131,4 @@
131131
("index", "rst2pdf", "Sample rst2pdf doc", "Your Name"),
132132
]
133133

134-
napoleon_custom_sections = ["Variable Shape"]
134+
# napoleon_custom_sections = ["Variable Shape"]

examples/cfd/flow_reconstruction_diffusion/dataset/dataset.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
import numpy as np
2424
import PIL.Image
2525
import torch
26-
from physicsnemo.utils.generative import EasyDict
26+
from physicsnemo.utils.diffusion import EasyDict
2727

2828
try:
2929
import pyspng

examples/cfd/flow_reconstruction_diffusion/generate.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
import torch
2424
import tqdm
2525
from omegaconf import DictConfig
26-
from physicsnemo.utils.generative.utils import StackedRandomGenerator
26+
from physicsnemo.utils.diffusion.utils import StackedRandomGenerator
2727

2828
from misc import open_url
2929

examples/cfd/flow_reconstruction_diffusion/train.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929
import torch
3030
from omegaconf import DictConfig
3131
from training_loop import training_loop
32-
from physicsnemo.utils.generative.utils import EasyDict, construct_class_by_name
32+
from physicsnemo.utils.diffusion.utils import EasyDict, construct_class_by_name
3333

3434
from physicsnemo.distributed import DistributedManager
3535
from physicsnemo.launch.logging import PythonLogger, RankZeroLoggingWrapper

examples/cfd/flow_reconstruction_diffusion/training_loop.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727
import torch
2828
from torch.nn.parallel import DistributedDataParallel
2929
from training_stats import default_collector, report, report0
30-
from physicsnemo.utils.generative.utils import (
30+
from physicsnemo.utils.diffusion.utils import (
3131
InfiniteSampler,
3232
check_ddp_consistency,
3333
construct_class_by_name,

examples/cfd/flow_reconstruction_diffusion/training_stats.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323

2424
import numpy as np
2525
import torch
26-
from physicsnemo.utils.generative.utils import EasyDict, profiled_function
26+
from physicsnemo.utils.diffusion.utils import EasyDict, profiled_function
2727

2828
# ----------------------------------------------------------------------------
2929

examples/weather/corrdiff/README.md

Lines changed: 57 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,9 @@ weather forecasts.
4242

4343
To get started with CorrDiff, we provide a simplified version called CorrDiff-Mini that combines:
4444

45-
1. A smaller neural network architecture that reduces memory usage and training time
45+
1. A smaller neural network architecture that reduces memory usage and training
46+
time.
47+
4648
2. A reduced training dataset, based on the HRRR dataset, that contains fewer samples (available at [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/modulus/resources/modulus_datasets-hrrr_mini))
4749

4850
Together, these modifications reduce training time from thousands of GPU hours to around 10 hours on A100 GPUs. The simplified data loader included with CorrDiff-Mini also serves as a helpful example for training CorrDiff on custom datasets. Note that CorrDiff-Mini is intended for learning and educational purposes only - its predictions should not be used for real applications.
@@ -74,14 +76,21 @@ CorrDiff training is managed through `train.py` and uses YAML configuration file
7476
- `conf/config_generate_gefs_hrrr.yaml` - Settings for generating predictions using GEFS-HRRR models
7577
- `conf/config_generate_custom.yaml` - Template configuration for generation with custom trained models
7678

77-
To select a specific configuration, use the `--config-name` option when running the training script. Each training configuration file defines three main components:
78-
1. Training dataset parameters
79-
2. Model architecture settings
80-
3. Training hyperparameters
79+
To select a specific configuration, use the `--config-name` option when running
80+
the training script. Each training configuration file defines three main
81+
components:
82+
83+
1. Training dataset parameters.
84+
85+
2. Model architecture settings.
86+
87+
3. Training hyperparameters.
8188

8289
You can modify configuration options in two ways:
83-
1. **Direct Editing**: Modify the YAML files directly
84-
2. **Command Line Override**: Use Hydra's `++` syntax to override settings at runtime
90+
91+
1. **Direct Editing**: Modify the YAML files directly.
92+
93+
2. **Command Line Override**: Use Hydra's `++` syntax to override settings at runtime.
8594

8695
For example, to change the training batch size (controlled by `training.hp.total_batch_size`):
8796
```bash
@@ -93,8 +102,10 @@ This modular configuration system allows for flexible experimentation while main
93102
### Training the regression model
94103

95104
CorrDiff uses a two-step training process:
96-
1. Train a deterministic regression model
97-
2. Train a diffusion model using the pre-trained regression model
105+
106+
1. Train a deterministic regression model.
107+
108+
2. Train a diffusion model using the pre-trained regression model.
98109

99110
For the CorrDiff-Mini regression model, we use the following configuration components:
100111

@@ -113,7 +124,7 @@ This configuration automatically loads these specific files from `conf/base`:
113124

114125
These base configuration files contain more detailed settings that are less commonly modified but give fine-grained control over the training process.
115126

116-
To begin training, execute the following command using [`train.py`](train.py):
127+
To begin training, execute the following command using [train.py](train.py):
117128
```bash
118129
python train.py --config-name=config_training_hrrr_mini_regression.yaml
119130
```
@@ -132,7 +143,7 @@ After successfully training the regression model, you can proceed with training
132143

133144
- A pre-trained regression model checkpoint
134145
- The same dataset used for regression training
135-
- Configuration file [`conf/config_training_hrrr_mini_diffusion.yaml`](conf/config_training_hrrr_mini_diffusion.yaml)
146+
- Configuration file [conf/config_training_hrrr_mini_diffusion.yaml](conf/config_training_hrrr_mini_diffusion.yaml)
136147

137148
To start the diffusion model training, execute:
138149
```bash
@@ -145,12 +156,12 @@ The training will generate checkpoints in the `checkpoints_diffusion` directory.
145156

146157
### Generation
147158

148-
Once both models are trained, you can use [`generate.py`](generate.py) to create new predictions. The generation process requires:
159+
Once both models are trained, you can use [generate.py](generate.py) to create new predictions. The generation process requires:
149160

150161
**Required Files:**
151162
- Trained regression model checkpoint
152163
- Trained diffusion model checkpoint
153-
- Configuration file [`conf/config_generate_hrrr_mini.yaml`](conf/config_generate_hrrr_mini.yaml)
164+
- Configuration file [conf/config_generate_hrrr_mini.yaml](conf/config_generate_hrrr_mini.yaml)
154165

155166
Execute the generation command:
156167
```bash
@@ -183,9 +194,11 @@ The Taiwan example demonstrates CorrDiff training on a high-resolution weather d
183194

184195
The Taiwan example supports three types of models, each serving a different purpose:
185196

186-
1. **Regression Model**: Basic deterministic model
187-
2. **Diffusion Model**: Full probabilistic model
188-
3. **Patch-based Diffusion Model**: Memory-efficient variant that processes small spatial regions to improve scalability
197+
1. **Regression Model**: Basic deterministic model.
198+
199+
2. **Diffusion Model**: Full probabilistic model.
200+
201+
3. **Patch-based Diffusion Model**: Memory-efficient variant that processes small spatial regions to improve scalability.
189202

190203
The patch-based approach divides the target region into smaller subsets during both training and generation, making it particularly useful for memory-constrained environments or large spatial domains.
191204

@@ -223,20 +236,20 @@ To switch between model types, simply change the configuration name in the train
223236

224237
The evaluation pipeline for CorrDiff models consists of two main components:
225238

226-
1. **Sample Generation** ([`generate.py`](generate.py)):
227-
Generates predictions and saves them in a netCDF file format. The process uses configuration settings from [`conf/config_generate.yaml`](conf/config_generate.yaml).
239+
1. **Sample Generation** ([generate.py](generate.py)):
240+
Generates predictions and saves them in a netCDF file format. The process uses configuration settings from [conf/config_generate.yaml](conf/config_generate.yaml).
228241
```bash
229242
python generate.py --config-name=config_generate_taiwan.yaml
230243
```
231244

232-
2. **Performance Scoring** ([`score_samples.py`](score_samples.py)):
245+
2. **Performance Scoring** ([score_samples.py](score_samples.py)):
233246
Computes both deterministic metrics (like MSE, MAE) and probabilistic scores for the generated samples.
234247
```bash
235248
python score_samples.py path=<PATH_TO_NC_FILE> output=<OUTPUT_FILE>
236249
```
237250

238251
For visualization and analysis, you have several options:
239-
- Use the plotting scripts in the [`inference`](inference/) directory
252+
- Use the plotting scripts in the [inference](inference/) directory
240253
- Visualize results with [Earth2Studio](https://github.com/NVIDIA/earth2studio)
241254
- Create custom visualizations using the NetCDF4 output structure
242255

@@ -302,7 +315,7 @@ This repository includes examples of **CorrDiff** training on specific datasets,
302315

303316
### Defining a Custom Dataset
304317

305-
To train CorrDiff on a custom dataset, you need to implement a custom dataset class that inherits from `DownscalingDataset` defined in [`datasets/base.py`](./datasets/base.py). This base class defines the interface that all dataset implementations must follow.
318+
To train CorrDiff on a custom dataset, you need to implement a custom dataset class that inherits from `DownscalingDataset` defined in [datasets/base.py](./datasets/base.py). This base class defines the interface that all dataset implementations must follow.
306319

307320
**Required Implementation:**
308321

@@ -348,17 +361,21 @@ def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor, Optional[to
348361

349362
**Important Notes:**
350363
- The training script will automatically:
364+
351365
1. Parse the `type` field to locate your dataset file and class
366+
352367
2. Register your custom dataset class using `register_dataset()`
368+
353369
3. Pass all other fields in the `dataset` section as kwargs to your class constructor
370+
354371
- All tensors should be properly normalized (use `normalize_input`/`normalize_output` methods if needed)
355372
- Ensure consistent dimensions across all samples
356373
- Channel metadata should accurately describe your data variables
357374

358375

359376
For reference implementations of dataset classes, look at:
360-
- [`datasets/hrrrmini.py`](./datasets/hrrrmini.py) - Simple example using NetCDF format
361-
- [`datasets/cwb.py`](./datasets/cwb.py) - More complex example
377+
- [datasets/hrrrmini.py](./datasets/hrrrmini.py) - Simple example using NetCDF format
378+
- [datasets/cwb.py](./datasets/cwb.py) - More complex example
362379

363380

364381
### Training configuration
@@ -413,12 +430,19 @@ model. During training, you can fine-tune various parameters. The most commonly
413430
414431
> **Note on Patch Size Selection**
415432
> When implementing a patch-based training, choosing the right patch size is critical for model performance. The patch dimensions are controlled by `patch_shape_x` and `patch_shape_y` in your configuration file. To determine optimal patch sizes:
433+
>
416434
> 1. Train a regression model on the full domain.
417-
> 2. Compute the residuals `x_res = x_data - regression_model(x_data)` on multiple samples, where `x_data` are ground truth samples.
418-
> 3. Calculate the auto-correlation function of your residuals using the provided utilities in [`inference/power_spectra.py`](./inference/power_spectra.py):
435+
>
436+
> 2. Compute the residuals `x_res = x_data - regression_model(x_data)` on
437+
> multiple samples, where `x_data` are ground truth samples.
438+
>
439+
> 3. Calculate the auto-correlation function of your residuals using the provided utilities in [inference/power_spectra.py](./inference/power_spectra.py):
419440
> - `average_power_spectrum()`
420441
> - `power_spectra_to_acf()`
421-
> 4. Set patch dimensions to match or exceed the distance at which auto-correlation approaches zero.
442+
>
443+
> 4. Set patch dimensions to match or exceed the distance at which
444+
> auto-correlation approaches zero.
445+
>
422446
> 5. This ensures each patch captures the full spatial correlation structure of your data.
423447
>
424448
> This analysis helps balance computational efficiency with the preservation of local physical relationships in your data.
@@ -477,11 +501,11 @@ The generated samples are saved in a NetCDF file with three main components:
477501
478502
Training from scratch is recommended for all other cases.
479503
480-
1. **How many samples are needed to train a CorrDiff model?**
504+
2. **How many samples are needed to train a CorrDiff model?**
481505
The more, the better. As a rule of thumb, at least 50,000 samples are necessary.
482506
*Note: For patch-based diffusion, each patch can be counted as a sample.*
483507
484-
2. **How many GPUs are required to train CorrDiff?**
508+
3. **How many GPUs are required to train CorrDiff?**
485509
A single GPU is sufficient as long as memory is not exhausted, but this may
486510
result in extremely slow training. To accelerate training, CorrDiff
487511
leverages distributed data parallelism. The total training wall-clock time
@@ -491,23 +515,23 @@ The generated samples are saved in a NetCDF file with three main components:
491515
patch-based diffusion models, decrease the patch size—ensuring it remains
492516
larger than the auto-correlation distance.
493517
494-
3. **How long does it take to train CorrDiff on a custom dataset?**
518+
4. **How long does it take to train CorrDiff on a custom dataset?**
495519
Training CorrDiff on the continental United States dataset required
496520
approximately 5,000 A100 GPU hours. This corresponds to roughly 80 hours of
497521
wall-clock time with 64 GPUs. You can expect the cost to scale
498522
linearly with the number of samples available.
499523
500-
4. **What are CorrDiff's current limitations for custom datasets?**
524+
5. **What are CorrDiff's current limitations for custom datasets?**
501525
The main limitation of CorrDiff is the maximum _downscaling ratio_ it can
502526
achieve. For a purely spatial super-resolution task (where input and output variables are the same), CorrDiff can reliably achieve a maximum resolution scaling of ×16. If the task involves inferring new output variables, the maximum reliable spatial super-resolution is ×11.
503527
504-
5. **What does a successful training look like?**
528+
6. **What does a successful training look like?**
505529
In a successful training run, the loss function should decrease monotonically, as shown below:
506530
<p align="center">
507531
<img src="../../../docs/img/corrdiff_training_loss.png"/>
508532
</p>
509533
510-
6. **Which hyperparameters are most important?**
534+
7. **Which hyperparameters are most important?**
511535
One of the most crucial hyperparameters is the patch size for a patch-based
512536
diffusion model (`patch_shape_x` and `patch_shape_y` in the configuration file). A larger
513537
patch size increases computational cost and GPU memory requirements, while a
@@ -530,7 +554,7 @@ The generated samples are saved in a NetCDF file with three main components:
530554
processed in parallel on each GPU. It needs to be reduced if you encounter
531555
an out-of-memory error.
532556
533-
7. **How do I set up validation during training?**
557+
8. **How do I set up validation during training?**
534558
CorrDiff supports validation during training through its configuration system. The validation approach is based on a separate validation configuration that inherits from and selectively overrides the training dataset settings.
535559
536560
**Configuration Example**:

0 commit comments

Comments
 (0)