You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/weather/corrdiff/README.md
+57-33Lines changed: 57 additions & 33 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -42,7 +42,9 @@ weather forecasts.
42
42
43
43
To get started with CorrDiff, we provide a simplified version called CorrDiff-Mini that combines:
44
44
45
-
1. A smaller neural network architecture that reduces memory usage and training time
45
+
1. A smaller neural network architecture that reduces memory usage and training
46
+
time.
47
+
46
48
2. A reduced training dataset, based on the HRRR dataset, that contains fewer samples (available at [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/modulus/resources/modulus_datasets-hrrr_mini))
47
49
48
50
Together, these modifications reduce training time from thousands of GPU hours to around 10 hours on A100 GPUs. The simplified data loader included with CorrDiff-Mini also serves as a helpful example for training CorrDiff on custom datasets. Note that CorrDiff-Mini is intended for learning and educational purposes only - its predictions should not be used for real applications.
@@ -74,14 +76,21 @@ CorrDiff training is managed through `train.py` and uses YAML configuration file
74
76
-`conf/config_generate_gefs_hrrr.yaml` - Settings for generating predictions using GEFS-HRRR models
75
77
-`conf/config_generate_custom.yaml` - Template configuration for generation with custom trained models
76
78
77
-
To select a specific configuration, use the `--config-name` option when running the training script. Each training configuration file defines three main components:
78
-
1. Training dataset parameters
79
-
2. Model architecture settings
80
-
3. Training hyperparameters
79
+
To select a specific configuration, use the `--config-name` option when running
80
+
the training script. Each training configuration file defines three main
81
+
components:
82
+
83
+
1. Training dataset parameters.
84
+
85
+
2. Model architecture settings.
86
+
87
+
3. Training hyperparameters.
81
88
82
89
You can modify configuration options in two ways:
83
-
1.**Direct Editing**: Modify the YAML files directly
84
-
2.**Command Line Override**: Use Hydra's `++` syntax to override settings at runtime
90
+
91
+
1.**Direct Editing**: Modify the YAML files directly.
92
+
93
+
2.**Command Line Override**: Use Hydra's `++` syntax to override settings at runtime.
85
94
86
95
For example, to change the training batch size (controlled by `training.hp.total_batch_size`):
87
96
```bash
@@ -93,8 +102,10 @@ This modular configuration system allows for flexible experimentation while main
93
102
### Training the regression model
94
103
95
104
CorrDiff uses a two-step training process:
96
-
1. Train a deterministic regression model
97
-
2. Train a diffusion model using the pre-trained regression model
105
+
106
+
1. Train a deterministic regression model.
107
+
108
+
2. Train a diffusion model using the pre-trained regression model.
98
109
99
110
For the CorrDiff-Mini regression model, we use the following configuration components:
100
111
@@ -113,7 +124,7 @@ This configuration automatically loads these specific files from `conf/base`:
113
124
114
125
These base configuration files contain more detailed settings that are less commonly modified but give fine-grained control over the training process.
115
126
116
-
To begin training, execute the following command using [`train.py`](train.py):
127
+
To begin training, execute the following command using [train.py](train.py):
3.**Patch-based Diffusion Model**: Memory-efficient variant that processes small spatial regions to improve scalability.
189
202
190
203
The patch-based approach divides the target region into smaller subsets during both training and generation, making it particularly useful for memory-constrained environments or large spatial domains.
191
204
@@ -223,20 +236,20 @@ To switch between model types, simply change the configuration name in the train
223
236
224
237
The evaluation pipeline for CorrDiff models consists of two main components:
Generates predictions and saves them in a netCDF file format. The process uses configuration settings from [`conf/config_generate.yaml`](conf/config_generate.yaml).
Generates predictions and saves them in a netCDF file format. The process uses configuration settings from [conf/config_generate.yaml](conf/config_generate.yaml).
For visualization and analysis, you have several options:
239
-
- Use the plotting scripts in the [`inference`](inference/) directory
252
+
- Use the plotting scripts in the [inference](inference/) directory
240
253
- Visualize results with [Earth2Studio](https://github.com/NVIDIA/earth2studio)
241
254
- Create custom visualizations using the NetCDF4 output structure
242
255
@@ -302,7 +315,7 @@ This repository includes examples of **CorrDiff** training on specific datasets,
302
315
303
316
### Defining a Custom Dataset
304
317
305
-
To train CorrDiff on a custom dataset, you need to implement a custom dataset class that inherits from `DownscalingDataset` defined in [`datasets/base.py`](./datasets/base.py). This base class defines the interface that all dataset implementations must follow.
318
+
To train CorrDiff on a custom dataset, you need to implement a custom dataset class that inherits from `DownscalingDataset` defined in [datasets/base.py](./datasets/base.py). This base class defines the interface that all dataset implementations must follow.
1. Parse the `type` field to locate your dataset file and class
366
+
352
367
2. Register your custom dataset class using `register_dataset()`
368
+
353
369
3. Pass all other fields in the `dataset` section as kwargs to your class constructor
370
+
354
371
- All tensors should be properly normalized (use `normalize_input`/`normalize_output` methods if needed)
355
372
- Ensure consistent dimensions across all samples
356
373
- Channel metadata should accurately describe your data variables
357
374
358
375
359
376
For reference implementations of dataset classes, look at:
360
-
-[`datasets/hrrrmini.py`](./datasets/hrrrmini.py) - Simple example using NetCDF format
361
-
-[`datasets/cwb.py`](./datasets/cwb.py) - More complex example
377
+
-[datasets/hrrrmini.py](./datasets/hrrrmini.py) - Simple example using NetCDF format
378
+
-[datasets/cwb.py](./datasets/cwb.py) - More complex example
362
379
363
380
364
381
### Training configuration
@@ -413,12 +430,19 @@ model. During training, you can fine-tune various parameters. The most commonly
413
430
414
431
> **Note on Patch Size Selection**
415
432
> When implementing a patch-based training, choosing the right patch size is critical for model performance. The patch dimensions are controlled by `patch_shape_x` and `patch_shape_y` in your configuration file. To determine optimal patch sizes:
433
+
>
416
434
> 1. Train a regression model on the full domain.
417
-
> 2. Compute the residuals `x_res = x_data - regression_model(x_data)` on multiple samples, where `x_data` are ground truth samples.
418
-
> 3. Calculate the auto-correlation function of your residuals using the provided utilities in [`inference/power_spectra.py`](./inference/power_spectra.py):
435
+
>
436
+
> 2. Compute the residuals `x_res = x_data - regression_model(x_data)` on
437
+
> multiple samples, where `x_data` are ground truth samples.
438
+
>
439
+
> 3. Calculate the auto-correlation function of your residuals using the provided utilities in [inference/power_spectra.py](./inference/power_spectra.py):
419
440
> - `average_power_spectrum()`
420
441
> - `power_spectra_to_acf()`
421
-
> 4. Set patch dimensions to match or exceed the distance at which auto-correlation approaches zero.
442
+
>
443
+
> 4. Set patch dimensions to match or exceed the distance at which
444
+
> auto-correlation approaches zero.
445
+
>
422
446
> 5. This ensures each patch captures the full spatial correlation structure of your data.
423
447
>
424
448
> This analysis helps balance computational efficiency with the preservation of local physical relationships in your data.
@@ -477,11 +501,11 @@ The generated samples are saved in a NetCDF file with three main components:
477
501
478
502
Training from scratch is recommended for all other cases.
479
503
480
-
1. **How many samples are needed to train a CorrDiff model?**
504
+
2. **How many samples are needed to train a CorrDiff model?**
481
505
The more, the better. As a rule of thumb, at least 50,000 samples are necessary.
482
506
*Note: For patch-based diffusion, each patch can be counted as a sample.*
483
507
484
-
2. **How many GPUs are required to train CorrDiff?**
508
+
3. **How many GPUs are required to train CorrDiff?**
485
509
A single GPU is sufficient as long as memory is not exhausted, but this may
486
510
result in extremely slow training. To accelerate training, CorrDiff
487
511
leverages distributed data parallelism. The total training wall-clock time
@@ -491,23 +515,23 @@ The generated samples are saved in a NetCDF file with three main components:
491
515
patch-based diffusion models, decrease the patch size—ensuring it remains
492
516
larger than the auto-correlation distance.
493
517
494
-
3. **How long does it take to train CorrDiff on a custom dataset?**
518
+
4. **How long does it take to train CorrDiff on a custom dataset?**
495
519
Training CorrDiff on the continental United States dataset required
496
520
approximately 5,000 A100 GPU hours. This corresponds to roughly 80 hours of
497
521
wall-clock time with 64 GPUs. You can expect the cost to scale
498
522
linearly with the number of samples available.
499
523
500
-
4. **What are CorrDiff's current limitations for custom datasets?**
524
+
5. **What are CorrDiff's current limitations for custom datasets?**
501
525
The main limitation of CorrDiff is the maximum _downscaling ratio_ it can
502
526
achieve. For a purely spatial super-resolution task (where input and output variables are the same), CorrDiff can reliably achieve a maximum resolution scaling of ×16. If the task involves inferring new output variables, the maximum reliable spatial super-resolution is ×11.
503
527
504
-
5. **What does a successful training look like?**
528
+
6. **What does a successful training look like?**
505
529
In a successful training run, the loss function should decrease monotonically, as shown below:
One of the most crucial hyperparameters is the patch size for a patch-based
512
536
diffusion model (`patch_shape_x` and `patch_shape_y` in the configuration file). A larger
513
537
patch size increases computational cost and GPU memory requirements, while a
@@ -530,7 +554,7 @@ The generated samples are saved in a NetCDF file with three main components:
530
554
processed in parallel on each GPU. It needs to be reduced if you encounter
531
555
an out-of-memory error.
532
556
533
-
7. **How do I set up validation during training?**
557
+
8. **How do I set up validation during training?**
534
558
CorrDiff supports validation during training through its configuration system. The validation approach is based on a separate validation configuration that inherits from and selectively overrides the training dataset settings.
0 commit comments