Fix casting in SongUNetPosEmbd and shape in CorrDiff generation #982

juliusberner · 2025-06-17T23:57:07Z

PhysicsNeMo Pull Request

Description

Fix regression output shape in CorrDiff
Only use act if fused_act is True in ApexGroupNorm
Avoid dtype change of attributes (since self.pos_embd can be buffer or parameter) and fix dtype of softmax output (which is fp32) for SongUNetPosEmbd
Avoid changing the dtype of the .data attribute of self.scalar to enable .compile for SongUNetPosEmbd.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.

physicsnemo/models/diffusion/song_unet.py

CharlelieLrt

Overall looks good! Just need a few clarifications and a few MREs for the bugs that this PR fixes.

CharlelieLrt · 2025-06-24T22:58:20Z

examples/weather/corrdiff/generate.py

@@ -196,7 +196,7 @@ def generate_fn():
                        net=net_reg,
                        img_lr=img_lr,
                        latents_shape=(
-                            cfg.generation.seed_batch_size,
+                            sum(map(len, rank_batches)),


@juliusberner could you explain the reason for this change? AFAIK the batch dimension of latents_shape is never really used, right?

This is the batch-size that the output of regression_step is expanded to. Since we compute image_out = image_reg + image_res later, this needs to match the batch-size of the output of diffusion_step.

CharlelieLrt · 2025-06-24T23:22:23Z

physicsnemo/models/diffusion/song_unet.py

-                if self.prob_channels and out.dtype != self.scalar.dtype:
-                    self.scalar.data = self.scalar.data.to(out.dtype)
-                if self.prob_channels and (not self.training):
-                    out[:, self.prob_channels] = (
-                        out[:, self.prob_channels] * self.scalar
-                    ).softmax(dim=1)
-                elif self.prob_channels and self.training:
+                scalar = self.scalar
+                if out.dtype != scalar.dtype:
+                    scalar = scalar.to(out.dtype)
+                if self.training:
+                    out[:, self.prob_channels] = out[:, self.prob_channels] * scalar
+                else:
                    out[:, self.prob_channels] = (
-                        out[:, self.prob_channels] * self.scalar
+                        (out[:, self.prob_channels] * scalar)
+                        .softmax(dim=1)
+                        .to(out.dtype)


LGTM, but could you just post below an MRE of the bug you encountered with the former casting logic?

In amp-bf16 training, the output of softmax is float32, while out.dtype = bfloat16, which gives a RuntimeError: Index put requires the source and destination dtypes match, got BFloat16 for the destination and Float for the source..

CharlelieLrt · 2025-06-24T23:52:31Z

physicsnemo/models/diffusion/song_unet.py

+        pos_embd = self.pos_embd
+        if x.dtype != pos_embd.dtype:
+            pos_embd = pos_embd.to(x.dtype)


Two remarks:

Same as above: could you post below a MRE of the bug that you would get with the former casting logic (the MRE could be grouped with the one above)

Is there some logic problem here? We are accessing pos_embd.dtype and right below we are checking if pos_embd is not None? I think if self.pos_embd is None then we do return None right away?

The assignment self.pos_embd = self.pos_embd.to(dtype) only works if pos_embd is a buffer but not if it is a parameter (which is the case if self.gridtype == "learnable"). Thus, we define a new local variable which works in both cases.

positional_embedding_indexing is only called in the forward if self.pos_embd is not None. If it is called from outside, it would return an empty list []. How should we handle it?

CharlelieLrt · 2025-06-24T23:58:54Z

physicsnemo/models/diffusion/song_unet.py

+        embeddings = self.pos_embd
+        if x.dtype != embeddings.dtype:
+            embeddings = embeddings.to(x.dtype)


Two remarks:

Same as above, it would be great if you could post an MRE below (can be grouped with other RMEs for these castings bugs).

Is there a specific reason to call it embeddings here, whereas it was called pos_embd in the positional_embedding_indexing method? If not, let's try to remain consistent in the names

Copying from above: The assignment self.pos_embd = self.pos_embd.to(dtype) only works if pos_embd is a buffer but not if it is a parameter (which is the case if self.gridtype == "learnable"). Thus, we define a new local variable which works in both cases.

I took it from the existing code, but it makes sense to rename it to pos_embd.

…ding, lead time aware, with compile, apex_gn, etc... Signed-off-by: Charlelie Laurent <claurent@nvidia.com>

Julius Berner added 2 commits June 17, 2025 16:49

Fix regression output shape

23f79b5

Only use act if fused_act is True

bab3815

juliusberner changed the title ~~Fix dtype in SognUNet and shape in CorrDiff generation~~ Fix dtype in SongUNetPosEmbd and shape in CorrDiff generation Jun 18, 2025

juliusberner changed the title ~~Fix dtype in SongUNetPosEmbd and shape in CorrDiff generation~~ Fix dtypes in SongUNetPosEmbd and shape in CorrDiff generation Jun 18, 2025

juliusberner changed the title ~~Fix dtypes in SongUNetPosEmbd and shape in CorrDiff generation~~ Fix casting in SongUNetPosEmbd and shape in CorrDiff generation Jun 18, 2025

juliusberner force-pushed the jberner/fix_corrdiff branch from 92a58a3 to b7e0382 Compare June 18, 2025 00:07

Avoid dtype change of buffer/param and fix softmax dtype

f124ba9

juliusberner force-pushed the jberner/fix_corrdiff branch from b7e0382 to f124ba9 Compare June 18, 2025 00:14

juliusberner commented Jun 18, 2025

View reviewed changes

physicsnemo/models/diffusion/song_unet.py Show resolved Hide resolved

CharlelieLrt self-requested a review June 18, 2025 00:17

CharlelieLrt assigned juliusberner Jun 18, 2025

CharlelieLrt added bug Something isn't working 2 - In Progress Currently a work in progress labels Jun 18, 2025

Merge branch 'main' into jberner/fix_corrdiff

2a817f7

CharlelieLrt requested changes Jun 25, 2025

View reviewed changes

CharlelieLrt and others added 2 commits June 26, 2025 16:40

Added unit tests for song unet models with learnable positional embed…

12f502f

…ding, lead time aware, with compile, apex_gn, etc... Signed-off-by: Charlelie Laurent <claurent@nvidia.com>

Merge branch 'main' into jberner/fix_corrdiff

ab4c29b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix casting in SongUNetPosEmbd and shape in CorrDiff generation #982

Fix casting in SongUNetPosEmbd and shape in CorrDiff generation #982

Uh oh!

juliusberner commented Jun 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

CharlelieLrt left a comment

Uh oh!

CharlelieLrt Jun 24, 2025

Uh oh!

juliusberner Jun 27, 2025

Uh oh!

CharlelieLrt Jun 24, 2025

Uh oh!

juliusberner Jun 27, 2025 •

edited

Loading

Uh oh!

CharlelieLrt Jun 24, 2025

Uh oh!

juliusberner Jun 27, 2025 •

edited

Loading

Uh oh!

CharlelieLrt Jun 24, 2025

Uh oh!

juliusberner Jun 27, 2025

Uh oh!

Uh oh!

Fix casting in SongUNetPosEmbd and shape in CorrDiff generation #982

Are you sure you want to change the base?

Fix casting in SongUNetPosEmbd and shape in CorrDiff generation #982

Uh oh!

Conversation

juliusberner commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PhysicsNeMo Pull Request

Description

Checklist

Uh oh!

Uh oh!

CharlelieLrt left a comment

Choose a reason for hiding this comment

Uh oh!

CharlelieLrt Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

juliusberner Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

CharlelieLrt Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

juliusberner Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CharlelieLrt Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

juliusberner Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CharlelieLrt Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

juliusberner Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

juliusberner commented Jun 17, 2025 •

edited

Loading

juliusberner Jun 27, 2025 •

edited

Loading

juliusberner Jun 27, 2025 •

edited

Loading