Skip to content

Changes to support DoMINO Design Sensitivities work + DoMINO model code fixes #973

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 46 commits into
base: main
Choose a base branch
from

Conversation

peterdsharpe
Copy link
Collaborator

@peterdsharpe peterdsharpe commented Jun 13, 2025

PhysicsNeMo Pull Request

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

@peterdsharpe peterdsharpe self-assigned this Jun 13, 2025
newlist.extend(x)
return newlist


Copy link
Collaborator Author

@peterdsharpe peterdsharpe Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review note: this PR deletes merge(); this function is (a) not used anywhere, (b) can and should be replaced simply by the built-in sum(lists, start=[]), and (c) as-written will always create an error, since newlist is a tuple and hence does not have a .extend() method.

…Torch documentation in GeoProcessor, GeometryRep, NNBasisFunctions, ParameterModel, and DoMINO classes.
… and performance; updates type hints and formatting, and modifies input handling for mesh data.
…rdinates; updates DoMINOInference to improve memory management and adds detailed docstrings for clarity.
…g random sampling, and adding detailed docstrings for initialization and item retrieval methods.
…n; updates DoMINOInference to utilize the new smoothing function and modifies sensitivity calculations accordingly. Enhances type hints and formatting for clarity.
…amline input handling. Updates input file processing and enhances results storage for mesh data.
…e/physicsnemo into psharpe/domino-sensitivities
…MINOInference, including sensitivity analysis and output to text files.
…cking.py to save output data in a dedicated gradient_checking_results directory. Adds new raw and smooth drag gradient data files.
…ivities (e.g., drag adjoint) with respect to underlying input geometry in CHANGELOG.md.
@peterdsharpe peterdsharpe marked this pull request as ready for review June 19, 2025 19:10
@@ -18,7 +18,7 @@
Important utilities for data processing and training, testing DoMINO.
"""

import os
Copy link
Collaborator Author

@peterdsharpe peterdsharpe Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterdsharpe peterdsharpe added the 3 - Ready for Review Ready for review by team label Jun 19, 2025
…ge, features, and configuration for aerodynamic design optimization.
@peterdsharpe
Copy link
Collaborator Author

/blossom-ci

@peterdsharpe
Copy link
Collaborator Author

peterdsharpe commented Jun 20, 2025

Review note: this introduces changes to the activation functions of the DoMINO model, and hence, existing DoMINO checkpoints will need to be retrained.

Last remaining to-do to pass CI: update the cached DoMINO checkpoint at ./test/models/data/domino_output.pth so that it can be loaded and validated at test-time. Going to hold off on this until review comments come back, in case other changes need to be made before generating this.

Comment on lines +41 to +43
output_dir: /user/aws_data_all/
input_dir: /data/drivaer_aws/drivaer_data_full/
cached_dir: /user/cached/drivaer_aws/drivaer_data_full/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already declare both project_dir and output in this config. Is it possible to reuse those with ${output}? I don't want to leave a config that requires many manual changes, it'd be nice to format all these paths as sensible relative paths.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good consideration! (Actually, I hadn't realized that you could use ${}-variables in YAML, so I learned something new today!)

Let me investigate what I can do here and get back to you on this one.

Copy link
Collaborator

@coreyjadams coreyjadams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great example use of DoMINO and a really cool extension to it's utility. As written, I think it's great, but it will be easier to maintain and extend with a few changes. I know you have another PR open on the utils cleanup, so I won't focus there too much. My main thoughts are:

  • The mesh utility functions should live somewhere they can get unit tested and used more broadly.
  • The config files have a lot of repetition (from the original configs) that could be cleaned up and consolidated
  • gelu -> relu is a good idea but we should do it more cautiously so as not to break the model for existing users.
  • The second half of the README is mostly just a list of bullets. Can we elaborate on those where they are useful? And a few are repetitive, perhaps consolidate?
  • The datapipe uses scipy KDTree which is good, but we can also use cuml and do it on the GPUs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did a decent amount of work in the last release to accelerate the data pipeline for DoMINO. I see a number of pieces here using CPU-only methods. If we have time to do it before merge, we should isolate the pieces from the training datapipe to reuse them here on CPU/GPU agnostic codes.

Really, I'm staring at the KDTree. That's a CPU op but cuml has a faster implementation on GPU.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked your math, might have a decimal error on line 62 column 17?

Just kidding. But - are these meant to be for unit testing and math checks? Might be better placed in the test area if so?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - the way it's implemented here should be pretty easy to do with cupy, too. Worth doing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file appears to have some good stuff. I don't want it to get buried and reimplemented later, and anyways it looks like it could use some tests implemented too.

Could this get reorganized into, say, physicsnemo.utils.mesh or similar?

Copy link
Collaborator Author

@peterdsharpe peterdsharpe Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! It's actually based on some utilities that @ktangsali and I were working on in PhysicsNeMo-CFD, which I had included here since I needed our neighbor-finding algorithm to implement the Laplacian smoothing kernel, and I couldn't depend on end-users of this specific example having PhysicsNeMo-CFD installed.

Perhaps one alternative strategy (rather than including these utilities here) would be to have this example list PN-CFD in its requirements.txt - but I'd really like to keep the dependency direction one-way: PhysicsNeMo-CFD depends on PhysicsNeMo, but not vice versa. It gets super messy when you have a bidirectional dependency and need to enforce mutual version pins. So, I don't love that alternative.

I do think this would be a good thing to upstream into something like physicsnemo.utils.mesh, since it's just generally useful and not CFD-specific - but I'd like to run it by @ktangsali first.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think wholesale switching to gelu makes sense for your design sensitivities work but it's going to be a bumpy shift for everyone who has a pretrained or fine tuned model. Can I propose, instead, the activation is configurable at the model level? Default to relu (for backward compatibility) and set it to gelu in your configs. We can target a pivot from relu -> gelu after the next trained model is released, presumably with gelu?

Copy link
Collaborator Author

@peterdsharpe peterdsharpe Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid concern!

I agree that there may be growing pains here, and I actually think it's symptomatic of a much larger problem that goes beyond this instance. Namely, the fundamental issue is that saved checkpoints currently do not encode much info about the model architecture, so we're loading fixed checkpoints (via saved files) into a moving-target architecture (via the repository).

  • On the GeLU/ReLU issue: I agree that making activations configurable for DoMINO is a workable fix for this instance of the issue. I'm open to doing this, though I think it's a bit of a band-aid fix to the larger true issue.
  • For the larger issue, it's an open question whether/how we tackle it:
    • One possible fix would be if, when the user saves a model checkpoint, a hash of the PhysicsNeMo repo is also saved (most convenient could be the latest Git commit hash). This has some implementation issues (what if the user isn't using Git?), but something conceptually similar would be nice. This way, when the checkpoint is loaded, we can throw a warning if the architecture is different than when it was saved. (PyTorch will already throw this for some model changes, but not for all changes.)
    • I think something like ONNX is not a great option here, since some of our models end up using a lot of bespoke code that doesn't fit nicely into ONNX primitives (e.g., anything with Warp)
    • We could just leave this as-is and expect users to pin commit hashes / versions for serious deployments. In an ideal world this should be happening anyway, but I'm not sure how reliable this solution is "in the wild".

More long-term: when it comes to new model architectures, I would recommend defaulting to $C^\infty$-continuous modeling where possible - at least for problems where the underlying PDE is expected to be smooth. This shows up most notably in activation functions, but also in more subtle places like KNN kernels (depending on how the results are then handled). Reworking how we handle KNN-based layers to be higher-order continuous is a challenge, but for the low-hanging fruit (like activations), I think encouraging continuity would be a good default policy.

  • This is because many applications benefit from this continuity property: for example, @ktangsali 's current work on adding PDE-residual-based physics loss to DoMINO also fundamentally requires meaningful spatial gradients, which any architecture that passes spatial information through ReLU cannot provide via AD (this is explicitly true for derivatives of order 2+, but even first-derivative accuracy will suffer via loss of smoothness).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal preference would be as follows:

  • Make GeLU the default for now, since higher-order continuity is required for multiple downstream applications and we expect the true solution of the RANS PDE to exhibit at least $C^2$-continuity
  • Not add activation as an exposed configuration option, trusting that users should be pinning their PhysicsNeMo repo versions when using a fixed saved checkpoint.
  • Choose whether/how to address the fundamental issue of separation-of-checkpoints-and-architectures later.

However, I'm definitely open to being convinced - can other reviewers weigh in here? @ktangsali @RishikeshRanade

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants