Skip to content

Conversation

@tonishi-nv
Copy link

@tonishi-nv tonishi-nv commented Oct 26, 2025

PhysicsNeMo Pull Request

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
@tonishi-nv tonishi-nv changed the title Tonishi/xmgn ressim A new X-MeshGraphNet example for reservoir simulation. Oct 26, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR adds a new reservoir simulation example to PhysicsNeMo using X-MeshGraphNet (XMGN), located in examples/reservoir_simulation/. The implementation provides a complete workflow for training graph neural networks on subsurface flow data from ECLIPSE/OPM/IX simulators. The example includes: (1) sim_utils/ package for reading binary simulation files and constructing graph representations from reservoir grids, (2) preprocessing pipeline that converts raw simulator outputs to partitioned PyTorch Geometric graphs with train/val/test splits, (3) training infrastructure with distributed support, MLflow tracking, and early stopping, and (4) autoregressive inference that predicts multiple timesteps and exports results in both HDF5 and GRDECL formats. The code integrates with PhysicsNeMo's distributed training patterns and checkpoint management while adding domain-specific features like transmissibility-based edge weights, well completion handling, and nonlinear scaling for physical variables. The architecture mirrors existing PhysicsNeMo examples (Hydra configuration, modular utilities) while addressing reservoir simulation requirements such as handling inactive cells, temporal sequences, and industry-standard file formats.

Important Files Changed

Filename Score Overview
examples/reservoir_simulation/xmgn/src/train.py 2/5 Adds complete training pipeline but has critical issues: MLflow run never closed, gradient accumulation logic inside partition loop instead of outside, and potential race conditions in distributed trash cleanup
examples/reservoir_simulation/xmgn/requirements.txt 2/5 Adds dependencies but contains version mismatch between torch (>=2.4.0) and PyG wheel instructions (torch-2.9.0), plus unused torchvision/torchaudio dependencies
examples/reservoir_simulation/xmgn/src/data/__init__.py 2/5 Creates data package API but uses absolute import from sim_utils import ... that will fail since sim_utils is a sibling directory, should use relative import from ...sim_utils import ...
examples/reservoir_simulation/xmgn/src/data/dataloader.py 3/5 Implements graph dataset and partitioning but missing null checks for edge_attr indexing and empty tensor validation before shape access could cause runtime crashes
examples/reservoir_simulation/sim_utils/grid.py 3/5 Adds Grid class for reservoir connectivity but remapping logic uses slow Python loops, creates invalid (0,0) edges in edge cases, and has debug prints that should be converted to logging
examples/reservoir_simulation/xmgn/src/preprocessor.py 3.5/5 Implements METIS-based graph partitioning with fallback, but swallows exceptions during partition organization and shadows loop variable i in nested loops
examples/reservoir_simulation/xmgn/src/utils/mlflow_utils.py 3/5 Provides MLflow logging utilities but contains bare except clause catching system exits, unsafe attribute access leaving dataset_name unbound, and uses deprecated datetime.utcnow()
examples/reservoir_simulation/sim_utils/ecl_reader.py 4/5 Adds ECLIPSE binary file parser with extensive functionality but loads entire files into memory and uses bare except clauses that could mask critical errors
examples/reservoir_simulation/sim_utils/well.py 4/5 Introduces well/completion data structures with minor documentation inconsistencies (missing stat parameter, references nonexistent set_out_flow_rate method)
examples/reservoir_simulation/xmgn/src/data/graph_builder.py 4/5 Comprehensive graph builder converting simulator outputs to PyTorch Geometric format with extensive validation, minor unused imports and potentially strict 20% failure threshold
examples/reservoir_simulation/xmgn/src/inference.py 4/5 Autoregressive inference script with checkpoint loading and dual output formats (HDF5/GRDECL), contains bare except clause catching all exceptions including keyboard interrupts
examples/reservoir_simulation/xmgn/src/utils/path_utils.py 4/5 Path management utilities with inconsistent validation - checks runspec.job_name but not dataset.sim_dir before access which could raise AttributeError
examples/reservoir_simulation/xmgn/conf/config.yaml 4/5 Main Hydra configuration for XMGN with comprehensive settings, minor trailing whitespace issue and should validate array length constraints at runtime
examples/reservoir_simulation/xmgn/conf/config_norne.yaml 4/5 Norne dataset-specific configuration with relative paths assuming specific directory structure and skip_graphs flag requiring pre-existing files
examples/reservoir_simulation/xmgn/README.md 4/5 Documents complete workflow from data prep to inference with minor config path discrepancy (conf/config.yaml vs config/config.yaml) and incomplete Option 2 section
examples/reservoir_simulation/sim_utils/__init__.py 4/5 Creates sim_utils package API with clean exports, assumes existence of sibling modules (ecl_reader, grid, well)
examples/reservoir_simulation/sim_utils/README.md 5/5 Well-structured documentation for sim_utils package with practical code examples and clear usage patterns
CHANGELOG.md 5/5 Adds single entry documenting new reservoir simulation example in correct format and location
examples/reservoir_simulation/xmgn/.gitignore 5/5 Standard MLflow mlruns/ exclusion, correct and necessary
examples/reservoir_simulation/xmgn/src/__init__.py 5/5 Empty package marker with proper license header, standard Python packaging
examples/reservoir_simulation/xmgn/src/utils/__init__.py 5/5 Empty utils package initialization with license header and empty __all__ list

Confidence score: 2/5

  • This PR requires careful review due to multiple critical issues in core training and data loading logic that could cause runtime failures or incorrect behavior
  • Score reflects critical bugs in train.py gradient accumulation, import path errors in data/init.py, version mismatches in requirements.txt, missing null checks in dataloader, and bare except clauses that could mask errors; additionally concerns about memory usage patterns (loading entire files), potential race conditions in distributed cleanup, and deprecated API usage
  • Pay close attention to examples/reservoir_simulation/xmgn/src/train.py (gradient accumulation logic, MLflow lifecycle), examples/reservoir_simulation/xmgn/src/data/__init__.py (import paths), examples/reservoir_simulation/xmgn/requirements.txt (version alignment), examples/reservoir_simulation/xmgn/src/data/dataloader.py (null safety), and examples/reservoir_simulation/xmgn/src/utils/mlflow_utils.py (exception handling)

Sequence Diagram

sequenceDiagram
    participant User
    participant Preprocessor
    participant EclReader
    participant Grid
    participant GraphBuilder
    participant Trainer
    participant Model
    participant Inference

    User->>Preprocessor: python src/preprocessor.py
    Preprocessor->>Preprocessor: validate_config()
    Preprocessor->>Preprocessor: check_existing_data()
    
    Preprocessor->>GraphBuilder: ReservoirGraphBuilder(cfg)
    GraphBuilder->>EclReader: find simulation files (.DATA/.AFI)
    
    loop For each simulation case
        GraphBuilder->>EclReader: read_init(static properties)
        GraphBuilder->>EclReader: read_egrid(grid geometry)
        GraphBuilder->>Grid: Grid(init_data, egrid_data)
        Grid->>Grid: compute cell coordinates
        Grid->>Grid: compute connections & transmissibilities
        GraphBuilder->>EclReader: read_restart(dynamic properties)
        GraphBuilder->>GraphBuilder: build_graph_from_simulation_data()
        GraphBuilder->>GraphBuilder: apply_nonlinear_scaling()
        GraphBuilder->>Preprocessor: return graph
    end
    
    Preprocessor->>Preprocessor: save_graphs()
    Preprocessor->>Preprocessor: create_partitions_from_graphs()
    Preprocessor->>Preprocessor: split_samples_by_case()
    Preprocessor->>Preprocessor: organize_partitions_by_split()
    Preprocessor->>Preprocessor: compute_global_statistics()
    Preprocessor->>Preprocessor: save_dataset_metadata()
    
    User->>Trainer: torchrun src/train.py
    Trainer->>Trainer: InitializeLoggers(cfg)
    Trainer->>Trainer: initialize_dataloaders()
    Trainer->>Trainer: initialize_model(MeshGraphNet)
    Trainer->>Trainer: initialize_optimizer(AdamW)
    
    loop For each epoch
        loop For each batch
            Trainer->>Model: forward(x, edge_attr, graph)
            Model->>Trainer: predictions
            Trainer->>Trainer: compute_weighted_loss()
            Trainer->>Trainer: backward()
            Trainer->>Trainer: optimizer.step()
        end
        
        alt Validation epoch
            Trainer->>Trainer: validate_epoch()
            Trainer->>Trainer: denormalize_predictions()
            Trainer->>Trainer: compute_per_variable_losses()
            Trainer->>Trainer: check_early_stopping()
            Trainer->>Trainer: save_checkpoint()
        end
    end
    
    User->>Inference: python src/inference.py
    Inference->>Inference: load best checkpoint
    Inference->>Inference: load test dataset
    
    loop For each test case
        loop Autoregressive timesteps
            Inference->>Model: forward(x, edge_attr, graph)
            Model->>Inference: predictions
            Inference->>Inference: update_node_features_with_predictions()
            Inference->>Inference: denormalize_predictions()
        end
    end
    
    Inference->>Inference: save_case_results_hdf5()
    Inference->>Inference: run_post()
    Inference->>Inference: generate GRDECL files
    Inference->>User: inference results & visualizations
Loading

21 files reviewed, 39 comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The current diff contains documentation-only fixes to the well.py file within the reservoir simulation utilities. Three docstring errors were corrected: (1) removed a reference to a non-existent id attribute in the Completion class, (2) fixed a method name typo from set_out_flow_rate to the correct set_flow_rate, and (3) added the missing stat parameter to the Well.__init__ docstring. These changes align the API documentation with the actual implementation and address the three syntax issues flagged in the previous review (lines 33-42, 106-110).

Important Files Changed

Filename Score Overview
examples/reservoir_simulation/sim_utils/well.py 5/5 Fixed three docstring inconsistencies: removed non-existent id reference, corrected method name typo, and added missing stat parameter documentation

Confidence score: 5/5

  • This PR is safe to merge with minimal risk
  • Score reflects documentation-only changes with no functional code modifications—these fixes correct previously flagged syntax issues in docstrings without touching any logic or introducing new code paths
  • No files require special attention

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The latest update addresses two documentation and defensive programming issues identified in previous reviews. The first change corrects a path inconsistency in the README where the configuration file path was referenced as both config/config.yaml and conf/config.yaml in different locations—the fix standardizes all references to conf/config.yaml to match the actual project directory structure. The second change adds validation for the cfg.dataset.sim_dir configuration attribute in path_utils.py, mirroring the existing pattern for cfg.runspec.job_name validation. This prevents potential AttributeErrors when the configuration is incomplete, ensuring the utility function fails fast with a clear error message rather than propagating an obscure exception during path construction.

Important Files Changed

Filename Score Overview
examples/reservoir_simulation/xmgn/README.md 5/5 Documentation fix correcting configuration file path from config/config.yaml to conf/config.yaml for internal consistency
examples/reservoir_simulation/xmgn/src/utils/path_utils.py 5/5 Added defensive validation check for cfg.dataset.sim_dir before access, preventing AttributeError on incomplete configuration

Confidence score: 5/5

  • This PR is safe to merge with minimal risk—both changes are minor defensive improvements with no functional side effects
  • Score reflects that these are documentation and validation fixes addressing specific review feedback, with no new logic, API changes, or complex interactions that could introduce bugs
  • No files require special attention—both changes are straightforward corrections following established patterns in the codebase

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review. The current change is a trivial formatting fix in examples/reservoir_simulation/xmgn/conf/config.yaml, removing trailing whitespace after the silu activation function specification on line 88. This addresses a linting violation previously flagged and ensures compliance with the repository's pre-commit hooks (which include ruff format and markdownlint). The change has zero functional impact—it simply removes invisible whitespace characters that violated the project's formatting standards.

Important Files Changed

Filename Score Overview
examples/reservoir_simulation/xmgn/conf/config.yaml 5/5 Removed trailing whitespace after 'silu' on line 88 (pure formatting fix)

Confidence score: 5/5

  • This PR is safe to merge with minimal risk—it contains only a trivial whitespace removal
  • Score reflects the non-functional nature of the change: removing invisible trailing spaces cannot introduce bugs or break existing functionality
  • No files require special attention; this is a single-line formatting correction with no code logic changes

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The latest change updates the requirements.txt file to adjust PyTorch and PyTorch Geometric (PyG) extension versions and versioning strategy.

What changed: The torch version constraint was modified from an implicit dependency on torch 2.9.0+cu128 to an explicit minimum version torch>=2.4.0. The PyG extension URL comment was updated from torch 2.9.0+cu128 to torch 2.4.0+cu121 wheels, and minimum version constraints (>=) were added to torch-scatter, torch-sparse, torch-cluster, and torch-spline-conv packages. This change aligns the CUDA toolkit version from CUDA 12.8 to CUDA 12.1.

Why this matters: The developer is standardizing on torch 2.4.0+cu121 as the tested baseline, likely because torch 2.9.0 is not yet widely available or because PyG wheel compatibility is more reliable for 2.4.0. The change from exact pinning to minimum version constraints (>=) allows patch and minor updates while ensuring a tested baseline. However, this introduces a version coordination risk between the torch requirement and the PyG wheel URL.

How it fits: This requirements file defines the environment for the new X-MeshGraphNet reservoir simulation example (examples/reservoir_simulation/xmgn/). The dependency versions must align with the PhysicsNeMo core library's torch requirements (defined in the root pyproject.toml) and with available pre-built PyG wheels to ensure smooth installation.

Important Files Changed

Filename Score Overview
examples/reservoir_simulation/xmgn/requirements.txt 3/5 Updated torch from 2.9.0+cu128 to >=2.4.0, changed PyG wheel URL to torch-2.4.0+cu121, and added minimum version constraints to PyG extensions

Confidence score: 3/5

  • This change reduces installation brittleness by moving from torch2.9.0 (potentially unavailable) to the more stable 2.4.0+cu121 baseline, but introduces version coordination risks between the torch constraint and PyG wheel URL.
  • Score reflects a critical version mismatch issue: the torch requirement torch>=2.4.0 (line 5) permits any version ≥2.4.0, but the PyG install URL comment (line 17) points to wheels built specifically for torch 2.4.0+cu121. If users install torch 2.5+ or 2.9+, pip will fail to find matching PyG wheels and attempt error-prone source builds. Additionally, the >= constraints on PyG extensions lack upper bounds, risking future incompatibilities.
  • Pay close attention to examples/reservoir_simulation/xmgn/requirements.txt lines 5 and 17-21: Consider pinning torch to ==2.4.0 or explicitly documenting that users must install torch 2.4.0 before running the PyG install command. The version mismatch between the flexible torch constraint and the fixed PyG wheel URL will cause installation failures in many environments.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. Two files were modified: a minor exception handling improvement and a dependency version update. In examples/reservoir_simulation/xmgn/src/utils/mlflow_utils.py, a bare except: clause was changed to except Exception: on line 80, following Python best practices to avoid catching system-level signals like KeyboardInterrupt. This utility flattens Hydra configurations for MLflow logging in the X-MeshGraphNet reservoir simulation example. In examples/reservoir_simulation/xmgn/requirements.txt, minimum version constraints were added to four PyG extension packages (torch-scatter>=2.1.2, torch-sparse>=0.6.18, torch-cluster>=1.6.3, torch-spline-conv>=1.2.2) that previously had no version pins. These packages depend on PyG 2.6.0 and torch 2.4.0+cu121, and the version constraints improve reproducibility by preventing installation of older incompatible versions. Both changes are small incremental improvements to code quality and dependency management.

Important Files Changed

Filename Score Overview
examples/reservoir_simulation/xmgn/src/utils/mlflow_utils.py 5/5 Changed bare except to except Exception: on line 80 to allow system signals to propagate
examples/reservoir_simulation/xmgn/requirements.txt 4/5 Added minimum version constraints to four PyG extension packages (torch-scatter, torch-sparse, torch-cluster, torch-spline-conv)

Confidence score: 4/5

  • This PR contains low-risk quality improvements with one potential installation issue that requires attention
  • Score reflects that the exception handling fix is safe and correct, but the requirements.txt change creates a potential version mismatch between the hardcoded PyG wheel URL (torch-2.4.0+cu121.html) and the newly specified minimum versions which may not exist at that URL, potentially causing pip to fall back to source builds
  • Pay close attention to examples/reservoir_simulation/xmgn/requirements.txt - verify that the minimum versions specified (torch-scatter>=2.1.2, torch-sparse>=0.6.18, torch-cluster>=1.6.3, torch-spline-conv>=1.2.2) are actually available at the PyG wheel repository URL mentioned in the installation comment, or update the comment to reflect the correct installation procedure

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The most recent update improves error handling in mlflow_utils.py by addressing variable scoping issues and deprecation warnings that were identified in previous reviews.

The changes pre-initialize job_name and dataset_name with fallback values before the try block, preventing UnboundLocalError exceptions when the exception handler needs to access these variables. Additionally, nested attribute access patterns are replaced with safer getattr() calls that provide defaults when config sections are missing. The deprecated datetime.utcnow() call is updated to datetime.now(timezone.utc) for Python 3.12+ compatibility. These defensive programming patterns align with PhysicsNeMo's approach of graceful degradation—the MLflow logging should capture as much metadata as possible without failing the entire training run if configuration fields are malformed or absent.

Important Files Changed

Filename Score Overview
examples/reservoir_simulation/xmgn/src/utils/mlflow_utils.py 5/5 Pre-initializes fallback variables and replaces direct attribute access with safer getattr patterns to prevent UnboundLocalError and AttributeError exceptions; updates deprecated datetime API

Confidence score: 5/5

  • This PR is safe to merge with minimal risk
  • All previous critical issues in this file have been addressed: variable scoping prevents UnboundLocalError, defensive attribute access prevents AttributeError, and the deprecated datetime API is updated for forward compatibility
  • No files require special attention

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. Two minor fixes were applied based on previous code review feedback: one addresses a Python 3.12 deprecation in MLflow timestamp logging, and the other replaces a bare except: clause with a more specific exception handler in the inference script.

Changes:

  • examples/reservoir_simulation/xmgn/src/utils/mlflow_utils.py: Replaced deprecated datetime.utcnow() with datetime.now(timezone.utc) for timezone-aware timestamp generation in MLflow logging
  • examples/reservoir_simulation/xmgn/src/inference.py: Changed bare except: on line 145 to except Exception: to prevent catching system-level exceptions like KeyboardInterrupt and SystemExit

Both changes follow Python best practices and align with patterns established elsewhere in the physicsnemo codebase. The MLflow fix future-proofs the code against Python 3.12+ where utcnow() will be removed, while the inference fix ensures proper exception handling without blocking graceful program termination.

Important Files Changed

Filename Score Overview
examples/reservoir_simulation/xmgn/src/utils/mlflow_utils.py 5/5 Replaced deprecated datetime.utcnow() with timezone-aware datetime.now(timezone.utc) for MLflow timestamp logging
examples/reservoir_simulation/xmgn/src/inference.py 5/5 Fixed bare except: clause to except Exception: to allow system interrupts to propagate correctly

Confidence score: 5/5

  • This PR is safe to merge with minimal risk
  • Score reflects simple, well-understood deprecation and best-practice fixes with no functional impact or complex logic changes
  • No files require special attention

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The most recent commit removes two unused return values from helper methods in examples/reservoir_simulation/xmgn/src/inference.py: the target_vars variable that was being discarded by all callers, and a redundant case_name extraction that was immediately re-computed at the call site. This is a straightforward code cleanup that simplifies tuple unpacking without changing behavior.

Important Files Changed

Filename Score Overview
examples/reservoir_simulation/xmgn/src/inference.py 5/5 Removed unused return values (target_vars and case_name) from two private helper methods to eliminate dead code

Confidence score: 5/5

  • This PR is safe to merge with minimal risk
  • The changes are pure refactoring—removing return values that were systematically ignored by all callers (evidenced by _ unpacking at lines 271 and 791), with no logic changes or new functionality
  • No files require special attention; this is straightforward dead-code elimination

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR.

A single targeted fix in the ECLIPSE binary file reader replaces a bare except: clause with explicit exception handling. The change on line 584 of ecl_reader.py now catches only struct.error and IndexError - the two specific exceptions that can occur when unpacking binary record headers during UNRST file parsing. This follows up on a previous review comment about exception-handling hygiene.

The _read_unrst method reads ECLIPSE reservoir simulation restart files by parsing binary records. Each record header is validated using struct.unpack to read a 4-byte integer. The new exception specification ensures that if malformed data or buffer boundary issues occur, the code silently continues pattern matching (as intended), but critical exceptions like KeyboardInterrupt and SystemExit will now propagate correctly. This is a standard practice for production code that processes potentially corrupted binary data streams.

Important Files Changed

Filename Score Overview
examples/reservoir_simulation/sim_utils/ecl_reader.py 5/5 Replaced bare except with specific struct.error and IndexError handling in binary file parser

Confidence score: 5/5

  • This is a safe, localized fix that improves exception handling without changing functionality
  • The change is minimal (single line), well-scoped, and follows Python best practices for exception handling
  • No files require special attention - this is a straightforward code quality improvement

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The latest change applies explicit log-level guards to all logging calls in the _read_unrst method of examples/reservoir_simulation/sim_utils/ecl_reader.py. Each logging.info() and logging.debug() statement is now wrapped in an if logging.getEffectiveLevel() <= logging.INFO/DEBUG: check to prevent string formatting overhead when the logger is set to WARNING or ERROR levels. This addresses a performance concern identified in earlier reviews for a data-loading hotpath that processes thousands of timesteps in ECLIPSE reservoir simulation restart files. While Python's logging module performs internal level checks, the explicit guards prevent f-string evaluation before the check, which can be meaningful when processing large datasets in production environments where verbose logging is disabled.

Important Files Changed

Filename Score Overview
examples/reservoir_simulation/sim_utils/ecl_reader.py 4/5 Added explicit log-level checks around every logging statement in _read_unrst to avoid string formatting overhead when verbose logging is disabled

Confidence score: 4/5

  • This PR is safe to merge with minimal risk; the changes are purely performance optimizations that do not alter functionality
  • Score reflects that while the pattern introduces code duplication and reduces readability, it does not introduce bugs or breaking changes; the optimization is valid though the benefit may be marginal since Python's logging module already performs level checks internally
  • Pay close attention to examples/reservoir_simulation/sim_utils/ecl_reader.py to verify the readability trade-off is acceptable for your team's coding standards, as the repeated guards make the method significantly more verbose

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR adds a comprehensive X-MeshGraphNet example for reservoir simulation surrogate modeling. The implementation includes Eclipse/IX binary file readers (sim_utils), parallel graph construction from simulation data, METIS-based graph partitioning for distributed training, and complete training/inference pipelines with MLflow tracking.

Key additions:

  • sim_utils package for reading Eclipse binary formats (INIT, EGRID, UNRST) with proper endianness handling
  • Parallel graph builder processing multiple simulation cases using multiprocessing with progress tracking
  • Preprocessor orchestrating graph generation, partitioning, data splits, and global statistics
  • Training pipeline with distributed support, gradient checkpointing, and early stopping
  • Inference with autoregressive rollout and GRDECL export for visualization
  • Comprehensive documentation with examples for 2D waterflood and Norne field datasets

The code demonstrates good software engineering practices with proper error handling, configuration validation, and memory-efficient processing. Previous review comments have identified several style and documentation issues that should be addressed, but no critical blocking issues were found in this review round.

Confidence Score: 4/5

  • This PR is safe to merge with minor style improvements recommended
  • The implementation is well-structured with comprehensive error handling, proper validation, and good documentation. Previous review rounds identified numerous style issues (bare excepts, logging inconsistencies, unused imports) and some logic concerns (import paths, potential race conditions, monkey-patching) that should be addressed but are not critical blockers. The core functionality appears sound with appropriate use of multiprocessing, memory management, and distributed training patterns. No new critical issues were identified in this review.
  • Previous comments highlight style improvements needed in train.py, inference.py, utils.py, grid.py, and well.py - primarily bare except clauses, logging consistency, and documentation fixes

Important Files Changed

File Analysis

Filename Score Overview
examples/reservoir_simulation/xmgn/src/data/graph_builder.py 4/5 New file implementing parallel graph construction from reservoir simulation data with multiprocessing, proper error handling, and efficient memory usage through immediate file saves
examples/reservoir_simulation/xmgn/src/preprocessor.py 4/5 New preprocessor orchestrating graph generation, partitioning, train/val/test splits, and global statistics computation with proper validation and user interaction
examples/reservoir_simulation/sim_utils/ecl_reader.py 4/5 New Eclipse binary file reader supporting INIT, EGRID, UNRST formats with endianness detection and comprehensive error handling

Sequence Diagram

sequenceDiagram
    participant User
    participant Preprocessor
    participant GraphBuilder
    participant EclReader
    participant Train
    participant Inference
    
    User->>Preprocessor: Run preprocessor.py
    Preprocessor->>Preprocessor: Validate config
    Preprocessor->>GraphBuilder: Create raw graphs
    
    loop For each simulation case
        GraphBuilder->>EclReader: Read INIT/EGRID/UNRST files
        EclReader-->>GraphBuilder: Grid data + dynamic vars
        GraphBuilder->>GraphBuilder: Build PyG graph per timestep
        GraphBuilder->>GraphBuilder: Save graph to disk
    end
    
    GraphBuilder-->>Preprocessor: List of graph files
    Preprocessor->>Preprocessor: Partition graphs (METIS)
    Preprocessor->>Preprocessor: Split train/val/test
    Preprocessor->>Preprocessor: Compute global statistics
    Preprocessor-->>User: Preprocessing complete
    
    User->>Train: Run train.py (distributed)
    Train->>Train: Load partitioned graphs
    Train->>Train: Train X-MGN model
    Train->>Train: Validate & checkpoint
    Train-->>User: Best model saved
    
    User->>Inference: Run inference.py
    Inference->>Inference: Load best checkpoint
    
    loop Autoregressive rollout
        Inference->>Inference: Predict next timestep
        Inference->>Inference: Use prediction as input
    end
    
    Inference->>Inference: Export HDF5 + GRDECL
    Inference-->>User: Results for visualization
Loading

6 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR adds a comprehensive new example implementing X-MeshGraphNet (X-MGN) for reservoir simulation surrogate modeling, targeting the petroleum engineering domain. The implementation includes ~5000 lines of new code across data preprocessing, training, inference, and visualization components.

Key Components Added:

  • sim_utils module: Binary ECLIPSE/IX file readers (ecl_reader.py, grid.py) for industry-standard reservoir simulation data formats
  • Data pipeline: Graph construction from reservoir grids (graph_builder.py), partitioning for distributed training, and PyTorch Geometric dataloaders
  • Training infrastructure: Multi-GPU training with MLflow tracking, gradient scaling, and early stopping
  • Inference pipeline: Autoregressive rollout with HDF5 and GRDECL output formats for visualization in industry tools (Petrel, ResInsight)
  • Documentation: Comprehensive README with two example datasets (2D five-spot, Norne Field)

Implementation Quality:
The code demonstrates solid engineering practices with proper licensing headers, extensive documentation, and integration with PhysicsNeMo's distributed training infrastructure. The autoregressive inference implementation correctly handles temporal dependencies and partition boundaries (halo nodes).

Previous Review Coverage:
This PR has already undergone extensive review with 70+ comments addressing style issues (unused imports, logging inconsistencies), potential logical issues (gradient accumulation timing, edge index remapping), and security concerns (weights_only=False in torch.load, bare except clauses). Most issues are non-critical style/best-practice suggestions.

Critical Issues Previously Identified:

  • Grid edge remapping logic complexity in sim_utils/grid.py (lines 239-247)
  • Gradient scaler placement inside partition loop in train.py (line 674)
  • Import path inconsistency (sim_utils absolute vs relative imports)
  • torch.load with weights_only=False poses security risk if loading untrusted partition files (inference.py:713)

Confidence Score: 3/5

  • This PR introduces substantial new functionality with some logical concerns that should be addressed, but no critical blocking issues for an example/research codebase
  • Score of 3 reflects a balance between the PR's strengths and concerns. Positive factors: comprehensive documentation, proper integration with PhysicsNeMo infrastructure, CHANGELOG updated, and thorough examples. Concerns that lower confidence: (1) complex grid remapping logic with potential indexing bugs that could cause incorrect results, (2) gradient accumulation timing issue that may affect training convergence, (3) security risk from weights_only=False in torch.load, (4) numerous style issues suggesting the code hasn't been fully production-hardened. Since this is an example/research code rather than core library functionality, these issues are manageable but should be addressed before wider adoption
  • examples/reservoir_simulation/sim_utils/grid.py (edge remapping logic), examples/reservoir_simulation/xmgn/src/train.py (gradient scaler placement), and examples/reservoir_simulation/xmgn/src/inference.py (torch.load security)

Important Files Changed

File Analysis

Filename Score Overview
examples/reservoir_simulation/xmgn/src/inference.py 4/5 Autoregressive inference implementation with checkpoint loading, HDF5 output, and GRDECL generation - uses weights_only=False in torch.load on line 713
examples/reservoir_simulation/sim_utils/ecl_reader.py 4/5 ECLIPSE binary file reader supporting INIT, EGRID, UNRST formats - extensive functionality with some logging inconsistencies
examples/reservoir_simulation/sim_utils/grid.py 3/5 Grid processing with corner-point geometry and edge mapping - contains complex remapping logic with potential indexing issues
examples/reservoir_simulation/xmgn/src/train.py 3/5 Training script with distributed support, MLflow tracking, and early stopping - has gradient accumulation and scaler timing issues
examples/reservoir_simulation/xmgn/src/preprocessor.py 4/5 Data preprocessing pipeline for graph generation, partitioning, and train/val/test splitting
examples/reservoir_simulation/xmgn/conf/config_norne.yaml 4/5 Configuration for Norne Field dataset - contains relative path that may break with Hydra's chdir setting

Sequence Diagram

sequenceDiagram
    participant User
    participant Preprocessor
    participant EclReader
    participant Grid
    participant GraphBuilder
    participant Partitioner
    participant Trainer
    participant Model as MeshGraphNet
    participant Inference
    participant PostProcessor

    Note over User,PostProcessor: Phase 1: Data Preprocessing
    User->>Preprocessor: python preprocessor.py --config-name=config
    Preprocessor->>EclReader: read ECLIPSE binary files (.INIT, .EGRID, .UNRST)
    EclReader-->>Preprocessor: grid data, properties, timesteps
    Preprocessor->>Grid: create grid structure
    Grid-->>Preprocessor: corner-point geometry, connections, active cells
    Preprocessor->>GraphBuilder: build_graphs(cases, timesteps)
    loop For each case and timestep
        GraphBuilder->>GraphBuilder: create nodes (active cells) and edges (connections)
        GraphBuilder->>GraphBuilder: extract features (PERMX, PORO, PRESSURE, SWAT)
        GraphBuilder-->>Preprocessor: graph with node/edge features
    end
    Preprocessor->>Partitioner: partition graphs using METIS
    Partitioner-->>Preprocessor: partitioned graphs with halo nodes
    Preprocessor->>Preprocessor: split into train/val/test sets
    Preprocessor-->>User: processed graphs saved to dataset_dir

    Note over User,PostProcessor: Phase 2: Training
    User->>Trainer: torchrun train.py --config-name=config
    Trainer->>Trainer: initialize distributed training
    Trainer->>Trainer: load partitioned graphs
    Trainer->>Model: initialize MeshGraphNet
    loop For each epoch
        loop For each batch
            loop For each partition
                Trainer->>Model: forward(x, edge_attr, partition)
                Model-->>Trainer: predictions
                Trainer->>Trainer: calculate MSE loss on inner nodes
                Trainer->>Trainer: backward and accumulate gradients
            end
            Trainer->>Trainer: optimizer step
        end
        Trainer->>Trainer: validation on val set
        Trainer->>Trainer: check early stopping
    end
    Trainer-->>User: best checkpoint saved

    Note over User,PostProcessor: Phase 3: Inference
    User->>Inference: python inference.py --config-name=config
    Inference->>Inference: load best checkpoint
    Inference->>Inference: load test dataset
    loop For each case (autoregressive)
        loop For each timestep
            alt timestep < prev_timesteps + 1
                Inference->>Model: forward with true features
            else
                Inference->>Inference: update features with previous predictions
                Inference->>Model: forward with predicted features
            end
            Model-->>Inference: predictions
            Inference->>Inference: denormalize and store results
        end
    end
    Inference->>PostProcessor: save HDF5 files (case_name.hdf5)
    Inference->>PostProcessor: generate GRDECL files
    PostProcessor->>Grid: get coordinates and actnum
    PostProcessor->>PostProcessor: map active cells to full grid
    PostProcessor-->>User: HDF5 and GRDECL files for visualization
Loading

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
@tonishi-nv tonishi-nv requested a review from mnabian November 5, 2025 18:04
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 16, 2025

Greptile Overview

Greptile Summary

This PR adds a comprehensive X-MeshGraphNet implementation for reservoir simulation surrogate modeling, introducing ~7600 lines of new code across data preprocessing, training, and inference pipelines.

Key additions:

  • Complete ECLIPSE binary file parser (sim_utils/ecl_reader.py) for reading .INIT, .EGRID, .UNRST formats
  • Graph construction from reservoir grids with support for Non-Neighbor Connections (NNCs), faults, and complex geometries
  • METIS-based graph partitioning with halo regions for scalable training on large grids
  • Distributed training pipeline using PhysicsNeMo's MeshGraphNet with multi-GPU support
  • Autoregressive inference with HDF5 and GRDECL output formats
  • Example configs for synthetic and Norne Field datasets with comprehensive documentation

Architecture strengths:

  • Well-integrated with PhysicsNeMo's distributed infrastructure
  • Robust multiprocessing in graph builder with proper error handling
  • Comprehensive documentation and visualization examples
  • CHANGELOG properly updated

Areas requiring attention:

  • sim_utils/grid.py:239-247 contains dead code that gets overwritten
  • Multiple files use root logger instead of module-level logger (inconsistent logging)
  • torch.load with weights_only=False poses security risks in inference
  • Training scaler operations are inside partition loop (should be outside)
  • Config path issues with Hydra's chdir: true setting in Norne config
  • Early stopping state management has permanence issues (can't be reused)
  • Missing validation checks for edge_attr existence and empty tensors in dataloader

Confidence Score: 3/5

  • This PR is safe to merge with moderate risk - the example is functional but has code quality issues that should be addressed
  • Score reflects that while the implementation is comprehensive and well-documented, there are multiple logic issues (dead code in grid.py, scaler handling in training loop), inconsistent patterns (root vs module logger), and potential runtime errors (missing edge_attr checks, empty tensor handling). These are concentrated in example code rather than core library changes, reducing overall risk. The CHANGELOG is properly updated and the feature is isolated to a new example directory.
  • Pay close attention to sim_utils/grid.py (dead code at lines 239-247), xmgn/src/train.py (scaler operations inside partition loop), and xmgn/src/data/dataloader.py (missing validation checks). The config path issue in config_norne.yaml should be verified before use.

Important Files Changed

File Analysis

Filename Score Overview
examples/reservoir_simulation/xmgn/src/train.py 3/5 Training loop with distributed support. Has scaler handling issues in partition loop and early stopping timing concerns, but core training logic is sound
examples/reservoir_simulation/xmgn/src/inference.py 3/5 Autoregressive inference with HDF5 and GRDECL output. Uses weights_only=False (security concern) and has halo node tracking that overwrites on collision
examples/reservoir_simulation/xmgn/src/preprocessor.py 3/5 Builds graphs from simulation data and partitions them. Validation logic for existing data is incomplete, partitions could be inconsistent if preprocessing interrupted
examples/reservoir_simulation/xmgn/src/data/graph_builder.py 4/5 Core graph construction from ECLIPSE format files. Robust multiprocessing with fallback handling, though filename collision risk in fallback logic
examples/reservoir_simulation/xmgn/src/data/dataloader.py 3/5 Custom dataloader with partitioning support. Missing edge_attr and empty tensor checks could cause runtime errors
examples/reservoir_simulation/sim_utils/grid.py 2/5 Grid and connectivity handling. Has dead code (lines 239-247) and uses root logger instead of module logger throughout
examples/reservoir_simulation/sim_utils/ecl_reader.py 3/5 Binary ECLIPSE file parser. Has bare excepts and uses root logger instead of module logger, but handles complex binary formats reasonably
examples/reservoir_simulation/xmgn/conf/config_norne.yaml 3/5 Norne-specific config. Has relative path issues with Hydra's chdir: true setting that could break dataset loading

Sequence Diagram

sequenceDiagram
    participant User
    participant Preprocessor
    participant EclReader
    participant Grid
    participant GraphBuilder
    participant Train
    participant MGN as MeshGraphNet
    participant Inference
    
    User->>Preprocessor: python preprocessor.py --config-name=config
    Preprocessor->>EclReader: Read INIT/EGRID/UNRST files
    EclReader-->>Preprocessor: Simulation data
    Preprocessor->>Grid: Build grid structure
    Grid-->>Preprocessor: Graph connectivity & transmissibilities
    Preprocessor->>GraphBuilder: Build PyG graphs
    GraphBuilder->>GraphBuilder: Extract static/dynamic features
    GraphBuilder->>GraphBuilder: Create autoregressive sequences
    GraphBuilder-->>Preprocessor: PyG Data objects
    Preprocessor->>Preprocessor: Partition graphs with METIS
    Preprocessor->>Preprocessor: Organize train/val/test splits
    Preprocessor-->>User: Processed graphs saved
    
    User->>Train: torchrun src/train.py --config-name=config
    Train->>Train: Load partition files & normalization stats
    Train->>MGN: Initialize model
    loop Each Epoch
        Train->>Train: Process each batch of partitions
        Train->>MGN: Forward pass on each partition
        MGN-->>Train: Predictions
        Train->>Train: Compute loss & backward
        Train->>Train: Validate & check early stopping
    end
    Train-->>User: Checkpoint saved
    
    User->>Inference: python inference.py --config-name=config
    Inference->>Inference: Load checkpoint & test data
    loop Autoregressive Prediction
        Inference->>MGN: Predict next timestep
        MGN-->>Inference: Predictions
        Inference->>Inference: Update state with predictions
    end
    Inference->>Inference: Denormalize & assemble full grid
    Inference-->>User: HDF5 & GRDECL output files
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

17 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
Signed-off-by: Tsubasa Onishi <tonishi@nvidia.com>
@tonishi-nv tonishi-nv requested a review from mnabian November 16, 2025 15:23
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

17 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants