Skip to content

Conversation

Copilot
Copy link

@Copilot Copilot AI commented Aug 12, 2025

This PR adds device existence validation to GPU operators to prevent cryptic failures when CUDA_VISIBLE_DEVICES and container GPU configuration are mismatched.

Problem

When using Docker with both --env CUDA_VISIBLE_DEVICES=1 and --gpus "device=1", the container runtime only exposes device 0 but renumbers it from the original device 1. This causes Devito to attempt using device ID 1 (from CUDA_VISIBLE_DEVICES) on a system that only has device 0 available, leading to opaque failures like:

tests/test_gpu_openacc.py ...............
Error: Process completed with exit code 1.

Solution

Added device validation logic that generates runtime checks before calling acc_set_device_num() or omp_set_default_device(). The generated code now includes:

OpenACC validation:

if (deviceid != -1) {
    int ngpus = acc_get_num_devices(acc_device_nvidia);
    if (deviceid >= ngpus) {
        printf("OpenACC: Error - device %d >= %d devices\n", deviceid, ngpus);
        exit(1);
    }
    acc_set_device_num(deviceid, acc_device_nvidia);
}

OpenMP validation:

if (deviceid != -1) {
    int ngpus = omp_get_num_devices();
    if (deviceid >= ngpus) {
        printf("OpenMP: Error - device %d >= %d devices\n", deviceid, ngpus);
        exit(1);
    }
    omp_set_default_device(deviceid);
}

Benefits

  • Clear error messages: Users get specific error messages instead of cryptic exit codes
  • Graceful failure: Program exits cleanly with error code 1 instead of undefined behavior
  • Backward compatible: Only validates when deviceid is explicitly set (deviceid != -1)
  • Cross-platform: Works for both OpenACC and OpenMP backends
  • Minimal overhead: Uses existing device count infrastructure with concise validation

Implementation Details

The solution follows Devito's established patterns:

  1. Leveraged existing infrastructure: Uses the existing 'num-devices' mappers for generating appropriate runtime calls (omp_get_num_devices or acc_get_num_devices)

  2. Minimal code changes: Enhanced device setup paths to include validation logic before device assignment, following Devito's 90-character line limit and coding standards

  3. Proper integration: Works seamlessly with both sequential and MPI device setup configurations

Changes

  • Modified devito/passes/iet/langbase.py to add concise validation logic using established patterns
  • Updated tests in tests/test_gpu_openmp.py and tests/test_gpu_openacc.py to validate the new behavior
  • Follows Devito's coding standards with proper imports, line limits, and minimal implementation

Fixes #2711.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copy link
Author

Copilot AI commented Aug 12, 2025

@ggorman 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs.

I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review.

Co-authored-by: ggorman <5394691+ggorman@users.noreply.github.com>
@Copilot Copilot AI changed the title [WIP] Need to catch when code attempts to run on a device ID that doesn't exist Add device ID validation to prevent cryptic GPU failures Aug 12, 2025
@Copilot Copilot AI requested a review from ggorman August 12, 2025 10:50
Copilot finished work on behalf of ggorman August 12, 2025 10:50
@ggorman
Copy link
Contributor

ggorman commented Aug 12, 2025

@copilot - CondGe does not do what you think it does; see github actions failures.

Co-authored-by: ggorman <5394691+ggorman@users.noreply.github.com>
Copy link
Author

Copilot AI commented Aug 12, 2025

@copilot - CondGe does not do what you think it does; see github actions failures.

Fixed the CondGe import error in commit b03b63d. The issue was that CondGe doesn't exist in devito.symbolics - I've replaced it with Ge from devito.types.relational which provides the correct >= comparison functionality for conditionals.

Co-authored-by: ggorman <5394691+ggorman@users.noreply.github.com>
Copilot finished work on behalf of ggorman August 12, 2025 11:23
@FabioLuporini
Copy link
Contributor

as usual, this is a mess

@ggorman
Copy link
Contributor

ggorman commented Aug 12, 2025

@copilot - pull in .github/copilot-instructions.md and use the additional knowledge from that file to revise your patch.

Co-authored-by: ggorman <5394691+ggorman@users.noreply.github.com>
Copy link
Author

Copilot AI commented Aug 12, 2025

as usual, this is a mess

Fixed in commit afbd708. Simplified the implementation to follow Devito's established patterns - removed unnecessary complexity, shortened error messages, used proper imports, and ensured 90-character line compliance. The validation logic is now much cleaner while preserving the core functionality.

@copilot - pull in .github/copilot-instructions.md and use the additional...

Revised the implementation using the copilot instructions from PR #2713 in commit afbd708. The new approach follows Devito's coding standards: uses SymPy directly, implements minimal changes, maintains 90-character line limits, and uses established patterns for device handling. The solution is now much cleaner while preserving the device validation functionality.

Copilot finished work on behalf of ggorman August 12, 2025 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need to catch when code attempts to run on a device ID that doesn't exist
3 participants