-
Notifications
You must be signed in to change notification settings - Fork 238
Add device ID validation to prevent cryptic GPU failures #2712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@ggorman 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs. I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review. |
Co-authored-by: ggorman <5394691+ggorman@users.noreply.github.com>
@copilot - CondGe does not do what you think it does; see github actions failures. |
Co-authored-by: ggorman <5394691+ggorman@users.noreply.github.com>
Fixed the CondGe import error in commit b03b63d. The issue was that CondGe doesn't exist in devito.symbolics - I've replaced it with Ge from devito.types.relational which provides the correct >= comparison functionality for conditionals. |
Co-authored-by: ggorman <5394691+ggorman@users.noreply.github.com>
as usual, this is a mess |
@copilot - pull in .github/copilot-instructions.md and use the additional knowledge from that file to revise your patch. |
Co-authored-by: ggorman <5394691+ggorman@users.noreply.github.com>
Fixed in commit afbd708. Simplified the implementation to follow Devito's established patterns - removed unnecessary complexity, shortened error messages, used proper imports, and ensured 90-character line compliance. The validation logic is now much cleaner while preserving the core functionality.
Revised the implementation using the copilot instructions from PR #2713 in commit afbd708. The new approach follows Devito's coding standards: uses SymPy directly, implements minimal changes, maintains 90-character line limits, and uses established patterns for device handling. The solution is now much cleaner while preserving the device validation functionality. |
This PR adds device existence validation to GPU operators to prevent cryptic failures when CUDA_VISIBLE_DEVICES and container GPU configuration are mismatched.
Problem
When using Docker with both
--env CUDA_VISIBLE_DEVICES=1
and--gpus "device=1"
, the container runtime only exposes device 0 but renumbers it from the original device 1. This causes Devito to attempt using device ID 1 (from CUDA_VISIBLE_DEVICES) on a system that only has device 0 available, leading to opaque failures like:Solution
Added device validation logic that generates runtime checks before calling
acc_set_device_num()
oromp_set_default_device()
. The generated code now includes:OpenACC validation:
OpenMP validation:
Benefits
Implementation Details
The solution follows Devito's established patterns:
Leveraged existing infrastructure: Uses the existing 'num-devices' mappers for generating appropriate runtime calls (
omp_get_num_devices
oracc_get_num_devices
)Minimal code changes: Enhanced device setup paths to include validation logic before device assignment, following Devito's 90-character line limit and coding standards
Proper integration: Works seamlessly with both sequential and MPI device setup configurations
Changes
devito/passes/iet/langbase.py
to add concise validation logic using established patternstests/test_gpu_openmp.py
andtests/test_gpu_openacc.py
to validate the new behaviorFixes #2711.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.