-
Notifications
You must be signed in to change notification settings - Fork 25
Open
Description
While running some tests on Frontier I noticed the following issue:
$ srun -N 1 -n 8 -c 1 --gpus-per-node=8 --gpu-bind=closest /ccs/proj/ast146/pgrete/src/athenapk/build-bump-parth/bin/athenaPK -i ./linear_wave3d.in parthenon/meshblock/nx1=256 parthenon/meshblock/nx2=256 parthenon/meshblock/nx3=256 parthenon/mesh/nx1=1024 parthenon/mesh/nx2=1024 parthenon/mesh/nx3=1024 parthenon/time/nlim=20 parthenon/time/integrator=rk2 parthenon/mesh/pack_size=4
Memory access fault by GPU node-8 (Agent handle: 0x61253f0) on address 0x7ff7f2522000. Reason: Unknown.
srun: error: frontier08577: task 0: Aborted
srun: Terminating StepId=2345368.15
slurmstepd: error: *** STEP 2345368.15 ON frontier08577 CANCELLED AT 2024-09-06T06:38:27 ***
^[[A^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=2345368.15 tasks 1-7: running
srun: StepId=2345368.15 task 0: exited abnormally
Should be confirmed if this is Frontier specific or more general AthenaPK or Parthenon.
Metadata
Metadata
Assignees
Labels
No labels