Skip to content

Can not train an epoch #43

@Licolas

Description

@Licolas

It always stops at here.

(subdivnet) lm@lm:~/0-majorRevision/SubdivNet-master$ sh scripts/manifold40/train.sh
[i 0711 09:55:59.289227 96 compiler.py:956] Jittor(1.3.8.5) src: /home/lm/anaconda3/envs/subdivnet/lib/python3.7/site-packages/jittor
[i 0711 09:55:59.298060 96 compiler.py:957] g++ at /usr/bin/g++(11.4.0)
[i 0711 09:55:59.298134 96 compiler.py:958] cache_path: /home/lm/.cache/jittor/jt1.3.8/g++11.4.0/py3.7.16/Linux-6.5.0-41xc8/IntelRXeonRSilxdc/default
[i 0711 09:55:59.307678 96 init.py:411] Found nvcc(11.7.99) at /usr/local/cuda-11.7/bin/nvcc.
[i 0711 09:55:59.383568 96 init.py:411] Found gdb(22.04.2) at /usr/bin/gdb.
[i 0711 09:55:59.397309 96 init.py:411] Found addr2line(2.38) at /usr/bin/addr2line.
[i 0711 09:55:59.510948 96 compiler.py:1011] cuda key:cu11.7.99_sm_89
[i 0711 09:56:00.005350 96 init.py:227] Total mem: 62.44GB, using 16 procs for compiling.
Compiling jittor_core(151/151) used: 2.437s eta: 0.000s
[i 0711 09:56:02.815749 96 jit_compiler.cc:28] Load cc_path: /usr/bin/g++
[i 0711 09:56:02.888116 96 init.cc:62] Found cuda archs: [89,]
[w 0711 09:56:02.903832 96 compiler.py:1384] CUDA arch(89)>86 will be backward-compatible
[w 0711 09:56:02.935237 96 compile_extern.py:203] CUDA related path found in LD_LIBRARY_PATH or PATH(['', '/usr/local/cuda-11.7/lib64', '/home/lm/anaconda3/envs/subdivnet/bin', '/home/lm/anaconda3/condabin', '/usr/local/sbin', '/usr/local/bin', '/usr/sbin', '/usr/bin', '/sbin', '/bin', '/usr/games', '/usr/local/games', '/snap/bin', '/snap/bin', '/usr/local/cuda-11.7/bin']), This path may cause jittor found the wrong libs, please unset LD_LIBRARY_PATH and remove cuda lib path in Path.
Or you can let jittor install cuda for you: python3.x -m jittor_utils.install_cuda
[i 0711 09:56:12.951927 96 cuda_flags.cc:49] CUDA enabled.
name: manifold40
Train 0: 0%|▍ | 12/3278 [00:06<20:55, 2.60it/s][w 0711 09:56:20.710701 96 cudnn_conv__Tx_float32__Ty_float32__Tw_float32__XFORMAT_abcd__WFORMAT_oihw__YFORMAT_abcd_____hash_4d5b3e2d24c769d3_op.cc:419] forward_ algorithm cache is full
Train 0: 0%|▍ | 13/3278 [00:06<21:05, 2.58it/s][w 0711 09:56:20.865463 96 cudnn_conv_backward_w__Tx_float32__Ty_float32__Tw_float32__XFORMAT_abcd__WFORMAT_oihw__YFO___hash_8e480e8564e59906_op.cc:418] backward w algorithm cache is full
Train 0: 0%|▍ | 15/3278 [00:07<19:45, 2.75it/s][w 0711 09:56:21.510013 96 cudnn_conv_backward_x__Tx_float32__Ty_float32__Tw_float32__XFORMAT_abcd__WFORMAT_oihw__YFO___hash_af8994a8aef53c1c_op.cc:410] backward x algorithm cache is full
Train 0: 67%|████████████████████████████████████████████████████████████████████▌ | 2184/3278 [10:19<05:21, 3.40it/s]

log is as follow:


Async error was detected. To locate the async backtrace and get better error report, please rerun your code with two enviroment variables set:

export JT_SYNC=1
export trace_py_var=3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions