Pending Timeout Error #275
-
As the number of participant submissions increased, the testing time also increased significantly. Code that used to take only 1 hour for test results is timing out today as the number of submissions increases. (1) Is it possible to provide more CPU resources in phase 3 or provide the current congestion on the server? I guess that the test time is mainly consumed by constantly cycling through the test data, i.e. the bottleneck is the CPU. the reason is that the fusion time using five models increases only a little compared to using a single model, not exponentially. Of course, it is also possible that the long queues are due to the need to wait for idle GPUs in the background. (2) Is the maximum runtime 2 hours or 10 hours, or 20 hours? The error message I get is "[PendingTimeout] pending timeout(>7200), evict by system", i.e. 2h. However, as I can see in the screenshot in reply to discussion #258, the time limit is 72000, i.e. 20h. Thank you. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Pending Timeout is 2 hours. |
Beta Was this translation helpful? Give feedback.
Pending Timeout is 2 hours.
Running Timeout is 10 hours.
In the last few days, most submissions (that encountered the Pending Timeout error) are stuck in the queue waiting for being scheduled by the system, which is mainly caused by the limited GPU resources. Once the submitted model was initialized and evaluated, the GPU, as well as the CPU, resources are no longer the bottleneck. Because, once the submitted code is scheduled and selected for evaluation, the computation resources had been successfully allocated for the scheduled submission already.
In other words, the limited computational resources only affect the global scheduling.