-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
emcc test/hello_world.c -o a.js -pthread -sPROXY_TO_PTHREAD=1 -fsanitize=address
And then stress test the resulting a.js
using the following Python script:
stress.py
import subprocess
import multiprocessing
import sys
COMMAND = ['node', 'a.js']
def worker(stop_flag):
while not stop_flag.is_set():
try:
result = subprocess.run(COMMAND, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, timeout=60)
output = result.stdout.decode(errors="ignore")
if result.stderr:
output += result.stderr.decode(errors="ignore")
if result.returncode != 0 or "hello, world" not in output:
stop_flag.set()
print(f"Command failed with exit code {result.returncode}. Output:\n{output}")
except Exception as e:
stop_flag.set()
print(f"Error running command: {e}")
output = e.stdout.decode(errors="ignore")
if e.stderr:
output += e.stderr.decode(errors="ignore")
print(f"Command failed in exception. Output:\n{output}")
def main():
stop_flag = multiprocessing.Event()
procs = []
for _ in range(multiprocessing.cpu_count()):
p = multiprocessing.Process(target=worker, args=(stop_flag,), daemon=True)
p.start()
procs.append(p)
for p in procs:
p.join()
if __name__ == "__main__":
multiprocessing.set_start_method("spawn") # more portable across platforms
main()
After a random amount of time, a few minutes, or half an hour, the above will fail with
Error running command: Command '['node', 'a.js']' timed out after 60 seconds
Command failed in exception. Output:
hello, world!
What has happened is that the node program has hung after the end of the program. I.e. the program executed correctly, but is failing to shut down.
What makes attempting to debug this difficult is that the hang does not occur all that often (although it does repeatedly happen on my CI). Adding enough console.log()
s in the application JS code results in the test case top stop failing, as some kind of timing is disturbed.
As a random test, if I do this:
diff --git a/src/lib/libpthread.js b/src/lib/libpthread.js
index 44d75892c..70a7939b2 100644
--- a/src/lib/libpthread.js
+++ b/src/lib/libpthread.js
@@ -511,7 +511,9 @@ var LibraryPThread = {
#if PTHREADS_DEBUG
dbg(`terminateWorker: ${worker.workerID}`);
#endif
- worker.terminate();
+ setTimeout(() => {
+ worker.terminate();
+ }, 5000);
// terminate() can be asynchronous, so in theory the worker can continue
// to run for some amount of time after termination. However from our POV
// the worker now dead and we don't want to hear from it again, so we stub
then the hang does no longer occur, and the page survives a three hour stress test.
To reproduce the hang, it is necessary to build with asan enabled. Simply building with
emcc test/hello_world.c -o a.js -pthread -sPROXY_TO_PTHREAD=1
does not reproduce a hang in the stress test. It is unclear whether there is something fundamentally related to asan that causes the race condition and the hang; or if it is just a side effect that asan changes timings so that the hang becomes more apparent.
This hang occurs e.g. in test asani.test_pthread_dylink_basics
on my CI, every 1-3 days. E.g. http://clbri.com:8010/api/v2/logs/53959/raw_inline . Though there is nothing fundamental to dynamic linking that causes the hang: the above hello world test case does not utilize dynamic linking.