Skip to content

-sPROXY_TO_PTHREAD hang in Node.js (test asani.test_pthread_dylink_basics is flaky) #25211

@juj

Description

@juj
emcc test/hello_world.c -o a.js -pthread -sPROXY_TO_PTHREAD=1 -fsanitize=address

And then stress test the resulting a.js using the following Python script:

stress.py

import subprocess
import multiprocessing
import sys

COMMAND = ['node', 'a.js']

def worker(stop_flag):
    while not stop_flag.is_set():
        try:
            result = subprocess.run(COMMAND, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, timeout=60)
            output = result.stdout.decode(errors="ignore")
            if result.stderr:
                output += result.stderr.decode(errors="ignore")
            if result.returncode != 0 or "hello, world" not in output:
                stop_flag.set()
                print(f"Command failed with exit code {result.returncode}. Output:\n{output}")
        except Exception as e:
            stop_flag.set()
            print(f"Error running command: {e}")
            output = e.stdout.decode(errors="ignore")
            if e.stderr:
              output += e.stderr.decode(errors="ignore")
            print(f"Command failed in exception. Output:\n{output}")

def main():
    stop_flag = multiprocessing.Event()
    procs = []
    for _ in range(multiprocessing.cpu_count()):
        p = multiprocessing.Process(target=worker, args=(stop_flag,), daemon=True)
        p.start()
        procs.append(p)

    for p in procs:
        p.join()

if __name__ == "__main__":
    multiprocessing.set_start_method("spawn")  # more portable across platforms
    main()

After a random amount of time, a few minutes, or half an hour, the above will fail with

Error running command: Command '['node', 'a.js']' timed out after 60 seconds
Command failed in exception. Output:
hello, world!

What has happened is that the node program has hung after the end of the program. I.e. the program executed correctly, but is failing to shut down.

What makes attempting to debug this difficult is that the hang does not occur all that often (although it does repeatedly happen on my CI). Adding enough console.log()s in the application JS code results in the test case top stop failing, as some kind of timing is disturbed.

As a random test, if I do this:

diff --git a/src/lib/libpthread.js b/src/lib/libpthread.js
index 44d75892c..70a7939b2 100644
--- a/src/lib/libpthread.js
+++ b/src/lib/libpthread.js
@@ -511,7 +511,9 @@ var LibraryPThread = {
 #if PTHREADS_DEBUG
     dbg(`terminateWorker: ${worker.workerID}`);
 #endif
-    worker.terminate();
+    setTimeout(() => {
+        worker.terminate();
+    }, 5000);
     // terminate() can be asynchronous, so in theory the worker can continue
     // to run for some amount of time after termination.  However from our POV
     // the worker now dead and we don't want to hear from it again, so we stub

then the hang does no longer occur, and the page survives a three hour stress test.

To reproduce the hang, it is necessary to build with asan enabled. Simply building with

emcc test/hello_world.c -o a.js -pthread -sPROXY_TO_PTHREAD=1

does not reproduce a hang in the stress test. It is unclear whether there is something fundamentally related to asan that causes the race condition and the hang; or if it is just a side effect that asan changes timings so that the hang becomes more apparent.

This hang occurs e.g. in test asani.test_pthread_dylink_basics on my CI, every 1-3 days. E.g. http://clbri.com:8010/api/v2/logs/53959/raw_inline . Though there is nothing fundamental to dynamic linking that causes the hang: the above hello world test case does not utilize dynamic linking.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions