Skip to content

: stdio redirection #900

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

shayne-fletcher
Copy link
Contributor

Differential Revision: D80366985

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 15, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D80366985

shayne-fletcher added a commit to shayne-fletcher/monarch-1 that referenced this pull request Aug 15, 2025
Summary:



Rollback Plan:

Differential Revision: D80366985
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D80366985

shayne-fletcher added a commit to shayne-fletcher/monarch-1 that referenced this pull request Aug 15, 2025
Summary:

rust startup code (code that runs before `main`) changes the disposition for `SIGPIPE` such that it is silently ignored (that is, runs `signal(Signal::SIGPIPE, SigHandler::SigIgn)` or equivalent). this behavior introduced in 2014, is poorly documented but see rust-lang/rust#62569.

a task spawned in `hyperactor::signal_handler::GlobalSignalManager::new` creates an async signal listener using `signal-hook-tokio` crate. it watches for `SIGINT` and `SIGTERM` and on receiving one, executes cleanup code before removing the hooks and re-raising the signals in order to restore and execute the default behaviors (process termination). that signal handling code includes logging calls via `tracing::info!()` and `tracing::error!()`.

the problem is, if `SIGTERM` (say) is being handled by an orphan, the earlier death of the parent can mean the orphan's stdout/stderr pipes are closed. normally, writing to a closed pipe would result in signalling `SIGPIPE` and process termination but here a logging call results in an infinite uninterruptible sleep, hanging the process preventing it from shutting down.

this diff adds a call to a newly developed function `stdio_redirect::handle_broken_pipes()` which detects this condition and redirects stdio to a file (named derived from the process ID - e.g. `monarch-process-exit-3529266.log`) as needed allowing the process to terminate and write logs normally as it does so.

Differential Revision: D80366985
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D80366985

Summary:

`fbcode//monarch/hyperactor_mesh:hyperactor_mesh_proxy_test` is a standalone self bootstrapping program that uses `ProcessAllocator` to do the following:
- the driver creates a proc/process to host a `ProxyActor`
- initialization of the `ProxyActor` on the new proc/process creates a proc/process to host a `TestActor`

so, executing this program creates a 3 level process hierarchy  `driver -> parent -> grandchild` where the `parent` process hosts a single proc/process (rank = 0) with one `ProxyActor` and the `grandchild` a single proc/process (rank = 0) with one `TestActor`.

using this program, i observe that as things stand, logs from the parent and the grandchild (since they share a common rank) are merged in the one file `/tmp/$USER/monarch_log_0.stdout`.

this diff disambiguates proc logs by incorporating the process ID of the mesh owner into the proc's log file name.

so, for example, now there will be logs `monarch_log_3529266_0.stdout` (capturing the logs of the parent proc) and `monarch_log_3530444_0.stdout` (capturing the logs of the grandchild proc).

Differential Revision: D80349615
Summary:

rust startup code (code that runs before `main`) changes the disposition for `SIGPIPE` such that it is silently ignored (that is, runs `signal(Signal::SIGPIPE, SigHandler::SigIgn)` or equivalent). this behavior introduced in 2014, is poorly documented but see rust-lang/rust#62569.

a task spawned in `hyperactor::signal_handler::GlobalSignalManager::new` creates an async signal listener using `signal-hook-tokio` crate. it watches for `SIGINT` and `SIGTERM` and on receiving one, executes cleanup code before removing the hooks and re-raising the signals in order to restore and execute the default behaviors (process termination). that signal handling code includes logging calls via `tracing::info!()` and `tracing::error!()`.

the problem is, if `SIGTERM` (say) is being handled by an orphan, the earlier death of the parent can mean the orphan's stdout/stderr pipes are closed. normally, writing to a closed pipe would result in signalling `SIGPIPE` and process termination but here a logging call results in an infinite uninterruptible sleep, hanging the process preventing it from shutting down.

this diff adds a call to a newly developed function `stdio_redirect::handle_broken_pipes()` which detects this condition and redirects stdio to a file (named derived from the process ID - e.g. `monarch-process-exit-3529266.log`) as needed allowing the process to terminate and write logs normally as it does so.

Differential Revision: D80366985
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D80366985

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants