Skip to content

Conversation

jgmoore-or
Copy link
Contributor

No description provided.

Signed-off-by: Joseph Moore <joseph.moore@hpe.com>
@jgmoore-or jgmoore-or requested a review from a team as a code owner September 5, 2025 13:36
Signed-off-by: Joseph Moore <joseph.moore@hpe.com>
+ if (na_ucx_addr != NULL) {
+ NA_LOG_SUBSYS_WARNING(addr,
+ "An entry is already present for this address");
+ na_ucx_addr_release(na_ucx_addr);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Joe, can we just release the address without checking refcount? what if anther place is still taking refcount on the address?

Copy link
Contributor Author

@jgmoore-or jgmoore-or Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was put in to ensure we free the resource. It was needed to work around an limitation with UCX. The UCX layer is supposed to give us a disconnect event on the the server-end for client-side ep closes, but these events don't occur on rare occasions. The clients are able to reuse addresses (ipaddr:port) for new clients, and in practice this often occurs. In the case where the disconnect for the prior usage of this address wasn't sent, we execute this line of code to enable the accept of the new connection request. The internal data structures in the Mercury UCX layer are not able to support multiple address structures sharing the same addr_key (sockaddr). Hence we have to release. Should always be the case in these instances that refcount is 1, so release would be called from decrement function if we used this call instead. I will look into your concern further.

Copy link
Contributor Author

@jgmoore-or jgmoore-or Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR should have been marked as draft, since it was created with initial purpose of producing RPMs for testing Jerome's changes (at LRZ). I will create an updated commit of this pr, with a separate patch file that adds Jerome's code changes as an incremental patch. This will make it clear that the "already present" handling was added to the DAOS-modified version of Mercury 2.4 in a separate, previously-landed fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that, please try to keep patches separate and organized by ticket numbers so that we can reference the initial issue/ticket that they were fixing otherwise we have no way of finding it.

@jgmoore-or jgmoore-or marked this pull request as draft September 16, 2025 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants