Skip to content

GRPC Client - TraceId mixup #2316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ananthvanchip opened this issue May 14, 2025 · 8 comments
Open

GRPC Client - TraceId mixup #2316

ananthvanchip opened this issue May 14, 2025 · 8 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@ananthvanchip
Copy link

Describe the bug

Corruption in the traceId that is being sent in the headers while making grpc calls. The issue is with the grpc-client-instrumentation.

Environment

  • OS: Linux
  • Go Version: 1.17+

To Reproduce

  1. Create a dummy grpc endpoint to invoke.
  2. Use a single connection and send multiple requests on the connection via different go-routines.
  3. Alternatively, create multiple connections and send multiple requests on those connections via different go-routines.
  4. You will see that the traceId will sometimes get mismatched (associated with the wrong go-routine.)

Expected behavior

TraceId should not be corrupted/exchanged b/w different requests.

Additional context

There are two issues in the way grpc-client is being impemented.

Issue 1

bpf_probe_read(&nextid, sizeof(nextid), (void *)(httpclient_ptr + (httpclient_nextid_pos)));

The nextId is used as the bridge b/w the go-routine of the request and the go-routine of the loopyWriter. The nextId is a merely an integer and it increases sequentially for every request. (Its odd numbers starting from 1). However, when multiple http2client are created (to different endpoints) they will all generate nextId from 1. So in a concurrent scenario, we will have different clients(or connections) which can have the same nextId and the association could get jumbled when the requests are being made. This will cause the wrong traceId to be sent to the wrong endpoint randomly and will create confusions.

Issue 2
We are adding a uprobe at the entry of
func (t *http2Client) NewStream(ctx context.Context, callHdr *CallHdr) (*ClientStream, error)

In that uprobe we are reading the nextId. The nextId is a global variable within the http2Client object. When two go-routines using the same connection/client make a request, they will create a new stream. For both of them, in the uprobe we will assign the same nextId. However, internally they will be assigned different nextId as the nextId assignment within the function is covered by a mutex lock.
https://github.com/grpc/grpc-go/blob/b89909b7bd0d9bd333aab291e90fec1fa8d45ce9/internal/transport/http2_client.go#L832

This again will cause different traceIds to be sent to different requests which will lead to eventual corruption.

Solution.
To solve both the issues, we can stay away from using nextId as the bridge b/w the requesting go-routine and the go-routine within loopyWriter. Instead, we can add a uprobe on
https://github.com/grpc/grpc-go/blob/09166b665e8b6442bc70e88cdef580da9e7b7b06/internal/transport/controlbuf.go#L352

The second parameter is a headerFrame which is available in the loopyWriter.handleHeaders method. The pointer to headerFrame can be the bridge. This will help us get rid of the entire nextId/streamId and the issues arising from that.

I am not a go/ebpf expert. Let me know what you think.. I can provide a patch on this if necessary.

@ananthvanchip ananthvanchip added the bug Something isn't working label May 14, 2025
@tfpereira
Copy link

If I understand correctly, the symptom of this is having multiple separate requests end up with the same trace ID?

If that is the case, I'm seeing the same behaviour on my tests with auto instrumentation

@MrAlias
Copy link
Contributor

MrAlias commented May 14, 2025

If I understand correctly, the symptom of this is having multiple separate requests end up with the same trace ID?

If that is the case, I'm seeing the same behaviour on my tests with auto instrumentation

@ananthvanchip can you verify this is the correct?

@MrAlias
Copy link
Contributor

MrAlias commented May 14, 2025

cc @RonFed @grcevski

@RonFed
Copy link
Contributor

RonFed commented May 14, 2025

Thanks for opening this issue.
I agree with the problems you described, and we definitely have a lot of places to improve the gRPC instrumentation.
Regarding the suggestions you provided:
The executeAndPut function looks to be called in multiple flows in the code. Ideally we should uprobe functions that are not triggered frequently to minimize the performance impact. I'm not sure how often this function is called in the lifetime of a stream.
I think it's a good idea to use the headerFrame pointer to correlate between the goroutines.
The second argument to the function you linked is an interface - I assume the headerFrame is one of its implementations - but I'm not sure if it's the only one.

The nextId can be read by adding a return uprobe to the NewStream function - I think this will solve the second issue you mentioned.
Another idea might be to use the loopywriter pointer itself together with the stream id - it may provide a unique key.

@grcevski
Copy link
Contributor

Yeah it's definitely a problem and also same issue exists in how ebpf-autoinstrumentation does it, although it's slightly different implementation. I always thought streamid is unique.

@MrAlias MrAlias added this to the Subsequent v0.22.0 milestone May 14, 2025
@ananthvanchip
Copy link
Author

If I understand correctly, the symptom of this is having multiple separate requests end up with the same trace ID?
If that is the case, I'm seeing the same behaviour on my tests with auto instrumentation

@ananthvanchip can you verify this is the correct?

yes.. this is correct.

Thanks for opening this issue. I agree with the problems you described, and we definitely have a lot of places to improve the gRPC instrumentation. Regarding the suggestions you provided: The executeAndPut function looks to be called in multiple flows in the code. Ideally we should uprobe functions that are not triggered frequently to minimize the performance impact. I'm not sure how often this function is called in the lifetime of a stream. I think it's a good idea to use the headerFrame pointer to correlate between the goroutines. The second argument to the function you linked is an interface - I assume the headerFrame is one of its implementations - but I'm not sure if it's the only one.

The nextId can be read by adding a return uprobe to the NewStream function - I think this will solve the second issue you mentioned. Another idea might be to use the loopywriter pointer itself together with the stream id - it may provide a unique key.

I can send you a PR on the code changes. While the executeAndPut is indeed called multiple times, it is not called from the same goroutine. So it can be skipped quite easily.

Retrieving the nextId in the return might not be ideal as we cannot guarantee that the loopyWriter will be called always after the return of the call as its in another go-routine.

@grcevski
Copy link
Contributor

I can send you a PR on the code changes. While the executeAndPut is indeed called multiple times, it is not called from the same goroutine. So it can be skipped quite easily.

That would be great! Thank you!

@damemi
Copy link
Contributor

damemi commented Jun 10, 2025

@ananthvanchip hi, just checking if you were still working on this or had any questions? Thanks!

@MrAlias MrAlias modified the milestones: v0.23.0, Subsequent v0.23.0 Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants