Recently, we've seen a lot of similar issues related to how we're handing read/write behavior:
elastic/beats#26031
elastic/beats#42933
#160
#125
I suspect a lot of these issues are related, and perhaps stem from some read/write behavior that's causing some problems. In particular:
- Writes should be non-blocking; if a write returns EAGAIN or a similar error, the underlying logic can handle that naturally in a read/write loop
- We need to handle ENOBUFS correctly; instead of hard-returning an error, we need to deal with the problem (netlink buffer has filled up, we need to resend our request) cleanly.
- Instead of a sleep/read loop when we're reading from the socket, we need to poll and wait for new data. This may also cut down on ENOBUFS issues.