-
Notifications
You must be signed in to change notification settings - Fork 75
Use non-blocking send on pid unset, retry on close #176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use non-blocking send on pid unset, retry on close #176
Conversation
|
Our assumption is that send blocks because netlink is congested, so it stops reading from the send buffer until we drain the receive buffer, first I wanted to confirm this: The path is: See comments inline int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
long *timeo, struct sock *ssk)
{
struct netlink_sock *nlk;
nlk = nlk_sk(sk);
/* haesbaert: if netlink is having a hard time */
if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
test_bit(NETLINK_S_CONGESTED, &nlk->state))) {
DECLARE_WAITQUEUE(wait, current);
/* haesbaert: if there is no timeout (case for nonblocking) return EAGAIN */
if (!*timeo) {
if (!ssk || netlink_is_kernel(ssk))
netlink_overrun(sk);
sock_put(sk);
kfree_skb(skb);
return -EAGAIN;
}
/* haesbaert: otherwise block for *timeo */
__set_current_state(TASK_INTERRUPTIBLE);
add_wait_queue(&nlk->wait, &wait);
/* haesbaert: interlocked in queue, recover from the sleep race */
if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
test_bit(NETLINK_S_CONGESTED, &nlk->state)) &&
!sock_flag(sk, SOCK_DEAD))
*timeo = schedule_timeout(*timeo); /* haesbaert: actual sleep is here */
__set_current_state(TASK_RUNNING);
remove_wait_queue(&nlk->wait, &wait);
sock_put(sk);
if (signal_pending(current)) {
kfree_skb(skb);
return sock_intr_errno(*timeo);
}
return 1;
}
netlink_skb_set_owner_r(skb, sk);
return 0;
}So on our case the socket is blocking, and default timeout is We can verify this by doing a produces: So the sender blocks when netlink is congested, so if we drain the receive buffer it should recover. |
haesbaert
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments.
The loop is missing some error checking and it's unbound, it's also requesting for ACKs when it shouldn't.
I took a stab, it compiles but I didn't test it, I don't see a season to keep trying to send multiple times, try once, if it got EAGAIN, drain, try again, done. I wrote this, feel free to use it or not:
sendit := func(msg syscall.NetlinkMessage) error {
again:
_, err := c.Netlink.SendNoWait(msg)
if errors.Is(err, syscall.EINTR) {
goto again
}
return err
}
// If our request to unset the PID would block, then try to
// drain events from the netlink socket, resend, try again.
// In netlink, EAGAIN on the send side means the kernel
// stopped reading the send buffer since the receive buffer is
// full.
err := sendit(msg)
if !errors.Is(err, syscall.EAGAIN) {
return
}
// Drain until drainMax or we get EAGAIN
for drainMax := 10000; drainMax > 0; drainMax-- {
_, err = c.Netlink.Receive(true, noParse)
if err == nil || errors.Is(err, syscall.EINTR) || errors.Is(err, syscall.ENOBUFS) {
continue
} else if errors.Is(err, syscall.EAGAIN) {
break
}
return
}
_ = sendit(msg)
haesbaert
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a return for other errors of send, otherwise this looks fine.
|
LGTM |
audit.go
Outdated
| // if we get an interrupt, retry the send | ||
| if err == nil { | ||
| return nil | ||
| } else if errors.Is(err, syscall.EINTR) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(*NetlinkClient).SendNoWait is a thin wrapper around syscall.SendTo. No functions in syscall ever wrap errors and given go1compat, this will not change (go1compat says that syscall is not covered, but this relates to changes forced by changes in the underlying OS and any change to types returned by syscalls would neither be likely from the Go team, nor quietly accepted by external reviewer).
tl;dr; No errors.Is is required here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, even if syscall.* isn't going to be wrapping anything, I feel like it's safer to assume that c.Netlink.* might wrap something in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it shouldn't, but it feels a little safer to be defensive. Also the linter complains if you don't use errors.Is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preferred approach would be to ensure that the behaviour is covered by tests, and then leave it without the extra weight. This would find regressions and ensure that bloat is not added to SendNoWait. I think the linter is wrong here. However, I'm not an owner, so it doesn't really matter.
💚 Build Succeeded
History
|
Tinkering with an alternative to elastic/beats#42933, by trying to drain the socket on close if we get an error while trying to unset the pid.
This is meant as an alternative fix for elastic/beats#26031, where we do a non-blocking write and try to drain the socket instead of starting a second goroutine.
In draft, as I'm still testing and figuring out how netlink handles errors.