Use non-blocking send on pid unset, retry on close #176

fearful-symmetry · 2025-03-10T20:48:23Z

Tinkering with an alternative to elastic/beats#42933, by trying to drain the socket on close if we get an error while trying to unset the pid.

This is meant as an alternative fix for elastic/beats#26031, where we do a non-blocking write and try to drain the socket instead of starting a second goroutine.

In draft, as I'm still testing and figuring out how netlink handles errors.

haesbaert · 2025-03-11T10:30:53Z

Our assumption is that send blocks because netlink is congested, so it stops reading from the send buffer until we drain the receive buffer, first I wanted to confirm this:

The path is:
userland does sendmsg(2):
in kernel:
netlink_sendmsg -> netlink_unicast -> netlink_attachskb
https://elixir.bootlin.com/linux/v4.18.20/source/net/netlink/af_netlink.c#L1217-L1255

See comments inline

int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
		      long *timeo, struct sock *ssk)
{
	struct netlink_sock *nlk;

	nlk = nlk_sk(sk);

        /* haesbaert: if netlink is having a hard time */
	if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
	     test_bit(NETLINK_S_CONGESTED, &nlk->state))) {
		DECLARE_WAITQUEUE(wait, current);
                /* haesbaert: if there is no timeout (case for nonblocking) return EAGAIN */
		if (!*timeo) {
			if (!ssk || netlink_is_kernel(ssk))
				netlink_overrun(sk);
			sock_put(sk);
			kfree_skb(skb);
			return -EAGAIN;
		}

                /* haesbaert: otherwise block for *timeo */
		__set_current_state(TASK_INTERRUPTIBLE);
		add_wait_queue(&nlk->wait, &wait);

                /* haesbaert: interlocked in queue, recover from the sleep race */
		if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
		     test_bit(NETLINK_S_CONGESTED, &nlk->state)) &&
		    !sock_flag(sk, SOCK_DEAD))
			*timeo = schedule_timeout(*timeo); /* haesbaert: actual sleep is here */

		__set_current_state(TASK_RUNNING);
		remove_wait_queue(&nlk->wait, &wait);
		sock_put(sk);

		if (signal_pending(current)) {
			kfree_skb(skb);
			return sock_intr_errno(*timeo);
		}
		return 1;
	}
	netlink_skb_set_owner_r(skb, sk);
	return 0;
}

So on our case the socket is blocking, and default timeout is MAX_SCHEDULE_TIMEOUT:
https://elixir.bootlin.com/linux/v4.18.20/source/net/core/sock.c#L2807
which is infinity.

We can verify this by doing a getopt SO_SNDTIME@SOL_SOCKET, if timeout is zero, it's actually infinity (sockgetopt in the kernel translates this for us):

	fd = socket(PF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, NETLINK_AUDIT);
	if (fd == -1)
		err(1, "socket");
	optlen = sizeof(tv);
	bzero(&tv, sizeof(tv));
	if (getsockopt(fd, SOL_SOCKET, SO_SNDTIMEO, &tv, &optlen) == -1)
		err(1, "getsockopt");
	printf("get %ld:%ld (optlen=%d/%zd)\n",
	    tv.tv_sec, tv.tv_usec, optlen, sizeof(tv));

produces: get 0:0 (optlen=16/16)

So the sender blocks when netlink is congested, so if we drain the receive buffer it should recover.

haesbaert

See comments.

The loop is missing some error checking and it's unbound, it's also requesting for ACKs when it shouldn't.

I took a stab, it compiles but I didn't test it, I don't see a season to keep trying to send multiple times, try once, if it got EAGAIN, drain, try again, done. I wrote this, feel free to use it or not:

	sendit := func(msg syscall.NetlinkMessage) error {
	again:
		_, err := c.Netlink.SendNoWait(msg)
		if errors.Is(err, syscall.EINTR) {
			goto again
		}
		return err
	}

	// If our request to unset the PID would block, then try to
	// drain events from the netlink socket, resend, try again.
	// In netlink, EAGAIN on the send side means the kernel
	// stopped reading the send buffer since the receive buffer is
	// full.
	err := sendit(msg)
	if !errors.Is(err, syscall.EAGAIN) {
		return
	}
	// Drain until drainMax or we get EAGAIN
	for drainMax := 10000; drainMax > 0; drainMax-- {
		_, err = c.Netlink.Receive(true, noParse)
		if err == nil || errors.Is(err, syscall.EINTR) || errors.Is(err, syscall.ENOBUFS) {
			continue
		} else if errors.Is(err, syscall.EAGAIN) {
			break
		}

		return
	}
	_ = sendit(msg)

audit.go

haesbaert

Missing a return for other errors of send, otherwise this looks fine.

haesbaert · 2025-03-12T16:47:13Z

LGTM

audit_test.go

efd6 · 2025-03-13T21:07:55Z

audit.go

+		// if we get an interrupt, retry the send
+		if err == nil {
+			return nil
+		} else if errors.Is(err, syscall.EINTR) {


(*NetlinkClient).SendNoWait is a thin wrapper around syscall.SendTo. No functions in syscall ever wrap errors and given go1compat, this will not change (go1compat says that syscall is not covered, but this relates to changes forced by changes in the underlying OS and any change to types returned by syscalls would neither be likely from the Go team, nor quietly accepted by external reviewer).

tl;dr; No errors.Is is required here.

so, even if syscall.* isn't going to be wrapping anything, I feel like it's safer to assume that c.Netlink.* might wrap something in the future.

It shouldn't.

I agree, it shouldn't, but it feels a little safer to be defensive. Also the linter complains if you don't use errors.Is.

My preferred approach would be to ensure that the behaviour is covered by tests, and then leave it without the extra weight. This would find regressions and ensure that bloat is not added to SendNoWait. I think the linter is wrong here. However, I'm not an owner, so it doesn't really matter.

audit.go

elasticmachine · 2025-03-14T20:34:05Z

💚 Build Succeeded

Buildkite Build
Commit: a15b923

History

💚 Build #103 succeeded bdca1ac
💚 Build #102 succeeded 377e832
💚 Build #101 succeeded 2d13888
💚 Build #100 succeeded c5852ce
💚 Build #98 succeeded 9ba5dbb
💚 Build #96 succeeded d84de7c

cc @fearful-symmetry

fearful-symmetry added 3 commits March 10, 2025 20:18

tinker with async close

0d2e031

docs

9bcd269

docs..

2b7c7b4

fearful-symmetry self-assigned this Mar 10, 2025

fearful-symmetry added 3 commits March 10, 2025 20:55

add badfd condition

1bf8771

lint

cd4b38d

format

f2c42d4

haesbaert requested changes Mar 11, 2025

View reviewed changes

audit.go Outdated Show resolved Hide resolved

audit.go Outdated Show resolved Hide resolved

audit.go Show resolved Hide resolved

audit.go Outdated Show resolved Hide resolved

audit.go Outdated Show resolved Hide resolved

rewrite

8e5d074

fearful-symmetry requested a review from haesbaert March 11, 2025 17:41

fearful-symmetry added 3 commits March 11, 2025 17:44

error checking

185bc5e

fmt

e1259eb

rewrite

7867c2f

haesbaert reviewed Mar 12, 2025

View reviewed changes

audit.go Show resolved Hide resolved

haesbaert requested changes Mar 12, 2025

View reviewed changes

add else block

fdc60e4

haesbaert approved these changes Mar 12, 2025

View reviewed changes

fearful-symmetry added 3 commits March 12, 2025 21:17

add tests, dumb linter

d84de7c

linter...

afe7f92

LINTER

9ba5dbb

fearful-symmetry marked this pull request as ready for review March 12, 2025 21:28

fearful-symmetry requested review from efd6 and haesbaert March 12, 2025 21:37

efd6 reviewed Mar 13, 2025

View reviewed changes

fearful-symmetry added 4 commits March 14, 2025 17:29

refactor to use switch

c5852ce

still tinkering...

2d13888

tinkering...

377e832

improve docs

bdca1ac

refactor, again

a15b923

haesbaert approved these changes Mar 17, 2025

View reviewed changes

andrewkroh added the Team:Security-Linux Platform Linux Platform Team in Security Solution label Mar 17, 2025

fearful-symmetry merged commit 1916e11 into elastic:main Mar 18, 2025
3 checks passed

fearful-symmetry mentioned this pull request Mar 18, 2025

Update go-libaudit and remove close/draining code elastic/beats#43352

Merged

4 tasks

mergify bot mentioned this pull request Oct 15, 2025

[8.19](backport #43352) Update go-libaudit and remove close/draining code elastic/beats#47128

Merged

4 tasks

Use non-blocking send on pid unset, retry on close #176

Use non-blocking send on pid unset, retry on close #176

Uh oh!

Conversation

fearful-symmetry commented Mar 10, 2025

Uh oh!

haesbaert commented Mar 11, 2025

Uh oh!

haesbaert left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

haesbaert left a comment

Choose a reason for hiding this comment

Uh oh!

haesbaert commented Mar 12, 2025

Uh oh!

Uh oh!

efd6 Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

fearful-symmetry Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

efd6 Mar 16, 2025

Choose a reason for hiding this comment

Uh oh!

fearful-symmetry Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

efd6 Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elasticmachine commented Mar 14, 2025

💚 Build Succeeded

History

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants