VIP 17: Enable Unix domain sockets for listen and backend addresses

Use cases:

Eliminate the overhead of TCP/loopback for connections with peers that are colocated on a host with Varnish.
Restrict who can send requests to Varnish by setting permissions on the UDS path of the listen address.
- (But see the discussion below about getting this right portably.)
Make it possible for a backend peer to require restricted credentials for the Varnish process by setting permissions on the UDS path on which it listens.
Obtain peer credentials from the UDS, such as uid and gid, in order to:
- Make information about the peer available in VCL and the log.
- Extend ACLs to make it possible to place further restrictions on peers connecting to the listen address.

I would like to make this contribution for the September 2017 release. With the VIP I'd like to clarify:

Are there any changes planned for VTCP and VSA in the September release that would make adding UDS to those interfaces less trivial than it is now?
Every platform has a way to get peer credentials from a UDS, but there's no standard and it's highly platform-dependent. So how do we want to handle that?
Additions/changes to VCL and other changes in naming, such as the -a option and backend definitions.
If someone knows a reason why we shouldn't do this at all, this is the place to say so.

An obvious application is the use of SSL offloaders connecting to the listen address, and SSL "onloaders" as backends. UDS would eliminate the TCP overhead, and the ability to restrict the credentials of peers mitigates the risks of man-in-the-middle. Both haproxy and nginx/ProxyPass, among others, support UDS addresses in "both directions", so they are candidates for this purpose. A notable exception is hitch, which currently only supports TCP connections. I would be happy to help the hitch project support UDS (shouldn't be hard at all).

Address notation

I suggest that we require a prefix such as unix: to identify UDS addresses (nginx uses unix:, haproxy uses unix@):

varnishd -a unix:/path/to/uds
backend uds { .host = "unix:/path/to/uds"; }

That makes the interpretation unambiguous. We could simply interpret paths as UDS addresses when they appear in those places, but then we would need logic like: if the argument cannot be resolved as a host or parsed as an IP address, then assume it's a path for UDS, but if the path does not exist or cannot be accessed, then fail. So better to just make it unambiguous.

Parsing UDS addresses would be an extension of VSS_Resolver.

The name .host in a backend definition becomes a bit peculiar if its value can also be a UDS (we will see a number or examples like this). We could:

stay with the name .host, and document the fact that it might not identify a host in some cases
replace .host with a name like .peer, sacrificing backward compatibility
introduce .peer, retain .host as a deprecated alias, and remove .host in a future release

I suggest the last option, comments welcome.

.port in a backend definition is already optional, and is unnecessary for a UDS. Should it be an error to specify a port when a UDS is specified, or should it be ignored? Comments welcome.

Access permissions on the listen address

For the -a address, I suggest an optional means of specifying who can access the UDS:

varnishd -a unix:/path/to/uds:uid=foo,gid=bar

There's an issue here in that the separator (: in the example) could not appear in any UDS path. We might just have to forbid a certain character in UDS paths. Fortunately we don't have a such a problem with backend addresses (which are generated by another server, so we have less freedom to impose restrictions on the path names).

uid and gid can be specified as numeric or with names. Either, both or none of uid and gid would be permitted. Enforcing access permissions would be tricky to get right portably and reliably (and might just not work). From what I surmise at the moment (and I might be quite wrong):

Ownership would have to set on the directory containing the UDS -- /path/to/ in the example.
- BSD-derived systems do not restrict connects to the UDS itself due to its permissions (or so I've read). But you can make a UDS inaccessible to a process that can't read its directory.
Then chmod the directory to 0700 or 0770, depending on whether access is set for user and/or group.
- This should be done before bind, creating the directory if necessary.
On Linux, peers connecting to the UDS must have read/write permission, so we would also set uid/gid ownership on the UDS and set permissions to 0600 or 0660, as the case may be. Might as well do that on every platform.
- Must be done after bind and before listen.
mgt_acceptor.c would do all of this. Typically the management process runs as root and is able to change permissions and ownership; if the management process owner can't do these things, then varnishd fails to start.

So the sequence for the management process would be (again, unless I'm getting this all wrong):

create the directory if necessary
if access restrictions were requested then set uid/gid and permissions on the directory accordingly
bind (note that VTCP_bind will have to unlink the before before bind for a UDS, if the path already exists)
set permissions on the UDS, at least read/write in all cases, and set ownership if requested

Then the socket can be handed off to the child process for listen.

If no access restrictions were requested, then don't manipulate ownership, let bind create the UDS, and set its permissions to 0666.

Comments and corrections on this section are very much welcome.

VSA and VTCP

Extending these interfaces, in their current form, to accommodate UDS is a piece of cake.

VSA can just as easily encapsulate sockaddr_un as it currently does for the ip4 and ip6 types.

For the most part, VTCP just works with sockets, so it doesn't matter whether they are TCP or UDS sockets. There would have to be some changes about naming (VTCP_name, _myname and _hisname), but I'd like to set that aside for a moment, and get to the subject of naming further down. Some other changes would involve:

Unlink the UDS path before bind in VTCP_bind
Some new kinds of errors may result from VTCP_connect, such as EPERM or ENOENT, but we may not have to change anything for that -- VTCP_connect currently just fails on error and lets the caller decide what to to with the errno.
We'll have to investigate which of the socket options are compatible with UDS. From a quick look I suspect that these are at least irrelevant to UDS and may be errors:
- httpready
- TCP_DEFER_ACCEPT
- TCP_FASTOPEN
- disabling Nagle (TCP_NODELAY)

My main question about all this is: are there plans to significantly revise VSA and VTCP for the September release? Or can I expect that they it will remain fairly easy to extend for UDS?

A minor issue is that the name VTCP (all of the VTCP_* functions, the source name vtcp.c, etc.) becomes a misnomer if it also covers UDS. We could just live with that. OTOH a single git commit could change it all at once, although we might have to bikeshed over a new name (VSOCK?).

Peer credentials

The good news is that all of the platforms listed as level A and B in "Picking Platforms" (the phk rant) have the means to obtain credentials of the peer on a connected UDS.

The bad news is that there's no standard, they're all different, and they encompass different information.

FreeBSD
- getpeereid returns the EUID and EGID. OpenBSD appears to have getpeereid as well.
- getsockopt(LOCAL_PEERCRED) returns credentials in the xucred struct defined in <sys/ucred.h>, which includes EUID and all of the groups to which the peer belongs.
Linux
- getsockopt(SO_PEERCRED) returns the ucred struct defined in <sys/socket.h> which includes pid, uid and gid. It's not clear to me from the manuals whether it's EUID/EGID or RUID/RGID. (Googled-up examples seem to assume EUID/EGID.)
- For getpeereid we'd have to link to libbsd.
Solaris
- Appears to have nothing like any of the other platforms, but it does have getpeerucred, which fills in a ucred_t defined in <ucred.h>. This is an opaque structure with a family of accessor functions ucred_get*, which tell you almost anything you can think of.
MacOS/Darwin
- Appears to be just like FreeBSD: getpeereid and getsockopt(LOCAL_PEERCRED)

All of these obtain the credentials that were true when the peer called connect or listen, and according to the docs they can't be faked (unless there's a kernel bug).

Most or all of these platforms have ways to receive peer credentials in ancillary messages, which may contain more information, but that may require that the peer co-operates, and we can't rely on that.

So it appears that the least common denominator is EUID and EGID (assuming that's what you get in Linux). I suggest that we just go with that, to be used as described below.

Because of all of the platform dependencies, there will have to be something like cred_compat.h full of #ifdefs, and probably some configure.ac logic to figure it all out. We'll also have to decide what to do when Varnish is built on a platform where we find none of the above.

Address naming

Getting back to VTCP_name, _hisname and _myname: these are currently hard-wired in their signatures for an address and a port, and they're spread out all over the place in Varnish.

IMO the least obtrusive way to adapt this for UDS would be to generate the UDS path in the address position, and generate a string "<uid>:<gid>" where the port is currently generated. Or we could bite the bullet by changing these three functions to something less hard-wired, then go find all of the places where they are called and figure out what to do. I suggest the less obtrusive option, at least in an initial implementation, although admittedly the more difficult option may be the right thing in the long run. Comments are welcome.

Assuming we go for "<uid>:<gid>" in the "port" position -- we could generate that string always using the numeric IDs. Or should we call getpwnam/getgrnam, and generate the names if we can get them? Comments welcome.

We'd have to decide what to do on a platform where we don't have a way (or haven't figured out how) to get the peer credentials. Generate ":" or "?:?"? Comments welcome again.

VCL/VRT

Additions and changes to VCL and VRT involve:

VCL variables *.ip: client.ip, local.ip, server.ip, remote.ip and beresp.backend.ip
VCL data type IP
introducing VMOD std functions to return the uid and gid for the *.ip objects, as numbers or names
extending ACLs to specify UDSen and optionally peer credentials
VRT: types VCL_IP and struct vrt_backend, and the VRT functions related to VCL_IP and suckaddr

The *.ip variables essentially encapsulate suckaddrs, which we don't have to change. For the string conversion, if the suckaddr wraps a sockaddr_un, then return the UDS path.

Here again we have the problem that the names *.ip are inappropriate, since the value could be a UDS. Again I suggest the strategy of introducing a new name, in this case *.addr, and deprecating the old names, but leaving the old names around until a future release.

VCL_IP is just a suckaddr, so we don't have to change anything, but we have another inappropriate name for UDSen. The same goes for data type IP. Again I suggest the strategy of introducing new names, ADDR and VCL_ADDR (VCL_ADDR defined as exactly the same typedef as VCL_IP), and deprecating the old names.

I suggest adding functions like these to VMOD std, with the obvious implementations:

INT uid_number(ADDR addr, INT fallback)
STRING uid_name(ADDR addr, STRING fallback)
INT gid_number(ADDR addr, INT fallback)
STRING gid_name(ADDR addr, STRING fallback)

Of course these would always return the fallbacks for non-UDS addresses.

ACLs can be extended to include paths for a UDS and restrictions on the uid/gid:

acl foo {
    "/path/to/uds";
    "/path/with/a/*/wildcard";
    "/path/with/a/uid/restriction",uid=4711;
    "/path/with/more/r?strictions",uid=foo,gid=bar;
}

So we can: name UDS paths in an ACL, allow filename globbing, include restrictions on the uid and gid, and allow both numbers and names for uid/gid.

I'm not sure what to do about struct vrt_backend, which currently has fields for IPv4 and IPv6 addresses, both as strings and suckaddrs. I doubt that it makes sense just to add the same fields for UDS addresses, since the point is that a backend may have both kinds of IP addresses, but it won't also have a UDS address at the same time.

We might have to introduce something like this:

union addr {
    struct {
        char *ipv4_addr;
        char *ipv6_addr;
        struct suckaddr *ipv4_suckaddr;
        struct suckaddr *ipv6_suckaddr;
    } ip;
    struct {
        char *path;
        struct suckaddr *uds_suckaddr;
    } uds;
};

... and then use the union type for the "address" field of the backend definition -- it's either an IP address, which can be one or both of IPv4 and IPv6, or a UDS. Comments welcome.

I think that the VRT functions that currently use VCL_IP and suckaddrs can be adapted either without changes or very straightforwardly, but again we'll want to introduce "addr" where "ip" currently appears in the names, and deprecate the old names:

VRT_acl_match: use the VCL_ADDR type in the signature
VRT_ipcmp: no change
VRT_IP_string: introduce char *VRT_ADDR_string(VRT_CTX, VCL_ADDR) with the same function, and deprecate the old one

VIP 17: Enable Unix domain sockets for listen and backend addresses

Address notation

Access permissions on the listen address

VSA and VTCP

Peer credentials

Address naming

VCL/VRT

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally