Using socketpair and passfd

I've always been aware of the passfd feature available on Linux, but I never had the chance to use it
for real in my projects. While testing some system design, I was able to experiment with it, and I find that
it opens up a lot of possibilities. Let's see that.
Socket pair
There’s nothing too complicated about socketpair, it allows you to have two interconnected sockets with a single system call,
so that when you send bytes on one socket, you can read them on the other:
#include <sys/socket.h> int socketpair(int domain, int type, int protocol, int sv[2]);
It’s more convenient than the pipe() or pipe2() system calls, which provide 2 file descriptors, one for reading and another for writing.
Even on Windows, in many projects, I’ve coded equivalents of socketpair because it’s so
convenient. Basically, you create a listening socket, call bind(), then listen(), followed by accept(), and simultaneously
we connect to that same listening socket, which gives us two interconnected sockets
(if we don’t want to use threads, we can even set the socket to non-blocking mode and run accept and connect in parallel, it works).
In practice, we often use this to establish a communication channel with a child process. So, for instance:
- we call
socketpair(AF_UNIX, SOCK_STREAM, 0, &socks). We will keepsocks[0]for the parent process and usesocks[1]for the child process; - we
fork(), and so all file descriptors are inherited by the child process. On the parent side, we can closesocks[1](close(socks[1])), as it is no longer needed, and on the child side, we can closesocks[0]; - now both processes can communicate using their sockets; we can add these sockets in our polling loops to read messages coming from the other process.
This is a fairly standard way to have a “main” process communicate with its workers. For example, we have a server listening on RDP that will launch a child process to handle an incoming connection. But we still want to maintain a connection between the main process and the “session process” to exchange information (ACLs, or other) or to pass commands ("stop now !").
Passfd
Passfd is not a system call, it's more a way to pass file descriptors as OOB (Out-Of-Band) data through socket. Basically, you can send the usual bytes and also pass additional information as supplementary data, including file descriptors.
The first time I saw it in action was in Wayland: the compositor and the Wayland clients communicate through a local AF_UNIX socket,
and at some point the question of the keyboard layout arises, so that the client knows which disposition to use. To achieve that he compositor
opens the XKB file and sends this file descriptor to the Wayland client via a passfd, and the client can read all the details of
the keyboard layout from that file. There’s something a bit magical about the operation: you open a file, send the handle to the process on the
other end, and it can use it directly (like calling read() or write()).
struct msghdr { void *msg_name; /* Optional address */ socklen_t msg_namelen; /* Size of address */ struct iovec *msg_iov; /* Scatter/gather array */ size_t msg_iovlen; /* # elements in msg_iov */ void *msg_control; /* Ancillary data, see below */ size_t msg_controllen; /* Ancillary data buffer len */ int msg_flags; /* Flags on received message */ };
When you want to send or receive data, you’ll use a struct msghdr with recvmsg() or sendmsg(). This msghdr
is particularly useful because, in addition to holding auxiliary data (msg_control and msg_controllen), it also
allows you to send data splitted in chunks. So in theory, there’s no need for the application to “bother” with creating
a nice linear buffer to perform a large send() when sending packets; it can simply provide the list of packet chunks to be sent,
and the system will do its best to assemble the chunks and send the whole thing.
In certain situations, this can avoid unnecessary copies, since the kernel will be able to pick in the list of chunks to fill
the packets it can send immediately, and it will copy the rest. Potentially, this allows us to skip part of the
buffer linearization step (this is what is done in libevent for instance).
For example, in RDP, we often build a payload incrementally by calling the internal protocol layers one after another. Once we have
the payload, we need to add a header, but since the header size depends on the size of the payload, we cannot pre-reserve space for it.
We could reserve space for the maximum header size, but doing so creates the need of a copy for the next protocol layer.
With the contents of msg_iov and msg_iovlen, we can create a list of chunks where we’ll place the payloads and the headers, and
the kernel will handle the transmittion of all these. But I digress.
The file descriptors are passed as auxiliary data with cmsg_level set to SOL_SOCKET and cmsg_type set to SCM_RIGHTS; it’s simply
an array of integers containing the handle numbers.
Combining the two
And if we combine the two, it creates a pretty interesting setup: a parent process that acts as an orchestrator, launching two child processes, it maintains a communication channel via a socketpair. It can create another socketpair and pass one end to each child using passfd:

And so we end up with two child processes that securely establish a communication channel through the parent process. In this Python example, the communication protocol between the parent and the children is simple: if the first byte is 0, a string is being transmitted; otherwise, if it is 1, a endpoint of the socketpair is being transmitted:
import socket import os import sys import time import array def doProcess(no, s): print("{}: child running".format(no)) while True: (data, ancdata, msg_flags, address) = s.recvmsg(4096, 4096) if data[0] == 0: print('{}: echo {}'.format(no, data[1:])) elif data[0] == 1: fds = array.array("i") for cmsg_level, cmsg_type, cmsg_data in ancdata: if cmsg_level == socket.SOL_SOCKET and cmsg_type == socket.SCM_RIGHTS: # Append data, ignoring any truncated integers at the end. fds.frombytes(cmsg_data[:len(cmsg_data) - (len(cmsg_data) % fds.itemsize)]) print("{}: got {} fds".format(no, len(fds))) sock = socket.fromfd(fds[0], socket.AF_UNIX, socket.SOCK_STREAM) if no == 1: sock.send(b'coucou') print("{}: coucou sent".format(no)) elif no == 2: print("{}: received={}".format(no, sock.recv(4096))) break sys.exit(0) if __name__ == '__main__': c1, c1sub = socket.socketpair(socket.AF_UNIX) c1sub.set_inheritable(True) if os.fork() == 0: doProcess(1, c1sub) c2, c2sub = socket.socketpair(socket.AF_UNIX) c2sub.set_inheritable(True) if os.fork() == 0: doProcess(2, c2sub) # Test that father/son communication works c1.send(b'\0hello 1') c2.send(b'\0hello 2') time.sleep(0.500) # then send an end of the socket pair to each child process s1, s2 = socket.socketpair(socket.AF_UNIX) socket.send_fds(c1, buffers=[b'\1tutu'], fds=[s1.fileno()] ) socket.send_fds(c2, buffers=[b'\1tutu'], fds=[s2.fileno()] ) time.sleep(10)
When run, we get this output, and you can see that the two child processes are communicating properly with each other:
1: child running 1: echo b'hello 1' 2: child running 2: echo b'hello 2' 1: got 1 fds 2: got 1 fds 1: coucou sent 2: received=b'coucou'
Conclusion
To conclude, these system calls open up tons of very interesting possibilities for system architecture.