[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] RFC: XenSock brainstorming
Hi all, a couple of months ago I started working on a new PV protocol for virtualizing syscalls. I named it XenSock, as its main purpose is to allow the implementation of the POSIX socket API in a domain other than the one of the caller. It allows connect, accept, recvmsg, sendmsg, etc to be implemented directly in Dom0. In a way this is conceptually similar to virtio-9pfs, but for sockets rather than filesystem APIs. See this diagram as reference: https://docs.google.com/presentation/d/1z4AICTY2ejAjZ-Ul15GTL3i_wcmhKQJA7tcXwhI3dys/edit?usp=sharing The frontends and backends could live either in userspace or kernel space, with different trade-offs. My current prototype is based on Linux kernel drivers but it would be nice to have userspace drivers too. Discussing where the drivers could be implemented it's beyond the scope of this email. # Goals The goal of the protocol is to provide networking capabilities to any guests, with the following added benefits: * guest networking should work out of the box with VPNs, wireless networks and any other complex network configurations in Dom0 * guest services should listen on ports bound directly to Dom0 IP addresses, fitting naturally in a Docker based workflow, where guests are Docker containers * Dom0 should have full visibility on the guest behavior and should be able to perform inexpensive filtering and manipulation of guest calls * XenSock should provide excellent performance. Unoptimized early code reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3 streams. # Status I would like to get feedback on the high level architecture, the data path and the ring formats. Beware that protocol and drivers are in their very early days. I don't have all the information to write a design document yet. The ABI is neither complete nor stable. The code is not ready for xen-devel yet, but I would be happy to push a git branch if somebody is interested in contributing to the project. # Design and limitations The frontend connects to the backend following the traditional xenstore based exchange of information. Frontend and backend setup an event channel and shared ring. The ring is used by the frontend to forward socket API calls to the backend. I am referring to this ring as command ring. This is an example of the ring format: #define XENSOCK_CONNECT 0 #define XENSOCK_RELEASE 3 #define XENSOCK_BIND 4 #define XENSOCK_LISTEN 5 #define XENSOCK_ACCEPT 6 #define XENSOCK_POLL 7 struct xen_xensock_request { uint32_t id; /* private to guest, echoed in response */ uint32_t cmd; /* command to execute */ uint64_t sockid; /* id of the socket */ union { struct xen_xensock_connect { uint8_t addr[28]; uint32_t len; uint32_t flags; grant_ref_t ref[XENSOCK_DATARING_PAGES]; uint32_t evtchn; } connect; struct xen_xensock_bind { uint8_t addr[28]; /* ipv6 ready */ uint32_t len; } bind; struct xen_xensock_accept { grant_ref_t ref[XENSOCK_DATARING_PAGES]; uint32_t evtchn; uint64_t sockid; } accept; } u; }; struct xen_xensock_response { uint32_t id; uint32_t cmd; uint64_t sockid; int32_t ret; }; DEFINE_RING_TYPES(xen_xensock, struct xen_xensock_request, struct xen_xensock_response); Connect and accept lead to the creation of new active sockets. Today each active socket has its own event channel and ring for sending and receiving data. Data rings have the following format: #define XENSOCK_DATARING_ORDER 2 #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER) #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT) typedef uint32_t XENSOCK_RING_IDX; struct xensock_ring_intf { char in[XENSOCK_DATARING_SIZE/4]; char out[XENSOCK_DATARING_SIZE/2]; XENSOCK_RING_IDX in_cons, in_prod; XENSOCK_RING_IDX out_cons, out_prod; int32_t in_error, out_error; }; The ring works like the Xen console ring (see xen/include/public/io/console.h). Data is copied to/from the ring by both frontend and backend. in_error, out_error are used to report errors. This simple design works well, but it requires at least 1 page per active socket. To get good performance (~20 Gbit/sec single stream), we need buffers of at least 64K, so actually we are looking at about 64 pages per ring (order 6). I am currently investigating the usage of AVX2 to perform the data copy. # Brainstorming Are 64 pages per active socket a reasonable amount in the context of modern OS level networking? I believe that regular Linux tcp sockets allocate something in that order of magnitude. If that's too much, I spent some time thinking about ways to reduce it. Some ideas follow. We could split up send and receive into two different data structures. I am thinking of introducing a single ring for all active sockets with variable size messages for sending data. Something like the following: struct xensock_ring_entry { uint64_t sockid; /* identifies a socket */ uint32_t len; /* length of data to follow */ uint8_t data[]; /* variable length data */ }; One ring would be dedicated to holding xensock_ring_entry structures, one after another in a classic circular fashion. Two indexes, out_cons and out_prod, would still be used the same way the are used in the console ring, but I would place them on a separate page for clarity: struct xensock_ring_intf { XENSOCK_RING_IDX out_cons, out_prod; }; The frontend, that is the producer, writes a new struct xensock_ring_entry to the ring, careful not to exceed the remaining free bytes available. Then it increments out_prod by the written amount. The backend, that is the consumer, reads the new struct xensock_ring_entry, reading as much data as specified by "len". Then it increments out_cons by the size of the struct xensock_ring_entry read. I think this could work. Theoretically we could do the same thing for receive: a separate single ring shared by all active sockets. We could even reuse struct xensock_ring_entry. However I have doubts that this model could work well for receive. When sending data, all sockets on the frontend side copy buffers onto this single ring. If there is no room, the frontend returns ENOBUFS. The backend picks up the data from the ring and calls sendmsg, which can also return ENOBUFS. In that case we don't increment out_cons, leaving the data on the ring. The backend will try again in the near future. Error messages would have to go on a separate data structure which I haven't finalized yet. When receiving from a socket, the backend copies data to the ring as soon as data is available, perhaps before the frontend requests the data. Buffers are copied to the ring not necessarily in the order that the frontend might want to read them. Thus the frontend would have to copy them out of the common ring into private per-socket dynamic buffers just to free the ring as soon as possible and consume the next xensock_ring_entry. It doesn't look very advantageous in terms of memory consumption and performance. Alternatively, the frontend would have to leave the data on the ring if the application didn't ask for it yet. In that case the frontend could look ahead without incrementing the in_cons pointer. It would have to keep track of which entries have been consumed and which entries have not been consumed. Only when the ring is full, the frontend would have no other choice but to copy the data out of the ring into temporary buffers. I am not sure how well this could work in practice. As a compromise, we could use a single shared ring for sending data, and 1 ring per active socket to receive data. This would cut the per-socket memory consumption in half (maybe to a quarter, moving out the indexes from the shared data ring into a separate page) and might be an acceptable trade-off. Any feedback or ideas? Many thanks, Stefano _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |