[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [DRAFT 1] XenSock protocol design document



On 07/08/2016 12:23 PM, Stefano Stabellini wrote:
> Hi all,
> 
Hey!

[...]

> 
> ## Design
> 
> ### Xenstore
> 
> The frontend and the backend connect to each other exchanging information via
> xenstore. The toolstack creates front and back nodes with state
> XenbusStateInitialising. There can only be one XenSock frontend per domain.
> 
> #### Frontend XenBus Nodes
> 
> port
>      Values:         <uint32_t>
> 
>      The identifier of the Xen event channel used to signal activity
>      in the ring buffer.
> 
> ring-ref
>      Values:         <uint32_t>
> 
>      The Xen grant reference granting permission for the backend to map
>      the sole page in a single page sized ring buffer.

Would it make sense to export minimum, default and maximum size of the socket 
over
xenstore entries? It normally follows a convention depending on the type of 
socket
(and OS) you have, or then through settables on socket options.


> ### Commands Ring
> 
> The shared ring is used by the frontend to forward socket API calls to the
> backend. I'll refer to this ring as **commands ring** to distinguish it from
> other rings which will be created later in the lifecycle of the protocol (data
> rings). The ring format is defined using the familiar `DEFINE_RING_TYPES` 
> macro
> (`xen/include/public/io/ring.h`). Frontend requests are allocated on the ring
> using the `RING_GET_REQUEST` macro.
> 
> The format is defined as follows:
> 
>     #define XENSOCK_DATARING_ORDER 6
>     #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
>     #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)
>     
>     #define XENSOCK_CONNECT        0
>     #define XENSOCK_RELEASE        3
>     #define XENSOCK_BIND           4
>     #define XENSOCK_LISTEN         5
>     #define XENSOCK_ACCEPT         6
>     #define XENSOCK_POLL           7
>     
>     struct xen_xensock_request {
>         uint32_t id;     /* private to guest, echoed in response */
>         uint32_t cmd;    /* command to execute */
>         uint64_t sockid; /* id of the socket */
>         union {
>             struct xen_xensock_connect {
>                 uint8_t addr[28];
>                 uint32_t len;
>                 uint32_t flags;
>                 grant_ref_t ref[XENSOCK_DATARING_PAGES];
>                 uint32_t evtchn;
>             } connect;
>             struct xen_xensock_bind {
>                 uint8_t addr[28]; /* ipv6 ready */
>                 uint32_t len;
>             } bind;
>             struct xen_xensock_accept {
>                 uint64_t sockid;
>                 grant_ref_t ref[XENSOCK_DATARING_PAGES];
>                 uint32_t evtchn;
>             } accept;
>         } u;
>     };
> 
> The first three fields are common for every command. Their binary layout
> is:
> 
>     0       4       8       12      16
>     +-------+-------+-------+-------+
>     |  id   |  cmd  |     sockid    |
>     +-------+-------+-------+-------+
> 
> - **id** is generated by the frontend and identifies one specific request
> - **cmd** is the command requested by the frontend:
>     - `XENSOCK_CONNECT`: 0
>     - `XENSOCK_RELEASE`: 3
>     - `XENSOCK_BIND`:    4
>     - `XENSOCK_LISTEN`:  5
>     - `XENSOCK_ACCEPT`:  6
>     - `XENSOCK_POLL`:    7
> - **sockid** is generated by the frontend and identifies the socket to 
> connect,
>   bind, etc. A new sockid is required on `XENSOCK_CONNECT` and `XENSOCK_BIND`
>   commands. A new sockid is also required on `XENSOCK_ACCEPT`, for the new
>   socket.
>   
Interesting - Have you consider setsockopt and getsockopt to be part of this? 
There
are some common options (as in POSIX defined) and then some more exotic flavors 
Linux
or FreeBSD specific. Say SO_REUSEPORT used on nginx that is good for load 
balancing
across a set of workers or Linux SO_BUSY_POLL for low latency sockets. Though 
not
sure how sensible it is to start exposing all of these socket options but to 
limit to
a specific subset? Or maybe doesn't make sense for your case - see further 
suggestion
regarding data ring part.

> All three fields are echoed back by the backend.
> 
> As for the other Xen ring based protocols, after writing a request to the 
> ring,
> the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and issues an event
> channel notification when a notification is required.
> 
> Backend responses are allocated on the ring using the `RING_GET_RESPONSE` 
> macro.
> The format is the following:
> 
>     struct xen_xensock_response {
>         uint32_t id;
>         uint32_t cmd;
>         uint64_t sockid;
>         int32_t ret;
>     };
>    
>     0       4       8       12      16      20
>     +-------+-------+-------+-------+-------+
>     |  id   |  cmd  |     sockid    |  ret  |
>     +-------+-------+-------+-------+-------+
> 
> - **id**: echoed back from request
> - **cmd**: echoed back from request
> - **sockid**: echoed back from request
> - **ret**: return value, identifies success or failure
> 
Are these fields taken from a specific OS (I assumed Linux)? Probably ids, cmd 
and
ret size could be less big overall or may be not - in which case could be useful
specifying in the spec if it's following a specific OS.

[...]

> The design is flexible and can support different ring sizes (at compile time).
> The following description is based on order 6 rings, chosen because they 
> provide
> excellent performance.
> 
> - **in** is an array of 65536 bytes, used as circular buffer
>   It contains data read from the socket. The producer is the backend, the
>   consumer is the frontend.
> - **out** is an array of 131072 bytes, used as circular buffer
>   It contains data to be written to the socket. The producer is the frontend,
>   the consumer is the backend.
Could this size be a tunable intercepting RCVBUF and SNDBUF sockopt adjustments
(these two are POSIX defined) ofc under the assumption that in this proposal 
you want
to replicate local and remote socket? IOW to dynamically allocate how much the 
socket
will use for sending/receiving which would turn into the amount of grants in 
use?
Even doing with xenstore entries in the backend is better - even though user 
may want
to adjust send/receive buffer for whatever aplication needs. Ideally this would 
be
dynamic per socket, instead of compile-time defined - and would allow more 
sockets on
the same VM without overshooting the grant table limits.

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.