[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [DRAFT 1] XenSock protocol design document



> -----Original Message-----
[snip]
> 
> # XenSocks Protocol v1
> 
> ## Rationale
> 
> XenSocks is a paravirtualized protocol for the POSIX socket API.
> 
> The purpose of XenSocks is to allow the implementation of a specific set
> of POSIX calls to be done in a domain other than your own. It allows
> connect, accept, bind, release, listen, poll, recvmsg and sendmsg to be
> implemented in another domain.

Does the other domain have privilege over the domain issuing the POSIX calls?

[snip]
> #### State Machine
> 
>     **Front**                             **Back**
>     XenbusStateInitialising               XenbusStateInitialising
>     - Query virtual device                - Query backend device
>       properties.                           identification data.
>     - Setup OS device instance.                          |
>     - Allocate and initialize the                        |
>       request ring.                                      V
>     - Publish transport parameters                XenbusStateInitWait
>       that will be in effect during
>       this connection.
>                  |
>                  |
>                  V
>        XenbusStateInitialised
> 
>                                           - Query frontend transport 
> parameters.
>                                           - Connect to the request ring and
>                                             event channel.
>                                                          |
>                                                          |
>                                                          V
>                                                  XenbusStateConnected
> 
>      - Query backend device properties.
>      - Finalize OS virtual device
>        instance.
>                  |
>                  |
>                  V
>         XenbusStateConnected
> 
> Once frontend and backend are connected, they have a shared page, which
> will is used to exchange messages over a ring, and an event channel,
> which is used to send notifications.
> 

What about XenbusStateClosing and XenbusStateClosed? We're missing half the 
state model here. Specifically how do individual connections get terminated if 
either end moves to closing? Does either end have to wait for the other?

> 
> ### Commands Ring
> 
> The shared ring is used by the frontend to forward socket API calls to the
> backend. I'll refer to this ring as **commands ring** to distinguish it from
> other rings which will be created later in the lifecycle of the protocol (data
> rings). The ring format is defined using the familiar `DEFINE_RING_TYPES`
> macro
> (`xen/include/public/io/ring.h`). Frontend requests are allocated on the ring
> using the `RING_GET_REQUEST` macro.
> 
> The format is defined as follows:
> 
>     #define XENSOCK_DATARING_ORDER 6
>     #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
>     #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES <<
> PAGE_SHIFT)
> 

Why a fixed size? Also, I assume DATARING should be CMDRING or somesuch here. 
Plus a fixed size of *six* pages seems like a lot.

>     #define XENSOCK_CONNECT        0
>     #define XENSOCK_RELEASE        3
>     #define XENSOCK_BIND           4
>     #define XENSOCK_LISTEN         5
>     #define XENSOCK_ACCEPT         6
>     #define XENSOCK_POLL           7
> 
>     struct xen_xensock_request {
>         uint32_t id;     /* private to guest, echoed in response */
>         uint32_t cmd;    /* command to execute */
>         uint64_t sockid; /* id of the socket */
>         union {
>             struct xen_xensock_connect {
>                 uint8_t addr[28];
>                 uint32_t len;
>                 uint32_t flags;
>                 grant_ref_t ref[XENSOCK_DATARING_PAGES];
>                 uint32_t evtchn;
>             } connect;
>             struct xen_xensock_bind {
>                 uint8_t addr[28]; /* ipv6 ready */
>                 uint32_t len;
>             } bind;
>             struct xen_xensock_accept {
>                 uint64_t sockid;
>                 grant_ref_t ref[XENSOCK_DATARING_PAGES];
>                 uint32_t evtchn;
>             } accept;
>         } u;
>     };
> 

Perhaps some layout diagrams for the above to avoid ABI assumptions?

> The first three fields are common for every command. Their binary layout
> is:
> 
>     0       4       8       12      16
>     +-------+-------+-------+-------+
>     |  id   |  cmd  |     sockid    |
>     +-------+-------+-------+-------+
> 

That's a start at least :-)

> - **id** is generated by the frontend and identifies one specific request
> - **cmd** is the command requested by the frontend:
>     - `XENSOCK_CONNECT`: 0
>     - `XENSOCK_RELEASE`: 3
>     - `XENSOCK_BIND`:    4
>     - `XENSOCK_LISTEN`:  5
>     - `XENSOCK_ACCEPT`:  6
>     - `XENSOCK_POLL`:    7
> - **sockid** is generated by the frontend and identifies the socket to
> connect,
>   bind, etc. A new sockid is required on `XENSOCK_CONNECT` and
> `XENSOCK_BIND`
>   commands. A new sockid is also required on `XENSOCK_ACCEPT`, for the
> new
>   socket.
> 

[snip]
> #### Connect
> 
> The **connect** operation corresponds to the connect system call. It
> connects a
> socket to the specified address. **sockid** is freely chosen by the frontend
> and
> references this specific socket from this point forward.
> 
> The connect operation creates a new shared ring, which we'll call **data
> ring**.
> The new ring is used to send and receive data over the connected socket.
> Information necessary to setup the new ring, such as grant table references
> and
> event channel ports, are passed from the frontend to the backend as part of
> this request. A **data ring** is unmapped and freed upon issuing a
> **release**
> command on the active socket identified by **sockid**.
> 
> When the frontend issues a **connect** command, the backend:
> - creates a new socket and connects it to **addr**
> - creates an internal mapping from **sockid** to its own socket
> - maps all the grant references and uses them as shared memory for the new
> data
>   ring
> - bind the **evtchn**
> - replies to the frontend
> 
> The data ring format will be described in the following section.
> 
> Fields:
> 
> - **cmd** value: 0
> - additional fields:
>   - **addr**: address to connect to, in struct sockaddr format
>   - **len**: address length
>   - **flags**: flags for the connection, reserved for future usage
>   - **ref**: grant references of the data ring
>   - **evtchn**: port number of the evtchn to signal activity on the data ring
> 
> Binary layout:
> 
>         16      20      24      28      32      36      40      44     48
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |                            addr                       |  len  |
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         | flags |ref[0] |ref[1] |ref[2] |ref[3] |ref[4] |ref[5] |ref[6] |
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[7] |ref[8] |ref[9] |ref[10]|ref[11]|ref[12]|ref[13]|ref[14]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[15]|ref[16]|ref[17]|ref[18]|ref[19]|ref[20]|ref[21]|ref[22]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[23]|ref[24]|ref[25]|ref[26]|ref[27]|ref[28]|ref[29]|ref[30]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[31]|ref[32]|ref[33]|ref[34]|ref[35]|ref[36]|ref[37]|ref[38]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[39]|ref[40]|ref[41]|ref[42]|ref[43]|ref[44]|ref[45]|ref[46]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[47]|ref[48]|ref[49]|ref[50]|ref[51]|ref[52]|ref[53]|ref[54]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[55]|ref[56]|ref[57]|ref[58]|ref[59]|ref[60]|ref[61]|ref[62]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[63]|evtchn |
>         +-------+-------+
> 

So you really do want to bake a 64 page ring into the protocol then?

> Return value:
> 
>   - 0 on success
>   - less than 0 on failure, see the error codes of the socket system call
> 

The socket system call on which OS?

> #### Bind
> 
> The **bind** operation assigns the address passed as parameter to the
> socket.
> It corresponds to the bind system call.

Is a domain allowed to bind to a privileged port in the backend domain?

> **sockid** is freely chosen by the
> frontend and references this specific socket from this point forward.
> **Bind**,
> **listen** and **accept** are the three operations required to have fully
> working passive sockets and should be issued in this order.
> 
> Fields:
> 
> - **cmd** value: 4
> - additional fields:
>   - **addr**: address to bind to, in struct sockaddr format
>   - **len**: address length
> 
> Binary layout:
> 
>         16      20      24      28      32      36      40      44     48
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |                            addr                       |  len  |
>         +-------+-------+-------+-------+-------+-------+-------+-------+
> 
> Return value:
> 
>   - 0 on success
>   - less than 0 on failure, see the error codes of the bind system call
> 
> 
> #### Listen
> 
> The **listen** operation marks the socket as a passive socket. It
> corresponds to
> the listen system call.

...which also takes a 'backlog' parameter, which doesn't seem to be specified 
here.

> 
> Fields:
> 
> - **cmd** value: 5
> - additional fields: none
> 
> Return value:
>   - 0 on success
>   - less than 0 on failure, see the error codes of the listen system call
> 
> 

[snip]
> ### Data ring
> 
> Data rings are used for sending and receiving data over a connected socket.
> They
> are created upon a successful **accept** or **connect** command. The
> ring works
> in a similar way to the existing Xen console ring.
> 
> #### Format
> 
>     #define XENSOCK_DATARING_ORDER 6
>     #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
>     #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES <<
> PAGE_SHIFT)
>     typedef uint32_t XENSOCK_RING_IDX;
> 
>     struct xensock_ring_intf {
>       char in[XENSOCK_DATARING_SIZE/4];
>       char out[XENSOCK_DATARING_SIZE/2];

Why have differing sizes for the rings?

>       XENSOCK_RING_IDX in_cons, in_prod;
>       XENSOCK_RING_IDX out_cons, out_prod;
>       int32_t in_error, out_error;
>     };
> 
> The design is flexible and can support different ring sizes (at compile time).
> The following description is based on order 6 rings, chosen because they
> provide
> excellent performance.
> 

What about datagram sockets? Raw sockets? Setting socket options? Etc.

  Paul

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.