Xen project Mailing List

Re: [Xen-devel] [DOC v7] PV Calls protocol design

ping? On Thu, 13 Oct 2016, Stefano Stabellini wrote: > Hi all, > > This is the design document of the PV Calls protocol. You can find > prototypes of the Linux frontend and backend drivers here: > > git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen.git pvcalls-7 > > To use them, make sure to enable CONFIG_XEN_PVCALLS in your kernel > config and add "pvcalls=1" to the command line of your DomU Linux > kernel. You also need the toolstack to create the initial xenstore nodes > for the protocol. To do that, please apply the attached patch to libxl > (the patch is based on Xen 4.7.0-rc3) and add "pvcalls=1" to your DomU > config file. > > Please review! > > Cheers, > > Stefano > > > Changes in v7: > - add a glossary of Xen terms > - add a paragraph on why Xen was chosen > - wording improvements > - add links to xenstore documents and headers > - specify that the current version is "1" > - rename max-dataring-page-order to max-page-order > - rename networking-calls to function-calls > - add links to [Data ring] throughout the document > - explain the difference between req_id and id > - mention that future commands larger than 56 bytes will require a > version increase > - mention that the list of commands is in calling order > - clarify that reuse data rings are found by ref > - rename ENOTSUPP to ENOTSUP > - add padding in struct pvcalls_data_intf for cachelining > - rename pvcalls_ring_queued to pvcalls_ring_unconsumed > > > Changes in v6: > - add reuse field in release command > - add "networking-calls" backend node on xenstore > - fixed tab/whitespace indentation > > Changes in v5: > - clarify text > - rename id to req_id > - rename sockid to id > - move id to request and response specific fields > - add version node to xenstore > > Changes in v4: > - rename xensock to pvcalls > > Changes in v3: > - add a dummy element to struct xen_xensock_request to make sure the > size of the struct is the same on both x86_32 and x86_64 > > Changes in v2: > - add max-dataring-page-order > - add "Publish backend features and transport parameters" to backend > xenbus workflow > - update new cmd values > - update xen_xensock_request > - add backlog parameter to listen and binary layout > - add description of new data ring format (interface+data) > - modify connect and accept to reflect new data ring format > - add link to POSIX docs > - add error numbers > - add address format section and relevant numeric definitions > - add explicit mention of unimplemented commands > - add protocol node name > - add xenbus shutdown diagram > - add socket operation > > --- > > > # PV Calls Protocol version 1 > > ## Glossary > > The following is a list of terms and definitions used in the Xen > community. If you are a Xen contributor you can skip this section. > > * PV > > Short for paravirtualized. > > * Dom0 > > First virtual machine that boots. In most configurations Dom0 is > privileged and has control over hardware devices, such as network > cards, graphic cards, etc. > > * DomU > > Regular unprivileged Xen virtual machine. > > * Domain > > A Xen virtual machine. Dom0 and all DomUs are all separate Xen > domains. > > * Guest > > Same as domain: a Xen virtual machine. > > * Frontend > > Each DomU has one or more paravirtualized frontend drivers to access > disks, network, console, graphics, etc. The presence of PV devices is > advertized on XenStore, a cross domain key-value database. Frontends > are similar in intent to the virtio drivers in Linux. > > * Backend > > A Xen paravirtualized backend typically runs in Dom0 and it is used to > export disks, network, console, graphics, etcs, to DomUs. Backends can > live both in kernel space and in userspace. For example xen-blkback > lives under drivers/block in the Linux kernel and xen_disk lives under > hw/block in QEMU. Paravirtualized backends are similar in intent to > virtio device emulators. > > * VMX and SVM > > On Intel processors, VMX is the CPU flag for VT-x, hardware > virtualization support. It corresponds to SVM on AMD processors. > > > > ## Rationale > > PV Calls is a paravirtualized protocol that allows the implementation of > a set of POSIX functions in a different domain. The PV Calls frontend > sends POSIX function calls to the backend, which implements them and > returns a value to the frontend. > > This version of the document covers networking function calls, such as > connect, accept, bind, release, listen, poll, recvmsg and sendmsg; but > the protocol is meant to be easily extended to cover different sets of > calls. Unimplemented commands return ENOTSUP. > > PV Calls provide the following benefits: > * full visibility of the guest behavior on the backend domain, allowing > for inexpensive filtering and manipulation of any guest calls > * excellent performance > > Specifically, PV Calls for networking offer these advantages: > * guest networking works out of the box with VPNs, wireless networks and > any other complex configurations on the host > * guest services listen on ports bound directly to the backend domain IP > addresses > * localhost becomes a secure namespace for inter-VMs communications > > > ## Design > > ### Why Xen? > > PV Calls are part of an effort to create a secure runtime environment > for containers (OCI images to be precise). PV Calls are based on Xen, > although porting them to other hypervisors is possible. Xen was chosen > because of its security and isolation properties and because it supports > PV guests, a type of virtual machines that does not require hardware > virtualization extensions (VMX on Intel processors and SVM on AMD > processors). This is important because PV Calls is meant for containers > and containers are often run on top of public cloud instances, which do > not support nested VMX (or SVM) as of today (late 2016). Xen PV guests > are lightweight, minimalist, and do not require machine emulation: all > properties that make them a good fit for the project. > > ### Xenstore > > The frontend and the backend connect via [xenstore] to > exchange information. The toolstack creates front and back nodes with > state [XenbusStateInitialising]. The protocol node name > is **pvcalls**. There can only be one PV Calls frontend per domain. > > #### Frontend XenBus Nodes > > port > Values: <uint32_t> > > The identifier of the Xen event channel used to signal activity > in the ring buffer. > > ring-ref > Values: <uint32_t> > > The Xen grant reference granting permission for the backend to map > the sole page in a single page sized ring buffer. Later on this > ring will be referred to commands ring. > > #### Backend XenBus Nodes > > version > Values: <uint32_t> > > Protocol version supported by the backend. Currently the value is > 1. > > max-page-order > Values: <uint32_t> > > The maximum supported size of a memory allocation in units of > lb(machine pages), e.g. 0 == 1 page, 1 = 2 pages, 2 == 4 pages, > etc. > > function-calls > Values: <uint32_t> > > Value "0" means that no calls are supported. > Value "1" means that socket, connect, release, bind, listen, accept > and poll are supported. > > #### State Machine > > Initialization: > > *Front* *Back* > XenbusStateInitialising XenbusStateInitialising > - Query virtual device - Query backend device > properties. identification data. > - Setup OS device instance. - Publish backend features > - Allocate and initialize the and transport parameters > request ring. | > - Publish transport parameters | > that will be in effect during V > this connection. XenbusStateInitWait > | > | > V > XenbusStateInitialised > > - Query frontend transport > parameters. > - Connect to the request ring and > event channel. > | > | > V > XenbusStateConnected > > - Query backend device properties. > - Finalize OS virtual device > instance. > | > | > V > XenbusStateConnected > > Once frontend and backend are connected, they have a shared page, which > will is used to exchange messages over a ring, and an event channel, > which is used to send notifications. > > Shutdown: > > *Front* *Back* > XenbusStateConnected XenbusStateConnected > | > | > V > XenbusStateClosing > > - Unmap grants > - Unbind evtchns > | > | > V > XenbusStateClosing > > - Unbind evtchns > - Free rings > - Free data structures > | > | > V > XenbusStateClosed > > - Free remaining data structures > | > | > V > XenbusStateClosed > > > ### Commands Ring > > The shared ring is used by the frontend to forward POSIX function calls > to the backend. We shall refer to this ring as **commands ring** to > distinguish it from other rings which can be created later in the > lifecycle of the protocol (see [Data ring]). The grant reference for > shared page for this ring is shared on xenstore (see [Frontend XenBus > Nodes]). The ring format is defined using the familiar > `DEFINE_RING_TYPES` macro (`xen/include/public/io/ring.h`). Frontend > requests are allocated on the ring using the `RING_GET_REQUEST` macro. > The list of commands below is in calling order. > > The format is defined as follows: > > #define PVCALLS_SOCKET 0 > #define PVCALLS_CONNECT 1 > #define PVCALLS_RELEASE 2 > #define PVCALLS_BIND 3 > #define PVCALLS_LISTEN 4 > #define PVCALLS_ACCEPT 5 > #define PVCALLS_POLL 6 > > struct xen_pvcalls_request { > uint32_t req_id; /* private to guest, echoed in response */ > uint32_t cmd; /* command to execute */ > union { > struct xen_pvcalls_socket { > uint64_t id; > uint32_t domain; > uint32_t type; > uint32_t protocol; > } socket; > struct xen_pvcalls_connect { > uint64_t id; > uint8_t addr[28]; > uint32_t len; > uint32_t flags; > grant_ref_t ref; > uint32_t evtchn; > } connect; > struct xen_pvcalls_release { > uint64_t id; > uint8_t reuse; > } release; > struct xen_pvcalls_bind { > uint64_t id; > uint8_t addr[28]; > uint32_t len; > } bind; > struct xen_pvcalls_listen { > uint64_t id; > uint32_t backlog; > } listen; > struct xen_pvcalls_accept { > uint64_t id; > uint64_t id_new; > grant_ref_t ref; > uint32_t evtchn; > } accept; > struct xen_pvcalls_poll { > uint64_t id; > } poll; > /* dummy member to force sizeof(struct xen_pvcalls_request) to > match across archs */ > struct xen_pvcalls_dummy { > uint8_t dummy[56]; > } dummy; > } u; > }; > > The first two fields are common for every command. Their binary layout > is: > > 0 4 8 > +-------+-------+ > |req_id | cmd | > +-------+-------+ > > - **req_id** is generated by the frontend and is a cookie used to > identify one specific request/response pair of commands. Not to be > confused with any command **id** which are used to identify a socket > across multiple commands, see [Socket]. > - **cmd** is the command requested by the frontend: > > - `PVCALLS_SOCKET`: 0 > - `PVCALLS_CONNECT`: 1 > - `PVCALLS_RELEASE`: 2 > - `PVCALLS_BIND`: 3 > - `PVCALLS_LISTEN`: 4 > - `PVCALLS_ACCEPT`: 5 > - `PVCALLS_POLL`: 6 > > Both fields are echoed back by the backend. See [Socket families and > address format] for the format of the **addr** field of connect and > bind. The maximum size of command specific arguments is 56 bytes. Any > future command that requires more than that will need a bump the > **version** of the protocol. > > Similarly to other Xen ring based protocols, after writing a request to > the ring, the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and > issues an event channel notification when a notification is required. > > Backend responses are allocated on the ring using the `RING_GET_RESPONSE` > macro. > The format is the following: > > struct xen_pvcalls_response { > uint32_t req_id; > uint32_t cmd; > int32_t ret; > uint32_t pad; > union { > struct _xen_pvcalls_socket { > uint64_t id; > } socket; > struct _xen_pvcalls_connect { > uint64_t id; > } connect; > struct _xen_pvcalls_release { > uint64_t id; > } release; > struct _xen_pvcalls_bind { > uint64_t id; > } bind; > struct _xen_pvcalls_listen { > uint64_t id; > } listen; > struct _xen_pvcalls_accept { > uint64_t id; > } accept; > struct _xen_pvcalls_poll { > uint64_t id; > } poll; > struct _xen_pvcalls_dummy { > uint8_t dummy[8]; > } dummy; > } u; > }; > > The first four fields are common for every response. Their binary layout > is: > > 0 4 8 12 16 > +-------+-------+-------+-------+ > |req_id | cmd | ret | pad | > +-------+-------+-------+-------+ > > - **req_id**: echoed back from request > - **cmd**: echoed back from request > - **ret**: return value, identifies success (0) or failure (see error numbers > below). If the **cmd** is not supported by the backend, ret is ENOTSUP. > - **pad**: padding > > After calling `RING_PUSH_RESPONSES_AND_CHECK_NOTIFY`, the backend checks > whether > it needs to notify the frontend and does so via event channel. > > A description of each command, their additional request and response > fields follow. > > > #### Socket > > The **socket** operation corresponds to the POSIX [socket][socket] > function. It creates a new socket of the specified family, type and > protocol. **id** is freely chosen by the frontend and references this > specific socket from this point forward. See "Socket families and > address format" below. > > Request fields: > > - **cmd** value: 0 > - additional fields: > - **id**: generated by the frontend, it identifies the new socket > - **domain**: the communication domain > - **type**: the socket type > - **protocol**: the particular protocol to be used with the socket, usually > 0 > > Request binary layout: > > 8 12 16 20 24 28 > +-------+-------+-------+-------+-------+ > | id |domain | type |protoco| > +-------+-------+-------+-------+-------+ > > Response additional fields: > > - **id**: echoed back from request > > Response binary layout: > > 16 20 24 > +-------+--------+ > | id | > +-------+--------+ > > Return value: > > - 0 on success > - See the [POSIX socket function][connect] for error names; the > corresponding > error numbers are specified later in this document. > > #### Connect > > The **connect** operation corresponds to the POSIX [connect][connect] > function. It connects a previously created socket (identified by **id**) > to the specified address. > > The connect operation creates a new shared ring, which we'll call **data > ring**. The [Data ring] is used to send and receive data from the > socket. The connect operation passes two additional parameters: > **evtchn** and **ref**. **evtchn** is the port number of a new event > channel which will be used for notifications of activity on the data > ring. **ref** is the grant reference of a page which contains shared > indices that point to the write and read locations in the ring buffers. > **ref** also contains the full array of grant references for the ring > buffers. When the frontend issues a **connect** command, the backend: > > - finds its own internal socket corresponding to **id** > - connects the socket to **addr** > - maps the grant reference **ref**, the shared page contains the data > ring interface (`struct pvcalls_data_intf`) > - maps all the grant references listed in `struct pvcalls_data_intf` and > uses them as shared memory for the ring buffers > - bind the **evtchn** > - replies to the frontend > > The [Data ring] format will be described in the following section. The > data ring is unmapped and freed upon issuing a **release** command on > the active socket identified by **id**. > > Request fields: > > - **cmd** value: 0 > - additional fields: > - **id**: identifies the socket > - **addr**: address to connect to, see the address format section for more > information > - **len**: address length > - **flags**: flags for the connection, reserved for future usage > - **ref**: grant reference of the page containing `struct > pvcalls_data_intf` > - **evtchn**: port number of the evtchn to signal activity on the data ring > > Request binary layout: > > 8 12 16 20 24 28 32 36 40 44 > +-------+-------+-------+-------+-------+-------+-------+-------+-------+ > | id | addr | > +-------+-------+-------+-------+-------+-------+-------+-------+-------+ > | len | flags | ref |evtchn | > +-------+-------+-------+-------+ > > Response additional fields: > > - **id**: echoed back from request > > Response binary layout: > > 16 20 24 > +-------+-------+ > | id | > +-------+-------+ > > Return value: > > - 0 on success > - See the [POSIX connect function][connect] for error names; the > corresponding > error numbers are specified later in this document. > > #### Release > > The **release** operation closes an existing active or a passive socket. > > When a release command is issued on a passive socket, the backend > releases it and frees its internal mappings. When a release command is > issued for an active socket, the data ring is also unmapped and freed: > > - frontend sends release command for an active socket > - backend releases the socket > - backend unmaps the data ring buffers > - backend unmaps the data ring interface > - backend unbinds the evtchn > - backend replies to frontend > - frontend frees ring and unbinds evtchn > > Request fields: > > - **cmd** value: 1 > - additional fields: > - **id**: identifies the socket > - **reuse**: an optimization hint for the backend. The field is > ignored for passive sockets. When set to 1, the frontend lets the > backend know that it will reuse exactly the same set of grant pages > (interface and data ring) and evtchn when creating one of the next > active sockets. The backend can take advantage of it by delaying > unmapping grants and unbinding the evtchn. The backend is free to > ignore the hint. Reused data rings are found by **ref**, the grant > reference of the page containing the indices. > > Request binary layout: > > 8 12 16 17 > +-------+-------+-----+ > | id |reuse| > +-------+-------+-----+ > > Response additional fields: > > - **id**: echoed back from request > > Response binary layout: > > 16 20 24 > +-------+-------+ > | id | > +-------+-------+ > > Return value: > > - 0 on success > - See the [POSIX shutdown function][shutdown] for error names; the > corresponding error numbers are specified later in this document. > > #### Bind > > The **bind** operation corresponds to the POSIX [bind][bind] function. > It assigns the address passed as parameter to a previously created > socket, identified by **id**. **Bind**, **listen** and **accept** are > the three operations required to have fully working passive sockets and > should be issued in this order. > > Request fields: > > - **cmd** value: 2 > - additional fields: > - **id**: identifies the socket > - **addr**: address to connect to, see the address format section for more > information > - **len**: address length > > Request binary layout: > > 8 12 16 20 24 28 32 36 40 44 > +-------+-------+-------+-------+-------+-------+-------+-------+-------+ > | id | addr | > +-------+-------+-------+-------+-------+-------+-------+-------+-------+ > | len | > +-------+ > > Response additional fields: > > - **id**: echoed back from request > > Response binary layout: > > 16 20 24 > +-------+-------+ > | id | > +-------+-------+ > > Return value: > > - 0 on success > - See the [POSIX bind function][bind] for error names; the corresponding > error > numbers are specified later in this document. > > > #### Listen > > The **listen** operation marks the socket as a passive socket. It corresponds > to > the [POSIX listen function][listen]. > > Reuqest fields: > > - **cmd** value: 3 > - additional fields: > - **id**: identifies the socket > - **backlog**: the maximum length to which the queue of pending > connections may grow > > Request binary layout: > > 8 12 16 20 > +-------+-------+-------+ > | id |backlog| > +-------+-------+-------+ > > Response additional fields: > > - **id**: echoed back from request > > Response binary layout: > > 16 20 24 > +-------+-------+ > | id | > +-------+-------+ > > Return value: > - 0 on success > - See the [POSIX listen function][listen] for error names; the corresponding > error numbers are specified later in this document. > > > #### Accept > > The **accept** operation extracts the first connection request on the > queue of pending connections for the listening socket identified by > **id** and creates a new connected socket. The id of the new socket is > also chosen by the frontend and passed as an additional field of the > accept request struct (**id_new**). See the [POSIX accept function][accept] > as reference. > > Similarly to the **connect** operation, **accept** creates a new [Data > ring]. The data ring is used to send and receive data from the socket. > The **accept** operation passes two additional parameters: **evtchn** > and **ref**. **evtchn** is the port number of a new event channel which > will be used for notifications of activity on the data ring. **ref** is > the grant reference of a page which contains shared indices that point > to the write and read locations in the ring buffers. **ref** also > contains the full array of grant references for the ring buffers. > > The backend will reply to the request only when a new connection is > successfully accepted, i.e. the backend does not return EAGAIN or > EWOULDBLOCK. > > Example workflow: > > - frontend issues an **accept** request > - backend waits for a connection to be available on the socket > - a new connection becomes available > - backend accepts the new connection > - backend creates an internal mapping from **id_new** to the new socket > - backend maps the grant reference **ref**, the shared page contains the > data ring interface (`struct pvcalls_data_intf`) > - backend maps all the grant references listed in `struct > pvcalls_data_intf` and uses them as shared memory for the new data > ring > - backend binds the **evtchn** > - backend replies to the frontend > > Request fields: > > - **cmd** value: 4 > - additional fields: > - **id**: id of listening socket > - **id_new**: id of the new socket > - **ref**: grant reference of the data ring interface (`struct > pvcalls_data_intf`) > - **evtchn**: port number of the evtchn to signal activity on the data ring > > Request binary layout: > > 8 12 16 20 24 28 32 > +-------+-------+-------+-------+-------+-------+ > | id | id_new | ref |evtchn | > +-------+-------+-------+-------+-------+-------+ > > Response additional fields: > > - **id**: id of the listening socket, echoed back from request > > Response binary layout: > > 16 20 24 > +-------+-------+ > | id | > +-------+-------+ > > Return value: > > - 0 on success > - See the [POSIX accept function][accept] for error names; the corresponding > error numbers are specified later in this document. > > > #### Poll > > In this version of the protocol, the **poll** operation is only valid > for passive sockets. For active sockets, the frontend should look at the > state of the data ring. When a new connection is available in the queue > of the passive socket, the backend generates a response and notifies the > frontend. > > Request fields: > > - **cmd** value: 5 > - additional fields: > - **id**: identifies the listening socket > > Request binary layout: > > 8 12 16 > +-------+-------+ > | id | > +-------+-------+ > > > Response additional fields: > > - **id**: echoed back from request > > Response binary layout: > > 16 20 24 > +--------+--------+ > | id | > +--------+--------+ > > Return value: > > - 0 on success > - See the [POSIX poll function][poll] for error names; the corresponding > error > numbers are specified later in this document. > > #### Error numbers > > The numbers corresponding to the error names specified by POSIX are: > > [EPERM] -1 > [ENOENT] -2 > [ESRCH] -3 > [EINTR] -4 > [EIO] -5 > [ENXIO] -6 > [E2BIG] -7 > [ENOEXEC] -8 > [EBADF] -9 > [ECHILD] -10 > [EAGAIN] -11 > [EWOULDBLOCK] -11 > [ENOMEM] -12 > [EACCES] -13 > [EFAULT] -14 > [EBUSY] -16 > [EEXIST] -17 > [EXDEV] -18 > [ENODEV] -19 > [EISDIR] -21 > [EINVAL] -22 > [ENFILE] -23 > [EMFILE] -24 > [ENOSPC] -28 > [EROFS] -30 > [EMLINK] -31 > [EDOM] -33 > [ERANGE] -34 > [EDEADLK] -35 > [EDEADLOCK] -35 > [ENAMETOOLONG] -36 > [ENOLCK] -37 > [ENOTEMPTY] -39 > [ENOSYS] -38 > [ENODATA] -61 > [ETIME] -62 > [EBADMSG] -74 > [EOVERFLOW] -75 > [EILSEQ] -84 > [ERESTART] -85 > [ENOTSOCK] -88 > [EOPNOTSUPP] -95 > [EAFNOSUPPORT] -97 > [EADDRINUSE] -98 > [EADDRNOTAVAIL] -99 > [ENOBUFS] -105 > [EISCONN] -106 > [ENOTCONN] -107 > [ETIMEDOUT] -110 > [ENOTSUP] -524 > > #### Socket families and address format > > The following definitions and explicit sizes, together with POSIX > [sys/socket.h][address] and [netinet/in.h][in] define socket families and > address format. Please be aware that only the **domain** `AF_INET`, **type** > `SOCK_STREAM` and **protocol** `0` are supported by this version of the > spec, others return ENOTSUP. > > #define AF_UNSPEC 0 > #define AF_UNIX 1 /* Unix domain sockets */ > #define AF_LOCAL 1 /* POSIX name for AF_UNIX */ > #define AF_INET 2 /* Internet IP Protocol */ > #define AF_INET6 10 /* IP version 6 */ > > #define SOCK_STREAM 1 > #define SOCK_DGRAM 2 > #define SOCK_RAW 3 > > /* generic address format */ > struct sockaddr { > uint16_t sa_family_t; > char sa_data[26]; > }; > > struct in_addr { > uint32_t s_addr; > }; > > /* AF_INET address format */ > struct sockaddr_in { > uint16_t sa_family_t; > uint16_t sin_port; > struct in_addr sin_addr; > char sin_zero[20]; > }; > > > ### Data ring > > Data rings are used for sending and receiving data over a connected socket. > They > are created upon a successful **accept** or **connect** command. > The **sendmsg** and **recvmsg** calls are implemented by sending data and > receiving data from data rings. > > A data ring is composed of two pieces: the interface and the **in** and > **out** > buffers. The interface, represented by `struct pvcalls_ring_intf` is shared > first and resides on the page whose grant reference is passed by **accept** > and > **connect** as parameter. `struct pvcalls_ring_intf` contains the list of > grant > references which constitute the **in** and **out** data buffers. > > #### Data ring interface > > struct pvcalls_data_intf { > PVCALLS_RING_IDX in_cons, in_prod; > int32_t in_error; > > uint8_t pad[52]; > > PVCALLS_RING_IDX out_cons, out_prod; > int32_t out_error; > > uint32_t ring_order; > grant_ref_t ref[]; > }; > > /* not actually C compliant (ring_order changes from socket to socket) */ > struct pvcalls_data { > char in[((1<<ring_order)<<PAGE_SHIFT)/2]; > char out[((1<<ring_order)<<PAGE_SHIFT)/2]; > }; > > - **ring_order** > It represents the order of the data ring. The following list of grant > references is of `(1 << ring_order)` elements. It cannot be greater than > **max-dataring-page-order**, as specified by the backend on XenBus. > - **ref[]** > The list of grant references which will contain the actual data. They are > mapped contiguosly in virtual memory. The first half of the pages is the > **in** array, the second half is the **out** array. The array must > have a power of two number of elements. > - **in** is an array used as circular buffer > It contains data read from the socket. The producer is the backend, the > consumer is the frontend. > - **out** is an array used as circular buffer > It contains data to be written to the socket. The producer is the frontend, > the consumer is the backend. > - **in_cons** and **in_prod** > Consumer and producer indices for data read from the socket. They keep track > of how much data has already been consumed by the frontend from the **in** > array. **in_prod** is increased by the backend, after writing data to > **in**. > **in_cons** is increased by the frontend, after reading data from **in**. > - **out_cons**, **out_prod** > Consumer and producer indices for the data to be written to the socket. They > keep track of how much data has been written by the frontend to **out** and > how much data has already been consumed by the backend. **out_prod** is > increased by the frontend, after writing data to **out**. **out_cons** is > increased by the backend, after reading data from **out**. > - **in_error** and **out_error** They signal errors when reading from the > socket > (**in_error**) or when writing to the socket (**out_error**). 0 means no > errors. When an error occurs, no further reads or writes operations are > performed on the socket. In the case of an orderly socket shutdown (i.e. > read > returns 0) **in_error** is set to ENOTCONN. **in_error** and **out_error** > are never set to EAGAIN or EWOULDBLOCK. > > The binary layout of `struct pvcalls_data_intf` follows: > > 0 4 8 12 64 68 72 > 76 > +---------+---------+---------+-----//-----+---------+---------+---------+ > | in_cons | in_prod |in_error | padding |out_cons |out_prod |out_error| > +---------+---------+---------+-----//-----+---------+---------+---------+ > > 76 80 84 88 4092 4096 > +---------+---------+---------+----//---+---------+ > |ring_orde| ref[0] | ref[1] | | ref[N] | > +---------+---------+---------+----//---+---------+ > > **N.B** For one page, N is maximum 1004 ((4096-80)/4), but given that N needs > to be a power of two, actually max N is 512. > The binary layout of the ring buffers follow: > > 0 ((1<<ring_order)<<PAGE_SHIFT)/2 > ((1<<ring_order)<<PAGE_SHIFT) > +------------//-------------+------------//-------------+ > | in | out | > +------------//-------------+------------//-------------+ > > #### Workflow > > The **in** and **out** arrays are used as circular buffers: > > 0 sizeof(array) == > ((1<<ring_order)<<PAGE_SHIFT)/2 > +-----------------------------------+ > |to consume| free |to consume | > +-----------------------------------+ > ^ ^ > prod cons > > 0 sizeof(array) > +-----------------------------------+ > | free | to consume | free | > +-----------------------------------+ > ^ ^ > cons prod > > The following function is provided to calculate how many bytes are currently > left unconsumed in an array: > > #define _MASK_PVCALLS_IDX(idx, ring_size) ((idx) & (ring_size-1)) > > static inline PVCALLS_RING_IDX pvcalls_ring_unconsumed(PVCALLS_RING_IDX > prod, > PVCALLS_RING_IDX cons, > PVCALLS_RING_IDX ring_size) > { > PVCALLS_RING_IDX size; > > if (prod == cons) > return 0; > > prod = _MASK_PVCALLS_IDX(prod, ring_size); > cons = _MASK_PVCALLS_IDX(cons, ring_size); > > if (prod == cons) > return ring_size; > > if (prod > cons) > size = prod - cons; > else { > size = ring_size - cons; > size += prod; > } > return size; > } > > The producer (the backend for **in**, the frontend for **out**) writes to the > array in the following way: > > - read *cons*, *prod*, *error* from shared memory > - memory barrier > - return on *error* > - write to array at position *prod* up to *cons*, wrapping around the circular > buffer when necessary > - memory barrier > - increase *prod* > - notify the other end via evtchn > > The consumer (the backend for **out**, the frontend for **in**) reads from the > array in the following way: > > - read *prod*, *cons*, *error* from shared memory > - memory barrier > - return on *error* > - read from array at position *cons* up to *prod*, wrapping around the > circular > buffer when necessary > - memory barrier > - increase *cons* > - notify the other end via evtchn > > The producer takes care of writing only as many bytes as available in the > buffer > up to *cons*. The consumer takes care of reading only as many bytes as > available > in the buffer up to *prod*. *error* is set by the backend when an error occurs > writing or reading from the socket. > > > [xenstore]: http://xenbits.xen.org/docs/unstable/misc/xenstore.txt > [XenbusStateInitialising]: > http://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,io,xenbus.h.html > [address]: http://pubs.opengroup.org/onlinepubs/7908799/xns/syssocket.h.html > [in]: > http://pubs.opengroup.org/onlinepubs/000095399/basedefs/netinet/in.h.html > [socket]: http://pubs.opengroup.org/onlinepubs/009695399/functions/socket.html > [connect]: http://pubs.opengroup.org/onlinepubs/7908799/xns/connect.html > [shutdown]: http://pubs.opengroup.org/onlinepubs/7908799/xns/shutdown.html > [bind]: http://pubs.opengroup.org/onlinepubs/7908799/xns/bind.html > [listen]: http://pubs.opengroup.org/onlinepubs/7908799/xns/listen.html > [accept]: http://pubs.opengroup.org/onlinepubs/7908799/xns/accept.html > [poll]: http://pubs.opengroup.org/onlinepubs/7908799/xsh/poll.html _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.