[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [DOC v5] Xen transport for 9pfs
On Tue, 14 Feb 2017, Konrad Rzeszutek Wilk wrote: > On Mon, Feb 13, 2017 at 11:47:26AM -0800, Stefano Stabellini wrote: > > Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> Thank you! For your convenience: --- docs: add Xen transport for 9pfs Signed-off-by: Stefano Stabellini <stefano@xxxxxxxxxxx> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> diff --git a/docs/misc/9pfs.markdown b/docs/misc/9pfs.markdown new file mode 100644 index 0000000..7f13831 --- /dev/null +++ b/docs/misc/9pfs.markdown @@ -0,0 +1,419 @@ +# Xen transport for 9pfs version 1 + +## Background + +9pfs is a network filesystem protocol developed for Plan 9. 9pfs is very +simple and describes a series of commands and responses. It is +completely independent from the communication channels, in fact many +clients and servers support multiple channels, usually called +"transports". For example the Linux client supports tcp and unix +sockets, fds, virtio and rdma. + + +### 9pfs protocol + +This document won't cover the full 9pfs specification. Please refer to +this [paper] and this [website] for a detailed description of it. +However it is useful to know that each 9pfs request and response has the +following header: + + struct header { + uint32_t size; + uint8_t id; + uint16_t tag; + } __attribute__((packed)); + + 0 4 5 7 + +---------+--+----+ + | size |id|tag | + +---------+--+----+ + +- *size* +The size of the request or response. + +- *id* +The 9pfs request or response operation. + +- *tag* +Unique id that identifies a specific request/response pair. It is used +to multiplex operations on a single channel. + +It is possible to have multiple requests in-flight at any given time. + + +## Rationale + +This document describes a Xen based transport for 9pfs, in the +traditional PV frontend and backend format. The PV frontend is used by +the client to send commands to the server. The PV backend is used by the +9pfs server to receive commands from clients and send back responses. + +The transport protocol supports multiple rings up to the maximum +supported by the backend. The size of every ring is also configurable +and can span multiple pages, up to the maximum supported by the backend +(although it cannot be more than 2MB). The design is to exploit +parallelism at the vCPU level and support multiple outstanding requests +simultaneously. + +This document does not cover the 9pfs client/server design or +implementation, only the transport for it. + + +## Xenstore + +The frontend and the backend connect via xenstore to exchange +information. The toolstack creates front and back nodes with state +[XenbusStateInitialising]. The protocol node name is **9pfs**. + +Multiple rings are supported for each frontend and backend connection. + +### Backend XenBus Nodes + +Backend specific properties, written by the backend, read by the +frontend: + + versions + Values: <string> + + List of comma separated protocol versions supported by the backend. + For example "1,2,3". Currently the value is just "1", as there is + only one version. N.B.: this is the version of the Xen trasport + protocol, not the version of 9pfs supported by the server. + + max-rings + Values: <uint32_t> + + The maximum supported number of rings per frontend. + + max-ring-page-order + Values: <uint32_t> + + The maximum supported size of a memory allocation in units of + log2n(machine pages), e.g. 1 = 2 pages, 2 == 4 pages, etc. It + must be at least 1. + +Backend configuration nodes, written by the toolstack, read by the +backend: + + path + Values: <string> + + Host filesystem path to share. + + tag + Values: <string> + + Alphanumeric tag that identifies the 9pfs share. The client needs + to know the tag to be able to mount it. + + security-model + Values: "none" + + *none*: files are stored using the same credentials as they are + created on the guest (no user ownership squash or remap) + Only "none" is supported in this version of the protocol. + +### Frontend XenBus Nodes + + version + Values: <string> + + Protocol version, chosen among the ones supported by the backend + (see **versions** under [Backend XenBus Nodes]). Currently the + value must be "1". + + num-rings + Values: <uint32_t> + + Number of rings. It needs to be lower or equal to max-rings. + + event-channel-<num> (event-channel-0, event-channel-1, etc) + Values: <uint32_t> + + The identifier of the Xen event channel used to signal activity + in the ring buffer. One for each ring. + + ring-ref<num> (ring-ref0, ring-ref1, etc) + Values: <uint32_t> + + The Xen grant reference granting permission for the backend to + map a page with information to setup a share ring. One for each + ring. + +### State Machine + +Initialization: + + *Front* *Back* + XenbusStateInitialising XenbusStateInitialising + - Query virtual device - Query backend device + properties. identification data. + - Setup OS device instance. - Publish backend features + - Allocate and initialize the and transport parameters + request ring. | + - Publish transport parameters | + that will be in effect during V + this connection. XenbusStateInitWait + | + | + V + XenbusStateInitialised + + - Query frontend transport parameters. + - Connect to the request ring and + event channel. + | + | + V + XenbusStateConnected + + - Query backend device properties. + - Finalize OS virtual device + instance. + | + | + V + XenbusStateConnected + +Once frontend and backend are connected, they have a shared page per +ring, which are used to setup the rings, and an event channel per ring, +which are used to send notifications. + +Shutdown: + + *Front* *Back* + XenbusStateConnected XenbusStateConnected + | + | + V + XenbusStateClosing + + - Unmap grants + - Unbind evtchns + | + | + V + XenbusStateClosing + + - Unbind evtchns + - Free rings + - Free data structures + | + | + V + XenbusStateClosed + + - Free remaining data structures + | + | + V + XenbusStateClosed + + +## Ring Setup + +The shared page has the following layout: + + typedef uint32_t XEN_9PFS_RING_IDX; + + struct xen_9pfs_intf { + XEN_9PFS_RING_IDX in_cons, in_prod; + uint8_t pad[56]; + XEN_9PFS_RING_IDX out_cons, out_prod; + uint8_t pad[56]; + + uint32_t ring_order; + /* this is an array of (1 << ring_order) elements */ + grant_ref_t ref[1]; + }; + + /* not actually C compliant (ring_order changes from ring to ring) */ + struct ring_data { + char in[((1 << ring_order) << PAGE_SHIFT) / 2]; + char out[((1 << ring_order) << PAGE_SHIFT) / 2]; + }; + +- **ring_order** + It represents the order of the data ring. The following list of grant + references is of `(1 << ring_order)` elements. It cannot be greater than + **max-ring-page-order**, as specified by the backend on XenBus. +- **ref[]** + The list of grant references which will contain the actual data. They are + mapped contiguosly in virtual memory. The first half of the pages is the + **in** array, the second half is the **out** array. The array must + have a power of two number of elements. +- **out** is an array used as circular buffer + It contains client requests. The producer is the frontend, the + consumer is the backend. +- **in** is an array used as circular buffer + It contains server responses. The producer is the backend, the + consumer is the frontend. +- **out_cons**, **out_prod** + Consumer and producer indices for client requests. They keep track of + how much data has been written by the frontend to **out** and how much + data has already been consumed by the backend. **out_prod** is + increased by the frontend, after writing data to **out**. **out_cons** + is increased by the backend, after reading data from **out**. +- **in_cons** and **in_prod** + Consumer and producer indices for responses. They keep track of how + much data has already been consumed by the frontend from the **in** + array. **in_prod** is increased by the backend, after writing data to + **in**. **in_cons** is increased by the frontend, after reading data + from **in**. + +The binary layout of `struct xen_9pfs_intf` follows: + + 0 4 8 64 68 72 76 + +---------+---------+-----//-----+---------+---------+---------+ + | in_cons | in_prod | padding |out_cons |out_prod |ring_orde| + +---------+---------+-----//-----+---------+---------+---------+ + + 76 80 84 4092 4096 + +---------+---------+----//---+---------+ + | ref[0] | ref[1] | | ref[N] | + +---------+---------+----//---+---------+ + +**N.B** For one page, N is maximum 991 (4096-132)/4, but given that N +needs to be a power of two, actually max N is 512. As 512 == (1 << 9), +the maximum possible max-ring-page-order value is 9. + +The binary layout of the ring buffers follow: + + 0 ((1<<ring_order)<<PAGE_SHIFT)/2 ((1<<ring_order)<<PAGE_SHIFT) + +------------//-------------+------------//-------------+ + | in | out | + +------------//-------------+------------//-------------+ + +## Why ring.h is not needed + +Many Xen PV protocols use the macros provided by [ring.h] to manage +their shared ring for communication. This procotol does not, because it +actually comes with two rings: the **in** ring and the **out** ring. +Each of them is mono-directional, and there is no static request size: +the producer writes opaque data to the ring. On the other end, in +[ring.h] they are combined, and the request size is static and +well-known. In this protocol: + + in -> backend to frontend only + out-> frontend to backend only + +In the case of the **in** ring, the frontend is the consumer, and the +backend is the producer. Everything is the same but mirrored for the +**out** ring. + +The producer, the backend in this case, never reads from the **in** +ring. In fact, the producer doesn't need any notifications unless the +ring is full. This version of the protocol doesn't take advantage of it, +leaving room for optimizations. + +On the other end, the consumer always requires notifications, unless it +is already actively reading from the ring. The producer can figure it +out, without any additional fields in the protocol, by comparing the +indexes at the beginning and the end of the function. This is similar to +what [ring.h] does. + +## Ring Usage + +The **in** and **out** arrays are used as circular buffers: + + 0 sizeof(array) == ((1<<ring_order)<<PAGE_SHIFT)/2 + +-----------------------------------+ + |to consume| free |to consume | + +-----------------------------------+ + ^ ^ + prod cons + + 0 sizeof(array) + +-----------------------------------+ + | free | to consume | free | + +-----------------------------------+ + ^ ^ + cons prod + +The following functions are provided to read and write to an array: + + #define MASK_XEN_9PFS_IDX(idx) ((idx) & (XEN_9PFS_RING_SIZE - 1)) + + static inline void xen_9pfs_read(char *buf, + XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons, + uint8_t *h, size_t len) { + if (*masked_cons < *masked_prod) { + memcpy(h, buf + *masked_cons, len); + } else { + if (len > XEN_9PFS_RING_SIZE - *masked_cons) { + memcpy(h, buf + *masked_cons, XEN_9PFS_RING_SIZE - *masked_cons); + memcpy((char *)h + XEN_9PFS_RING_SIZE - *masked_cons, buf, len - (XEN_9PFS_RING_SIZE - *masked_cons)); + } else { + memcpy(h, buf + *masked_cons, len); + } + } + *masked_cons = _MASK_XEN_9PFS_IDX(*masked_cons + len); + } + + static inline void xen_9pfs_write(char *buf, + XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons, + uint8_t *opaque, size_t len) { + if (*masked_prod < *masked_cons) { + memcpy(buf + *masked_prod, opaque, len); + } else { + if (len > XEN_9PFS_RING_SIZE - *masked_prod) { + memcpy(buf + *masked_prod, opaque, XEN_9PFS_RING_SIZE - *masked_prod); + memcpy(buf, opaque + (XEN_9PFS_RING_SIZE - *masked_prod), len - (XEN_9PFS_RING_SIZE - *masked_prod)); + } else { + memcpy(buf + *masked_prod, opaque, len); + } + } + *masked_prod = _MASK_XEN_9PFS_IDX(*masked_prod + len); + } + +The producer (the backend for **in**, the frontend for **out**) writes to the +array in the following way: + +- read *cons*, *prod* from shared memory +- general memory barrier +- verify *prod* against local copy (consumer shouldn't change it) +- write to array at position *prod* up to *cons*, wrapping around the circular + buffer when necessary +- write memory barrier +- increase *prod* +- notify the other end via event channel + +The consumer (the backend for **out**, the frontend for **in**) reads from the +array in the following way: + +- read *prod*, *cons* from shared memory +- read memory barrier +- verify *cons* against local copy (producer shouldn't change it) +- read from array at position *cons* up to *prod*, wrapping around the circular + buffer when necessary +- general memory barrier +- increase *cons* +- notify the other end via event channel + +The producer takes care of writing only as many bytes as available in the buffer +up to *cons*. The consumer takes care of reading only as many bytes as available +in the buffer up to *prod*. + + +## Request/Response Workflow + +The client chooses one of the available rings, then it sends a request +to the other end on the *out* array, following the producer workflow +described in [Ring Usage]. + +The server receives the notification and reads the request, following +the consumer workflow described in [Ring Usage]. The server knows how +much to read because it is specified in the *size* field of the 9pfs +header. The server processes the request and sends back a response on +the *in* array of the same ring, following the producer workflow as +usual. Thus, every request/response pair is on one ring. + +The client receives a notification and reads the response from the *in* +array. The client knows how much data to read because it is specified in +the *size* field of the 9pfs header. + + +[paper]: https://www.usenix.org/legacy/event/usenix05/tech/freenix/full_papers/hensbergen/hensbergen.pdf +[website]: https://github.com/chaos/diod/blob/master/protocol.md +[XenbusStateInitialising]: http://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,io,xenbus.h.html +[ring.h]: http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/ring.h;hb=HEAD _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |