[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH v3 0/1] netif: staging grants for I/O requests
> -----Original Message----- > From: Joao Martins [mailto:joao.m.martins@xxxxxxxxxx] > Sent: 13 September 2017 19:11 > To: Xen-devel <xen-devel@xxxxxxxxxxxxx> > Cc: Wei Liu <wei.liu2@xxxxxxxxxx>; Paul Durrant <Paul.Durrant@xxxxxxxxxx>; > Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>; Joao Martins > <joao.m.martins@xxxxxxxxxx> > Subject: [PATCH v3 0/1] netif: staging grants for I/O requests > > Hey, > > This is v3 taking into consideration all comments received from v2 (changelog > in the first patch). The specification is right after the diffstat. > > Reference implementation also here (on top of net-next): > > https://github.com/jpemartins/linux.git xen-net-stg-gnts-v3 > > Although I am satisfied with how things are being done above, I wanted > to request some advise/input on whether there could be a simpler way of > achieving the same. Specifically because these control messages > adds up significant code on the frontend to pregrant, and in other cases the > control message might be limitative if frontend tries to keep a dinamically > changed buffer pool in different queues. *Maybe* it could be simpler to > adjust > the TX/RX ring ABI in a compatible matter (Disclaimer: I haven't implemented > this just yet): But the whole point of pre-granting is to separate the grant/ungrant operations from the rx/tx operations, right? So, why would the extra control messages really be an overhead? Paul > > 1) Add a flag `NETTXF_persist` to `netif_tx_request` > > 2) Replace RX `netif_rx_request` padding with `flags` and adda > `NETRXF_persist` with the same purpose as 1). > > 3) This remains backwards compatible as backends not supporting this > wouldn't > act on this new flag, and given we replace padding with flags means > unsupported > backends won't simply be aware of RX *request* `flags`. This is under the > assumption that there's no requirement that padding must be zero > throughout > the netif.h specification. > > 4) Keeping `GET_GREF_MAPPING_SIZE` ctrl msg for frontend to do better > decisions? > > 5) Semantics are simple: slots with flags marked as NET{RX,TX}F_persist > represent a permanent mapped ref and therefore mapped if non-existent. > *future* omissions of the flag signals the mapping should be removed. > > This would allow guests which reuse buffers (apparently Windows :)) to scale > better as unmaps would be done on the individual queue context plus > allowing > frontend to remain a more simple in the management of "permanent" > buffers. The > drawback seems to be the added complexity (and somewhat racy behaviour) > on the > datapath, to map or unmap accordingly. Because now we would have to > differentiate between long vs short lived map/unmap ops in addition to > looking > up on our mappings table. Thoughts, or perhaps people may prefer the one > already described in the series? > > Cheers, > > Joao Martins (1): > public/io/netif.h: add gref mapping control messages > > xen/include/public/io/netif.h | 115 > ++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 115 insertions(+) > --- > % Staging grants for network I/O requests > % Joao Martins <<joao.m.martins@xxxxxxxxxx>> > % Revision 3 > > \clearpage > > -------------------------------------------------------------------- > Architecture(s): Any > -------------------------------------------------------------------- > > # Background and Motivation > > At the Xen hackaton '16 networking session, we spoke about having a > permanently > mapped region to describe header/linear region of packet buffers. This > document > outlines the proposal covering motivation of this and applicability for other > use-cases alongside the necessary changes. This proposal is an RFC and also > includes alternative solutions. > > The motivation of this work is to eliminate grant ops for packet I/O intensive > workloads such as those observed with smaller requests size (i.e. <= 256 > bytes > or <= MTU). Currently on Xen, only bulk transfer (e.g. 32K..64K packets) are > the > only ones performing really good (up to 80 Gbit/s in few CPUs), usually > backing end-hosts and server appliances. Anything that involves higher > packet > rates (<= 1500 MTU) or without sg, performs badly almost like a 1 Gbit/s > throughput. > > # Proposal > > The proposal is to leverage the already implicit copy from and to packet > linear > data on netfront and netback, to be done instead from a permanently > mapped > region. In some (physical) NICs this is known as header/data split. > > Specifically some workloads (e.g. NFV) it would provide a big increase in > throughput when we switch to (zero)copying in the backend/frontend, > instead of > the grant hypercalls. Thus this extension aims at futureproofing the netif > protocol by adding the possibility of guests setting up a list of grants that > are set up at device creation and revoked at device freeing - without taking > too much grant entries in account for the general case (i.e. to cover only the > header region <= 256 bytes, 16 grants per ring) while configurable by kernel > when one wants to resort to a copy-based as opposed to grant copy/map. > > \clearpage > > # General Operation > > Here we describe how netback and netfront general operate, and where the > proposed > solution will fit. The security mechanism currently involves grants references > which in essence are round-robin recycled 'tickets' stamped with the GPFNs, > permission attributes, and the authorized domain: > > (This is an in-memory view of struct grant_entry_v1): > > 0 1 2 3 4 5 6 7 octet > +------------+-----------+------------------------+ > | flags | domain id | frame | > +------------+-----------+------------------------+ > > Where there are N grant entries in a grant table, for example: > > @0: > +------------+-----------+------------------------+ > | rw | 0 | 0xABCDEF | > +------------+-----------+------------------------+ > | rw | 0 | 0xFA124 | > +------------+-----------+------------------------+ > | ro | 1 | 0xBEEF | > +------------+-----------+------------------------+ > > ..... > @N: > +------------+-----------+------------------------+ > | rw | 0 | 0x9923A | > +------------+-----------+------------------------+ > > Each entry consumes 8 bytes, therefore 512 entries can fit on one page. > The `gnttab_max_frames` which is a default of 32 pages. Hence 16,384 > grants. The ParaVirtualized (PV) drivers will use the grant reference (index > in the grant table - 0 .. N) in their command ring. > > \clearpage > > ## Guest Transmit > > The view of the shared transmit ring is the following: > > 0 1 2 3 4 5 6 7 octet > +------------------------+------------------------+ > | req_prod | req_event | > +------------------------+------------------------+ > | rsp_prod | rsp_event | > +------------------------+------------------------+ > | pvt | pad[44] | > +------------------------+ | > | .... | [64bytes] > +------------------------+------------------------+-\ > | gref | offset | flags | | > +------------+-----------+------------------------+ +-'struct > | id | size | id | status | | > netif_tx_sring_entry' > +-------------------------------------------------+-/ > |/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/| .. N > +-------------------------------------------------+ > > Each entry consumes 16 octets therefore 256 entries can fit on one > page.`struct > netif_tx_sring_entry` includes both `struct netif_tx_request` (first 12 > octets) > and `struct netif_tx_response` (last 4 octets). Additionally a `struct > netif_extra_info` may overlay the request in which case the format is: > > +------------------------+------------------------+-\ > | type |flags| type specific data (gso, hash, etc)| | > +------------+-----------+------------------------+ +-'struct > | padding for tx | unused | | netif_extra_info' > +-------------------------------------------------+-/ > > In essence the transmission of a packet in a from frontend to the backend > network stack goes as following: > > **Frontend** > > 1) Calculate how many slots are needed for transmitting the packet. > Fail if there are aren't enough slots. > > [ Calculation needs to estimate slots taking into account 4k page boundary ] > > 2) Make first request for the packet. > The first request contains the whole packet size, checksum info, > flag whether it contains extra metadata, and if following slots contain > more data. > > 3) Put grant in the `gref` field of the tx slot. > > 4) Set extra info if packet requires special metadata (e.g. GSO size) > > 5) If there's still data to be granted set flag `NETTXF_more_data` in > request `flags`. > > 6) Grant remaining packet pages one per slot. (grant boundary is 4k) > > 7) Fill resultant grefs in the slots setting `NETTXF_more_data` for the N-1. > > 8) Fill the total packet size in the first request. > > 9) Set checksum info of the packet (if the chksum offload if supported) > > 10) Update the request producer index (`req_prod`) > > 11) Check whether backend needs a notification > > 11.1) Perform hypercall `EVTCHNOP_send` which might mean a __VMEXIT__ > depending on the guest type. > > **Backend** > > 12) Backend gets an interrupt and runs its interrupt service routine. > > 13) Backend checks if there are unconsumed requests > > 14) Backend consume a request from the ring > > 15) Process extra info (e.g. if GSO info was set) > > 16) Counts all requests for this packet to be processed (while > `NETTXF_more_data` is set) and performs a few validation tests: > > 16.1) Fail transmission if total packet size is smaller than Ethernet > minimum allowed; > > Failing transmission means filling `id` of the request and > `status` of `NETIF_RSP_ERR` of `struct netif_tx_response`; > update rsp_prod and finally notify frontend (through `EVTCHNOP_send`). > > 16.2) Fail transmission if one of the slots (size + offset) crosses the page > boundary > > 16.3) Fail transmission if number of slots are bigger than spec defined > (18 slots max in netif.h) > > 17) Allocate packet metadata > > [ *Linux specific*: This structure emcompasses a linear data region which > generally accomodates the protocol header and such. Netback allocates up > to 128 > bytes for that. ] > > 18) *Linux specific*: Setup up a `GNTTABOP_copy` to copy up to 128 bytes to > this small > region (linear part of the skb) *only* from the first slot. > > 19) Setup GNTTABOP operations to copy/map the packet > > 20) Perform the `GNTTABOP_copy` (grant copy) and/or > `GNTTABOP_map_grant_ref` > hypercalls. > > [ *Linux-specific*: does a copy for the linear region (<=128 bytes) and maps > the > remaining slots as frags for the rest of the data ] > > 21) Check if the grant operations were successful and fail transmission if > any of the resultant operation `status` were different than `GNTST_okay`. > > 21.1) If it's a grant copying backend, therefore produce responses for all the > the copied grants like in 16.1). Only difference is that status is > `NETIF_RSP_OKAY`. > > 21.2) Update the response producer index (`rsp_prod`) > > 22) Set up gso info requested by frontend [optional] > > 23) Set frontend provided checksum info > > 24) *Linux-specific*: Register destructor callback when packet pages are > freed. > > 25) Call into to the network stack. > > 26) Update `req_event` to `request consumer index + 1` to receive a > notification > on the first produced request from frontend. > [optional, if backend is polling the ring and never sleeps] > > 27) *Linux-specific*: Packet destructor callback is called. > > 27.1) Set up `GNTTABOP_unmap_grant_ref` ops for the designated packet > pages. > > 27.2) Once done, perform `GNTTABOP_unmap_grant_ref` hypercall. > Underlying > this hypercall a TLB flush of all backend vCPUS is done. > > 27.3) Produce Tx response like step 21.1) and 21.2) > > [*Linux-specific*: It contains a thread that is woken for this purpose. And > it batch these unmap operations. The callback just queues another unmap.] > > 27.4) Check whether frontend requested a notification > > 27.4.1) If so, Perform hypercall `EVTCHNOP_send` which might mean a > __VMEXIT__ > depending on the guest type. > > **Frontend** > > 28) Transmit interrupt is raised which signals the packet transmission > completion. > > 29) Transmit completion routine checks for unconsumed responses > > 30) Processes the responses and revokes the grants provided. > > 31) Updates `rsp_cons` (request consumer index) > > This proposal aims at removing steps 19) 20) 21) by using grefs previously > mapped at guest request. Guest decides how to distribute or use these > premapped > grefs with either linear or full packet. This allows us to replace step 27) > (the unmap) preventing the TLB flush. > > Note that a grant copy does the following (in pseudo code): > > rcu_lock(src_domain); > rcu_lock(dst_domain); > > for (op = gntcopy[0]; op < nr_ops; op++) { > src_frame = __acquire_grant_for_copy(src_domain, > <op.src.gref>); > ^ here implies a holding a potential contended per CPU lock > on the > remote grant table. > src_vaddr = map_domain_page(src_frame); > > dst_frame = __get_paged_frame(dst_domain, > <op.dst.mfn>) > dst_vaddr = map_domain_page(dst_frame); > > memcpy(dst_vaddr + <op.dst.offset>, > src_frame + <op.src.offset>, > <op.size>); > > unmap_domain_page(src_frame); > unmap_domain_page(dst_frame); > > rcu_unlock(src_domain); > rcu_unlock(dst_domain); > > Linux netback implementation copies the first 128 bytes into its network > buffer > linear region. Hence on the case of the first region it is replaced by a > memcpy > on backend, as opposed to a grant copy. > > \clearpage > > ## Guest Receive > > The view of the shared receive ring is the following: > > 0 1 2 3 4 5 6 7 octet > +------------------------+------------------------+ > | req_prod | req_event | > +------------------------+------------------------+ > | rsp_prod | rsp_event | > +------------------------+------------------------+ > | pvt | pad[44] | > +------------------------+ | > | .... | [64bytes] > +------------------------+------------------------+ > | id | pad | gref | ->'struct > netif_rx_request' > +------------+-----------+------------------------+ > | id | offset | flags | status | ->'struct > netif_rx_response' > +-------------------------------------------------+ > |/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/| .. N > +-------------------------------------------------+ > > > Each entry in the ring occupies 16 octets which means a page fits 256 entries. > Additionally a `struct netif_extra_info` may overlay the rx request in which > case the format is: > > +------------------------+------------------------+ > | type |flags| type specific data (gso, hash, etc)| ->'struct > netif_extra_info' > +------------+-----------+------------------------+ > > Notice the lack of padding, and that is because it's not used on Rx, as Rx > request boundary is 8 octets. > > In essence the steps for receiving of a packet in a Linux frontend is as > from backend to frontend network stack: > > **Backend** > > 1) Backend transmit function starts > > [*Linux-specific*: It means we take a packet and add to an internal queue > (protected by a lock) whereas a separate thread takes it from that queue > and > process the actual like the steps below. This thread has the purpose of > aggregating as much copies as possible.] > > 2) Checks if there are enough rx ring slots that can accomodate the packet. > > 3) Gets a request from the ring for the first data slot and fetches the `gref` > from it. > > 4) Create grant copy op from packet page to `gref`. > > [ It's up to the backend to choose how it fills this data. E.g. backend may > choose to merge as much as data from different pages into this single gref, > similar to mergeable rx buffers in vhost. ] > > 5) Sets up flags/checksum info on first request. > > 6) Gets a response from the ring for this data slot. > > 7) Prefill expected response ring with the request `id` and slot size. > > 8) Update the request consumer index (`req_cons`) > > 9) Gets a request from the ring for the first extra info [optional] > > 10) Sets up extra info (e.g. GSO descriptor) [optional] repeat step 8). > > 11) Repeat steps 3 through 8 for all packet pages and set > `NETRXF_more_data` > in the N-1 slot. > > 12) Perform the `GNTTABOP_copy` hypercall. > > 13) Check if the grant operations status was incorrect and if so set `status` > of the `struct netif_rx_response` field to NETIF_RSP_ERR. > > 14) Update the response producer index (`rsp_prod`) > > **Frontend** > > 15) Frontend gets an interrupt and runs its interrupt service routine > > 16) Checks if there's unconsumed responses > > 17) Consumes a response from the ring (first response for a packet) > > 18) Revoke the `gref` in the response > > 19) Consumes extra info response [optional] > > 20) While N-1 requests has `NETRXF_more_data`, then fetch each of > responses > and revoke the designated `gref`. > > 21) Update the response consumer index (`rsp_cons`) > > 22) *Linux-specific*: Copy (from first slot gref) up to 256 bytes to the > linear > region of the packet metadata structure (skb). The rest of the pages > processed in the responses are then added as frags. > > 23) Set checksum info based on first response flags. > > 24) Call packet into the network stack. > > 25) Allocate new pages and any necessary packet metadata strutures to new > requests. These requests will then be used in step 1) and so forth. > > 26) Update the request producer index (`req_prod`) > > 27) Check whether backend needs notification: > > 27.1) If so, Perform hypercall `EVTCHNOP_send` which might mean a > __VMEXIT__ > depending on the guest type. > > 28) Update `rsp_event` to `response consumer index + 1` such that frontend > receive a notification on the first newly produced response. > [optional, if frontend is polling the ring and never sleeps] > > This proposal aims at replacing step 4), 12) and 22) with memcpy if the > grefs on the Rx ring were requested to be mapped by the guest. Frontend > may use > strategies to allow fast recycling of grants for replinishing the ring, > hence letting Domain-0 replace the grant copies with memcpy instead, > which is > faster. > > Depending on the implementation, it would mean that we no longer > would need to aggregate as much as grant ops as possible (step 1) and could > transmit the packet on the transmit function (e.g. Linux ```ndo_start_xmit```) > as previously proposed > here\[[0](http://lists.xenproject.org/archives/html/xen-devel/2015- > 05/msg01504.html)\]. > This would heavily improve efficiency specifially for smaller packets. Which > in > return would decrease RTT, having data being acknoledged much quicker. > > \clearpage > > # Proposed Extension > > The idea is to allow guest more controllability on how its grants are mapped > or > not. Currently there's no control over it for frontends or backends, and > latter > cannot make assumptions on the mapping transmit or receive grants, hence > we > need frontend to take initiative into managing its own mapping of grants. > Guests may then opportunistically recycle these grants (e.g. Linux) and avoid > resorting to copies which come when using a fixed amount of buffers. Other > frameworks (e.g. XDP, netmap, DPDK) use a fixed set of buffers which also > makes the case for this extension. > > ## Terminology > > `staging grants` is a term used in this document to refer to the whole concept > of having a set of grants permanently mapped with backend, containing data > staging until completion. Therefore the term should not be confused with a > new > kind of grants on the hypervisor. > > ## Control Ring Messages > > ### `XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE` > > This message is sent by the frontend to fetch the number of grefs that can > be kept mapped in the backend. It only receives the queue as argument, and > data representing amount of free entries in the mapping table. > > ### `XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING` > > This is sent by the frontend to map a list of grant references in the backend. > It receives the queue index, the grant containing the list (offset is > implicitly zero) and how many entries in the list. Each entry in this list > has the following format: > > 0 1 2 3 4 5 6 7 octet > +-----+-----+-----+-----+-----+-----+-----+-----+ > | grant ref | flags | status | > +-----+-----+-----+-----+-----+-----+-----+-----+ > > grant ref: grant reference > flags: flags describing the control operation > status: XEN_NETIF_CTRL_STATUS_* > > The list can have a maximum of 512 entries to be mapped at once. > > ### `XEN_NETIF_CTRL_TYPE_DEL_GREF_MAPPING` > > This is sent by the frontend for backend to unmap a list of grant > references. The arguments are the same as > `XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING`, > including the format of the list. However entries to be specified on the list > can only refer to the ones previously added with > `XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING` and additionally these can > not be > inflight grant references in ring at the time the user has requested to unmap > them. > > ## Datapath Changes > > Control ring is only available after backend state is `XenbusConnected` > therefore only on this state change can the frontend query the total amount > of > maps it can keep. It then grants N entries per queue on both TX and RX ring > which will create the underying backend gref -> page association (e.g. stored > in hash table). Frontend may wish to recycle these pregranted buffers or > choose > a copy approach to replace granting. > > On steps 19) of Guest Transmit and 3) of Guest Receive, data gref is first > looked up in this table and uses the underlying page if it already exists a > mapping. On the successfull cases, steps 20) 21) and 27) of Guest Transmit > are > skipped, with 19) being replaced with a memcpy of up to 128 bytes. On Guest > Receive, 4) 12) and 22) are replaced with memcpy instead of a grant copy. > > Failing to obtain the total number of mappings > (`XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE`) means the guest falls > back to the > normal usage without pre granting buffers. > > \clearpage > > # Wire Performance > > This section is a glossary meant to keep in mind numbers on the wire. > > The minimum size that can fit in a single packet with size N is calculated as: > > Packet = Ethernet Header (14) + Protocol Data Unit (46 - 1500) = 60 bytes > > In the wire it's a bit more: > > Preamble (7) + Start Frame Delimiter (1) + Packet + CRC (4) + Interframe gap > (12) = 84 bytes > > For given Link-speed in Bits/sec and Packet size, real packet rate is > calculated as: > > Rate = Link-speed / ((Preamble + Packet + CRC + Interframe gap) * 8) > > Numbers to keep in mind (packet size excludes PHY layer, though packet > rates > disclosed by vendors take those into account, since it's what goes on the > wire): > > | Packet + CRC (bytes) | 10 Gbit/s | 40 Gbit/s | 100 Gbit/s | > |------------------------|:----------:|:----------:|:------------:| > | 64 | 14.88 Mpps| 59.52 Mpps| 148.80 Mpps | > | 128 | 8.44 Mpps| 33.78 Mpps| 84.46 Mpps | > | 256 | 4.52 Mpps| 18.11 Mpps| 45.29 Mpps | > | 1500 | 822 Kpps| 3.28 Mpps| 8.22 Mpps | > | 65535 | ~19 Kpps| 76.27 Kpps| 190.68 Kpps | > > Caption: Mpps (Million packets per second) ; Kpps (Kilo packets per second) > > \clearpage > > # Performance > > Numbers between a Linux v4.11 guest and another host connected by a 100 > Gbit/s > NIC on a E5-2630 v4 2.2 GHz host to give an idea on the performance benefits > of > this extension. Please refer to this presentation[7] for a better overview of > the results. > > ( Numbers include protocol overhead ) > > **bulk transfer (Guest TX/RX)** > > Queues Before (Gbit/s) After (Gbit/s) > ------ ------------- ------------ > 1queue 17244/6000 38189/28108 > 2queue 24023/9416 54783/40624 > 3queue 29148/17196 85777/54118 > 4queue 39782/18502 99530/46859 > > ( Guest -> Dom0 ) > > **Packet I/O (Guest TX/RX) in UDP 64b** > > Queues Before (Mpps) After (Mpps) > ------ ------------- ------------ > 1queue 0.684/0.439 2.49/2.96 > 2queue 0.953/0.755 4.74/5.07 > 4queue 1.890/1.390 8.80/9.92 > > \clearpage > > # References > > [0] http://lists.xenproject.org/archives/html/xen-devel/2015- > 05/msg01504.html > > [1] > https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap > _mem2.c#L362 > > [2] https://www.freebsd.org/cgi/man.cgi?query=vale&sektion=4&n=1 > > [3] https://github.com/iovisor/bpf- > docs/blob/master/Express_Data_Path.pdf > > [4] http://prototype- > kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.htm > l#write-access-to-packet-data > > [5] http://lxr.free- > electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L207 > 3 > > [6] http://lxr.free- > electrons.com/source/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L52 > > [7] > https://schd.ws/hosted_files/xendeveloperanddesignsummit2017/e6/ToGr > antOrNotToGrant-XDDS2017_v3.pdf > > # History > > A table of changes to the document, in chronological order. > > ------------------------------------------------------------------------ > Date Revision Version Notes > ---------- -------- -------- ------------------------------------------- > 2016-12-14 1 Xen 4.9 Initial version for RFC > > 2017-09-01 2 Xen 4.10 Rework to use control ring > > Trim down the specification > > Added some performance numbers from the > presentation > > 2017-09-13 3 Xen 4.10 Addressed changes from Paul Durrant > > ------------------------------------------------------------------------ _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |