Xen project Mailing List

Re: [Xen-devel] [RFC] netif: staging grants for requests

On Fri, 6 Jan 2017, Joao Martins wrote: > >>> I don't think it makes sense: the rest of the page would still be > >>> accessible from the backend, which could compromise its content. I would > >>> share a list of pages that can be mapped contiguously to create a data > >>> ring. > >> Note that the permissions of the TX grefs are readonly, whereas the rx > >> ring is > >> read/write, so I inherit this same property here for the data grefs too. I > >> wouldn't mind having a grant per packet; > > > > That's not what I am suggesting. > > > > Let's suppose that the frontend chooses a slot size of 128 and that we > > have 256 slots in total. The frontend shares 8 pages to the backend: > > > > 128*256 = 32768 = 8*4096 > > > > these 8 pages are going to be used for all data slots. So slot number 34 > > (if we count from 1) is the second slot of the second page (32 data > > slots per page). > Exactly and that's what I am proposing :) The text appears to be a bit > ambiguous > and causing a little confusion, my apologies. All right, I really misunderstood the document. Please clarify it in the next version :-) > >> But then nothing currently stops a backend to write to these pages > >> irrespective of the approach (with or without staging grants) until > >> they are copied/revoked; the frontend would always copy from the > >> staging region into a guest-only visible region. > >> > >>>> This list has twice as max slots as would have `tx-ring-ref` or > >>>> `rx-ring-ref` > >>>> respectively, and it is set up at device creation and removed at device > >>>> teardown, same as the command ring entries. This way we keep up with > >>>> ring size > >>>> changes as it it expected to be in the command ring. A hypothetical > >>>> multi-page > >>> ^ repetition > >> OK, will remove it. > >> > >>>> command ring would increase number of slots and thus this data list > >>>> would grow > >>>> accordingly. List is terminated by an entry which ```gref``` field is 0, > >>>> having > >>>> ignored the other fields of this specific entry. > >>> > >>> Could you please explain a bit more how the command ring increases the > >>> number of slot and why we need to allocate twice the number of required > >>> slots at the beginning? > >> "This list has twice as max slots as" is meant to say that the list has > >> twice as > >> capacity as the command ring. The list entry is 8 octets, so one page fits > >> 512 > >> entries. The data ref is only filled with up to the command ring size which > >> means only half of it would be filled currently. I will adjust this > >> sentence > >> here to make it more clear. > >> > >> multi-page rings aren't supported on netback/netfront (yet) but as an > >> example we > >> would have one `tx-data-ref0` to represent the command ring slots of > >> `tx-ring-ref0` and tx-ring-ref1`. I noticed that to not waste grants, I > >> need to > >> assume the ```data-ref``` is full with the lack of a termanting entry; > >> otherwise > >> I would endup needing `tx-data-ref1` to have a terminating entry. I will > >> add a > >> sentence mentioning this. > >> > >>>> ## Datapath Changes > >>>> > >>>> The packet is copied to/from the mapped data list grefs of up to > >>>> `tx-data-len` > >>>> or `rx-data-len`. This means that the buffer (referenced by `gref` from > >>>> within > >>>> the `struct netif_[tx|rx]_request`) has the same data up to `size`. In > >>>> other > >>>> words, *gref[0->size] contents is replicated on the `data-ring` at > >>>> `idx`. Hence > >>>> netback should ignore up to `size` of the `gref` when processing as the > >>>> `data-ring` has the contents of it. > >>> > >>> My lack of netfront/netback knowledge might show, but I am not > >>> following. What do gref, size and idx represent in this context? Please > >>> explain the new workflow in more details. > >> [ Please ignore the part saying "replicated on the `data-ring` at `idx`", > >> since > >> there is no ring. It should instead be `data list` buffer identified by > >> `struct > >> netif_[tx|rx]_request` `id` field. ] > >> > >> The new workflow is not too different from the old one: (on TX) we *first* > >> memcpy the packet to staging grants region (or `data gref` like how I > >> mention in > >> this doc) of up to the negotiated `{tx,rx}-data-len` (or `size` specified > >> in the > >> command ring if smaller). The `data gref` to be used is identified by the > >> `id` > >> field in netif_[tx|rx]_request struct. The rest will proceed as before, > >> that is > >> granting the packet (within XEN_PAGE_SIZE boundaries) and setting `gref` > >> and > >> `offset` accordingly in the rest of the command ring slots. Let me improve > >> this > >> paragraph to make this more clear. > >> > >> The other difference (see Zerocopy section) is that if frontend sets the > >> flag > >> NETTXF_staging_buffer, then the `gref` field in netif_[tx|rx]_request > >> struct > >> will have a value of the `data gref` id (the id field saying "frame > >> identifier" > >> that you asked earlier in the doc). This is to allow a frontend to specify > >> an RX > >> `data gref` to be used in the TX ring without involving any additional > >> copy. I > >> haven't PoC-ed this zerocopy part yet, as it only covers two specific > >> scenarios > >> (that is guest XDP_TX or on a DPDK frontend). > >> > >>>> Values bigger then the 4096 page/grant boundary only have special > >>>> meaning for > >>>> backend being how much it is required to be copied/pulled across the > >>>> whole > >>>> packet (which can be composed of multiple slots). Hence (e.g.) a value > >>>> of 65536 > >>>> vs 4096 will have the same data list size and the latter value would > >>>> lead to > >>>> only copy/pull one gref in the whole packet, whereas the former will be a > >>>> copy-only interface for all slots. > >>>> > >>>> ## Buffer Identification and Flags > >>>> > >>>> The data list ids must start from 0 and are global and continguous > >>>> (across both > >>>> lists). Data slot is identified by ring slot ```id``` field. Resultant > >>>> data > >>>> gref id to be used in RX data list is computed by subtract of `struct > >>>> netif_[tx|rx]_request` ```id``` from total amount of tx data grefs. > >>>> Example of > >>>> the lists layout below: > >>> > >>> What is the "resultant data gref id"? What is the "RX data list"? > >>> Please explain in more details. > >> "Resultant data gref id" corresponds to the case where we set > >> NETTXF_staging_buffer flag, in which case we set the command ring `gref` > >> with > >> the `id` in the data list entry (see below diagram). "RX data list" is the > >> list > >> described with `rx-data-ref`. I should have put that as separate paragrah > >> as > >> this is making things more confusing. > >> > >>>> ``` > >>>> [tx-data-ref-0, tx-data-len=256] > >>>> { .id = 0, gref = 512, .offset = 0x0 } > >>>> { .id = 1, gref = 512, .offset = 0x100 } > >>>> { .id = 2, gref = 512, .offset = 0x200 } > >>>> ... > >>>> { .id = 256, gref = 0, .offset = 0x0 } > >>>> > >>>> [rx-data-ref-0, rx-data-len=4096] > >>>> { .id = 256, gref = 529, .offset = 0x0 } > >>>> { .id = 257, gref = 530, .offset = 0x0 } > >>>> { .id = 258, gref = 531, .offset = 0x0 } > >>>> ... > >>>> { .id = 512, gref = 0, .offset = 0x0 } > >>>> ``` > >>>> > >>>> Permissions of RX data grefs are read-write whereas TX data grefs is > >>>> read-only. > >>>> > >>>> ## Zerocopy > >>>> > >>>> Frontend may wish to provide a bigger RX list than TX, and use RX > >>>> buffers for > >>>> transmission in a zerocopy fashion for guests mainly doing forwarding. > >>>> In such > >>>> cases backend set NETTXF_staging_buffer flag in ```netif_tx_request``` > >>>> flags > >>>> field such that `gref` field instead designates the `id` of a data grefs. > >>>> > >>>> This is only valid when packets are solely described by the staging > >>>> grants for > >>>> the slot packet size being written. Or when [tx|rx]-data-len is 4096 (for > >>>> feature-sg 0) or 65535 (for feature-sg 1) and thus no new `gref` is > >>>> needed for > >>>> describing the packet payload. > >>>> > >>>> \clearpage > >>>> > >>>> ## Performance > >>>> > >>>> Numbers that give a rough idea on the performance benefits of this > >>>> extension. > >>>> These are Guest <-> Dom0 which test the communication between backend and > >>>> frontend, excluding other bottlenecks in the datapath (the software > >>>> switch). > >>>> > >>>> ``` > >>>> # grant copy > >>>> Guest TX (1vcpu, 64b, UDP in pps): 1 506 170 pps > >>>> Guest TX (4vcpu, 64b, UDP in pps): 4 988 563 pps > >>>> Guest TX (1vcpu, 256b, UDP in pps): 1 295 001 pps > >>>> Guest TX (4vcpu, 256b, UDP in pps): 4 249 211 pps > >>>> > >>>> # grant copy + grant map (see next subsection) > >>>> Guest TX (1vcpu, 260b, UDP in pps): 577 782 pps > >>>> Guest TX (4vcpu, 260b, UDP in pps): 1 218 273 pps > >>>> > >>>> # drop at the guest network stack > >>>> Guest RX (1vcpu, 64b, UDP in pps): 1 549 630 pps > >>>> Guest RX (4vcpu, 64b, UDP in pps): 2 870 947 pps > >>>> ``` > >>>> > >>>> With this extension: > >>>> ``` > >>>> # memcpy > >>>> data-len=256 TX (1vcpu, 64b, UDP in pps): 3 759 012 pps > >>>> data-len=256 TX (4vcpu, 64b, UDP in pps): 12 416 436 pps > >>>> data-len=256 TX (1vcpu, 256b, UDP in pps): 3 248 392 pps > >>>> data-len=256 TX (4vcpu, 256b, UDP in pps): 11 165 355 pps > >>>> > >>>> # memcpy + grant map (see next subsection) > >>>> data-len=256 TX (1vcpu, 260b, UDP in pps): 588 428 pps > >>>> data-len=256 TX (4vcpu, 260b, UDP in pps): 1 668 044 pps > >>>> > >>>> # (drop at the guest network stack) > >>>> data-len=256 RX (1vcpu, 64b, UDP in pps): 3 285 362 pps > >>>> data-len=256 RX (4vcpu, 64b, UDP in pps): 11 761 847 pps > >>>> > >>>> # (drop with guest XDP_DROP prog) > >>>> data-len=256 RX (1vcpu, 64b, UDP in pps): 9 466 591 pps > >>>> data-len=256 RX (4vcpu, 64b, UDP in pps): 33 006 157 pps > >>>> ``` > >>> > >>> Very nice! > >> :D > >> > >>>> Latency measurements (netperf TCP_RR request size 1 and response size 1): > >>>> ``` > >>>> 24 KTps vs 28 KTps > >>>> 39 KTps vs 50 KTps (with kernel busy poll) > >>>> ``` > >>>> > >>>> TCP Bulk transfer measurements aren't showing a representative increase > >>>> on > >>>> maximum throughput (sometimes ~10%), but rather less retransmissions and > >>>> more stable. This is probably because of being having a slight decrease > >>>> in rtt > >>>> time (i.e. receiver acknowledging data quicker). Currently trying > >>>> exploring > >>>> other data list sizes and probably will have a better idea on the > >>>> effects of > >>>> this. > >>> > >>> This is strange. By TCP Bulk transfers, do you mean iperf? > >> Yeap. > >> > >>> From my pvcalls experience, I would expect a great improvement there too. > >> Notice that here we are only memcpying a small portion of the packet (256 > >> bytes > >> max, not all of it). > > > > Are we? Unless I am mistaken, the protocol doesn't have any limitations > > on the amount of bytes we are memcpying. It is up to the backend, which > > could theoretically support large amounts such as 64k and larger. > The case I gave above was when the backend was only memcpying maximum 256 > bytes > (that is for the linear part of the skbs). But correct, the protocol doesn't > have limitations if the backend allow a frontend to copy the whole packet, but > only if the frontend selects it too tx-data-len|rx-data-len as 65536. Sure. My point is that changing the backend settings, increasing the amount of data copied per packet, you should be able to get very good TCP bulk transfers performance with your current code. In fact, you should be able to get close to PVCalls. If that doesn't happen, it's a problem. > >> Past experience (in a old proposal I made) it also showed > >> me an improvement like yours. I am a bit hesitant with memcpy all of way > >> when > >> other things are involved in the workload; but then I currently don't see > >> many > >> other alternatives to lessen the grants overhead. In the meantime I'll > >> have more > >> data points and have a clearer idea on the ratio of the improvement vs the > >> compromise. > > > > I don't want to confuse you, but another alternative to consider would > > be to use a curcular data ring like in PVCalls instead of slots. But I > > don't know how difficult it would be to implement something like that in > > netfront/netback. The command ring could still be used to send packets, > > but the data could be transferred over the data ring instead. In this > > model we would have to handle the case where a slot is available on the > > command ring, but we don't have enough room on the data ring to copy the > > data. For example, we could fall back to grant ops. > I thought of that option but I was a bit afraid for starters to introduce two > data rings hence I kept things simple (despite text being a bit messy). The > only > reservation I had having data rings is that the backend wouldn't be able to > hold > on to the pages in a zerocopy fashion like it does now. The advantage of that approach is that the amount of data memcpy'ed could be dynamic per packet and potentially much larger than 256, while the total amount of shared memory could be lower (because we don't have to share all the slots all the time). If you are unable to get good TCP Bulk transfer performance with the current approach, I would consider revisiting this proposal and introduce a data ring buffer instead. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.