[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] RFC v1: Xen block protocol overhaul - problem statement (with pictures!)



On Fri, Jan 18, 2013 at 05:00:10PM +0100, Roger Pau Monnà wrote:
> On 18/12/12 15:31, Konrad Rzeszutek Wilk wrote:
> > Hey,
> > 
> > I am including some folks that are not always on Xen-devel to see if they 
> > have
> > some extra ideas or can correct my misunderstandings.
> > 
> > This is very much RFC - so there is bound to be some bugs.
> > The original is here
> > https://docs.google.com/document/d/1Vh5T8Z3Tx3sUEhVB0DnNDKBNiqB_ZA8Z5YVqAsCIjuI/edit
> > in case one wishes to modify/provide comment on that one.
> 
> Hello Konrad,
> 
> Thanks for the problem description and possible solutions, I'm 
> going to take on this (provided that you are OK with it).

The whole thing!? You are quite ambitious here!

> 
> > 
> > There are outstanding issues we have now with the block protocol:
> > Note: I am assuming 64-bit guest/host - as the sizeâs of the structures
> > change on 32-bit. I am also attaching the header for the blkif ring
> > as of today.
> > 
> > A) Segment size is limited to 11 pages. It means we can at most squeeze
> > in 44kB per request. The ring can hold 32 (next power of two below 36)
> > requests, meaning we can do 1.4M of outstanding requests.
> 
> I've been looking at the different options here and I think the most
> suitable one would be using a mechanism similar to VirtIO "indirect
> descriptors", it is the one that allows for more in-flight requests
> while being quite simple to implement.

<nods> I think you are right - plus it looks to be similar to how the more
modern hardware does requests (here I am thinking of NVMe and its command
descriptor - where it can point to a page full of DMA addresses + LBAs and say
"Go READ").

> 
> > B). Producer and consumer index is on the same cache line. In present 
> > hardware
> > that means the reader and writer will compete for the same cacheline 
> > causing a
> > ping-pong between sockets.
> > 
> > C). The requests and responses are on the same ring. This again causes the
> > ping-pong between sockets as the ownership of the cache line will shift 
> > between
> > sockets.
> > 
> > D). Cache alignment. Currently the protocol is 16-bit aligned. This is 
> > awkward
> > as the request and responses sometimes fit within a cacheline or sometimes
> > straddle them.
> > 
> > E). Interrupt mitigation. We are currently doing a kick whenever we are
> > done âprocessingâ the ring. There are better ways to do this - and
> > we could use existing network interrupt mitigation techniques to make the
> > code poll when there is a lot of data.
> > 
> > F). Latency. The processing of the request limits us to only do 44kB - 
> > which means
> > that a 1MB chunk of data - which on contemporary devices would only use I/O 
> > request
> > - would be split up in multiple ârequestsâ inadvertently delaying the 
> > processing of
> > said block.
> > 
> > G) Future extensions. DIF/DIX for integrity. There might
> > be other in the future and it would be good to leave space for extra
> > flags TBD.
> > 
> > H). Separate the response and request rings. The current
> > implementation has one thread for one block ring. There is no reason why
> > there could not be two threads - one for responses and one for requests -
> > and especially if they are scheduled on different CPUs. Furthermore this
> > could also be split in multi-queues - so two queues (response and request)
> > on each vCPU.
> > 
> > I). We waste a lot of space on the ring - as we use the
> > ring for both requests and responses. The response structure needs to
> > occupy the same amount of space as the request structure (112 bytes). If
> > the request structure is expanded to be able to fit more segments (say
> > the âstruct blkif_sring_entry is expanded to ~1500 bytes) that still
> > requires us to have a matching size response structure. We do not need
> > to use that much space for one response. Having a separate response ring
> > would simplify the structures.
> > 
> > J). 32 bit vs 64 bit. Right now the size
> > of the request structure is 112 bytes under 64-bit guest and 102 bytes
> > under 32-bit guest. It is confusing and furthermore requires the host
> > to do extra accounting and processing.
> > 
> > The crude drawing displays memory that the ring occupies in offset of
> > 64 bytes (cache line). Of course future CPUs could have different cache
> > lines (say 32 bytes?)- which would skew this drawing. A 32-bit ring is
> > a bit different as the âstruct blkif_sring_entryâ is of 102 bytes.
> > 
> > 
> > The A) has two solutions to this (look at
> > http://comments.gmane.org/gmane.comp.emulators.xen.devel/140406 for
> > details). One proposed by Justin from Spectralogic has to negotiate
> > the segment size. This means that the âstruct blkif_sring_entryâ
> > is now a variable size. It can expand from 112 bytes (cover 11 pages of
> > data - 44kB) to 1580 bytes (256 pages of data - so 1MB). It is a simple
> > extension by just making the array in the request expand from 11 to a
> > variable size negotiated.
> > 
> > 
> > The math is as follow.
> > 
> > 
> >         struct blkif_request_segment {
> >                 uint32 grant;                         // 4 bytes uint8_t
> >                 first_sect, last_sect;// 1, 1 = 6 bytes
> >         }
> > (6 bytes for each segment) - the above structure is in an array of size
> > 11 in the request. The âstruct blkif_sring_entryâ is 112 bytes. The
> > change is to expand the array - in this example we would tack on 245 extra
> > âstruct blkif_request_segmentâ - 245*6 + 112 = 1582. If we were to
> > use 36 requests (so 1582*36 + 64) we would use 57016 bytes (14 pages).
> > 
> > 
> > The other solution (from Intel - Ronghui) was to create one extra
> > ring that only has the âstruct blkif_request_segmentâ in them. The
> > âstruct blkif_requestâ would be changed to have an index in said
> > âsegment ringâ. There is only one segment ring. This means that the
> > size of the initial ring is still the same. The requests would point
> > to the segment and enumerate out how many of the indexes it wants to
> > use. The limit is of course the size of the segment. If one assumes a
> > one-page segment this means we can in one request cover ~4MB. The math
> > is as follow:
> > 
> > 
> > First request uses the half of the segment ring - so index 0 up
> > to 341 (out of 682). Each entry in the segment ring is a âstruct
> > blkif_request_segmentâ so it occupies 6 bytes. The other requests on
> > the ring (so there are 35 left) can use either the remaining 341 indexes
> > of the sgement ring or use the old style request. The old style request
> > can address use up to 44kB. For example:
> > 
> > 
> >  sring[0]->[uses 0->341 indexes in the segment ring] = 342*4096 = 1400832
> >  sring[1]->[use sthe old style request] = 11*4096 = 45056
> >  sring[2]->[uses 342->682 indexes in the segment ring] = 1392640
> >  sring[3..32] -> [uses the old style request] = 29*4096*11 = 1306624
> > 
> > 
> > Total: 4145152 bytes Naturally this could be extended to have a bigger
> > segment ring to cover more.
> > 
> > 
> > 
> > 
> > 
> > The problem with this extension is that we use 6 bytes and end up
> > straddling a cache line. Using 8 bytes will fix the cache straddling. This
> > would mean fitting only 512 segments per page.
> > 
> > 
> > There is yet another mechanism that could be employed  - and it borrows
> > from VirtIO protocol. And that is the âindirect descriptorsâ. This
> > very similar to what Intel suggests, but with a twist.
> > 
> > 
> > We could provide a new BLKIF_OP (say BLKIF_OP_INDIRECT), and the âstruct
> > blkif_sringâ (each entry can be up to 112 bytes if needed - so the
> > old style request would fit). It would look like:
> > 
> > 
> > /* so 64 bytes under 64-bit. If necessary, the array (seg) can be
> > expanded to fit 11 segments as the old style request did */ struct
> > blkif_request_indirect {
> >         uint8_t        op;           /* BLKIF_OP_* (usually READ or WRITE   
> >  */
> > // 1 blkif_vdev_t   handle;       /* only for read/write requests         
> > */ // 2
> > #ifdef CONFIG_X86_64
> >         uint32_t       _pad1;             /* 
> > offsetof(blkif_request,u.rw.id) == 8 */ // 2
> > #endif
> >         uint64_t       id;           /* private guest value, echoed in resp 
> >  */
> >         grant_ref_t    gref;        /* reference to indirect buffer frame  
> > if used*/
> >             struct blkif_request_segment_aligned seg[4]; // each is 8 bytes
> >
> > } __attribute__((__packed__));
> > 
> > 
> > struct blkif_request {
> >         uint8_t        operation;    /* BLKIF_OP_???  */
> >         union {
> >                 struct blkif_request_rw rw;
> >                 struct blkif_request_indirect
> >                 indirect; â other ..
> >         } u;
> > } __attribute__((__packed__));
> > 
> 
> Do you think it will be suitable to change the size of blkif_request, I 
> was thinking that if we introduce indirect request we could reduce the 
> maximum number of segments per request to 2, so blkif_request is 
> reduced to 32bytes and we can fit 126 requests (64 due to rounding) 
> in a single ring. 

Altering it to be 32-bytes means we are still crossing two cachelines.

This is where I thought of putting 'struct blkif_request_rw' on a diet
and just making it have 4 of the grants - in case the request is
just 16K. And then the size of the structure (if I got my math right)
would be 64-bytes.

Naturally this means we need to negotiate a 'feature-request-size'
where v1 says that the request is of 64-bytes length. v2 might
be 128bytes or whatever else we would need to fit within a cacheline
in future CPUs. We could pad the 'struct request_v1' with 64-bytes
in case the cacheline is 128bytes and our request is 64-bytes.


> 
> Everything that needs more than two segments will switch to using the 
> new indirect operation. Also, to make use of the extra space in 
> blkif_request_indirect I would remove the extra segments and instead 
> pass more grant references to frames containing 
> blkif_request_indirect_entry entries.

Right. The complications arise when we need to support this new format
and the old format and have to switch over when we migrate to old
backends.

> 
> #define BLKIF_MAX_SEGMENTS_PER_REQUEST 2
> #define BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST 4
> 
> struct blkif_request_segment {
>     grant_ref_t gref;        /* reference to I/O buffer frame        */
>     /* @first_sect: first sector in frame to transfer (inclusive).   */
>     /* @last_sect: last sector in frame to transfer (inclusive).     */
>     uint8_t     first_sect, last_sect;
> } __attribute__((__packed__));
> 
> struct blkif_request_indirect_entry {
>     blkif_sector_t  sector_number;
>     struct          blkif_request_segment seg;
>     uint8_t         _pad[2];
> } __attribute__((__packed__));
> 
> struct blkif_request_rw {
>     uint8_t        nr_segments;  /* number of segments                   */
>     blkif_vdev_t   handle;       /* only for read/write requests         */
>     uint64_t       id;           /* private guest value, echoed in resp  */
>     blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  */
>     struct blkif_request_segment seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> } __attribute__((__packed__));
> 
> struct blkif_request_indirect {
>     uint8_t        op;           /* BLKIF_OP_* (usually READ or WRITE    */
>     uint16_t       nr_indirect;  /* number of indirect entries in gref frames 
> */
>     blkif_vdev_t   handle;       /* only for read/write requests         */
>     uint64_t       id;           /* private guest value, echoed in resp  */
>     /* reference to indirect buffer frame */
>     grant_ref_t    gref[BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST];
> } __attribute__((__packed__));
> 
> struct blkif_request_discard {
>     uint8_t        flag;         /* BLKIF_DISCARD_SECURE or zero.        */
> #define BLKIF_DISCARD_SECURE (1<<0)  /* ignored if discard-secure=0          
> */
>     uint64_t       id;           /* private guest value, echoed in resp  */
>     blkif_sector_t sector_number;
>     uint64_t       nr_sectors;
> } __attribute__((__packed__));
> 
> struct blkif_request {
>     uint8_t        operation;    /* BLKIF_OP_???                         */
>     union {
>         struct blkif_request_rw rw;
>         struct blkif_request_indirect indirect;
>         struct blkif_request_discard discard;
>     } u;
> } __attribute__((__packed__));
> 
> I've removed all the extra paddings, because this protocol is no longer 
> compatible with the previous one (we modify 
> BLKIF_MAX_SEGMENTS_PER_REQUEST), and packed the structures. I'm 
> wondering if using __attribute__ in a public header is allowed, since 
> this is a GNU extension (AFAIK supported by gcc and clang at least).

I don't know. But I think we can bypass the need for it if we calculate
the size of each entry right and insert padding whenever neccessary.

> 
> The main motivation behind the modification of the size of the 
> request is because I've realized that in your proposal, you where 
> adding 4 extra segments to the indirect request, and I was wondering 
> if I could pack more grant frames containing indirect entries in a 
> single request. By using the old size of blkf_request we will be able to
> send up to 9 grant frames filled with indirect entries, which gives us
> 4096 * 256 * 9 = 9MB of outstanding IO in each request. I think that
> this is too high, and will introduce a lot of latency to the protocol.
> That's the main reason I've decided to reduce the size of blkif_request
> in order to be able to fit more of them in a ring, instead of making
> each of the requests bigger.

OK. That is OK as long as we make sure to _not_ have two requests
straddling the cache-line. Or maybe that is not a problem.

But the other nice benefit of your idea - where it is only 32-bytes
- is that we have now extra 32-bytes where we can insert such
things as DIF/DIX or "other" new features. And that way have the
request be of size 64 bytes.

> 
> Using this implementation each request has a size of 32bytes (to 
> prevent cache straddle), so we could fit 126 requests in a ring, but 
> due to the rounding, we end up with 64. Also each request can contain 
> 4 grant pages, each one filled with 256 blkif_request_indirect_entry. 
> The math for the maximum data in a request is: 4096 * 256 * 4 = 4MB, 
> and in the whole ring 4 * 64 = 256MB worth of outstanding IO.
>  

Which has a nice power of two ring to it. (Ah the puns!)

I like the idea of putting the request on a diet - but too much
could cause us to miss the opportunity to insert other flags on it.
If I recall correctly, the DIF/DIX only need 8 bytes of data.
If we make the assumption that:
        I/O request = one ring entry

and the  "one ring entry" can use the the '4' grants if we just have a
16KB I/O request, but if it is more than that - we use the indirect page
and can stuff 1MB of data in there. 
The extra 32-bytes of space for such things as 'DIF/DIX'. This also
means we could unify the 'struct request' with the 'discard' operation
and it could utilize the 32-bytes of extra unused payload data.

> > 
> > 
> > The âoperationâ would be BLKIF_OP_INDIRECT. The read/write/discard,
> > etc operation would now be in indirect.op. The indirect.gref points to
> > a page that is filled with:
> > 
> > 
> > struct blkif_request_indirect_entry {
> >         blkif_sector_t sector_number;
> >         struct blkif_request_segment seg;
> > } __attribute__((__packed__));
> > //16 bytes, so we can fit in a page 256 of these structures.
> > 
> > 
> > This means that with the existing 36 slots in the ring (single page)
> > we can cover: 32 slots * each blkif_request_indirect covers: 256 * 4096
> > ~= 32M. If we donât want to use indirect descriptor we can still use
> > up to 4 pages of the request (as it has enough space to contain four
> > segments and the structure will still be cache-aligned).
> > 
> > 
> > 
> > 
> > B). Both the producer (req_* and rsp_*) values are in the same
> > cache-line. This means that we end up with the same cacheline being
> > modified by two different guests. Depending on the architecture and
> > placement of the guest this could be bad - as each logical CPU would
> > try to write and read from the same cache-line. A mechanism where
> > the req_* and rsp_ values are separated and on a different cache line
> > could be used. The value of the correct cache-line and alignment could
> > be negotiated via XenBus - in case future technologies start using 128
> > bytes for cache or such. Or the the producer and consumer indexes are in
> > separate rings. Meaning we have a ârequest ringâ and a âresponse
> > ringâ - and only the âreq_prodâ, âreq_eventâ are modified in
> > the ârequest ringâ. The opposite (resp_*) are only modified in the
> > âresponse ringâ.
> 
> Using different rings for requests and responses sounds like a good idea,
> this will reduce the number of variables on the ring, leaving only
> "produced" and "event" (we no longer need rsp_ and req_).

<nods>
> 
> Since we will no longer have two different domains writing to the same
> cache line I guess this will get rid of the cache ping-pong problem.

<nods> That is my thinking.
> 
> Also while there we could try to use all the page, instead of rounding
> down to the closer power of two, this will allow us to go from 64
> requests to 126 (if using the indirect descriptors proposal explained
> above).

Right.
Lets call this 'feature-seperate-req-and-rsp' ?

>  
> > C). Similar to B) problem but with a bigger payload. Each
> > âblkif_sring_entryâ occupies 112 bytes which does not lend itself
> > to a nice cache line size. If the indirect descriptors are to be used
> > for everything we could âslim-downâ the blkif_request/response to
> > be up-to 64 bytes. This means modifying BLKIF_MAX_SEGMENTS_PER_REQUEST
> > to 5 as that would slim the largest of the structures to 64-bytes.
> > Naturally this means negotiating a new size of the structure via XenBus.

Here I am using 5, and I think you were using 4 - and you mentioned you
got it down to 32-bytes. I should double-check the 'sizeof' then to
see what it would be like.

> > 
> > 
> > D). The first picture shows the problem. We now aligning everything
> > on the wrong cachelines. Worst in â of the cases we straddle
> > three cache-lines. We could negotiate a proper alignment for each
> > request/response structure.
> > 
> > 
> > E). The network stack has showed that going in a polling mode does improve
> > performance. The current mechanism of kicking the guest and or block
> > backend is not always clear.  [TODO: Konrad to explain it in details]

Oh, I never did explain this - but I think the patches that Daniel came
up with actually fix a part of it. They make the kick-the-other guest
only happen when the backend has processed all of the requests and
cannot find anything else to do. Previously it was more of 'done one
request, lets kick the backend.'.

But going forward, this 'kick-the-other-guest' could be further modulated.
If we have a full ring and the frontend keeps on adding entries we (backend)
should never get an interrupt. As of matter of fact the frontend should be just
polling all the time and process them as fast as possible.

Or we could add tweaks in the backend to adopt an 'adaptive interrupt
coelescing' mechanism. This way we only kick the frontend if a certain
threshold of I/Os have been processed or if we are slowing down on the
threshold. Again, the idea is that - if we have the rings full we should
not send 'kicks' at all and just keep on processing.

Note: I *believe* that the backend does this - keeps on processing and
won't kick if the ring is full. But I am not 100% sure on the frontend side.

> > 
> > 
> > F). The current block protocol for big I/Os that the backend devices can
> > handle ends up doing extra work by splitting the I/O in smaller chunks,
> > and then reassembling them. With the solutions outlined in A) this can
> > be fixed. This is easily seen with 1MB I/Os. Since each request can
> > only handle 44kB that means we have to split a 1MB I/O in 24 requests
> > (23 * 4096 * 11 = 1081344). Then the backend ends up sending them in
> > sector-sizes- which with contemporary devices (such as SSD) ends up with
> > more processing. The SSDs are comfortable handling 128kB or higher I/Os
> > in one go.
> > 
> > 
> > G). DIF/DIX. This a protocol to carry extra âchecksumâ information
> > for each I/O. The I/O can be a sector size, page-size or an I/O size
> > (most popular are 1MB). The DIF/DIX needs 8 bytes of information for
> > each I/O. It would be worth considering putting/reserving that amount of
> > space in each request/response. Also putting in extra flags for future
> > extensions would be worth it - however the author is not aware of any
> > right now.
> > 
> > 
> > H). Separate response/request. Potentially even multi-queue per-VCPU
> > queues. As v2.6.37 demonstrated, the idea of WRITE_BARRIER was
> > flawed. There is no similar concept in the storage world were the
> > operating system can put a food down and say: âeverything before this
> > has to be on the disk.â There are ligther versions of this - called
> > âFUAâ and âFLUSHâ. Depending on the internal implementation
> > of the storage they are either ignored or do the right thing. The
> > filesystems determine the viability of these flags and change writing
> > tactics depending on this. From a protocol level, this means that the
> > WRITE/READ/SYNC requests can be intermixed - the storage by itself
> > determines the order of the operation. The filesystem is the one that
> > determines whether the WRITE should be with a FLUSH to preserve some form
> > of atomicity. This means we do not have to preserve an order of operations
> > - so we can have multiple queues for request and responses. This has
> > show in the network world to improve performance considerably.
> > 
> > 
> > I). Wastage of response/request on the same ring. Currently each response
> > MUST occupy the same amount of space that the request occupies - as the
> > ring can have both responses and requests. Separating the request and
> > response ring would remove the wastage.
> > 
> > 
> > J). 32-bit vs 64-bit (or 102 bytes vs 112 bytes). The size of the ring
> > entries is different if the guest is in 32-bit or 64-bit mode. Cleaning
> > this up to be the same size would save considerable accounting that the
> > host has to do (extra memcpy for each response/request).
> 
> So forth I think most of the problems could be solved by using indirect
> descriptors, aligning requests to a cache sensible value (to prevent
> straddling) and using separate rings for requests and responses. This 
> would be my plan for solving this issues:

That would take care of the bulk of the big issues.

> 
> 1. Implement LRU mechanism in blkback for the persistent grants list. If
> we are going to increment the number of in-flight IO we need to be able to
> keep the list of persistently mapped grants to something sensible, and
> we are no longer able to persistently map all the grants that could be 
> used. In my proposal that would be 64 * 4 for each grant frame that 

Can you explain the foundations behind the '64' and '4'? I think I know
the 4 (BLKIF_..SEGMENTS) but I am not sure about '64'.

> contains indirect entries, plus 64 * 4 * 256, for all the possible
> entries in the indirect frame 64 * 4 + 64 * 4 * 256 = 65792.

That would cover ~256MB of data per guest. Wow. 
> 
> 2. Implement indirect descriptors.
> 
> 3. Implement new ring protocol, using two separate rings for requests
> and replies.

Don't want to negotiate each of these proposal as a seperate feature? So:
        seperate-req-and-rsp
        slim-version-of-req
        indiret-descriptor
        multi-page-ring
        aligment-64-or-other-value


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.