[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] RFC v1: Xen block protocol overhaul - problem statement (with pictures!)



> > > Which has a nice power of two ring to it. (Ah the puns!)
> > > 
> > > I like the idea of putting the request on a diet - but too much
> > > could cause us to miss the opportunity to insert other flags on it.
> > > If I recall correctly, the DIF/DIX only need 8 bytes of data.
> > > If we make the assumption that:
> > >         I/O request = one ring entry
> > 
> > So we only need to reserve 8bytes for each DIF/IDX, even if the request
> > contains a variable number of data? (I mean, block requests can at a
> > minimum contain 4096bytes, or much more)
> 
> I need to double check with Martin (CC-ed here). But my recollection
> is that it is just attached the the 'bio'. So if the BIO is 4K or 1MB -
> it would only have one DIF/DIX data type.

And that is semi-correct. If the user did a horrible job (say using
dd) the pages are chained together - and we end up with a link list
of bio's. The last bio would point to a page filled with 'sector's worth
of data has a checksum. Each checksum occupies 8 bytes. So if the
total 'bio' length is say 1MB, this last page is filled with 256 of
checksums - so 2048 bytes of data.

> 
> Hmm, but then we operate on the 'struct request' so that might not
> be the case..
> > 
> > > and the  "one ring entry" can use the the '4' grants if we just have a
> > > 16KB I/O request, but if it is more than that - we use the indirect page
> > 
> > Well, on my purpose I've limited the number of segments of a "rw"
> > requests to 2, so it's only 8K, anything bigger has to use indirect
> > descriptors, which can fit 4M of data (because I'm passing 4 grant
> > frames full of "blkif_request_indirect_entry" entries).
> 
> <nods>
> > 
> > > and can stuff 1MB of data in there.
> > > The extra 32-bytes of space for such things as 'DIF/DIX'. This also
> > > means we could unify the 'struct request' with the 'discard' operation
> > > and it could utilize the 32-bytes of extra unused payload data.
> > > 
> > >>>
> > >>>
> > >>> The âoperationâ would be BLKIF_OP_INDIRECT. The read/write/discard,
> > >>> etc operation would now be in indirect.op. The indirect.gref points to
> > >>> a page that is filled with:
> > >>>
> > >>>
> > >>> struct blkif_request_indirect_entry {
> > >>>         blkif_sector_t sector_number;
> > >>>         struct blkif_request_segment seg;
> > >>> } __attribute__((__packed__));
> > >>> //16 bytes, so we can fit in a page 256 of these structures.
> > >>>
> > >>>
> > >>> This means that with the existing 36 slots in the ring (single page)
> > >>> we can cover: 32 slots * each blkif_request_indirect covers: 256 * 4096
> > >>> ~= 32M. If we donât want to use indirect descriptor we can still use
> > >>> up to 4 pages of the request (as it has enough space to contain four
> > >>> segments and the structure will still be cache-aligned).
> > >>>


Martin asked me why we even do this via these entries. Meaning why
have this tuple of information for each page: <lba, first_sect, last_sect, 
gref>.
The lba on the next subsequent indirect entry is going to be incremented by
one. The first_sect and last_sect too... So why not just do:

struct blkif_request_indirect {
        uint8_t        operation;
        blkif_vdev_t   handle;       /* only for read/write requests         */
#ifdef CONFIG_X86_64
        uint32_t       _pad1;        /* offsetof(blkif_request,u.rw.id) == 8 */
#endif
        uint64_t       id;           /* private guest value, echoed in resp  */
        blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  */

        grant_ref_t     indirect_desc;
        uint16_t        nr_elems;
}

And the 'indirect_desc' would point to a page that looks quite close to
what the scatterlist looks like:

        struct indirect_chain {
                uint16_t        op_flag;        //*Can D_NEXT, D_START, D_END ?
                uint16_t        next;
                uint16_t        offset;
                uint16_t        length;
                uint32_t        gref;
                uint32_t        _pad;           // Need this in case we ever 
want to
                                                // make gref + _pad be a 
physical addr.
        }

And the page itself would be:
        struct indirect_chain[256];

the 'next' would just contain the index inside in indirect_chain page - so from
0->256.  The offset and length would reference wherein the page the data is
contained.

This way the 'lba' information is part of the 'blkif_request_indirect' and the
payload info is all in the indirect descriptors.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.