[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] blkback global resources



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> Sent: Tuesday, March 27, 2012 8:27 AM
> To: Daniel Stodden
> Cc: Andrei Lifchits; xen-devel
> Subject: Re: [Xen-devel] blkback global resources
> 
> >>> On 26.03.12 at 18:53, Daniel Stodden <daniel.stodden@xxxxxxxxxxxxxx>
> wrote:
> > On Mon, 2012-03-26 at 17:06 +0100, Keir Fraser wrote:
> >> Cc'ing Daniel for you on this one, Jan.
> >>
> >>  K.
> >>
> >> On 26/03/2012 16:56, "Jan Beulich" <JBeulich@xxxxxxxx> wrote:
> >>
> >> > All the resources allocated based on xen_blkif_reqs are global in
> >> > blkback. While (without having measured anything) I think that this
> >> > is bad from a QoS perspective (not the least implied from a warning
> >> > issued by Citrix'es multi-page-ring patches:
> >> >
> >> > if (blkif_reqs < BLK_RING_SIZE(order)) printk(KERN_WARNING
> >> > "WARNING: "
> >> >       "I/O request space (%d reqs) < ring order %ld, "
> >> >       "consider increasing %s.reqs to >= %ld.",
> >> >       blkif_reqs, order, KBUILD_MODNAME,
> >> >       roundup_pow_of_two(BLK_RING_SIZE(order)));
> >> >
> >> > indicating that this _is_ a bottleneck), I'm otoh hesitant to
> >> > convert this to per-instance allocations, as the amount of memory
> >> > taken away from Dom0 for this may be not insignificant when there
> >> > are many devices.
> >> >
> >> > Does anyone have an opinion here, in particular regarding the
> >> > original authors' decision to make this global vs. the apparently
> >> > made observation (by Daniel Stodden, the author of said patch, who
> >> > I don't have any current email of to ask directly), but also in the
> >> > context of multi-page rings, the purpose of which is to allow for
> >> > larger amounts of in-flight I/O?
> >> >
> >> > Thanks, Jan
> >
> > Re-CC'ing Andrei Lifchits, I think there's been some work going on at
> > Citrix regarding that matter.
> >
> > Yes, just allocating a pfn pool per backend instance is way too much
> > memory balooned out. Otherwise this stuff would have never looked the
> > way it does now.
> 
> This of course could be accounted for by having an initially non-empty (large
> enough) balloon (not sure how easy it is these days to do this for pv-ops, but
> it has always been trivial with the legacy code). That wouldn't help a 32-bit
> kernel much (where generally the initial balloon is all in highmem, yet the
> vacated pages need to be in lowmem), but for 64-bit kernels it should be
> fine.
> 
> > Regarding the right balance, note that on the other extreme end, if
> > PFN space were infinite, there's not much expected performance gain
> > from rendering virtual backends fully independent. Beyond controller
> > queue depth, these requests are all just going to pile up, waiting.
> 
> Is there a way to look through the queue stack to find out how many distinct
> ones there are that the backend is running on top of as well as - for a
> particular I/O path - the one with the smallest depth? Or can one assume
> that the top most one (generally loop's or blktap2's) won't advertise a queue
> deeper than what is going to be accepted downstream (probably not, I'd
> guess)?

Hm, I don't remember seeing anything relating to that off the top of my head in 
the blkback code, so I don't think so. (I'm not sure the benefit would be that 
great, anyways).

> And - what you say would similarly apply to the usefulness of multi-page
> rings afaict.
> 
> > XenServer has some support for decoupling in blktap.ko [1] which
> > worked relatively well: Use frame 'pool' kobjects. A bunch of pages,
> > mapped to sysfs object. Name was arbitrary. Size configurable, even at
> runtime.

I have added a similar functionality to blkback (pools configurable through 
xenstore, with userland tools creating one pool per SR), which is now out in 
the form of a limited-availability hotfix and will be there in the next 
XenServer release. Felipe (CC'd) measured the effects on performance and found 
that it helps.

> > Sysfs meant stuff was easily set up by shell or python code, or
> > manually. To become operational, every backend must be bound to a pool
> > (initially, the global 'default' one, for tool compat). Backends can
> > be relinked arbitrarily before entering Connected state.
> >
> > Then let the userland toolstack set things up according to physical
> > I/O topology and properties probed. Basically every physical backend
> > (say, a volume group, or a HBA) would start out by allocating and
> > dimensioning a dedicated pool (named after the backend), and every
> > backend instance fired up gets bound to the pool it belongs to.
> 
> Having userland do all that seems like a fallback solution only to me - I 
> would
> hope that sufficient information is available directly to the drivers.

You're probably right.

> Thanks in any case for responding so quickly, Jan
> 
> > There's a lot of additional optimizations one could consider, e.g.
> > autogrowing the pool (log(nbackends) or so?) and the like. To improve
> > locality, having backends which look ahead in their request queue and
> > allocate whole batches is probably a good idea too, etc, etc.
> >
> > HTH,
> > Daniel
> >
> > [1]
> > http://xenbits.xen.org/gitweb/?p=people/dstodden/linux.git
> >  mostly in drivers/block/blktap/sysfs.c (show/store_pool) and request.c.
> >  Note that these are based on mempools, not the frame pools blkback
> > would take.
> 

Cheers,
Andrei

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.