[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] Revert "domctl: improve locking during domain destruction"



On Wed, 2020-03-25 at 08:11 +0100, Jan Beulich wrote:
> On 24.03.2020 19:39, Julien Grall wrote:
> > On 24/03/2020 16:13, Jan Beulich wrote:
> > > On 24.03.2020 16:21, Hongyan Xia wrote:
> > > > From: Hongyan Xia <hongyxia@xxxxxxxxxx>
> > > > In contrast,
> > > > after dropping that commit, parallel domain destructions will
> > > > just fail
> > > > to take the domctl lock, creating a hypercall continuation and
> > > > backing
> > > > off immediately, allowing the thread that holds the lock to
> > > > destroy a
> > > > domain much more quickly and allowing backed-off threads to
> > > > process
> > > > events and irqs.
> > > > 
> > > > On a 144-core server with 4TiB of memory, destroying 32 guests
> > > > (each
> > > > with 4 vcpus and 122GiB memory) simultaneously takes:
> > > > 
> > > > before the revert: 29 minutes
> > > > after the revert: 6 minutes
> > > 
> > > This wants comparing against numbers demonstrating the bad
> > > effects of
> > > the global domctl lock. Iirc they were quite a bit higher than 6
> > > min,
> > > perhaps depending on guest properties.
> > 
> > Your original commit message doesn't contain any clue in which
> > cases the domctl lock was an issue. So please provide information
> > on the setups you think it will make it worse.
> 
> I did never observe the issue myself - let's see whether one of the
> SUSE
> people possibly involved in this back then recall (or have further
> pointers; Jim, Charles?), or whether any of the (partly former)
> Citrix
> folks do. My vague recollection is that the issue was the tool stack
> as
> a whole stalling for far too long in particular when destroying very
> large guests. One important aspect not discussed in the commit
> message
> at all is that holding the domctl lock block basically _all_ tool
> stack
> operations (including e.g. creation of new guests), whereas the new
> issue attempted to be addressed is limited to just domain cleanup.

The best solution is to make the heap scalable instead of a global
lock, but that is not going to be trivial.

Of course, another solution is to keep the domctl lock dropped in
domain_kill() but have another domain_kill lock so that competing
domain_kill()s will try to take that lock and back off with hypercall
continuation. But this is kind of hacky (we introduce a lock to reduce
spinlock contention elsewhere), which is probably not a solution but a
workaround.

Seeing the dramatic increase from 6 to 29 minutes in concurrent guest
destruction, I wonder if the benefit of that commit can outweigh this
negative though.

Hongyan




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.