Xen project Mailing List

Re: [Xen-devel] [PATCH] Revert "domctl: improve locking during domain destruction"

To: Jan Beulich <jbeulich@xxxxxxxx>, Julien Grall <julien@xxxxxxx>

From: Hongyan Xia <hx242@xxxxxxx>

Date: Thu, 26 Mar 2020 14:39:55 +0000

Cc: Charles Arnold <CARNOLD@xxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Ian Jackson <ian.jackson@xxxxxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Jim Fehlig <JFEHLIG@xxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Thu, 26 Mar 2020 14:40:02 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Wed, 2020-03-25 at 08:11 +0100, Jan Beulich wrote: > On 24.03.2020 19:39, Julien Grall wrote: > > On 24/03/2020 16:13, Jan Beulich wrote: > > > On 24.03.2020 16:21, Hongyan Xia wrote: > > > > From: Hongyan Xia <hongyxia@xxxxxxxxxx> > > > > In contrast, > > > > after dropping that commit, parallel domain destructions will > > > > just fail > > > > to take the domctl lock, creating a hypercall continuation and > > > > backing > > > > off immediately, allowing the thread that holds the lock to > > > > destroy a > > > > domain much more quickly and allowing backed-off threads to > > > > process > > > > events and irqs. > > > > > > > > On a 144-core server with 4TiB of memory, destroying 32 guests > > > > (each > > > > with 4 vcpus and 122GiB memory) simultaneously takes: > > > > > > > > before the revert: 29 minutes > > > > after the revert: 6 minutes > > > > > > This wants comparing against numbers demonstrating the bad > > > effects of > > > the global domctl lock. Iirc they were quite a bit higher than 6 > > > min, > > > perhaps depending on guest properties. > > > > Your original commit message doesn't contain any clue in which > > cases the domctl lock was an issue. So please provide information > > on the setups you think it will make it worse. > > I did never observe the issue myself - let's see whether one of the > SUSE > people possibly involved in this back then recall (or have further > pointers; Jim, Charles?), or whether any of the (partly former) > Citrix > folks do. My vague recollection is that the issue was the tool stack > as > a whole stalling for far too long in particular when destroying very > large guests. One important aspect not discussed in the commit > message > at all is that holding the domctl lock block basically _all_ tool > stack > operations (including e.g. creation of new guests), whereas the new > issue attempted to be addressed is limited to just domain cleanup. The best solution is to make the heap scalable instead of a global lock, but that is not going to be trivial. Of course, another solution is to keep the domctl lock dropped in domain_kill() but have another domain_kill lock so that competing domain_kill()s will try to take that lock and back off with hypercall continuation. But this is kind of hacky (we introduce a lock to reduce spinlock contention elsewhere), which is probably not a solution but a workaround. Seeing the dramatic increase from 6 to 29 minutes in concurrent guest destruction, I wonder if the benefit of that commit can outweigh this negative though. Hongyan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.