[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 2/2] libxl_pci: Fix guest shutdown with PCI PT attached
On Tue, Oct 15, 2019 at 06:59:37PM +0200, Sander Eikelenboom wrote: >On 14/10/2019 17:03, Chao Gao wrote: >> On Thu, Oct 10, 2019 at 06:13:43PM +0200, Sander Eikelenboom wrote: >>> On 01/10/2019 12:35, Anthony PERARD wrote: >>>> Rewrite of the commit message: >>>> >>>> Before the problematic commit, libxl used to ignore error when >>>> destroying (force == true) a passthrough device, especially error that >>>> happens when dealing with the DM. >>>> >>>> Since fae4880c45fe, if the DM failed to detach the pci device within >>>> the allowed time, the timed out error raised skip part of >>>> pci_remove_*, but also raise the error up to the caller of >>>> libxl__device_pci_destroy_all, libxl__destroy_domid, and thus the >>>> destruction of the domain fails. >>>> >>>> In this patch, if the DM didn't confirmed that the device is removed, >>>> we will print a warning and keep going if force=true. The patch >>>> reorder the functions so that pci_remove_timeout() calls >>>> pci_remove_detatched() like it's done when DM calls are successful. >>>> >>>> We also clean the QMP states and associated timeouts earlier, as soon >>>> as they are not needed anymore. >>>> >>>> Reported-by: Sander Eikelenboom <linux@xxxxxxxxxxxxxx> >>>> Fixes: fae4880c45fe015e567afa223f78bf17a6d98e1b >>>> Signed-off-by: Anthony PERARD <anthony.perard@xxxxxxxxxx> >>>> >>> >>> Hi Anthony / Chao, >>> >>> I have to come back to this, a bit because perhaps there is an underlying >>> issue. >>> While it earlier occurred to me that the VM to which I passed through most >>> pci-devices >>> (8 to be exact) became very slow to shutdown, but I didn't investigate it >>> further. >>> >>> But after you commit messages from this patch it kept nagging, so today I >>> did some testing >>> and bisecting. >>> >>> The difference in tear-down time at least from what the IOMMU code logs is >>> quite large: >>> >>> xen-4.12.0 >>> Setup: 7.452 s >>> Tear-down: 7.626 s >>> >>> xen-unstable-ee7170822f1fc209f33feb47b268bab35541351d >>> Setup: 7.468 s >>> Tear-down: 50.239 s >>> >>> Bisection turned up: >>> commit c4b1ef0f89aa6a74faa4618ce3efed1de246ec40 >>> Author: Chao Gao <chao.gao@xxxxxxxxx> >>> Date: Fri Jul 19 10:24:08 2019 +0100 >>> libxl_qmp: wait for completion of device removal >>> >>> Which makes me wonder if there is something going wrong in Qemu ? > >> Hi Sander, >Hi Chao, > >> >> Thanks for your testing and the bisection. >> >> I tried on my machine, the destruction time of a guest with 8 pass-thru >> devices increased from 4s to 12s after applied the commit above. > >To what patch are you referring Anthony's or >c4b1ef0f89aa6a74faa4618ce3efed1de246ec40 ? The latter. > >> In my understanding, I guess you might get the error message "timed out >> waiting for DM to remove...". There might be some issues on your assigned >> devices' drivers. You can first unbind the devices with their drivers in >> VM and then tear down the VM, and check whether the VM teardown gets >> much faster. > >I get that error message when I test with Anthony's patch applied, the >destruction time with that patch is low. > >How ever my point was if that patch is correct in the sense that there seems >to be an underlying issue >which causes it to take so long. That issue was uncovered by >c4b1ef0f89aa6a74faa4618ce3efed1de246ec40, so I'm not >saying that commit is wrong in any sense, it just uncovered another issue that >was already present, >but hard to detect as we just didn't wait at destruction time (and thus the >same effect as a timeout). Actually, it is introduced by c4b1ef0f89, though it did fix another issue. > >One or the other way that was just a minor issue until >fae4880c45fe015e567afa223f78bf17a6d98e1b, where the long >destruction time now caused the domain destruction to stall, which was then >fixed by Antony's patch, but that uses >a timeout which kinds of circumvents the issue, instead of finding out where >is comes from and solve it there ( >if that is possible of course). > >And I wonder if Anthony's patch doesn't interfere with the case you made >c4b1ef0f89aa6a74faa4618ce3efed1de246ec40 for, >if you get the timeout error message as well, then that is kind of not waiting >for the destruction to finish, isn't it ? > >Chao, >could you perhaps test for me Xen with as latest commit >ee7170822f1fc209f33feb47b268bab35541351d ? >That is before Anthony's patch series, but after your >c4b1ef0f89aa6a74faa4618ce3efed1de246ec40. It's actually what I did. VM teardown with 8 pass-thru devices on my side takes 12s which only took 4s without my patch. Thanks Chao _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |