[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 2/2] libxl_pci: Fix guest shutdown with PCI PT attached
On 15/10/2019 18:59, Sander Eikelenboom wrote: > On 14/10/2019 17:03, Chao Gao wrote: >> On Thu, Oct 10, 2019 at 06:13:43PM +0200, Sander Eikelenboom wrote: >>> On 01/10/2019 12:35, Anthony PERARD wrote: >>>> Rewrite of the commit message: >>>> >>>> Before the problematic commit, libxl used to ignore error when >>>> destroying (force == true) a passthrough device, especially error that >>>> happens when dealing with the DM. >>>> >>>> Since fae4880c45fe, if the DM failed to detach the pci device within >>>> the allowed time, the timed out error raised skip part of >>>> pci_remove_*, but also raise the error up to the caller of >>>> libxl__device_pci_destroy_all, libxl__destroy_domid, and thus the >>>> destruction of the domain fails. >>>> >>>> In this patch, if the DM didn't confirmed that the device is removed, >>>> we will print a warning and keep going if force=true. The patch >>>> reorder the functions so that pci_remove_timeout() calls >>>> pci_remove_detatched() like it's done when DM calls are successful. >>>> >>>> We also clean the QMP states and associated timeouts earlier, as soon >>>> as they are not needed anymore. >>>> >>>> Reported-by: Sander Eikelenboom <linux@xxxxxxxxxxxxxx> >>>> Fixes: fae4880c45fe015e567afa223f78bf17a6d98e1b >>>> Signed-off-by: Anthony PERARD <anthony.perard@xxxxxxxxxx> >>>> >>> >>> Hi Anthony / Chao, >>> >>> I have to come back to this, a bit because perhaps there is an underlying >>> issue. >>> While it earlier occurred to me that the VM to which I passed through most >>> pci-devices >>> (8 to be exact) became very slow to shutdown, but I didn't investigate it >>> further. >>> >>> But after you commit messages from this patch it kept nagging, so today I >>> did some testing >>> and bisecting. >>> >>> The difference in tear-down time at least from what the IOMMU code logs is >>> quite large: >>> >>> xen-4.12.0 >>> Setup: 7.452 s >>> Tear-down: 7.626 s >>> >>> xen-unstable-ee7170822f1fc209f33feb47b268bab35541351d >>> Setup: 7.468 s >>> Tear-down: 50.239 s >>> >>> Bisection turned up: >>> commit c4b1ef0f89aa6a74faa4618ce3efed1de246ec40 >>> Author: Chao Gao <chao.gao@xxxxxxxxx> >>> Date: Fri Jul 19 10:24:08 2019 +0100 >>> libxl_qmp: wait for completion of device removal >>> >>> Which makes me wonder if there is something going wrong in Qemu ? > >> Hi Sander, > Hi Chao, > >> >> Thanks for your testing and the bisection. >> >> I tried on my machine, the destruction time of a guest with 8 pass-thru >> devices increased from 4s to 12s after applied the commit above. > > To what patch are you referring Anthony's or > c4b1ef0f89aa6a74faa4618ce3efed1de246ec40 ? > >> In my understanding, I guess you might get the error message "timed out >> waiting for DM to remove...". There might be some issues on your assigned >> devices' drivers. You can first unbind the devices with their drivers in >> VM and then tear down the VM, and check whether the VM teardown gets >> much faster. Sorry I forgot to answer your question, I tried unbinding the drivers in the guest prior to shutting it down, but it didn't make any difference. -- Sander > I get that error message when I test with Anthony's patch applied, the > destruction time with that patch is low. > > How ever my point was if that patch is correct in the sense that there seems > to be an underlying issue > which causes it to take so long. That issue was uncovered by > c4b1ef0f89aa6a74faa4618ce3efed1de246ec40, so I'm not > saying that commit is wrong in any sense, it just uncovered another issue > that was already present, > but hard to detect as we just didn't wait at destruction time (and thus the > same effect as a timeout). > > One or the other way that was just a minor issue until > fae4880c45fe015e567afa223f78bf17a6d98e1b, where the long > destruction time now caused the domain destruction to stall, which was then > fixed by Antony's patch, but that uses > a timeout which kinds of circumvents the issue, instead of finding out where > is comes from and solve it there ( > if that is possible of course). > > And I wonder if Anthony's patch doesn't interfere with the case you made > c4b1ef0f89aa6a74faa4618ce3efed1de246ec40 for, > if you get the timeout error message as well, then that is kind of not > waiting for the destruction to finish, isn't it ? > > Chao, > could you perhaps test for me Xen with as latest commit > ee7170822f1fc209f33feb47b268bab35541351d ? > That is before Anthony's patch series, but after your > c4b1ef0f89aa6a74faa4618ce3efed1de246ec40. > > I would expect to see longer destruction times in the case of 8 pass-throuh > devices as well. > > Unfortunately Qemu doesn't seem to do much verbose logging even when i enable > the debug defines in hw/xen, > especially for the destruction side of things (it mostly logs setting up > stuff). > > -- > Sander > > > > >> Anthony & Wei, >> >> The commit above basically serializes and synchronizes detaching >> assigned devices and thus increases VM teardown time significantly if >> there are multiple assigned devices. The commit aimed to avoid qemu's >> access to PCI configuration space coinciding with the device reset >> initiated by xl (which is not desired and is exactly the case which >> triggers the assertion in Xen [1]). I personally insist that xl should >> wait for DM's completion of device detaching. Otherwise, besides Xen >> panic (which can be fixed in another way), in theory, such sudden >> unawared device reset might cause a disaster (e.g. data loss for a >> storage device). >> >> [1]: >> https://lists.xenproject.org/archives/html/xen-devel/2019-09/msg03287.html >> >> But considering fast creation and teardown is an important benefit of >> virtualization, I am not sure how to deal with the situation. Anyway, >> you can make the decision. To fix the regression on VM teardown, we can >> revert the commit by removing the timeout logic. >> >> What's your opinion? >> >> Thanks >> Chao >> > > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |