Xen project Mailing List

Re: [Xen-devel] S3 crash with VTD Queue Invalidation enabled

From: Ben Guthro <ben@xxxxxxxxxx>

Date: Tue, 4 Jun 2013 17:09:59 -0400

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Tue, 04 Jun 2013 21:10:34 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Tue, Jun 4, 2013 at 3:49 PM, Ben Guthro <ben@xxxxxxxxxx> wrote: > On Tue, Jun 4, 2013 at 3:20 PM, Ben Guthro <ben@xxxxxxxxxx> wrote: >> On Tue, Jun 4, 2013 at 10:01 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote: >>>>>> On 04.06.13 at 14:25, Ben Guthro <ben@xxxxxxxxxx> wrote: >>>> On Tue, Jun 4, 2013 at 4:54 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote: >>>>>>>> On 03.06.13 at 21:22, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote: >>>>>> On 03/06/13 19:29, Ben Guthro wrote: >>>>>>> (XEN) Xen call trace: >>>>>>> (XEN) [<ffff82c480149091>] invalidate_sync+0x258/0x291 >>>>>>> (XEN) [<ffff82c48014919d>] flush_iotlb_qi+0xd3/0xef >>>>>>> (XEN) [<ffff82c480145a60>] iommu_flush_all+0xb5/0xde >>>>>>> (XEN) [<ffff82c480145b08>] vtd_suspend+0x23/0xf1 >>>>>>> (XEN) [<ffff82c480141e12>] iommu_suspend+0x3c/0x3e >>>>>>> (XEN) [<ffff82c48019f315>] enter_state_helper+0x1a0/0x3cb >>>>>>> (XEN) [<ffff82c480105ed4>] >>>>>>> continue_hypercall_tasklet_handler+0x51/0xbf >>>>>>> (XEN) [<ffff82c480127a1e>] do_tasklet_work+0x8d/0xc7 >>>>>>> (XEN) [<ffff82c480127d89>] do_tasklet+0x6b/0x9b >>>>>>> (XEN) [<ffff82c48015a42f>] idle_loop+0x67/0x6f >>>>>> >>>>>> This was likely broken by XSA-36 >>>>>> >>>>>> My fix for the crash path is: >>>>>> http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=53fd1d8458de01169dfb >>>>>> 56feb315f02c2b521a86 >>>>>> >>>>>> You want to inspect the use of iommu_enabled and iommu_intremap. >>>>> >>>>> According to the comment in vtd_suspend(), >>>>> iommu_disable_x2apic_IR() is supposed to run after >>>>> iommu_suspend() (and indeed lapic_suspend() gets called >>>>> immediately after iommu_suspend() by device_power_down()), >>>>> and hence that shouldn't be the reason here. But, Ben, to be >>>>> sure, dumping the state of the various IOMMU related enabling >>>>> variables would be a good idea. >>>> >>>> I assume you are referring to the variables below, defined at the top of >>>> iommu.c >>>> At the time of the crash, they look like this: >>>> >>>> (XEN) iommu_enabled = 1 >>>> (XEN) force_iommu; = 0 >>>> (XEN) iommu_verbose; = 0 >>>> (XEN) iommu_workaround_bios_bug; = 0 >>>> (XEN) iommu_passthrough; = 0 >>>> (XEN) iommu_snoop = 0 >>>> (XEN) iommu_qinval = 1 >>>> (XEN) iommu_intremap = 1 >>>> (XEN) iommu_hap_pt_share = 0 >>>> (XEN) iommu_debug; = 0 >>>> (XEN) amd_iommu_perdev_intremap = 1 >>>> >>>> If that gives any additional insight, please let me know. >>>> I'm not sure I gleaned anything particularly significant from it though. >>>> >>>> Or - perhaps you are referring to other enabling variables? >>> >>> These were exactly the ones (or really you picked a superset of >>> what I wanted to know the state of). To me this pretty clearly >>> means that Andrew's original thought here is not applicable, as >>> at this point we can't possibly have shut down qinval yet. >>> >>>>> Is this perhaps having some similarity with >>>>> http://lists.xen.org/archives/html/xen-devel/2013-04/msg00343.html? >>>>> We're clearly running single-CPU only here and there... >>>> >>>> We certainly should be, as we have gone through the >>>> disable_nonboot_cpus() by this point - and I can verify that from the >>>> logs. >>> >>> I'm much more tending towards the connection here, noting that >>> Andrew's original thread didn't really lead anywhere (i.e. we still >>> don't know what the panic he saw was actually caused by). >>> >> >> I'm starting to think you're on to something here. > > hmm - maybe not. > I get the same crash with "maxcpus=1" > > > >> I've put a bunch of trace throughout the functions in qinval.c >> >> It seems that everything is functioning properly, up until we go >> through the disable_nonboot_cpus() path. >> Prior to this, I see the qinval.c functions being executed on all >> cpus, and both drhd units >> Afterward, it gets stuck in queue_invalidate_wait on the first drhd >> unit.. and eventually panics. >> >> I'm not exactly sure what to make of this yet. querying status of the hardware all seems to be working correctly... it just doesn't work with querying the QINVAL_STAT_DONE state, as far as I can tell. Other register state is: (XEN) VER = 10 (XEN) CAP = c0000020e60262 (XEN) n_fault_reg = 1 (XEN) fault_recording_offset = 200 (XEN) fault_recording_reg_l = 0 (XEN) fault_recording_reg_h = 0 (XEN) ECAP = f0101a (XEN) GCMD = 0 (XEN) GSTS = c7000000 (XEN) RTADDR = 137a31000 (XEN) CCMD = 800000000000000 (XEN) FSTS = 0 (XEN) FECTL = 0 (XEN) FEDATA = 4128 (XEN) FEADDR = fee0000c (XEN) FEUADDR = 0 (with code lifted from print_iommu_regs() ) None of this looks suspicious to my untrained eye - but I'm including it here in case someone else sees something I don't. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.