[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] S3 crash with VTD Queue Invalidation enabled



On Tue, Jun 4, 2013 at 3:49 PM, Ben Guthro <ben@xxxxxxxxxx> wrote:
> On Tue, Jun 4, 2013 at 3:20 PM, Ben Guthro <ben@xxxxxxxxxx> wrote:
>> On Tue, Jun 4, 2013 at 10:01 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote:
>>>>>> On 04.06.13 at 14:25, Ben Guthro <ben@xxxxxxxxxx> wrote:
>>>> On Tue, Jun 4, 2013 at 4:54 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote:
>>>>>>>> On 03.06.13 at 21:22, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:
>>>>>> On 03/06/13 19:29, Ben Guthro wrote:
>>>>>>> (XEN) Xen call trace:
>>>>>>> (XEN)    [<ffff82c480149091>] invalidate_sync+0x258/0x291
>>>>>>> (XEN)    [<ffff82c48014919d>] flush_iotlb_qi+0xd3/0xef
>>>>>>> (XEN)    [<ffff82c480145a60>] iommu_flush_all+0xb5/0xde
>>>>>>> (XEN)    [<ffff82c480145b08>] vtd_suspend+0x23/0xf1
>>>>>>> (XEN)    [<ffff82c480141e12>] iommu_suspend+0x3c/0x3e
>>>>>>> (XEN)    [<ffff82c48019f315>] enter_state_helper+0x1a0/0x3cb
>>>>>>> (XEN)    [<ffff82c480105ed4>] 
>>>>>>> continue_hypercall_tasklet_handler+0x51/0xbf
>>>>>>> (XEN)    [<ffff82c480127a1e>] do_tasklet_work+0x8d/0xc7
>>>>>>> (XEN)    [<ffff82c480127d89>] do_tasklet+0x6b/0x9b
>>>>>>> (XEN)    [<ffff82c48015a42f>] idle_loop+0x67/0x6f
>>>>>>
>>>>>> This was likely broken by XSA-36
>>>>>>
>>>>>> My fix for the crash path is:
>>>>>> http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=53fd1d8458de01169dfb
>>>>>> 56feb315f02c2b521a86
>>>>>>
>>>>>> You want to inspect the use of iommu_enabled and iommu_intremap.
>>>>>
>>>>> According to the comment in vtd_suspend(),
>>>>> iommu_disable_x2apic_IR() is supposed to run after
>>>>> iommu_suspend() (and indeed lapic_suspend() gets called
>>>>> immediately after iommu_suspend() by device_power_down()),
>>>>> and hence that shouldn't be the reason here. But, Ben, to be
>>>>> sure, dumping the state of the various IOMMU related enabling
>>>>> variables would be a good idea.
>>>>
>>>> I assume you are referring to the variables below, defined at the top of
>>>> iommu.c
>>>> At the time of the crash, they look like this:
>>>>
>>>> (XEN) iommu_enabled = 1
>>>> (XEN) force_iommu; = 0
>>>> (XEN) iommu_verbose; = 0
>>>> (XEN) iommu_workaround_bios_bug; = 0
>>>> (XEN) iommu_passthrough; = 0
>>>> (XEN) iommu_snoop = 0
>>>> (XEN) iommu_qinval = 1
>>>> (XEN) iommu_intremap = 1
>>>> (XEN) iommu_hap_pt_share = 0
>>>> (XEN) iommu_debug; = 0
>>>> (XEN) amd_iommu_perdev_intremap = 1
>>>>
>>>> If that gives any additional insight, please let me know.
>>>> I'm not sure I gleaned anything particularly significant from it though.
>>>>
>>>> Or - perhaps you are referring to other enabling variables?
>>>
>>> These were exactly the ones (or really you picked a superset of
>>> what I wanted to know the state of). To me this pretty clearly
>>> means that Andrew's original thought here is not applicable, as
>>> at this point we can't possibly have shut down qinval yet.
>>>
>>>>> Is this perhaps having some similarity with
>>>>> http://lists.xen.org/archives/html/xen-devel/2013-04/msg00343.html?
>>>>> We're clearly running single-CPU only here and there...
>>>>
>>>> We certainly should be, as we have gone through the
>>>> disable_nonboot_cpus() by this point - and I can verify that from the
>>>> logs.
>>>
>>> I'm much more tending towards the connection here, noting that
>>> Andrew's original thread didn't really lead anywhere (i.e. we still
>>> don't know what the panic he saw was actually caused by).
>>>
>>
>> I'm starting to think you're on to something here.
>
> hmm - maybe not.
> I get the same crash with "maxcpus=1"
>
>
>
>> I've put a bunch of trace throughout the functions in qinval.c
>>
>> It seems that everything is functioning properly, up until we go
>> through the disable_nonboot_cpus() path.
>> Prior to this, I see the qinval.c functions being executed on all
>> cpus, and both drhd units
>> Afterward, it gets stuck in queue_invalidate_wait on the first drhd
>> unit.. and eventually panics.
>>
>> I'm not exactly sure what to make of this yet.

querying status of the hardware all seems to be working correctly...
it just doesn't work with querying the QINVAL_STAT_DONE state, as far
as I can tell.

Other register state is:

(XEN)  VER = 10
(XEN)  CAP = c0000020e60262
(XEN)  n_fault_reg = 1
(XEN)  fault_recording_offset = 200
(XEN)  fault_recording_reg_l = 0
(XEN)  fault_recording_reg_h = 0
(XEN)  ECAP = f0101a
(XEN)  GCMD = 0
(XEN)  GSTS = c7000000
(XEN)  RTADDR = 137a31000
(XEN)  CCMD = 800000000000000
(XEN)  FSTS = 0
(XEN)  FECTL = 0
(XEN)  FEDATA = 4128
(XEN)  FEADDR = fee0000c
(XEN)  FEUADDR = 0

(with code lifted from print_iommu_regs() )


None of this looks suspicious to my untrained eye - but I'm including
it here in case someone else sees something I don't.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.