[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Multi-bridged PCIe devices (Was: Re: iommuu/vt-d issues with LSI MegaSAS (PERC5i))



On Tue, Jan 07, 2014 at 12:42:17PM +0000, Gordan Bobic wrote:
> On 2014-01-07 12:15, Jan Beulich wrote:
> >>>>On 07.01.14 at 12:35, Gordan Bobic <gordan@xxxxxxxxxx> wrote:
> >>On 2014-01-07 11:26, Wu, Feng wrote:
> >>>>-----Original Message-----
> >>>>From: xen-devel-bounces@xxxxxxxxxxxxx
> >>>>[mailto:xen-devel-bounces@xxxxxxxxxxxxx] On Behalf Of Gordan Bobic
> >>>>Sent: Tuesday, January 07, 2014 6:44 PM
> >>>>To: Andrew Cooper
> >>>>Cc: xen-devel@xxxxxxxxxxxxx
> >>>>Subject: Re: [Xen-devel] Multi-bridged PCIe devices (Was: Re:
> >>>>iommuu/vt-d
> >>>>issues with LSI MegaSAS (PERC5i))
> >>>>
> >>>>On 2014-01-07 10:38, Andrew Cooper wrote:
> >>>>> On 07/01/14 10:35, Gordan Bobic wrote:
> >>>>>> On 2014-01-07 03:17, Zhang, Yang Z wrote:
> >>>>>>> Konrad Rzeszutek Wilk wrote on 2014-01-07:
> >>>>>>>>> Which would look like this:
> >>>>>>>>>
> >>>>>>>>> C220 ---> Tundra Bridge -----> (HB6 PCI bridge -> Brooktree BDFs)
> >>>>>>>>> on the card
> >>>>>>>>>           \--------------> IEEE-1394a
> >>>>>>>>>
> >>>>>>>>> I am actually wondering if this 07:00.0 device is the one that
> >>>>>>>>> reports itself as 08:00.0 (which I think is what you alluding to
> >>>>>>>>> Jan)
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> And to double check that theory I decided to pass in the IEEE-1394a
> >>>>>>>> to a guest:
> >>>>>>>>
> >>>>>>>>            +-1c.5-[07-08]----00.0-[08]----03.0  Texas Instruments
> >>>>>>>> TSB43AB22A IEEE-1394a-2000 Controller (PHY/Link) [iOHCI-Lynx]
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> (XEN) [VT-D]iommu.c:885: iommu_fault_status: Fault Overflow (XEN)
> >>>>>>>> [VT-D]iommu.c:887: iommu_fault_status: Primary Pending Fault (XEN)
> >>>>>>>> [VT-D]iommu.c:865: DMAR:[DMA Read] Request device [0000:08:00.0]
> >>>>>>>> fault
> >>>>>>>> addr 370f1000, iommu reg = ffff82c3ffd53000 (XEN) DMAR:[fault reason
> >>>>>>>> 02h] Present bit in context entry is clear (XEN) print_vtd_entries:
> >>>>>>>> iommu ffff83083d4939b0 dev 0000:08:00.0 gmfn 370f1 (XEN)
> >>>>>>>> root_entry
> >>>>>>>> = ffff83083d47f000 (XEN)     root_entry[8] = 72569b001 (XEN)
> >>>>>>>> context
> >>>>>>>> = ffff83072569b000 (XEN)     context[0] = 0_0 (XEN)
> >>>>>>>> ctxt_entry[0]
> >>>>>>>> not present
> >>>>>>>>
> >>>>>>>> So, capture card OK - Likely the Tundra bridge has an issue:
> >>>>>>>>
> >>>>>>>> 07:00.0 PCI bridge: Tundra Semiconductor Corp. Device 8113 (rev 01)
> >>>>>>>> (prog-if 01 [Subtractive decode])
> >>>>>>>>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV-
> >>>>VGASnoop-
> >>>>>>>>         ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+
> >>>>>>>> 66MHz-
> >>>>>>>>         UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
> >>>><MAbort+
> >>>>>>>>         >SERR- <PERR- INTx- Latency: 0 Bus: primary=07,
> >>>>>>>> secondary=08,
> >>>>>>>>         subordinate=08, sec-latency=32 Memory behind bridge:
> >>>>>>>>         f0600000-f06fffff Secondary status: 66MHz+ FastB2B+ ParErr-
> >>>>>>>>         DEVSEL=medium TAbort- <TAbort- <MAbort+ <SERR- <PERR-
> >>>>>>>> BridgeCtl:
> >>>>>>>>         Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
> >>>>>>>>                 PriDiscTmr- SecDiscTmr- DiscTmrStat-
> >>>>DiscTmrSERREn-
> >>>>>>>>         Capabilities: [60] Subsystem: Super Micro Computer Inc
> >>>>>>>> Device 0805
> >>>>>>>>         Capabilities: [a0] Power Management version 3
> >>>>>>>>                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> >>>>>>>>                 PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0
> >>>>>>>> NoSoftRst+
> >>>>>>>>                 PME-Enable- DSel=0 DScale=0 PME-
> >>>>>>>>
> >>>>>>>> or there is some unknown bridge in the motherboard.
> >>>>>>>
> >>>>>>> According your description above, the upstream Linux should also have
> >>>>>>> the same problem. Did you see it with upstream Linux?

I did not even think to test. I sadly won't be able to do much of 
reboot/shutdown
as this is a production machine.

> >>>>>>
> >>>>>> The problem I was seeing with LSI cards (phantom device doing DMA)
> >>>>>> does, indeed, also occur in upstream Linux. If I enable intel-iommu on
> >>>>>> bare metal Linux, the same problem occurs as with Xen.
> >>>>>>
> >>>>>>> There may be some buggy device that generate DMA request with
> >>>>>>> internal
> >>>>>>> BDF but it didn't expose it(not like Phantom device). For those
> >>>>>>> devices, I think we need to setup the VT-d page table manually.
> >>>>>>
> >>>>>> I think what is needed is a pci-phantom style override that tells the
> >>>>>> hypervisor to tell the IOMMU to allow DMA traffic from a specific
> >>>>>> invisible device ID.
> >>>>>>
> >>>>>> Gordan
> >>>>>
> >>>>> There is.  See "pci-phantom" in
> >>>>> http://xenbits.xen.org/docs/unstable/misc/xen-command-line.html
> >>>>
> >>>>I thought this was only applicable to phantom _functions_ (number
> >>>>after
> >>>>the
> >>>>dot) rather than whole phantom _devices_. Is that not the case?
> >>>
> >>>I think that's right. I go through the related code for the pci
> >>>phantom device just now, I find that
> >>>the information of command line 'pci-phantom' is stored in variable '
> >>>phantom_devs[8] '
> >>>with type of s truct phantom_dev{}. This variable is used in function
> >>>alloc_pdev() as follow:
> >>>
> >>>
> >>>                for ( i = 0; i < nr_phantom_devs; ++i )
> >>>                    if ( phantom_devs[i].seg == pseg->nr &&
> >>>                         phantom_devs[i].bus == bus &&
> >>>                         phantom_devs[i].slot == PCI_SLOT(devfn) &&
> >>>                         phantom_devs[i].stride > PCI_FUNC(devfn) )
> >>>                    {
> >>>                        pdev->phantom_stride =
> >>>phantom_devs[i].stride;
> >>>                        break;
> >>>                    }
> >>>
> >>>So from the code, we can see this command line only works for phantom
> >>>_function_, not for whole phantom _devices_.
> >>
> >>What would it take to make it work for a whole phantom device?
> >
> >First and foremost a definition of what a phantom device is and
> >how one would behave. Once again - phantom functions are part
> >of the PCIe specification, so those don't require a definition.
> 
> Konrad's patch from a while back seemed to do the required thing to
> allow an otherwise invisible/undetected device to do DMA transfers
> without freaking out the IOMMU that doesn't know about it.

Except it didn't work :-) That was the first thing I tried with this
motherboard. And it looks like there are extra things I would need
to modify in the hypervisor for it to work (like make the
hypervisor create an fake PCI device with BARs and such).

Which is actually what I was going try out - see if I can make it
(hypervisor) add a PCI device for a non-existent PCI device (does
not show in the PCI configuration scan).

That requires knowing the MMIO BARs the 'fake' device has, and
.. well, whatever else the Intel VT-d code requires.

For reference, here is the code that Gordan was mentioning:


#include <linux/module.h>
#include <linux/string.h>
#include <linux/types.h>
#include <linux/init.h>
#include <linux/stat.h>
#include <linux/err.h>
#include <linux/ctype.h>
#include <linux/slab.h>
#include <linux/limits.h>
#include <linux/device.h>
#include <linux/pci.h>
#include <linux/device.h>

#include <linux/pci.h>

#include <xen/interface/xen.h>
#include <xen/interface/physdev.h>

#include <asm/xen/hypervisor.h>
#include <asm/xen/hypercall.h>

#define LSI_HACK  "0.1"

MODULE_AUTHOR("Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>");
MODULE_DESCRIPTION("lsi hack");
MODULE_LICENSE("GPL");
MODULE_VERSION(LSI_HACK);

static int __init lsi_hack_init(void)
{
        int r = 0;

        struct physdev_manage_pci manage_pci = {
                        .bus    = 0x8,
                        .devfn  = PCI_DEVFN(0,0),
                };
        r = HYPERVISOR_physdev_op(PHYSDEVOP_manage_pci_add,
                        &manage_pci);

        return r;
}

static void __exit lsi_hack_exit(void)
{
        int r = 0;
        struct physdev_manage_pci manage_pci;

        manage_pci.bus = 0x8;
        manage_pci.devfn = PCI_DEVFN(0,0);

        r = HYPERVISOR_physdev_op(PHYSDEVOP_manage_pci_remove,
                &manage_pci);
        if (r)
                printk(KERN_ERR "%s: %d\n", __FUNCTION__, r);
}

module_init(lsi_hack_init);
module_exit(lsi_hack_exit);

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.