Xen project Mailing List

Re: [Xen-devel] [PATCH v4 RFC 0/6] x86/MSI: XSA-120, 126, 128-131 follow-up

To: "xen-devel" <xen-devel@xxxxxxxxxxxxxxxxxxxx>

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Mon, 13 Jul 2015 12:42:01 +0100

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Keir Fraser <keir@xxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>

Delivery-date: Mon, 13 Jul 2015 11:42:09 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

>>> On 22.06.15 at 16:38, <JBeulich@xxxxxxxx> wrote: > Only patches 1, 2, and 6 are really RFC (with some debugging code still > left in), the others (hence v4) have been submitted before. > > 1: PCI: add config space write abstract intercept logic > 2: MSI-X: track host and guest mask-all requests separately > 3: MSI-X: be more careful during teardown > 4: MSI-X: access MSI-X table only after having enabled MSI-X > 5: MSI-X: reduce fiddling with control register during restore > 6: MSI: properly track guest masking requests > > A fundamental question is whether to go without the so far missing > (and more involved) MMCFG intercepts. This largely depends > whether there are any half way recent Dom0 OSes using MMCFG > accesses for the base 256 bytes of PCI config space. Below/attached a 7th patch dealing with MMCFG accesses for devices we know about _before_ Dom0 setting up its mapping of MMCFG space. As noted at its top, this means it still won't work for a number of cases (not mentioned there is the case of Dom0 re-organizing bus numbering in order to e.g. fit in SR-IOV virtual functions). There are two approaches I can see to deal with these cases, neither of which I like particularly: 1) Globally intercept all MMCFG writes (i.e. independent of whether there's anything we actually need to see, like the MSI/ MSI-X mask bits that are of interest here). The part that makes me dislike this model is the increased involvement of the x86 instruction emulator plus the increased chances of us needing to deal with other than just plain stores. 2) Track Dom0 mappings of MMCFG space: We'd need to set up (if not covered by the frame table) struct page_info for all involved pages, and either overload it to store two or three pointers back to L1 page table slots where the respective page is being mapped, or (if two or three mappings per page don't suffice) attach a linked list of such pointers. I think it goes without saying that the complexity of getting this overload right (i.e. namely to not conflict with any existing uses) as well as the need to not lose track of any mappings are the reasons I'm not really keen on this one. Models I considered and found not suitable (and can recall right now): 3) Hunt down Dom0 mappings at the time we learn we need to intercept writes for a particular device. The overhead of having to scan all Dom0 page tables seems prohibitive of such a solution. 4) Initially disallow writes to MMCFG pages globally, granting write permission upon first write access when we can determine that the device doesn't need any "snooping". This wouldn't cope with hotplug (and alike for SR-IOV VFs), as we might end up needing to revoke write access. 5) Mandating Dom0 to use hypervisor managed mappings of MMCFG space. For one this would require all Dom0 kernels to be changed. And then it wouldn't be usable for 32-bit Dom0, as there's no VA space left to place such a mapping, and even if we recycled some of the compat M2P space the range wouldn't be large enough to just cover a single segment's space, not to speak of multiple segments. And introducing a windowing model would make Dom0 changes even more complex (for little benefit, considering that generally we'd expect most people to use 64-bit kernels these days). 6) Mandating at least config space writes to be done via hypercall. This again would require all Dom0 kernels to change. Hence - if anyone has any better idea, please speak up soon. And of course also if anyone has a particular preference between the two presented viable options. Jan TODO: currently dependent upon IOMMU being enabled (due to bus scan only happening in that case) TODO: SR-IOV (and alike?) devices not visible at time of MMCFG registration TODO: remove //temp-s --- unstable.orig/xen/arch/x86/mm.c +++ unstable/xen/arch/x86/mm.c @@ -5208,6 +5208,7 @@ int ptwr_do_page_fault(struct vcpu *v, u /* We are looking only for read-only mappings of p.t. pages. */ if ( ((l1e_get_flags(pte) & (_PAGE_PRESENT|_PAGE_RW)) != _PAGE_PRESENT) || + rangeset_contains_singleton(mmio_ro_ranges, l1e_get_pfn(pte)) || !get_page_from_pagenr(l1e_get_pfn(pte), d) ) goto bail; @@ -5255,6 +5256,7 @@ int ptwr_do_page_fault(struct vcpu *v, u struct mmio_ro_emulate_ctxt { struct x86_emulate_ctxt ctxt; unsigned long cr2; + unsigned int seg, bdf; }; static int mmio_ro_emulated_read( @@ -5294,6 +5296,50 @@ static const struct x86_emulate_ops mmio .write = mmio_ro_emulated_write, }; +static int mmio_intercept_write( + enum x86_segment seg, + unsigned long offset, + void *p_data, + unsigned int bytes, + struct x86_emulate_ctxt *ctxt) +{ + struct mmio_ro_emulate_ctxt *mmio_ctxt = + container_of(ctxt, struct mmio_ro_emulate_ctxt, ctxt); + + /* + * Only allow naturally-aligned stores no wider than 4 bytes to the + * original %cr2 address. + */ + if ( ((bytes | offset) & (bytes - 1)) || bytes > 4 || + offset != mmio_ctxt->cr2 ) + { + MEM_LOG("mmio_intercept: bad write (cr2=%lx, addr=%lx, bytes=%u)", + mmio_ctxt->cr2, offset, bytes); + return X86EMUL_UNHANDLEABLE; + } + + offset &= 0xfff; +printk("%04x:%02x:%02x.%u write [%lx] %0*x\n",//temp + mmio_ctxt->seg, PCI_BUS(mmio_ctxt->bdf), PCI_SLOT(mmio_ctxt->bdf), PCI_FUNC(mmio_ctxt->bdf),//temp + offset, bytes * 2, *(uint32_t*)p_data);//temp + pci_conf_write_intercept(mmio_ctxt->seg, mmio_ctxt->bdf, offset, bytes, + p_data); +printk("%04x:%02x:%02x.%u write [%lx]=%0*x\n",//temp + mmio_ctxt->seg, PCI_BUS(mmio_ctxt->bdf), PCI_SLOT(mmio_ctxt->bdf), PCI_FUNC(mmio_ctxt->bdf),//temp + offset, bytes * 2, *(uint32_t*)p_data);//temp + pci_mmcfg_write(mmio_ctxt->seg, PCI_BUS(mmio_ctxt->bdf), + PCI_DEVFN2(mmio_ctxt->bdf), offset, bytes, + *(uint32_t *)p_data); + + return X86EMUL_OKAY; +} + +static const struct x86_emulate_ops mmio_intercept_ops = { + .read = mmio_ro_emulated_read, + .insn_fetch = ptwr_emulated_read, + .write = mmio_intercept_write, +}; + /* Check if guest is trying to modify a r/o MMIO page. */ int mmio_ro_do_page_fault(struct vcpu *v, unsigned long addr, struct cpu_user_regs *regs) @@ -5308,6 +5354,7 @@ int mmio_ro_do_page_fault(struct vcpu *v .ctxt.swint_emulate = x86_swint_emulate_none, .cr2 = addr }; + const unsigned long *map; int rc; /* Attempt to read the PTE that maps the VA being accessed. */ @@ -5332,7 +5379,14 @@ int mmio_ro_do_page_fault(struct vcpu *v if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn) ) return 0; - rc = x86_emulate(&mmio_ro_ctxt.ctxt, &mmio_ro_emulate_ops); + if ( pci_mmcfg_decode(mfn, &mmio_ro_ctxt.seg, &mmio_ro_ctxt.bdf) && + (map = pci_get_intercept_map(mmio_ro_ctxt.seg)) != NULL && + test_bit(mmio_ro_ctxt.bdf, map) && + ((map = pci_get_ro_map(mmio_ro_ctxt.seg)) == NULL || + !test_bit(mmio_ro_ctxt.bdf, map)) ) + rc = x86_emulate(&mmio_ro_ctxt.ctxt, &mmio_intercept_ops); + else + rc = x86_emulate(&mmio_ro_ctxt.ctxt, &mmio_ro_emulate_ops); return rc != X86EMUL_UNHANDLEABLE ? EXCRET_fault_fixed : 0; } --- unstable.orig/xen/arch/x86/pci.c +++ unstable/xen/arch/x86/pci.c @@ -75,6 +75,15 @@ int pci_conf_write_intercept(unsigned in struct pci_dev *pdev; int rc = 0; + if ( !data ) /* probe */ + { + ASSERT(!~reg && !size); + return pci_find_cap_offset(seg, PCI_BUS(bdf), PCI_SLOT(bdf), + PCI_FUNC(bdf), PCI_CAP_ID_MSI) || + pci_find_cap_offset(seg, PCI_BUS(bdf), PCI_SLOT(bdf), + PCI_FUNC(bdf), PCI_CAP_ID_MSIX); + } + /* * Avoid expensive operations when no hook is going to do anything * for the access anyway. --- unstable.orig/xen/arch/x86/x86_64/mmconfig_64.c +++ unstable/xen/arch/x86/x86_64/mmconfig_64.c @@ -154,10 +154,25 @@ void arch_pci_ro_device(int seg, int bdf } } +static void set_ro(const struct acpi_mcfg_allocation *cfg, + const unsigned long *map) +{ + unsigned int bdf = PCI_BDF(cfg->start_bus_number, 0, 0); + unsigned int end = PCI_BDF(cfg->end_bus_number, -1, -1); + + if (!map) + return; + + while ((bdf = find_next_bit(map, end + 1, bdf)) <= end) { + arch_pci_ro_device(cfg->pci_segment, bdf); + if (bdf++ == end) + break; + } +} + int pci_mmcfg_arch_enable(unsigned int idx) { const typeof(pci_mmcfg_config[0]) *cfg = pci_mmcfg_virt[idx].cfg; - const unsigned long *ro_map = pci_get_ro_map(cfg->pci_segment); if (pci_mmcfg_virt[idx].virt) return 0; @@ -169,16 +184,10 @@ int pci_mmcfg_arch_enable(unsigned int i } printk(KERN_INFO "PCI: Using MCFG for segment %04x bus %02x-%02x\n", cfg->pci_segment, cfg->start_bus_number, cfg->end_bus_number); - if (ro_map) { - unsigned int bdf = PCI_BDF(cfg->start_bus_number, 0, 0); - unsigned int end = PCI_BDF(cfg->end_bus_number, -1, -1); - - while ((bdf = find_next_bit(ro_map, end + 1, bdf)) <= end) { - arch_pci_ro_device(cfg->pci_segment, bdf); - if (bdf++ == end) - break; - } - } + + set_ro(cfg, pci_get_ro_map(cfg->pci_segment)); + set_ro(cfg, pci_get_intercept_map(cfg->pci_segment)); + return 0; } @@ -197,6 +206,32 @@ void pci_mmcfg_arch_disable(unsigned int cfg->pci_segment, cfg->start_bus_number, cfg->end_bus_number); } +bool_t pci_mmcfg_decode(unsigned long mfn, unsigned int *seg, + unsigned int *bdf) +{ + unsigned int idx; + + for (idx = 0; idx < pci_mmcfg_config_num; ++idx) { + const struct acpi_mcfg_allocation *cfg = pci_mmcfg_virt[idx].cfg; + + if (pci_mmcfg_virt[idx].virt && + mfn >= PFN_DOWN(cfg->address + (cfg->start_bus_number << 20)) && + mfn < PFN_DOWN(cfg->address + ((cfg->end_bus_number + 1) << 20))) { +static unsigned long cnt, thr;//temp + *seg = cfg->pci_segment; + *bdf = mfn - PFN_DOWN(cfg->address); +if(++cnt > thr) {//temp + thr |= cnt; + printk("%lx -> %04x:%02x:%02x.%u\n", mfn, *seg, PCI_BUS(*bdf), PCI_SLOT(*bdf), PCI_FUNC(*bdf)); +} + return 1; + } + } + +printk("%lx -> ???\n", mfn);//temp + return 0; +} + int __init pci_mmcfg_arch_init(void) { int i; --- unstable.orig/xen/drivers/passthrough/pci.c +++ unstable/xen/drivers/passthrough/pci.c @@ -39,6 +39,7 @@ struct pci_seg { struct list_head alldevs_list; u16 nr; unsigned long *ro_map; + unsigned long *intercept_map; /* bus2bridge_lock protects bus2bridge array */ spinlock_t bus2bridge_lock; #define MAX_BUSES 256 @@ -123,6 +124,13 @@ const unsigned long *pci_get_ro_map(u16 return pseg ? pseg->ro_map : NULL; } +const unsigned long *pci_get_intercept_map(u16 seg) +{ + struct pci_seg *pseg = get_pseg(seg); + + return pseg ? pseg->intercept_map : NULL; +} + static struct phantom_dev { u16 seg; u8 bus, slot, stride; @@ -284,6 +292,20 @@ static struct pci_dev *alloc_pdev(struct pdev->domain = NULL; INIT_LIST_HEAD(&pdev->msi_list); + if ( !pseg->intercept_map && + pci_conf_write_intercept(pseg->nr, PCI_BDF2(bus, devfn), ~0, 0, + NULL) > 0 ) + { + size_t sz = BITS_TO_LONGS(PCI_BDF(-1, -1, -1) + 1) * sizeof(long); + + pseg->intercept_map = vzalloc(sz); + if ( !pseg->intercept_map ) + { + xfree(pdev); + return NULL; + } + } + if ( pci_find_cap_offset(pseg->nr, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), PCI_CAP_ID_MSIX) ) { @@ -366,6 +388,13 @@ static struct pci_dev *alloc_pdev(struct check_pdev(pdev); + if ( pci_conf_write_intercept(pseg->nr, PCI_BDF2(bus, devfn), ~0, 0, + NULL) > 0 ) + { + set_bit(PCI_BDF2(bus, devfn), pseg->intercept_map); + arch_pci_ro_device(pseg->nr, PCI_BDF2(bus, devfn)); + } + return pdev; } --- unstable.orig/xen/include/asm-x86/pci.h +++ unstable/xen/include/asm-x86/pci.h @@ -20,5 +20,7 @@ int pci_conf_write_intercept(unsigned in uint32_t *data); int pci_msi_conf_write_intercept(struct pci_dev *, unsigned int reg, unsigned int size, uint32_t *data); +bool_t pci_mmcfg_decode(unsigned long mfn, unsigned int *seg, + unsigned int *bdf); #endif /* __X86_PCI_H__ */ --- unstable.orig/xen/include/xen/pci.h +++ unstable/xen/include/xen/pci.h @@ -106,6 +106,7 @@ void setup_hwdom_pci_devices(struct doma int pci_release_devices(struct domain *d); int pci_add_segment(u16 seg); const unsigned long *pci_get_ro_map(u16 seg); +const unsigned long *pci_get_intercept_map(u16 seg); int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *, nodeid_t node); int pci_remove_device(u16 seg, u8 bus, u8 devfn);

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.