[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
On 11/13/20 8:32 AM, Oleksandr Andrushchenko wrote: > > On 11/12/20 4:46 PM, Roger Pau Monné wrote: >> On Thu, Nov 12, 2020 at 01:16:10PM +0000, Oleksandr Andrushchenko wrote: >>> On 11/12/20 11:40 AM, Roger Pau Monné wrote: >>>> On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote: >>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx> >>>>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c >>>>> index f74f728884c0..7dc7c70e24f2 100644 >>>>> --- a/xen/drivers/vpci/header.c >>>>> +++ b/xen/drivers/vpci/header.c >>>>> @@ -31,14 +31,87 @@ >>>>> struct map_data { >>>>> struct domain *d; >>>>> bool map; >>>>> + struct pci_dev *pdev; >>>> If the field is required please place it after the domain one. >>> I will, but may I ask why? >> So that if we add further boolean fields we can do at the end of the >> struct for layout reasons. If we do: >> >> struct map_data { >> struct domain *d; >> bool map; >> struct pci_dev *pdev; >> bool foo; >> } >> >> We will end up with a bunch of padding that could be avoided by doing: >> >> struct map_data { >> struct domain *d; >> struct pci_dev *pdev; >> bool map; >> bool foo; >> } > Ah, so this is about padding. Got it >> >>>>> + s = PFN_DOWN(s); >>>>> + e = PFN_DOWN(e); >>>> Changing the rangeset to store memory addresses instead of frames >>>> could for example be split into a separate patch. >>> Ok >>>> I think you are doing the calculation of the end pfn wrong here, you >>>> should use PFN_UP instead in case the address is not aligned. >>> PFN_DOWN for the start seems to be ok if the address is not aligned >>> >>> which is the case if I pass bar_index in the lower bits: PCI memory has >>> >>> PAGE_SIZE granularity, so besides the fact that I use bar_index the address >> No, BARs don't need to be aligned to page boundaries, you can even >> have different BARs inside the same physical page. >> >> The spec recommends that the minimum size of a BAR should be 4KB, but >> that's not a strict requirement in which case a BAR can be as small as >> 16bytes, and then you can have multiple ones inside the same page. > Ok, will account on that >> >>> must be page aligned. >>> >>> The end address is expressed in (size - 1) form, again page aligned, >>> >>> so to get the last page to be mapped PFN_DOWN also seems to be appropriate. >>> >>> Do I miss something here? >> I'm not aware of any of those addresses or sizes being guaranteed to >> be page aligned, so I think you need to account for that. >> >> Some of the code here uses PFN_DOWN to calculate the end address >> because the rangesets are used in an inclusive fashion, so the end >> frame also gets mapped. > Ok >> >>>>> + mfn = _mfn(PFN_DOWN(header->bars[bar_idx].addr)); >>>>> for ( ; ; ) >>>>> { >>>>> unsigned long size = e - s + 1; >>>>> @@ -52,11 +125,15 @@ static int map_range(unsigned long s, unsigned long >>>>> e, void *data, >>>>> * - {un}map_mmio_regions doesn't support preemption. >>>>> */ >>>>> - rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, >>>>> _mfn(s)) >>>>> - : unmap_mmio_regions(map->d, _gfn(s), size, >>>>> _mfn(s)); >>>>> + rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, mfn) >>>>> + : unmap_mmio_regions(map->d, _gfn(s), size, mfn); >>>>> if ( rc == 0 ) >>>>> { >>>>> - *c += size; >>>>> + /* >>>>> + * Range set is not expressed in frame numbers and the size >>>>> + * is the number of frames, so update accordingly. >>>>> + */ >>>>> + *c += size << PAGE_SHIFT; >>>>> break; >>>>> } >>>>> if ( rc < 0 ) >>>>> @@ -67,8 +144,9 @@ static int map_range(unsigned long s, unsigned long e, >>>>> void *data, >>>>> break; >>>>> } >>>>> ASSERT(rc < size); >>>>> - *c += rc; >>>>> + *c += rc << PAGE_SHIFT; >>>>> s += rc; >>>>> + mfn += rc; >>>>> if ( general_preempt_check() ) >>>>> return -ERESTART; >>>>> } >>>>> @@ -84,7 +162,7 @@ static int map_range(unsigned long s, unsigned long e, >>>>> void *data, >>>>> static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd, >>>>> bool rom_only) >>>>> { >>>>> - struct vpci_header *header = &pdev->vpci->header; >>>>> + struct vpci_header *header = get_hwdom_vpci_header(pdev); >>>>> bool map = cmd & PCI_COMMAND_MEMORY; >>>>> unsigned int i; >>>>> @@ -136,6 +214,7 @@ bool vpci_process_pending(struct vcpu *v) >>>>> struct map_data data = { >>>>> .d = v->domain, >>>>> .map = v->vpci.cmd & PCI_COMMAND_MEMORY, >>>>> + .pdev = v->vpci.pdev, >>>>> }; >>>>> int rc = rangeset_consume_ranges(v->vpci.mem, map_range, >>>>> &data); >>>>> @@ -168,7 +247,8 @@ bool vpci_process_pending(struct vcpu *v) >>>>> static int __init apply_map(struct domain *d, const struct pci_dev >>>>> *pdev, >>>>> struct rangeset *mem, uint16_t cmd) >>>>> { >>>>> - struct map_data data = { .d = d, .map = true }; >>>>> + struct map_data data = { .d = d, .map = true, >>>>> + .pdev = (struct pci_dev *)pdev }; >>>> Dropping the const here is not fine. IT either needs to be dropped >>>> from apply_map and further up, or this needs to also be made const. >>> Ok, I'll try to keep it const >>>>> int rc; >>>>> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) >>>>> == -ERESTART ) >>>>> @@ -205,7 +285,7 @@ static void defer_map(struct domain *d, struct >>>>> pci_dev *pdev, >>>>> static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, >>>>> bool rom_only) >>>>> { >>>>> - struct vpci_header *header = &pdev->vpci->header; >>>>> + struct vpci_header *header; >>>>> struct rangeset *mem = rangeset_new(NULL, NULL, 0); >>>>> struct pci_dev *tmp, *dev = NULL; >>>>> #ifdef CONFIG_X86 >>>>> @@ -217,6 +297,11 @@ static int modify_bars(const struct pci_dev *pdev, >>>>> uint16_t cmd, bool rom_only) >>>>> if ( !mem ) >>>>> return -ENOMEM; >>>>> + if ( is_hardware_domain(current->domain) ) >>>>> + header = get_hwdom_vpci_header(pdev); >>>>> + else >>>>> + header = get_vpci_header(current->domain, pdev); >>>>> + >>>>> /* >>>>> * Create a rangeset that represents the current device BARs >>>>> memory region >>>>> * and compare it against all the currently active BAR memory >>>>> regions. If >>>>> @@ -225,12 +310,15 @@ static int modify_bars(const struct pci_dev *pdev, >>>>> uint16_t cmd, bool rom_only) >>>>> * First fill the rangeset with all the BARs of this device or >>>>> with the ROM >>>>> * BAR only, depending on whether the guest is toggling the memory >>>>> decode >>>>> * bit of the command register, or the enable bit of the ROM BAR >>>>> register. >>>>> + * >>>>> + * Use the PCI reserved bits of the BAR to pass BAR's index. >>>>> */ >>>>> for ( i = 0; i < ARRAY_SIZE(header->bars); i++ ) >>>>> { >>>>> const struct vpci_bar *bar = &header->bars[i]; >>>>> - unsigned long start = PFN_DOWN(bar->addr); >>>>> - unsigned long end = PFN_DOWN(bar->addr + bar->size - 1); >>>>> + unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) | >>>>> i; >>>>> + unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) + >>>>> + bar->size - 1; >>>> Will this work fine on Arm 32bits with LPAE? It's my understanding >>>> that in that case unsigned long is 32bits, but the physical address >>>> space is 44bits, in which case this won't work. >>> Hm, good question >>>> I think you need to keep the usage of frame numbers here. >>> If I re-work the gfn <-> mfn mapping then yes, I can use frame numbers here >>> and elsewhere >>>>> if ( !MAPPABLE_BAR(bar) || >>>>> (rom_only ? bar->type != VPCI_BAR_ROM >>>>> @@ -251,9 +339,11 @@ static int modify_bars(const struct pci_dev *pdev, >>>>> uint16_t cmd, bool rom_only) >>>>> /* Remove any MSIX regions if present. */ >>>>> for ( i = 0; msix && i < ARRAY_SIZE(msix->tables); i++ ) >>>>> { >>>>> - unsigned long start = PFN_DOWN(vmsix_table_addr(pdev->vpci, i)); >>>>> - unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) + >>>>> - vmsix_table_size(pdev->vpci, i) - 1); >>>>> + unsigned long start = (vmsix_table_addr(pdev->vpci, i) & >>>>> + PCI_BASE_ADDRESS_MEM_MASK) | i; >>>>> + unsigned long end = (vmsix_table_addr(pdev->vpci, i) & >>>>> + PCI_BASE_ADDRESS_MEM_MASK ) + >>>>> + vmsix_table_size(pdev->vpci, i) - 1; >>>>> rc = rangeset_remove_range(mem, start, end); >>>>> if ( rc ) >>>>> @@ -273,6 +363,8 @@ static int modify_bars(const struct pci_dev *pdev, >>>>> uint16_t cmd, bool rom_only) >>>>> */ >>>>> for_each_pdev ( pdev->domain, tmp ) >>>>> { >>>>> + struct vpci_header *header; >>>>> + >>>>> if ( tmp == pdev ) >>>>> { >>>>> /* >>>>> @@ -289,11 +381,14 @@ static int modify_bars(const struct pci_dev *pdev, >>>>> uint16_t cmd, bool rom_only) >>>>> continue; >>>>> } >>>>> - for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ ) >>>>> + header = get_vpci_header(tmp->domain, pdev); >>>>> + >>>>> + for ( i = 0; i < ARRAY_SIZE(header->bars); i++ ) >>>>> { >>>>> - const struct vpci_bar *bar = &tmp->vpci->header.bars[i]; >>>>> - unsigned long start = PFN_DOWN(bar->addr); >>>>> - unsigned long end = PFN_DOWN(bar->addr + bar->size - 1); >>>>> + const struct vpci_bar *bar = &header->bars[i]; >>>>> + unsigned long start = (bar->addr & >>>>> PCI_BASE_ADDRESS_MEM_MASK) | i; >>>>> + unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) >>>>> + + bar->size - 1; >>>>> if ( !bar->enabled || !rangeset_overlaps_range(mem, >>>>> start, end) || >>>>> /* >>>>> @@ -357,7 +452,7 @@ static void cmd_write(const struct pci_dev *pdev, >>>>> unsigned int reg, >>>>> pci_conf_write16(pdev->sbdf, reg, cmd); >>>>> } >>>>> -static void bar_write(const struct pci_dev *pdev, unsigned int reg, >>>>> +static void bar_write_hwdom(const struct pci_dev *pdev, unsigned int reg, >>>>> uint32_t val, void *data) >>>>> { >>>>> struct vpci_bar *bar = data; >>>>> @@ -377,14 +472,17 @@ static void bar_write(const struct pci_dev *pdev, >>>>> unsigned int reg, >>>>> { >>>>> /* If the value written is the current one avoid printing a >>>>> warning. */ >>>>> if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) ) >>>>> + { >>>>> + struct vpci_header *header = get_hwdom_vpci_header(pdev); >>>>> + >>>>> gprintk(XENLOG_WARNING, >>>>> "%04x:%02x:%02x.%u: ignored BAR %lu write with >>>>> memory decoding enabled\n", >>>>> pdev->seg, pdev->bus, slot, func, >>>>> - bar - pdev->vpci->header.bars + hi); >>>>> + bar - header->bars + hi); >>>>> + } >>>>> return; >>>>> } >>>>> - >>>>> /* >>>>> * Update the cached address, so that when memory decoding is >>>>> enabled >>>>> * Xen can map the BAR into the guest p2m. >>>>> @@ -403,10 +501,89 @@ static void bar_write(const struct pci_dev *pdev, >>>>> unsigned int reg, >>>>> pci_conf_write32(pdev->sbdf, reg, val); >>>>> } >>>>> +static uint32_t bar_read_hwdom(const struct pci_dev *pdev, unsigned >>>>> int reg, >>>>> + void *data) >>>>> +{ >>>>> + return vpci_hw_read32(pdev, reg, data); >>>>> +} >>>>> + >>>>> +static void bar_write_guest(const struct pci_dev *pdev, unsigned int reg, >>>>> + uint32_t val, void *data) >>>>> +{ >>>>> + struct vpci_bar *vbar = data; >>>>> + bool hi = false; >>>>> + >>>>> + if ( vbar->type == VPCI_BAR_MEM64_HI ) >>>>> + { >>>>> + ASSERT(reg > PCI_BASE_ADDRESS_0); >>>>> + vbar--; >>>>> + hi = true; >>>>> + } >>>>> + vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0)); >>>>> + vbar->addr |= (uint64_t)val << (hi ? 32 : 0); >>>>> +} >>>>> + >>>>> +static uint32_t bar_read_guest(const struct pci_dev *pdev, unsigned int >>>>> reg, >>>>> + void *data) >>>>> +{ >>>>> + struct vpci_bar *vbar = data; >>>>> + uint32_t val; >>>>> + bool hi = false; >>>>> + >>>>> + if ( vbar->type == VPCI_BAR_MEM64_HI ) >>>>> + { >>>>> + ASSERT(reg > PCI_BASE_ADDRESS_0); >>>>> + vbar--; >>>>> + hi = true; >>>>> + } >>>>> + >>>>> + if ( vbar->type == VPCI_BAR_MEM64_LO || vbar->type == >>>>> VPCI_BAR_MEM64_HI ) >>>> I think this would be clearer using a switch statement. >>> I'll think about >>>>> + { >>>>> + if ( hi ) >>>>> + val = vbar->addr >> 32; >>>>> + else >>>>> + val = vbar->addr & 0xffffffff; >>>>> + if ( val == ~0 ) >>>> Strictly speaking I think you are not forced to write 1s to the >>>> reserved 4 bits in the low register (and in the 32bit case). >>> Ah, so Linux kernel, for instance, could have written 0xffffff0 while >>> >>> I expect 0xffffffff? >> I think real hardware would return the size when written 1s to all >> bits except the reserved ones. >> >>>>> + { >>>>> + /* Guests detects BAR's properties and sizes. */ >>>>> + if ( !hi ) >>>>> + { >>>>> + val = 0xffffffff & ~(vbar->size - 1); >>>>> + val |= vbar->type == VPCI_BAR_MEM32 ? >>>>> PCI_BASE_ADDRESS_MEM_TYPE_32 >>>>> + : >>>>> PCI_BASE_ADDRESS_MEM_TYPE_64; >>>>> + val |= vbar->prefetchable ? >>>>> PCI_BASE_ADDRESS_MEM_PREFETCH : 0; >>>>> + } >>>>> + else >>>>> + val = vbar->size >> 32; >>>>> + vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0)); >>>>> + vbar->addr |= (uint64_t)val << (hi ? 32 : 0); >>>>> + } >>>>> + } >>>>> + else if ( vbar->type == VPCI_BAR_MEM32 ) >>>>> + { >>>>> + val = vbar->addr; >>>>> + if ( val == ~0 ) >>>>> + { >>>>> + if ( !hi ) >>>> There's no way hi can be true at this point AFAICT. >>> Sure, thank you >>>>> + { >>>>> + val = 0xffffffff & ~(vbar->size - 1); >>>>> + val |= vbar->type == VPCI_BAR_MEM32 ? >>>>> PCI_BASE_ADDRESS_MEM_TYPE_32 >>>>> + : >>>>> PCI_BASE_ADDRESS_MEM_TYPE_64; >>>>> + val |= vbar->prefetchable ? >>>>> PCI_BASE_ADDRESS_MEM_PREFETCH : 0; >>>>> + } >>>>> + } >>>>> + } >>>>> + else >>>>> + { >>>>> + val = vbar->addr; >>>>> + } >>>>> + return val; >>>>> +} >>>>> + >>>>> static void rom_write(const struct pci_dev *pdev, unsigned int reg, >>>>> uint32_t val, void *data) >>>>> { >>>>> - struct vpci_header *header = &pdev->vpci->header; >>>>> + struct vpci_header *header = get_hwdom_vpci_header(pdev); >>>>> struct vpci_bar *rom = data; >>>>> uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn); >>>>> uint16_t cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND); >>>>> @@ -452,15 +629,56 @@ static void rom_write(const struct pci_dev *pdev, >>>>> unsigned int reg, >>>>> rom->addr = val & PCI_ROM_ADDRESS_MASK; >>>>> } >>>> Don't you need to also protect a domU from writing to the ROM BAR >>>> register? >>> ROM was not a target of this RFC as I have no HW to test that, but final >>> code must >>> >>> also handle ROM as well, you are right >>> >>>>> +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, >>>>> unsigned int reg, >>>>> + void *data) >>>>> +{ >>>>> + struct vpci_bar *vbar, *bar = data; >>>>> + >>>>> + if ( is_hardware_domain(current->domain) ) >>>>> + return bar_read_hwdom(pdev, reg, data); >>>>> + >>>>> + vbar = get_vpci_bar(current->domain, pdev, bar->index); >>>>> + if ( !vbar ) >>>>> + return ~0; >>>>> + >>>>> + return bar_read_guest(pdev, reg, vbar); >>>>> +} >>>>> + >>>>> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int >>>>> reg, >>>>> + uint32_t val, void *data) >>>>> +{ >>>>> + struct vpci_bar *bar = data; >>>>> + >>>>> + if ( is_hardware_domain(current->domain) ) >>>>> + bar_write_hwdom(pdev, reg, val, data); >>>>> + else >>>>> + { >>>>> + struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, >>>>> bar->index); >>>>> + >>>>> + if ( !vbar ) >>>>> + return; >>>>> + bar_write_guest(pdev, reg, val, vbar); >>>>> + } >>>>> +} >>>> You should assign different handlers based on whether the domain that >>>> has the device assigned is a domU or the hardware domain, rather than >>>> doing the selection here. >>> Hm, handlers are assigned once in init_bars and this function is only called >>> >>> for hwdom, so there is no way I can do that for the guests. Hence, the >>> dispatcher >> I think we might want to reset the vPCI handlers when a devices gets >> assigned and deassigned. > > In ARM case init_bars is called too early: PCI device assignment is currently > > initiated by Domain-0' kernel and is done *before* PCI devices are given > memory > > ranges and BARs assigned: > > [ 0.429514] pci_bus 0000:00: root bus resource [bus 00-ff] > [ 0.429532] pci_bus 0000:00: root bus resource [io 0x0000-0xfffff] > [ 0.429555] pci_bus 0000:00: root bus resource [mem 0xfe200000-0xfe3fffff] > [ 0.429575] pci_bus 0000:00: root bus resource [mem 0x30000000-0x37ffffff] > [ 0.429604] pci_bus 0000:00: root bus resource [mem 0x38000000-0x3fffffff > pref] > [ 0.429670] pci 0000:00:00.0: enabling Extended Tags > [ 0.453764] pci 0000:00:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE > > < init_bars > > > [ 0.453793] pci 0000:00:00.0: -- IRQ 0 > [ 0.458825] pci 0000:00:00.0: Failed to add - passthrough or MSI/MSI-X > might fail! > [ 0.471790] pci 0000:01:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE > > < init_bars > > > [ 0.471821] pci 0000:01:00.0: -- IRQ 255 > [ 0.476809] pci 0000:01:00.0: Failed to add - passthrough or MSI/MSI-X > might fail! > > < BAR assignments below > > > [ 0.488233] pci 0000:00:00.0: BAR 14: assigned [mem 0xfe200000-0xfe2fffff] > [ 0.488265] pci 0000:00:00.0: BAR 15: assigned [mem 0x38000000-0x380fffff > pref] > > In case of x86 this is pretty much ok as BARs are already in place, but for > ARM we > > need to take care and re-setup vPCI BARs for hwdom. Things are getting even > more > > complicated if the host PCI bridge is not ECAM like, so you cannot set > mmio_handlers > > and trap hwdom's access to the config space to update BARs etc. This is why I > have that > > ugly hack for rcar_gen3 to read actual BARs for hwdom. > > > If we go further and take a look at SR-IOV then when the kernel assigns the > device > > (BUS_NOTIFY_ADD_DEVICE) then it already has BARs assigned for virtual > functions > > (need to double-check that). Hm, indeed. We just need to move init_bars from being called on PCI *device add* to *device assign*. This way it won't (?) break x86 and allow ARM to properly initialize vPCI's BARs... > >> In order to do passthrough to domUs safely >> we will have to add more handlers than what's required for dom0, > Can you please tell what are thinking about? What are the missing handlers? >> and >> having is_hardware_domain sprinkled in all of them is not a suitable >> solution. > > I'll try to replace is_hardware_domain with something like: > > +/* > + * Detect whether physical PCI devices in this segment belong > + * to the domain given, e.g. on x86 all PCI devices live in hwdom, > + * but in case of ARM this might not be the case: those may also > + * live in driver domains or even Xen itself. > + */ > +bool pci_is_hardware_domain(struct domain *d, u16 seg) > +{ > +#ifdef CONFIG_X86 > + return is_hardware_domain(d); > +#elif CONFIG_ARM > + return pci_is_owner_domain(d, seg); > +#else > +#error "Unsupported architecture" > +#endif > +} > + > +/* > + * Get domain which owns this segment: for x86 this is always hardware > + * domain and for ARM this can be different. > + */ > +struct domain *pci_get_hardware_domain(u16 seg) > +{ > +#ifdef CONFIG_X86 > + return hardware_domain; > +#elif CONFIG_ARM > + return pci_get_owner_domain(seg); > +#else > +#error "Unsupported architecture" > +#endif > +} > > This is what I use to properly detect the domain that really owns physical > host bridge > >> >> Roger. > > Thank you, > > Oleksandr >
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |