[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
On 11/12/20 4:46 PM, Roger Pau Monné wrote: > On Thu, Nov 12, 2020 at 01:16:10PM +0000, Oleksandr Andrushchenko wrote: >> On 11/12/20 11:40 AM, Roger Pau Monné wrote: >>> On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote: >>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx> >>>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c >>>> index f74f728884c0..7dc7c70e24f2 100644 >>>> --- a/xen/drivers/vpci/header.c >>>> +++ b/xen/drivers/vpci/header.c >>>> @@ -31,14 +31,87 @@ >>>> struct map_data { >>>> struct domain *d; >>>> bool map; >>>> + struct pci_dev *pdev; >>> If the field is required please place it after the domain one. >> I will, but may I ask why? > So that if we add further boolean fields we can do at the end of the > struct for layout reasons. If we do: > > struct map_data { > struct domain *d; > bool map; > struct pci_dev *pdev; > bool foo; > } > > We will end up with a bunch of padding that could be avoided by doing: > > struct map_data { > struct domain *d; > struct pci_dev *pdev; > bool map; > bool foo; > } Ah, so this is about padding. Got it > >>>> + s = PFN_DOWN(s); >>>> + e = PFN_DOWN(e); >>> Changing the rangeset to store memory addresses instead of frames >>> could for example be split into a separate patch. >> Ok >>> I think you are doing the calculation of the end pfn wrong here, you >>> should use PFN_UP instead in case the address is not aligned. >> PFN_DOWN for the start seems to be ok if the address is not aligned >> >> which is the case if I pass bar_index in the lower bits: PCI memory has >> >> PAGE_SIZE granularity, so besides the fact that I use bar_index the address > No, BARs don't need to be aligned to page boundaries, you can even > have different BARs inside the same physical page. > > The spec recommends that the minimum size of a BAR should be 4KB, but > that's not a strict requirement in which case a BAR can be as small as > 16bytes, and then you can have multiple ones inside the same page. Ok, will account on that > >> must be page aligned. >> >> The end address is expressed in (size - 1) form, again page aligned, >> >> so to get the last page to be mapped PFN_DOWN also seems to be appropriate. >> >> Do I miss something here? > I'm not aware of any of those addresses or sizes being guaranteed to > be page aligned, so I think you need to account for that. > > Some of the code here uses PFN_DOWN to calculate the end address > because the rangesets are used in an inclusive fashion, so the end > frame also gets mapped. Ok > >>>> + mfn = _mfn(PFN_DOWN(header->bars[bar_idx].addr)); >>>> for ( ; ; ) >>>> { >>>> unsigned long size = e - s + 1; >>>> @@ -52,11 +125,15 @@ static int map_range(unsigned long s, unsigned long >>>> e, void *data, >>>> * - {un}map_mmio_regions doesn't support preemption. >>>> */ >>>> >>>> - rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s)) >>>> - : unmap_mmio_regions(map->d, _gfn(s), size, >>>> _mfn(s)); >>>> + rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, mfn) >>>> + : unmap_mmio_regions(map->d, _gfn(s), size, mfn); >>>> if ( rc == 0 ) >>>> { >>>> - *c += size; >>>> + /* >>>> + * Range set is not expressed in frame numbers and the size >>>> + * is the number of frames, so update accordingly. >>>> + */ >>>> + *c += size << PAGE_SHIFT; >>>> break; >>>> } >>>> if ( rc < 0 ) >>>> @@ -67,8 +144,9 @@ static int map_range(unsigned long s, unsigned long e, >>>> void *data, >>>> break; >>>> } >>>> ASSERT(rc < size); >>>> - *c += rc; >>>> + *c += rc << PAGE_SHIFT; >>>> s += rc; >>>> + mfn += rc; >>>> if ( general_preempt_check() ) >>>> return -ERESTART; >>>> } >>>> @@ -84,7 +162,7 @@ static int map_range(unsigned long s, unsigned long e, >>>> void *data, >>>> static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd, >>>> bool rom_only) >>>> { >>>> - struct vpci_header *header = &pdev->vpci->header; >>>> + struct vpci_header *header = get_hwdom_vpci_header(pdev); >>>> bool map = cmd & PCI_COMMAND_MEMORY; >>>> unsigned int i; >>>> >>>> @@ -136,6 +214,7 @@ bool vpci_process_pending(struct vcpu *v) >>>> struct map_data data = { >>>> .d = v->domain, >>>> .map = v->vpci.cmd & PCI_COMMAND_MEMORY, >>>> + .pdev = v->vpci.pdev, >>>> }; >>>> int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data); >>>> >>>> @@ -168,7 +247,8 @@ bool vpci_process_pending(struct vcpu *v) >>>> static int __init apply_map(struct domain *d, const struct pci_dev >>>> *pdev, >>>> struct rangeset *mem, uint16_t cmd) >>>> { >>>> - struct map_data data = { .d = d, .map = true }; >>>> + struct map_data data = { .d = d, .map = true, >>>> + .pdev = (struct pci_dev *)pdev }; >>> Dropping the const here is not fine. IT either needs to be dropped >>> from apply_map and further up, or this needs to also be made const. >> Ok, I'll try to keep it const >>>> int rc; >>>> >>>> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == >>>> -ERESTART ) >>>> @@ -205,7 +285,7 @@ static void defer_map(struct domain *d, struct pci_dev >>>> *pdev, >>>> >>>> static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool >>>> rom_only) >>>> { >>>> - struct vpci_header *header = &pdev->vpci->header; >>>> + struct vpci_header *header; >>>> struct rangeset *mem = rangeset_new(NULL, NULL, 0); >>>> struct pci_dev *tmp, *dev = NULL; >>>> #ifdef CONFIG_X86 >>>> @@ -217,6 +297,11 @@ static int modify_bars(const struct pci_dev *pdev, >>>> uint16_t cmd, bool rom_only) >>>> if ( !mem ) >>>> return -ENOMEM; >>>> >>>> + if ( is_hardware_domain(current->domain) ) >>>> + header = get_hwdom_vpci_header(pdev); >>>> + else >>>> + header = get_vpci_header(current->domain, pdev); >>>> + >>>> /* >>>> * Create a rangeset that represents the current device BARs memory >>>> region >>>> * and compare it against all the currently active BAR memory >>>> regions. If >>>> @@ -225,12 +310,15 @@ static int modify_bars(const struct pci_dev *pdev, >>>> uint16_t cmd, bool rom_only) >>>> * First fill the rangeset with all the BARs of this device or with >>>> the ROM >>>> * BAR only, depending on whether the guest is toggling the memory >>>> decode >>>> * bit of the command register, or the enable bit of the ROM BAR >>>> register. >>>> + * >>>> + * Use the PCI reserved bits of the BAR to pass BAR's index. >>>> */ >>>> for ( i = 0; i < ARRAY_SIZE(header->bars); i++ ) >>>> { >>>> const struct vpci_bar *bar = &header->bars[i]; >>>> - unsigned long start = PFN_DOWN(bar->addr); >>>> - unsigned long end = PFN_DOWN(bar->addr + bar->size - 1); >>>> + unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) | i; >>>> + unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) + >>>> + bar->size - 1; >>> Will this work fine on Arm 32bits with LPAE? It's my understanding >>> that in that case unsigned long is 32bits, but the physical address >>> space is 44bits, in which case this won't work. >> Hm, good question >>> I think you need to keep the usage of frame numbers here. >> If I re-work the gfn <-> mfn mapping then yes, I can use frame numbers here >> and elsewhere >>>> >>>> if ( !MAPPABLE_BAR(bar) || >>>> (rom_only ? bar->type != VPCI_BAR_ROM >>>> @@ -251,9 +339,11 @@ static int modify_bars(const struct pci_dev *pdev, >>>> uint16_t cmd, bool rom_only) >>>> /* Remove any MSIX regions if present. */ >>>> for ( i = 0; msix && i < ARRAY_SIZE(msix->tables); i++ ) >>>> { >>>> - unsigned long start = PFN_DOWN(vmsix_table_addr(pdev->vpci, i)); >>>> - unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) + >>>> - vmsix_table_size(pdev->vpci, i) - 1); >>>> + unsigned long start = (vmsix_table_addr(pdev->vpci, i) & >>>> + PCI_BASE_ADDRESS_MEM_MASK) | i; >>>> + unsigned long end = (vmsix_table_addr(pdev->vpci, i) & >>>> + PCI_BASE_ADDRESS_MEM_MASK ) + >>>> + vmsix_table_size(pdev->vpci, i) - 1; >>>> >>>> rc = rangeset_remove_range(mem, start, end); >>>> if ( rc ) >>>> @@ -273,6 +363,8 @@ static int modify_bars(const struct pci_dev *pdev, >>>> uint16_t cmd, bool rom_only) >>>> */ >>>> for_each_pdev ( pdev->domain, tmp ) >>>> { >>>> + struct vpci_header *header; >>>> + >>>> if ( tmp == pdev ) >>>> { >>>> /* >>>> @@ -289,11 +381,14 @@ static int modify_bars(const struct pci_dev *pdev, >>>> uint16_t cmd, bool rom_only) >>>> continue; >>>> } >>>> >>>> - for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ ) >>>> + header = get_vpci_header(tmp->domain, pdev); >>>> + >>>> + for ( i = 0; i < ARRAY_SIZE(header->bars); i++ ) >>>> { >>>> - const struct vpci_bar *bar = &tmp->vpci->header.bars[i]; >>>> - unsigned long start = PFN_DOWN(bar->addr); >>>> - unsigned long end = PFN_DOWN(bar->addr + bar->size - 1); >>>> + const struct vpci_bar *bar = &header->bars[i]; >>>> + unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) >>>> | i; >>>> + unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) >>>> + + bar->size - 1; >>>> >>>> if ( !bar->enabled || !rangeset_overlaps_range(mem, start, >>>> end) || >>>> /* >>>> @@ -357,7 +452,7 @@ static void cmd_write(const struct pci_dev *pdev, >>>> unsigned int reg, >>>> pci_conf_write16(pdev->sbdf, reg, cmd); >>>> } >>>> >>>> -static void bar_write(const struct pci_dev *pdev, unsigned int reg, >>>> +static void bar_write_hwdom(const struct pci_dev *pdev, unsigned int reg, >>>> uint32_t val, void *data) >>>> { >>>> struct vpci_bar *bar = data; >>>> @@ -377,14 +472,17 @@ static void bar_write(const struct pci_dev *pdev, >>>> unsigned int reg, >>>> { >>>> /* If the value written is the current one avoid printing a >>>> warning. */ >>>> if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) ) >>>> + { >>>> + struct vpci_header *header = get_hwdom_vpci_header(pdev); >>>> + >>>> gprintk(XENLOG_WARNING, >>>> "%04x:%02x:%02x.%u: ignored BAR %lu write with >>>> memory decoding enabled\n", >>>> pdev->seg, pdev->bus, slot, func, >>>> - bar - pdev->vpci->header.bars + hi); >>>> + bar - header->bars + hi); >>>> + } >>>> return; >>>> } >>>> >>>> - >>>> /* >>>> * Update the cached address, so that when memory decoding is >>>> enabled >>>> * Xen can map the BAR into the guest p2m. >>>> @@ -403,10 +501,89 @@ static void bar_write(const struct pci_dev *pdev, >>>> unsigned int reg, >>>> pci_conf_write32(pdev->sbdf, reg, val); >>>> } >>>> >>>> +static uint32_t bar_read_hwdom(const struct pci_dev *pdev, unsigned int >>>> reg, >>>> + void *data) >>>> +{ >>>> + return vpci_hw_read32(pdev, reg, data); >>>> +} >>>> + >>>> +static void bar_write_guest(const struct pci_dev *pdev, unsigned int reg, >>>> + uint32_t val, void *data) >>>> +{ >>>> + struct vpci_bar *vbar = data; >>>> + bool hi = false; >>>> + >>>> + if ( vbar->type == VPCI_BAR_MEM64_HI ) >>>> + { >>>> + ASSERT(reg > PCI_BASE_ADDRESS_0); >>>> + vbar--; >>>> + hi = true; >>>> + } >>>> + vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0)); >>>> + vbar->addr |= (uint64_t)val << (hi ? 32 : 0); >>>> +} >>>> + >>>> +static uint32_t bar_read_guest(const struct pci_dev *pdev, unsigned int >>>> reg, >>>> + void *data) >>>> +{ >>>> + struct vpci_bar *vbar = data; >>>> + uint32_t val; >>>> + bool hi = false; >>>> + >>>> + if ( vbar->type == VPCI_BAR_MEM64_HI ) >>>> + { >>>> + ASSERT(reg > PCI_BASE_ADDRESS_0); >>>> + vbar--; >>>> + hi = true; >>>> + } >>>> + >>>> + if ( vbar->type == VPCI_BAR_MEM64_LO || vbar->type == >>>> VPCI_BAR_MEM64_HI ) >>> I think this would be clearer using a switch statement. >> I'll think about >>>> + { >>>> + if ( hi ) >>>> + val = vbar->addr >> 32; >>>> + else >>>> + val = vbar->addr & 0xffffffff; >>>> + if ( val == ~0 ) >>> Strictly speaking I think you are not forced to write 1s to the >>> reserved 4 bits in the low register (and in the 32bit case). >> Ah, so Linux kernel, for instance, could have written 0xffffff0 while >> >> I expect 0xffffffff? > I think real hardware would return the size when written 1s to all > bits except the reserved ones. > >>>> + { >>>> + /* Guests detects BAR's properties and sizes. */ >>>> + if ( !hi ) >>>> + { >>>> + val = 0xffffffff & ~(vbar->size - 1); >>>> + val |= vbar->type == VPCI_BAR_MEM32 ? >>>> PCI_BASE_ADDRESS_MEM_TYPE_32 >>>> + : >>>> PCI_BASE_ADDRESS_MEM_TYPE_64; >>>> + val |= vbar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH >>>> : 0; >>>> + } >>>> + else >>>> + val = vbar->size >> 32; >>>> + vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0)); >>>> + vbar->addr |= (uint64_t)val << (hi ? 32 : 0); >>>> + } >>>> + } >>>> + else if ( vbar->type == VPCI_BAR_MEM32 ) >>>> + { >>>> + val = vbar->addr; >>>> + if ( val == ~0 ) >>>> + { >>>> + if ( !hi ) >>> There's no way hi can be true at this point AFAICT. >> Sure, thank you >>>> + { >>>> + val = 0xffffffff & ~(vbar->size - 1); >>>> + val |= vbar->type == VPCI_BAR_MEM32 ? >>>> PCI_BASE_ADDRESS_MEM_TYPE_32 >>>> + : >>>> PCI_BASE_ADDRESS_MEM_TYPE_64; >>>> + val |= vbar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH >>>> : 0; >>>> + } >>>> + } >>>> + } >>>> + else >>>> + { >>>> + val = vbar->addr; >>>> + } >>>> + return val; >>>> +} >>>> + >>>> static void rom_write(const struct pci_dev *pdev, unsigned int reg, >>>> uint32_t val, void *data) >>>> { >>>> - struct vpci_header *header = &pdev->vpci->header; >>>> + struct vpci_header *header = get_hwdom_vpci_header(pdev); >>>> struct vpci_bar *rom = data; >>>> uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn); >>>> uint16_t cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND); >>>> @@ -452,15 +629,56 @@ static void rom_write(const struct pci_dev *pdev, >>>> unsigned int reg, >>>> rom->addr = val & PCI_ROM_ADDRESS_MASK; >>>> } >>> Don't you need to also protect a domU from writing to the ROM BAR >>> register? >> ROM was not a target of this RFC as I have no HW to test that, but final >> code must >> >> also handle ROM as well, you are right >> >>>> >>>> +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned >>>> int reg, >>>> + void *data) >>>> +{ >>>> + struct vpci_bar *vbar, *bar = data; >>>> + >>>> + if ( is_hardware_domain(current->domain) ) >>>> + return bar_read_hwdom(pdev, reg, data); >>>> + >>>> + vbar = get_vpci_bar(current->domain, pdev, bar->index); >>>> + if ( !vbar ) >>>> + return ~0; >>>> + >>>> + return bar_read_guest(pdev, reg, vbar); >>>> +} >>>> + >>>> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int >>>> reg, >>>> + uint32_t val, void *data) >>>> +{ >>>> + struct vpci_bar *bar = data; >>>> + >>>> + if ( is_hardware_domain(current->domain) ) >>>> + bar_write_hwdom(pdev, reg, val, data); >>>> + else >>>> + { >>>> + struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, >>>> bar->index); >>>> + >>>> + if ( !vbar ) >>>> + return; >>>> + bar_write_guest(pdev, reg, val, vbar); >>>> + } >>>> +} >>> You should assign different handlers based on whether the domain that >>> has the device assigned is a domU or the hardware domain, rather than >>> doing the selection here. >> Hm, handlers are assigned once in init_bars and this function is only called >> >> for hwdom, so there is no way I can do that for the guests. Hence, the >> dispatcher > I think we might want to reset the vPCI handlers when a devices gets > assigned and deassigned. In ARM case init_bars is called too early: PCI device assignment is currently initiated by Domain-0' kernel and is done *before* PCI devices are given memory ranges and BARs assigned: [ 0.429514] pci_bus 0000:00: root bus resource [bus 00-ff] [ 0.429532] pci_bus 0000:00: root bus resource [io 0x0000-0xfffff] [ 0.429555] pci_bus 0000:00: root bus resource [mem 0xfe200000-0xfe3fffff] [ 0.429575] pci_bus 0000:00: root bus resource [mem 0x30000000-0x37ffffff] [ 0.429604] pci_bus 0000:00: root bus resource [mem 0x38000000-0x3fffffff pref] [ 0.429670] pci 0000:00:00.0: enabling Extended Tags [ 0.453764] pci 0000:00:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE < init_bars > [ 0.453793] pci 0000:00:00.0: -- IRQ 0 [ 0.458825] pci 0000:00:00.0: Failed to add - passthrough or MSI/MSI-X might fail! [ 0.471790] pci 0000:01:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE < init_bars > [ 0.471821] pci 0000:01:00.0: -- IRQ 255 [ 0.476809] pci 0000:01:00.0: Failed to add - passthrough or MSI/MSI-X might fail! < BAR assignments below > [ 0.488233] pci 0000:00:00.0: BAR 14: assigned [mem 0xfe200000-0xfe2fffff] [ 0.488265] pci 0000:00:00.0: BAR 15: assigned [mem 0x38000000-0x380fffff pref] In case of x86 this is pretty much ok as BARs are already in place, but for ARM we need to take care and re-setup vPCI BARs for hwdom. Things are getting even more complicated if the host PCI bridge is not ECAM like, so you cannot set mmio_handlers and trap hwdom's access to the config space to update BARs etc. This is why I have that ugly hack for rcar_gen3 to read actual BARs for hwdom. If we go further and take a look at SR-IOV then when the kernel assigns the device (BUS_NOTIFY_ADD_DEVICE) then it already has BARs assigned for virtual functions (need to double-check that). > In order to do passthrough to domUs safely > we will have to add more handlers than what's required for dom0, Can you please tell what are thinking about? What are the missing handlers? > and > having is_hardware_domain sprinkled in all of them is not a suitable > solution. I'll try to replace is_hardware_domain with something like: +/* + * Detect whether physical PCI devices in this segment belong + * to the domain given, e.g. on x86 all PCI devices live in hwdom, + * but in case of ARM this might not be the case: those may also + * live in driver domains or even Xen itself. + */ +bool pci_is_hardware_domain(struct domain *d, u16 seg) +{ +#ifdef CONFIG_X86 + return is_hardware_domain(d); +#elif CONFIG_ARM + return pci_is_owner_domain(d, seg); +#else +#error "Unsupported architecture" +#endif +} + +/* + * Get domain which owns this segment: for x86 this is always hardware + * domain and for ARM this can be different. + */ +struct domain *pci_get_hardware_domain(u16 seg) +{ +#ifdef CONFIG_X86 + return hardware_domain; +#elif CONFIG_ARM + return pci_get_owner_domain(seg); +#else +#error "Unsupported architecture" +#endif +} This is what I use to properly detect the domain that really owns physical host bridge > > Roger. Thank you, Oleksandr
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |