|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
On 11/12/20 4:46 PM, Roger Pau Monné wrote:
> On Thu, Nov 12, 2020 at 01:16:10PM +0000, Oleksandr Andrushchenko wrote:
>> On 11/12/20 11:40 AM, Roger Pau Monné wrote:
>>> On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote:
>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx>
>>>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>>>> index f74f728884c0..7dc7c70e24f2 100644
>>>> --- a/xen/drivers/vpci/header.c
>>>> +++ b/xen/drivers/vpci/header.c
>>>> @@ -31,14 +31,87 @@
>>>> struct map_data {
>>>> struct domain *d;
>>>> bool map;
>>>> + struct pci_dev *pdev;
>>> If the field is required please place it after the domain one.
>> I will, but may I ask why?
> So that if we add further boolean fields we can do at the end of the
> struct for layout reasons. If we do:
>
> struct map_data {
> struct domain *d;
> bool map;
> struct pci_dev *pdev;
> bool foo;
> }
>
> We will end up with a bunch of padding that could be avoided by doing:
>
> struct map_data {
> struct domain *d;
> struct pci_dev *pdev;
> bool map;
> bool foo;
> }
Ah, so this is about padding. Got it
>
>>>> + s = PFN_DOWN(s);
>>>> + e = PFN_DOWN(e);
>>> Changing the rangeset to store memory addresses instead of frames
>>> could for example be split into a separate patch.
>> Ok
>>> I think you are doing the calculation of the end pfn wrong here, you
>>> should use PFN_UP instead in case the address is not aligned.
>> PFN_DOWN for the start seems to be ok if the address is not aligned
>>
>> which is the case if I pass bar_index in the lower bits: PCI memory has
>>
>> PAGE_SIZE granularity, so besides the fact that I use bar_index the address
> No, BARs don't need to be aligned to page boundaries, you can even
> have different BARs inside the same physical page.
>
> The spec recommends that the minimum size of a BAR should be 4KB, but
> that's not a strict requirement in which case a BAR can be as small as
> 16bytes, and then you can have multiple ones inside the same page.
Ok, will account on that
>
>> must be page aligned.
>>
>> The end address is expressed in (size - 1) form, again page aligned,
>>
>> so to get the last page to be mapped PFN_DOWN also seems to be appropriate.
>>
>> Do I miss something here?
> I'm not aware of any of those addresses or sizes being guaranteed to
> be page aligned, so I think you need to account for that.
>
> Some of the code here uses PFN_DOWN to calculate the end address
> because the rangesets are used in an inclusive fashion, so the end
> frame also gets mapped.
Ok
>
>>>> + mfn = _mfn(PFN_DOWN(header->bars[bar_idx].addr));
>>>> for ( ; ; )
>>>> {
>>>> unsigned long size = e - s + 1;
>>>> @@ -52,11 +125,15 @@ static int map_range(unsigned long s, unsigned long
>>>> e, void *data,
>>>> * - {un}map_mmio_regions doesn't support preemption.
>>>> */
>>>>
>>>> - rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
>>>> - : unmap_mmio_regions(map->d, _gfn(s), size,
>>>> _mfn(s));
>>>> + rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, mfn)
>>>> + : unmap_mmio_regions(map->d, _gfn(s), size, mfn);
>>>> if ( rc == 0 )
>>>> {
>>>> - *c += size;
>>>> + /*
>>>> + * Range set is not expressed in frame numbers and the size
>>>> + * is the number of frames, so update accordingly.
>>>> + */
>>>> + *c += size << PAGE_SHIFT;
>>>> break;
>>>> }
>>>> if ( rc < 0 )
>>>> @@ -67,8 +144,9 @@ static int map_range(unsigned long s, unsigned long e,
>>>> void *data,
>>>> break;
>>>> }
>>>> ASSERT(rc < size);
>>>> - *c += rc;
>>>> + *c += rc << PAGE_SHIFT;
>>>> s += rc;
>>>> + mfn += rc;
>>>> if ( general_preempt_check() )
>>>> return -ERESTART;
>>>> }
>>>> @@ -84,7 +162,7 @@ static int map_range(unsigned long s, unsigned long e,
>>>> void *data,
>>>> static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>>>> bool rom_only)
>>>> {
>>>> - struct vpci_header *header = &pdev->vpci->header;
>>>> + struct vpci_header *header = get_hwdom_vpci_header(pdev);
>>>> bool map = cmd & PCI_COMMAND_MEMORY;
>>>> unsigned int i;
>>>>
>>>> @@ -136,6 +214,7 @@ bool vpci_process_pending(struct vcpu *v)
>>>> struct map_data data = {
>>>> .d = v->domain,
>>>> .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
>>>> + .pdev = v->vpci.pdev,
>>>> };
>>>> int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
>>>>
>>>> @@ -168,7 +247,8 @@ bool vpci_process_pending(struct vcpu *v)
>>>> static int __init apply_map(struct domain *d, const struct pci_dev
>>>> *pdev,
>>>> struct rangeset *mem, uint16_t cmd)
>>>> {
>>>> - struct map_data data = { .d = d, .map = true };
>>>> + struct map_data data = { .d = d, .map = true,
>>>> + .pdev = (struct pci_dev *)pdev };
>>> Dropping the const here is not fine. IT either needs to be dropped
>>> from apply_map and further up, or this needs to also be made const.
>> Ok, I'll try to keep it const
>>>> int rc;
>>>>
>>>> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) ==
>>>> -ERESTART )
>>>> @@ -205,7 +285,7 @@ static void defer_map(struct domain *d, struct pci_dev
>>>> *pdev,
>>>>
>>>> static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool
>>>> rom_only)
>>>> {
>>>> - struct vpci_header *header = &pdev->vpci->header;
>>>> + struct vpci_header *header;
>>>> struct rangeset *mem = rangeset_new(NULL, NULL, 0);
>>>> struct pci_dev *tmp, *dev = NULL;
>>>> #ifdef CONFIG_X86
>>>> @@ -217,6 +297,11 @@ static int modify_bars(const struct pci_dev *pdev,
>>>> uint16_t cmd, bool rom_only)
>>>> if ( !mem )
>>>> return -ENOMEM;
>>>>
>>>> + if ( is_hardware_domain(current->domain) )
>>>> + header = get_hwdom_vpci_header(pdev);
>>>> + else
>>>> + header = get_vpci_header(current->domain, pdev);
>>>> +
>>>> /*
>>>> * Create a rangeset that represents the current device BARs memory
>>>> region
>>>> * and compare it against all the currently active BAR memory
>>>> regions. If
>>>> @@ -225,12 +310,15 @@ static int modify_bars(const struct pci_dev *pdev,
>>>> uint16_t cmd, bool rom_only)
>>>> * First fill the rangeset with all the BARs of this device or with
>>>> the ROM
>>>> * BAR only, depending on whether the guest is toggling the memory
>>>> decode
>>>> * bit of the command register, or the enable bit of the ROM BAR
>>>> register.
>>>> + *
>>>> + * Use the PCI reserved bits of the BAR to pass BAR's index.
>>>> */
>>>> for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>> {
>>>> const struct vpci_bar *bar = &header->bars[i];
>>>> - unsigned long start = PFN_DOWN(bar->addr);
>>>> - unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
>>>> + unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) | i;
>>>> + unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) +
>>>> + bar->size - 1;
>>> Will this work fine on Arm 32bits with LPAE? It's my understanding
>>> that in that case unsigned long is 32bits, but the physical address
>>> space is 44bits, in which case this won't work.
>> Hm, good question
>>> I think you need to keep the usage of frame numbers here.
>> If I re-work the gfn <-> mfn mapping then yes, I can use frame numbers here
>> and elsewhere
>>>>
>>>> if ( !MAPPABLE_BAR(bar) ||
>>>> (rom_only ? bar->type != VPCI_BAR_ROM
>>>> @@ -251,9 +339,11 @@ static int modify_bars(const struct pci_dev *pdev,
>>>> uint16_t cmd, bool rom_only)
>>>> /* Remove any MSIX regions if present. */
>>>> for ( i = 0; msix && i < ARRAY_SIZE(msix->tables); i++ )
>>>> {
>>>> - unsigned long start = PFN_DOWN(vmsix_table_addr(pdev->vpci, i));
>>>> - unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
>>>> - vmsix_table_size(pdev->vpci, i) - 1);
>>>> + unsigned long start = (vmsix_table_addr(pdev->vpci, i) &
>>>> + PCI_BASE_ADDRESS_MEM_MASK) | i;
>>>> + unsigned long end = (vmsix_table_addr(pdev->vpci, i) &
>>>> + PCI_BASE_ADDRESS_MEM_MASK ) +
>>>> + vmsix_table_size(pdev->vpci, i) - 1;
>>>>
>>>> rc = rangeset_remove_range(mem, start, end);
>>>> if ( rc )
>>>> @@ -273,6 +363,8 @@ static int modify_bars(const struct pci_dev *pdev,
>>>> uint16_t cmd, bool rom_only)
>>>> */
>>>> for_each_pdev ( pdev->domain, tmp )
>>>> {
>>>> + struct vpci_header *header;
>>>> +
>>>> if ( tmp == pdev )
>>>> {
>>>> /*
>>>> @@ -289,11 +381,14 @@ static int modify_bars(const struct pci_dev *pdev,
>>>> uint16_t cmd, bool rom_only)
>>>> continue;
>>>> }
>>>>
>>>> - for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>>>> + header = get_vpci_header(tmp->domain, pdev);
>>>> +
>>>> + for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>> {
>>>> - const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
>>>> - unsigned long start = PFN_DOWN(bar->addr);
>>>> - unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
>>>> + const struct vpci_bar *bar = &header->bars[i];
>>>> + unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK)
>>>> | i;
>>>> + unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK)
>>>> + + bar->size - 1;
>>>>
>>>> if ( !bar->enabled || !rangeset_overlaps_range(mem, start,
>>>> end) ||
>>>> /*
>>>> @@ -357,7 +452,7 @@ static void cmd_write(const struct pci_dev *pdev,
>>>> unsigned int reg,
>>>> pci_conf_write16(pdev->sbdf, reg, cmd);
>>>> }
>>>>
>>>> -static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>> +static void bar_write_hwdom(const struct pci_dev *pdev, unsigned int reg,
>>>> uint32_t val, void *data)
>>>> {
>>>> struct vpci_bar *bar = data;
>>>> @@ -377,14 +472,17 @@ static void bar_write(const struct pci_dev *pdev,
>>>> unsigned int reg,
>>>> {
>>>> /* If the value written is the current one avoid printing a
>>>> warning. */
>>>> if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )
>>>> + {
>>>> + struct vpci_header *header = get_hwdom_vpci_header(pdev);
>>>> +
>>>> gprintk(XENLOG_WARNING,
>>>> "%04x:%02x:%02x.%u: ignored BAR %lu write with
>>>> memory decoding enabled\n",
>>>> pdev->seg, pdev->bus, slot, func,
>>>> - bar - pdev->vpci->header.bars + hi);
>>>> + bar - header->bars + hi);
>>>> + }
>>>> return;
>>>> }
>>>>
>>>> -
>>>> /*
>>>> * Update the cached address, so that when memory decoding is
>>>> enabled
>>>> * Xen can map the BAR into the guest p2m.
>>>> @@ -403,10 +501,89 @@ static void bar_write(const struct pci_dev *pdev,
>>>> unsigned int reg,
>>>> pci_conf_write32(pdev->sbdf, reg, val);
>>>> }
>>>>
>>>> +static uint32_t bar_read_hwdom(const struct pci_dev *pdev, unsigned int
>>>> reg,
>>>> + void *data)
>>>> +{
>>>> + return vpci_hw_read32(pdev, reg, data);
>>>> +}
>>>> +
>>>> +static void bar_write_guest(const struct pci_dev *pdev, unsigned int reg,
>>>> + uint32_t val, void *data)
>>>> +{
>>>> + struct vpci_bar *vbar = data;
>>>> + bool hi = false;
>>>> +
>>>> + if ( vbar->type == VPCI_BAR_MEM64_HI )
>>>> + {
>>>> + ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>> + vbar--;
>>>> + hi = true;
>>>> + }
>>>> + vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>>> + vbar->addr |= (uint64_t)val << (hi ? 32 : 0);
>>>> +}
>>>> +
>>>> +static uint32_t bar_read_guest(const struct pci_dev *pdev, unsigned int
>>>> reg,
>>>> + void *data)
>>>> +{
>>>> + struct vpci_bar *vbar = data;
>>>> + uint32_t val;
>>>> + bool hi = false;
>>>> +
>>>> + if ( vbar->type == VPCI_BAR_MEM64_HI )
>>>> + {
>>>> + ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>> + vbar--;
>>>> + hi = true;
>>>> + }
>>>> +
>>>> + if ( vbar->type == VPCI_BAR_MEM64_LO || vbar->type ==
>>>> VPCI_BAR_MEM64_HI )
>>> I think this would be clearer using a switch statement.
>> I'll think about
>>>> + {
>>>> + if ( hi )
>>>> + val = vbar->addr >> 32;
>>>> + else
>>>> + val = vbar->addr & 0xffffffff;
>>>> + if ( val == ~0 )
>>> Strictly speaking I think you are not forced to write 1s to the
>>> reserved 4 bits in the low register (and in the 32bit case).
>> Ah, so Linux kernel, for instance, could have written 0xffffff0 while
>>
>> I expect 0xffffffff?
> I think real hardware would return the size when written 1s to all
> bits except the reserved ones.
>
>>>> + {
>>>> + /* Guests detects BAR's properties and sizes. */
>>>> + if ( !hi )
>>>> + {
>>>> + val = 0xffffffff & ~(vbar->size - 1);
>>>> + val |= vbar->type == VPCI_BAR_MEM32 ?
>>>> PCI_BASE_ADDRESS_MEM_TYPE_32
>>>> + :
>>>> PCI_BASE_ADDRESS_MEM_TYPE_64;
>>>> + val |= vbar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH
>>>> : 0;
>>>> + }
>>>> + else
>>>> + val = vbar->size >> 32;
>>>> + vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>>> + vbar->addr |= (uint64_t)val << (hi ? 32 : 0);
>>>> + }
>>>> + }
>>>> + else if ( vbar->type == VPCI_BAR_MEM32 )
>>>> + {
>>>> + val = vbar->addr;
>>>> + if ( val == ~0 )
>>>> + {
>>>> + if ( !hi )
>>> There's no way hi can be true at this point AFAICT.
>> Sure, thank you
>>>> + {
>>>> + val = 0xffffffff & ~(vbar->size - 1);
>>>> + val |= vbar->type == VPCI_BAR_MEM32 ?
>>>> PCI_BASE_ADDRESS_MEM_TYPE_32
>>>> + :
>>>> PCI_BASE_ADDRESS_MEM_TYPE_64;
>>>> + val |= vbar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH
>>>> : 0;
>>>> + }
>>>> + }
>>>> + }
>>>> + else
>>>> + {
>>>> + val = vbar->addr;
>>>> + }
>>>> + return val;
>>>> +}
>>>> +
>>>> static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>>>> uint32_t val, void *data)
>>>> {
>>>> - struct vpci_header *header = &pdev->vpci->header;
>>>> + struct vpci_header *header = get_hwdom_vpci_header(pdev);
>>>> struct vpci_bar *rom = data;
>>>> uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
>>>> uint16_t cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
>>>> @@ -452,15 +629,56 @@ static void rom_write(const struct pci_dev *pdev,
>>>> unsigned int reg,
>>>> rom->addr = val & PCI_ROM_ADDRESS_MASK;
>>>> }
>>> Don't you need to also protect a domU from writing to the ROM BAR
>>> register?
>> ROM was not a target of this RFC as I have no HW to test that, but final
>> code must
>>
>> also handle ROM as well, you are right
>>
>>>>
>>>> +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned
>>>> int reg,
>>>> + void *data)
>>>> +{
>>>> + struct vpci_bar *vbar, *bar = data;
>>>> +
>>>> + if ( is_hardware_domain(current->domain) )
>>>> + return bar_read_hwdom(pdev, reg, data);
>>>> +
>>>> + vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>> + if ( !vbar )
>>>> + return ~0;
>>>> +
>>>> + return bar_read_guest(pdev, reg, vbar);
>>>> +}
>>>> +
>>>> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int
>>>> reg,
>>>> + uint32_t val, void *data)
>>>> +{
>>>> + struct vpci_bar *bar = data;
>>>> +
>>>> + if ( is_hardware_domain(current->domain) )
>>>> + bar_write_hwdom(pdev, reg, val, data);
>>>> + else
>>>> + {
>>>> + struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev,
>>>> bar->index);
>>>> +
>>>> + if ( !vbar )
>>>> + return;
>>>> + bar_write_guest(pdev, reg, val, vbar);
>>>> + }
>>>> +}
>>> You should assign different handlers based on whether the domain that
>>> has the device assigned is a domU or the hardware domain, rather than
>>> doing the selection here.
>> Hm, handlers are assigned once in init_bars and this function is only called
>>
>> for hwdom, so there is no way I can do that for the guests. Hence, the
>> dispatcher
> I think we might want to reset the vPCI handlers when a devices gets
> assigned and deassigned.
In ARM case init_bars is called too early: PCI device assignment is currently
initiated by Domain-0' kernel and is done *before* PCI devices are given memory
ranges and BARs assigned:
[ 0.429514] pci_bus 0000:00: root bus resource [bus 00-ff]
[ 0.429532] pci_bus 0000:00: root bus resource [io 0x0000-0xfffff]
[ 0.429555] pci_bus 0000:00: root bus resource [mem 0xfe200000-0xfe3fffff]
[ 0.429575] pci_bus 0000:00: root bus resource [mem 0x30000000-0x37ffffff]
[ 0.429604] pci_bus 0000:00: root bus resource [mem 0x38000000-0x3fffffff
pref]
[ 0.429670] pci 0000:00:00.0: enabling Extended Tags
[ 0.453764] pci 0000:00:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
< init_bars >
[ 0.453793] pci 0000:00:00.0: -- IRQ 0
[ 0.458825] pci 0000:00:00.0: Failed to add - passthrough or MSI/MSI-X might
fail!
[ 0.471790] pci 0000:01:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
< init_bars >
[ 0.471821] pci 0000:01:00.0: -- IRQ 255
[ 0.476809] pci 0000:01:00.0: Failed to add - passthrough or MSI/MSI-X might
fail!
< BAR assignments below >
[ 0.488233] pci 0000:00:00.0: BAR 14: assigned [mem 0xfe200000-0xfe2fffff]
[ 0.488265] pci 0000:00:00.0: BAR 15: assigned [mem 0x38000000-0x380fffff
pref]
In case of x86 this is pretty much ok as BARs are already in place, but for ARM
we
need to take care and re-setup vPCI BARs for hwdom. Things are getting even more
complicated if the host PCI bridge is not ECAM like, so you cannot set
mmio_handlers
and trap hwdom's access to the config space to update BARs etc. This is why I
have that
ugly hack for rcar_gen3 to read actual BARs for hwdom.
If we go further and take a look at SR-IOV then when the kernel assigns the
device
(BUS_NOTIFY_ADD_DEVICE) then it already has BARs assigned for virtual functions
(need to double-check that).
> In order to do passthrough to domUs safely
> we will have to add more handlers than what's required for dom0,
Can you please tell what are thinking about? What are the missing handlers?
> and
> having is_hardware_domain sprinkled in all of them is not a suitable
> solution.
I'll try to replace is_hardware_domain with something like:
+/*
+ * Detect whether physical PCI devices in this segment belong
+ * to the domain given, e.g. on x86 all PCI devices live in hwdom,
+ * but in case of ARM this might not be the case: those may also
+ * live in driver domains or even Xen itself.
+ */
+bool pci_is_hardware_domain(struct domain *d, u16 seg)
+{
+#ifdef CONFIG_X86
+ return is_hardware_domain(d);
+#elif CONFIG_ARM
+ return pci_is_owner_domain(d, seg);
+#else
+#error "Unsupported architecture"
+#endif
+}
+
+/*
+ * Get domain which owns this segment: for x86 this is always hardware
+ * domain and for ARM this can be different.
+ */
+struct domain *pci_get_hardware_domain(u16 seg)
+{
+#ifdef CONFIG_X86
+ return hardware_domain;
+#elif CONFIG_ARM
+ return pci_get_owner_domain(seg);
+#else
+#error "Unsupported architecture"
+#endif
+}
This is what I use to properly detect the domain that really owns physical host
bridge
>
> Roger.
Thank you,
Oleksandr
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |