[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0



On 2023/12/5 18:32, Jan Beulich wrote:
> On 05.12.2023 10:19, Roger Pau Monné wrote:
>> On Mon, Dec 04, 2023 at 02:19:33PM -0800, Stefano Stabellini wrote:
>>> On Mon, 4 Dec 2023, Roger Pau Monné wrote:
>>>> On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote:
>>>>> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
>>>>>> On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
>>>>>>> On Thu, 30 Nov 2023, Roger Pau Monné wrote:
>>>>>>>> On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
>>>>>>>>> On Fri, 24 Nov 2023, Jiqian Chen wrote:
>>>>>>>>>> This patch is to solve two problems we encountered when we try to
>>>>>>>>>> passthrough a device to hvm domU base on Xen PVH dom0.
>>>>>>>>>>
>>>>>>>>>> First, hvm guest will alloc a pirq and irq for a passthrough device
>>>>>>>>>> by using gsi, before that, the gsi must first has a mapping in dom0,
>>>>>>>>>> see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
>>>>>>>>>> into Xen and check whether dom0 has the mapping. See
>>>>>>>>>> XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
>>>>>>>>>> dom0 and it return irq is 0, and then return -EPERM.
>>>>>>>>>> This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
>>>>>>>>>> when thay are enabled.
>>>>>>>>>>
>>>>>>>>>> Second, in PVH dom0, the gsi of a passthrough device doesn't get
>>>>>>>>>> registered, but gsi must be configured for it to be able to be
>>>>>>>>>> mapped into a domU.
>>>>>>>>>>
>>>>>>>>>> After searching codes, we can find map_pirq and register_gsi will be
>>>>>>>>>> done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
>>>>>>>>>> the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
>>>>>>>>>> can be conclude to that the gsi of a passthrough device doesn't be
>>>>>>>>>> unmasked.
>>>>>>>>>>
>>>>>>>>>> To solve the unmaske problem, this patch call the unmask_irq when we
>>>>>>>>>> assign a device to be passthrough. So that the gsi can get registered
>>>>>>>>>> and mapped in PVH dom0.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Roger, this seems to be more of a Xen issue than a Linux issue. Why do
>>>>>>>>> we need the unmask check in Xen? Couldn't we just do:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>>>>>>>>> index 4e40d3609a..df262a4a18 100644
>>>>>>>>> --- a/xen/arch/x86/hvm/vioapic.c
>>>>>>>>> +++ b/xen/arch/x86/hvm/vioapic.c
>>>>>>>>> @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
>>>>>>>>>              hvm_dpci_eoi(d, gsi);
>>>>>>>>>      }
>>>>>>>>>  
>>>>>>>>> -    if ( is_hardware_domain(d) && unmasked )
>>>>>>>>> +    if ( is_hardware_domain(d) )
>>>>>>>>>      {
>>>>>>>>>          /*
>>>>>>>>>           * NB: don't call vioapic_hwdom_map_gsi while holding 
>>>>>>>>> hvm.irq_lock
>>>>>>>>
>>>>>>>> There are some issues with this approach.
>>>>>>>>
>>>>>>>> mp_register_gsi() will only setup the trigger and polarity of the
>>>>>>>> IO-APIC pin once, so we do so once the guest unmask the pin in order
>>>>>>>> to assert that the configuration is the intended one.  A guest is
>>>>>>>> allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
>>>>>>>> that doesn't take effect unless the pin is unmasked.
>>>>>>>>
>>>>>>>> Overall the question would be whether we have any guarantees that
>>>>>>>> the hardware domain has properly configured the pin, even if it's not
>>>>>>>> using it itself (as it hasn't been unmasked).
>>>>>>>>
>>>>>>>> IIRC PCI legacy interrupts are level triggered and low polarity, so we
>>>>>>>> could configure any pins that are not setup at bind time?
>>>>>>>
>>>>>>> That could work.
>>>>>>>
>>>>>>> Another idea is to move only the call to allocate_and_map_gsi_pirq at
>>>>>>> bind time? That might be enough to pass a pirq_access_permitted check.
>>>>>>
>>>>>> Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
>>>>>> just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
>>>>>> parameter would be a GSI instead of a previously mapped IRQ).  Such
>>>>>> difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
>>>>>> route I would recommend that we instead introduce a new dmop that has
>>>>>> this syntax regardless of the domain type it's called from.
>>>>>
>>>>> Looking at the code it is certainly a bit confusing. My point was that
>>>>> we don't need to wait until polarity and trigger are set appropriately
>>>>> to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
>>>>> should be able to figure out that Dom0 is permitted pirq access.
>>>>
>>>> The logic is certainly not straightforward, and it could benefit from
>>>> some comments.
>>>>
>>>> The irq permissions are a bit special, in that they get setup when the
>>>> IRQ is mapped.
>>>>
>>>> The problem however is not so much with IRQ permissions, that we can
>>>> indeed sort out internally in Xen.  Such check in dom0 has the side
>>>> effect of preventing the IRQ from being assigned to a domU without the
>>>> hardware source being properly configured AFAICT.
>>>
>>> Now I understand why you made a comment previously about Xen having to
>>> configure trigger and polarity for these interrupts on its own.
>>>
>>>
>>>>> So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
>>>>> somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
>>>>> polarity to be configured to work. But the suggestion of doing it a
>>>>> "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.
>>>>>
>>>>> But maybe we can find another location, maybe within
>>>>> xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
>>>>> trigger and polarity are set and before the interrupt is unmasked.
>>>>>
>>>>> Then we change the implementation of vioapic_hwdom_map_gsi to skip the
>>>>> call to allocate_and_map_gsi_pirq, because by the time
>>>>> vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
>>>>> already been done.
>>>>
>>>> But then we would end up in a situation where the
>>>> pirq_access_permitted() check will pass, but the IO-APIC pin won't be
>>>> configured, which I think it's not what we want.
>>>>
>>>> One option would be to allow mp_register_gsi() to be called multiple
>>>> times, and update the IO-APIC pin configuration as long as the pin is
>>>> not unmasked.  That would propagate each dom0 RTE update to the
>>>> underlying IO-APIC.  However such approach relies on dom0 configuring
>>>> all possible IO-APIC pins, even if no device on dom0 is using them, I
>>>> think it's not a very reliable option.
>>>>
>>>> Another option would be to modify the toolstack to setup the GSI
>>>> itself using the PHYSDEVOP_setup_gsi hypercall.  As said in a previous
>>>> email, since we only care about PCI device passthrough the legacy INTx
>>>> should always be level triggered and low polarity.
>>>>
>>>>> I am not familiar with vioapic.c but to give you an idea of what I was
>>>>> thinking:
>>>>>
>>>>>
>>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>>>>> index 4e40d3609a..16d56fe851 100644
>>>>> --- a/xen/arch/x86/hvm/vioapic.c
>>>>> +++ b/xen/arch/x86/hvm/vioapic.c
>>>>> @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, 
>>>>> unsigned int trig,
>>>>>          return ret;
>>>>>      }
>>>>>  
>>>>> -    ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>>>>> -    if ( ret )
>>>>> -    {
>>>>> -        gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>>>>> -                 gsi, ret);
>>>>> -        return ret;
>>>>> -    }
>>>>> -
>>>>>      pcidevs_lock();
>>>>>      ret = pt_irq_create_bind(currd, &pt_irq_bind);
>>>>>      if ( ret )
>>>>> @@ -287,6 +279,17 @@ static void vioapic_write_redirent(
>>>>>              hvm_dpci_eoi(d, gsi);
>>>>>      }
>>>>>  
>>>>> +    if ( is_hardware_domain(d) ) 
>>>>> +    {
>>>>> +        int pirq = gsi, ret;
>>>>> +        ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>>>>> +        if ( ret )
>>>>> +        {
>>>>> +            gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: 
>>>>> %d\n",
>>>>> +                    gsi, ret);
>>>>> +            return ret;
>>>>> +        }
>>>>> +    }
>>>>>      if ( is_hardware_domain(d) && unmasked )
>>>>>      {
>>>>>          /*
>>>>
>>>> As said above, such approach relies on dom0 writing to the IO-APIC RTE
>>>> of likely each IO-APIC pin, which is IMO not quite reliable.  In there
>>>> are two different issues here that need to be fixed for PVH dom0:
>>>>
>>>>  - Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to
>>>>    succeed for a PVH dom0, even if dom0 is not using the GSI itself.
>>>
>>> Yes makes sense
>>>
>>>
>>>>  - Configure IO-APIC pins for PCI interrupts even if dom0 is not using
>>>>    the IO-APIC pin itself.
>>>>
>>>> First one needs to be fixed internally in Xen, second one will require
>>>> the toolstack to issue an extra hypercall in order to ensure the
>>>> IO-APIC pin is properly configured.
>>>  
>>> On ARM, Xen doesn't need to wait for dom0 to configure interrupts
>>> correctly. Xen configures them all on its own at boot based on Device
>>> Tree information. I guess it is not possible to do the same on x86?
>>
>> No, not exactly.  There's some interrupt information in the ACPI MADT,
>> but that's just for very specific sources (Interrupt Source Override
>> Structures)
>>
>> Then on AML devices can have resource descriptors that contain
>> information about how interrupts are setup.  However Xen is not able
>> to read any of this information on AML.
>>
>> Legacy PCI interrupts are (always?) level triggered and low polarity,
>> because it's assumed that an interrupt source can be shared between
>> multiple devices.
> 
> Except that as per what you said just in the earlier paragraph ACPI can
> tell us otherwise.
> 
>> I'm however not able to find any reference to this in the PCI spec,
>> hence I'm reluctant to take this for granted in Xen, and default all
>> GSIs >= 16 to such mode.
>>
>> OTOH legacy PCI interrupts are not that used anymore, as almost all
>> devices will support MSI(-X) (because PCIe mandates it) and OSes
>> should prefer the latter.  SR-IOV VF don't even support legacy PCI
>> interrupts anymore.
>>
>>> If
>>> not, then I can see why we would need 1 extra toolstack hypercall for
>>> that (or to bundle the operation of configuring IO-APIC pins together
>>> with an existing toolstack hypercall).
>>
>> One suitable compromise would be to default unconfigured GSIs >= 16 to
>> level-triggered and low-polarity, as I would expect that to work in
>> almost all cases.  We can always introduce the usage of
>> PHYSDEVOP_setup_gsi later if required.
>>
>> Maybe Jan has more input here, would you agree to defaulting non-ISA
>> GSIs to level-triggered, low-polarity in the absence of a specific
>> setup provided by dom0?
> 
> Well, such defaulting is an option, but in case it's wrong we might
> end up with hard to diagnose issues. Personally I'd prefer if we
> didn't take shortcuts here, i.e. if we followed what Dom0 is able
> to read from ACPI.

When PVH dom0 enable a device, it will get trigger and polarity from ACPI (see 
acpi_pci_irq_enable)
I have a version of patch which tried that way, see below:

diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
index ada3868c02c2..43e1bda9f946 100644
--- a/arch/x86/xen/enlighten_pvh.c
+++ b/arch/x86/xen/enlighten_pvh.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/acpi.h>
 #include <linux/export.h>
+#include <linux/pci.h>

 #include <xen/hvc-console.h>

@@ -25,6 +26,127 @@
 bool __ro_after_init xen_pvh;
 EXPORT_SYMBOL_GPL(xen_pvh);

+typedef struct gsi_info {
+       int gsi;
+       int trigger;
+       int polarity;
+       int pirq;
+} gsi_info_t;
+
+struct acpi_prt_entry {
+       struct acpi_pci_id      id;
+       u8                      pin;
+       acpi_handle             link;
+       u32                     index;          /* GSI, or link _CRS index */
+};
+
+static int xen_pvh_get_gsi_info(struct pci_dev *dev,
+                                                               gsi_info_t 
*gsi_info)
+{
+       int gsi;
+       u8 pin = 0;
+       struct acpi_prt_entry *entry;
+       int trigger = ACPI_LEVEL_SENSITIVE;
+       int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
+                                     ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
+
+       if (dev)
+               pin = dev->pin;
+       if (!pin) {
+               xen_raw_printk("No interrupt pin configured\n");
+               return -EINVAL;
+       }
+
+       entry = acpi_pci_irq_lookup(dev, pin);
+       if (entry) {
+               if (entry->link)
+                       gsi = acpi_pci_link_allocate_irq(entry->link,
+                                                        entry->index,
+                                                        &trigger, &polarity,
+                                                        NULL);
+               else
+                       gsi = entry->index;
+       } else
+               return -EINVAL;
+
+       gsi_info->gsi = gsi;
+       gsi_info->trigger = trigger;
+       gsi_info->polarity = polarity;
+
+       return 0;
+}
+
+static int xen_pvh_map_pirq(gsi_info_t *gsi_info)
+{
+       struct physdev_map_pirq map_irq;
+       int ret;
+
+       map_irq.domid = DOMID_SELF;
+       map_irq.type = MAP_PIRQ_TYPE_GSI;
+       map_irq.index = gsi_info->gsi;
+       map_irq.pirq = gsi_info->gsi;
+
+       ret = HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq, &map_irq);
+       gsi_info->pirq = map_irq.pirq;
+
+       return ret;
+}
+
+static int xen_pvh_unmap_pirq(gsi_info_t *gsi_info)
+{
+       struct physdev_unmap_pirq unmap_irq;
+
+       unmap_irq.domid = DOMID_SELF;
+       unmap_irq.pirq = gsi_info->pirq;
+
+       return HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
+}
+
+static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
+{
+       struct physdev_setup_gsi setup_gsi;
+
+       setup_gsi.gsi = gsi_info->gsi;
+       setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 
1);
+       setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
+
+       return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
+}
+
+int xen_pvh_passthrough_gsi(struct pci_dev *dev)
+{
+       int ret;
+       gsi_info_t gsi_info;
+
+       if (!dev) {
+               return -EINVAL;
+       }
+
+       ret = xen_pvh_get_gsi_info(dev, &gsi_info);
+       if (ret) {
+               xen_raw_printk("Fail to get gsi info!\n");
+               return ret;
+       }
+
+       ret = xen_pvh_map_pirq(&gsi_info);
+       if (ret) {
+               xen_raw_printk("Fail to map pirq for gsi (%d)!\n", 
gsi_info.gsi);
+               return ret;
+       }
+
+       ret = xen_pvh_setup_gsi(&gsi_info);
+       if (ret == -EEXIST) {
+               ret = 0;
+               xen_raw_printk("Already setup the GSI :%u\n", gsi_info.gsi);
+       } else if (ret) {
+               xen_raw_printk("Fail to setup gsi (%d)!\n", gsi_info.gsi);
+               xen_pvh_unmap_pirq(&gsi_info);
+       }
+
+       return ret;
+}
+EXPORT_SYMBOL_GPL(xen_pvh_passthrough_gsi);
+
 void __init xen_pvh_init(struct boot_params *boot_params)
 {
        u32 msr;
diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
index ff30ceca2203..630fe0a34bc6 100644
--- a/drivers/acpi/pci_irq.c
+++ b/drivers/acpi/pci_irq.c
@@ -288,7 +288,7 @@ static int acpi_reroute_boot_interrupt(struct pci_dev *dev,
 }
 #endif /* CONFIG_X86_IO_APIC */

-static struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
+struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
 {
        struct acpi_prt_entry *entry = NULL;
        struct pci_dev *bridge;
diff --git a/drivers/xen/xen-pciback/pci_stub.c 
b/drivers/xen/xen-pciback/pci_stub.c
index e34b623e4b41..1abd4dad6f40 100644
--- a/drivers/xen/xen-pciback/pci_stub.c
+++ b/drivers/xen/xen-pciback/pci_stub.c
@@ -20,6 +20,7 @@
 #include <linux/atomic.h>
 #include <xen/events.h>
 #include <xen/pci.h>
+#include <xen/acpi.h>
 #include <xen/xen.h>
 #include <asm/xen/hypervisor.h>
 #include <xen/interface/physdev.h>
@@ -399,6 +400,12 @@ static int pcistub_init_device(struct pci_dev *dev)
        if (err)
                goto config_release;

+       if (xen_initial_domain() && xen_pvh_domain()) {
+               err = xen_pvh_passthrough_gsi(dev);
+               if (err)
+                       goto config_release;
+       }
+
        if (dev->msix_cap) {
                struct physdev_pci_device ppdev = {
                        .seg = pci_domain_nr(dev->bus),
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 641dc4843987..368d56ba2c5e 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -375,6 +375,7 @@ void acpi_unregister_gsi (u32 gsi);

 struct pci_dev;

+struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin);
 int acpi_pci_irq_enable (struct pci_dev *dev);
 void acpi_penalize_isa_irq(int irq, int active);
 bool acpi_isa_irq_available(int irq);
diff --git a/include/xen/acpi.h b/include/xen/acpi.h
index b1e11863144d..ce7f5554f88e 100644
--- a/include/xen/acpi.h
+++ b/include/xen/acpi.h
@@ -67,6 +67,7 @@ static inline void xen_acpi_sleep_register(void)
                acpi_suspend_lowlevel = xen_acpi_suspend_lowlevel;
        }
 }
+int xen_pvh_passthrough_gsi(struct pci_dev *dev);
 #else
 static inline void xen_acpi_sleep_register(void)
 {

> 
> Jan

-- 
Best regards,
Jiqian Chen.

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.