[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v5 01/17] VT-d Posted-intterrupt (PI) design




> -----Original Message-----
> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@xxxxxxxxxx]
> Sent: Wednesday, August 12, 2015 11:35 PM
> To: Wu, Feng
> Cc: xen-devel@xxxxxxxxxxxxx; Tian, Kevin; Keir Fraser; George Dunlap; Andrew
> Cooper; Jan Beulich; Zhang, Yang Z
> Subject: Re: [Xen-devel] [PATCH v5 01/17] VT-d Posted-intterrupt (PI) design
> 
> On Wed, Aug 12, 2015 at 10:35:22AM +0800, Feng Wu wrote:
> 
> The title has an extra 'i'.
> 
> > Add the design doc for VT-d PI.
> >
> > CC: Kevin Tian <kevin.tian@xxxxxxxxx>
> > CC: Yang Zhang <yang.z.zhang@xxxxxxxxx>
> > CC: Jan Beulich <jbeulich@xxxxxxxx>
> > CC: Keir Fraser <keir@xxxxxxx>
> > CC: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> > CC: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
> > Signed-off-by: Feng Wu <feng.wu@xxxxxxxxx>
> > Reviewed-by: Kevin Tian <kevin.tian@xxxxxxxxx>
> > ---
> >  docs/misc/vtd-pi.txt | 333
> +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 333 insertions(+)
> >  create mode 100644 docs/misc/vtd-pi.txt
> >
> > diff --git a/docs/misc/vtd-pi.txt b/docs/misc/vtd-pi.txt
> > new file mode 100644
> > index 0000000..98a77ba
> > --- /dev/null
> > +++ b/docs/misc/vtd-pi.txt
> > @@ -0,0 +1,333 @@
> > +Authors: Feng Wu <feng.wu@xxxxxxxxx>
> > +
> > +VT-d Posted-interrupt (PI) design for XEN
> > +
> > +Background
> > +==========
> > +With the development of virtualization, there are more and more device
> > +assignment requirements. However, today when a VM is running with
> > +assigned devices (such as, NIC), external interrupt handling for the 
> > assigned
> > +devices always needs VMM intervention.
> > +
> > +VT-d Posted-interrupt is a more enhanced method to handle interrupts
> > +in the virtualization environment. Interrupt posting is the process by
> > +which an interrupt request is recorded in a memory-resident
> > +posted-interrupt-descriptor structure by the root-complex, followed by
> > +an optional notification event issued to the CPU complex.
> > +
> > +With VT-d Posted-interrupt we can get the following advantages:
> > +- Direct delivery of external interrupts to running vCPUs without VMM
> > +intervention
> > +- Decrease the interrupt migration complexity. On vCPU migration, software
> > +can atomically co-migrate all interrupts targeting the migrating vCPU. For
> > +virtual machines with assigned devices, migrating a vCPU across pCPUs
> > +either incur the overhead of forwarding interrupts in software (e.g. via 
> > VMM
> 
> s/incur/incurs/
> > +generated IPIS), or complexity to independently migrate each interrupt
> targeting
> 
> s/IPIS/IPIs
> > +the vCPU to the new pCPU. However, after enabling VT-d PI, the destination
> vCPU
> > +of an external interrupt from assigned devices is stored in the IRTE (i.e.
> > +Posted-interrupt Descriptor Address), when vCPU is migrated to another
> pCPU,
> > +we will set this new pCPU in the 'NDST' filed of Posted-interrupt 
> > descriptor,
> this
> > +make the interrupt migration automatic.
> > +
> > +Here is what Xen currently does for external interrupts from assigned
> devices:
> > +
> > +When a VM is running and an external interrupt from an assigned device
> occurs
> > +for it. VM-EXIT happens, then:
> > +
> > +vmx_do_extint() --> do_IRQ() --> __do_IRQ_guest() --> hvm_do_IRQ_dpci()
> -->
> > +raise_softirq_for(pirq_dpci) --> raise_softirq(HVM_DPCI_SOFTIRQ)
> > +
> > +softirq HVM_DPCI_SOFTIRQ is bound to dpci_softirq()
> > +
> > +dpci_softirq() --> hvm_dirq_assist() --> vmsi_deliver_pirq() --> 
> > vmsi_deliver()
> -->
> > +vmsi_inj_irq() --> vlapic_set_irq()
> > +
> > +vlapic_set_irq() does the following things:
> > +1. If CPU-side posted-interrupt is supported, call 
> > vmx_deliver_posted_intr()
> to deliver
> > +the virtual interrupt via posted-interrupt infrastructure.
> > +2. Else if CPU-side posted-interrupt is not supported, set the related 
> > vIRR in
> vLAPIC
> > +page and call vcpu_kick() to kick the related vCPU. Before VM-Entry,
> vmx_intr_assist()
> > +will help to inject the interrupt to guests.
> > +
> > +However, after VT-d PI is supported, when a guest is running in non-root 
> > and
> an
> > +external interrupt from an assigned device occurs for it. No VM-Exit is
> needed,
> > +the guest can handle this totally in non-root mode, thus avoiding all the
> above
> > +code flow.
> > +
> > +Posted-interrupt Introduction
> > +========================
> > +There are two components to the Posted-interrupt architecture:
> > +Processor Support and Root-Complex Support
> > +
> > +- Processor Support
> > +Posted-interrupt processing is a feature by which a processor processes
> > +the virtual interrupts by recording them as pending on the virtual-APIC
> > +page.
> > +
> > +Posted-interrupt processing is enabled by setting the process posted
> > +interrupts VM-execution control. The processing is performed in response
> > +to the arrival of an interrupt with the posted-interrupt notification 
> > vector.
> > +In response to such an interrupt, the processor processes virtual 
> > interrupts
> > +recorded in a data structure called a posted-interrupt descriptor.
> > +
> > +More information about APICv and CPU-side Posted-interrupt, please refer
> > +to Chapter 29, and Section 29.6 in the Intel SDM:
> >
> +http://www.intel.com/content/dam/www/public/us/en/documents/manuals/
> 64-ia-32-architectures-software-developer-manual-325462.pdf
> > +
> > +- Root-Complex Support
> > +Interrupt posting is the process by which an interrupt request (from IOAPIC
> > +or MSI/MSIx capable sources) is recorded in a memory-resident
> > +posted-interrupt-descriptor structure by the root-complex, followed by
> > +an optional notification event issued to the CPU complex. The interrupt
> > +request arriving at the root-complex carry the identity of the interrupt
> > +request source and a 'remapping-index'. The remapping-index is used to
> > +look-up an entry from the memory-resident interrupt-remap-table. Unlike
> > +with interrupt-remapping, the interrupt-remap-table-entry for a posted-
> 
> s/with//
> > +interrupt, specifies a virtual-vector and a pointer to the posted-interrupt
> > +descriptor. The virtual-vector specifies the vector of the interrupt to be
> > +recorded in the posted-interrupt descriptor. The posted-interrupt 
> > descriptor
> > +hosts storage for the virtual-vectors and contains the attributes of the
> > +notification event (interrupt) to be issued to the CPU complex to inform
> > +CPU/software about pending interrupts recorded in the posted-interrupt
> > +descriptor.
> > +
> > +More information about VT-d PI, please refer to
> >
> +http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolo
> gy/vt-directed-io-spec.html
> > +
> > +Important Definitions
> > +==================
> > +There are some changes to IRTE and posted-interrupt descriptor after
> > +VT-d PI is introduced:
> > +IRTE: Interrupt Remapping Table Entry
> > +Posted-interrupt Descriptor Address: the address of the posted-interrupt
> descriptor
> > +Virtual Vector: the guest vector of the interrupt
> > +URG: indicates if the interrupt is urgent
> > +
> > +Posted-interrupt descriptor:
> > +The Posted Interrupt Descriptor hosts the following fields:
> > +Posted Interrupt Request (PIR): Provide storage for posting (recording)
> interrupts (one bit
> > +per vector, for up to 256 vectors).
> > +
> > +Outstanding Notification (ON): Indicate if there is a notification event
> outstanding (not
> > +processed by processor or software) for this Posted Interrupt Descriptor.
> When this field is 0,
> > +hardware modifies it from 0 to 1 when generating a notification event, and
> the entity receiving
> > +the notification event (processor or software) resets it as part of posted
> interrupt processing.
> > +
> > +Suppress Notification (SN): Indicate if a notification event is to be
> suppressed (not
> > +generated) for non-urgent interrupt requests (interrupts processed through
> an IRTE with
> > +URG=0).
> > +
> > +Notification Vector (NV): Specify the vector for notification event 
> > (interrupt).
> > +
> > +Notification Destination (NDST): Specify the physical APIC-ID of the
> destination logical
> > +processor for the notification event.
> > +
> > +Design Overview
> > +==============
> > +In this design, we will cover the following items:
> > +1. Add a variable to control whether enable VT-d posted-interrupt or not.
> > +2. VT-d PI feature detection.
> > +3. Extend posted-interrupt descriptor structure to cover VT-d PI specific
> stuff.
> 
> s/stuff/items/
> 
> But that really is up to you :-)
> 
> > +4. Extend IRTE structure to support VT-d PI.
> > +5. Introduce a new global vector which is used for waking up the blocked
> vCPU.
> > +6. Update IRTE when guest modifies the interrupt configuration (MSI/MSIx
> configuration).
> > +7. Update posted-interrupt descriptor during vCPU scheduling (when the
> state
> > +of the vCPU is transmitted among RUNSTATE_running / RUNSTATE_blocked/
> > +RUNSTATE_runnable / RUNSTATE_offline).
> > +8. How to wakeup blocked vCPU when an interrupt is posted for it (wakeup
> notification handler).
> > +9. New boot command line for Xen, which controls VT-d PI feature by user.
> > +10. Multicast/broadcast and lowest priority interrupts consideration.
> > +
> > +
> > +Implementation details
> > +===================
> > +- New variable to control VT-d PI
> > +
> > +Like variable 'iommu_intremap' for interrupt remapping, it is very
> straightforward
> > +to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' is
> set
> > +only when interrupt remapping and VT-d posted-interrupt are both enabled.
> > +
> > +- VT-d PI feature detection.
> > +Bit 59 in VT-d Capability Register is used to report VT-d Posted-interrupt
> support.
> > +
> > +- Extend posted-interrupt descriptor structure to cover VT-d PI specific 
> > stuff.
> > +Here is the new structure for posted-interrupt descriptor:
> > +
> > +struct pi_desc {
> > +    DECLARE_BITMAP(pir, NR_VECTORS);
> > +    union {
> > +        struct
> > +        {
> > +        u16 on     : 1,  /* bit 256 - Outstanding Notification */
> > +            sn     : 1,  /* bit 257 - Suppress Notification */
> > +            rsvd_1 : 14; /* bit 271:258 - Reserved */
> > +        u8  nv;          /* bit 279:272 - Notification Vector */
> > +        u8  rsvd_2;      /* bit 287:280 - Reserved */
> > +        u32 ndst;        /* bit 319:288 - Notification Destination */
> > +        };
> > +        u64 control;
> > +    };
> > +    u32 rsvd[6];
> > +} __attribute__ ((aligned (64)));
> > +
> > +- Extend IRTE structure to support VT-d PI.
> > +
> > +Here is the new structure for IRTE:
> > +/* interrupt remap entry */
> > +struct iremap_entry {
> > +  union {
> > +    struct { u64 lo, hi; };
> > +    struct {
> > +        u16 p       : 1,
> > +            fpd     : 1,
> > +            dm      : 1,
> > +            rh      : 1,
> > +            tm      : 1,
> > +            dlm     : 3,
> > +            avail   : 4,
> > +            res_1   : 4;
> > +        u8  vector;
> > +        u8  res_2;
> > +        u32 dst;
> > +        u16 sid;
> > +        u16 sq      : 2,
> > +            svt     : 2,
> > +            res_3   : 12;
> > +        u32 res_4   : 32;
> > +    } remap;
> > +    struct {
> > +        u16 p       : 1,
> > +            fpd     : 1,
> > +            res_1   : 6,
> > +            avail   : 4,
> > +            res_2   : 2,
> > +            urg     : 1,
> > +            im      : 1;
> > +        u8  vector;
> > +        u8  res_3;
> > +        u32 res_4   : 6,
> > +            pda_l   : 26;
> > +        u16 sid;
> > +        u16 sq      : 2,
> > +            svt     : 2,
> > +            res_5   : 12;
> > +        u32 pda_h;
> > +    } post;
> > +  };
> > +};
> > +
> > +- Introduce a new global vector which is used to wake up the blocked vCPU.
> > +
> > +Currently, there is a global vector 'posted_intr_vector', which is used as 
> > the
> > +global notification vector for all vCPUs in the system. This vector is 
> > stored in
> > +VMCS and CPU considers it as a _special_ vector, uses it to notify the
> related
> > +pCPU when an interrupt is recorded in the posted-interrupt descriptor.
> > +
> > +This existing global vector is a _special_ vector to CPU, CPU handle it in 
> > a
> > +_special_ way compared to normal vectors, please refer to 29.6 in Intel SDM
> >
> +http://www.intel.com/content/dam/www/public/us/en/documents/manuals/
> 64-ia-32-architectures-software-developer-manual-325462.pdf
> > +for more information about how CPU handles it.
> > +
> > +After having VT-d PI, VT-d engine can issue notification event when the
> > +assigned devices issue interrupts. We need add a new global vector to
> > +wakeup the blocked vCPU, please refer to later section in this design for
> > +how to use this new global vector.
> > +
> > +- Update IRTE when guest modifies the interrupt configuration (MSI/MSIx
> configuration).
> > +After VT-d PI is introduced, the format of IRTE is changed as follows:
> > +   Descriptor Address: the address of the posted-interrupt descriptor
> > +   Virtual Vector: the guest vector of the interrupt
> > +   URG: indicates if the interrupt is urgent
> > +   Other fields continue to have the same meaning
> > +
> > +'Descriptor Address' tells the destination vCPU of this interrupt, since
> > +each vCPU has a dedicated posted-interrupt descriptor.
> > +
> > +'Virtual Vector' tells the guest vector of the interrupt.
> > +
> > +When guest changes the configuration of the interrupts, such as, the
> > +cpu affinity, or the vector, we need to update the associated IRTE
> accordingly.
> > +
> > +- Update posted-interrupt descriptor during vCPU scheduling
> > +
> > +The basic idea here is:
> > +1. When vCPU's state is RUNSTATE_running,
> > +        - Set 'NV' to 'posted_intr_vector'.
> > +        - Clear 'SN' to accept posted-interrupts.
> > +        - Set 'NDST' to the pCPU on which the vCPU will be running.
> > +2. When vCPU's state is RUNSTATE_blocked,
> > +        - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the
> > +          related vCPU when posted-interrupt happens for it.
> > +          Please refer to the above section about the new global vector.
> > +        - Clear 'SN' to accept posted-interrupts
> > +3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
> > +        - Set 'SN' to suppress non-urgent interrupts
> > +          (Current, we only support non-urgent interrupts)
> 
> s/Current/Currently/
> 
> or s/Current/Right now/
> 
> 
> > +         When vCPU is in RUNSTATE_runnable or RUNSTATE_offline,
> 
> s/,//
> > +         It is not needed to accept posted-interrupt notification event,
> 
> s/,//
> 
> > +         since we don't change the behavior of scheduler when the
> interrupt
> > +         occurs, we still need wait the next scheduling of the vCPU.
> 
> s/wait the next/wait for the next/
> > +         When external interrupts from assigned devices occur, the
> interrupts
> > +         are recorded in PIR, and will be synced to IRR before VM-Entry.
> > +        - Set 'NV' to 'posted_intr_vector'.
> > +
> > +- How to wakeup blocked vCPU when an interrupt is posted for it (wakeup
> notification handler).
> > +
> > +Here is the scenario for the usage of the new global vector:
> > +
> > +1. vCPU0 is running on pCPU0
> > +2. vCPU0 is blocked and vCPU1 is currently running on pCPU0
> > +3. An external interrupt from an assigned device occurs for vCPU0, if we
> > +still use 'posted_intr_vector' as the notification vector for vCPU0, the
> > +notification event for vCPU0 (the event will go to pCPU1) will be consumed
> > +by vCPU1 incorrectly (remember this is a special vector to CPU). The worst
> > +case is that vCPU0 will never be woken up again since the wakeup event
> > +for it is always consumed by other vCPUs incorrectly. So we need introduce
> > +another global vector, naming 'pi_wakeup_vector' to wake up the blocked
> vCPU.
> > +
> > +After using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue 
> > notification
> > +event using this new vector. Since this new vector is not a SPECIAL one to
> CPU,
> > +it is just a normal vector. To cpu, it just receives an normal external
> interrupt,
> 
> s/cpu/CPU/
> > +then we can get control in the handler of this new vector. In this case,
> hypervisor
> > +can do something in it, such as wakeup the blocked vCPU.
> > +
> > +Here are what we do for the blocked vCPU:
> > +1. Define a per-cpu list 'pi_blocked_vcpu', which stored the blocked
> > +vCPU on the pCPU.
> > +2. When the vCPU's state is changed to RUNSTATE_blocked, insert the vCPU
> > +to the per-cpu list belonging to the pCPU it was running.
> > +3. When the vCPU is unblocked, remove the vCPU from the related pCPU
> list.
> > +
> > +In the handler of 'pi_wakeup_vector', we do:
> > +1. Get the physical CPU.
> > +2. Iterate the list 'pi_blocked_vcpu' of the current pCPU, if 'ON' is set,
> > +we unblock the associated vCPU.
> > +
> > +- New boot command line for Xen, which controls VT-d PI feature by user.
> > +
> > +Like 'intremap' for interrupt remapping, we add a new boot command line
> > +'intpost' for posted-interrupts.
> > +
> > +- Multicast/broadcast and lowest priority interrupts consideration.
> > +
> > +With VT-d PI, the destination vCPU information of an external interrupt
> > +from assigned devices is stored in IRTE, this makes the following
> > +consideration of the design:
> > +1. Multicast/broadcast interrupts cannot be posted.
> > +2. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex
> > +(starting from Nehalem) ignore TPR value, and instead supported two other
> > +ways (configurable by BIOS) on how the handle lowest priority interrupts:
> > +   A) Round robin: In this method, the chipset simply delivers lowest 
> > priority
> > +interrupts in a round-robin manner across all the available logical CPUs.
> While
> > +this provides good load balancing, this was not the best thing to do always
> as
> > +interrupts from the same device (like NIC) will start running on all the 
> > CPUs
> > +thrashing caches and taking locks. This led to the next scheme.
> > +   B) Vector hashing: In this method, hardware would apply a hash function
> > +on the vector value in the interrupt request, and use that hash to pick a
> logical
> > +CPU to route the lowest priority interrupt. This way, a given vector always
> goes
> > +to the same logical CPU, avoiding the thrashing problem above.
> > +
> > +So, gist of above is that, lowest priority interrupts has never been 
> > delivered
> as
> > +"lowest priority" in physical hardware.
> > +
> > +Vector hashing is used in this design.
> 
> And with those tiny little changes:
> 
> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Thanks a lot for your review on this whole series, Konrad!

Thanks,
Feng

> > --
> > 2.1.0
> >
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@xxxxxxxxxxxxx
> > http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.