Xen project Mailing List

Re: [Xen-devel] [PATCH v10 6/7] vmx: VT-d posted-interrupt core logic handling

To: "Wu, Feng" <feng.wu@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

From: "Tian, Kevin" <kevin.tian@xxxxxxxxx>

Date: Thu, 10 Dec 2015 11:40:20 +0000

Accept-language: en-US

Cc: George Dunlap <george.dunlap@xxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Dario Faggioli <dario.faggioli@xxxxxxxxxx>, Keir Fraser <keir@xxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>

Delivery-date: Thu, 10 Dec 2015 11:40:30 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Thread-index: AQHRLagdOHPvyUqH0Eeui5ASGXlMn57EHLtw

Thread-topic: [PATCH v10 6/7] vmx: VT-d posted-interrupt core logic handling

> From: Wu, Feng > Sent: Thursday, December 03, 2015 4:36 PM > > This is the core logic handling for VT-d posted-interrupts. Basically it > deals with how and when to update posted-interrupts during the following > scenarios: > - vCPU is preempted > - vCPU is slept > - vCPU is blocked > > When vCPU is preempted/slept, we update the posted-interrupts during > scheduling by introducing two new architecutral scheduler hooks: > vmx_pi_switch_from() and vmx_pi_switch_to(). When vCPU is blocked, we > introduce a new architectural hooks: arch_vcpu_block() to update hooks -> hook > posted-interrupts descriptor. > > Besides that, before VM-entry, we will make sure the 'NV' filed is set > to 'posted_intr_vector' and the vCPU is not in any blocking lists, which > is needed when vCPU is running in non-root mode. The reason we do this check > is because we change the posted-interrupts descriptor in vcpu_block(), > however, we don't change it back in vcpu_unblock() or when vcpu_block() > directly returns due to event delivery (in fact, we don't need to do it > in the two places, that is why we do it before VM-Entry). > > When we handle the lazy context switch for the following two scenarios: > - Preempted by a tasklet, which uses in an idle context. > - the prev vcpu is in offline and no new available vcpus in run queue. > We don't change the 'SN' bit in posted-interrupt descriptor, this > may incur spurious PI notification events, but since PI notification > event is only sent when 'ON' is clear, and once the PI notification > is sent, ON is set by hardware, hence no more notification events > before 'ON' is clear. Besides that, spurious PI notification events are > going to happen from time to time in Xen hypervisor, such as, when > guests trap to Xen and PI notification event happens, there is > nothing Xen actually needs to do about it, the interrupts will be > delivered to guest atht the next time we do a VMENTRY. > > CC: Keir Fraser <keir@xxxxxxx> > CC: Jan Beulich <jbeulich@xxxxxxxx> > CC: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> > CC: Kevin Tian <kevin.tian@xxxxxxxxx> > CC: George Dunlap <george.dunlap@xxxxxxxxxxxxx> > CC: Dario Faggioli <dario.faggioli@xxxxxxxxxx> > Suggested-by: Yang Zhang <yang.z.zhang@xxxxxxxxx> > Suggested-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx> > Suggested-by: George Dunlap <george.dunlap@xxxxxxxxxxxxx> > Suggested-by: Jan Beulich <jbeulich@xxxxxxxx> > Signed-off-by: Feng Wu <feng.wu@xxxxxxxxx> > --- > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c > index 6c2b512..3368cf2 100644 > --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -7019,6 +7019,12 @@ void hvm_domain_soft_reset(struct domain *d) > hvm_destroy_all_ioreq_servers(d); > } > > +void arch_vcpu_block(struct vcpu *v) > +{ > + if ( v->arch.vcpu_block ) > + v->arch.vcpu_block(v); > +} > + > /* > * Local variables: > * mode: C > diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c > index 000d06e..0f23fce 100644 > --- a/xen/arch/x86/hvm/vmx/vmcs.c > +++ b/xen/arch/x86/hvm/vmx/vmcs.c > @@ -676,6 +676,8 @@ int vmx_cpu_up(void) > if ( cpu_has_vmx_vpid ) > vpid_sync_all(); > > + vmx_pi_per_cpu_init(cpu); > + > return 0; > } > > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c > index 39dc500..0d9462e 100644 > --- a/xen/arch/x86/hvm/vmx/vmx.c > +++ b/xen/arch/x86/hvm/vmx/vmx.c > @@ -83,7 +83,131 @@ static int vmx_msr_write_intercept(unsigned int msr, > uint64_t > msr_content); > static void vmx_invlpg_intercept(unsigned long vaddr); > static int vmx_vmfunc_intercept(struct cpu_user_regs *regs); > > +/* > + * We maintain a per-CPU linked-list of vCPU, so in PI wakeup handler we vCPU -> vCPUs > + * can find which vCPU should be woken up. > + */ > +static DEFINE_PER_CPU(struct list_head, pi_blocked_vcpu); > +static DEFINE_PER_CPU(spinlock_t, pi_blocked_vcpu_lock); > + > uint8_t __read_mostly posted_intr_vector; > +uint8_t __read_mostly pi_wakeup_vector; > + > +void vmx_pi_per_cpu_init(unsigned int cpu) > +{ > + INIT_LIST_HEAD(&per_cpu(pi_blocked_vcpu, cpu)); > + spin_lock_init(&per_cpu(pi_blocked_vcpu_lock, cpu)); > +} > + > +void vmx_vcpu_block(struct vcpu *v) > +{ > + unsigned long flags; > + struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc; > + > + if ( !has_arch_pdevs(v->domain) ) > + return; > + > + ASSERT(v->arch.hvm_vmx.pi_block_cpu == NR_CPUS); > + > + /* > + * The vCPU is blocking, we need to add it to one of the per pCPU lists. > + * We save v->processor to v->arch.hvm_vmx.pi_block_cpu and use it for > + * the per-CPU list, we also save it to posted-interrupt descriptor and > + * make it as the destination of the wake-up notification event. the 2nd piece - "we also save it to posted-interrupt descriptor" is not reflected within this function. Do you mean "we have saved it to..." or "we will save it later to..." in other places? > + */ > + v->arch.hvm_vmx.pi_block_cpu = v->processor; > + > + spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock, > + v->arch.hvm_vmx.pi_block_cpu), flags); > + list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list, > + &per_cpu(pi_blocked_vcpu, v->arch.hvm_vmx.pi_block_cpu)); > + spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock, > + v->arch.hvm_vmx.pi_block_cpu), flags); > + > + ASSERT(!pi_test_sn(pi_desc)); > + > + /* > + * We don't need to set the 'NDST' field, since it should point to > + * the same pCPU as v->processor. > + */ > + > + write_atomic(&pi_desc->nv, pi_wakeup_vector); > +} > + > +static void vmx_pi_switch_from(struct vcpu *v) > +{ > + struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc; > + > + if ( !iommu_intpost || !has_arch_pdevs(v->domain) || > + test_bit(_VPF_blocked, &v->pause_flags) ) > + return; > + > + /* > + * The vCPU has been preempted or went to sleep. We don't need to send > + * notification event to a non-running vcpu, the interrupt information I might be wrong on this, but my gut-feeling is that 'non-running' includes both runnable and blocked vcpu... > + * will be delivered to it before VM-ENTRY when the vcpu is scheduled > + * to run next time. > + */ > + pi_set_sn(pi_desc); > +} > + > +static void vmx_pi_switch_to(struct vcpu *v) > +{ > + struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc; > + > + if ( !iommu_intpost || !has_arch_pdevs(v->domain) ) > + return; > + > + if ( x2apic_enabled ) > + write_atomic(&pi_desc->ndst, cpu_physical_id(v->processor)); > + else > + write_atomic(&pi_desc->ndst, > + MASK_INSR(cpu_physical_id(v->processor), > + PI_xAPIC_NDST_MASK)); > + > + pi_clear_sn(pi_desc); > +} > + > +static void vmx_pi_state_to_normal(struct vcpu *v) > +{ > + unsigned long flags; > + unsigned int pi_block_cpu; > + struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc; > + > + if ( !iommu_intpost || !has_arch_pdevs(v->domain) ) > + return; > + > + ASSERT(!test_bit(_VPF_blocked, &v->pause_flags)); > + > + /* > + * Set 'NV' field back to posted_intr_vector, so the > + * Posted-Interrupts can be delivered to the vCPU when > + * it is running in non-root mode. > + */ > + if ( pi_desc->nv != posted_intr_vector ) > + write_atomic(&pi_desc->nv, posted_intr_vector); > + > + /* the vCPU is not on any blocking list. */ > + pi_block_cpu = v->arch.hvm_vmx.pi_block_cpu; > + if ( pi_block_cpu == NR_CPUS ) > + return; > + > + spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock, pi_block_cpu), flags); > + > + /* > + * v->arch.hvm_vmx.pi_block_cpu == NR_CPUS here means the vCPU was > + * removed from the blocking list while we are acquiring the lock. > + */ > + if ( v->arch.hvm_vmx.pi_block_cpu == NR_CPUS ) > + { > + spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock, pi_block_cpu), > flags); > + return; > + } > + > + list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list); > + v->arch.hvm_vmx.pi_block_cpu = NR_CPUS; > + spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock, pi_block_cpu), > flags); > +} > > static int vmx_domain_initialise(struct domain *d) > { > @@ -106,10 +230,17 @@ static int vmx_vcpu_initialise(struct vcpu *v) > > spin_lock_init(&v->arch.hvm_vmx.vmcs_lock); > > + INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list); > + > + v->arch.hvm_vmx.pi_block_cpu = NR_CPUS; > + > v->arch.schedule_tail = vmx_do_resume; > v->arch.ctxt_switch_from = vmx_ctxt_switch_from; > v->arch.ctxt_switch_to = vmx_ctxt_switch_to; > > + if ( iommu_intpost ) > + v->arch.vcpu_block = vmx_vcpu_block; > + > if ( (rc = vmx_create_vmcs(v)) != 0 ) > { > dprintk(XENLOG_WARNING, > @@ -734,6 +865,7 @@ static void vmx_ctxt_switch_from(struct vcpu *v) > vmx_save_guest_msrs(v); > vmx_restore_host_msrs(); > vmx_save_dr(v); > + vmx_pi_switch_from(v); > } > > static void vmx_ctxt_switch_to(struct vcpu *v) > @@ -758,6 +890,7 @@ static void vmx_ctxt_switch_to(struct vcpu *v) > > vmx_restore_guest_msrs(v); > vmx_restore_dr(v); > + vmx_pi_switch_to(v); > } > > > @@ -2014,6 +2147,40 @@ static struct hvm_function_table __initdata > vmx_function_table = { > .altp2m_vcpu_emulate_vmfunc = vmx_vcpu_emulate_vmfunc, > }; > > +/* Handle VT-d posted-interrupt when VCPU is blocked. */ > +static void pi_wakeup_interrupt(struct cpu_user_regs *regs) > +{ > + struct arch_vmx_struct *vmx, *tmp; > + struct vcpu *v; > + spinlock_t *lock = &per_cpu(pi_blocked_vcpu_lock, smp_processor_id()); > + struct list_head *blocked_vcpus = > + &per_cpu(pi_blocked_vcpu, smp_processor_id()); > + LIST_HEAD(list); what's it? > + > + ack_APIC_irq(); > + this_cpu(irq_count)++; > + > + spin_lock(lock); > + > + /* > + * XXX: The length of the list depends on how many vCPU is current > + * blocked on this specific pCPU. This may hurt the interrupt latency > + * if the list grows to too many entries. > + */ > + list_for_each_entry_safe(vmx, tmp, blocked_vcpus, pi_blocked_vcpu_list) > + { > + if ( pi_test_on(&vmx->pi_desc) ) > + { > + list_del(&vmx->pi_blocked_vcpu_list); > + vmx->pi_block_cpu = NR_CPUS; > + v = container_of(vmx, struct vcpu, arch.hvm_vmx); > + vcpu_unblock(v); > + } > + } > + > + spin_unlock(lock); > +} > + I may miss something here. Who will clear ON bit? test read? It's unclear from design doc: +Outstanding Notification (ON): Indicate if there is a notification event outstanding (not +processed by processor or software) for this Posted Interrupt Descriptor. When this field is 0, +hardware modifies it from 0 to 1 when generating a notification event, and the entity receiving +the notification event (processor or software) resets it as part of posted interrupt processing. Thanks Kevin _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.