Xen project Mailing List

Re: [Xen-devel] [PATCH v10 6/7] vmx: VT-d posted-interrupt core logic handling

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Wed, 20 Jan 2016 01:35:24 -0700

Cc: Kevin Tian <kevin.tian@xxxxxxxxx>, Keir Fraser <keir@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Dario Faggioli <dario.faggioli@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Wed, 20 Jan 2016 08:35:47 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

>>> On 20.01.16 at 08:49, <feng.wu@xxxxxxxxx> wrote: >> From: Jan Beulich [mailto:JBeulich@xxxxxxxx] >> Sent: Monday, January 18, 2016 11:14 PM >> >>> On 03.12.15 at 09:35, <feng.wu@xxxxxxxxx> wrote: >> > --- a/xen/arch/x86/hvm/vmx/vmx.c >> > +++ b/xen/arch/x86/hvm/vmx/vmx.c >> > @@ -83,7 +83,131 @@ static int vmx_msr_write_intercept(unsigned int >> msr, uint64_t msr_content); >> > + ASSERT(v->arch.hvm_vmx.pi_block_cpu == NR_CPUS); >> > + >> > + /* >> > + * The vCPU is blocking, we need to add it to one of the per pCPU >> > lists. >> > + * We save v->processor to v->arch.hvm_vmx.pi_block_cpu and use it >> for >> > + * the per-CPU list, we also save it to posted-interrupt descriptor >> > and >> > + * make it as the destination of the wake-up notification event. >> > + */ >> > + v->arch.hvm_vmx.pi_block_cpu = v->processor; >> > + >> > + spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock, >> > + v->arch.hvm_vmx.pi_block_cpu), flags); >> > + list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list, >> > + &per_cpu(pi_blocked_vcpu, >> > v->arch.hvm_vmx.pi_block_cpu)); >> > + spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock, >> > + v->arch.hvm_vmx.pi_block_cpu), flags); >> > + >> > + ASSERT(!pi_test_sn(pi_desc)); >> > + >> > + /* >> > + * We don't need to set the 'NDST' field, since it should point to >> > + * the same pCPU as v->processor. >> > + */ >> > + >> > + write_atomic(&pi_desc->nv, pi_wakeup_vector); >> >> Stray blank line between comment and corresponding code. >> >> Also the ASSERT() is rather more disconnected from the write >> than seems desirable: Wouldn't it be better to cmpxchg() the >> whole 32-bit value, validating that SN is clear in the result? > > Not quite understand this. The control field is 64 bits, do you > mean cmpxchg() the whole 64-bit value like follows: > > + do { > + old.control = new.control = pi_desc->control; > + new.nv = pi_wakeup_vector; > + } while ( cmpxchg(&pi_desc->control, old.control, new.control) > + != old.control ); > > This a somewhat like the implementation in the earlier versions, > why do you want to change it back? Did you read what I've said? I'm worried about the disconnect of assertion and update: You really care about SN being clear _after_ the update aiui. And since NDST doesn't change, a 32-bit CMPXCHG would seem to suffice. >> > + pi_block_cpu = v->arch.hvm_vmx.pi_block_cpu; >> > + if ( pi_block_cpu == NR_CPUS ) >> > + return; >> >> Are this condition and the one in the immediately preceding if() >> connected in any way? > > I am not sure I understand this correctly. Which one is > " the one in the immediately preceding if()"? If you hadn't ripped out too much of the context, it would be a matter of pointing you back up. Now I have to re-quote the full code: + if ( pi_desc->nv != posted_intr_vector ) + write_atomic(&pi_desc->nv, posted_intr_vector); + + /* the vCPU is not on any blocking list. */ + pi_block_cpu = v->arch.hvm_vmx.pi_block_cpu; + if ( pi_block_cpu == NR_CPUS ) + return; These are the two if()-s the question is about. >> I.e. could the latter perhaps become an >> ASSERT() by connecting the two blocks suitably? I'm simply >> trying to limit the number of different cases to consider mentally >> when looking at this code ... > > If we get a true from above check, it means the vcpu was removed by > pi_wakeup_interrupt(), then we just need to return. If we get a false, > we will acquire the spinlock as the following code does, there are two > scenarios: > - pi_wakeup_interrupt() acquires the spinlock before us, then when > we get the lock, 'v->arch.hvm_vmx.pi_block_cpu' has already been > set to NR_CPUS by pi_wakeup_interrupt(), we need do nothing. > - We get the spinlock before pi_wakeup_interrupt(), this time we need > to remove the vCPU from the list and set 'v->arch.hvm_vmx.pi_block_cpu' > to NR_CPUS. This is all only about the second condition, but the question really is whether one being true or false may imply the result of other. In which case this would better be ASSERT()ed than handled by two conditionals. >> > @@ -106,10 +230,17 @@ static int vmx_vcpu_initialise(struct vcpu *v) >> > >> > spin_lock_init(&v->arch.hvm_vmx.vmcs_lock); >> > >> > + INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list); >> > + >> > + v->arch.hvm_vmx.pi_block_cpu = NR_CPUS; >> > + >> > v->arch.schedule_tail = vmx_do_resume; >> > v->arch.ctxt_switch_from = vmx_ctxt_switch_from; >> > v->arch.ctxt_switch_to = vmx_ctxt_switch_to; >> > >> > + if ( iommu_intpost ) >> > + v->arch.vcpu_block = vmx_vcpu_block; >> >> Considering the locking question above - couldn't this be done only >> when a device gets added, and the pointer zapped upon removal >> of the last device, at once saving the conditional inside the hook? > > That is another solution, but I think it make simple thing difficult, > I feel the current logic is simple and clear. Btw, this hook is not > a critical path, another conditional in it should not a big deal. Then you didn't understand: The question isn't this path, but the path where the hook gets called if non-NULL (and hence the possibility to avoid such needless calls). >> > --- a/xen/common/schedule.c >> > +++ b/xen/common/schedule.c >> > @@ -802,6 +802,8 @@ void vcpu_block(void) >> > >> > set_bit(_VPF_blocked, &v->pause_flags); >> > >> > + arch_vcpu_block(v); >> > + >> > /* Check for events /after/ blocking: avoids wakeup waiting race. */ >> > if ( local_events_need_delivery() ) >> > { >> >> ISTR some previous discussion on this placement - wasn't it >> determined that putting it in the else part here would be >> sufficient for your needs and better in terms of overhead? >> Or is there some race possible if moved past the invocation >> of local_events_need_delivery()? > > The discussion happened several month ago, it is really hard to recall > every details, I have to look for it from my index. It is really bad to > last for so long for the review. I understand the frustration on your end, but I can only justify it by the frustration at my end resulting from everyone dumping out their patch sets with only very few people helping to deal with that flood. Plus, after previous iterations, I expected the review to be more involved than it turned out in the end (i.e. you've done a good job in simplifying things after the feedback you had received - thanks for that). > We need to call arch_vcpu_block() before local_events_need_delivery(), > since VT-d hardware can issue notification event when we are in > arch_vcpu_block(), it that happens, 'ON' bit is set, then we will check > it in local_events_need_delivery(). If we do arch_vcpu_block() in the > else part, we cannot handle this. This is one reason I can recall, I am > not sure whether there are other concerns since it has been really > a long time. The current implementation is fully discussed with scheduler > maintainers. Okay, this all makes sense. It's just that I don't see how the ON bit would get checked by local_events_need_delivery(). But that may in turn be because, with the long turnaround, I've lost some of the details of the rest of this series. >> > --- a/xen/include/asm-x86/hvm/vmx/vmcs.h >> > +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h >> >> > + unsigned int pi_block_cpu; >> >> Looking at all the use sites I wonder whether it wouldn't make for >> easier to read code if you instead stored the lock to acquire. This >> would then perhaps also prepare a road for balancing lists growing >> too large (i.e. rather than having one list per CPU, you could >> then have a variable number of lists, depending on current needs, >> and [kind of] freely select one of them). > > Since this patch is still experimental, can we put the "length of the list" > issue in the next stage when we turn it to default on? I in no way meant to say that the list length issue needs to be dealt with right away. All I was saying was that storing a lock pointer here instead of a CPU number may make that easier (and might therefore be a good idea to do right away). Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.