Xen project Mailing List

Re: [Xen-devel] [PATCH v3 3/3] VT-d PI: restrict the vcpu number on a given pcpu

On Fri, Jun 16, 2017 at 09:09:13AM -0600, Jan Beulich wrote: >>>> On 24.05.17 at 08:56, <chao.gao@xxxxxxxxx> wrote: >> Currently, a blocked vCPU is put in its pCPU's pi blocking list. If >> too many vCPUs are blocked on a given pCPU, it will incur that the list >> grows too long. After a simple analysis, there are 32k domains and >> 128 vcpu per domain, thus about 4M vCPUs may be blocked in one pCPU's >> PI blocking list. When a wakeup interrupt arrives, the list is >> traversed to find some specific vCPUs to wake them up. This traversal in >> that case would consume much time. >> >> To mitigate this issue, this patch limits the vcpu number on a given >> pCPU, > >This would be a bug, but I think it's the description which is wrong >(or at least imprecise): You don't limit the number of vCPU-s _run_ >on any pCPU, but those tracked on any pCPU-s blocking list. Please >say so here to avoid confusion. Agree. > >> taking factors such as perfomance of common case, current hvm vcpu >> count and current pcpu count into consideration. With this method, for >> the common case, it works fast and for some extreme cases, the list >> length is under control. >> >> The change in vmx_pi_unblock_vcpu() is for the following case: >> vcpu is running -> try to block (this patch may change NSDT to >> another pCPU) but notification comes in time, thus the vcpu > >What does "but notification comes in time" mean? > I mean when local_events_need_delivery() in vcpu_block() return true. >> goes back to running station -> VM-entry (we should set NSDT again, > >s/station/state/ ? > >> reverting the change we make to NSDT in vmx_vcpu_block()) > >Overall I'm not sure I really understand what you try to explain >here. Will put it above the related change. I wanted to explain why we need this change if a vcpu can be added to a remote pcpu (means the vcpu isn't running on this pcpu). a vcpu may go through the two different paths from calling vcpu_block() to VM-entry: Path1: vcpu_block()->vmx_vcpu_block()->local_events_need_delivery(return true) -> vmx_pi_unblock_vcpu (during VM-entry) Path2: vcpu_block()->vmx_vcpu_block()->local_events_need_delivery(return false) -> vmx_pi_switch_from() -> vmx_pi_switch_to() ->vmx_pi_unblock_vcpu (during VM-entry) For migration a vcpu to another pcpu would lead to a incorrect pi_desc->ndst, vmx_pi_switch_to() re-assigns pi_desc->ndst. It was enough for Path1 (no one changed the pi_desc->ndst field and changed the binding between pcpu and vcpu) and Path2. But, now vmx_vcpu_block() would change pi_desc->ndst to another pcpu to receive wakeup interrupt. If local_events_need_delivery() returns true, we should correct pi_desc->ndst to current pcpu in vmx_pi_unblock_vcpu(). > >> --- a/xen/arch/x86/hvm/vmx/vmx.c >> +++ b/xen/arch/x86/hvm/vmx/vmx.c >> @@ -100,16 +100,62 @@ void vmx_pi_per_cpu_init(unsigned int cpu) >> spin_lock_init(&per_cpu(vmx_pi_blocking, cpu).lock); >> } >> >> +/* >> + * By default, the local pcpu (means the one the vcpu is currently running >> on) >> + * is chosen as the destination of wakeup interrupt. But if the vcpu number >> of >> + * the pcpu exceeds a limit, another pcpu is chosen until we find a suitable >> + * one. >> + * >> + * Currently, choose (v_tot/p_tot) + K as the limit of vcpu count, where >> + * v_tot is the total number of hvm vcpus on the system, p_tot is the total >> + * number of pcpus in the system, and K is a fixed number. Experments shows >> + * the maximum time to wakeup a vcpu from a 128-entry blocking list is about >> + * 22us, which is tolerable. So choose 128 as the fixed number K. > >Giving and kind of absolute time value requires also stating on what >hardware this was measured. > >> + * This policy makes sure: >> + * 1) for common cases, the limit won't be reached and the local pcpu is >> used >> + * which is beneficial to performance (at least, avoid an IPI when >> unblocking >> + * vcpu). >> + * 2) for the worst case, the blocking list length scales with the vcpu >> count >> + * divided by the pcpu count. >> + */ >> +#define PI_LIST_FIXED_NUM 128 >> +#define PI_LIST_LIMIT (atomic_read(&num_hvm_vcpus) / num_online_cpus() >> + \ >> + PI_LIST_FIXED_NUM) >> + >> +static bool pi_over_limit(int count) > >Can a caller validly pass a negative argument? Otherwise unsigned int >please. > >> +{ >> + /* Compare w/ constant first to save an atomic read in the common case >> */ > >As an atomic read is just a normal read on x86, does this really matter? agree. > >> + return ((count > PI_LIST_FIXED_NUM) && >> + (count > (atomic_read(&num_hvm_vcpus) / num_online_cpus()) + >> + PI_LIST_FIXED_NUM)); > >Right above you've #define-d PI_LIST_LIMIT - why do you open code >it here? Also note that the outer pair of parentheses is pointless (and >hampering readability). > >> static void vmx_vcpu_block(struct vcpu *v) >> { >> unsigned long flags; >> - unsigned int dest; >> + unsigned int dest, pi_cpu; >> spinlock_t *old_lock; >> - spinlock_t *pi_blocking_list_lock = >> - &per_cpu(vmx_pi_blocking, v->processor).lock; >> struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc; >> + spinlock_t *pi_blocking_list_lock; >> + >> + pi_cpu = v->processor; >> + retry: >> + pi_blocking_list_lock = &per_cpu(vmx_pi_blocking, pi_cpu).lock; >> >> spin_lock_irqsave(pi_blocking_list_lock, flags); >> + /* >> + * Since pi_cpu may now be one other than the one v is currently >> + * running on, check to make sure that it's still up. >> + */ >> + if ( unlikely((!cpu_online(pi_cpu)) || > >But this check comest to late then: You've already used per-CPU >data of an offline CPU by the time you make it here. I'm also not >you really need the lock here. A read_atomic() or the counter >would suffice afaics (of course the writers then would need to >use write_atomic() or add_sized()). > >> + pi_over_limit(per_cpu(vmx_pi_blocking, pi_cpu).counter)) ) > >Indentation. > >> + { >> + pi_cpu = cpumask_cycle(pi_cpu, &cpu_online_map); > >With this, how could the CPU be offline by the time you make it >back to the check above. Thanks to point it out. It would incur a bug. I think we should do things like this: IF pi_blocking_list of current pcpu doesn't over the limit: add the vcpu to current pcpu. ELSE add the vcpu to another pcpu. To add the vcpu to another pcpu, we should avoid concurrency with vmx_pi_desc_fixup(). Thus, a lock (e.g. remote_pi_list_lock) can solve this potential concurrency. Using this lock like below: in vmx_vcpu_block(): IF pi_blocking_list of current pcpu doesn't over the limit: add the vcpu to current pcpu ELSE acquire remote_pi_list_lock choose another online pcpu (don't worry this pcpu would goes offline for we hold the remote_pi_list_lock, which blocks calling vmx_pi_desc_fixup(), thus at least we can add this vcpu to the pi_blocking_list before cleanup) add the vcpu to the chosen pcpu release remote_pi_list_lock in vmx_pi_desc_fixup(): acquire remote_pi_list_lock ... release remote_pi_list_lock Thanks Chao _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.