Xen project Mailing List

Re: [Xen-devel] [PATCH v4 3/4] VT-d PI: restrict the vcpu number on a given pcpu

From: George Dunlap <dunlapg@xxxxxxxxx>

Date: Fri, 21 Jul 2017 16:43:42 +0100

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Fri, 21 Jul 2017 15:43:55 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Fri, Jul 7, 2017 at 7:48 AM, Chao Gao <chao.gao@xxxxxxxxx> wrote: > Currently, a blocked vCPU is put in its pCPU's pi blocking list. If > too many vCPUs are blocked on a given pCPU, it will incur that the list > grows too long. After a simple analysis, there are 32k domains and > 128 vcpu per domain, thus about 4M vCPUs may be blocked in one pCPU's > PI blocking list. When a wakeup interrupt arrives, the list is > traversed to find some specific vCPUs to wake them up. This traversal in > that case would consume much time. > > To mitigate this issue, this patch limits the number of vCPUs tracked on a > given pCPU's blocking list, taking factors such as perfomance of common case, > current hvm vCPU count and current pCPU count into consideration. With this > method, for the common case, it works fast and for some extreme cases, the > list length is under control. > > With this patch, when a vcpu is to be blocked, we check whether the pi > blocking list's length of the pcpu where the vcpu is running exceeds > the limit which is the average vcpus per pcpu ratio plus a constant. > If no, the vcpu is added to this pcpu's pi blocking list. Otherwise, > another online pcpu is chosen to accept the vcpu. > > Signed-off-by: Chao Gao <chao.gao@xxxxxxxxx> > --- > v4: > - use a new lock to avoid adding a blocked vcpu to a offline pcpu's blocking > list. > > --- > xen/arch/x86/hvm/vmx/vmx.c | 136 > +++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 114 insertions(+), 22 deletions(-) > > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c > index ecd6485..04e9aa6 100644 > --- a/xen/arch/x86/hvm/vmx/vmx.c > +++ b/xen/arch/x86/hvm/vmx/vmx.c > @@ -95,22 +95,91 @@ static DEFINE_PER_CPU(struct vmx_pi_blocking_vcpu, > vmx_pi_blocking); > uint8_t __read_mostly posted_intr_vector; > static uint8_t __read_mostly pi_wakeup_vector; > > +/* > + * Protect critical sections to avoid adding a blocked vcpu to a destroyed > + * blocking list. > + */ > +static DEFINE_SPINLOCK(remote_pbl_operation); > + > +#define remote_pbl_operation_begin(flags) \ > +({ \ > + spin_lock_irqsave(&remote_pbl_operation, flags); \ > +}) > + > +#define remote_pbl_operation_done(flags) \ > +({ \ > + spin_unlock_irqrestore(&remote_pbl_operation, flags); \ > +}) > + > void vmx_pi_per_cpu_init(unsigned int cpu) > { > INIT_LIST_HEAD(&per_cpu(vmx_pi_blocking, cpu).list); > spin_lock_init(&per_cpu(vmx_pi_blocking, cpu).lock); > } > > +/* > + * By default, the local pcpu (means the one the vcpu is currently running > on) > + * is chosen as the destination of wakeup interrupt. But if the vcpu number > of > + * the pcpu exceeds a limit, another pcpu is chosen until we find a suitable > + * one. > + * > + * Currently, choose (v_tot/p_tot) + K as the limit of vcpu count, where > + * v_tot is the total number of hvm vcpus on the system, p_tot is the total > + * number of pcpus in the system, and K is a fixed number. An experment on a > + * skylake server which has 112 cpus and 64G memory shows the maximum time to > + * wakeup a vcpu from a 128-entry blocking list takes about 22us, which is > + * tolerable. So choose 128 as the fixed number K. > + * > + * This policy makes sure: > + * 1) for common cases, the limit won't be reached and the local pcpu is used > + * which is beneficial to performance (at least, avoid an IPI when unblocking > + * vcpu). > + * 2) for the worst case, the blocking list length scales with the vcpu count > + * divided by the pcpu count. > + */ > +#define PI_LIST_FIXED_NUM 128 > +#define PI_LIST_LIMIT (atomic_read(&num_hvm_vcpus) / num_online_cpus() + > \ > + PI_LIST_FIXED_NUM) > +static inline bool pi_over_limit(int cpu) > +{ > + return per_cpu(vmx_pi_blocking, cpu).counter > PI_LIST_LIMIT; Is there any reason to hide this calculation behind a #define, when it's only used once anyway? Also -- the vast majority of the time, .counter will be < PI_LIST_FIXED_NUM; there's no reason to do an atomic read and an integer division in that case. I would do this: if ( likely(per_cpu(vm_pi_blocking, cpu).counter <= PI_LIST_FIXED_LIMIT) ) return 0; return per_cpu(vm_pi_blocking, cpu).counter < PI_LIST_FIXED_LIMIT + (atomic_read(&num_hvm_vcpus) / num_online_cpus)); Also, I personally think it would make the code more readable to say, "pi_under_limit()" instead; that way... > +} > + > static void vmx_vcpu_block(struct vcpu *v) > { > - unsigned long flags; > - unsigned int dest; > + unsigned long flags[2]; > + unsigned int dest, pi_cpu; > spinlock_t *old_lock; > - spinlock_t *pi_blocking_list_lock = > - &per_cpu(vmx_pi_blocking, v->processor).lock; > struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc; > + spinlock_t *pi_blocking_list_lock; > + bool in_remote_operation = false; > + > + pi_cpu = v->processor; > + > + if ( unlikely(pi_over_limit(pi_cpu)) ) > + { ...here you put the most common thing first, and the exceptional case second (which I think makes the code easier to understand). In fact, you might consider putting this whole thing in a function; something like: unsigned int pi_get_blocking_cpu(unsigned int pi_cpu, unsigned long &flags) { if ( pi_under_limit(pi_cpu) ) { spin_lock_irqsave([pi lock], flags); return pi_cpu; } /* Loop looking for pi's in other places */ } Probably also worth mentioning briefly in a comment why this loop is guaranteed to terminate. -George _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.