[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v4 3/4] VT-d PI: restrict the vcpu number on a given pcpu



On Fri, Jul 7, 2017 at 7:48 AM, Chao Gao <chao.gao@xxxxxxxxx> wrote:
> Currently, a blocked vCPU is put in its pCPU's pi blocking list. If
> too many vCPUs are blocked on a given pCPU, it will incur that the list
> grows too long. After a simple analysis, there are 32k domains and
> 128 vcpu per domain, thus about 4M vCPUs may be blocked in one pCPU's
> PI blocking list. When a wakeup interrupt arrives, the list is
> traversed to find some specific vCPUs to wake them up. This traversal in
> that case would consume much time.
>
> To mitigate this issue, this patch limits the number of vCPUs tracked on a
> given pCPU's blocking list, taking factors such as perfomance of common case,
> current hvm vCPU count and current pCPU count into consideration. With this
> method, for the common case, it works fast and for some extreme cases, the
> list length is under control.
>
> With this patch, when a vcpu is to be blocked, we check whether the pi
> blocking list's length of the pcpu where the vcpu is running exceeds
> the limit which is the average vcpus per pcpu ratio plus a constant.
> If no, the vcpu is added to this pcpu's pi blocking list. Otherwise,
> another online pcpu is chosen to accept the vcpu.
>
> Signed-off-by: Chao Gao <chao.gao@xxxxxxxxx>
> ---
> v4:
>  - use a new lock to avoid adding a blocked vcpu to a offline pcpu's blocking
>  list.
>
> ---
>  xen/arch/x86/hvm/vmx/vmx.c | 136 
> +++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 114 insertions(+), 22 deletions(-)
>
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index ecd6485..04e9aa6 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -95,22 +95,91 @@ static DEFINE_PER_CPU(struct vmx_pi_blocking_vcpu, 
> vmx_pi_blocking);
>  uint8_t __read_mostly posted_intr_vector;
>  static uint8_t __read_mostly pi_wakeup_vector;
>
> +/*
> + * Protect critical sections to avoid adding a blocked vcpu to a destroyed
> + * blocking list.
> + */
> +static DEFINE_SPINLOCK(remote_pbl_operation);
> +
> +#define remote_pbl_operation_begin(flags)                   \
> +({                                                          \
> +    spin_lock_irqsave(&remote_pbl_operation, flags);        \
> +})
> +
> +#define remote_pbl_operation_done(flags)                    \
> +({                                                          \
> +    spin_unlock_irqrestore(&remote_pbl_operation, flags);   \
> +})
> +
>  void vmx_pi_per_cpu_init(unsigned int cpu)
>  {
>      INIT_LIST_HEAD(&per_cpu(vmx_pi_blocking, cpu).list);
>      spin_lock_init(&per_cpu(vmx_pi_blocking, cpu).lock);
>  }
>
> +/*
> + * By default, the local pcpu (means the one the vcpu is currently running 
> on)
> + * is chosen as the destination of wakeup interrupt. But if the vcpu number 
> of
> + * the pcpu exceeds a limit, another pcpu is chosen until we find a suitable
> + * one.
> + *
> + * Currently, choose (v_tot/p_tot) + K as the limit of vcpu count, where
> + * v_tot is the total number of hvm vcpus on the system, p_tot is the total
> + * number of pcpus in the system, and K is a fixed number. An experment on a
> + * skylake server which has 112 cpus and 64G memory shows the maximum time to
> + * wakeup a vcpu from a 128-entry blocking list takes about 22us, which is
> + * tolerable. So choose 128 as the fixed number K.
> + *
> + * This policy makes sure:
> + * 1) for common cases, the limit won't be reached and the local pcpu is used
> + * which is beneficial to performance (at least, avoid an IPI when unblocking
> + * vcpu).
> + * 2) for the worst case, the blocking list length scales with the vcpu count
> + * divided by the pcpu count.
> + */
> +#define PI_LIST_FIXED_NUM 128
> +#define PI_LIST_LIMIT     (atomic_read(&num_hvm_vcpus) / num_online_cpus() + 
> \
> +                           PI_LIST_FIXED_NUM)
> +static inline bool pi_over_limit(int cpu)
> +{
> +    return per_cpu(vmx_pi_blocking, cpu).counter > PI_LIST_LIMIT;

Is there any reason to hide this calculation behind a #define, when
it's only used once anyway?

Also -- the vast majority of the time, .counter will be <
PI_LIST_FIXED_NUM; there's no reason to do an atomic read and an
integer division in that case.  I would do this:

if ( likely(per_cpu(vm_pi_blocking, cpu).counter <= PI_LIST_FIXED_LIMIT) )
  return 0;

return per_cpu(vm_pi_blocking, cpu).counter < PI_LIST_FIXED_LIMIT +
(atomic_read(&num_hvm_vcpus) / num_online_cpus));

Also, I personally think it would make the code more readable to say,
"pi_under_limit()" instead; that way...

> +}
> +
>  static void vmx_vcpu_block(struct vcpu *v)
>  {
> -    unsigned long flags;
> -    unsigned int dest;
> +    unsigned long flags[2];
> +    unsigned int dest, pi_cpu;
>      spinlock_t *old_lock;
> -    spinlock_t *pi_blocking_list_lock =
> -               &per_cpu(vmx_pi_blocking, v->processor).lock;
>      struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +    spinlock_t *pi_blocking_list_lock;
> +    bool in_remote_operation = false;
> +
> +    pi_cpu = v->processor;
> +
> +    if ( unlikely(pi_over_limit(pi_cpu)) )
> +    {

...here you put the most common thing first, and the exceptional case
second (which I think makes the code easier to understand).

In fact, you might consider putting this whole thing in a function;
something like:

unsigned int pi_get_blocking_cpu(unsigned int pi_cpu, unsigned long &flags)
{
    if ( pi_under_limit(pi_cpu) ) {
     spin_lock_irqsave([pi lock], flags);
     return pi_cpu;
   }

  /* Loop looking for pi's in other places */
}

Probably also worth mentioning briefly in a comment why this loop is
guaranteed to terminate.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.