|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH, RFC] x86/HVM: batch vCPU wakeups
On Tue, Sep 09, 2014 at 09:33:37AM +0100, Jan Beulich wrote:
> Mass wakeups (via vlapic_ipi()) can take enormous amounts of time,
> especially when many of the remote pCPU-s are in deep C-states. For
> 64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware,
> accumulated times of over 2ms were observed. Considering that Windows
> broadcasts IPIs from its timer interrupt, which at least at certain
> times can run at 1kHz, it is clear that this can result in good
> guest behavior.
I guess that should say "it is clear that this *can't* result in good guest
behaviour" ..
-- Pasi
> In fact, on said hardware guests with significantly
> beyond 40 vCPU-s simply hung when e.g. ServerManager gets started.
> With the patch here, average broadcast times for a 64-vCPU guest went
> to a measured maximum of 310us (which is still quite a lot).
>
> This isn't just helping to reduce the number of ICR writes when the
> host APICs run in clustered mode, but also to reduce them by
> suppressing the sends altogether when - by the time
> cpu_raise_softirq_batch_finish() is reached - the remote CPU already
> managed to handle the softirq. Plus - when using MONITOR/MWAIT - the
> update of softirq_pending(cpu), being on the monitored cache line -
> should make the remote CPU wake up ahead of the ICR being sent,
> allowing the wait-for-ICR-idle latencies to be reduced (to perhaps a
> large part due to overlapping the wakeups of multiple CPUs).
>
> Of course this necessarily increases the latencies for the remote
> CPU wakeup at least slightly. To weigh between the effects, the
> condition to enable batching in vlapic_ipi() may need further tuning.
>
> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
> ---
> RFC for two reasons:
> 1) I started to consider elimination of the event-check IPIs in a more
> general way when MONITOR/MWAIT is in use: As long as the remote CPU
> is known to be MWAITing (or in the process of resuming after MWAIT),
> the write of softirq_pending(cpu) ought to be sufficient to wake it.
> This would yield the patch here pointless on MONITOR/MWAIT capable
> hardware.
> 2) The condition when to enable batching in vlapic_ipi() is already
> rather complex, but it is nevertheless unclear whether it would
> benefit from further tuning (as mentioned above).
>
> --- a/xen/arch/x86/hvm/vlapic.c
> +++ b/xen/arch/x86/hvm/vlapic.c
> @@ -447,12 +447,30 @@ void vlapic_ipi(
>
> default: {
> struct vcpu *v;
> - for_each_vcpu ( vlapic_domain(vlapic), v )
> + const struct domain *d = vlapic_domain(vlapic);
> + bool_t batch = d->max_vcpus > 2 &&
> + (short_hand
> + ? short_hand != APIC_DEST_SELF
> + : vlapic_x2apic_mode(vlapic)
> + ? dest_mode ? hweight16(dest) > 1
> + : dest == 0xffffffff
> + : dest_mode
> + ? hweight8(dest &
> + GET_xAPIC_DEST_FIELD(
> + vlapic_get_reg(vlapic,
> + APIC_DFR))) > 1
> + : dest == 0xff);
> +
> + if ( batch )
> + cpu_raise_softirq_batch_begin();
> + for_each_vcpu ( d, v )
> {
> if ( vlapic_match_dest(vcpu_vlapic(v), vlapic,
> short_hand, dest, dest_mode) )
> vlapic_accept_irq(v, icr_low);
> }
> + if ( batch )
> + cpu_raise_softirq_batch_finish();
> break;
> }
> }
> --- a/xen/common/softirq.c
> +++ b/xen/common/softirq.c
> @@ -23,6 +23,9 @@ irq_cpustat_t irq_stat[NR_CPUS];
>
> static softirq_handler softirq_handlers[NR_SOFTIRQS];
>
> +static DEFINE_PER_CPU(cpumask_t, batch_mask);
> +static DEFINE_PER_CPU(unsigned int, batching);
> +
> static void __do_softirq(unsigned long ignore_mask)
> {
> unsigned int i, cpu;
> @@ -70,22 +73,56 @@ void open_softirq(int nr, softirq_handle
>
> void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
> {
> - int cpu;
> - cpumask_t send_mask;
> + unsigned int cpu = smp_processor_id();
> + cpumask_t send_mask, *raise_mask;
> +
> + if ( !per_cpu(batching, cpu) || in_irq() )
> + {
> + cpumask_clear(&send_mask);
> + raise_mask = &send_mask;
> + }
> + else
> + raise_mask = &per_cpu(batch_mask, cpu);
>
> - cpumask_clear(&send_mask);
> for_each_cpu(cpu, mask)
> if ( !test_and_set_bit(nr, &softirq_pending(cpu)) )
> - cpumask_set_cpu(cpu, &send_mask);
> + cpumask_set_cpu(cpu, raise_mask);
>
> - smp_send_event_check_mask(&send_mask);
> + if ( raise_mask == &send_mask )
> + smp_send_event_check_mask(raise_mask);
> }
>
> void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
> {
> - if ( !test_and_set_bit(nr, &softirq_pending(cpu))
> - && (cpu != smp_processor_id()) )
> + unsigned int this_cpu = smp_processor_id();
> +
> + if ( test_and_set_bit(nr, &softirq_pending(cpu))
> + || (cpu == this_cpu) )
> + return;
> +
> + if ( !per_cpu(batching, this_cpu) || in_irq() )
> smp_send_event_check_cpu(cpu);
> + else
> + set_bit(nr, &per_cpu(batch_mask, this_cpu));
> +}
> +
> +void cpu_raise_softirq_batch_begin(void)
> +{
> + ++per_cpu(batching, smp_processor_id());
> +}
> +
> +void cpu_raise_softirq_batch_finish(void)
> +{
> + unsigned int cpu, this_cpu = smp_processor_id();
> + cpumask_t *mask = &per_cpu(batch_mask, this_cpu);
> +
> + ASSERT(per_cpu(batching, this_cpu));
> + for_each_cpu ( cpu, mask )
> + if ( !softirq_pending(cpu) )
> + cpumask_clear_cpu(cpu, mask);
> + smp_send_event_check_mask(mask);
> + cpumask_clear(mask);
> + --per_cpu(batching, this_cpu);
> }
>
> void raise_softirq(unsigned int nr)
> --- a/xen/include/xen/softirq.h
> +++ b/xen/include/xen/softirq.h
> @@ -30,6 +30,9 @@ void cpumask_raise_softirq(const cpumask
> void cpu_raise_softirq(unsigned int cpu, unsigned int nr);
> void raise_softirq(unsigned int nr);
>
> +void cpu_raise_softirq_batch_begin(void);
> +void cpu_raise_softirq_batch_finish(void);
> +
> /*
> * Process pending softirqs on this CPU. This should be called periodically
> * when performing work that prevents softirqs from running in a timely
> manner.
>
>
> x86/HVM: batch vCPU wakeups
>
> Mass wakeups (via vlapic_ipi()) can take enormous amounts of time,
> especially when many of the remote pCPU-s are in deep C-states. For
> 64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware,
> accumulated times of over 2ms were observed. Considering that Windows
> broadcasts IPIs from its timer interrupt, which at least at certain
> times can run at 1kHz, it is clear that this can result in good
> guest behavior. In fact, on said hardware guests with significantly
> beyond 40 vCPU-s simply hung when e.g. ServerManager gets started.
> With the patch here, average broadcast times for a 64-vCPU guest went
> to a measured maximum of 310us (which is still quite a lot).
>
> This isn't just helping to reduce the number of ICR writes when the
> host APICs run in clustered mode, but also to reduce them by
> suppressing the sends altogether when - by the time
> cpu_raise_softirq_batch_finish() is reached - the remote CPU already
> managed to handle the softirq. Plus - when using MONITOR/MWAIT - the
> update of softirq_pending(cpu), being on the monitored cache line -
> should make the remote CPU wake up ahead of the ICR being sent,
> allowing the wait-for-ICR-idle latencies to be reduced (to perhaps a
> large part due to overlapping the wakeups of multiple CPUs).
>
> Of course this necessarily increases the latencies for the remote
> CPU wakeup at least slightly. To weigh between the effects, the
> condition to enable batching in vlapic_ipi() may need further tuning.
>
> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
> ---
> RFC for two reasons:
> 1) I started to consider elimination of the event-check IPIs in a more
> general way when MONITOR/MWAIT is in use: As long as the remote CPU
> is known to be MWAITing (or in the process of resuming after MWAIT),
> the write of softirq_pending(cpu) ought to be sufficient to wake it.
> This would yield the patch here pointless on MONITOR/MWAIT capable
> hardware.
> 2) The condition when to enable batching in vlapic_ipi() is already
> rather complex, but it is nevertheless unclear whether it would
> benefit from further tuning (as mentioned above).
>
> --- a/xen/arch/x86/hvm/vlapic.c
> +++ b/xen/arch/x86/hvm/vlapic.c
> @@ -447,12 +447,30 @@ void vlapic_ipi(
>
> default: {
> struct vcpu *v;
> - for_each_vcpu ( vlapic_domain(vlapic), v )
> + const struct domain *d = vlapic_domain(vlapic);
> + bool_t batch = d->max_vcpus > 2 &&
> + (short_hand
> + ? short_hand != APIC_DEST_SELF
> + : vlapic_x2apic_mode(vlapic)
> + ? dest_mode ? hweight16(dest) > 1
> + : dest == 0xffffffff
> + : dest_mode
> + ? hweight8(dest &
> + GET_xAPIC_DEST_FIELD(
> + vlapic_get_reg(vlapic,
> + APIC_DFR))) > 1
> + : dest == 0xff);
> +
> + if ( batch )
> + cpu_raise_softirq_batch_begin();
> + for_each_vcpu ( d, v )
> {
> if ( vlapic_match_dest(vcpu_vlapic(v), vlapic,
> short_hand, dest, dest_mode) )
> vlapic_accept_irq(v, icr_low);
> }
> + if ( batch )
> + cpu_raise_softirq_batch_finish();
> break;
> }
> }
> --- a/xen/common/softirq.c
> +++ b/xen/common/softirq.c
> @@ -23,6 +23,9 @@ irq_cpustat_t irq_stat[NR_CPUS];
>
> static softirq_handler softirq_handlers[NR_SOFTIRQS];
>
> +static DEFINE_PER_CPU(cpumask_t, batch_mask);
> +static DEFINE_PER_CPU(unsigned int, batching);
> +
> static void __do_softirq(unsigned long ignore_mask)
> {
> unsigned int i, cpu;
> @@ -70,22 +73,56 @@ void open_softirq(int nr, softirq_handle
>
> void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
> {
> - int cpu;
> - cpumask_t send_mask;
> + unsigned int cpu = smp_processor_id();
> + cpumask_t send_mask, *raise_mask;
> +
> + if ( !per_cpu(batching, cpu) || in_irq() )
> + {
> + cpumask_clear(&send_mask);
> + raise_mask = &send_mask;
> + }
> + else
> + raise_mask = &per_cpu(batch_mask, cpu);
>
> - cpumask_clear(&send_mask);
> for_each_cpu(cpu, mask)
> if ( !test_and_set_bit(nr, &softirq_pending(cpu)) )
> - cpumask_set_cpu(cpu, &send_mask);
> + cpumask_set_cpu(cpu, raise_mask);
>
> - smp_send_event_check_mask(&send_mask);
> + if ( raise_mask == &send_mask )
> + smp_send_event_check_mask(raise_mask);
> }
>
> void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
> {
> - if ( !test_and_set_bit(nr, &softirq_pending(cpu))
> - && (cpu != smp_processor_id()) )
> + unsigned int this_cpu = smp_processor_id();
> +
> + if ( test_and_set_bit(nr, &softirq_pending(cpu))
> + || (cpu == this_cpu) )
> + return;
> +
> + if ( !per_cpu(batching, this_cpu) || in_irq() )
> smp_send_event_check_cpu(cpu);
> + else
> + set_bit(nr, &per_cpu(batch_mask, this_cpu));
> +}
> +
> +void cpu_raise_softirq_batch_begin(void)
> +{
> + ++per_cpu(batching, smp_processor_id());
> +}
> +
> +void cpu_raise_softirq_batch_finish(void)
> +{
> + unsigned int cpu, this_cpu = smp_processor_id();
> + cpumask_t *mask = &per_cpu(batch_mask, this_cpu);
> +
> + ASSERT(per_cpu(batching, this_cpu));
> + for_each_cpu ( cpu, mask )
> + if ( !softirq_pending(cpu) )
> + cpumask_clear_cpu(cpu, mask);
> + smp_send_event_check_mask(mask);
> + cpumask_clear(mask);
> + --per_cpu(batching, this_cpu);
> }
>
> void raise_softirq(unsigned int nr)
> --- a/xen/include/xen/softirq.h
> +++ b/xen/include/xen/softirq.h
> @@ -30,6 +30,9 @@ void cpumask_raise_softirq(const cpumask
> void cpu_raise_softirq(unsigned int cpu, unsigned int nr);
> void raise_softirq(unsigned int nr);
>
> +void cpu_raise_softirq_batch_begin(void);
> +void cpu_raise_softirq_batch_finish(void);
> +
> /*
> * Process pending softirqs on this CPU. This should be called periodically
> * when performing work that prevents softirqs from running in a timely
> manner.
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |