Xen project Mailing List

Re: [Xen-devel] [PATCH, RFC] x86/HVM: batch vCPU wakeups

From: Pasi Kärkkäinen <pasik@xxxxxx>

Date: Tue, 9 Sep 2014 12:33:46 +0300

Cc: Ian Campbell <Ian.Campbell@xxxxxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir@xxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, Tim Deegan <tim@xxxxxxx>

Delivery-date: Tue, 09 Sep 2014 09:33:54 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Tue, Sep 09, 2014 at 09:33:37AM +0100, Jan Beulich wrote: > Mass wakeups (via vlapic_ipi()) can take enormous amounts of time, > especially when many of the remote pCPU-s are in deep C-states. For > 64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware, > accumulated times of over 2ms were observed. Considering that Windows > broadcasts IPIs from its timer interrupt, which at least at certain > times can run at 1kHz, it is clear that this can result in good > guest behavior. I guess that should say "it is clear that this *can't* result in good guest behaviour" .. -- Pasi > In fact, on said hardware guests with significantly > beyond 40 vCPU-s simply hung when e.g. ServerManager gets started. > With the patch here, average broadcast times for a 64-vCPU guest went > to a measured maximum of 310us (which is still quite a lot). > > This isn't just helping to reduce the number of ICR writes when the > host APICs run in clustered mode, but also to reduce them by > suppressing the sends altogether when - by the time > cpu_raise_softirq_batch_finish() is reached - the remote CPU already > managed to handle the softirq. Plus - when using MONITOR/MWAIT - the > update of softirq_pending(cpu), being on the monitored cache line - > should make the remote CPU wake up ahead of the ICR being sent, > allowing the wait-for-ICR-idle latencies to be reduced (to perhaps a > large part due to overlapping the wakeups of multiple CPUs). > > Of course this necessarily increases the latencies for the remote > CPU wakeup at least slightly. To weigh between the effects, the > condition to enable batching in vlapic_ipi() may need further tuning. > > Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> > --- > RFC for two reasons: > 1) I started to consider elimination of the event-check IPIs in a more > general way when MONITOR/MWAIT is in use: As long as the remote CPU > is known to be MWAITing (or in the process of resuming after MWAIT), > the write of softirq_pending(cpu) ought to be sufficient to wake it. > This would yield the patch here pointless on MONITOR/MWAIT capable > hardware. > 2) The condition when to enable batching in vlapic_ipi() is already > rather complex, but it is nevertheless unclear whether it would > benefit from further tuning (as mentioned above). > > --- a/xen/arch/x86/hvm/vlapic.c > +++ b/xen/arch/x86/hvm/vlapic.c > @@ -447,12 +447,30 @@ void vlapic_ipi( > > default: { > struct vcpu *v; > - for_each_vcpu ( vlapic_domain(vlapic), v ) > + const struct domain *d = vlapic_domain(vlapic); > + bool_t batch = d->max_vcpus > 2 && > + (short_hand > + ? short_hand != APIC_DEST_SELF > + : vlapic_x2apic_mode(vlapic) > + ? dest_mode ? hweight16(dest) > 1 > + : dest == 0xffffffff > + : dest_mode > + ? hweight8(dest & > + GET_xAPIC_DEST_FIELD( > + vlapic_get_reg(vlapic, > + APIC_DFR))) > 1 > + : dest == 0xff); > + > + if ( batch ) > + cpu_raise_softirq_batch_begin(); > + for_each_vcpu ( d, v ) > { > if ( vlapic_match_dest(vcpu_vlapic(v), vlapic, > short_hand, dest, dest_mode) ) > vlapic_accept_irq(v, icr_low); > } > + if ( batch ) > + cpu_raise_softirq_batch_finish(); > break; > } > } > --- a/xen/common/softirq.c > +++ b/xen/common/softirq.c > @@ -23,6 +23,9 @@ irq_cpustat_t irq_stat[NR_CPUS]; > > static softirq_handler softirq_handlers[NR_SOFTIRQS]; > > +static DEFINE_PER_CPU(cpumask_t, batch_mask); > +static DEFINE_PER_CPU(unsigned int, batching); > + > static void __do_softirq(unsigned long ignore_mask) > { > unsigned int i, cpu; > @@ -70,22 +73,56 @@ void open_softirq(int nr, softirq_handle > > void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr) > { > - int cpu; > - cpumask_t send_mask; > + unsigned int cpu = smp_processor_id(); > + cpumask_t send_mask, *raise_mask; > + > + if ( !per_cpu(batching, cpu) || in_irq() ) > + { > + cpumask_clear(&send_mask); > + raise_mask = &send_mask; > + } > + else > + raise_mask = &per_cpu(batch_mask, cpu); > > - cpumask_clear(&send_mask); > for_each_cpu(cpu, mask) > if ( !test_and_set_bit(nr, &softirq_pending(cpu)) ) > - cpumask_set_cpu(cpu, &send_mask); > + cpumask_set_cpu(cpu, raise_mask); > > - smp_send_event_check_mask(&send_mask); > + if ( raise_mask == &send_mask ) > + smp_send_event_check_mask(raise_mask); > } > > void cpu_raise_softirq(unsigned int cpu, unsigned int nr) > { > - if ( !test_and_set_bit(nr, &softirq_pending(cpu)) > - && (cpu != smp_processor_id()) ) > + unsigned int this_cpu = smp_processor_id(); > + > + if ( test_and_set_bit(nr, &softirq_pending(cpu)) > + || (cpu == this_cpu) ) > + return; > + > + if ( !per_cpu(batching, this_cpu) || in_irq() ) > smp_send_event_check_cpu(cpu); > + else > + set_bit(nr, &per_cpu(batch_mask, this_cpu)); > +} > + > +void cpu_raise_softirq_batch_begin(void) > +{ > + ++per_cpu(batching, smp_processor_id()); > +} > + > +void cpu_raise_softirq_batch_finish(void) > +{ > + unsigned int cpu, this_cpu = smp_processor_id(); > + cpumask_t *mask = &per_cpu(batch_mask, this_cpu); > + > + ASSERT(per_cpu(batching, this_cpu)); > + for_each_cpu ( cpu, mask ) > + if ( !softirq_pending(cpu) ) > + cpumask_clear_cpu(cpu, mask); > + smp_send_event_check_mask(mask); > + cpumask_clear(mask); > + --per_cpu(batching, this_cpu); > } > > void raise_softirq(unsigned int nr) > --- a/xen/include/xen/softirq.h > +++ b/xen/include/xen/softirq.h > @@ -30,6 +30,9 @@ void cpumask_raise_softirq(const cpumask > void cpu_raise_softirq(unsigned int cpu, unsigned int nr); > void raise_softirq(unsigned int nr); > > +void cpu_raise_softirq_batch_begin(void); > +void cpu_raise_softirq_batch_finish(void); > + > /* > * Process pending softirqs on this CPU. This should be called periodically > * when performing work that prevents softirqs from running in a timely > manner. > > > x86/HVM: batch vCPU wakeups > > Mass wakeups (via vlapic_ipi()) can take enormous amounts of time, > especially when many of the remote pCPU-s are in deep C-states. For > 64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware, > accumulated times of over 2ms were observed. Considering that Windows > broadcasts IPIs from its timer interrupt, which at least at certain > times can run at 1kHz, it is clear that this can result in good > guest behavior. In fact, on said hardware guests with significantly > beyond 40 vCPU-s simply hung when e.g. ServerManager gets started. > With the patch here, average broadcast times for a 64-vCPU guest went > to a measured maximum of 310us (which is still quite a lot). > > This isn't just helping to reduce the number of ICR writes when the > host APICs run in clustered mode, but also to reduce them by > suppressing the sends altogether when - by the time > cpu_raise_softirq_batch_finish() is reached - the remote CPU already > managed to handle the softirq. Plus - when using MONITOR/MWAIT - the > update of softirq_pending(cpu), being on the monitored cache line - > should make the remote CPU wake up ahead of the ICR being sent, > allowing the wait-for-ICR-idle latencies to be reduced (to perhaps a > large part due to overlapping the wakeups of multiple CPUs). > > Of course this necessarily increases the latencies for the remote > CPU wakeup at least slightly. To weigh between the effects, the > condition to enable batching in vlapic_ipi() may need further tuning. > > Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> > --- > RFC for two reasons: > 1) I started to consider elimination of the event-check IPIs in a more > general way when MONITOR/MWAIT is in use: As long as the remote CPU > is known to be MWAITing (or in the process of resuming after MWAIT), > the write of softirq_pending(cpu) ought to be sufficient to wake it. > This would yield the patch here pointless on MONITOR/MWAIT capable > hardware. > 2) The condition when to enable batching in vlapic_ipi() is already > rather complex, but it is nevertheless unclear whether it would > benefit from further tuning (as mentioned above). > > --- a/xen/arch/x86/hvm/vlapic.c > +++ b/xen/arch/x86/hvm/vlapic.c > @@ -447,12 +447,30 @@ void vlapic_ipi( > > default: { > struct vcpu *v; > - for_each_vcpu ( vlapic_domain(vlapic), v ) > + const struct domain *d = vlapic_domain(vlapic); > + bool_t batch = d->max_vcpus > 2 && > + (short_hand > + ? short_hand != APIC_DEST_SELF > + : vlapic_x2apic_mode(vlapic) > + ? dest_mode ? hweight16(dest) > 1 > + : dest == 0xffffffff > + : dest_mode > + ? hweight8(dest & > + GET_xAPIC_DEST_FIELD( > + vlapic_get_reg(vlapic, > + APIC_DFR))) > 1 > + : dest == 0xff); > + > + if ( batch ) > + cpu_raise_softirq_batch_begin(); > + for_each_vcpu ( d, v ) > { > if ( vlapic_match_dest(vcpu_vlapic(v), vlapic, > short_hand, dest, dest_mode) ) > vlapic_accept_irq(v, icr_low); > } > + if ( batch ) > + cpu_raise_softirq_batch_finish(); > break; > } > } > --- a/xen/common/softirq.c > +++ b/xen/common/softirq.c > @@ -23,6 +23,9 @@ irq_cpustat_t irq_stat[NR_CPUS]; > > static softirq_handler softirq_handlers[NR_SOFTIRQS]; > > +static DEFINE_PER_CPU(cpumask_t, batch_mask); > +static DEFINE_PER_CPU(unsigned int, batching); > + > static void __do_softirq(unsigned long ignore_mask) > { > unsigned int i, cpu; > @@ -70,22 +73,56 @@ void open_softirq(int nr, softirq_handle > > void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr) > { > - int cpu; > - cpumask_t send_mask; > + unsigned int cpu = smp_processor_id(); > + cpumask_t send_mask, *raise_mask; > + > + if ( !per_cpu(batching, cpu) || in_irq() ) > + { > + cpumask_clear(&send_mask); > + raise_mask = &send_mask; > + } > + else > + raise_mask = &per_cpu(batch_mask, cpu); > > - cpumask_clear(&send_mask); > for_each_cpu(cpu, mask) > if ( !test_and_set_bit(nr, &softirq_pending(cpu)) ) > - cpumask_set_cpu(cpu, &send_mask); > + cpumask_set_cpu(cpu, raise_mask); > > - smp_send_event_check_mask(&send_mask); > + if ( raise_mask == &send_mask ) > + smp_send_event_check_mask(raise_mask); > } > > void cpu_raise_softirq(unsigned int cpu, unsigned int nr) > { > - if ( !test_and_set_bit(nr, &softirq_pending(cpu)) > - && (cpu != smp_processor_id()) ) > + unsigned int this_cpu = smp_processor_id(); > + > + if ( test_and_set_bit(nr, &softirq_pending(cpu)) > + || (cpu == this_cpu) ) > + return; > + > + if ( !per_cpu(batching, this_cpu) || in_irq() ) > smp_send_event_check_cpu(cpu); > + else > + set_bit(nr, &per_cpu(batch_mask, this_cpu)); > +} > + > +void cpu_raise_softirq_batch_begin(void) > +{ > + ++per_cpu(batching, smp_processor_id()); > +} > + > +void cpu_raise_softirq_batch_finish(void) > +{ > + unsigned int cpu, this_cpu = smp_processor_id(); > + cpumask_t *mask = &per_cpu(batch_mask, this_cpu); > + > + ASSERT(per_cpu(batching, this_cpu)); > + for_each_cpu ( cpu, mask ) > + if ( !softirq_pending(cpu) ) > + cpumask_clear_cpu(cpu, mask); > + smp_send_event_check_mask(mask); > + cpumask_clear(mask); > + --per_cpu(batching, this_cpu); > } > > void raise_softirq(unsigned int nr) > --- a/xen/include/xen/softirq.h > +++ b/xen/include/xen/softirq.h > @@ -30,6 +30,9 @@ void cpumask_raise_softirq(const cpumask > void cpu_raise_softirq(unsigned int cpu, unsigned int nr); > void raise_softirq(unsigned int nr); > > +void cpu_raise_softirq_batch_begin(void); > +void cpu_raise_softirq_batch_finish(void); > + > /* > * Process pending softirqs on this CPU. This should be called periodically > * when performing work that prevents softirqs from running in a timely > manner. > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > http://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.