[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-changelog] [xen master] x86/HVM: batch vCPU wakeups



commit c47c316d99b3b570d0bb968b99331e6714ef1df7
Author:     Jan Beulich <jbeulich@xxxxxxxx>
AuthorDate: Thu Sep 18 14:44:58 2014 +0200
Commit:     Jan Beulich <jbeulich@xxxxxxxx>
CommitDate: Thu Sep 18 14:44:58 2014 +0200

    x86/HVM: batch vCPU wakeups
    
    Mass wakeups (via vlapic_ipi()) can take enormous amounts of time,
    especially when many of the remote pCPU-s are in deep C-states. For
    64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware,
    accumulated times of over 2ms were observed (average 1.1ms).
    Considering that Windows broadcasts IPIs from its timer interrupt,
    which at least at certain times can run at 1kHz, it is clear that this
    can't result in good guest behavior. In fact, on said hardware guests
    with significantly beyond 40 vCPU-s simply hung when e.g. ServerManager
    gets started.
    
    This isn't just helping to reduce the number of ICR writes when the
    host APICs run in clustered mode, it also reduces them by suppressing
    the sends altogether when - by the time
    cpu_raise_softirq_batch_finish() is reached - the remote CPU already
    managed to handle the softirq. Plus - when using MONITOR/MWAIT - the
    update of softirq_pending(cpu), being on the monitored cache line -
    should make the remote CPU wake up ahead of the ICR being sent,
    allowing the wait-for-ICR-idle latencies to be reduced (perhaps to a
    large part due to overlapping the wakeups of multiple CPUs).
    
    With this alone (i.e. without the IPI avoidance patch in place),
    average broadcast times for a 64-vCPU guest went down to a measured
    maximum of 310us. With that other patch in place, improvements aren't
    as clear anymore (short term averages only went down from 255us to
    250us, which clearly is within the error range of the measurements),
    but longer term an improvement of the averages is still visible.
    Depending on hardware, long term maxima were observed to go down quite
    a bit (on aforementioned hardware), while they were seen to go up
    again on a (single core) Nehalem (where instead the improvement on the
    average values was more visible).
    
    Of course this necessarily increases the latencies for the remote
    CPU wakeup at least slightly. To weigh between the effects, the
    condition to enable batching in vlapic_ipi() may need further tuning.
    
    Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
    Reviewed-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
    Reviewed-by: Tim Deegan <tim@xxxxxxx>
---
 xen/arch/x86/hvm/vlapic.c |   26 +++++++++++++++++++++++
 xen/common/softirq.c      |   51 ++++++++++++++++++++++++++++++++++++++------
 xen/include/xen/softirq.h |    3 ++
 3 files changed, 73 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index cd7e872..47c4eaa 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -409,6 +409,26 @@ void vlapic_handle_EOI_induced_exit(struct vlapic *vlapic, 
int vector)
     hvm_dpci_msi_eoi(current->domain, vector);
 }
 
+static bool_t is_multicast_dest(struct vlapic *vlapic, unsigned int short_hand,
+                                uint32_t dest, bool_t dest_mode)
+{
+    if ( vlapic_domain(vlapic)->max_vcpus <= 2 )
+        return 0;
+
+    if ( short_hand )
+        return short_hand != APIC_DEST_SELF;
+
+    if ( vlapic_x2apic_mode(vlapic) )
+        return dest_mode ? hweight16(dest) > 1 : dest == 0xffffffff;
+
+    if ( dest_mode )
+        return hweight8(dest &
+                        GET_xAPIC_DEST_FIELD(vlapic_get_reg(vlapic,
+                                                            APIC_DFR))) > 1;
+
+    return dest == 0xff;
+}
+
 void vlapic_ipi(
     struct vlapic *vlapic, uint32_t icr_low, uint32_t icr_high)
 {
@@ -447,12 +467,18 @@ void vlapic_ipi(
 
     default: {
         struct vcpu *v;
+        bool_t batch = is_multicast_dest(vlapic, short_hand, dest, dest_mode);
+
+        if ( batch )
+            cpu_raise_softirq_batch_begin();
         for_each_vcpu ( vlapic_domain(vlapic), v )
         {
             if ( vlapic_match_dest(vcpu_vlapic(v), vlapic,
                                    short_hand, dest, dest_mode) )
                 vlapic_accept_irq(v, icr_low);
         }
+        if ( batch )
+            cpu_raise_softirq_batch_finish();
         break;
     }
     }
diff --git a/xen/common/softirq.c b/xen/common/softirq.c
index ea86671..22e417a 100644
--- a/xen/common/softirq.c
+++ b/xen/common/softirq.c
@@ -23,6 +23,9 @@ irq_cpustat_t irq_stat[NR_CPUS];
 
 static softirq_handler softirq_handlers[NR_SOFTIRQS];
 
+static DEFINE_PER_CPU(cpumask_t, batch_mask);
+static DEFINE_PER_CPU(unsigned int, batching);
+
 static void __do_softirq(unsigned long ignore_mask)
 {
     unsigned int i, cpu;
@@ -71,24 +74,58 @@ void open_softirq(int nr, softirq_handler handler)
 void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
 {
     unsigned int cpu, this_cpu = smp_processor_id();
-    cpumask_t send_mask;
+    cpumask_t send_mask, *raise_mask;
+
+    if ( !per_cpu(batching, this_cpu) || in_irq() )
+    {
+        cpumask_clear(&send_mask);
+        raise_mask = &send_mask;
+    }
+    else
+        raise_mask = &per_cpu(batch_mask, this_cpu);
 
-    cpumask_clear(&send_mask);
     for_each_cpu(cpu, mask)
         if ( !test_and_set_bit(nr, &softirq_pending(cpu)) &&
              cpu != this_cpu &&
              !arch_skip_send_event_check(cpu) )
-            cpumask_set_cpu(cpu, &send_mask);
+            cpumask_set_cpu(cpu, raise_mask);
 
-    smp_send_event_check_mask(&send_mask);
+    if ( raise_mask == &send_mask )
+        smp_send_event_check_mask(raise_mask);
 }
 
 void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
 {
-    if ( !test_and_set_bit(nr, &softirq_pending(cpu))
-         && (cpu != smp_processor_id())
-         && !arch_skip_send_event_check(cpu) )
+    unsigned int this_cpu = smp_processor_id();
+
+    if ( test_and_set_bit(nr, &softirq_pending(cpu))
+         || (cpu == this_cpu)
+         || arch_skip_send_event_check(cpu) )
+        return;
+
+    if ( !per_cpu(batching, this_cpu) || in_irq() )
         smp_send_event_check_cpu(cpu);
+    else
+        set_bit(nr, &per_cpu(batch_mask, this_cpu));
+}
+
+void cpu_raise_softirq_batch_begin(void)
+{
+    ++this_cpu(batching);
+}
+
+void cpu_raise_softirq_batch_finish(void)
+{
+    unsigned int cpu, this_cpu = smp_processor_id();
+    cpumask_t *mask = &per_cpu(batch_mask, this_cpu);
+
+    ASSERT(per_cpu(batching, this_cpu));
+    for_each_cpu ( cpu, mask )
+        if ( !softirq_pending(cpu) )
+            cpumask_clear_cpu(cpu, mask);
+    smp_send_event_check_mask(mask);
+    cpumask_clear(mask);
+    --per_cpu(batching, this_cpu);
 }
 
 void raise_softirq(unsigned int nr)
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index 0c0d481..0895a16 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -30,6 +30,9 @@ void cpumask_raise_softirq(const cpumask_t *, unsigned int 
nr);
 void cpu_raise_softirq(unsigned int cpu, unsigned int nr);
 void raise_softirq(unsigned int nr);
 
+void cpu_raise_softirq_batch_begin(void);
+void cpu_raise_softirq_batch_finish(void);
+
 /*
  * Process pending softirqs on this CPU. This should be called periodically
  * when performing work that prevents softirqs from running in a timely manner.
--
generated by git-patchbot for /home/xen/git/xen.git#master

_______________________________________________
Xen-changelog mailing list
Xen-changelog@xxxxxxxxxxxxx
http://lists.xensource.com/xen-changelog


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.