[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC PATCH v1] Replace tasklets with per-cpu implementation.



On Tue, Sep 09, 2014 at 10:01:09AM +0100, Jan Beulich wrote:
> >>> On 08.09.14 at 21:01, <konrad.wilk@xxxxxxxxxx> wrote:
> > On Wed, Sep 03, 2014 at 09:03:45AM +0100, Jan Beulich wrote:
> >> >>> On 02.09.14 at 22:28, <konrad.wilk@xxxxxxxxxx> wrote:
> >> > I typed this prototype up and asked folks with the right hardware to
> >> > test it. It _ought_ to, thought I think that the tasklet code
> >> > still could use an overhaul.
> >> 
> >> Apart from needing general cleaning up, the main question I have
> >> is whether the dpci_kill() part couldn't be replaced by obtaining a
> >> proper reference on the domain when scheduling a softirq, and
> >> dropping that reference after having handled it.
> > 
> > 
> > The one fatal corner is when we call 'hvm_dirq_assist' with
> > d->arch.hvm_domain.irq.dpci set to NULL. There is even an ASSERT in 
> > 'hvm_dirq_assist' for that.
> > 
> > The one edge condition I am troubled by is when we receive an
> > interrupt and process it - while on anothre CPU we get an hypercall
> > for 'domain_kill'.
> > 
> > That is:
> >  a) 'do_IRQ'-> .. schedule_dpci_for(d)' had been called (and so
> >     d->arch.hvm_domain.irq.dpci is still valid).
> > 
> >  b) on another CPU 'domain_kill' calls domain_relinquish_resources
> >    >pci_release_devices->pci_clean_dpci_irqs which then makes 
> >    d->arch.hvm_domain.irq.dpci NULL.
> > 
> >  c). the 'schedule_tail' (vmx_do_resume) right after a) is called 
> >    which means we call the do_softirq. The 'dpci_softirq' is called
> >    which calls 'hvm_dirq_assist' and blows up.
> > 
> >  d). the 'domain_relinquish_resources' on another CPU continues on trying
> >    to tear down the domain.
> 
> I have to admit that I don't immediately see why this would be a
> problem with the new softirq approach, but not with the previous
> tasklet one. In any event, couldn't domain_relinquish_resources()


Before this patch it used the 'tasklet_kill' in the pci_clean_dpci_irqs which
would wait for the tasklet (if it had been scheduled) to finish up.

In the the patch you had just reviewed (prototype) you asked whether
we could ditch the 'dpci_kill' altogether and depend on refcnts.
The 'dpci_kill' replaced the 'tasklet_kill' in pci_clean_dpci_irqs.

That would work, but we need need some mechanism to stall the
'domain_kill' sequence to make sure that the 'hvm_dirq_assist' runs
before we nuke d->arch.hvm_domain.irq.dpci. Or we fix the softirq
to skip running it if the guest is in process of dying.

> wait for the list of scheduled entities to become empty, while not
> inserting new entries onto the list when ->is_dying?

Of couse. That would require pci_clean_dpci_irqs to call
'dpci_kill' and 'dpci_kill' would wait for the list to become empty.
(That was the first idea I had in the prototype and the code
you just reviewed).

Going back to your original question:
You were wondering if that 'dpci_kill' could be outright removed and
just use ref-counts. The answer is yes, but we MUST check
for d->is_dying == DOMDYING_dying in the softirq otherwise we can blow up.

Or we can go back to what the prototype which had an wait mechanism.

See attached patch (which has been tested).
> 
> > +static void schedule_dpci_for(struct domain *d)
> > +{
> > +    if ( !test_and_set_bit(STATE_SCHED, &d->state) )
> > +    {
> > +        unsigned long flags;
> > +        struct list_head *list;
> > +
> > +        local_irq_save(flags);
> > +        INIT_LIST_HEAD(&d->list);
> > +        get_domain(d); /* To be put by 'dpci_softirq' */
> 
> get_knownalive_domain()?

<nods>
> 
> > +static void migrate_tasklets_from_cpu(unsigned int cpu, struct list_head 
> > *list)
> 
> tasklets?

<blushes>
> 
> > +{
> > +    struct domain *d;
> > +
> > +    while ( !list_empty(list) )
> > +    {
> > +        d = list_entry(list->next, struct domain, list);
> > +        list_del(&d->list);
> > +        schedule_dpci_for(d);
> > +    }
> > +}
> > +
> > +static int cpu_callback(
> > +    struct notifier_block *nfb, unsigned long action, void *hcpu)
> > +{
> > +    unsigned int cpu = (unsigned long)hcpu;
> > +
> > +    switch ( action )
> > +    {
> > +    case CPU_UP_PREPARE:
> > +        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
> > +        break;
> > +    case CPU_UP_CANCELED:
> > +    case CPU_DEAD:
> > +        migrate_tasklets_from_cpu(cpu, (&per_cpu(dpci_list, cpu)));
> 
> Can CPUs go down while softirqs are pending on them?

No. By the time we get here, the CPU is no longer "hearing" them.
> 
> > +        break;
> > +    default:
> > +        break;
> > +    }
> > +
> > +    return NOTIFY_DONE;
> > +}
> > +
> > +static struct notifier_block cpu_nfb = {
> > +    .notifier_call = cpu_callback,
> > +    .priority = 99
> 
> If the whole notification is needed, this seemingly arbitrary number
> would need a comment.

I copied it from the tasklet code, which didn't have an comment
either. Will remove the priority and use default.
> 
> > --- a/xen/include/xen/softirq.h
> > +++ b/xen/include/xen/softirq.h
> > @@ -8,6 +8,7 @@ enum {
> >      NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
> >      RCU_SOFTIRQ,
> >      TASKLET_SOFTIRQ,
> > +    HVM_DPCI_SOFTIRQ,
> >      NR_COMMON_SOFTIRQS
> >  };
> 
> This should of course also be conditional upon HAS_PASSTHROUGH.

Aye!

Pls see attached patch that I hope follows what you had asked for.
Compile and tested with PCI passthrough:


From f8705ed031d500ce190916969ae63b23bb487cf2 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
Date: Fri, 29 Aug 2014 13:40:09 -0400
Subject: [PATCH] dpci: Replace tasklet with an softirq (v2)

The existing tasklet mechanism has a single global
spinlock that is taken every-time the global list
is touched. And we use this lock quite a lot - when
we call do_tasklet_work which is called via an softirq
and from the idle loop. We take the lock on any
operation on the tasklet_list.

The problem we are facing is that there are quite a lot of
tasklets scheduled. The most common one that is invoked is
the one injecting the VIRQ_TIMER in the guest. Guests
are not insane and don't set the one-shot or periodic
clocks to be in sub 1ms intervals (causing said tasklet
to be scheduled for such small intervalls).

The problem appears when PCI passthrough devices are used
over many sockets and we have an mix of heavy-interrupt
guests and idle guests. The idle guests end up seeing
1/10 of its RUNNING timeslice eaten by the hypervisor
(and 40% steal time).

The mechanism by which we inject PCI interrupts is by
hvm_do_IRQ_dpci which schedules the hvm_dirq_assist
tasklet every time an interrupt is received.
The callchain is:

_asm_vmexit_handler
 -> vmx_vmexit_handler
    ->vmx_do_extint
        -> do_IRQ
            -> __do_IRQ_guest
                -> hvm_do_IRQ_dpci
                   tasklet_schedule(&dpci->dirq_tasklet);
                   [takes lock to put the tasklet on]

[later on the schedule_tail is invoked which is 'vmx_do_resume']

vmx_do_resume
 -> vmx_asm_do_vmentry
        -> call vmx_intr_assist
          -> vmx_process_softirqs
            -> do_softirq
              [executes the tasklet function, takes the
               lock again]

While on other CPUs they might be sitting in a idle loop
and invoked to deliver an VIRQ_TIMER, which also ends
up taking the lock twice: first to schedule the
v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to
the guests' BLOCKED_state); then to execute it - which is
accounted for in the guest's RUNTIME_state.

The end result is that on a 8 socket machine with
PCI passthrough, where four sockets are busy with interrupts,
and the other sockets have idle guests - we end up with
the idle guests having around 40% steal time and 1/10
of its timeslice (3ms out of 30 ms) being tied up
taking the lock. The latency of the PCI interrupts delieved
to guest is also hindered.

With this patch the problem disappears completly.
That is removing the lock for the PCI passthrough use-case
(the 'hvm_dirq_assist' case) by not using tasklets at all.

The patch is simple - instead of scheduling an tasklet
we schedule our own softirq - HVM_DPCI_SOFTIRQ, which will
take care of running 'hvm_dirq_assist'. The information we need
on each CPU is which 'struct domain' structure the 'hvm_dirq_assist'
needs to run on. That is simple solved by threading the
'struct domain' through a linked list. The rule of only running
one 'hvm_dirq_assist' for only one 'struct domain' is also preserved
by having 'schedule_dpci_for' ignore any subsequent calls for
an domain which has already been scheduled.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
---
v2: On top of ref-cnts also have wait loop for the outstanding
    'struct domain' that need to be processed.
---
 xen/drivers/passthrough/io.c  | 168 ++++++++++++++++++++++++++++++++++++++++--
 xen/drivers/passthrough/pci.c |   2 +-
 xen/include/xen/hvm/irq.h     |   2 +-
 xen/include/xen/sched.h       |   6 ++
 xen/include/xen/softirq.h     |   3 +
 5 files changed, 174 insertions(+), 7 deletions(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index ef75b94..d52db0b 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -20,15 +20,125 @@
 
 #include <xen/event.h>
 #include <xen/iommu.h>
+#include <xen/cpu.h>
 #include <xen/irq.h>
 #include <asm/hvm/irq.h>
 #include <asm/hvm/iommu.h>
 #include <asm/hvm/support.h>
 #include <xen/hvm/irq.h>
-#include <xen/tasklet.h>
 
 static void hvm_dirq_assist(unsigned long _d);
 
+static DEFINE_PER_CPU(struct list_head, dpci_list);
+
+/* These fancy bit manipulations (bit 0 and bit 1) along with using a lock
+ * operation allow us to have four stages in dpci life-time.
+ *   a) 0x0: Completely empty (not scheduled nor running).
+ *   b) 0x1: Scheduled but not running. Used to guard in 'schedule_dpci_for'
+ *      such that we will only schedule one. If it is scheduled and had never
+ *      run (hence never clearing STATE_SCHED bit), dpci_kill will spin
+ *      forever on said domain. However 'schedule_dpci_for' raises the
+ *      softirq  so it will run, albeit there might be a race (dpci_kill
+ *      spinning until the softirq handler runs).
+ *   c) 0x2: it is running (only on one CPU) and can be scheduled on any
+ *      CPU. The bit 0 - scheduled is cleared at this stage allowing
+ *      'schedule_dpci_for' to succesfully schedule.
+ *   d) 0x3: scheduled and running - only possible if the running 
hvm_dirq_assist
+ *      calls 'schedule_dpci_for'. It does not do that, but it could in the
+ *      future.
+ *
+ *  The two bits play a vital role in assuring that the 'hvm_dirq_assist' is
+ *  scheduled once and runs only once for a domain. The steps are:
+ *
+ *   1) schedule_dpci_for: STATE_SCHED bit set (0x1), added on the per cpu 
list.
+ *   2) dpci_softirq picks one domain from the list. Schedules
+ *   itself later if there are more domain's on it. Tries to set STATE_RUN bit
+ *   (0x3) - if it fails adds domain back to the per-cpu list. If it succeeds
+ *   clears the STATE_SCHED bit (0x2). Once hvm_dirq_assist for an domain
+ *   complets, it unsets STATE_RUN (0x0 or 0x1 if 'schedule_dpci_for' is called
+ *   from 'hvm_dirq_assist').
+ */
+enum {
+    STATE_SCHED, /* Bit 0 */
+    STATE_RUN
+};
+
+static void schedule_dpci_for(struct domain *d)
+{
+    if ( !test_and_set_bit(STATE_SCHED, &d->state) )
+    {
+        unsigned long flags;
+        struct list_head *list;
+
+        local_irq_save(flags);
+        INIT_LIST_HEAD(&d->list);
+        get_knownalive_domain(d); /* To be put by 'dpci_softirq' */
+        list = &__get_cpu_var(dpci_list);
+        list_add_tail(&d->list, list);
+
+        local_irq_restore(flags);
+        raise_softirq(HVM_DPCI_SOFTIRQ);
+    }
+}
+
+void dpci_kill(struct domain *d)
+{
+    while ( test_and_set_bit(STATE_SCHED, &d->state) )
+    {
+        do {
+            process_pending_softirqs();
+        } while ( test_bit(STATE_SCHED, &d->state) );
+    }
+
+    while ( test_bit(STATE_RUN, &d->state) )
+    {
+        cpu_relax();
+    }
+    clear_bit(STATE_SCHED, &d->state);
+}
+
+static void dpci_softirq(void)
+{
+
+    struct domain *d;
+    struct list_head *list;
+    struct list_head our_list;
+
+    local_irq_disable();
+    list = &__get_cpu_var(dpci_list);
+
+    INIT_LIST_HEAD(&our_list);
+    list_splice(list, &our_list);
+
+    INIT_LIST_HEAD(&__get_cpu_var(dpci_list));
+
+    local_irq_enable();
+
+    while (!list_empty(&our_list))
+    {
+        d = list_entry(our_list.next, struct domain, list);
+        list_del(&d->list);
+
+        if ( !test_and_set_bit(STATE_RUN, &(d)->state) )
+        {
+            if ( !test_and_clear_bit(STATE_SCHED, &d->state) )
+                BUG();
+            hvm_dirq_assist((unsigned long)d);
+            clear_bit(STATE_RUN, &(d)->state);
+            put_domain(d);
+            continue;
+        }
+
+        local_irq_disable();
+
+        INIT_LIST_HEAD(&d->list);
+        list_add_tail(&d->list, &__get_cpu_var(dpci_list));
+        local_irq_enable();
+
+        raise_softirq(HVM_DPCI_SOFTIRQ);
+    }
+}
+
 bool_t pt_irq_need_timer(uint32_t flags)
 {
     return !(flags & (HVM_IRQ_DPCI_GUEST_MSI | HVM_IRQ_DPCI_TRANSLATE));
@@ -114,9 +224,7 @@ int pt_irq_create_bind(
             spin_unlock(&d->event_lock);
             return -ENOMEM;
         }
-        softirq_tasklet_init(
-            &hvm_irq_dpci->dirq_tasklet,
-            hvm_dirq_assist, (unsigned long)d);
+
         for ( i = 0; i < NR_HVM_IRQS; i++ )
             INIT_LIST_HEAD(&hvm_irq_dpci->girq[i]);
 
@@ -459,7 +567,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
         return 0;
 
     pirq_dpci->masked = 1;
-    tasklet_schedule(&dpci->dirq_tasklet);
+    schedule_dpci_for(d);
     return 1;
 }
 
@@ -631,3 +739,53 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
 unlock:
     spin_unlock(&d->event_lock);
 }
+
+static void migrate_domains_from_cpu(unsigned int cpu, struct list_head *list)
+{
+    struct domain *d;
+
+    while ( !list_empty(list) )
+    {
+        d = list_entry(list->next, struct domain, list);
+        list_del(&d->list);
+        schedule_dpci_for(d);
+    }
+}
+
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+
+    switch ( action )
+    {
+    case CPU_UP_PREPARE:
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+        break;
+    case CPU_UP_CANCELED:
+    case CPU_DEAD:
+        migrate_domains_from_cpu(cpu, (&per_cpu(dpci_list, cpu)));
+        break;
+    default:
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback,
+};
+
+static int __init setup_dpci_softirq(void)
+{
+    int cpu;
+
+    for_each_online_cpu(cpu)
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+
+    open_softirq(HVM_DPCI_SOFTIRQ, dpci_softirq);
+    register_cpu_notifier(&cpu_nfb);
+    return 0;
+}
+__initcall(setup_dpci_softirq);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 1eba833..71d0fbf 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -784,7 +784,7 @@ static void pci_clean_dpci_irqs(struct domain *d)
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        tasklet_kill(&hvm_irq_dpci->dirq_tasklet);
+        dpci_kill(d);
 
         pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
 
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index c89f4b1..5dff010 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -88,7 +88,6 @@ struct hvm_irq_dpci {
     DECLARE_BITMAP(isairq_map, NR_ISAIRQS);
     /* Record of mapped Links */
     uint8_t link_cnt[NR_LINK];
-    struct tasklet dirq_tasklet;
 };
 
 /* Machine IRQ to guest device/intx mapping. */
@@ -109,6 +108,7 @@ int pt_pirq_iterate(struct domain *d,
                               struct hvm_pirq_dpci *, void *arg),
                     void *arg);
 
+void dpci_kill(struct domain *d);
 /* Modify state of a PCI INTx wire. */
 void hvm_pci_intx_assert(
     struct domain *d, unsigned int device, unsigned int intx);
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index c5157e6..7984263 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -456,6 +456,12 @@ struct domain
     /* vNUMA topology accesses are protected by rwlock. */
     rwlock_t vnuma_rwlock;
     struct vnuma_info *vnuma;
+
+#ifdef HAS_PASSTHROUGH
+    /* For HVM_DPCI_SOFTIRQ. We use refcnt when scheduled to run. */
+    struct list_head list;
+    unsigned long state;
+#endif
 };
 
 struct domain_setup_info
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index 0c0d481..64ae25f 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -8,6 +8,9 @@ enum {
     NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
     RCU_SOFTIRQ,
     TASKLET_SOFTIRQ,
+#ifdef HAS_PASSTHROUGH
+    HVM_DPCI_SOFTIRQ,
+#endif
     NR_COMMON_SOFTIRQS
 };
 
-- 
1.9.3


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.