Xen project Mailing List

Re: [Xen-devel] [RFC PATCH v1] Replace tasklets with per-cpu implementation.

From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Date: Mon, 8 Sep 2014 15:01:41 -0400

Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, keir@xxxxxxx

Delivery-date: Mon, 08 Sep 2014 22:48:19 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Wed, Sep 03, 2014 at 09:03:45AM +0100, Jan Beulich wrote: > >>> On 02.09.14 at 22:28, <konrad.wilk@xxxxxxxxxx> wrote: > > I typed this prototype up and asked folks with the right hardware to > > test it. It _ought_ to, thought I think that the tasklet code > > still could use an overhaul. > > Apart from needing general cleaning up, the main question I have > is whether the dpci_kill() part couldn't be replaced by obtaining a > proper reference on the domain when scheduling a softirq, and > dropping that reference after having handled it. The one fatal corner is when we call 'hvm_dirq_assist' with d->arch.hvm_domain.irq.dpci set to NULL. There is even an ASSERT in 'hvm_dirq_assist' for that. The one edge condition I am troubled by is when we receive an interrupt and process it - while on anothre CPU we get an hypercall for 'domain_kill'. That is: a) 'do_IRQ'-> .. schedule_dpci_for(d)' had been called (and so d->arch.hvm_domain.irq.dpci is still valid). b) on another CPU 'domain_kill' calls domain_relinquish_resources >pci_release_devices->pci_clean_dpci_irqs which then makes d->arch.hvm_domain.irq.dpci NULL. c). the 'schedule_tail' (vmx_do_resume) right after a) is called which means we call the do_softirq. The 'dpci_softirq' is called which calls 'hvm_dirq_assist' and blows up. d). the 'domain_relinquish_resources' on another CPU continues on trying to tear down the domain. What we need is a form of semantic barrier in 'pci_clean_dpci_irqs' that will enforce that all outstanding 'dpci_softirq' have been executed. The refcnt does not enforce that - unless we: 1). Add a new type of refcnt (and spin in pci_clean_dpci_irqs to make sure that the refcnt is at zero). But that would still require an 'dpci_kill' function - albeit with a different implementation. 2). Add a check in 'dpci_softirq' to see if the domain is going through destruction, ie if ( d->is_dying ) /* Skip it. */ 3). Leave it as the prototype had. Here is a draft patch based on the 2) idea. Compile tested and going to test it overnight. From 8c5d3c6b9ec25eb0f921917f36a29d758816a44f Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> Date: Fri, 29 Aug 2014 13:40:09 -0400 Subject: [PATCH] dpci: Replace tasklet with an softirq (v1) The existing tasklet mechanism has a single global spinlock that is taken every-time the global list is touched. And we use this lock quite a lot - when we call do_tasklet_work which is called via an softirq and from the idle loop. We take the lock on any operation on the tasklet_list. The problem we are facing is that there are quite a lot of tasklets scheduled. The most common one that is invoked is the one injecting the VIRQ_TIMER in the guest. Guests are not insane and don't set the one-shot or periodic clocks to be in sub 1ms intervals (causing said tasklet to be scheduled for such small intervalls). The problem appears when PCI passthrough devices are used over many sockets and we have an mix of heavy-interrupt guests and idle guests. The idle guests end up seeing 1/10 of its RUNNING timeslice eaten by the hypervisor (and 40% steal time). The mechanism by which we inject PCI interrupts is by hvm_do_IRQ_dpci which schedules the hvm_dirq_assist tasklet every time an interrupt is received. The callchain is: _asm_vmexit_handler -> vmx_vmexit_handler ->vmx_do_extint -> do_IRQ -> __do_IRQ_guest -> hvm_do_IRQ_dpci tasklet_schedule(&dpci->dirq_tasklet); [takes lock to put the tasklet on] [later on the schedule_tail is invoked which is 'vmx_do_resume'] vmx_do_resume -> vmx_asm_do_vmentry -> call vmx_intr_assist -> vmx_process_softirqs -> do_softirq [executes the tasklet function, takes the lock again] While on other CPUs they might be sitting in a idle loop and invoked to deliver an VIRQ_TIMER, which also ends up taking the lock twice: first to schedule the v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to the guests' BLOCKED_state); then to execute it - which is accounted for in the guest's RUNTIME_state. The end result is that on a 8 socket machine with PCI passthrough, where four sockets are busy with interrupts, and the other sockets have idle guests - we end up with the idle guests having around 40% steal time and 1/10 of its timeslice (3ms out of 30 ms) being tied up taking the lock. The latency of the PCI interrupts delieved to guest is also hindered. With this patch the problem disappears completly. That is removing the lock for the PCI passthrough use-case (the 'hvm_dirq_assist' case) by not using tasklets at all. The patch is simple - instead of scheduling an tasklet we schedule our own softirq - HVM_DPCI_SOFTIRQ, which will take care of running 'hvm_dirq_assist'. The information we need on each CPU is which 'struct domain' structure the 'hvm_dirq_assist' needs to run on. That is simple solved by threading the 'struct domain' through a linked list. The rule of only running one 'hvm_dirq_assist' for only one 'struct domain' is also preserved by having 'schedule_dpci_for' ignore any subsequent calls for an domain which has already been scheduled. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> --- xen/drivers/passthrough/io.c | 137 ++++++++++++++++++++++++++++++++++++++++-- xen/drivers/passthrough/pci.c | 2 - xen/include/xen/hvm/irq.h | 1 - xen/include/xen/sched.h | 6 ++ xen/include/xen/softirq.h | 1 + 5 files changed, 139 insertions(+), 8 deletions(-) diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c index ef75b94..d87f7dc 100644 --- a/xen/drivers/passthrough/io.c +++ b/xen/drivers/passthrough/io.c @@ -20,15 +20,93 @@ #include <xen/event.h> #include <xen/iommu.h> +#include <xen/cpu.h> #include <xen/irq.h> #include <asm/hvm/irq.h> #include <asm/hvm/iommu.h> #include <asm/hvm/support.h> #include <xen/hvm/irq.h> -#include <xen/tasklet.h> static void hvm_dirq_assist(unsigned long _d); +static DEFINE_PER_CPU(struct list_head, dpci_list); + +enum { + STATE_SCHED, + STATE_RUN +}; + +static void schedule_dpci_for(struct domain *d) +{ + if ( !test_and_set_bit(STATE_SCHED, &d->state) ) + { + unsigned long flags; + struct list_head *list; + + local_irq_save(flags); + INIT_LIST_HEAD(&d->list); + get_domain(d); /* To be put by 'dpci_softirq' */ + list = &__get_cpu_var(dpci_list); + list_add_tail(&d->list, list); + + local_irq_restore(flags); + raise_softirq(HVM_DPCI_SOFTIRQ); + } +} + +static void dpci_softirq(void) +{ + + struct domain *d; + struct list_head *list; + struct list_head our_list; + + local_irq_disable(); + list = &__get_cpu_var(dpci_list); + + INIT_LIST_HEAD(&our_list); + list_splice(list, &our_list); + + INIT_LIST_HEAD(&__get_cpu_var(dpci_list)); + + local_irq_enable(); + + while (!list_empty(&our_list)) + { + d = list_entry(our_list.next, struct domain, list); + list_del(&d->list); + + if ( !test_and_set_bit(STATE_RUN, &(d)->state) ) + { + if ( !test_and_clear_bit(STATE_SCHED, &d->state) ) + BUG(); + /* + * Quite nasty corner case: do_IRQ is called which schedules the + * dpci. Right as it is ready to call 'do_softirq', on another CPU + * we call 'domain_kill' which calls 'pci_clean_dpci_irqs'. The + * d->arch.hvm_domain.irq.dpci is NULL, and hvm_dirq_assist + * hits the ASSERT. If we detect that we are in the shutdown + * situation we won't process anymore interrupts. + */ + if ( d->is_dying ) + goto skip; + hvm_dirq_assist((unsigned long)d); + skip: + clear_bit(STATE_RUN, &(d)->state); + put_domain(d); + continue; + } + + local_irq_disable(); + + INIT_LIST_HEAD(&d->list); + list_add_tail(&d->list, &__get_cpu_var(dpci_list)); + local_irq_enable(); + + raise_softirq(HVM_DPCI_SOFTIRQ); + } +} + bool_t pt_irq_need_timer(uint32_t flags) { return !(flags & (HVM_IRQ_DPCI_GUEST_MSI | HVM_IRQ_DPCI_TRANSLATE)); @@ -114,9 +192,7 @@ int pt_irq_create_bind( spin_unlock(&d->event_lock); return -ENOMEM; } - softirq_tasklet_init( - &hvm_irq_dpci->dirq_tasklet, - hvm_dirq_assist, (unsigned long)d); + for ( i = 0; i < NR_HVM_IRQS; i++ ) INIT_LIST_HEAD(&hvm_irq_dpci->girq[i]); @@ -459,7 +535,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq) return 0; pirq_dpci->masked = 1; - tasklet_schedule(&dpci->dirq_tasklet); + schedule_dpci_for(d); return 1; } @@ -631,3 +707,54 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi, unlock: spin_unlock(&d->event_lock); } + +static void migrate_tasklets_from_cpu(unsigned int cpu, struct list_head *list) +{ + struct domain *d; + + while ( !list_empty(list) ) + { + d = list_entry(list->next, struct domain, list); + list_del(&d->list); + schedule_dpci_for(d); + } +} + +static int cpu_callback( + struct notifier_block *nfb, unsigned long action, void *hcpu) +{ + unsigned int cpu = (unsigned long)hcpu; + + switch ( action ) + { + case CPU_UP_PREPARE: + INIT_LIST_HEAD(&per_cpu(dpci_list, cpu)); + break; + case CPU_UP_CANCELED: + case CPU_DEAD: + migrate_tasklets_from_cpu(cpu, (&per_cpu(dpci_list, cpu))); + break; + default: + break; + } + + return NOTIFY_DONE; +} + +static struct notifier_block cpu_nfb = { + .notifier_call = cpu_callback, + .priority = 99 +}; + +static int __init setup_dpci_softirq(void) +{ + int cpu; + + for_each_online_cpu(cpu) + INIT_LIST_HEAD(&per_cpu(dpci_list, cpu)); + + open_softirq(HVM_DPCI_SOFTIRQ, dpci_softirq); + register_cpu_notifier(&cpu_nfb); + return 0; +} +__initcall(setup_dpci_softirq); diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c index 1eba833..59c4386 100644 --- a/xen/drivers/passthrough/pci.c +++ b/xen/drivers/passthrough/pci.c @@ -784,8 +784,6 @@ static void pci_clean_dpci_irqs(struct domain *d) hvm_irq_dpci = domain_get_irq_dpci(d); if ( hvm_irq_dpci != NULL ) { - tasklet_kill(&hvm_irq_dpci->dirq_tasklet); - pt_pirq_iterate(d, pci_clean_dpci_irq, NULL); d->arch.hvm_domain.irq.dpci = NULL; diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h index c89f4b1..a08dbeb 100644 --- a/xen/include/xen/hvm/irq.h +++ b/xen/include/xen/hvm/irq.h @@ -88,7 +88,6 @@ struct hvm_irq_dpci { DECLARE_BITMAP(isairq_map, NR_ISAIRQS); /* Record of mapped Links */ uint8_t link_cnt[NR_LINK]; - struct tasklet dirq_tasklet; }; /* Machine IRQ to guest device/intx mapping. */ diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index c5157e6..7984263 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -456,6 +456,12 @@ struct domain /* vNUMA topology accesses are protected by rwlock. */ rwlock_t vnuma_rwlock; struct vnuma_info *vnuma; + +#ifdef HAS_PASSTHROUGH + /* For HVM_DPCI_SOFTIRQ. We use refcnt when scheduled to run. */ + struct list_head list; + unsigned long state; +#endif }; struct domain_setup_info diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h index 0c0d481..bdc607c 100644 --- a/xen/include/xen/softirq.h +++ b/xen/include/xen/softirq.h @@ -8,6 +8,7 @@ enum { NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ, RCU_SOFTIRQ, TASKLET_SOFTIRQ, + HVM_DPCI_SOFTIRQ, NR_COMMON_SOFTIRQS }; -- 1.9.3 _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.