[Xen-devel] [RFC PATCH v1] Replace tasklets with per-cpu implementation.


With the Xen 4.5 feature freeze being right on the doorsteps I am not
expecting this to go in as:
 1) It touches core code,
 2) It has never been tested on ARM,
 3) It is RFC for right now.

With those expectations out of the way, I am submitting for review
an over-haul of the tasklet code. We had found one large machines
with a small size of guests (12) that the steal time for idle
guests was excessively high. Further debugging revealed that the
global tasklet lock was taken across all sockets at an excessively
high rate. To the point that 1/10th of a guest idle time was
taken (and actually accounted for as in RUNNING state!).

The ideal situation to reproduce this behavior is:
 1). Allocate a twelve guests with one to four SR-IOV VFs.
 2). Have half of them (six) heavily use the SR-IOV VFs devices.
 3). Monitor the rest (which are in idle) and despair.

As I discovered under the hood, we have two tasklets that are
scheduled and executed quite often - the VIRQ_TIMER one:
aassert_evtchn_irq_taskle, and the one in charge of injecting
an PCI interrupt in the guest: hvm_do_IRQ_dpci.

The 'hvm_do_IRQ_dpci' is the on that is most often scheduled
and run. The performance bottleneck comes from the fact that
we take the same spinlock three times: tasklet_schedule,
when we are about to execute the tasklet, and when we are
done executing the tasklet.

This patchset throws away the global list and lock for all
tasklets. Instead there are two per-cpu lists: one for
softirq, and one run when scheduler decides it. There is also
an global list and lock when we have cross-CPU tasklet scheduling
- which thankfully rarely happens (microcode update and
hypercall continuation).

The insertion and removal from the list is done by disabling
interrupts - which are short bursts of time. The behavior
of the code to only execute one tasklet per iteration is
also preserved (the Linux code would run through all 
of its tasklets).

The performance benefit of this patch were astounding and
removed the issues we saw. It also decreased the IRQ
latency of delievering an interrupt to a guest.

In terms of the patchset I choose an path in which:
 0) The first patch fixes the performance bug we saw and it
    was easy to backport.
 1) It is bisectable.
 2) If something breaks it should be fairly easy to figure
    out which patch broke it.
 3) It is spit up in a bit weird fashion with scaffolding code
    was added to keep it ticking (as at some point we have
    the old and the new implementation existing and used).
    And then later on removed. This is how Greg KH added
    kref and kobjects long time ago in the kernel and it had
    worked - so I figured I would borrow from this workflow.

I would appreciate feedback from the maintainers if they
would like this to be organized better.

 xen/common/tasklet.c      | 305 +++++++++++++++++++++++++++++++++-------------
 xen/include/xen/tasklet.h |  52 +++++++-
 2 files changed, 271 insertions(+), 86 deletions(-)

Konrad Rzeszutek Wilk (5):
      tasklet: Introduce per-cpu tasklet for softirq.
      tasklet: Add cross CPU feeding of per-cpu tasklets.
      tasklet: Remove the old-softirq implementation.
      tasklet: Introduce per-cpu tasklet for schedule tasklet.
      tasklet: Remove the scaffolding.

