[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[patch 00/37] cpu/hotplug, x86: Reworked parallel CPU bringup



Hi!

This is a complete rework of the parallel bringup patch series (V17)

    
https://lore.kernel.org/lkml/20230328195758.1049469-1-usama.arif@xxxxxxxxxxxxx

to address the issues which were discovered in review:

 1) The X86 microcode loader serialization requirement

    https://lore.kernel.org/lkml/87v8iirxun.ffs@tglx

    Microcode loading on HT enabled X86 CPUs requires that the microcode is
    loaded on the primary thread. The sibling thread(s) must be in
    quiescent state; either looping in a place which is aware of potential
    changes by the microcode update (see late loading) or in fully quiescent
    state, i.e. waiting for INIT/SIPI.

    This is required by hardware/firmware on Intel. Aside of that it's a
    vendor independent software correctness issue. Assume the following
    sequence:

    CPU1.0                    CPU1.1
                              CPUID($A)
    Load microcode.
    Changes CPUID($A, $B)
                              CPUID($B)

    CPU1.1 makes a decision on $A and $B which might be inconsistent due
    to the microcode update.

    The solution for this is to bringup the primary threads first and after
    that the siblings. Loading microcode on the siblings is a NOOP on Intel
    and on AMD it is guaranteed to only modify thread local state.

    This ensures that the APs can load microcode before reaching the alive
    synchronization point w/o doing any further x86 specific
    synchronization between the core siblings.

 2) The general design issues discussed in V16

    https://lore.kernel.org/lkml/87pm8y6yme.ffs@tglx

    The previous parallel bringup patches just glued this mechanism into
    the existing code without a deeper analysis of the synchronization
    mechanisms and without generalizing it so that the control logic is
    mostly in the core code and not made an architecture specific tinker
    space.

    Much of that had been pointed out 2 years ago in the discussions about
    the early versions of parallel bringup already.


The series is based on:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip x86/apic

and also available from git:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git hotplug


Background
----------

The reason why people are interested in parallel bringup is to shorten
the (kexec) reboot time of cloud servers to reduce the downtime of the
VM tenants. There are obviously other interesting use cases for this
like VM startup time, embedded devices...

The current fully serialized bringup does the following per AP:

    1) Prepare callbacks (allocate, intialize, create threads)
    2) Kick the AP alive (e.g. INIT/SIPI on x86)
    3) Wait for the AP to report alive state
    4) Let the AP continue through the atomic bringup
    5) Let the AP run the threaded bringup to full online state

There are two significant delays:

    #3 The time for an AP to report alive state in start_secondary() on x86
       has been measured in the range between 350us and 3.5ms depending on
       vendor and CPU type, BIOS microcode size etc.

    #4 The atomic bringup does the microcode update. This has been measured
       to take up to ~8ms on the primary threads depending on the microcode
       patch size to apply.

On a two socket SKL server with 56 cores (112 threads) the boot CPU spends
on current mainline about 800ms busy waiting for the APs to come up and
apply microcode. That's more than 80% of the actual onlining procedure.

By splitting the actual bringup mechanism into two parts this can be
reduced to waiting for the first AP to report alive or if the system is
large enough the first AP is already waiting when the boot CPU finished the
wake-up of the last AP.


The actual solution comes in several parts
------------------------------------------

 1) [P 1-2] General cleanups (init annotations, kernel doc...)

 2) [P 3] The obvious

    Avoid pointless delay calibration when TSC is synchronized across
    sockets. That removes a whopping 100ms delay for the first CPU of a
    socket. This is an improvement independent of parallel bringup and had
    been discussed two years ago already.

 2) [P 3-6] Removal of the CPU0 hotplug hack.

    This was added 11 years ago with the promise to make this a real
    hardware mechanism, but that never materialized. As physical CPU
    hotplug is not really supported and the physical unplugging of CPU0
    never materialized there is no reason to keep this cruft around. It's
    just maintenance ballast for no value and the removal makes
    implementing the parallel bringup feature way simpler.

 3) [P 7-16] Cleanup of the existing bringup mechanism:

     a) Code reorganisation so that the general hotplug specific code is
        in smpboot.c and not sprinkled all over the place

     b) Decouple MTRR/PAT initialization from smp_callout_mask to prepare
        for replacing that mask with a hotplug core code synchronization
        mechanism.

     c) Make TSC synchronization function call based so that the control CPU
        does not have to busy wait for nothing if synchronization is not
        required.

     d) Remove the smp_callin_mask synchronization point as its not longer
        required due to #3c.

     e) Rework the sparse_irq_lock held region in the core code so that the
        next polling synchronization point in the x86 code can be removed to.

     f) Due to #3e it's not longer required to spin wait for the AP to set
        it's online bit.  Remove wait_cpu_online() and the XENPV
        counterpart. So the control CPU can directly wait for the online
        idle completion by the AP and free the control CPU up for other
        work.

     This reduces the synchronization points in the x86 code to one, which
     is the AP alive one. This synchronization will be moved to core
     infrastructure in the next section.

 4) [P 17-27] Replace the disconnected CPU state tracking

    The extra CPU state tracking which is used by a few architectures is
    completely separate from the CPU hotplug core code.

    Replacing it by a variant integrated in the core hotplug machinery
    allows to reduce architecture specific code and provides a generic
    synchronization mechanism for (parallel) CPU bringup/teardown.

    - Convert x86 over and replace the AP alive synchronization on x86 with
      the core variant which removes the remaining x86 hotplug
      synchronization masks.

    - Convert the other architectures usage and remove the old interface
      and code.

 5) [P 28-30] Split the bringup into two steps

    First step invokes the wakeup function on the BP, e.g. SIPI/STARTUP on
    x86. The second one waits on the BP for the AP to report alive and
    releases it for the complete onlining.

    As the hotplug state machine allows partial bringup this allows later
    to kick all APs alive in a first iteration and then bring them up
    completely one by one afterwards.

 6) [P 31] Switch the primary thread detection to a cpumask

    This makes the parallel bringup a simple cpumask based mechanism
    without tons of conditionals and checks for primary threads.

 7) [P 32] Implement the parallel bringup core code

    The parallel bringup looks like this:
    
      1) Bring up the primary SMT threads to the CPUHP_KICK_AP_ALIVE step
         one by one

      2) Bring up the primary SMT threads to the CPUHP_ONLINE step one by
         one

      3) Bring up the secondary SMT threads to the CPUHP_KICK_AP_ALIVE
         step one by one

      4) Bring up the secondary SMT threads to the CPUHP_ONLINE
         step one by one

    In case that SMT is not supported this is obviously reduced to step #1
    and #2.

 8) [P 33-37] Prepare X86 for parallel bringup and enable it


Caveats
-------

The non X86 changes have been all compile tested. Boot and runtime
testing has only be done on a few real hardware platforms and qemu as
available. That definitely needs some help from the people who have
these systems at their fingertips.


Results and analysis
--------------------

Here are numbers for a dual socket SKL 56 cores/ 112 threads machine.  All
numbers in milliseconds. The time measured is the time which the cpu_up()
call takes for each CPU and phase. It's not exact as the system is already
scheduling, handling interrupts and soft interrupts, which is obviously
skewing the picture slightly.

Baseline tip tree x86/apic branch.

                total      avg/CPU          min          max
total  :      912.081        8.217        3.720      113.271

The max of 100ms is due to the silly delay calibration for the second
socket which takes 100ms and was eliminated first. Also the other initial
cleanups and improvements take some time away.

So the real baseline becomes:

                total      avg/CPU          min          max
total  :      785.960        7.081        3.752       36.098

The max here is on the first CPU of the second socket. 20ms of that is due
to TSC synchronization and an extra 2ms to react on the SIPI.

With parallel bootup enabled this becomes:

                total      avg/CPU          min          max
prepare:       39.108        0.352        0.238        0.883
online :       45.166        0.407        0.170       20.357
total  :       84.274        0.759        0.408       21.240

That's a factor ~9.3 reduction on average.

Looking at the 27 primary threads of socket 0 then this becomes even more
interesting:

                total      avg/CPU          min          max
total  :      325.764       12.065       11.981       14.125

versus:
                total      avg/CPU          min          max
prepare:        8.945        0.331        0.238        0.834
online :        4.830        0.179        0.170        0.212
total  :       13.775        0.510        0.408        1.046

So the reduction factor is ~23.5 here. That's mostly because the 20ms TSC
sync is not skewing the picture.

For all 55 primaries, i.e with the 20ms TSC sync extra for socket 1 this
becomes:

                total      avg/CPU          min          max
total  :      685.489       12.463       11.975       36.098

versus:

                total      avg/CPU          min          max
prepare:       19.080        0.353        0.238        0.883
online :       30.283        0.561        0.170       20.357
total  :       49.363        0.914        0.408       21.240

The TSC sync reduces the win to a factor of ~13.8

With 'tsc=reliable' on the command line the socket sync is disabled which
brings it back to the socket 0 numbers:

                total      avg/CPU          min          max
prepare:       18.970        0.351        0.231        0.874
online :       10.328        0.191        0.169        0.358
total  :       29.298        0.543        0.400        1.232

Now looking at the secondary threads only:

                total      avg/CPU          min          max
total  :      100.471        1.794        0.375        4.745

versus:
                total      avg/CPU          min          max
prepare:       19.753        0.353        0.257        0.512
online :       14.671        0.262        0.179        3.461
total  :       34.424        0.615        0.436        3.973

Still a factor of ~3.

The average on the secondaries for the serialized bringup is significantly
lower than for the primaries because the SIPI response time is shorter and
the microcode update takes no time.

This varies wildly with the system, whether microcode in BIOS is already up
to date, how big the microcode patch is and how long the INIT/SIPI response
time is. On an AMD Zen3 machine INIT/SIPI response time is amazingly fast
(350us), but then it lacks TSC_ADJUST and does a two millisecond TSC sync
test for _every_ AP. All of this sucks...


Possible further enhancements
-----------------------------

It's definitely worthwhile to look into reducing the cross socket TSC sync
test time. It's probably safe enough to use 5ms or even 2ms instead of 20ms
on systems with TSC_ADJUST and a few other 'TSC is sane' indicators. Moving
it out of the hotplug path is eventually possible, but that needs some deep
thoughts.

Let's take the TSC sync out of the picture by adding 'tsc=reliable" to the
kernel command line. So the bringup of 111 APs takes:

                total      avg/CPU          min          max
prepare:       38.936        0.351        0.231        0.874
online :       25.231        0.227        0.169        3.465
total  :       64.167        0.578        0.400        4.339

Some of the outliers are not necessarily in the state callbacks as the
system is already scheduling and handles interrupts and soft
interrupts. Haven't analyzed that yet in detail.

In the prepare stage which runs on the control CPU the larger steps are:

  smpcfd:prepare           16us  avg/CPU
  threads:prepare          98us  avg/CPU
  workqueue:prepare        43us  avg/CPU
  trace/RB:prepare        135us  avg/CPU

The trace ringbuffer initialization allocates 354 pages and 354 control
structures one by one. That probably should allocate a large page and an
array of control structures and work from there. I'm sure that would reduce
this significantly. Steven?

smpcfd does just a percpu allocation. No idea why that takes that long.

Vs. threads and workqueues. David thought about spreading out the
preparation work and do it really in parallel. That's a nice idea, but the
threads and workqueue prepare steps are self serializing. The workqueue one
has a global mutex and aside of that both steps create kernel threads which
implicitely serialize on kthreadd. alloc_percpu(), which is used by
smpcfd:prepare is also globally serialized.

The rest of the prepare steps is pretty much in the single digit
microseconds range.

On the AP side it should be possible to move some of the initialization
steps before the alive synchronization point, but that really needs a lot
of analysis whether the functions are safe to invoke that early and outside
of the cpu_hotplug_lock held region for the case of two stage parallel
bringup; see below.

The largest part is:

    identify_secondary_cpu()    99us avg/CPU
   
    Inside of identify_secondary_cpu() the largest offender:

      mcheck_init()             73us avg/CPU

    This part is definitly worth to be looked at whether it can be at least
    partially moved to the early startup code before the alive
    synchronization point. There's a lot of deep analysis required and
    ideally we just rewrite the whole CPUID evaluation trainwreck
    completely.

The rest of the AP side is low single digit microseconds except of:

    perf/x86:starting           14us avg/CPU

    smpboot/threads:online      13us avg/CPU
    workqueue:online            17us avg/CPU
    mm/vmstat:online            17us avg/CPU
    sched:active                30us avg/CPU

sched:active is special. Onlining the first secondary HT thread on the
second socket creates a 3.2ms outlier which skews the whole picture. That's
caused by enabling the static key sched_smt_present which patches the world
and some more. For all other APs this is really in the 1us range. This
definitely could be postponed during bootup like the scheduler domain
rebuild is done after the bringup. But that's still fully serialized and
single threaded and obviously could be done later in the context of async
parallel init. It's unclear why this is different with the fully serialized
bringup where it takes significantly less time, but that's something which
needs to be investigated.


Is truly parallel bringup feasible?
-----------------------------------

In theory yes, realistically no. Why?

   1) The preparation phase

      Allocating memory, creating threads for the to be brought up CPU must
      obviously happen on an already online CPU.

      While it would be possible to bring up a subset of CPUs first and let
      them do the preparation steps for groups of still offline CPUs
      concurrently, the actual benefit of doing so is dubious.

      The prime example is kernel thread creation, which is implicitely
      serialized on kthreadd.

      A simple experiment shows that 4 concurrent workers on 4 different
      CPUs where each is creating 14 * 5 = 70 kernel threads are 5% slower
      than a single worker creating 4 * 14 * 5 = 280 threads.

      So we'd need to have multiple kthreadd instances to handle that,
      which would then serialize on tasklist lock and other things.

      That aside the preparation phase is also affected by the problem
      below.

   2) Assumptions about hotplug serialization

      a) There are quite some assumptions about CPU bringup being fully
         serialized across state transitions.  A lot of state callbacks rely
         on that and would require local locking.

         Adding that local locking is surely possible, but that has several
         downsides:

          - It adds complexity and makes it harder for developers to get
            this correct. The subtle bugs resulting out of that are going
            to be interesting

          - Fine grained locking has a charm, but only if the time spent
            for the actual work is larger than the time required for
            serialization and synchronization.

            Serializing a callback which takes less than a microsecond and
            then having a large number of CPUs contending on the lock will
            not make it any faster at all. That's a well known issue of
            parallelizing and neither made up nor kernel specific.

      b) Some operations definitely require to be protected by the
         cpu_hotplug_lock, especially those which affect cpumasks as the
         masks are guaranteed to be stable in a cpus_read_lock()'ed region.

         As this lock cannot be taken in atomic contexts, it's required
         that the control CPU holds the lock write locked across these
         state transitions. And no, we are not making this a spinlock just
         for that and we even can't.

         Just slapping a lock into the x86 specific part of the cpumask
         update function does not solve anything. The relevant patch in V17
         is completely useless as it only serializes the actual cpumask/map
         modifications, but all read side users are hosed if the update
         would be moved before the alive synchronization point, i.e. into a
         non hotplug lock protected region.

         Even if the hotplug lock would be held accross the whole parallel
         bringup operation then this would still expose all usage of these
         masks and maps in the actual hotplug state callbacks to concurrent
         modifications.

         And no, we are not going to expose an architecture specific raw
         spinlock to the hotplug state callbacks, especially not to those
         in generic code.

      c) Some cpu_read_lock()'ed regions also expect that there is no CPU
         state transition happening which would modify their local
         state. This would again require local serialization.

    3) The amount of work and churn:

       - Analyze the per architecture low level startup functions plus their
         descendant functions and make them ready for concurrency if
         necessary.

       - Analyze ~300 hotplug state callbacks and their descendant functions
         and make them ready for concurrency if necessary.

       - Analyze all cpus_read_lock()'ed regions and address their
         requirements.
      
       - Rewrite the core code to handle the cpu_hotplug_lock requirements
         only in distinct phases of the state machine.

       - Rewrite the core code to handle state callback failure and the
         related rollback in the context of the new rules.

      - ...

   Even if some people are dedicated enough to do that, it's very
   questionable whether the resulting complexity is justified.

   We've spent a serious amount of time to sanitize hotplug and bring it
   into a state where it is correct. This also made it reasonably simple
   for developers to implement hotplug state callbacks without having to
   become hotplug experts.

   Breaking this completely up will result in a flood of hard to diagnose
   subtle issues for sure. Who is going to deal with them?

   The experience with this series so far does not make me comfortable
   about that thought in any way.


Summary
-------

The obvious and low hanging fruits have to be solved first:

  - The CPUID evaluation and related setup mechanisms

  - The trace/ringbuffer oddity

  - The sched:active oddity for the first sibling on the second socket
  
  - Some other expensive things which I'm not seeing in my test setup due
    to lack of hardware or configuration.

Anything else is pretty much wishful thinking in my opinion.

  To be clear. I'm not standing in the way if there is a proper solution,
  but that requires to respect the basic engineering rules:

    1) Correctness first
    2) Keep it maintainable
    3) Keep it simple

  So far this stuff failed already at #1.

I completely understand why this is important for cloud people, but
the real question to ask here is what are the actual requirements.

  As far as I understand the main goal is to make a (kexec) reboot
  almost invisible to VM tenants.

  Now lets look at how this works:

     A) Freeze VMs and persist state
     B) kexec into the new kernel
     C) Restore VMs from persistant memory
     D) Thaw VMs

  So the key problem is how long it takes to get from #B to #C and finally
  to #D.

  As far as I understand #C takes a serious amount of time and cannot be
  parallelized for whatever reasons.

  At the same time the number of online CPUs required to restore the VMs
  state is less than the number of online CPUs required to actually
  operate them in #D.

  That means it would be good enough to return to userspace with a
  limited number of online CPUs as fast as possible. A certain amount of
  CPUs are going to be busy with restoring the VMs state, i.e. one CPU
  per VM. Some remaining non-busy CPU can bringup the rest of the system
  and the APs in order to be functional for #D, i.e the restore of VM
  operation.

  Trying to optimize this purely in kernel space by adding complexity of
  dubious value is simply bogus in my opinion.

  It's already possible today to limit the number of CPUs which are
  initially onlined and online the rest later from user space.

  There are two issue there:

    a) The death by MCE broadcast problem

       Quite some (contemporary) x86 CPU generations are affected by
       this:

         - MCE can be broadcasted to all CPUs and not only issued locally
           to the CPU which triggered it.

         - Any CPU which has CR4.MCE == 0, even if it sits in a wait
           for INIT/SIPI state, will cause an immediate shutdown of the
           machine if a broadcasted MCE is delivered.

    b) Do the parallel bringup via sysfs control knob

       The per CPU target state interface allows to do that today one
       by one, but it's akward and has quite some overhead.

       A knob to online the rest of the not yet onlined present CPUs
       with the benefit of the parallel bringup mechanism is
       missing.

    #a) That's a risk to take by the operator.

        Even the regular serialized bringup does not protect against this
        issue up to the point where all present CPUs have at least
        initialized CR4.

        Limiting the number of APs to online early via the kernel command
        line widens that window and increases the risk further by
        executing user space before all APs have CR4 initialized.

        But the same applies to a deferred online mechanism implemented in
        the kernel where some worker brings up the not yet online APs while
        the early online CPUs are already executing user space code.

    #b) Is a no brainer to implement on top of this.


Conclusion
----------

Adding the basic parallel bringup mechanism as provided by this series
makes a lot of sense. Improving particular issues as pointed out in the
analysis makes sense too.

But trying to solve an application specific problem fully in the kernel
with tons of complexity, without exploring straight forward and simple
approaches first, does not make any sense at all.

Thanks,

        tglx

---
 Documentation/admin-guide/kernel-parameters.txt |   20 
 Documentation/core-api/cpu_hotplug.rst          |   13 
 arch/Kconfig                                    |   23 +
 arch/arm/Kconfig                                |    1 
 arch/arm/include/asm/smp.h                      |    2 
 arch/arm/kernel/smp.c                           |   18 
 arch/arm64/Kconfig                              |    1 
 arch/arm64/include/asm/smp.h                    |    2 
 arch/arm64/kernel/smp.c                         |   14 
 arch/csky/Kconfig                               |    1 
 arch/csky/include/asm/smp.h                     |    2 
 arch/csky/kernel/smp.c                          |    8 
 arch/mips/Kconfig                               |    1 
 arch/mips/cavium-octeon/smp.c                   |    1 
 arch/mips/include/asm/smp-ops.h                 |    1 
 arch/mips/kernel/smp-bmips.c                    |    1 
 arch/mips/kernel/smp-cps.c                      |   14 
 arch/mips/kernel/smp.c                          |    8 
 arch/mips/loongson64/smp.c                      |    1 
 arch/parisc/Kconfig                             |    1 
 arch/parisc/kernel/process.c                    |    4 
 arch/parisc/kernel/smp.c                        |    7 
 arch/riscv/Kconfig                              |    1 
 arch/riscv/include/asm/smp.h                    |    2 
 arch/riscv/kernel/cpu-hotplug.c                 |   14 
 arch/x86/Kconfig                                |   45 --
 arch/x86/include/asm/apic.h                     |    5 
 arch/x86/include/asm/cpu.h                      |    5 
 arch/x86/include/asm/cpumask.h                  |    5 
 arch/x86/include/asm/processor.h                |    1 
 arch/x86/include/asm/realmode.h                 |    3 
 arch/x86/include/asm/sev-common.h               |    3 
 arch/x86/include/asm/smp.h                      |   26 -
 arch/x86/include/asm/topology.h                 |   23 -
 arch/x86/include/asm/tsc.h                      |    2 
 arch/x86/kernel/acpi/sleep.c                    |    9 
 arch/x86/kernel/apic/apic.c                     |   22 -
 arch/x86/kernel/callthunks.c                    |    4 
 arch/x86/kernel/cpu/amd.c                       |    2 
 arch/x86/kernel/cpu/cacheinfo.c                 |   21 
 arch/x86/kernel/cpu/common.c                    |   50 --
 arch/x86/kernel/cpu/topology.c                  |    3 
 arch/x86/kernel/head_32.S                       |   14 
 arch/x86/kernel/head_64.S                       |  121 +++++
 arch/x86/kernel/sev.c                           |    2 
 arch/x86/kernel/smp.c                           |    3 
 arch/x86/kernel/smpboot.c                       |  508 ++++++++----------------
 arch/x86/kernel/topology.c                      |   98 ----
 arch/x86/kernel/tsc.c                           |   20 
 arch/x86/kernel/tsc_sync.c                      |   36 -
 arch/x86/power/cpu.c                            |   37 -
 arch/x86/realmode/init.c                        |    3 
 arch/x86/realmode/rm/trampoline_64.S            |   27 +
 arch/x86/xen/enlighten_hvm.c                    |   11 
 arch/x86/xen/smp_hvm.c                          |   16 
 arch/x86/xen/smp_pv.c                           |   56 +-
 drivers/acpi/processor_idle.c                   |    4 
 include/linux/cpu.h                             |    4 
 include/linux/cpuhotplug.h                      |   17 
 kernel/cpu.c                                    |  397 +++++++++++++++++-
 kernel/smp.c                                    |    2 
 kernel/smpboot.c                                |  163 -------
 62 files changed, 953 insertions(+), 976 deletions(-)





 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.