[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [patch 00/37] cpu/hotplug, x86: Reworked parallel CPU bringup
Hi! This is a complete rework of the parallel bringup patch series (V17) https://lore.kernel.org/lkml/20230328195758.1049469-1-usama.arif@xxxxxxxxxxxxx to address the issues which were discovered in review: 1) The X86 microcode loader serialization requirement https://lore.kernel.org/lkml/87v8iirxun.ffs@tglx Microcode loading on HT enabled X86 CPUs requires that the microcode is loaded on the primary thread. The sibling thread(s) must be in quiescent state; either looping in a place which is aware of potential changes by the microcode update (see late loading) or in fully quiescent state, i.e. waiting for INIT/SIPI. This is required by hardware/firmware on Intel. Aside of that it's a vendor independent software correctness issue. Assume the following sequence: CPU1.0 CPU1.1 CPUID($A) Load microcode. Changes CPUID($A, $B) CPUID($B) CPU1.1 makes a decision on $A and $B which might be inconsistent due to the microcode update. The solution for this is to bringup the primary threads first and after that the siblings. Loading microcode on the siblings is a NOOP on Intel and on AMD it is guaranteed to only modify thread local state. This ensures that the APs can load microcode before reaching the alive synchronization point w/o doing any further x86 specific synchronization between the core siblings. 2) The general design issues discussed in V16 https://lore.kernel.org/lkml/87pm8y6yme.ffs@tglx The previous parallel bringup patches just glued this mechanism into the existing code without a deeper analysis of the synchronization mechanisms and without generalizing it so that the control logic is mostly in the core code and not made an architecture specific tinker space. Much of that had been pointed out 2 years ago in the discussions about the early versions of parallel bringup already. The series is based on: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip x86/apic and also available from git: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git hotplug Background ---------- The reason why people are interested in parallel bringup is to shorten the (kexec) reboot time of cloud servers to reduce the downtime of the VM tenants. There are obviously other interesting use cases for this like VM startup time, embedded devices... The current fully serialized bringup does the following per AP: 1) Prepare callbacks (allocate, intialize, create threads) 2) Kick the AP alive (e.g. INIT/SIPI on x86) 3) Wait for the AP to report alive state 4) Let the AP continue through the atomic bringup 5) Let the AP run the threaded bringup to full online state There are two significant delays: #3 The time for an AP to report alive state in start_secondary() on x86 has been measured in the range between 350us and 3.5ms depending on vendor and CPU type, BIOS microcode size etc. #4 The atomic bringup does the microcode update. This has been measured to take up to ~8ms on the primary threads depending on the microcode patch size to apply. On a two socket SKL server with 56 cores (112 threads) the boot CPU spends on current mainline about 800ms busy waiting for the APs to come up and apply microcode. That's more than 80% of the actual onlining procedure. By splitting the actual bringup mechanism into two parts this can be reduced to waiting for the first AP to report alive or if the system is large enough the first AP is already waiting when the boot CPU finished the wake-up of the last AP. The actual solution comes in several parts ------------------------------------------ 1) [P 1-2] General cleanups (init annotations, kernel doc...) 2) [P 3] The obvious Avoid pointless delay calibration when TSC is synchronized across sockets. That removes a whopping 100ms delay for the first CPU of a socket. This is an improvement independent of parallel bringup and had been discussed two years ago already. 2) [P 3-6] Removal of the CPU0 hotplug hack. This was added 11 years ago with the promise to make this a real hardware mechanism, but that never materialized. As physical CPU hotplug is not really supported and the physical unplugging of CPU0 never materialized there is no reason to keep this cruft around. It's just maintenance ballast for no value and the removal makes implementing the parallel bringup feature way simpler. 3) [P 7-16] Cleanup of the existing bringup mechanism: a) Code reorganisation so that the general hotplug specific code is in smpboot.c and not sprinkled all over the place b) Decouple MTRR/PAT initialization from smp_callout_mask to prepare for replacing that mask with a hotplug core code synchronization mechanism. c) Make TSC synchronization function call based so that the control CPU does not have to busy wait for nothing if synchronization is not required. d) Remove the smp_callin_mask synchronization point as its not longer required due to #3c. e) Rework the sparse_irq_lock held region in the core code so that the next polling synchronization point in the x86 code can be removed to. f) Due to #3e it's not longer required to spin wait for the AP to set it's online bit. Remove wait_cpu_online() and the XENPV counterpart. So the control CPU can directly wait for the online idle completion by the AP and free the control CPU up for other work. This reduces the synchronization points in the x86 code to one, which is the AP alive one. This synchronization will be moved to core infrastructure in the next section. 4) [P 17-27] Replace the disconnected CPU state tracking The extra CPU state tracking which is used by a few architectures is completely separate from the CPU hotplug core code. Replacing it by a variant integrated in the core hotplug machinery allows to reduce architecture specific code and provides a generic synchronization mechanism for (parallel) CPU bringup/teardown. - Convert x86 over and replace the AP alive synchronization on x86 with the core variant which removes the remaining x86 hotplug synchronization masks. - Convert the other architectures usage and remove the old interface and code. 5) [P 28-30] Split the bringup into two steps First step invokes the wakeup function on the BP, e.g. SIPI/STARTUP on x86. The second one waits on the BP for the AP to report alive and releases it for the complete onlining. As the hotplug state machine allows partial bringup this allows later to kick all APs alive in a first iteration and then bring them up completely one by one afterwards. 6) [P 31] Switch the primary thread detection to a cpumask This makes the parallel bringup a simple cpumask based mechanism without tons of conditionals and checks for primary threads. 7) [P 32] Implement the parallel bringup core code The parallel bringup looks like this: 1) Bring up the primary SMT threads to the CPUHP_KICK_AP_ALIVE step one by one 2) Bring up the primary SMT threads to the CPUHP_ONLINE step one by one 3) Bring up the secondary SMT threads to the CPUHP_KICK_AP_ALIVE step one by one 4) Bring up the secondary SMT threads to the CPUHP_ONLINE step one by one In case that SMT is not supported this is obviously reduced to step #1 and #2. 8) [P 33-37] Prepare X86 for parallel bringup and enable it Caveats ------- The non X86 changes have been all compile tested. Boot and runtime testing has only be done on a few real hardware platforms and qemu as available. That definitely needs some help from the people who have these systems at their fingertips. Results and analysis -------------------- Here are numbers for a dual socket SKL 56 cores/ 112 threads machine. All numbers in milliseconds. The time measured is the time which the cpu_up() call takes for each CPU and phase. It's not exact as the system is already scheduling, handling interrupts and soft interrupts, which is obviously skewing the picture slightly. Baseline tip tree x86/apic branch. total avg/CPU min max total : 912.081 8.217 3.720 113.271 The max of 100ms is due to the silly delay calibration for the second socket which takes 100ms and was eliminated first. Also the other initial cleanups and improvements take some time away. So the real baseline becomes: total avg/CPU min max total : 785.960 7.081 3.752 36.098 The max here is on the first CPU of the second socket. 20ms of that is due to TSC synchronization and an extra 2ms to react on the SIPI. With parallel bootup enabled this becomes: total avg/CPU min max prepare: 39.108 0.352 0.238 0.883 online : 45.166 0.407 0.170 20.357 total : 84.274 0.759 0.408 21.240 That's a factor ~9.3 reduction on average. Looking at the 27 primary threads of socket 0 then this becomes even more interesting: total avg/CPU min max total : 325.764 12.065 11.981 14.125 versus: total avg/CPU min max prepare: 8.945 0.331 0.238 0.834 online : 4.830 0.179 0.170 0.212 total : 13.775 0.510 0.408 1.046 So the reduction factor is ~23.5 here. That's mostly because the 20ms TSC sync is not skewing the picture. For all 55 primaries, i.e with the 20ms TSC sync extra for socket 1 this becomes: total avg/CPU min max total : 685.489 12.463 11.975 36.098 versus: total avg/CPU min max prepare: 19.080 0.353 0.238 0.883 online : 30.283 0.561 0.170 20.357 total : 49.363 0.914 0.408 21.240 The TSC sync reduces the win to a factor of ~13.8 With 'tsc=reliable' on the command line the socket sync is disabled which brings it back to the socket 0 numbers: total avg/CPU min max prepare: 18.970 0.351 0.231 0.874 online : 10.328 0.191 0.169 0.358 total : 29.298 0.543 0.400 1.232 Now looking at the secondary threads only: total avg/CPU min max total : 100.471 1.794 0.375 4.745 versus: total avg/CPU min max prepare: 19.753 0.353 0.257 0.512 online : 14.671 0.262 0.179 3.461 total : 34.424 0.615 0.436 3.973 Still a factor of ~3. The average on the secondaries for the serialized bringup is significantly lower than for the primaries because the SIPI response time is shorter and the microcode update takes no time. This varies wildly with the system, whether microcode in BIOS is already up to date, how big the microcode patch is and how long the INIT/SIPI response time is. On an AMD Zen3 machine INIT/SIPI response time is amazingly fast (350us), but then it lacks TSC_ADJUST and does a two millisecond TSC sync test for _every_ AP. All of this sucks... Possible further enhancements ----------------------------- It's definitely worthwhile to look into reducing the cross socket TSC sync test time. It's probably safe enough to use 5ms or even 2ms instead of 20ms on systems with TSC_ADJUST and a few other 'TSC is sane' indicators. Moving it out of the hotplug path is eventually possible, but that needs some deep thoughts. Let's take the TSC sync out of the picture by adding 'tsc=reliable" to the kernel command line. So the bringup of 111 APs takes: total avg/CPU min max prepare: 38.936 0.351 0.231 0.874 online : 25.231 0.227 0.169 3.465 total : 64.167 0.578 0.400 4.339 Some of the outliers are not necessarily in the state callbacks as the system is already scheduling and handles interrupts and soft interrupts. Haven't analyzed that yet in detail. In the prepare stage which runs on the control CPU the larger steps are: smpcfd:prepare 16us avg/CPU threads:prepare 98us avg/CPU workqueue:prepare 43us avg/CPU trace/RB:prepare 135us avg/CPU The trace ringbuffer initialization allocates 354 pages and 354 control structures one by one. That probably should allocate a large page and an array of control structures and work from there. I'm sure that would reduce this significantly. Steven? smpcfd does just a percpu allocation. No idea why that takes that long. Vs. threads and workqueues. David thought about spreading out the preparation work and do it really in parallel. That's a nice idea, but the threads and workqueue prepare steps are self serializing. The workqueue one has a global mutex and aside of that both steps create kernel threads which implicitely serialize on kthreadd. alloc_percpu(), which is used by smpcfd:prepare is also globally serialized. The rest of the prepare steps is pretty much in the single digit microseconds range. On the AP side it should be possible to move some of the initialization steps before the alive synchronization point, but that really needs a lot of analysis whether the functions are safe to invoke that early and outside of the cpu_hotplug_lock held region for the case of two stage parallel bringup; see below. The largest part is: identify_secondary_cpu() 99us avg/CPU Inside of identify_secondary_cpu() the largest offender: mcheck_init() 73us avg/CPU This part is definitly worth to be looked at whether it can be at least partially moved to the early startup code before the alive synchronization point. There's a lot of deep analysis required and ideally we just rewrite the whole CPUID evaluation trainwreck completely. The rest of the AP side is low single digit microseconds except of: perf/x86:starting 14us avg/CPU smpboot/threads:online 13us avg/CPU workqueue:online 17us avg/CPU mm/vmstat:online 17us avg/CPU sched:active 30us avg/CPU sched:active is special. Onlining the first secondary HT thread on the second socket creates a 3.2ms outlier which skews the whole picture. That's caused by enabling the static key sched_smt_present which patches the world and some more. For all other APs this is really in the 1us range. This definitely could be postponed during bootup like the scheduler domain rebuild is done after the bringup. But that's still fully serialized and single threaded and obviously could be done later in the context of async parallel init. It's unclear why this is different with the fully serialized bringup where it takes significantly less time, but that's something which needs to be investigated. Is truly parallel bringup feasible? ----------------------------------- In theory yes, realistically no. Why? 1) The preparation phase Allocating memory, creating threads for the to be brought up CPU must obviously happen on an already online CPU. While it would be possible to bring up a subset of CPUs first and let them do the preparation steps for groups of still offline CPUs concurrently, the actual benefit of doing so is dubious. The prime example is kernel thread creation, which is implicitely serialized on kthreadd. A simple experiment shows that 4 concurrent workers on 4 different CPUs where each is creating 14 * 5 = 70 kernel threads are 5% slower than a single worker creating 4 * 14 * 5 = 280 threads. So we'd need to have multiple kthreadd instances to handle that, which would then serialize on tasklist lock and other things. That aside the preparation phase is also affected by the problem below. 2) Assumptions about hotplug serialization a) There are quite some assumptions about CPU bringup being fully serialized across state transitions. A lot of state callbacks rely on that and would require local locking. Adding that local locking is surely possible, but that has several downsides: - It adds complexity and makes it harder for developers to get this correct. The subtle bugs resulting out of that are going to be interesting - Fine grained locking has a charm, but only if the time spent for the actual work is larger than the time required for serialization and synchronization. Serializing a callback which takes less than a microsecond and then having a large number of CPUs contending on the lock will not make it any faster at all. That's a well known issue of parallelizing and neither made up nor kernel specific. b) Some operations definitely require to be protected by the cpu_hotplug_lock, especially those which affect cpumasks as the masks are guaranteed to be stable in a cpus_read_lock()'ed region. As this lock cannot be taken in atomic contexts, it's required that the control CPU holds the lock write locked across these state transitions. And no, we are not making this a spinlock just for that and we even can't. Just slapping a lock into the x86 specific part of the cpumask update function does not solve anything. The relevant patch in V17 is completely useless as it only serializes the actual cpumask/map modifications, but all read side users are hosed if the update would be moved before the alive synchronization point, i.e. into a non hotplug lock protected region. Even if the hotplug lock would be held accross the whole parallel bringup operation then this would still expose all usage of these masks and maps in the actual hotplug state callbacks to concurrent modifications. And no, we are not going to expose an architecture specific raw spinlock to the hotplug state callbacks, especially not to those in generic code. c) Some cpu_read_lock()'ed regions also expect that there is no CPU state transition happening which would modify their local state. This would again require local serialization. 3) The amount of work and churn: - Analyze the per architecture low level startup functions plus their descendant functions and make them ready for concurrency if necessary. - Analyze ~300 hotplug state callbacks and their descendant functions and make them ready for concurrency if necessary. - Analyze all cpus_read_lock()'ed regions and address their requirements. - Rewrite the core code to handle the cpu_hotplug_lock requirements only in distinct phases of the state machine. - Rewrite the core code to handle state callback failure and the related rollback in the context of the new rules. - ... Even if some people are dedicated enough to do that, it's very questionable whether the resulting complexity is justified. We've spent a serious amount of time to sanitize hotplug and bring it into a state where it is correct. This also made it reasonably simple for developers to implement hotplug state callbacks without having to become hotplug experts. Breaking this completely up will result in a flood of hard to diagnose subtle issues for sure. Who is going to deal with them? The experience with this series so far does not make me comfortable about that thought in any way. Summary ------- The obvious and low hanging fruits have to be solved first: - The CPUID evaluation and related setup mechanisms - The trace/ringbuffer oddity - The sched:active oddity for the first sibling on the second socket - Some other expensive things which I'm not seeing in my test setup due to lack of hardware or configuration. Anything else is pretty much wishful thinking in my opinion. To be clear. I'm not standing in the way if there is a proper solution, but that requires to respect the basic engineering rules: 1) Correctness first 2) Keep it maintainable 3) Keep it simple So far this stuff failed already at #1. I completely understand why this is important for cloud people, but the real question to ask here is what are the actual requirements. As far as I understand the main goal is to make a (kexec) reboot almost invisible to VM tenants. Now lets look at how this works: A) Freeze VMs and persist state B) kexec into the new kernel C) Restore VMs from persistant memory D) Thaw VMs So the key problem is how long it takes to get from #B to #C and finally to #D. As far as I understand #C takes a serious amount of time and cannot be parallelized for whatever reasons. At the same time the number of online CPUs required to restore the VMs state is less than the number of online CPUs required to actually operate them in #D. That means it would be good enough to return to userspace with a limited number of online CPUs as fast as possible. A certain amount of CPUs are going to be busy with restoring the VMs state, i.e. one CPU per VM. Some remaining non-busy CPU can bringup the rest of the system and the APs in order to be functional for #D, i.e the restore of VM operation. Trying to optimize this purely in kernel space by adding complexity of dubious value is simply bogus in my opinion. It's already possible today to limit the number of CPUs which are initially onlined and online the rest later from user space. There are two issue there: a) The death by MCE broadcast problem Quite some (contemporary) x86 CPU generations are affected by this: - MCE can be broadcasted to all CPUs and not only issued locally to the CPU which triggered it. - Any CPU which has CR4.MCE == 0, even if it sits in a wait for INIT/SIPI state, will cause an immediate shutdown of the machine if a broadcasted MCE is delivered. b) Do the parallel bringup via sysfs control knob The per CPU target state interface allows to do that today one by one, but it's akward and has quite some overhead. A knob to online the rest of the not yet onlined present CPUs with the benefit of the parallel bringup mechanism is missing. #a) That's a risk to take by the operator. Even the regular serialized bringup does not protect against this issue up to the point where all present CPUs have at least initialized CR4. Limiting the number of APs to online early via the kernel command line widens that window and increases the risk further by executing user space before all APs have CR4 initialized. But the same applies to a deferred online mechanism implemented in the kernel where some worker brings up the not yet online APs while the early online CPUs are already executing user space code. #b) Is a no brainer to implement on top of this. Conclusion ---------- Adding the basic parallel bringup mechanism as provided by this series makes a lot of sense. Improving particular issues as pointed out in the analysis makes sense too. But trying to solve an application specific problem fully in the kernel with tons of complexity, without exploring straight forward and simple approaches first, does not make any sense at all. Thanks, tglx --- Documentation/admin-guide/kernel-parameters.txt | 20 Documentation/core-api/cpu_hotplug.rst | 13 arch/Kconfig | 23 + arch/arm/Kconfig | 1 arch/arm/include/asm/smp.h | 2 arch/arm/kernel/smp.c | 18 arch/arm64/Kconfig | 1 arch/arm64/include/asm/smp.h | 2 arch/arm64/kernel/smp.c | 14 arch/csky/Kconfig | 1 arch/csky/include/asm/smp.h | 2 arch/csky/kernel/smp.c | 8 arch/mips/Kconfig | 1 arch/mips/cavium-octeon/smp.c | 1 arch/mips/include/asm/smp-ops.h | 1 arch/mips/kernel/smp-bmips.c | 1 arch/mips/kernel/smp-cps.c | 14 arch/mips/kernel/smp.c | 8 arch/mips/loongson64/smp.c | 1 arch/parisc/Kconfig | 1 arch/parisc/kernel/process.c | 4 arch/parisc/kernel/smp.c | 7 arch/riscv/Kconfig | 1 arch/riscv/include/asm/smp.h | 2 arch/riscv/kernel/cpu-hotplug.c | 14 arch/x86/Kconfig | 45 -- arch/x86/include/asm/apic.h | 5 arch/x86/include/asm/cpu.h | 5 arch/x86/include/asm/cpumask.h | 5 arch/x86/include/asm/processor.h | 1 arch/x86/include/asm/realmode.h | 3 arch/x86/include/asm/sev-common.h | 3 arch/x86/include/asm/smp.h | 26 - arch/x86/include/asm/topology.h | 23 - arch/x86/include/asm/tsc.h | 2 arch/x86/kernel/acpi/sleep.c | 9 arch/x86/kernel/apic/apic.c | 22 - arch/x86/kernel/callthunks.c | 4 arch/x86/kernel/cpu/amd.c | 2 arch/x86/kernel/cpu/cacheinfo.c | 21 arch/x86/kernel/cpu/common.c | 50 -- arch/x86/kernel/cpu/topology.c | 3 arch/x86/kernel/head_32.S | 14 arch/x86/kernel/head_64.S | 121 +++++ arch/x86/kernel/sev.c | 2 arch/x86/kernel/smp.c | 3 arch/x86/kernel/smpboot.c | 508 ++++++++---------------- arch/x86/kernel/topology.c | 98 ---- arch/x86/kernel/tsc.c | 20 arch/x86/kernel/tsc_sync.c | 36 - arch/x86/power/cpu.c | 37 - arch/x86/realmode/init.c | 3 arch/x86/realmode/rm/trampoline_64.S | 27 + arch/x86/xen/enlighten_hvm.c | 11 arch/x86/xen/smp_hvm.c | 16 arch/x86/xen/smp_pv.c | 56 +- drivers/acpi/processor_idle.c | 4 include/linux/cpu.h | 4 include/linux/cpuhotplug.h | 17 kernel/cpu.c | 397 +++++++++++++++++- kernel/smp.c | 2 kernel/smpboot.c | 163 ------- 62 files changed, 953 insertions(+), 976 deletions(-)
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |