Xen project Mailing List

Re: [Xen-devel] [PATCH v5 7/8] x86/microcode: Synchronize late microcode loading

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Mon, 11 Feb 2019 06:23:08 -0700

Cc: Kevin Tian <kevin.tian@xxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, Ashok Raj <ashok.raj@xxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, tglx@xxxxxxxxxxxxx, Borislav Petkov <bp@xxxxxxx>, Roger Pau Monne <roger.pau@xxxxxxxxxx>

Delivery-date: Mon, 11 Feb 2019 13:23:26 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

>>> On 11.02.19 at 06:40, <chao.gao@xxxxxxxxx> wrote: > On Fri, Feb 08, 2019 at 09:29:32AM -0700, Jan Beulich wrote: >>>>> On 28.01.19 at 08:06, <chao.gao@xxxxxxxxx> wrote: >>> + /* >>> + * Initiate an update on all processors which don't have an online >>> sibling >>> + * thread with a lower thread id. Other sibling threads just await the >>> + * completion of microcode update. >>> + */ >>> + if ( cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu)) ) >>> + ret = microcode_update_cpu(); >>> + /* >>> + * Increase the wait timeout to a safe value here since we're >>> serializing >>> + * the microcode update and that could take a while on a large number >>> of >>> + * CPUs. And that is fine as the *actual* timeout will be determined by >>> + * the last CPU finished updating and thus cut short >>> + */ >>> + if ( wait_for_cpus(&cpu_out, MICROCODE_DEFAULT_TIMEOUT_US * nr_cores) ) >>> + panic("Timeout when finishing updating microcode"); >> >>While I expect this to go away again in the next patch, I'd still like to >>see this improved, in particular in case the patch here goes in >>independently of the next one. After all on a system with 100 cores >>the timeout totals to a whopping 3 seconds. > > To be clear, the timeout remains the same in the next patch due to > the serial print clause in apply_microcode(). > >> >>Generally the time needed to wait scales by the number of CPUs still >>in need of doing the update. And if a timeout is really to occur, it's >>perhaps because of one bad core or socket, not because nothing >>works at all. Hence it would seem both nice and possible to scale the >>"remaining time to wait" by the (known) number of remaining >>processors to respond. > > Basically, I think the benefit is we can recognize the failure earlier > if no core called in in a given interval (i.e. 30ms), and trigger a > panic. Considering for such case, even with this optimization, the > system needs reboot, which generally takes several minutes, what's the > value of this optimization? Hmm, on one hand this is a fair point you make. Otoh, why do you add any timeout at all, if we say we're hosed anyway if the timeout expires? You could then as well log a message (say once a second) about how many (or which) CPUs still didn't respond. The admin can then still reboot the system if desired. Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.