[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v5 7/8] x86/microcode: Synchronize late microcode loading



>>> On 11.02.19 at 06:40, <chao.gao@xxxxxxxxx> wrote:
> On Fri, Feb 08, 2019 at 09:29:32AM -0700, Jan Beulich wrote:
>>>>> On 28.01.19 at 08:06, <chao.gao@xxxxxxxxx> wrote:
>>> +    /*
>>> +     * Initiate an update on all processors which don't have an online 
>>> sibling
>>> +     * thread with a lower thread id. Other sibling threads just await the
>>> +     * completion of microcode update.
>>> +     */
>>> +    if ( cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu)) )
>>> +        ret = microcode_update_cpu();
>>> +    /*
>>> +     * Increase the wait timeout to a safe value here since we're 
>>> serializing
>>> +     * the microcode update and that could take a while on a large number 
>>> of
>>> +     * CPUs. And that is fine as the *actual* timeout will be determined by
>>> +     * the last CPU finished updating and thus cut short
>>> +     */
>>> +    if ( wait_for_cpus(&cpu_out, MICROCODE_DEFAULT_TIMEOUT_US * nr_cores) )
>>> +        panic("Timeout when finishing updating microcode");
>>
>>While I expect this to go away again in the next patch, I'd still like to
>>see this improved, in particular in case the patch here goes in
>>independently of the next one. After all on a system with 100 cores
>>the timeout totals to a whopping 3 seconds.
> 
> To be clear, the timeout remains the same in the next patch due to
> the serial print clause in apply_microcode().
> 
>>
>>Generally the time needed to wait scales by the number of CPUs still
>>in need of doing the update. And if a timeout is really to occur, it's
>>perhaps because of one bad core or socket, not because nothing
>>works at all. Hence it would seem both nice and possible to scale the
>>"remaining time to wait" by the (known) number of remaining
>>processors to respond.
> 
> Basically, I think the benefit is we can recognize the failure earlier
> if no core called in in a given interval (i.e. 30ms), and trigger a
> panic. Considering for such case, even with this optimization, the
> system needs reboot, which generally takes several minutes, what's the
> value of this optimization?

Hmm, on one hand this is a fair point you make. Otoh, why do
you add any timeout at all, if we say we're hosed anyway if the
timeout expires? You could then as well log a message (say
once a second) about how many (or which) CPUs still didn't
respond. The admin can then still reboot the system if desired.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.