[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v5 7/8] x86/microcode: Synchronize late microcode loading



>>> On 28.01.19 at 08:06, <chao.gao@xxxxxxxxx> wrote:
> @@ -30,18 +31,25 @@
>  #include <xen/smp.h>
>  #include <xen/softirq.h>
>  #include <xen/spinlock.h>
> +#include <xen/stop_machine.h>
>  #include <xen/tasklet.h>
>  #include <xen/guest_access.h>
>  #include <xen/earlycpio.h>
> +#include <xen/watchdog.h>
>  
> +#include <asm/delay.h>
>  #include <asm/msr.h>
>  #include <asm/processor.h>
>  #include <asm/setup.h>
>  #include <asm/microcode.h>
>  
> +/* By default, wait for 30000us */
> +#define MICROCODE_DEFAULT_TIMEOUT_US 30000

This value deserves some explanation as to how it was chosen.

>  static module_t __initdata ucode_mod;
>  static signed int __initdata ucode_mod_idx;
>  static bool_t __initdata ucode_mod_forced;
> +static unsigned int nr_cores;

Could you see about avoiding such a static? You have ...

> +static int do_microcode_update(void *unused)

... an "unused" here after all, and it's the (indirect) caller of the
function which calculates the value.

> +{
> +    unsigned int cpu = smp_processor_id();
> +    int ret;
>  
> -    error = microcode_update_cpu();
> -    if ( error )
> -        info->error = error;
> +    ret = wait_for_cpus(&cpu_in, MICROCODE_DEFAULT_TIMEOUT_US);
> +    if ( ret )
> +        return ret;
>  
> -    info->cpu = cpumask_next(info->cpu, &cpu_online_map);
> -    if ( info->cpu < nr_cpu_ids )
> -        return continue_hypercall_on_cpu(info->cpu, do_microcode_update, 
> info);
> +    /*
> +     * Initiate an update on all processors which don't have an online 
> sibling
> +     * thread with a lower thread id. Other sibling threads just await the
> +     * completion of microcode update.
> +     */
> +    if ( cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu)) )
> +        ret = microcode_update_cpu();
> +    /*
> +     * Increase the wait timeout to a safe value here since we're serializing
> +     * the microcode update and that could take a while on a large number of
> +     * CPUs. And that is fine as the *actual* timeout will be determined by
> +     * the last CPU finished updating and thus cut short
> +     */
> +    if ( wait_for_cpus(&cpu_out, MICROCODE_DEFAULT_TIMEOUT_US * nr_cores) )
> +        panic("Timeout when finishing updating microcode");

While I expect this to go away again in the next patch, I'd still like to
see this improved, in particular in case the patch here goes in
independently of the next one. After all on a system with 100 cores
the timeout totals to a whopping 3 seconds.

Generally the time needed to wait scales by the number of CPUs still
in need of doing the update. And if a timeout is really to occur, it's
perhaps because of one bad core or socket, not because nothing
works at all. Hence it would seem both nice and possible to scale the
"remaining time to wait" by the (known) number of remaining
processors to respond.

Additionally it would be confusing to see dozens of panics in case
this actually triggers, because all CPUs would hit this at about the
same time.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.