|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v2] amd: disable C6 after 1000 days on Zen2
On Fri, Jun 30, 2023 at 03:18:20PM +0200, Roger Pau Monne wrote:
> As specified on Errata 1474:
>
> "A core will fail to exit CC6 after about 1044 days after the last
> system reset. The time of failure may vary depending on the spread
> spectrum and REFCLK frequency."
>
> Detect when running on AMD Zen2 (family 17h models 30-3fh, 60-6fh or
> 70-7fh) and setup a timer to prevent entering C6 after 1000 days of
> uptime. Take into account the TSC value at boot in order to account
> for any time elapsed before Xen has been booted. Worst case we end
> up disabling C6 before strictly necessary, but that would still be
> safe, and it's better than not taking the TSC value into account and
> hanging.
>
> Disable C6 by updating the MSR listed in the revision guide, this
> avoids applying workarounds in the CPU idle drivers, as the processor
> won't be allowed to enter C6 by the hardware itself.
>
> Print a message once C6 is disabled in order to let the user know.
>
> Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> ---
> The current Revision Guide for Fam17h model 60-6Fh (Lucienne and
> Renoir) hasn't been updated to reflect the MSR workaround, but the PPR
> for those models lists the MSR and the bits as having the expected
> meaning, so I assume it's safe to apply the same workaround there.
>
> For all accounts this seems to affect all Zen2 models, and hence the
> workaround should be the same. Might also affect Hygon, albeit I
> think Hygon is strictly limited to Zen1.
> ---
> Changes since v1:
> - Apply the workaround listed by AMD: toggle some MSR bits.
> - Do not apply the workaround if virtualized.
> - Check for STIBP feature instead of listing specific models.
> - Implement the DAYS macro based on SECONDS.
> ---
> xen/arch/x86/cpu/amd.c | 70 ++++++++++++++++++++++++++++
> xen/arch/x86/include/asm/msr-index.h | 5 ++
> xen/include/xen/time.h | 1 +
> 3 files changed, 76 insertions(+)
>
> diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c
> index 0eaef82e5145..bdf45f8387e8 100644
> --- a/xen/arch/x86/cpu/amd.c
> +++ b/xen/arch/x86/cpu/amd.c
> @@ -51,6 +51,8 @@ bool __read_mostly amd_acpi_c1e_quirk;
> bool __ro_after_init amd_legacy_ssbd;
> bool __initdata amd_virt_spec_ctrl;
>
> +static bool __read_mostly c6_disabled;
> +
> static inline int rdmsr_amd_safe(unsigned int msr, unsigned int *lo,
> unsigned int *hi)
> {
> @@ -905,6 +907,31 @@ void __init detect_zen2_null_seg_behaviour(void)
>
> }
>
> +static void cf_check disable_c6(void *arg)
> +{
> + uint64_t val;
> +
> + if (!c6_disabled) {
> + printk(XENLOG_WARNING
> + "Disabling C6 after 1000 days apparent uptime due to AMD errata 1474\n");
> + c6_disabled = true;
> + smp_call_function(disable_c6, NULL, 0);
I've realized this is racy with CPU hotplug, so I will need to inhibit
CPU hotplug around the call to smp_call_function() in order to prevent
CPUs being hotplugged and not seeing c6_disabled while set and also
not being set in cpu_online_map when the call to smp_call_function
happens.
Thanks, Roger.
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |