[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] x86: retrieve and log CPU frequency information



On 15.05.2020 10:32, Roger Pau Monné wrote:
> On Wed, Apr 15, 2020 at 01:55:24PM +0200, Jan Beulich wrote:
>> While from just a single Skylake system it is already clear that we
>> can't base any of our logic on CPUID leaf 15 [1] (leaf 16 is
>> documented to be used for display purposes only anyway), logging this
>> information may still give us some reference in case of problems as well
>> as for future work. Additionally on the AMD side it is unclear whether
>> the deviation between reported and measured frequencies is because of us
>> not doing well, or because of nominal and actual frequencies being quite
>> far apart.
>>
>> The chosen variable naming in amd_log_freq() has pointed out a naming
>> problem in rdmsr_safe(), which is being taken care of at the same time.
>> Symmetrically wrmsr_safe(), being an inline function, also gets an
>> unnecessary underscore dropped from one of its local variables.
>>
>> [1] With a core crystal clock of 24MHz and a ratio of 216/2, the
>>     reported frequency nevertheless is 2600MHz, rather than the to be
>>     expected (and calibrated by both us and Linux) 2592MHz.
>>
>> Suggested-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
>> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
> 
> Reviewed-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Thanks, but please clarify whether this is with or without the
two suggested changes (breaking out intel_log_freq() and introducing
local variables for (uint8_t)(msrval >> NN)), or whether you
mean to leave it to me whether to make them. And if I'm to make the
change, whether you'd trust me to not screw up things, i.e. whether
I can keep you R-b in that case.

> I have one question below about P-state limits.
> 
>> ---
>> TBD: The node ID retrieval using extended leaf 1E implies it won't work
>>      on older hardware (pre-Fam15 I think). Besides the Node ID MSR,
>>      which doesn't get advertised on my Fam10 box (and it's zero on all
>>      processors despite there being two nodes as per the PCI device
>>      map), and which isn't even documented for Fam11, Fam12, and Fam14,
>>      I didn't find any other means to retrieve the node ID a CPU is
>>      associated with - the NodeId register in PCI config space depends
>>      on one already knowing the node ID for doing the access, as the
>>      device to be used is a function of the node ID.
> 
> I there a real chance of the boost states being different between
> nodes?

Probably not, but doing things properly would still have been
nice.

> Won't Xen explode elsewhere due to possibly diverging features
> between nodes?

For many features - yes, but for boost states being different I
don't think it would.

>> --- a/xen/arch/x86/cpu/amd.c
>> +++ b/xen/arch/x86/cpu/amd.c
>> @@ -532,6 +532,102 @@ static void amd_get_topology(struct cpui
>>                                                            : c->cpu_core_id);
>>  }
>>  
>> +void amd_log_freq(const struct cpuinfo_x86 *c)
>> +{
>> +    unsigned int idx = 0, h;
>> +    uint64_t hi, lo, val;
>> +
>> +    if (c->x86 < 0x10 || c->x86 > 0x19 ||
>> +        (c != &boot_cpu_data &&
>> +         (!opt_cpu_info || (c->apicid & (c->x86_num_siblings - 1)))))
>> +            return;
>> +
>> +    if (c->x86 < 0x17) {
>> +            unsigned int node = 0;
>> +            uint64_t nbcfg;
>> +
>> +            /*
>> +             * Make an attempt at determining the node ID, but assume
>> +             * symmetric setup (using node 0) if this fails.
>> +             */
>> +            if (c->extended_cpuid_level >= 0x8000001e &&
>> +                cpu_has(c, X86_FEATURE_TOPOEXT)) {
>> +                    node = cpuid_ecx(0x8000001e) & 0xff;
>> +                    if (node > 7)
>> +                            node = 0;
>> +            } else if (cpu_has(c, X86_FEATURE_NODEID_MSR)) {
>> +                    rdmsrl(0xC001100C, val);
>> +                    node = val & 7;
>> +            }
>> +
>> +            /*
>> +             * Enable (and use) Extended Config Space accesses, as we
>> +             * can't be certain that MCFG is available here during boot.
>> +             */
>> +            rdmsrl(MSR_AMD64_NB_CFG, nbcfg);
>> +            wrmsrl(MSR_AMD64_NB_CFG,
>> +                   nbcfg | (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT));
>> +#define PCI_ECS_ADDRESS(sbdf, reg) \
>> +    (0x80000000 | ((sbdf).bdf << 8) | ((reg) & 0xfc) | (((reg) & 0xf00) << 
>> 16))
>> +
>> +            for ( ; ; ) {
>> +                    pci_sbdf_t sbdf = PCI_SBDF(0, 0, 0x18 | node, 4);
>> +
>> +                    switch (pci_conf_read32(sbdf, PCI_VENDOR_ID)) {
>> +                    case 0x00000000:
>> +                    case 0xffffffff:
>> +                            /* No device at this SBDF. */
>> +                            if (!node)
>> +                                    break;
>> +                            node = 0;
>> +                            continue;
>> +
>> +                    default:
>> +                            /*
>> +                             * Core Performance Boost Control, family
>> +                             * dependent up to 3 bits starting at bit 2.
> 
> 
> I would add:
> 
> "Note that boot states operate at a frequency above the base one, and
> thus need to be accounted for in order to correctly fetch the nominal
> frequency of the processor."

Done.

>> +                             */
>> +                            switch (c->x86) {
>> +                            case 0x10: idx = 1; break;
>> +                            case 0x12: idx = 7; break;
>> +                            case 0x14: idx = 7; break;
>> +                            case 0x15: idx = 7; break;
>> +                            case 0x16: idx = 7; break;
>> +                            }
>> +                            idx &= pci_conf_read(PCI_ECS_ADDRESS(sbdf,
>> +                                                                 0x15c),
>> +                                                 0, 4) >> 2;
>> +                            break;
>> +                    }
>> +                    break;
>> +            }
>> +
>> +#undef PCI_ECS_ADDRESS
>> +            wrmsrl(MSR_AMD64_NB_CFG, nbcfg);
>> +    }
>> +
>> +    lo = 0; /* gcc may not recognize the loop having at least 5 iterations 
>> */
>> +    for (h = c->x86 == 0x10 ? 5 : 8; h--; )
>> +            if (!rdmsr_safe(0xC0010064 + h, lo) && (lo >> 63))
>> +                    break;
>> +    if (!(lo >> 63))
>> +            return;
> 
> Should you also take the P-state limit here into account (from MSR
> 0xC0010061)?
> 
> I assume the firmware could set a minimum P-state higher than the
> possible ones present in the list of P-states, effectively preventing
> switching to lowest-performance P-states?

We're not after permitted P-states here - these would matter only if
we were meaning to alter the current P-state by direct MSR accesses.
Here we're only after logging capabilities (and the P-state limits
can, aiui, in principle also change at runtime).

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.