|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH] x86: retrieve and log CPU frequency information
On Wed, Apr 15, 2020 at 01:55:24PM +0200, Jan Beulich wrote:
> While from just a single Skylake system it is already clear that we
> can't base any of our logic on CPUID leaf 15 [1] (leaf 16 is
> documented to be used for display purposes only anyway), logging this
> information may still give us some reference in case of problems as well
> as for future work. Additionally on the AMD side it is unclear whether
> the deviation between reported and measured frequencies is because of us
> not doing well, or because of nominal and actual frequencies being quite
> far apart.
>
> The chosen variable naming in amd_log_freq() has pointed out a naming
> problem in rdmsr_safe(), which is being taken care of at the same time.
> Symmetrically wrmsr_safe(), being an inline function, also gets an
> unnecessary underscore dropped from one of its local variables.
>
> [1] With a core crystal clock of 24MHz and a ratio of 216/2, the
> reported frequency nevertheless is 2600MHz, rather than the to be
> expected (and calibrated by both us and Linux) 2592MHz.
>
> Suggested-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
Reviewed-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
I have one question below about P-state limits.
> ---
> TBD: The node ID retrieval using extended leaf 1E implies it won't work
> on older hardware (pre-Fam15 I think). Besides the Node ID MSR,
> which doesn't get advertised on my Fam10 box (and it's zero on all
> processors despite there being two nodes as per the PCI device
> map), and which isn't even documented for Fam11, Fam12, and Fam14,
> I didn't find any other means to retrieve the node ID a CPU is
> associated with - the NodeId register in PCI config space depends
> on one already knowing the node ID for doing the access, as the
> device to be used is a function of the node ID.
I there a real chance of the boost states being different between
nodes? Won't Xen explode elsewhere due to possibly diverging features
between nodes?
>
> --- a/xen/arch/x86/cpu/amd.c
> +++ b/xen/arch/x86/cpu/amd.c
> @@ -532,6 +532,102 @@ static void amd_get_topology(struct cpui
> : c->cpu_core_id);
> }
>
> +void amd_log_freq(const struct cpuinfo_x86 *c)
> +{
> + unsigned int idx = 0, h;
> + uint64_t hi, lo, val;
> +
> + if (c->x86 < 0x10 || c->x86 > 0x19 ||
> + (c != &boot_cpu_data &&
> + (!opt_cpu_info || (c->apicid & (c->x86_num_siblings - 1)))))
> + return;
> +
> + if (c->x86 < 0x17) {
> + unsigned int node = 0;
> + uint64_t nbcfg;
> +
> + /*
> + * Make an attempt at determining the node ID, but assume
> + * symmetric setup (using node 0) if this fails.
> + */
> + if (c->extended_cpuid_level >= 0x8000001e &&
> + cpu_has(c, X86_FEATURE_TOPOEXT)) {
> + node = cpuid_ecx(0x8000001e) & 0xff;
> + if (node > 7)
> + node = 0;
> + } else if (cpu_has(c, X86_FEATURE_NODEID_MSR)) {
> + rdmsrl(0xC001100C, val);
> + node = val & 7;
> + }
> +
> + /*
> + * Enable (and use) Extended Config Space accesses, as we
> + * can't be certain that MCFG is available here during boot.
> + */
> + rdmsrl(MSR_AMD64_NB_CFG, nbcfg);
> + wrmsrl(MSR_AMD64_NB_CFG,
> + nbcfg | (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT));
> +#define PCI_ECS_ADDRESS(sbdf, reg) \
> + (0x80000000 | ((sbdf).bdf << 8) | ((reg) & 0xfc) | (((reg) & 0xf00) <<
> 16))
> +
> + for ( ; ; ) {
> + pci_sbdf_t sbdf = PCI_SBDF(0, 0, 0x18 | node, 4);
> +
> + switch (pci_conf_read32(sbdf, PCI_VENDOR_ID)) {
> + case 0x00000000:
> + case 0xffffffff:
> + /* No device at this SBDF. */
> + if (!node)
> + break;
> + node = 0;
> + continue;
> +
> + default:
> + /*
> + * Core Performance Boost Control, family
> + * dependent up to 3 bits starting at bit 2.
I would add:
"Note that boot states operate at a frequency above the base one, and
thus need to be accounted for in order to correctly fetch the nominal
frequency of the processor."
> + */
> + switch (c->x86) {
> + case 0x10: idx = 1; break;
> + case 0x12: idx = 7; break;
> + case 0x14: idx = 7; break;
> + case 0x15: idx = 7; break;
> + case 0x16: idx = 7; break;
> + }
> + idx &= pci_conf_read(PCI_ECS_ADDRESS(sbdf,
> + 0x15c),
> + 0, 4) >> 2;
> + break;
> + }
> + break;
> + }
> +
> +#undef PCI_ECS_ADDRESS
> + wrmsrl(MSR_AMD64_NB_CFG, nbcfg);
> + }
> +
> + lo = 0; /* gcc may not recognize the loop having at least 5 iterations
> */
> + for (h = c->x86 == 0x10 ? 5 : 8; h--; )
> + if (!rdmsr_safe(0xC0010064 + h, lo) && (lo >> 63))
> + break;
> + if (!(lo >> 63))
> + return;
Should you also take the P-state limit here into account (from MSR
0xC0010061)?
I assume the firmware could set a minimum P-state higher than the
possible ones present in the list of P-states, effectively preventing
switching to lowest-performance P-states?
The rest LGTM, Thanks.
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |