[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 0/9] Porting the intel_pstate driver to Xen



On 24/04/2015 17:11, Jan Beulich wrote:
> >>> On 24.04.15 at 10:32, <wei.w.wang@xxxxxxxxx> wrote:
> > On 23/04/2015 15:27, Jan Beulich wrote:
> >> >>> On 24.04.15 at 07:12, <wei.w.wang@xxxxxxxxx> wrote:
> >> > On 23/04/2015 22:09, Jan Beulich wrote:
> >> >> >>> On 23.04.16 at 15:31, <wei.w.wang@xxxxxxxxx> wrote:
> >> >> > The intel_pstate.c file under xen/arch/x86/acpi/cpufreq/
> >> >> > contains all the logic for selecting the current P-state. It
> >> >> > follows its implementation in the kernel. Instead of using the
> >> >> > traditional cpufreq governors, intel_pstate implements its
> >> >> > internal governor in the "setpolicy()".
> >> >>
> >> >> And this internal governor behaves how? Like ondemand, powersave,
> >> >> peerformance, or yet something else? And how would its behavior be
> >> >> changed?
> >> >
> >> > In the kenel intel_pstate implementation, they have two internal
> governors:
> >> > Powersave and Performance.
> >> > Powersave is similar to the old (cpufreq) ondemand governor. A
> >> > timer function is periodically invoked to sample the CPU busy info (e.g.
> >> > will get increased due to the running of a CPU-intensive workload).
> >> > However, the final calculated target value is clamped into the
> >> > [min_pct, max_pct] limit interval.
> >> > The Performance governor is actually a special case of Powersave,
> >> > when the min_pct= max_pct=100%. This is the same as the old
> >> > performance
> >> governor.
> >>
> >> So a true powersave one would then be accomplished by setting min_pct
> >> = max_pct = <some value smaller than 100>%. Is there a limit on the
> >> valid percentages to be specified here?
> >
> >
> > In the old driver, a powersave governor just sets the CPU to run with
> > the lowest possible performance state. This one does not exist in the
> > intel_pstate driver.
> > The intel_pstate driver changes the terminology by using "powersave"
> > to refer to the previous "ondemand" case. This does make people feel
> confused.
> > But we may think it this way: it only has two modes, the max
> > performance mode and the ondemand mode. "ondemand" is the one who
> > saves power (actually in a more reasonable way compared to the
> > previous "powersave" which simply sets the CPU to run with the lowest
> > performance state). Anyway, we can surely change the name if it sounds
> uncomfortable.
> 
> I think at the very least from a user interface perspective (e.g. the xenpm
> tool) the meaning of the old governor names should be retained as much as
> possible.

Ok. I am simply using the name "internal" for user tools. Please see the 
example below:

scaling_driver           : intel_pstate
scaling_avail_gov    : internal
current_governor    : internal 

> 
> > The valid pct value range is 0 to 100.
> 
> So what does 0% mean then? I.e. (wrt "powersave") what does min_pct =
> max_pct = 0 result in?

Probably I misunderstood that question. 
The CPU actually has its own valid frequency range. On my machine, for example, 
the minimal is 1.2GHZ (got from an MSR), which corresponds to 32%. Then the 
valid range is 32 to 100. Any input pct value less than 32 will be set to 32.
 

> 
> >> Also, you calling "powersave" what supposedly is "ondemand"
> >> makes me nervous about it not immediately raising the CPU freq when
> >> load increases, yet imo that's a fundamental requirement for server
> >> kind loads where you don't want to run in "performance" mode. Can you
> >> clarify the behavior here?
> >
> > The timer fires very 10ms to update the CPU P-state according to the
> > sampled workload info.
> 
> But that doesn't tell what the action is that the timer initiates. I.e.
> under what conditions it would effect a frequency change.

Each time when the timer is timer function is invoked, it gets the CPU 
utilization statistics using APERF and MPERF MSR registers. The current CPU 
utilization is compared to the previous one to generate a scale factor, which 
scales the frequency up or down.

Best,
Wei
> 
> Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.