[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 2 of 2] linux-xencommons: Load processor-passthru



>>> Furthermore, given the purpose of the driver, I'm thinking that this
>>> is too late in the boot process anyway, and the driver should rather
>>> be loaded at the point where other CPUFreq drivers would normally
>>> be by a particular distro (which would still be later than when the

In Ubuntu and Fedora they are set to:
CONFIG_X86_POWERNOW_K8=y
CONFIG_X86_ACPI_CPUFREQ=y

(thought previous to Fedora 16 POWERNOW_K8 was =m, per
https://bugzilla.redhat.com/show_bug.cgi?id=739159, " powernow-k8
transitions fail in xen dom0")

Looking at how the built is done with =y, the cpufreq drivers are loaded
in the late_init stage and there is a strict order (drivers/cpufreq/Makefile):
  # x86 drivers.
  # Link order matters. K8 is preferred to ACPI because of firmware
bugs in early
  # K8 systems. ACPI is preferred to all other hardware-specific drivers.
  # speedstep-* is preferred over p4-clockmod.

Both of them (acpi-cpufreq.c and powernow-k8.c) have a symbol
dependency on drivers/acpi/processor.c

>>> respective code gets run on traditional Xenified Linux).
>>
>> My thinking is to keep the amount of changes to be within the Xen-ecosystem.

.. snip..
>> The fix to the tool-stack would be something along:
.. snip..
>> diff -r 917f845f3e83 tools/hotplug/Linux/init.d/xencommons
>> --- a/tools/hotplug/Linux/init.d/xencommons   Thu Feb 23 13:29:00 2012 -0500
>> +++ b/tools/hotplug/Linux/init.d/xencommons   Fri Feb 24 21:42:20 2012 -0500
>> @@ -58,7 +58,7 @@ do_start () {
>>       modprobe xen-gntdev 2>/dev/null
>>       modprobe evtchn 2>/dev/null
>>       modprobe gntdev 2>/dev/null
>> -
>> +     for file in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do 
>> echo xen > $file; done 2>/dev/null
>
> While not an unreasonable idea (yes, I sort of contradict my other mail
> from a few minutes ago, given your explanation above), this won't work
> with subsequently onlined vCPU-s (and hence, consistent with my other

This udev rule could do it? [completely untested]

ACTION=="add", SUBSYSTEM=="cpu", RUN+="/bin/sh -c '[ ! -e
/sys$devpath/online ] || echo xen >
/sys$devpath/cpufreq/scaling_governor'"

> mail, having the thing be a governor is likely the wrong choice - it ought
> to be a driver).
>

By driver, I think you mean either a) a new cpufreq scaling driver, or
b) the processor-core driver.

The b) is what the old Xen-O-Linux tree had, and a simplified version
of this had been posted on LKML: https://lkml.org/lkml/2011/11/30/245
. There was no response from Len Brown about it and from the off-topic
email conversations I got the hint that Len wasn't particularly
thrilled about it.

For a), this would mean some form of unregistering the existing
cpufreq scaling drivers.The reason
for that is we want to use the generic ones (acpi-cpufreq and
powernow-k8) b/c they do all the filtering and parsing of the ACPI
data instead of re-implementing it in our own cpufreq-xen-scaling.
Thought one other option is to export both powernow-k8 and
acpi-cpufreq functions that do this and use them within the
cpufreq-xen-scaling-driver but that sounds icky.

I think the "unregistering" would be akin to
"cpufreq_unregister_driver" but without any parameters.
However, we would be the only user of this mechanism and I am not sure
what Dave Jones feelings are about that (CC-ed here) Here is what I
was thinking (this is of course completely untested/uncompiled):

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 622013f..049fca9 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1881,9 +1881,12 @@ int cpufreq_unregister_driver(struct
cpufreq_driver *driver)
 {
        unsigned long flags;

-       if (!cpufreq_driver || (driver != cpufreq_driver))
+       if (!cpufreq_driver)
                return -EINVAL;

+       if (!driver)
+               driver = cpufreq_driver;
+
        pr_debug("unregistering driver %s\n", driver->name);

        subsys_interface_unregister(&cpufreq_interface);

There would also be the need to do some tweaking of the order of drivers to load
(so that cpufreq-xen-scaling-driver gets loaded _after_ all of the
other ones). And I am afraid
about the scenario where a cpufreq-gov-X tries to set a specific
frequency and we yank it out
from within and might end up with a NULL pointer. But I think the
"cpufreq_driver_lock" inhibits that. The other thing that strikes me
as dangerous is that we don't call the module_exit of the cpufreq
scaling driver.

While I was writing this up I thought about the overall picture.
Right now, the APIs work in this way:

[acpi processors] -> [cpufreq-scaling] <- [cpufreq-gov-*]

And we want to leverage them to do two things:
1). Inhibit cpufreq-gov-* telling cpufreq-scaling drivers what to do
[which is a bug
right now with Linux dom0 pvops - as the cpufreq-gov-* has no clue
about the other domains and might set the P states to something
incorrect]
2). Upload the power management information up to the hypervisor.

The 1) option can be either done by ignoring the WRMSR or outb in the hypervisor
(but for that we need 2) to know _which_ ones to inhibit). Or it can
be done by your
suggestion by introducing a cpufreq scaling driver that would nop all
calls and unregister
the other drivers. Or lastly by the cpufreq-gov-xen driver which would
leave it to the hypervisor to do it.

The 2). Can be done by either the proposed cpufreq-gov-xen, or the
earlier patch set I sent called processor-passthru
[https://lwn.net/Articles/483668/] which is a simple driver that
checks when cpufreq-scaling driver is loaded and then it upload the
power management data. It does nothing
with the cpufreq-[scaling|gov] APIs. Granted it does need a 'modprobe
processor-passthru' in the xencommons to take advantage of the data -
and yes, it does deal with hot-cpu on lining.

Thoughts?

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.