Xen project Mailing List

Re: [Xen-devel] [Patch 0/6] xen: cpupool support

To: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>

From: Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>

Date: Wed, 22 Apr 2009 10:19:23 +0200

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>

Delivery-date: Wed, 22 Apr 2009 01:20:27 -0700

Domainkey-signature: s=s1536a; d=ts.fujitsu.com; c=nofws; q=dns; h=X-SBRSScore:X-IronPort-AV:Received:X-IronPort-AV: Received:Received:Message-ID:Date:From:Organization: User-Agent:MIME-Version:To:CC:Subject:References: In-Reply-To:X-Enigmail-Version:Content-Type: Content-Transfer-Encoding; b=F2OC8godV4ddIbuUvKNPZfVAqNgUaYPA7pketPP49lC2eY8cWFNmVmmh ZfxosaF0Kj3Jqrvi95m5vYUHSk7ycQtQq69PS6CBod4KGzeI+lWBtOf2Y 41djiF3cCOVpsG0ScLc73FMmtUVY3Rgbvcy8ANrkG8rWB5LAMMVr5flV6 tpgcDDDJhCLBdAz2t/g7HA55YeUWOsg8kqJzsbaQuYqo7STtWWOHbrrNz ubUQ+vWwT94y0jfRjdr5UBWz12u2R;

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Hi George, thanks for the overall positive feedback :-) George Dunlap wrote: > Juergen, > > Thanks for doing this work. Overall things look like they're going in > the right direction. > > However, it's a pretty big change, and I'd like to hear some more > opinions. So to facilitate discussion on the list, would you please > send out another e-mail with: > 1. A description of the motivation for this work (the basic problem > you're trying to solve) The basic problem we ran into was a weakness of the current credit scheduler. We wanted to be able to pin multiple domains to a subset of the physical processors and hoped the scheduler would still schedule the domains according to their weights. Unfortunately this was not the case. Our motivation to do this is the license model we are using for our BS2000 OS. The customer buys a specific processing power for which he pays a monthly fee. He might use multiple BS2000 domains, but the overall consumable BS2000 power is restricted according to his license by allowing BS2000 to run only on a subset of the physical processors. On other processors other domains are allowed to run. As pinning the BS2000 domains to the processor subset was not working, we thought of two possible solutions: - fix the credit scheduler to support our request - introduction of cpupools to have an own scheduler for BS2000 without explicit pinning of cpus. Fixing the scheduler to support weights correctly in case of cpu pinning seemed to be a complex task with minor benefit for others. The cpupool approach seemed to be the better solution having more general use cases: - solution for our problem - better scalability of the scheduler for large cpu numbers - potential base for NUMA systems - support of "software partitions" with more flexibility as hardware partitions - easy grouping of domains > 2. A description overall of what cpu pools does The idea is to have multiple pools of cpus, each pool having its own scheduler. Each physical cpu is member of at most one pool. A pool can have multiple cpus assigned to. A domain is assigned to a pool on creation, resulting in being able to run only on the physical cpus assigned to the same pool. Domains can be moved from one pool to another, cpus can be removed from or added to a pool. The scheduler of each pool is selected at pool creation. Changing scheduling parameters of a pool only affect domains of this pool. Each scheduler "sees" only the cpus of its own pool (e.g. each pool with credit scheduler has its own master cpu, its own load balancing, ...). On system boot Pool-0 is created as the default pool. As a default all physical processors are assigned to Pool-0, it is possible to reduce the number of cpus in Pool-0 via a boot parameter. Domain 0 is always assigned to Pool-0, it can't be moved to another pool. Cpus not assigned to any pool can run only the idle domain. There were several design decisions to take: - Idle domain handling: either keep the current solution (1 idle domain with a pinned vcpu for each physical processor) or one idle domain per pool. I've chosen the first variant as this solution seemed to require less changes (see discussion below for this topic). - Use an existing hypercall for cpupool control or introduce a new one. Again I wanted to change not too much code, so I used the domctl hypercall (other scheduler related stuff is handled via this hypercall, too). - Handling of special case "continue_hypercall_on_cpu": This function is used to execute a domain 0 hypercall (or parts of it) on a specific physical processor, e.g. for microcode updates of Intel cpus. With domain 0 residing in Pool-0 not running on all physical processors this is a problem. I had either to find a general solution for this problem keeping the semantics of continue_hypercall_on_cpu, or to eliminate the need for this function by changing each case where this function is used. I preferred the general solution (see again discussion below). The main functional support is in the hypervisor, of course. Here are the main changes I've made: - I added a cpupool.c source to handle all cpupool operations - To be able to support multiple scheduler incarnations some static global variables had to be allocated from heap for each scheduler. Each physical processor has a percpu cpopool pointer now, the cpupool structure contains the scheduler reference. Most changes in the scheduler are related to the elimination of the global variables. - At domain creation a cpupool id has to be specified. It may be NONE for special domains like the idle domain. - References of cpu_online_mask had to be replaced by the cpu-mask of the cpupool in some places. - continue_hypercall_on_cpu had to be modified to be cpupool aware. See below for more details. > 3. A description of any quirky corner cases you ran into, how you > solved them, and why you chose the way you did George, you've read my patch quite well! The corner cases are exactly the topics you are mentioning below. :-) > Here are some examples for #3 I got after spending a couple of hours > looking at your patch: > * The whole "cpu borrowing" thing As mentioned above, the semantics of continue_hypercall_on_cpu are problematic with cpupools. Without the cpupools this function pins the vcpu performing the hypercall temporarily to the specified physical processor and removes that pinning after the sub-function specified as a parameter has been completed. With cpupools it is no longer possible to just pin a vcpu to any physical processor as this processor might be out of reach for the scheduler. First I thought it might be possible to use on_selected_cpus instead, but the sub-functions used with continue_hypercall_on_cpu sometimes access guest memory. It would be possible to allocate a buffer and copy the guest memory to this buffer, of course. This would have required a change of all users of continue_hypercall_on_cpu I wanted to avoid. The solution I've chosen expands the idea of pinning the vcpu to a processor temporarily by adding this processor to a cpupool temporarily, if necessary. It is a little bit more complicated as the vcpu pinning, because after completion of the sub-function on the borrowed processor this processor has to be returned to its original cpupool. And this is possible only, if the vcpu executing the hypercall is no longer running on the processor to be returned. Things are rather easy if the borrowed processor was not assigned to a cpupool. It can be assigned to the current pool and unassigned afterwards quite easy. If the processor to be borrowed is assigned to an active cpupool however, the processor must first be unassigned from this pool. This could leave the pool without any processor resulting in strange behaviour. As the need for continuing a hypercall on processors outside the current cpupool seems to be a rare event, I've chosen to suspend all domains in the cpupool from which the processor is borrowed until the processor is returned. > * Dealing with the idle domain I've chosen to stay with one global idle domain instead of per-cpupool idle domains for two main reasons: - I felt uneasy about changing a central concept of the hypervisor - Assigning a processor to or unassigning it from a cpupool with multiple idle domains seemed to be more complex. Switching the scheduler on a processor seems to be a bad idea as long any non-idle vcpu is running on that processor. If the idle vcpus are cpupool specific as well, things are becoming really ugly. Either you have a vcpu running outside its related scheduler, or the current vcpu referenced by the per-cpu pointer "current" is invalid for a short period of time, which is even worse. This led to the solution of one idle domain and the idle vcpus changing between schedulers. Generally the idle domain plays a critical role whenever a processor is assigned to or unassigned from a cpupool. The critical operation is changing from one scheduler to another. At this time only an idle vcpu is allowed to be active on the processor. > * Why the you expose allocating and freeing of vcpu and pcpu data in > the sched_ops structure This is related to the supported operations on cpupools. Switching a processor between cpupools requires changing the scheduler responsible for this processor. And this requires a change of the pcpu scheduler data. Without a interface for allocating/freeing pcpu scheduler specific data it would be impossible to switch schedulers. The same applies to the vcpu scheduler data, but this is related to moving a domain from one cpupool to another. Again the scheduler has to be changed, but this time for all the vcpus of the moved domain. Without the capability to move a domain to another cpupool allocating and freeing vcpu data would still be necessary for switching processors (the idle vcpu of the switched processor is changing the scheduler as well), but it would not have to be exposed to sched_ops. > > Some of these I'd people to be able to discuss who don't have the time > / inclination to spend looking at the patch (which could use a lot > more comments). > > As for me: I'm happy with the general idea of the patch (putting cpu > pools in underneath the scheduler, and allowing pools to have > different schedulers). I think this is a good orthogonal to the new > scheduler. I'm not too keen on the whole "cpu borrowing" thing; it > seems like there should be a cleaner solution to the problem. Overall > the patches need more comments. I have some coding specific comments, > but I'll save those until the high-level things have been discussed. Thanks again for the feedback! Juergen -- Juergen Gross Principal Developer Operating Systems TSP ES&S SWE OS6 Telephone: +49 (0) 89 636 47950 Fujitsu Technolgy Solutions e-mail: juergen.gross@xxxxxxxxxxxxxx Otto-Hahn-Ring 6 Internet: ts.fujitsu.com D-81739 Muenchen Company details: ts.fujitsu.com/imprint.html _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.