Xen project Mailing List

[Xen-devel] Sched planning: Ideal scheduler, and credit1

To: "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>

From: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>

Date: Fri, 5 Jun 2009 11:11:27 +0100

Delivery-date: Fri, 05 Jun 2009 03:11:59 -0700

Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type:content-transfer-encoding; b=TvzCY4ZRpqfL09ljWR9mmntiSd9MOsWSO+ss57bw3v9S+pVzMzKhC3BrWiDKZlIXLQ rf47xPhr3Pt+UC5RFH+QAtJil3+BnWy8RsNHDfe19POqUGMlie4Dkm0tSAMfKaQgJdFI ppKId9nSTA24HPE1q5rubZLdElcnVao+1bhlY=

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Sorry that this mail has been so long in coming. I'm still trying to work out kinks in a new algorithm, but I think it's important to keep people up-do-date on where I am now. This is actually rather long. I'd absolutely most like the feedback on the last part; but I think a lot of people will wonder why I'm trying something totally new instead of trying to fix what's already there. So if you only have a little time to give, please skip to the new description. However, if you have more time, and are interested in a more thorough description of the problem, and/or are prone to wonder about ways of solving the problem that doesn't require a complete re-write, read the whole thing. The sections are: * Describing what an ideal scheduler would do * Describing at a high level what credit1 did, and what effects that had on latency-sensitive workloads * [future email] Describing a new proposal for how to account for credits, and ask for input / suggestions / criticism. * WWISD? (What Would an Ideal Scheduler Do?) First, let's review our overall goal: When VMs are competing for cpu time, they should get time in to their weight. This isn't too hard in the case of cpu-bound processes, which will consume as much cpu as is given to them. But many workloads, even if given an entire core to themselves, would only use some percentage of that. (This is probably why the workload was virtualized in the first place.) So we want to give VMs either their "fair share" or whatever they're happy with, whichever is less. And we want to divide the remaining portion by weight among VMs that do want more. For a concrete example, suppose that we have a box with 2 cores, and 4 single-vcpu VMs with the following weights: a: 35 b: 35 c: 65 d: 65 And suppose that each VM, if run by itself, would consume the following percentages of the cpu: a: 50% b: 40% c: 90% d: 40% Then an ideal scheduler would give the following percentages when they are all run together on a 2-core system: a: 42% b: 40% c: 78% d: 40% I leave it as an exercise to the reader to work out how I got these numbers. But note that b and d both got the amount of cpu time they wanted (40%), and the remaining is split, according to weight, between a and c. (i.e., 35/65 == 42/72). We don't need our scheduler to match this "ideal" exactly, but it should be within a few percentage points. Our ideal scheduler would also allow latency-sensitive VMs, which will only run for a short time and then halt again, to run before longer-running cpu-intensive VMs. So if b is a network-bound workload, and d is a video-playing workload, we want them to be scheduled as soon as they become runnable (or very shortly afterwards). VM 'a' may also be a network-bound workload if run on an idle system. In this case, we don't want to run it whenever it wants to run (since we only want it to get 42% of the cpu), but we want to schedule it in such a way that it can still utilize all its 42%, which probably means letting it run soon after it becomes runnable a decent amount of the time. (Comments on my portrayal of the ideal scheduler are welcome.) * Credit1, and why it doesn't work well for video At a basic level, credit1 has two main priorities: UNDER and OVER. Credits are handed out every 30ms according to weight. VMs that have positive credits are classed at priority UNDER, and those with negative credits are classed as OVER. Credit1 has a strong probabilistic element. Every 10ms a "tick" timer fires. When the tick happens, a full 10ms of credit is subtracted from the currently running VM. It relies on an element of randomness to spread this charge appropriately over all active VMs. However, if we divide credits over all VMs we have the following problem: if some VMs are idle, then what will happen is that the "active" VMs will spend most of their time in the "OVER" state, while mostly idle VMs will spend all of their time in the "UNDER" state accumulating more and more credits. So credit1 attempts to distinguish between two kinds of workloads: * "active" workloads, which are competing for CPU, and need to have credit accounting done. * "inactive" workloads, which use minimal cpu, and need no credit accounting. The rules for determining whether a VM was "active" or "inactive" are as follows: * If a VM accumulates more than 30ms worth of credit, it is marked as "inactive". Its credits are taken away and it is set to "UNDER" priority. * If a VM is interrupted by a tick, it is marked as active. It is not given any credits immediately, but will be given a share at the next accounting period (which happens every 30ms). Again, this relies on an element of probability: the more often a VM runs, the higher probability it will be interrupted by a tick and be marked "active". In addition to the "active/inactive" distinction, credit1 has a BOOST priority meant for "interactive" processes, that wake up and block on a regular basis. BOOST is the highest priority, so any VM running in BOOST will run before other VMs without having to wait on the runqueue behind VMs with UNDER priority. The conditions for entering and leaving BOOST are as follows: * If a VM wakes up from being blocked and is in the UNDER state (either active or inactive), it is marked BOOST priority. * If a VM is interrupted by a tick, it loses BOOST. (I was not involved in the design of credit1; all of this is inferred from the code itself, combined with my recent experience trying to approximate the ideal scheduler.) A couple of observations about this algorithm. It really seems to be designed to have a relatively small number of VMs that are cpu "burners". If a VM is cpu-intensive, uses up all of its credits, and its workload isn't real-time or latency-sensitive, then this algorithm works great. Each cpu-intensive VM will get overall an amount of time proportional to its weight. However, if a VM uses some percentage of the cpu between 0 and its "fair share", this algorithm doesn't work that well. Every cpu that doesn't use up its credits on a regular basis will eventually get hit by a "tick", which means it will be set to "active" with no credits, causing it to go 10ms OVER credit. It will then dig its way out of OVER, and gradually accumulate credits until it has enough to be considered "inactive", starting the cycle again. The amount of time spent in "active" vs "inactive" is probabilistic and depends on how much the VM runs. So it's worst in workloads like network or video, which consume non-negligible cpu time, but not as much as it can. The BOOST mechanism does allow latency-sensitive VMs to run immediately during the "inactive" phase of this cycle. But during the "active" phase, it starts its time digging its way out of OVER; it runs after all UNDER VMs instead of before them. This introduces long latencies and interferes with the workload significantly during this phase. Furthermore, even cpu-hog VMs which are HVM block to qemu on a regular basis, and so (at least some percentage of the time) get marked as BOOST and run for up to 10ms (the next tick), blocking latency-sensitive applications which are in boost. Finally, when latency-sensitive workloads are using more than their "fair share" and need to be accounted for, the rough granularity of accounting (debited 10ms of credit at random intervals) works against these workloads: VMs will spend long periods in UNDER (and thus also in BOOST), but will then spend long periods in OVER receiving little cpu time. This is even more so if there are several competing "latency-sensitive" workloads that wake up often, because at any time some random subset will be in BOOST. This makes performance highly unpredictable and erratic. This is a summary of observed problems with the current credit system. The next e-mail will cover some ways we might try to modify the current system which fail for various reasons, and a proposal for a new way of doing things. I'd like to comment briefly on the Boost Credit Scheduler, posted earlier this year by Naoki Nishiguchi. The Boost scheduler was based on the current credit scheduler, but added a "boost reservation" feature. It allowed an administrator to assign a VM a guaranteed minimum "reservation" of cpu time. While a VM was running underneath that minimum, it ran at BOOST priority. Multiple competing VMs running at BOOST priority were interleaved at a 1ms granularity to ensure low latency. The main downside of this approach is that it requires the user / administrator to specify ahead of time how much a given VM will need to do its job appropriately. If you give the VM a 20% reservation, and the video decoder needs 25% for some reason, you're likely to drop frames again. Furthermore, if your VM has a rogue process that starts burning cpu, you may eat through your 20% "boost" period doing useless spinning, and the rest of the time you'll be back where you started (going into the normal over/under cycle). I think the reservation is a useful feature for highly over-committed boxes, where even VMs with high weight may be cpu-starved. But I think that we should be able to make most things "just work" by default on boxes where there's plenty of cpu time to go around, without resorting to this kind of user intervention. (As always, comments welcome.) _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.