[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Sketch of an idea for handling the "mixed workload" problem
On Fri, Sep 29, 2023 at 05:42:16PM +0100, George Dunlap wrote: > The basic credit2 algorithm goes something like this: > > 1. All vcpus start with the same number of credits; about 10ms worth > if everyone has the same weight > > 2. vcpus burn credits as they consume cpu, based on the relative > weights: higher weights burn slower, lower weights burn faster > > 3. At any given point in time, the runnable vcpu with the highest > credit is allowed to run > > 4. When the "next runnable vcpu" on a runqueue is negative, credit is > reset: everyone gets another 10ms, and can carry over at most 2ms of > credit over the reset. > > Generally speaking, vcpus that use less than their quota and have lots > of interrupts are scheduled immediately, since when they wake up they > always have more credit than the vcpus who are burning through their > slices. > > But what about a situation as described recently on Matrix, where a VM > uses a non-negligible amount of cpu doing un-accelerated encryption > and decryption, which can be delayed by a few MS, as well as handling > audio events? How can we make sure that: > > 1. We can run whenever interrupts happen > 2. We get no more than our fair share of the cpu? > > The counter-intuitive key here is that in order to achieve the above, > you need to *deschedule or preempt early*, so that when the interrupt > comes, you have spare credit to run the interrupt handler. How do we > manage that? > > The idea I'm working out comes from a phrase I used in the Matrix > discussion, about a vcpu that "foolishly burned all its credits". > Naturally the thing you want to do to have credits available is to > save them up. > > So the idea would be this. Each vcpu would have a "boost credit > ratio" and a "default boost interval"; there would be sensible > defaults based on typical workloads, but these could be tweaked for > individual VMs. > > When credit is assigned, all VMs would get the same amount of credit, > but divided into two "buckets", according to the boost credit ratio. > > Under certain conditions, a vcpu would be considered "boosted"; this > state would last either until the default boost interval, or until > some other event (such as a de-boost yield). > > The queue would be sorted thus: > > * Boosted vcpus, by boost credit available > * Non-boosted vcpus, by non-boost credit available > > Getting more boost credit means having lower priority when not > boosted; and burning through your boost credit means not being > scheduled when you need to be. > > Other ways we could consider putting a vcpu into a boosted state (some > discussed on Matrix or emails linked from Matrix): > * Xen is about to preempt, but finds that the vcpu interrupts are > blocked (this sort of overlaps with the "when we deliver an interrupt" > one) > * Xen is about to preempt, but finds that the (currently out-of-tree) > "dont_desched" bit has been set in the shared memory area I think both of these would be good. Another one would be when Xen is about to deliver an interrupt to a guest, provided that there is no storm of interrupts. I’ve seen a USB webcam cause a system-wide latency spike through what I presume is an interrupt storm, and I suspect that others have observed similar behavior with USB external drives. > Other ways to consider de-boosting: > * There's a way to trigger a VMEXIT when interrupts have been > re-enabled; setting this up when the VM is in the boost state That’s a good idea, but should be conditional on “dont_desched” _not_ being set. This handles the case where the guest is running a realtime thread. Generally, I’d like to see something like this: - A vCPU with sufficient boost credit is boosted by Xen under the following conditions: 1. Xen interrupts the guest. 2. Xen is about to preempt, but detects that “dont_desched” is set. 3. Xen is about to preempt, but detects that interrupts are disabled. - A vCPU is deboosted if: 1. It runs out of boost credit, even if “dont_desched” is set. 2. An interrupt handler returns, but only if “dont_desched” is not set. 3. Interrupts are re-enabled, but only if “dont_desched” is not set. The first case is an abnormal condition and typically means that either the system is overloaded or a vCPU is running boosted for too long. To help debug this situation, Xen will log a warning and increment both a system-wide and a per-domain counter. dom0 can retrieve counters for any domain, and a domain can read its own counter. - When to set “dont_desched” is entirely up to the guest kernel, but there are some general rules guests should follow: - Only set “dont_desched” if there is a good reason, and unset it as soon as possible. Xen gives vCPUs with “dont_desched” set priority over all other vCPUs on the system, but the amount of time a vCPU is allowed to run with an elevated priority is limited. Xen will log a warning if a guest tries to run with elevated priority for too long. - Xen boosts vCPUs before delivering an interrupt, but there should be a way for a vCPU to deboost itself even before returning from the interrupt handler. - Guests should always set “dont_desched” when running hard-realtime threads (used for e.g. audio processing), even when the thread is in userspace. This ensures that Xen gives the underlying vCPU priority over vCPUs - Guests should always set “dont_desched” when holding a spin lock, but it is even better to use paravirtualized spin locks (which make a hypercall into Xen and therefore allow other vCPUs to run). - Xen does not implement priority inheritance, so guests need to do that. - Max boost credits can be set by dom0 via a hypercall. The advantage of this approach is that it keeps almost all policy out of Xen. The only exception is the boosting when an interrupt is received, but a well-behaved guest will deboost itself very quickly (by enabling interrupts) if the boost was not actually needed, so this should have very limited impact. I think this should be enough for realtime audio, and it is somewhat related to (but hopefully simpler than) the KVM RFC from Google [1]. Any thoughts on this? -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab [1]: https://lore.kernel.org/kvm/20231214024727.3503870-1-vineeth@xxxxxxxxxxxxxxx/ Attachment:
signature.asc
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |