[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling

To: Jan Beulich <JBeulich@xxxxxxxx>
From: George Dunlap <george.dunlap@xxxxxxxxxx>
Date: Wed, 9 Mar 2016 16:01:26 +0000
Cc: Kevin Tian <kevin.tian@xxxxxxxxx>, Feng Wu <feng.wu@xxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Dario Faggioli <dario.faggioli@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, Keir Fraser <keir@xxxxxxx>
Delivery-date: Wed, 09 Mar 2016 16:02:29 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 09/03/16 13:39, Jan Beulich wrote:
>>>> On 08.03.16 at 19:38, <george.dunlap@xxxxxxxxxx> wrote:
>> Still -- I have a hard time constructing in my mind a scenario where
>> huge numbers of idle vcpus for some reason decide to congregate on a
>> single pcpu.
>>
>> Suppose we had 1024 pcpus, and 1023 VMs each with 5 vcpus, of which 1
>> was spinning at 100% and the other 4 were idle.  I'm not seeing a
>> situation where any of the schedulers put all (1023*4) idle vcpus on a
>> single pcpu.
> 
> As per my understanding idle vCPU-s don't get migrated at all.
> And even if they do, their PI association with a pCPU doesn't
> change (because that gets established once an for all at the
> time the vCPU blocks).
> 
>> For the credit1 scheduler, I'm basically positive that it can't happen
>> even once, even by chance.  You'd never be able to accrete more than a
>> dozen vcpus on that one pcpu before they were stolen away.
> 
> Isn't stealing here happing only for runnable vCPU-s?
> 
>> And in any case, are you really going to have 1023 devices so that you
>> can hand one to each of those 1023 guests?  Because it's only vcpus of
>> VMs *which have a device assigned* that end up on the block list.
> 
> Who knows what people put in their (huge) systems, or by what
> factor the VF/PF ratio will grow in the next few years?
> 
>> If I may go "meta" for a moment here -- this is exactly what I'm talking
>> about with "Something bad may happen" being difficult to work with.
>> Rather than you spelling out exactly the situation which you think may
>> happen, (which I could then either accept or refute on its merits) *I*
>> am now spending a lot of time and effort trying to imagine what
>> situations you may be talking about and then refuting them myself.
> 
> I thought I was precise enough (without going into too much detail),
> but looks like I wasn't.
> 
> 1) vCPU1 blocks on pCPU1 (indefinitely for the purpose here)
> 2) vCPU2 gets migrated to pCPU1 and blocks (indefinitely ...)
> ...
> n) vCPUn gets migrated to pCPU1 and blocks (indefinitely ...)
> n+1) a PI wakeup interrupt arrives on pCPU1
> 
> In this consideration it doesn't matter whether the vCPU-s are all
> from the same or different VMs. The sole requirement is that they
> must satisfy the condition(s) to be put on the blocking list.

Right -- so here's one of our differing assumptions.  In my experience
there is no such thing as a truly idle vcpu: they always wake up at
least occasionally (usually a few times a second) for some reason or
other.  (Which is why I talked about the load of each idle vcpu being
less than 0.02%.)  So I was assuming that the vcpu would be stolen
during one of the 0.02% time it was running.

But let's suppose that's not the case -- the chances of something like
you're talking about happening are astronomically small.

So for this to work, you have to have a set of "perversely idle" vcpus,
call it set PI, which do a normal amount of work -- enough to get the
attention of the load balancer -- and then mysteriously block, taking
almost no interrupts at all, for an incredibly long amount of time (on
the order of minutes at least), and then wake up.  The chance of having
a large number of these in itself is pretty minimal.

Then we have the problem of how are we going to get the perversely idle
vcpus onto the same pcpu (call it p)?  Well, somehow all the other cpus
have to be busy, which means we have to have almost exactly the right
number of normally working vcpus (call this set W) to keep all the
*other* pcpus busy.

For a member of PI to be moved to p, p itself has to be idle.  Suppose
we start with a random distribution, and it happens that p only has one
member of PI on  it.  So the load balancer moves some more work there.
If it happens to grab a vcpu from the set W, then the whole thing stops
until W is migrated away, because now p isn't idle -- it's got a fairly
busy vcpu on it.  But of course, if p is busy, then now some other pcpu
is now idle, and so *it* will start attracting members of PI, until it
accidentally grabs a member of W, &c &c.

And of course, only one vcpu will be moved to p at a time.  If all the
vcpus in PI block at the same time, most of them will stay just where
they are. So we have further constraints on PI: Not only do they have to
have this unnatural "Run then block completely" pattern; they must block
in a staged fashion, so that they can be moved one-by-one onto p.

And not only that, the load balancer has to *migrate* them in the right
order.  If it grabs a vcpu that won't block until all the other ones
have already blocked, then p will be busy and the other vcpus in PI will
end up on other pcpus.

So for your nightmare scenario to happen, we must have hundreds of vcpus
which exhibit this strange blocking pattern; they must block in a staged
fashion; the load balancer, when choosing work to move onto p, has to
somehow, 100 times in a row (500 times? 1000 times?), select a running
vcpu that is in PI, instead of a running pcpu which is in W; and not
only that, it has to grab the vcpus in PI *in the order in which they
are going to block*, at least 100 (500 / 1000) times.

This is just incredibly far-fetched.  By the time this happens to
someone they will already have been struck by lightning 50 times and won
the billon dollar Powerball jackpot twice; at that point they won't care.

>>> Hence I
>>> think the better approach, instead of improving lookup, is to
>>> distribute vCPU-s evenly across lists. Which in turn would likely
>>> require those lists to no longer be tied to pCPU-s, an aspect I
>>> had already suggested during review. As soon as distribution
>>> would be reasonably even, the security concern would vanish:
>>> Someone placing more vCPU-s on a host than that host can
>>> handle is responsible for the consequences. Quite contrary to
>>> someone placing more vCPU-s on a host than a single pCPU can
>>> reasonably handle in an interrupt handler.
>>
>> I don't really understand your suggestion.  The PI interrupt is
>> necessarily tied to a specific pcpu; unless we start having multiple PI
>> interrupts, we only have as many interrupts as we have pcpus, right?
>> Are you saying that rather than put vcpus on the list of the pcpu it's
>> running on, we should set the interrupt to that of an arbitrary pcpu
>> that happens to have room on its list?
> 
> Ah, right, I think that limitation was named before, yet I've
> forgotten about it again. But that only slightly alters the
> suggestion: To distribute vCPU-s evenly would then require to
> change their placement on the pCPU in the course of entering
> blocked state.

Right -- well having a mechanism to limit the total number of pi-capable
vcpus assigned to a single pcpu would be something we could consider too
-- once we have an idea what kind of number that might be.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling
  - From: Jan Beulich

References:
- [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling
  - From: Konrad Rzeszutek Wilk
- Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling
  - From: George Dunlap
- Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling
  - From: Konrad Rzeszutek Wilk
- Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling
  - From: George Dunlap
- Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling
  - From: Wu, Feng
- Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling
  - From: George Dunlap
- Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling
  - From: Jan Beulich
- Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling
  - From: George Dunlap
- Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling
  - From: Jan Beulich
- Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling
  - From: George Dunlap
- Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling
  - From: Jan Beulich

Prev by Date: Re: [Xen-devel] [PATCH v7]xen: sched: convert RTDS from time to event driven model
Next by Date: Re: [Xen-devel] [PATCH v6 for Xen 4.7 1/4] xen: enable per-VCPU parameter settings for RTDS scheduler
Previous by thread: Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling
Next by thread: Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.