[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling




> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@xxxxxxxxxx]
> Sent: Wednesday, March 9, 2016 2:39 AM
> To: Jan Beulich <JBeulich@xxxxxxxx>; George Dunlap
> <George.Dunlap@xxxxxxxxxxxxx>; Wu, Feng <feng.wu@xxxxxxxxx>
> Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>; Dario Faggioli
> <dario.faggioli@xxxxxxxxxx>; Tian, Kevin <kevin.tian@xxxxxxxxx>; xen-
> devel@xxxxxxxxxxxxx; Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>; Keir
> Fraser <keir@xxxxxxx>
> Subject: Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt
> core logic handling
> 
> On 08/03/16 17:26, Jan Beulich wrote:
> >>>> On 08.03.16 at 18:05, <george.dunlap@xxxxxxxxxx> wrote:
> >> On 08/03/16 15:42, Jan Beulich wrote:
> >>>>>> On 08.03.16 at 15:42, <George.Dunlap@xxxxxxxxxxxxx> wrote:
> >>>> On Tue, Mar 8, 2016 at 1:10 PM, Wu, Feng <feng.wu@xxxxxxxxx> wrote:
> >>>>>> -----Original Message-----
> >>>>>> From: George Dunlap [mailto:george.dunlap@xxxxxxxxxx]
> >>>>>>
> >>>>>> 2. Try to test engineered situations where we expect this to be a
> >>>>>> problem, to see how big of a problem it is (proving the theory to be
> >>>>>> accurate or inaccurate in this case)
> >>>>>
> >>>>> Maybe we can run a SMP guest with all the vcpus pinned to a dedicated
> >>>>> pCPU, we can run some benchmark in the guest with VT-d PI and without
> >>>>> VT-d PI, then see the performance difference between these two
> sceanrios.
> >>>>
> >>>> This would give us an idea what the worst-case scenario would be.
> >>>
> >>> How would a single VM ever give us an idea about the worst
> >>> case? Something getting close to worst case is a ton of single
> >>> vCPU guests all temporarily pinned to one and the same pCPU
> >>> (could be multi-vCPU ones, but the more vCPU-s the more
> >>> artificial this pinning would become) right before they go into
> >>> blocked state (i.e. through one of the two callers of
> >>> arch_vcpu_block()), the pinning removed while blocked, and
> >>> then all getting woken at once.
> >>
> >> Why would removing the pinning be important?
> >
> > It's not important by itself, other than to avoid all vCPU-s then
> > waking up on the one pCPU.
> >
> >> And I guess it's actually the case that it doesn't need all VMs to
> >> actually be *receiving* interrupts; it just requires them to be
> >> *capable* of receiving interrupts, for there to be a long chain all
> >> blocked on the same physical cpu.
> >
> > Yes.
> >
> >>>>  But
> >>>> pinning all vcpus to a single pcpu isn't really a sensible use case we
> >>>> want to support -- if you have to do something stupid to get a
> >>>> performance regression, then I as far as I'm concerned it's not a
> >>>> problem.
> >>>>
> >>>> Or to put it a different way: If we pin 10 vcpus to a single pcpu and
> >>>> then pound them all with posted interrupts, and there is *no*
> >>>> significant performance regression, then that will conclusively prove
> >>>> that the theoretical performance regression is of no concern, and we
> >>>> can enable PI by default.
> >>>
> >>> The point isn't the pinning. The point is what pCPU they're on when
> >>> going to sleep. And that could involve quite a few more than just
> >>> 10 vCPU-s, provided they all sleep long enough.
> >>>
> >>> And the "theoretical performance regression is of no concern" is
> >>> also not a proper way of looking at it, I would say: Even if such
> >>> a situation would happen extremely rarely, if it can happen at all,
> >>> it would still be a security issue.
> >>
> >> What I'm trying to get at is -- exactly what situation?  What actually
> >> constitutes a problematic interrupt latency / interrupt processing
> >> workload, how many vcpus must be sleeping on the same pcpu to actually
> >> risk triggering that latency / workload, and how feasible is it that
> >> such a situation would arise in a reasonable scenario?
> >>
> >> If 200us is too long, and it only takes 3 sleeping vcpus to get there,
> >> then yes, there is a genuine problem we need to try to address before we
> >> turn it on by default.  If we say that up to 500us is tolerable, and it
> >> takes 100 sleeping vcpus to reach that latency, then this is something I
> >> don't really think we need to worry about.
> >>
> >> "I think something bad may happen" is a really difficult to work with.
> >
> > I understand that, but coming up with proper numbers here isn't
> > easy. Fact is - it cannot be excluded that on a system with
> > hundreds of pCPU-s and thousands or vCPU-s, that all vCPU-s
> > would at some point pile up on one pCPU's list.
> 
> So it's already the case that when a vcpu is woken, it is inserted into
> the runqueue by priority order, both for credit1 and credit2; and this
> is an insertion sort, so the amount of time it takes to do the insert is
> expected to be the time it takes to traverse half of the list.  This
> isn't an exact analog, because in that case it's the number of
> *runnable* vcpus, not the number of *blocked* vcpus; but it demonstrates
> the point that 1) we already have code that assumes that walking a list
> of vcpus per pcpu is a reasonably bounded thing 2) we have years of no
> major performance problems reported to back that assumption up.
> 
> I guess the slight difference there is that it's already well-understood
> that too many *active* vcpus will overload your system and slow things
> down; in the case of the pi wake-ups, the problem is that too many
> *inactive* vcpus will overload your system and slow things down.
> 
> Still -- I have a hard time constructing in my mind a scenario where
> huge numbers of idle vcpus for some reason decide to congregate on a
> single pcpu.
> 
> Suppose we had 1024 pcpus, and 1023 VMs each with 5 vcpus, of which 1
> was spinning at 100% and the other 4 were idle.  I'm not seeing a
> situation where any of the schedulers put all (1023*4) idle vcpus on a
> single pcpu.
> 
> For the credit1 scheduler, I'm basically positive that it can't happen
> even once, even by chance.  You'd never be able to accrete more than a
> dozen vcpus on that one pcpu before they were stolen away.
> 
> For the credit2 scheduler, it *might* be possible that if the busy vcpu
> on each VM never changes (which itself is pretty unlikely), *and* the
> sum of the "load" for all (1023*4) idle vcpus was less than 1 (i.e.,
> idle vcpus took less than 0.02% of the cpu time), then you *might*
> possibly after a long time end up at a situation where you had all vcpus
> on a single pcpu.  But that "accretion" process would take a very long
> time; and as soon as any vcpu had a brief "spike" above the "0.02%", a
> whole bunch of them get moved somewhere else.
> 
> And in any case, are you really going to have 1023 devices so that you
> can hand one to each of those 1023 guests?  Because it's only vcpus of
> VMs *which have a device assigned* that end up on the block list.
> 
> If I may go "meta" for a moment here -- this is exactly what I'm talking
> about with "Something bad may happen" being difficult to work with.
> Rather than you spelling out exactly the situation which you think may
> happen, (which I could then either accept or refute on its merits) *I*
> am now spending a lot of time and effort trying to imagine what
> situations you may be talking about and then refuting them myself.
> 
> If you have concerns, you need to make those concerns concrete, or at
> least set clear criteria for how someone could go about addressing your
> concerns.  And yes, it is *your* job, as the person doing the objecting
> (and even moreso as the x86 maintainer), to make your concerns explicit
> and/or set those criteria, and not Feng's job (or even my job) to try to
> guess what it is might make you happy.
> 
> > How many would be tolerable on a single list depends upon host
> > characteristics, so a fixed number won't do anyway.
> 
> Sure, but if we can run through a list of 100 vcpus in 25us on a typical
> server, then we can be pretty certain 100 vcpus will never exceed 500us
> on basically any server.
> 
> On the other hand, if 50 vcpus takes 500us on whatever server Feng uses
> for his tests, then yes, we don't really have enough "slack" to be sure
> that we won't run to problems at some point.
> 
> But at this point we're just pulling numbers out of the air -- when we
> have actual data we can make a better judgement about what might or
> might not be acceptable.
> 
> > Hence I
> > think the better approach, instead of improving lookup, is to
> > distribute vCPU-s evenly across lists. Which in turn would likely
> > require those lists to no longer be tied to pCPU-s, an aspect I
> > had already suggested during review. As soon as distribution
> > would be reasonably even, the security concern would vanish:
> > Someone placing more vCPU-s on a host than that host can
> > handle is responsible for the consequences. Quite contrary to
> > someone placing more vCPU-s on a host than a single pCPU can
> > reasonably handle in an interrupt handler.
> 
> I don't really understand your suggestion.  The PI interrupt is
> necessarily tied to a specific pcpu; unless we start having multiple PI
> interrupts, we only have as many interrupts as we have pcpus, right?
> Are you saying that rather than put vcpus on the list of the pcpu it's
> running on, we should set the interrupt to that of an arbitrary pcpu
> that happens to have room on its list?

I don't think that is a good idea, as George mentioned above, the PI
wakeup notification events (PI interrupts) are bound to a specific
pCPU, that means the 'NDST' filed in the vCPU's PI descriptor is the
specific pCPU, so when the PI interrupts happen, we can find the
right blocking list. If we put the vCPU to another pCPU's (other than
the one indicated by 'NDST' field) blocking list, how should we find
the vCPU to wake up when PI interrupts come in?

Thanks,
Feng

> 
>  -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.