[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities



On Tue, 2015-04-07 at 11:27 +0100, Andrew Cooper wrote:
> On 04/04/2015 03:14, Dario Faggioli wrote:
>
> > I'm putting here in the cover letter a markdown document I wrote to better
> > describe my findings and ideas (sorry if it's a bit long! :-D). You can also
> > fetch it at the following links:
> >
> >  * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.pdf
> >  * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.markdown
> >
> > See the document itself and the changelog of the various patches for 
> > details.

> 
> There seem to be several areas of confusion indicated in your document. 
>
I see. Sorry for that then.

> I am unsure whether this is a side effect of the way you have written
> it, but here are (hopefully) some words of clarification.
>
And thanks for this. :-)

> PSR CMT works by tagging cache lines with the currently-active RMID. 
> The cache utilisation is a count of the number of lines which are tagged
> with a specific RMID.  MBM on the other hand counts the number of cache
> line fills and cache line evictions tagged with a specific RMID.
> 
Ok.

> By this nature, the information will never reveal the exact state of
> play.  e.g. a core with RMID A which gets a cache line hit against a
> line currently tagged with RMID B will not alter any accounting. 
>
So, you're saying that the information we get is an approximation of
reality, not it's 100% accurate representation. That is no news, IMO.
When, inside Credit2, we try to track the average load on each runqueue,
that is an approximation. When, in Credit1, we consider a vcpu "cache
hot" if it run recently, that is an approximation. Etc. These
approximations happens fully in software, because it is possible, in
those cases.

PSR provides data and insights on something that, without hardware
support, we couldn't possibly hope to know anything about. Whether we
should think about using such data or not, it depends whether they are
represents a (base for a) reasonable enough approximation, or they are
just a bunch of pseudo random numbers.

It seems to me that you are suggesting the latter to be more likely than
the former, i.e., PSR does not provide a good enough approximation for
being used from inside Xen and toolstack, is my understanding correct?

> Furthermore, as alterations of the RMID only occur in
> __context_switch(), Xen actions such as handling an interrupt will be
> accounted against the currently active domain (or other future
> granularity of RMID).
> 
Yes, I thought about this. However, this is certainly important for
per-domain, or for a (unlikely) future per-vcpu, monitoring, but if you
attach an RMID to a pCPU (or groups of pCPU) then that is not really a
problem.

Actually, it's the correct behavior: running Xen and serving interrupts
in a certain core, in that case, *do* need to be accounted! So,
considering that both the document and the RFC series are mostly focused
on introducing per-pcpu/core/socket monitoring, rather than on
per-domain monitoring, and given that the document was becoming quite
long, I decided not to add a section about this.

> "max_rmid" is a per-socket property.  There is no requirement for it to
> be the same for each socket in a system, although it is likely, given a
> homogeneous system.
>
I know. Again this was not mentioned for document length reasons, but I
planned to ask about this (as I've done that already this morning, as
you can see. :-D).

In this case, though, it probably was something worth being mentioned,
so I will if there will ever be a v2 of the document. :-)

Mostly, I was curious to learn why that is not reflected in the current
implementation, i.e., whether there are any reasons why we should not
take advantage of per-socketness of RMIDs, as reported by SDM, as that
can greatly help mitigating RMID shortage in the per-CPU/core/socket
configuration (in general, actually, but it's per-cpu that I'm
interested in).

> The limit on RMID is based on the size of the
> accounting table.
> 
Did not know in details, but it makes sense. Getting feedback on what
should be expected as number of available RMIDs in current and future
hardware, from Intel people and from everyone who knows (like you :-D ),
was the main purpose of sending this out, so thanks.

> As far as MSRs themselves go, an extra MSR write in the context switch
> path is likely to pale into the noise.  However, querying the data is an
> indirect MSR read (write to the event select MSR, read from  the data
> MSR).  Furthermore there is no way to atomically read all data at once
> which means that activity on other cores can interleave with
> back-to-back reads in the scheduler.
> 
All true. And in fact, how and how frequent data should be gathered
remains to be decided (as said in the document). I was thinking more to
some periodic sampling, rather than to throw handfuls of rdmsr/wrmsr
against the code that makes scheduling decisions! :-D

> As far as the plans here go, I have some concerns.  PSR is only
> available on server platforms, which will be 2/4 socket systems with
> large numbers of cores.  As you have discovered, there insufficient
> RMIDs for redbrick pcpus, and on a system that size, XenServer typically
> gets 7x vcpus to pcpus.
> 
> I think it is unrealistic to expect to use any scheduler scheme which is
> per-pcpu or per-vcpu while the RMID limit is as small as it is. 
>
On the per-vcpu schemes, I fully agree. However, it was necessary to
mention it, IMO, and explain why that is the case... Being able to
monitor single vCPUs would be pretty cool, and it likely is one of the
first things that someone looking at this technology for the first time
would like to know whether it is possible or not. It's not, and I
thought not stating so and not explaining the reasons why it is not
would have been quite a deficiency of such a document.

On per-pcpu schemes, I mostly agree. Although exploiting the per-socket
nature of RMID, if possible, seems to offer a viable solution.

What I'm not sure I got is your opinion on per-pcpu or per-socket
schemes.

> Depending on workload, even a per-domain scheme might be problematic. 
> One of our tests involves running 500xWin7 VMs on that particular box.
> 
Yep. And in fact, I didn't even mention using any per-domain scheme for
scheduling as it has the same disadvantages of per-vcpu schemes, in
terms of RMID usage (a few multi-vcpus domain == many single-vcpus
domain), and it's useless for the scheduler, which barely knows about
what a domain is.

Regards, and Thanks a lot for your feedback. :-)
Dario

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.