Xen project Mailing List

Re: [Xen-devel] [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities

To: Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>

From: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

Date: Tue, 7 Apr 2015 13:10:01 +0000

Accept-language: en-GB, en-US

Cc: Wei Liu <wei.liu2@xxxxxxxxxx>, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, "JBeulich@xxxxxxxx" <JBeulich@xxxxxxxx>, "chao.p.peng@xxxxxxxxxxxxxxx" <chao.p.peng@xxxxxxxxxxxxxxx>

Delivery-date: Tue, 07 Apr 2015 13:10:10 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Thread-index: AQHQcR1lGFV+CI3JqkKuU5yUzqQKdZ1BZIqA

Thread-topic: [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities

On Tue, 2015-04-07 at 11:27 +0100, Andrew Cooper wrote: > On 04/04/2015 03:14, Dario Faggioli wrote: > > > I'm putting here in the cover letter a markdown document I wrote to better > > describe my findings and ideas (sorry if it's a bit long! :-D). You can also > > fetch it at the following links: > > > > * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.pdf > > * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.markdown > > > > See the document itself and the changelog of the various patches for > > details. > > There seem to be several areas of confusion indicated in your document. > I see. Sorry for that then. > I am unsure whether this is a side effect of the way you have written > it, but here are (hopefully) some words of clarification. > And thanks for this. :-) > PSR CMT works by tagging cache lines with the currently-active RMID. > The cache utilisation is a count of the number of lines which are tagged > with a specific RMID. MBM on the other hand counts the number of cache > line fills and cache line evictions tagged with a specific RMID. > Ok. > By this nature, the information will never reveal the exact state of > play. e.g. a core with RMID A which gets a cache line hit against a > line currently tagged with RMID B will not alter any accounting. > So, you're saying that the information we get is an approximation of reality, not it's 100% accurate representation. That is no news, IMO. When, inside Credit2, we try to track the average load on each runqueue, that is an approximation. When, in Credit1, we consider a vcpu "cache hot" if it run recently, that is an approximation. Etc. These approximations happens fully in software, because it is possible, in those cases. PSR provides data and insights on something that, without hardware support, we couldn't possibly hope to know anything about. Whether we should think about using such data or not, it depends whether they are represents a (base for a) reasonable enough approximation, or they are just a bunch of pseudo random numbers. It seems to me that you are suggesting the latter to be more likely than the former, i.e., PSR does not provide a good enough approximation for being used from inside Xen and toolstack, is my understanding correct? > Furthermore, as alterations of the RMID only occur in > __context_switch(), Xen actions such as handling an interrupt will be > accounted against the currently active domain (or other future > granularity of RMID). > Yes, I thought about this. However, this is certainly important for per-domain, or for a (unlikely) future per-vcpu, monitoring, but if you attach an RMID to a pCPU (or groups of pCPU) then that is not really a problem. Actually, it's the correct behavior: running Xen and serving interrupts in a certain core, in that case, *do* need to be accounted! So, considering that both the document and the RFC series are mostly focused on introducing per-pcpu/core/socket monitoring, rather than on per-domain monitoring, and given that the document was becoming quite long, I decided not to add a section about this. > "max_rmid" is a per-socket property. There is no requirement for it to > be the same for each socket in a system, although it is likely, given a > homogeneous system. > I know. Again this was not mentioned for document length reasons, but I planned to ask about this (as I've done that already this morning, as you can see. :-D). In this case, though, it probably was something worth being mentioned, so I will if there will ever be a v2 of the document. :-) Mostly, I was curious to learn why that is not reflected in the current implementation, i.e., whether there are any reasons why we should not take advantage of per-socketness of RMIDs, as reported by SDM, as that can greatly help mitigating RMID shortage in the per-CPU/core/socket configuration (in general, actually, but it's per-cpu that I'm interested in). > The limit on RMID is based on the size of the > accounting table. > Did not know in details, but it makes sense. Getting feedback on what should be expected as number of available RMIDs in current and future hardware, from Intel people and from everyone who knows (like you :-D ), was the main purpose of sending this out, so thanks. > As far as MSRs themselves go, an extra MSR write in the context switch > path is likely to pale into the noise. However, querying the data is an > indirect MSR read (write to the event select MSR, read from the data > MSR). Furthermore there is no way to atomically read all data at once > which means that activity on other cores can interleave with > back-to-back reads in the scheduler. > All true. And in fact, how and how frequent data should be gathered remains to be decided (as said in the document). I was thinking more to some periodic sampling, rather than to throw handfuls of rdmsr/wrmsr against the code that makes scheduling decisions! :-D > As far as the plans here go, I have some concerns. PSR is only > available on server platforms, which will be 2/4 socket systems with > large numbers of cores. As you have discovered, there insufficient > RMIDs for redbrick pcpus, and on a system that size, XenServer typically > gets 7x vcpus to pcpus. > > I think it is unrealistic to expect to use any scheduler scheme which is > per-pcpu or per-vcpu while the RMID limit is as small as it is. > On the per-vcpu schemes, I fully agree. However, it was necessary to mention it, IMO, and explain why that is the case... Being able to monitor single vCPUs would be pretty cool, and it likely is one of the first things that someone looking at this technology for the first time would like to know whether it is possible or not. It's not, and I thought not stating so and not explaining the reasons why it is not would have been quite a deficiency of such a document. On per-pcpu schemes, I mostly agree. Although exploiting the per-socket nature of RMID, if possible, seems to offer a viable solution. What I'm not sure I got is your opinion on per-pcpu or per-socket schemes. > Depending on workload, even a per-domain scheme might be problematic. > One of our tests involves running 500xWin7 VMs on that particular box. > Yep. And in fact, I didn't even mention using any per-domain scheme for scheduling as it has the same disadvantages of per-vcpu schemes, in terms of RMID usage (a few multi-vcpus domain == many single-vcpus domain), and it's useless for the scheduler, which barely knows about what a domain is. Regards, and Thanks a lot for your feedback. :-) Dario

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.