[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities



2015-04-07 9:10 GMT-04:00 Dario Faggioli <dario.faggioli@xxxxxxxxxx>:
> On Tue, 2015-04-07 at 11:27 +0100, Andrew Cooper wrote:
>> On 04/04/2015 03:14, Dario Faggioli wrote:
>>
>> > I'm putting here in the cover letter a markdown document I wrote to better
>> > describe my findings and ideas (sorry if it's a bit long! :-D). You can 
>> > also
>> > fetch it at the following links:
>> >
>> >  * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.pdf
>> >  * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.markdown
>> >
>> > See the document itself and the changelog of the various patches for 
>> > details.
>
>>
>> There seem to be several areas of confusion indicated in your document.
>>
> I see. Sorry for that then.
>
>> I am unsure whether this is a side effect of the way you have written
>> it, but here are (hopefully) some words of clarification.
>>
> And thanks for this. :-)
>
>> PSR CMT works by tagging cache lines with the currently-active RMID.
>> The cache utilisation is a count of the number of lines which are tagged
>> with a specific RMID.  MBM on the other hand counts the number of cache
>> line fills and cache line evictions tagged with a specific RMID.
>>
> Ok.
>
>> By this nature, the information will never reveal the exact state of
>> play.  e.g. a core with RMID A which gets a cache line hit against a
>> line currently tagged with RMID B will not alter any accounting.
>>
> So, you're saying that the information we get is an approximation of
> reality, not it's 100% accurate representation. That is no news, IMO.
> When, inside Credit2, we try to track the average load on each runqueue,
> that is an approximation. When, in Credit1, we consider a vcpu "cache
> hot" if it run recently, that is an approximation. Etc. These
> approximations happens fully in software, because it is possible, in
> those cases.
>
> PSR provides data and insights on something that, without hardware
> support, we couldn't possibly hope to know anything about. Whether we
> should think about using such data or not, it depends whether they are
> represents a (base for a) reasonable enough approximation, or they are
> just a bunch of pseudo random numbers.
>
> It seems to me that you are suggesting the latter to be more likely than
> the former, i.e., PSR does not provide a good enough approximation for
> being used from inside Xen and toolstack, is my understanding correct?
>
>> Furthermore, as alterations of the RMID only occur in
>> __context_switch(), Xen actions such as handling an interrupt will be
>> accounted against the currently active domain (or other future
>> granularity of RMID).
>>
> Yes, I thought about this. However, this is certainly important for
> per-domain, or for a (unlikely) future per-vcpu, monitoring, but if you
> attach an RMID to a pCPU (or groups of pCPU) then that is not really a
> problem.
>
> Actually, it's the correct behavior: running Xen and serving interrupts
> in a certain core, in that case, *do* need to be accounted! So,
> considering that both the document and the RFC series are mostly focused
> on introducing per-pcpu/core/socket monitoring, rather than on
> per-domain monitoring, and given that the document was becoming quite
> long, I decided not to add a section about this.
>
>> "max_rmid" is a per-socket property.  There is no requirement for it to
>> be the same for each socket in a system, although it is likely, given a
>> homogeneous system.
>>
> I know. Again this was not mentioned for document length reasons, but I
> planned to ask about this (as I've done that already this morning, as
> you can see. :-D).
>
> In this case, though, it probably was something worth being mentioned,
> so I will if there will ever be a v2 of the document. :-)
>
> Mostly, I was curious to learn why that is not reflected in the current
> implementation, i.e., whether there are any reasons why we should not
> take advantage of per-socketness of RMIDs, as reported by SDM, as that
> can greatly help mitigating RMID shortage in the per-CPU/core/socket
> configuration (in general, actually, but it's per-cpu that I'm
> interested in).
>
>> The limit on RMID is based on the size of the
>> accounting table.
>>
> Did not know in details, but it makes sense. Getting feedback on what
> should be expected as number of available RMIDs in current and future
> hardware, from Intel people and from everyone who knows (like you :-D ),
> was the main purpose of sending this out, so thanks.
>
>> As far as MSRs themselves go, an extra MSR write in the context switch
>> path is likely to pale into the noise.  However, querying the data is an
>> indirect MSR read (write to the event select MSR, read from  the data
>> MSR).  Furthermore there is no way to atomically read all data at once
>> which means that activity on other cores can interleave with
>> back-to-back reads in the scheduler.
>>
> All true. And in fact, how and how frequent data should be gathered
> remains to be decided (as said in the document). I was thinking more to
> some periodic sampling, rather than to throw handfuls of rdmsr/wrmsr
> against the code that makes scheduling decisions! :-D
>


Actually, I'm considering if periodic sampling is a better idea than
event-based/situation-based sampling. For example, as you and George
mentioned that the cache affinity information may only be useful in
short term, which means you may not need to issue the MSR to get the
cache information when a vcpu runs long enough. IMHO, there should
exist some heuristics to indicate when the "near-accurate" cache usage
information will be very useful to guide the scheduling decisions.

For example, another situation in my mind which does not need so
frequent sampling is that:
If a domain has very little cache usage for the last several
"event-based" cache-usage sampling, we (or scheduler) can speculate
that this domain is not cache intensive, and make decision based on
this speculation. Then we only sample the cache usage of this domain
with a very low frequency until this domain change from the
not-cache-intensive mode to cache-intensive mode, we will change it
back to event-based sampling.

So I think maybe a hybrid way may be better. :-)

Best,

Meng


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.