[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-ia64-devel] RE: Xen/ia64 - global or per VP VHPT



Hi Dan,

First of all a clarification. In this message, the words VM, domain and
guest mean the same thing.

Magenheimer, Dan (HP Labs Fort Collins) <mailto:dan.magenheimer@xxxxxx>
wrote on Saturday, April 30, 2005 8:09 PM:

>>> In my opinion, performance when emulating physical mode is
>>> a moot point.
>> 
>> Linux IPF TLB miss handlers turn off PRS.dt. This is very performance
>> sensitive.
> 
> Um, true, but that's largely irrelevant to discussion of
> VHPT capability/performance isn't it?

I was just responding to your comment that "performance when emulating
physical mode is a moot point." In other words, I am presenting a case in
which you need to emulate physical mode and in which performance is
important.

The reason why we got to talking about this is because you stated:

> No, multiple page sizes are supported, though there does have
> to be a system-wide minimum page size (e.g. if this were defined
> as 16KB, a 4KB-page mapping request from a guestOS would be rejected).
> Larger page sizes are represented by multiple entries in the
> global VHPT.

And I commented:

> In my opinion this is a moot point because in order to provide the
> appropriate semantics for physical mode emulation (PRS.dt, or PSR.it, or
> PSR.rt == 0) it is necessary to support a 4K page size as the minimum
> (unless you special case translations for physical mode emulation). Also
in
> terms of machine memory utilization, it is better to have smaller pages (I
> know this functionality is not yet available in Xen, but I believe it will
> become important once people are done working on the basics).

I was making the point that I think a VMM needs to support 4K pages.
Admittedly, this is unrelated to the VHPT discussion, so we can ignore it.

> 
>> The way I see you applying this argument here is a bit
>> different, though:
>> there are things that Linux does today that will cause
>> trouble with this
>> particular design choice, but all I have to do is to make sure these
>> troublesome things get designed out of the paravirtualized OS.
> 
> Yes, that's basically what I am saying.  I understand why a
> VTi implementation needs to handle every possibly situation
> because silicon rolls are very expensive.  It's not nearly
> as important for paravirtualized.  For example, Vmware didn't
> support Linux 2.6 until their "next release" (I don't remember
> what the release number was).

I am not sure how the VMware example is relevant here (it certainly has
nothing to so with VHPTs). Please let's talk about specifics and how they
relate to the issues of:

- Scalability (additional contention in a Global VHPT)

- The need to minimize guest interference (one guest/domain having the
ability to interfere with another through a shared resource)

> 
>> In any case, I think it is critical to define exactly what an IPF
>> paravirtualized guest is (maybe this has already been done
>> and I missed it)
>> before making assumptions as to what the guest will and will not do
>> (specially when those things are done by native guests
>> today). I don't think
>> it is quiet the same as an X-86 XenoLinux, as a number of the hypercalls
are
>> very specific to addressing X-86 virtualization holes, which
>> do not have
>> equivalents in IPF.
> 
> There is a paravirtualized design and Xenlinux implementation
> available. (See an earlier posting.) It's still a work in
> progress but its proceeding nicely.

I have not seen this. Would you mind sending me a pointer to this. I tend to
follow these discussions sporadically, so I missed that one email.

> 
>> I know that there have been attempts at paravirtualizing (actually more
like
>> dynamically patching) IPF Linux before (e.g., vBlades, you
>> may be familiar
>> with it :-), but I am not sure if the Xen project for IPF has decided
>> exactly what an IPF paravirtualized XenoLinux will look like.
>> I am also not
>> sure if it has also been decided that no native IPF guests (no binary
>> patching) will be supported.
> 
> An entirely paravirtualized guest (no patching) is certainly
> feasible.  I could have it running in a couple weeks time,
> but haven't considered it high on the priority list.
> 
> Another interesting case (I think suggested by Arun) is a
> "semi-paravirtualized" system where some paravirtualization
> is done but priv-sensitive instructions are handled by
> hardware (VT) rather than binary patching.  Or perhaps that's
> what you meant?
> 
> In any case, there's a lot of interesting possibilities here
> and, although there are many opinions about which is best,
> I think we should preserve the option of trying/implementing
> as many as possible.  I'm not "black and white"... I'm more
> of an RGB kinda guy :-)

I don't doubt that everything you mention above is possible. All I am saying
is that it would be very useful to specify exactly what paravirtualization
is doing before making claims that certain issues will not be relevant in a
paravirtualized environment.

>> Let's define "big" in an environment where there are multiple
>> cores per die...
> 
> Not sure what your point is.  Yes, SMP systems are becoming
> more common, but that doesn't mean that every system is
> going to be running Oracle or data mining.  In other words,
> it may be better to model a "big" system as an independent
> collection of small systems (e.g. utility data center).

We are definitely not talking the same language here...

My point is that by using a global VHPT, you are creating lock contention
that is proportional to the number of processors/cores in the system and the
number of VMs/domains, regardless of whether or not those VMs are SMP. Lets
take as an example a system in which you are running 10 UP VMs. If you are
using per VM/domain VHPT then you do not have to be concerned with locking
in the individual VHPTs. If you use a global VHPT, now you have to lock
between 10 VMs. If you have 10 processors/cores then there potentially is
going to be 10 guys contending for those locks. The fact that the VMs are UP
does not matter, the global VHPT makes this an SMP problem!

> 
>>> E.g., assume an administrator automatically configures all domains
>>> with a nominal 4GB but ability to dynamically grow up to 64GB.  The
>>> per-guest VHPT would need to pre-allocate a shadow VHPT for the
>>> largest of these (say 1% of 64GB) even if each of the domains never
>>> grew beyond the 4GB, right?  (Either that or some kind of VHPT
>>> resizing might be required whenever memory is "hot-plugged"?)
>> 
>> I am not sure I understand your example. As I said in my
>> previous posting,
>> experience has shown that the optimal size of the VHPT (for performance)
is
>> dependent of the number of physical pages it supports (not
>> how many domains,
>> but how many total pages those domains will be using). In
>> other words, the
>> problem of having a VHPT support more memory is independent
>> of whether it
>> represents one domain or multiple domains. It depends on how
>> many total
>> memory pages are being supported.
> 
> OK, let me try again.  Let's assume a system has 64GB and (by whatever
> means) we determine that a 1GB VHPT is the ideal size for a 64GB
> system.  Now let's assume an environment where the "load" (as measured
> by number of active guests competing for a processor) is widely
> variable... say maybe a national lab where one or two hardy allnighters
> run their domains during the night but 16 or more run during the day.
> Assume also that all those domains are running apps that are heavily
> memory intensive, that is they will use whatever memory is made
> available but can operate as necessary with less memory.  So
> when the one domain is running, it balloons up to a full 64GB
> but when many are running, they chug along with 4GB or less each.
> 
> The global VHPT allocates 1GB at Xen boot time.
> 
> How big a VHPT do you allocate for each of the 16 domains?  Surely
> not
> Or are you ballooning the VHPT size along with memory size?

Let's take a careful look at this example

What I am about to describe is based on the following assumption: the VHPT
should be proportional to the number of physical pages it covers, not to the
number of domains. If you are going to cover 64 GB of memory, then your VHPT
should be of size X (regardless of how many domains use that memory).

In the case of a single global VHPT, you will need to allocate size X if you
are running one domain that uses 64GB or if you are running 16 that each use
4 GB.

In the case of per domain VHPTs (let's assume UP VMs for simplicity) you
will need one VHPT for domain running per physical pages that domain is
using. If you are running a single domain that is using 64GM, then you need
one VHPT of X. If you are running 16 domains each one using 4 GB, then you
will need 16 VHPTs each of size X/16.

If your domains can grow/shrink (the ballooning case you mention above) to
use 4GB - 64 GB of memory, then in the case of a single VHPT it is OK to
just allocate X, although this is wasteful if you are not using all 64 GB
(e.g., you are running two domains each using 4GB of memory), but you do not
have a choice (other than dynamically growing/shrinking the VHPT). In the
case of multiple VHPTs you will have a similar problem, although the size
could be 16X, if all domains can grow to 64GB and if there is no
growing/shrinking of the VHPT. In other words, growing/shrinking the VHPT
would be more critical to your example if using per domain VHPTs, but it
would also be important (to avoid allocating size X when the  VMs running
are not using all the 64GB of memory) in the case of a single VHPT.

> 
> On a related note, I think you said something about needing to
> guarantee forward progress.  In your implementation, is the VHPT
> required for this?  If so, what happens when a domain migrates
> at the exact point where the VHPT is needed to guarantee forward
> progress?  Or do you plan on moving the VHPT as part of migration?

What I think I said is that having collision chains from the VHPT is
critical to avoiding forward progress issues. The problem is that IPF may
need up to 3 different translations for a single instruction. If you do not
have collision chains and the translations required for a single instruction
(I-side, D-side and RSE) happen to hash to the same VHPT entry, you may get
into a situation in which the entries keep colliding with each other and the
guest makes no forward progress (it enters a state in which it alternates
the I-side, D-side and RSE faults). By the way, this is not just
theoretical, I have seen it happen in two different implementations of IPF
virtual MMUs.

First a clarification, there is no relationship (that I know of) between
migration and forward progress issues, but I will comment on the migration
example anyway. Moving the VHPT is not necessary (actually, in general, is
not possible, as that would imply that the same machine pages are allocated
in the target machine for the VM being moved as in the source machine) at
migration time, you just rebuild it in the new machine (I am assuming that
the contents of the VHPT is demand built, right?). In a migration case you
can start with an empty (all entries are invalid in the VHPT) and let the
guest generate page faults and the VM build the VHPT.

> 
>> I see it a bit more black and white than you do.
> 
> Black and white invariably implies a certain set of assumptions.
> I'm not questioning your position given your set of assumptions,
> I'm questioning your assumptions -- as those assumption may
> make good sense in some situations (e.g. VT has to implement
> all possible cases) but less so in others (e.g. paravirtualized).

You keep on making this differentiation between full and paravirtualization
(but I don't think that is very relevant to what I am saying), please
explain how in a paravirtualized guest the example I presented above of 10
UP VMs having to synchronize updates to the VHPT is not an issue. 

> 
> Dan

Bert

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
Xen-ia64-devel mailing list
Xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-ia64-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.