[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M



On Thu, 13 Jun 2013, Ian Campbell wrote:
> On Thu, 2013-06-13 at 15:50 +0100, Stefano Stabellini wrote:
> > On Thu, 13 Jun 2013, George Dunlap wrote:
> > > On 13/06/13 14:44, Stefano Stabellini wrote:
> > > > On Wed, 12 Jun 2013, George Dunlap wrote:
> > > > > On 12/06/13 08:25, Jan Beulich wrote:
> > > > > > > > > On 11.06.13 at 19:26, Stefano Stabellini
> > > > > > > > > <stefano.stabellini@xxxxxxxxxxxxx> wrote:
> > > > > > > I went through the code that maps the PCI MMIO regions in 
> > > > > > > hvmloader
> > > > > > > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it
> > > > > > > already
> > > > > > > maps the PCI region to high memory if the PCI bar is 64-bit and 
> > > > > > > the
> > > > > > > MMIO
> > > > > > > region is larger than 512MB.
> > > > > > > 
> > > > > > > Maybe we could just relax this condition and map the device 
> > > > > > > memory to
> > > > > > > high memory no matter the size of the MMIO region if the PCI bar 
> > > > > > > is
> > > > > > > 64-bit?
> > > > > > I can only recommend not to: For one, guests not using PAE or
> > > > > > PSE-36 can't map such space at all (and older OSes may not
> > > > > > properly deal with 64-bit BARs at all). And then one would generally
> > > > > > expect this allocation to be done top down (to minimize risk of
> > > > > > running into RAM), and doing so is going to present further risks of
> > > > > > incompatibilities with guest OSes (Linux for example learned only in
> > > > > > 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> > > > > > 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> > > > > > PFN to pfn_pte(), the respective parameter of which is
> > > > > > "unsigned long").
> > > > > > 
> > > > > > I think this ought to be done in an iterative process - if all MMIO
> > > > > > regions together don't fit below 4G, the biggest one should be
> > > > > > moved up beyond 4G first, followed by the next to biggest one
> > > > > > etc.
> > > > > First of all, the proposal to move the PCI BAR up to the 64-bit range 
> > > > > is a
> > > > > temporary work-around.  It should only be done if a device doesn't 
> > > > > fit in
> > > > > the
> > > > > current MMIO range.
> > > > > 
> > > > > We have three options here:
> > > > > 1. Don't do anything
> > > > > 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they
> > > > > don't
> > > > > fit
> > > > > 3. Convince qemu to allow MMIO regions to mask memory (or what it 
> > > > > thinks
> > > > > is
> > > > > memory).
> > > > > 4. Add a mechanism to tell qemu that memory is being relocated.
> > > > > 
> > > > > Number 4 is definitely the right answer long-term, but we just don't 
> > > > > have
> > > > > time
> > > > > to do that before the 4.3 release.  We're not sure yet if #3 is 
> > > > > possible;
> > > > > even
> > > > > if it is, it may have unpredictable knock-on effects.
> > > > > 
> > > > > Doing #2, it is true that many guests will be unable to access the 
> > > > > device
> > > > > because of 32-bit limitations.  However, in #1, *no* guests will be 
> > > > > able
> > > > > to
> > > > > access the device.  At least in #2, *many* guests will be able to do 
> > > > > so.
> > > > > In
> > > > > any case, apparently #2 is what KVM does, so having the limitation on
> > > > > guests
> > > > > is not without precedent.  It's also likely to be a somewhat tested
> > > > > configuration (unlike #3, for example).
> > > > I would avoid #3, because I don't think is a good idea to rely on that
> > > > behaviour.
> > > > I would also avoid #4, because having seen QEMU's code, it's wouldn't be
> > > > easy and certainly not doable in time for 4.3.
> > > > 
> > > > So we are left to play with the PCI MMIO region size and location in
> > > > hvmloader.
> > > > 
> > > > I agree with Jan that we shouldn't relocate unconditionally all the
> > > > devices to the region above 4G. I meant to say that we should relocate
> > > > only the ones that don't fit. And we shouldn't try to dynamically
> > > > increase the PCI hole below 4G because clearly that doesn't work.
> > > > However we could still increase the size of the PCI hole below 4G by
> > > > default from start at 0xf0000000 to starting at 0xe0000000.
> > > > Why do we know that is safe? Because in the current configuration
> > > > hvmloader *already* increases the PCI hole size by decreasing the start
> > > > address every time a device doesn't fit.
> > > > So it's already common for hvmloader to set pci_mem_start to
> > > > 0xe0000000, you just need to assign a device with a PCI hole size big
> > > > enough.
> > > > 
> > > > 
> > > > My proposed solution is:
> > > > 
> > > > - set 0xe0000000 as the default PCI hole start for everybody, including
> > > > qemu-xen-traditional
> > > > - move above 4G everything that doesn't fit and support 64-bit bars
> > > > - print an error if the device doesn't fit and doesn't support 64-bit
> > > > bars
> > > 
> > > Also, as I understand it, at the moment:
> > > 1. Some operating systems (32-bit XP) won't be able to use relocated 
> > > devices
> > > 2. Some devices (without 64-bit BARs) can't be relocated
> > > 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
> > > 
> > > So if we have #1 or #2, at the moment an option for a work-around is to 
> > > use
> > > qemu-traditional.
> > > 
> > > However, if we add your "print an error if the device doesn't fit", then 
> > > this
> > > option will go away -- this will be a regression in functionality from 
> > > 4.2.
> > 
> > Keep in mind that if we start the pci hole at 0xe0000000, the number of
> > cases for which any workarounds are needed is going to be dramatically
> > decreased to the point that I don't think we need a workaround anymore.
> 
> Starting at 0xe0000000 leaves, as you say a 448MB whole, with graphics
> cards regularly having 512M+ of RAM on them that suggests the workaround
> will be required in many cases.

http://www.nvidia.co.uk/object/graphics_cards_buy_now_uk.html

Actually more than half of the graphic cards sold today have >= 2G of
videoram so they wouldn't fit below 4G even in the old scheme that gives
at most 2G of PCI hole below 4G.

So the resulting configurations would be the same: the devices would be
located above 4G.


> > The algorithm is going to work like this in details:
> > 
> > - the pci hole size is set to 0xfc000000-0xe0000000 = 448MB
> > - we calculate the total mmio size, if it's bigger than the pci hole we
> > raise a 64 bit relocation flag
> > - if the 64 bit relocation is enabled, we relocate above 4G the first
> > device that is 64-bit capable and has an MMIO size greater or equal to
> > 512MB
> 
> Don't you mean the device with the largest MMIO size? Otherwise 2 256MB
> devices would still break things.

You are right that would be much better.
It's worth mentioning that the problem you have just identified exists
even in the current scheme. In fact you could reach a non-configurable
state by passing through 4 graphic cards with 512MB of video ram each.


> Paulo's comment about large alignments first is also worth considering.
>
> > - if the pci hole size is now big enough for the remaining devices we
> > stop the above 4G relocation, otherwise keep relocating devices that are
> > 64 bit capable and have an MMIO size greater or equal to 512MB
> > - if one or more devices don't fit we print an error and continue (it's
> > not a critical failure, one device won't be used)
> 
> This can result in a different device being broken to the one which
> would previously have been broken, including on qemu-trad I think?

Previously the guest would fail to boot with qemu-xen. On
qemu-xen-traditional the configuration would be completely different:
fewer devices would be relocated above 4G and the pci hole would be
bigger. So you are right the devices being broken would be different.


> > We could have a xenstore flag somewhere that enables the old behaviour
> > so that people can revert back to qemu-xen-traditional and make the pci
> > hole below 4G even bigger than 448MB, but I think that keeping the old
> > behaviour around is going to make the code more difficult to maintain.
> 
> The downside of that is that things which worked with the old scheme may
> not work with the new one though. Early in a release cycle when we have
> time to discover what has broken then that might be OK, but is post rc4
> really the time to be risking it?

Yes, you are right: there are some scenarios that would have worked
before that wouldn't work anymore with the new scheme.
Are they important enough to have a workaround, pretty difficult to
identify for a user?


> > Also it's difficult for people to realize that they need the workaround
> > because hvmloader logs aren't enabled by default and only go to the Xen
> > serial console. The value of this workaround pretty low in my view.
> > Finally it's worth noting that Windows XP is going EOL in less than an
> > year.
> 
> That's been true for something like 5 years...
> 
> Also, apart from XP, doesn't Windows still pick a HAL at install time,
> so even a modern guest installed under the old scheme may not get a PAE
> capable HAL. If you increase the amount of RAM I think Windows will
> "upgrade" the HAL, but is changing the MMIO layout enough to trigger
> this? Or maybe modern Windows all use PAE (or even 64 bit) anyway?
> 
> There are also performance implications of enabling PAE over 2 level
> paging. Not sure how significant they are with HAP though. Made a big
> difference with shadow IIRC.
> 
> Maybe I'm worrying about nothing but while all of these unknowns might
> be OK towards the start of a release cycle rc4 seems awfully late in the
> day to be risking it.

Keep in mind that all these configurations are perfectly valid even with
the code that we have out there today. We aren't doing anything new,
just modifying the default.
One just needs to assign a PCI device with more than 190MB to trigger it.
I am trusting the fact that given that we had this behaviour for many
years now, and it's pretty common to assign a device only some of the
times you are booting your guest, any problems would have already come
up.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.