[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] RFC: QEMU bumping memory limit and domain restore



On 06/03/2015 02:32 PM, Andrew Cooper wrote:
> On 03/06/15 14:22, George Dunlap wrote:
>> On Tue, Jun 2, 2015 at 3:05 PM, Wei Liu <wei.liu2@xxxxxxxxxx> wrote:
>>> Previous discussion at [0].
>>>
>>> For the benefit of discussion, we refer to max_memkb inside hypervisor
>>> as hv_max_memkb (name subject to improvement). That's the maximum number
>>> of memory a domain can use.
>> Why don't we try to use "memory" for virtual RAM that we report to the
>> guest, and "pages" for what exists inside the hypervisor?  "Pages" is
>> the term the hypervisor itself uses internally (i.e., set_max_mem()
>> actually changes a domain's max_pages value).
>>
>> So in this case both guest memory and option roms are created using
>> hypervisor pages.
>>
>>> Libxl doesn't know hv_max_memkb for a domain needs prior to QEMU start-up
>>> because of optional ROMs etc.
>> So a translation of this using "memory/pages" terminology would be:
>>
>> QEMU may need extra pages from Xen to implement option ROMS, and so at
>> the moment it calls set_max_mem() to increase max_pages so that it can
>> allocate more pages to the guest.  libxl doesn't know what max_pages a
>> domain needs prior to qemu start-up.
>>
>>> Libxl doesn't know the hv_max_memkb even after QEMU start-up, because
>>> there is no mechanism to community between QEMU and libxl. This is an
>>> area that needs improvement, we've encountered problems in this area
>>> before.
>> [translating]
>> Libxl doesn't know max_pages  even after qemu start-up, because there
>> is no mechanism to communicate between qemu and libxl.
>>
>>> QEMU calls xc_domain_setmaxmem to increase hv_max_memkb by N pages. Those
>>> pages are only accounted in hypervisor. During migration, libxl
>>> (currently) doesn't extract that value from hypervisor.
>> [translating]
>> qemu calls xc_domain_setmaxmem to increase max_pages by N pages.
>> Those pages are only accounted for in the hypervisor.  libxl
>> (currently) does not extract that value from the hypervisor.
>>
>>> So now the problem is on the remote end:
>>>
>>> 1. Libxl indicates domain needs X pages.
>>> 2. Domain actually needs X + N pages.
>>> 3. Remote end tries to write N more pages and fail.
>>>
>>> This behaviour currently doesn't affect normal migration (that you
>>> transfer libxl JSON to remote, construct a domain, then start QEMU)
>>> because QEMU won't bump hv_max_memkb again. This is by design and
>>> reflected in QEMU code.
>> I don't understand this paragraph -- does the remote domain actually
>> need X+N pages or not?  If it does, in what way does this behavior
>> "not affect normal migration"?
>>
>>> This behaviour affects COLO and becomes a bug in that case, because
>>> secondary VM's QEMU doesn't go through the same start-of-day
>>> initialisation (Hongyang, correct me if I'm wrong), i.e. no bumping
>>> hv_max_memkb inside QEMU.
>>>
>>> Andrew plans to embed JSON inside migration v2 and COLO is based on
>>> migration v2. The bug is fixed if JSON is correct in the first place.
>>>
>>> As COLO is not yet upstream, so this bug is not a blocker for 4.6. But
>>> it should be fixed for the benefit of COLO.
>>>
>>> So here is a proof of concept patch to record and honour that value
>>> during migration.  A new field is added in IDL. Note that we don't
>>> provide xl level config option for it and mandate it to be default value
>>> during domain creation. This is to prevent libxl user from using it to
>>> avoid unforeseen repercussions.
>>>
>>> This patch is compiled test only. If we agree this is the way to go I
>>> will test and submit a proper patch.
>> Reading max_pages from Xen and setting it on the far side seems like a
>> reasonable option.
> 
> It is the wrong level to fix the bug.  Yes - it will (and does) fix one
> manifestation of the bug, but does not solve the problem.
> 
>>   Is there a reason we can't add a magic XC_SAVE_ID
>> for v1, like we do for other parameters?
> 
> Amongst other things, playing with xc_domain_setmaxmem() is liable to
> cause a PoD domain to be shot by Xen because the PoD cache was not
> adjusted at the same time that maxmem was.

As far as I can tell that's completely unrelated.

The PoD target needs to be updated when you change the *balloon driver*
target in xenstore.  libxl_domain_setmaxmem() for instance won't update
the PoD target either.

> Only libxl is in a position to safely adjust domain memory.

I was initially of the same mind as you in this matter.  But I think if
you stop thinking about this as "the domain's memory", and start instead
thinking about it as qemu modifying the state of the machine (by, say,
adding option roms), then I think it can be less strict.  The *memory*
(i.e., what is seen by the guest as virtual RAM) is controlled by libxl,
but the *pages* may be manipulated by others (as they are by the balloon
driver, for instance).

Although actually -- I guess that exposes another issue with this: what
if someone calls setmaxmem in libxl?  libxl doesn't know about the extra
N pages that qemu wanted, so it won't factor that in to its setmaxmem
calculation.  In particular, qemu's read-modify-write of setmaxmem isn't
idempotent -- there's almost certainly a race if someone were to call
libxl_domain_setmaxmem at the same time qemu was trying to add a few pages.

In any case, as Wei said, you have the problem that by the time qemu
figures out that it wants some extra pages to implement an option rom,
libxl has already determined that its job was complete and walked off.
Changing that requires a more architectural change.

Perhaps joining this with the security deprivileging thing would make
some sense?  I.e., have qemu do no hypercalls at all, but have them done
instead by a "babysitter" process from libxl, which could then update
the domain config state as well.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.