[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1/4] expand x86 arch_shared_info to support linear p2m list



On 14/11/14 15:32, Juergen Gross wrote:
> On 11/14/2014 03:59 PM, Andrew Cooper wrote:
>> On 14/11/14 14:14, Jürgen Groß wrote:
>>> On 11/14/2014 02:56 PM, Andrew Cooper wrote:
>>>> On 14/11/14 12:53, Juergen Gross wrote:
>>>>> On 11/14/2014 12:41 PM, Andrew Cooper wrote:
>>>>>> On 14/11/14 09:37, Juergen Gross wrote:
>>>>>>> The x86 struct arch_shared_info field pfn_to_mfn_frame_list_list
>>>>>>> currently contains the mfn of the top level page frame of the 3
>>>>>>> level
>>>>>>> p2m tree, which is used by the Xen tools during saving and
>>>>>>> restoring
>>>>>>> (and live migration) of pv domains and for crash dump analysis.
>>>>>>> With
>>>>>>> three levels of the p2m tree it is possible to support up to 512
>>>>>>> GB of
>>>>>>> RAM for a 64 bit pv domain.
>>>>>>>
>>>>>>> A 32 bit pv domain can support more, as each memory page can hold
>>>>>>> 1024
>>>>>>> instead of 512 entries, leading to a limit of 4 TB.
>>>>>>>
>>>>>>> To be able to support more RAM on x86-64 switch to a virtual mapped
>>>>>>> p2m list.
>>>>>>>
>>>>>>> This patch expands struct arch_shared_info with a new p2m list
>>>>>>> virtual
>>>>>>> address and the mfn of the page table root. The new information is
>>>>>>> indicated by the domain to be valid by storing ~0UL into
>>>>>>> pfn_to_mfn_frame_list_list. The hypervisor indicates usability of
>>>>>>> this
>>>>>>> feature by a new flag XENFEAT_virtual_p2m.
>>>>>>
>>>>>> How do you envisage this being used?  Are you expecting the tools
>>>>>> to do
>>>>>> manual pagetable walks using xc_map_foreign_xxx() ?
>>>>>
>>>>> Yes. Not very different compared to today's mapping via the 3 level
>>>>> p2m tree. Just another entry format, 4 instead of 3 levels and
>>>>> starting
>>>>> at an offset.
>>>>
>>>> Yes - David and I were discussing this over lunch, and it is not
>>>> actually very different.
>>>>
>>>> In reality, how likely is it that the pages backing this virtual
>>>> linear
>>>> array change?
>>>
>>> Very unlikely, I think. But not impossible.
>>>
>>>> One issue currently is that, during the live part of migration, the
>>>> toolstack has no way of working out whether the structure of the
>>>> p2m has
>>>> changed (intermediate leaves rearranged, or the length increasing).
>>>>
>>>> In the case that the VM does change the structure of the p2m under the
>>>> feet of the toolstack, migration will either blow up in a
>>>> non-subtle way
>>>> with a p2m/m2p mismatch, or in a subtle way with the receiving side
>>>> copying the new p2m over the wrong part of the new domain.
>>>>
>>>> I am wondering whether, with this new p2m method, we can take
>>>> sufficient
>>>> steps to be able to guarantee mishaps like this can't occur.
>>>
>>> This should be easy: I could add a counter in arch_shared_info which is
>>> incremented whenever a p2m mapping is being changed. The toolstack
>>> could
>>> compare the counter values before start and at end of migration and
>>> redo
>>> the migration (or fail) if they are different. In order to avoid races
>>> I would have to increment the counter before and after changing the
>>> mapping.
>>>
>>
>> That is insufficient I believe.
>>
>> Consider:
>>
>> * Toolstack walks pagetables and maps the frames containing the
>> linear p2m
>> * Live migration starts
>> * VM remaps a frame in the middle of the linear p2m
>> * Live migration continues, but the toolstack has a stale frame in the
>> middle of its view of the p2m.
>
> This would be covered by my suggestion. At the end of the memory
> transfer (with some bogus contents) the toolstack would discover the
> change of the p2m structure and either fail the migration or start it
> from the beginning and thus overwriting the bogus frames.

Checking after pause is too late.  The content of the p2m is used verify
each frame being sent on the wire, so is in active use for the entire
duration of live migration.

If the toolstack starts verifying frames being sent using information
from a stale p2m, the best that can be hoped for is that the toolstack
declares that the p2m and m2p are inconsistent and abort the migrate.

>
>> As the p2m is almost never expected to change, I think it might be
>> better to have a flag the toolstack can set to say "The toolstack is
>> peeking at your p2m behind your back - you must not change its
>> structure."
>
> Be careful here: changes of the structure can be due to two scenarios:
> - ballooning (invalid entries being populated): this is no problem, as
>   we can stop the ballooning during live migration.
> - mapping of grant pages e.g. in a stub domain (first map in an area
>   former marked as invalid): you can't stop this, as the stub domain
>   has to do some work. Here a restart of the migration should work, as
>   the p2m structure change can only happen once for each affected p2m
>   page.

Migration is not at all possible with a domain referencing foreign frames.

The live part can cope with foreign frames referenced in the ptes.  As
part of the pause handling in the VM, the frontends must unmap any
grants they have.  After pause, any remaining foreign frames cause a
migration failure.

>
>> Having just thought this through, I think there is also a race condition
>> between a VM changing an entry in the p2m, and the toolstack doing
>> verifications of frames being sent.
>
> Okay, so the flag you mentioned should just prohibit changes in the
> p2m list related to memory frames of the affected domain: ballooning
> up or down, or rearranging the memory layout (does this happen today?).
> Mapping and unmapping of grant pages should be still allowed.

HVM guests doesn't have any of their p2m updates represented in the
logdirty bitmap, so ballooning an HVM guest during migrate leads to
unexpected holes or lack of holes on the resuming side, leading to a
very confused balloon driver.

At the time I had not found a problem with PV guests, but it is now
clear that there is a period of time when a guest is altering its p2m
where the p2m and m2p are out of sync, which will cause a migration
failure if the toolstack observes this artefact.

~Andrew


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.