[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 1/2] docs/designs: Add a design document for non-cooperative live migration



> -----Original Message-----
> From: Wei Liu <wl@xxxxxxx>
> Sent: 28 January 2020 13:41
> To: Durrant, Paul <pdurrant@xxxxxxxxxxxx>
> Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx; Andrew Cooper
> <andrew.cooper3@xxxxxxxxxx>; George Dunlap <George.Dunlap@xxxxxxxxxxxxx>;
> Ian Jackson <ian.jackson@xxxxxxxxxxxxx>; Jan Beulich <jbeulich@xxxxxxxx>;
> Julien Grall <julien@xxxxxxx>; Konrad Rzeszutek Wilk
> <konrad.wilk@xxxxxxxxxx>; Stefano Stabellini <sstabellini@xxxxxxxxxx>; Wei
> Liu <wl@xxxxxxx>
> Subject: Re: [PATCH v3 1/2] docs/designs: Add a design document for non-
> cooperative live migration
> 
> Thanks for writing this up. I skimmed through it. It looks sensible.
> 
> On Tue, Jan 28, 2020 at 12:28:22PM +0000, Paul Durrant wrote:
> > It has become apparent to some large cloud providers that the current
> > model of cooperative migration of guests under Xen is not usable as it
> > relies on software running inside the guest, which is likely beyond the
> > provider's control.
> > This patch introduces a proposal for non-cooperative live migration,
> > designed not to rely on any guest-side software.
> >
> > Signed-off-by: Paul Durrant <pdurrant@xxxxxxxxxx>
> > ---
> > Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> > Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>
> > Cc: Ian Jackson <ian.jackson@xxxxxxxxxxxxx>
> > Cc: Jan Beulich <jbeulich@xxxxxxxx>
> > Cc: Julien Grall <julien@xxxxxxx>
> > Cc: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> > Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>
> > Cc: Wei Liu <wl@xxxxxxx>
> >
> > v2:
> >  - Use the term 'non-cooperative' instead of 'transparent'
> >  - Replace 'trust in' with 'reliance on' when referring to guest-side
> >    software
> > ---
> >  docs/designs/non-cooperative-migration.md | 259 ++++++++++++++++++++++
> >  1 file changed, 259 insertions(+)
> >  create mode 100644 docs/designs/non-cooperative-migration.md
> >
> > diff --git a/docs/designs/non-cooperative-migration.md
> b/docs/designs/non-cooperative-migration.md
> > new file mode 100644
> > index 0000000000..f38d664c34
> > --- /dev/null
> > +++ b/docs/designs/non-cooperative-migration.md
> > @@ -0,0 +1,259 @@
> > +# Non-Cooperative Migration of Guests on Xen
> > +
> > +## Background
> > +
> > +The normal model of migration in Xen is driven by the guest because it
> was
> > +originally implemented for PV guests, where the guest must be aware it
> is
> > +running under Xen and is hence expected to co-operate. This model dates
> from
> > +an era when it was assumed that the host administrator had control of
> at least
> > +the privileged software running in the guest (i.e. the guest kernel)
> which may
> > +still be true in an enterprise deployment but is not generally true in
> a cloud
> > +environment. The aim of this design is to provide a model which is
> purely host
> > +driven, requiring no co-operation from the software running in the
> > +guest, and is thus suitable for cloud scenarios.
> > +
> > +PV guests are out of scope for this project because, as is outlined
> above, they
> > +have a symbiotic relationship with the hypervisor and therefore a
> certain level
> > +of co-operation can be assumed.
> 
> Missing newline here?
> 

Yep.

> > +HVM guests can already be migrated on Xen without guest co-operation
> but only
> > +if they don’t have PV drivers installed[1] or are in power state S3.
> The
> > +reason for not expecting co-operation if the guest is in S3 is obvious,
> but the
> > +reason co-operation is expected if PV drivers are installed is due to
> the
> > +nature of PV protocols.
> > +
> > +## Xenstore Nodes and Domain ID
> > +
> > +The PV driver model consists of a *frontend* and a *backend*. The
> frontend runs
> > +inside the guest domain and the backend runs inside a *service domain*
> which
> > +may or may not domain 0. The frontend and backend typically pass data
> via
> 
> "may or may not _be_ domain 0"
> 

Ack.

> > +memory pages which are shared between the two domains, but this channel
> of
> > +communication is generally established using xenstore (the store
> protocol
> > +itself being an exception to this for obvious chicken-and-egg reasons).
> > +
> > +Typical protocol establishment is based on use of two separate xenstore
> > +*areas*. If we consider PV drivers for the *netif* protocol (i.e. class
> vif)
> > +and assume the guest has domid X, the service domain has domid Y, and
> the vif
> > +has index Z then the frontend area will reside under the parent node:
> 
> The term "parent" shows up first time in this document. What does it
> mean in Xen's context?
> 

I'd hope it's well known that xenstore is hierarchical. I can add a short 
explanation if you think it’s needed.

> > +
> > +`/local/domain/Y/device/vif/Z`
> > +
> > +All backends, by convention, typically reside under parent node:
> > +
> > +`/local/domain/X/backend`
> > +
> > +and the normal backend area for vif Z would be:
> > +
> > +`/local/domain/X/backend/vif/Y/Z`
> > +
> > +but this should not be assumed.
> > +
> > +The toolstack will place two nodes in the frontend area to explicitly
> locate
> > +the backend:
> > +
> > +    * `backend`: the fully qualified xenstore path of the backend area
> > +    * `backend-id`: the domid of the service domain
> > +
> > +and similarly two nodes in the backend area to locate the frontend
> area:
> > +
> > +    * `frontend`: the fully qualified xenstore path of the frontend
> area
> > +    * `frontend-id`: the domid of the guest domain
> > +
> > +
> > +The guest domain only has write permission to the frontend area and
> similarly
> > +the service domain only has write permission to the backend area, but
> both ends
> > +have read permission to both areas.
> > +
> > +Under both frontend and backend areas is a node called *state*. This is
> key to
> > +protocol establishment. Upon PV device creation the toolstack will set
> the
> > +value of both state nodes to 1 (XenbusStateInitialising[2]). This
> should cause
> > +enumeration of appropriate devices in both the guest and service
> domains. The
> > +backend device, once it has written any necessary protocol specific
> information
> > +into the xenstore backend area (to be read by the frontend driver) will
> update
> > +the backend state node to 2 (XenbusStateInitWait). From this point on
> PV
> > +protocols differ slightly; the following illustration is true of the
> netif
> > +protocol.
> 
> Missing newline?
> 

Yes. I'll fix all the paragraph breaks.

> > +Upon seeing a backend state value of 2, the frontend driver will then
> read the
> > +protocol specific information, write details of grant references (for
> shared
> > +pages) and event channel ports (for signalling) that it has created,
> and set
> > +the state node in the frontend area to 4 (XenbusStateConnected). Upon
> see this
> > +frontend state, the backend driver will then read the grant references
> (mapping
> > +the shared pages) and event channel ports (opening its end of them) and
> set the
> > +state node in the backend area to 4. Protocol establishment is now
> complete and
> > +the frontend and backend start to pass data.
> > +
> > +Because the domid of both ends of a PV protocol forms a key part of
> negotiating
> > +the data plane for that protocol (because it is encoded into both
> xenstore
> > +nodes and node paths), and because guest’s own domid and the domid of
> the
> > +service domain are visible to the guest in xenstore (and hence may
> cached
> > +internally), and neither are necessarily preserved during migration, it
> is
> > +hence necessary to have the co-operation of the frontend in re-
> negotiating the
> > +protocol using the new domid after migration.
> 
> Add newline here?
> 

Yeah... looks like the tool I exported from is a bit buggy.

> > +Moreover the backend-id value will be used by the frontend driver in
> setting up
> > +grant table entries and event channels to communicate with the service
> domain,
> > +so the co-operation of the guest is required to re-establish these in
> the new
> > +host environment after migration.
> > +
> > +Thus if we are to change the model and support migration of a guest
> with PV
> > +drivers, without the co-operation of the frontend driver code, the
> paths and
> > +values in both the frontend and backend xenstore areas must remain
> unchanged
> > +and valid in the new host environment, and the grant table entries and
> event
> > +channels must be preserved (and remain operational once guest execution
> is
> > +resumed).
> 
> Add newline here?
> 
> > +Because the service domain’s domid is used directly by the guest in
> setting
> > +up grant entries and event channels, the backend drivers in the new
> host
> > +environment must be provided by service domain with the same domid.
> Also,
> > +because the guest can sample its own domid from the frontend area and
> use it in
> > +hypercalls (e.g. HVMOP_set_param) rather than DOMID_SELF, the guest
> domid must
> > +also be preserved to maintain the ABI.
> > +
> > +Furthermore, it will necessary to modify backend drivers to re-
> establish
> > +communication with frontend drivers without perturbing the content of
> the
> > +backend area or requiring any changes to the values of the xenstore
> state nodes.
> > +
> > +## Other Para-Virtual State
> > +
> > +### Shared Rings
> > +
> > +Because the console and store protocol shared pages are actually part
> of the
> > +guest memory image (in an E820 reserved region just below 4G) then the
> content
> > +will get migrated as part of the guest memory image. Hence no
> additional code
> > +is require to prevent any guest visible change in the content.
> > +
> > +### Shared Info
> > +
> > +There is already a record defined in *LibXenCtrl Domain Image Format*
> [3]
> 
> LibXenCtrl -> libxenctrl
>

Now that there's a patch to change it, yes :-)
 
> > +called `SHARED_INFO` which simply contains a complete copy of the
> domain’s
> > +shared info page. It is not currently incuded in an HVM (type `0x0002`)
> > +migration stream. It may be feasible to include it as an optional
> record
> > +but it is not clear that the content of the shared info page ever needs
> > +to be preserved for an HVM guest.
> 
> Add newline?
> 
> > +For a PV guest the `arch_shared_info` sub-structure contains important
> > +information about the guest’s P2M, but this information is not relevant
> for
> > +an HVM guest where the P2M is not directly manipulated via the guest.
> The other
> > +state contained in the `shared_info` structure relates the domain wall-
> clock
> > +(the state of which should already be transferred by the `RTC` HVM
> context
> > +information which contained in the `HVM_CONTEXT` save record) and some
> event
> > +channel state (particularly if using the *2l* protocol). Event channel
> state
> > +will need to be fully transferred if we are not going to require the
> guest
> > +co-operation to re-open the channels and so it should be possible to
> re-build a
> > +shared info page for an HVM guest from such other state.
> 
> Add newline here?
> 
> > +Note that the shared info page also contains an array of
> `XEN_LEGACY_MAX_VCPUS`
> > +(32) `vcpu_info` structures. A domain may nominate a different guest
> physical
> > +address to use for the vcpu info. This is mandatory for if a domain
> wants to
> > +use more than 32 vCPUs and optional for legacy vCPUs. This mapping is
> not
> > +currently transferred in the migration state so this will either need
> to be
> > +added into an existing save record, or an additional type of save
> record will
> > +be needed.
> > +
> > +### Xenstore Watches
> > +
> > +As mentioned above, no domain Xenstore state is currently transferred
> in the
> > +migration stream. There is a record defined in *LibXenLight Domain
> Image
> 
> LibXenLight -> libxenlight
> 

Same here.

> > +Format* [4] called `EMULATOR_XENSTORE_DATA` for transferring Xenstore
> nodes
> > +relating to emulators but no record type is defined for nodes relating
> to the
> > +domain itself, nor for registered *watches*. A XenStore watch is a
> mechanism
> > +used by PV frontend and backend drivers to request a notification if
> the value
> > +of a particular node (e.g. the other end’s state node) changes, so it
> is
> > +important that watches continue to function after a migration. One or
> more new
> > +save records will therefore be required to transfer Xenstore state. It
> will
> > +also be necessary to extend the *store* protocol[5] with mechanisms to
> allow
> > +the toolstack to acquire the list of watches that the guest has
> registered and
> > +for the toolstack to register a watch on behalf of a domain.
> > +
> > +### Event channels
> > +
> > +Event channels are essentially the para-virtual equivalent of
> interrupts. They
> > +are an important part of post PV protocols. Normally a frontend driver
> creates
> > +an *inter-domain* event channel between its own domain and the domain
> running
> > +the backend, which it discovers using the `backend-id` node in Xenstore
> (see
> > +above), by making a `EVTCHNOP_alloc_unbound` hypercall. This hypercall
> > +allocates an event channel object in the hypervisor and assigns a
> *local port*
> > +number which is then written into the frontend area in Xenstore. The
> backend
> > +driver then reads this port number and *binds* to the event channel by
> > +specifying it, and the value of `frontend-id`, as *remote domain* and
> *remote
> > +port* (respectively) to a `EVTCHNOP_bind_interdomain` hypercall. Once
> > +connection is established in this fashion frontend and backend drivers
> can use
> > +the event channel as a *mailbox* to notify each other when a shared
> ring has
> > +been updated with new requests or response structures.
> 
> Missing newline here.
> 
> > +Currently no event channel state is preserved on migration, requiring
> frontend
> > +and backend drivers to create and bind a complete new set of event
> channels in
> > +order to re-establish a protocol connection. Hence, one or more new
> save
> > +records will be required to transfer event channel state in order to
> avoid the
> > +need for explicit action by frontend drivers running in the guest. Note
> that
> > +the local port numbers need to preserved in this state as they are the
> only
> > +context the guest has to refer to the hypervisor event channel objects.
> > + Note also that the PV *store* (Xenstore access) and *console*
> protocols also
> > +rely on event channels which are set up by the toolstack. Normally,
> early in
> > +migration, the toolstack running on the remote host would set up a new
> pair of
> > +event channels for these protocols in the destination domain. These may
> not be
> > +assigned the same local port numbers as the protocols running in the
> source
> > +domain. For non-cooperative migration these channels must either be
> created with
> > +fixed port numbers, or their creation must be avoided and instead be
> included
> > +in the general event channel state record(s).
> > +
> > +### Grant table
> > +
> > +The grant table is essentially the para-virtual equivalent of an IOMMU.
> For
> > +example, the shared rings of a PV protocol are *granted* by a frontend
> driver
> > +to the backend driver by allocating *grant entries* in the guest’s
> table,
> > +filling in details of the memory pages and then writing the *grant
> references*
> > +(the index values of the grant entries) into Xenstore. The grant
> references of
> > +the protocol buffers themselves are typically written directly into the
> request
> > +structures passed via a shared ring.
> 
> Missing newline.
> 
> > +The guest is responsible for managing its own grant table. No hypercall
> is
> > +required to grant a memory page to another domain. It is sufficient to
> find an
> > +unused grant entry and set bits in the entry to give read and/or write
> access
> > +to a remote domain also specified in the entry along with the page
> frame
> > +number. Thus the layout and content of the grant table logically forms
> part of
> > +the guest state.
> 
> Missing newline.
> 
> > +Currently no grant table state is migrated, requiring a guest to
> separately
> > +maintain any state that it wishes to persist elsewhere in its memory
> image and
> > +then restore it after migration. Thus to avoid the need for such
> explicit
> > +action by the guest, one or more new save records will be required to
> migrate
> > +the contents of the grant table.
> > +
> > +# Outline Proposal
> > +
> > +* PV backend drivers will be modified to unilaterally re-establish
> connection
> > +to a frontend if the backend state node is restored with value 4
> > +(XenbusStateConnected)[6].
> 
> Missing newline.
> 
> > +* The toolstack should be modified to allow domid to be randomized on
> initial
> > +creation or default migration, but make it identical to the source
> domain on
> > +non-cooperative migration. Non-Cooperative migration will have to be
> denied if the
> > +domid is unavailable on the target host, but randomization of domid on
> creation
> > +should hopefully minimize the likelihood of this. Non-Cooperative
> migration to
> > +localhost will clearly not be possible. Patches have already been sent
> to
> > +`xen-devel` to make this change[7].
> > +* `xenstored` should be modified to implement the new mechanisms
> needed. See
> > +*Other Para-Virtual State* above. A further design document will
> propose
> > +additional protocol messages.
> > +* Within the migration stream extra save records will be defined as
> required.
> > +See *Other Para-Virtual State* above. A further design document will
> propose
> > +modifications to the LibXenLight and LibXenCtrl Domain Image Formats.
> 
> LibXenLight and LibXenCtrl should be fixed.
> 

Yes.

> > +* An option should be added to the toolstack to initiate a non-
> cooperative
> > +migration, instead of the (default) potentially co-operative migration.
> > +Essentially this should skip the check to see if PV drivers and migrate
> as if
> > +there are none present, but also enabling the extra save records. Note
> that at
> > +least some of the extra records should only form part of a non-
> cooperative
> > +migration stream. For example, migrating event channel state would be
> counter
> > +productive in a normal migration as this will essentially leak event
> channel
> > +objects at the receiving end. Others, such as grant table state, could
> > +potentially harmlessly form part of a normal migration stream.
> > +
> > +* * *
> > +[1] PV drivers are deemed to be installed if the HVM parameter
> > +*HVM_PARAM_CALLBACK_IRQ* has been set to a non-zero value.
> 
> I think this is just an approximation, but it should be good enough in
> practice.
> 

This is just description of the test as it stands. Personally I don't like it 
because I think the callback via should be killed with fire, but alas it is 
ABI. However other mechanisms for guests to get events notifications in HVM 
guests have existed for a while so I wouldn't actually view it as a reliable 
test. E.g. I can happily avoid registering the callback via in the Windows PV 
drivers without loss of functionality.

  Paul

> > +
> > +[2] See
> https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/x
> enbus.h
> > +
> > +[3] See
> https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxc-
> migration-stream.pandoc
> > +
> > +[4] See
> https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-
> migration-stream.pandoc
> > +
> > +[5] See
> https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt
> > +
> > +[6] `xen-blkback` and `xen-netback` have already been modified in Linux
> to do
> > +this.
> > +
> > +[7] See https://lists.xenproject.org/archives/html/xen-devel/2020-
> 01/msg00632.html
> > +
> > --
> > 2.20.1
> >
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.