[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 1/2] docs/designs: Add a design document for non-cooperative live migration



Thanks for writing this up. I skimmed through it. It looks sensible.

On Tue, Jan 28, 2020 at 12:28:22PM +0000, Paul Durrant wrote:
> It has become apparent to some large cloud providers that the current
> model of cooperative migration of guests under Xen is not usable as it
> relies on software running inside the guest, which is likely beyond the
> provider's control.
> This patch introduces a proposal for non-cooperative live migration,
> designed not to rely on any guest-side software.
> 
> Signed-off-by: Paul Durrant <pdurrant@xxxxxxxxxx>
> ---
> Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>
> Cc: Ian Jackson <ian.jackson@xxxxxxxxxxxxx>
> Cc: Jan Beulich <jbeulich@xxxxxxxx>
> Cc: Julien Grall <julien@xxxxxxx>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>
> Cc: Wei Liu <wl@xxxxxxx>
> 
> v2:
>  - Use the term 'non-cooperative' instead of 'transparent'
>  - Replace 'trust in' with 'reliance on' when referring to guest-side
>    software
> ---
>  docs/designs/non-cooperative-migration.md | 259 ++++++++++++++++++++++
>  1 file changed, 259 insertions(+)
>  create mode 100644 docs/designs/non-cooperative-migration.md
> 
> diff --git a/docs/designs/non-cooperative-migration.md 
> b/docs/designs/non-cooperative-migration.md
> new file mode 100644
> index 0000000000..f38d664c34
> --- /dev/null
> +++ b/docs/designs/non-cooperative-migration.md
> @@ -0,0 +1,259 @@
> +# Non-Cooperative Migration of Guests on Xen
> +
> +## Background
> +
> +The normal model of migration in Xen is driven by the guest because it was
> +originally implemented for PV guests, where the guest must be aware it is
> +running under Xen and is hence expected to co-operate. This model dates from
> +an era when it was assumed that the host administrator had control of at 
> least
> +the privileged software running in the guest (i.e. the guest kernel) which 
> may
> +still be true in an enterprise deployment but is not generally true in a 
> cloud
> +environment. The aim of this design is to provide a model which is purely 
> host
> +driven, requiring no co-operation from the software running in the
> +guest, and is thus suitable for cloud scenarios.
> +
> +PV guests are out of scope for this project because, as is outlined above, 
> they
> +have a symbiotic relationship with the hypervisor and therefore a certain 
> level
> +of co-operation can be assumed.

Missing newline here?

> +HVM guests can already be migrated on Xen without guest co-operation but only
> +if they don’t have PV drivers installed[1] or are in power state S3. The
> +reason for not expecting co-operation if the guest is in S3 is obvious, but 
> the
> +reason co-operation is expected if PV drivers are installed is due to the
> +nature of PV protocols.
> +
> +## Xenstore Nodes and Domain ID
> +
> +The PV driver model consists of a *frontend* and a *backend*. The frontend 
> runs
> +inside the guest domain and the backend runs inside a *service domain* which
> +may or may not domain 0. The frontend and backend typically pass data via

"may or may not _be_ domain 0"

> +memory pages which are shared between the two domains, but this channel of
> +communication is generally established using xenstore (the store protocol
> +itself being an exception to this for obvious chicken-and-egg reasons).
> +
> +Typical protocol establishment is based on use of two separate xenstore
> +*areas*. If we consider PV drivers for the *netif* protocol (i.e. class vif)
> +and assume the guest has domid X, the service domain has domid Y, and the vif
> +has index Z then the frontend area will reside under the parent node:

The term "parent" shows up first time in this document. What does it
mean in Xen's context?

> +
> +`/local/domain/Y/device/vif/Z`
> +
> +All backends, by convention, typically reside under parent node:
> +
> +`/local/domain/X/backend`
> +
> +and the normal backend area for vif Z would be:
> +
> +`/local/domain/X/backend/vif/Y/Z`
> +
> +but this should not be assumed.
> +
> +The toolstack will place two nodes in the frontend area to explicitly locate
> +the backend:
> +
> +    * `backend`: the fully qualified xenstore path of the backend area
> +    * `backend-id`: the domid of the service domain
> +
> +and similarly two nodes in the backend area to locate the frontend area:
> +
> +    * `frontend`: the fully qualified xenstore path of the frontend area
> +    * `frontend-id`: the domid of the guest domain
> +
> +
> +The guest domain only has write permission to the frontend area and similarly
> +the service domain only has write permission to the backend area, but both 
> ends
> +have read permission to both areas.
> +
> +Under both frontend and backend areas is a node called *state*. This is key 
> to
> +protocol establishment. Upon PV device creation the toolstack will set the
> +value of both state nodes to 1 (XenbusStateInitialising[2]). This should 
> cause
> +enumeration of appropriate devices in both the guest and service domains. The
> +backend device, once it has written any necessary protocol specific 
> information
> +into the xenstore backend area (to be read by the frontend driver) will 
> update
> +the backend state node to 2 (XenbusStateInitWait). From this point on PV
> +protocols differ slightly; the following illustration is true of the netif
> +protocol.

Missing newline?

> +Upon seeing a backend state value of 2, the frontend driver will then read 
> the
> +protocol specific information, write details of grant references (for shared
> +pages) and event channel ports (for signalling) that it has created, and set
> +the state node in the frontend area to 4 (XenbusStateConnected). Upon see 
> this
> +frontend state, the backend driver will then read the grant references 
> (mapping
> +the shared pages) and event channel ports (opening its end of them) and set 
> the
> +state node in the backend area to 4. Protocol establishment is now complete 
> and
> +the frontend and backend start to pass data.
> +
> +Because the domid of both ends of a PV protocol forms a key part of 
> negotiating
> +the data plane for that protocol (because it is encoded into both xenstore
> +nodes and node paths), and because guest’s own domid and the domid of the
> +service domain are visible to the guest in xenstore (and hence may cached
> +internally), and neither are necessarily preserved during migration, it is
> +hence necessary to have the co-operation of the frontend in re-negotiating 
> the
> +protocol using the new domid after migration.

Add newline here?

> +Moreover the backend-id value will be used by the frontend driver in setting 
> up
> +grant table entries and event channels to communicate with the service 
> domain,
> +so the co-operation of the guest is required to re-establish these in the new
> +host environment after migration.
> +
> +Thus if we are to change the model and support migration of a guest with PV
> +drivers, without the co-operation of the frontend driver code, the paths and
> +values in both the frontend and backend xenstore areas must remain unchanged
> +and valid in the new host environment, and the grant table entries and event
> +channels must be preserved (and remain operational once guest execution is
> +resumed).

Add newline here?

> +Because the service domain’s domid is used directly by the guest in setting
> +up grant entries and event channels, the backend drivers in the new host
> +environment must be provided by service domain with the same domid. Also,
> +because the guest can sample its own domid from the frontend area and use it 
> in
> +hypercalls (e.g. HVMOP_set_param) rather than DOMID_SELF, the guest domid 
> must
> +also be preserved to maintain the ABI.
> +
> +Furthermore, it will necessary to modify backend drivers to re-establish
> +communication with frontend drivers without perturbing the content of the
> +backend area or requiring any changes to the values of the xenstore state 
> nodes.
> +
> +## Other Para-Virtual State
> +
> +### Shared Rings
> +
> +Because the console and store protocol shared pages are actually part of the
> +guest memory image (in an E820 reserved region just below 4G) then the 
> content
> +will get migrated as part of the guest memory image. Hence no additional code
> +is require to prevent any guest visible change in the content.
> +
> +### Shared Info
> +
> +There is already a record defined in *LibXenCtrl Domain Image Format* [3]

LibXenCtrl -> libxenctrl

> +called `SHARED_INFO` which simply contains a complete copy of the domain’s
> +shared info page. It is not currently incuded in an HVM (type `0x0002`)
> +migration stream. It may be feasible to include it as an optional record
> +but it is not clear that the content of the shared info page ever needs
> +to be preserved for an HVM guest.

Add newline?

> +For a PV guest the `arch_shared_info` sub-structure contains important
> +information about the guest’s P2M, but this information is not relevant for
> +an HVM guest where the P2M is not directly manipulated via the guest. The 
> other
> +state contained in the `shared_info` structure relates the domain wall-clock
> +(the state of which should already be transferred by the `RTC` HVM context
> +information which contained in the `HVM_CONTEXT` save record) and some event
> +channel state (particularly if using the *2l* protocol). Event channel state
> +will need to be fully transferred if we are not going to require the guest
> +co-operation to re-open the channels and so it should be possible to 
> re-build a
> +shared info page for an HVM guest from such other state.

Add newline here?

> +Note that the shared info page also contains an array of 
> `XEN_LEGACY_MAX_VCPUS`
> +(32) `vcpu_info` structures. A domain may nominate a different guest physical
> +address to use for the vcpu info. This is mandatory for if a domain wants to
> +use more than 32 vCPUs and optional for legacy vCPUs. This mapping is not
> +currently transferred in the migration state so this will either need to be
> +added into an existing save record, or an additional type of save record will
> +be needed.
> +
> +### Xenstore Watches
> +
> +As mentioned above, no domain Xenstore state is currently transferred in the
> +migration stream. There is a record defined in *LibXenLight Domain Image

LibXenLight -> libxenlight

> +Format* [4] called `EMULATOR_XENSTORE_DATA` for transferring Xenstore nodes
> +relating to emulators but no record type is defined for nodes relating to the
> +domain itself, nor for registered *watches*. A XenStore watch is a mechanism
> +used by PV frontend and backend drivers to request a notification if the 
> value
> +of a particular node (e.g. the other end’s state node) changes, so it is
> +important that watches continue to function after a migration. One or more 
> new
> +save records will therefore be required to transfer Xenstore state. It will
> +also be necessary to extend the *store* protocol[5] with mechanisms to allow
> +the toolstack to acquire the list of watches that the guest has registered 
> and
> +for the toolstack to register a watch on behalf of a domain.
> +
> +### Event channels
> +
> +Event channels are essentially the para-virtual equivalent of interrupts. 
> They
> +are an important part of post PV protocols. Normally a frontend driver 
> creates
> +an *inter-domain* event channel between its own domain and the domain running
> +the backend, which it discovers using the `backend-id` node in Xenstore (see
> +above), by making a `EVTCHNOP_alloc_unbound` hypercall. This hypercall
> +allocates an event channel object in the hypervisor and assigns a *local 
> port*
> +number which is then written into the frontend area in Xenstore. The backend
> +driver then reads this port number and *binds* to the event channel by
> +specifying it, and the value of `frontend-id`, as *remote domain* and *remote
> +port* (respectively) to a `EVTCHNOP_bind_interdomain` hypercall. Once
> +connection is established in this fashion frontend and backend drivers can 
> use
> +the event channel as a *mailbox* to notify each other when a shared ring has
> +been updated with new requests or response structures.

Missing newline here.

> +Currently no event channel state is preserved on migration, requiring 
> frontend
> +and backend drivers to create and bind a complete new set of event channels 
> in
> +order to re-establish a protocol connection. Hence, one or more new save
> +records will be required to transfer event channel state in order to avoid 
> the
> +need for explicit action by frontend drivers running in the guest. Note that
> +the local port numbers need to preserved in this state as they are the only
> +context the guest has to refer to the hypervisor event channel objects.
> + Note also that the PV *store* (Xenstore access) and *console* protocols also
> +rely on event channels which are set up by the toolstack. Normally, early in
> +migration, the toolstack running on the remote host would set up a new pair 
> of
> +event channels for these protocols in the destination domain. These may not 
> be
> +assigned the same local port numbers as the protocols running in the source
> +domain. For non-cooperative migration these channels must either be created 
> with
> +fixed port numbers, or their creation must be avoided and instead be included
> +in the general event channel state record(s).
> +
> +### Grant table
> +
> +The grant table is essentially the para-virtual equivalent of an IOMMU. For
> +example, the shared rings of a PV protocol are *granted* by a frontend driver
> +to the backend driver by allocating *grant entries* in the guest’s table,
> +filling in details of the memory pages and then writing the *grant 
> references*
> +(the index values of the grant entries) into Xenstore. The grant references 
> of
> +the protocol buffers themselves are typically written directly into the 
> request
> +structures passed via a shared ring.

Missing newline.

> +The guest is responsible for managing its own grant table. No hypercall is
> +required to grant a memory page to another domain. It is sufficient to find 
> an
> +unused grant entry and set bits in the entry to give read and/or write access
> +to a remote domain also specified in the entry along with the page frame
> +number. Thus the layout and content of the grant table logically forms part 
> of
> +the guest state.

Missing newline.

> +Currently no grant table state is migrated, requiring a guest to separately
> +maintain any state that it wishes to persist elsewhere in its memory image 
> and
> +then restore it after migration. Thus to avoid the need for such explicit
> +action by the guest, one or more new save records will be required to migrate
> +the contents of the grant table.
> +
> +# Outline Proposal
> +
> +* PV backend drivers will be modified to unilaterally re-establish connection
> +to a frontend if the backend state node is restored with value 4
> +(XenbusStateConnected)[6].

Missing newline.

> +* The toolstack should be modified to allow domid to be randomized on initial
> +creation or default migration, but make it identical to the source domain on
> +non-cooperative migration. Non-Cooperative migration will have to be denied 
> if the
> +domid is unavailable on the target host, but randomization of domid on 
> creation
> +should hopefully minimize the likelihood of this. Non-Cooperative migration 
> to
> +localhost will clearly not be possible. Patches have already been sent to
> +`xen-devel` to make this change[7].
> +* `xenstored` should be modified to implement the new mechanisms needed. See
> +*Other Para-Virtual State* above. A further design document will propose
> +additional protocol messages.
> +* Within the migration stream extra save records will be defined as required.
> +See *Other Para-Virtual State* above. A further design document will propose
> +modifications to the LibXenLight and LibXenCtrl Domain Image Formats.

LibXenLight and LibXenCtrl should be fixed.

> +* An option should be added to the toolstack to initiate a non-cooperative
> +migration, instead of the (default) potentially co-operative migration.
> +Essentially this should skip the check to see if PV drivers and migrate as if
> +there are none present, but also enabling the extra save records. Note that 
> at
> +least some of the extra records should only form part of a non-cooperative
> +migration stream. For example, migrating event channel state would be counter
> +productive in a normal migration as this will essentially leak event channel
> +objects at the receiving end. Others, such as grant table state, could
> +potentially harmlessly form part of a normal migration stream.
> +
> +* * *
> +[1] PV drivers are deemed to be installed if the HVM parameter
> +*HVM_PARAM_CALLBACK_IRQ* has been set to a non-zero value.

I think this is just an approximation, but it should be good enough in
practice.

> +
> +[2] See 
> https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/xenbus.h
> +
> +[3] See 
> https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxc-migration-stream.pandoc
> +
> +[4] See 
> https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc
> +
> +[5] See 
> https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt
> +
> +[6] `xen-blkback` and `xen-netback` have already been modified in Linux to do
> +this.
> +
> +[7] See 
> https://lists.xenproject.org/archives/html/xen-devel/2020-01/msg00632.html
> +
> -- 
> 2.20.1
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.