[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH v3 1/2] docs/designs: Add a design document for non-cooperative live migration
Thanks for writing this up. I skimmed through it. It looks sensible. On Tue, Jan 28, 2020 at 12:28:22PM +0000, Paul Durrant wrote: > It has become apparent to some large cloud providers that the current > model of cooperative migration of guests under Xen is not usable as it > relies on software running inside the guest, which is likely beyond the > provider's control. > This patch introduces a proposal for non-cooperative live migration, > designed not to rely on any guest-side software. > > Signed-off-by: Paul Durrant <pdurrant@xxxxxxxxxx> > --- > Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> > Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx> > Cc: Ian Jackson <ian.jackson@xxxxxxxxxxxxx> > Cc: Jan Beulich <jbeulich@xxxxxxxx> > Cc: Julien Grall <julien@xxxxxxx> > Cc: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> > Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx> > Cc: Wei Liu <wl@xxxxxxx> > > v2: > - Use the term 'non-cooperative' instead of 'transparent' > - Replace 'trust in' with 'reliance on' when referring to guest-side > software > --- > docs/designs/non-cooperative-migration.md | 259 ++++++++++++++++++++++ > 1 file changed, 259 insertions(+) > create mode 100644 docs/designs/non-cooperative-migration.md > > diff --git a/docs/designs/non-cooperative-migration.md > b/docs/designs/non-cooperative-migration.md > new file mode 100644 > index 0000000000..f38d664c34 > --- /dev/null > +++ b/docs/designs/non-cooperative-migration.md > @@ -0,0 +1,259 @@ > +# Non-Cooperative Migration of Guests on Xen > + > +## Background > + > +The normal model of migration in Xen is driven by the guest because it was > +originally implemented for PV guests, where the guest must be aware it is > +running under Xen and is hence expected to co-operate. This model dates from > +an era when it was assumed that the host administrator had control of at > least > +the privileged software running in the guest (i.e. the guest kernel) which > may > +still be true in an enterprise deployment but is not generally true in a > cloud > +environment. The aim of this design is to provide a model which is purely > host > +driven, requiring no co-operation from the software running in the > +guest, and is thus suitable for cloud scenarios. > + > +PV guests are out of scope for this project because, as is outlined above, > they > +have a symbiotic relationship with the hypervisor and therefore a certain > level > +of co-operation can be assumed. Missing newline here? > +HVM guests can already be migrated on Xen without guest co-operation but only > +if they don’t have PV drivers installed[1] or are in power state S3. The > +reason for not expecting co-operation if the guest is in S3 is obvious, but > the > +reason co-operation is expected if PV drivers are installed is due to the > +nature of PV protocols. > + > +## Xenstore Nodes and Domain ID > + > +The PV driver model consists of a *frontend* and a *backend*. The frontend > runs > +inside the guest domain and the backend runs inside a *service domain* which > +may or may not domain 0. The frontend and backend typically pass data via "may or may not _be_ domain 0" > +memory pages which are shared between the two domains, but this channel of > +communication is generally established using xenstore (the store protocol > +itself being an exception to this for obvious chicken-and-egg reasons). > + > +Typical protocol establishment is based on use of two separate xenstore > +*areas*. If we consider PV drivers for the *netif* protocol (i.e. class vif) > +and assume the guest has domid X, the service domain has domid Y, and the vif > +has index Z then the frontend area will reside under the parent node: The term "parent" shows up first time in this document. What does it mean in Xen's context? > + > +`/local/domain/Y/device/vif/Z` > + > +All backends, by convention, typically reside under parent node: > + > +`/local/domain/X/backend` > + > +and the normal backend area for vif Z would be: > + > +`/local/domain/X/backend/vif/Y/Z` > + > +but this should not be assumed. > + > +The toolstack will place two nodes in the frontend area to explicitly locate > +the backend: > + > + * `backend`: the fully qualified xenstore path of the backend area > + * `backend-id`: the domid of the service domain > + > +and similarly two nodes in the backend area to locate the frontend area: > + > + * `frontend`: the fully qualified xenstore path of the frontend area > + * `frontend-id`: the domid of the guest domain > + > + > +The guest domain only has write permission to the frontend area and similarly > +the service domain only has write permission to the backend area, but both > ends > +have read permission to both areas. > + > +Under both frontend and backend areas is a node called *state*. This is key > to > +protocol establishment. Upon PV device creation the toolstack will set the > +value of both state nodes to 1 (XenbusStateInitialising[2]). This should > cause > +enumeration of appropriate devices in both the guest and service domains. The > +backend device, once it has written any necessary protocol specific > information > +into the xenstore backend area (to be read by the frontend driver) will > update > +the backend state node to 2 (XenbusStateInitWait). From this point on PV > +protocols differ slightly; the following illustration is true of the netif > +protocol. Missing newline? > +Upon seeing a backend state value of 2, the frontend driver will then read > the > +protocol specific information, write details of grant references (for shared > +pages) and event channel ports (for signalling) that it has created, and set > +the state node in the frontend area to 4 (XenbusStateConnected). Upon see > this > +frontend state, the backend driver will then read the grant references > (mapping > +the shared pages) and event channel ports (opening its end of them) and set > the > +state node in the backend area to 4. Protocol establishment is now complete > and > +the frontend and backend start to pass data. > + > +Because the domid of both ends of a PV protocol forms a key part of > negotiating > +the data plane for that protocol (because it is encoded into both xenstore > +nodes and node paths), and because guest’s own domid and the domid of the > +service domain are visible to the guest in xenstore (and hence may cached > +internally), and neither are necessarily preserved during migration, it is > +hence necessary to have the co-operation of the frontend in re-negotiating > the > +protocol using the new domid after migration. Add newline here? > +Moreover the backend-id value will be used by the frontend driver in setting > up > +grant table entries and event channels to communicate with the service > domain, > +so the co-operation of the guest is required to re-establish these in the new > +host environment after migration. > + > +Thus if we are to change the model and support migration of a guest with PV > +drivers, without the co-operation of the frontend driver code, the paths and > +values in both the frontend and backend xenstore areas must remain unchanged > +and valid in the new host environment, and the grant table entries and event > +channels must be preserved (and remain operational once guest execution is > +resumed). Add newline here? > +Because the service domain’s domid is used directly by the guest in setting > +up grant entries and event channels, the backend drivers in the new host > +environment must be provided by service domain with the same domid. Also, > +because the guest can sample its own domid from the frontend area and use it > in > +hypercalls (e.g. HVMOP_set_param) rather than DOMID_SELF, the guest domid > must > +also be preserved to maintain the ABI. > + > +Furthermore, it will necessary to modify backend drivers to re-establish > +communication with frontend drivers without perturbing the content of the > +backend area or requiring any changes to the values of the xenstore state > nodes. > + > +## Other Para-Virtual State > + > +### Shared Rings > + > +Because the console and store protocol shared pages are actually part of the > +guest memory image (in an E820 reserved region just below 4G) then the > content > +will get migrated as part of the guest memory image. Hence no additional code > +is require to prevent any guest visible change in the content. > + > +### Shared Info > + > +There is already a record defined in *LibXenCtrl Domain Image Format* [3] LibXenCtrl -> libxenctrl > +called `SHARED_INFO` which simply contains a complete copy of the domain’s > +shared info page. It is not currently incuded in an HVM (type `0x0002`) > +migration stream. It may be feasible to include it as an optional record > +but it is not clear that the content of the shared info page ever needs > +to be preserved for an HVM guest. Add newline? > +For a PV guest the `arch_shared_info` sub-structure contains important > +information about the guest’s P2M, but this information is not relevant for > +an HVM guest where the P2M is not directly manipulated via the guest. The > other > +state contained in the `shared_info` structure relates the domain wall-clock > +(the state of which should already be transferred by the `RTC` HVM context > +information which contained in the `HVM_CONTEXT` save record) and some event > +channel state (particularly if using the *2l* protocol). Event channel state > +will need to be fully transferred if we are not going to require the guest > +co-operation to re-open the channels and so it should be possible to > re-build a > +shared info page for an HVM guest from such other state. Add newline here? > +Note that the shared info page also contains an array of > `XEN_LEGACY_MAX_VCPUS` > +(32) `vcpu_info` structures. A domain may nominate a different guest physical > +address to use for the vcpu info. This is mandatory for if a domain wants to > +use more than 32 vCPUs and optional for legacy vCPUs. This mapping is not > +currently transferred in the migration state so this will either need to be > +added into an existing save record, or an additional type of save record will > +be needed. > + > +### Xenstore Watches > + > +As mentioned above, no domain Xenstore state is currently transferred in the > +migration stream. There is a record defined in *LibXenLight Domain Image LibXenLight -> libxenlight > +Format* [4] called `EMULATOR_XENSTORE_DATA` for transferring Xenstore nodes > +relating to emulators but no record type is defined for nodes relating to the > +domain itself, nor for registered *watches*. A XenStore watch is a mechanism > +used by PV frontend and backend drivers to request a notification if the > value > +of a particular node (e.g. the other end’s state node) changes, so it is > +important that watches continue to function after a migration. One or more > new > +save records will therefore be required to transfer Xenstore state. It will > +also be necessary to extend the *store* protocol[5] with mechanisms to allow > +the toolstack to acquire the list of watches that the guest has registered > and > +for the toolstack to register a watch on behalf of a domain. > + > +### Event channels > + > +Event channels are essentially the para-virtual equivalent of interrupts. > They > +are an important part of post PV protocols. Normally a frontend driver > creates > +an *inter-domain* event channel between its own domain and the domain running > +the backend, which it discovers using the `backend-id` node in Xenstore (see > +above), by making a `EVTCHNOP_alloc_unbound` hypercall. This hypercall > +allocates an event channel object in the hypervisor and assigns a *local > port* > +number which is then written into the frontend area in Xenstore. The backend > +driver then reads this port number and *binds* to the event channel by > +specifying it, and the value of `frontend-id`, as *remote domain* and *remote > +port* (respectively) to a `EVTCHNOP_bind_interdomain` hypercall. Once > +connection is established in this fashion frontend and backend drivers can > use > +the event channel as a *mailbox* to notify each other when a shared ring has > +been updated with new requests or response structures. Missing newline here. > +Currently no event channel state is preserved on migration, requiring > frontend > +and backend drivers to create and bind a complete new set of event channels > in > +order to re-establish a protocol connection. Hence, one or more new save > +records will be required to transfer event channel state in order to avoid > the > +need for explicit action by frontend drivers running in the guest. Note that > +the local port numbers need to preserved in this state as they are the only > +context the guest has to refer to the hypervisor event channel objects. > + Note also that the PV *store* (Xenstore access) and *console* protocols also > +rely on event channels which are set up by the toolstack. Normally, early in > +migration, the toolstack running on the remote host would set up a new pair > of > +event channels for these protocols in the destination domain. These may not > be > +assigned the same local port numbers as the protocols running in the source > +domain. For non-cooperative migration these channels must either be created > with > +fixed port numbers, or their creation must be avoided and instead be included > +in the general event channel state record(s). > + > +### Grant table > + > +The grant table is essentially the para-virtual equivalent of an IOMMU. For > +example, the shared rings of a PV protocol are *granted* by a frontend driver > +to the backend driver by allocating *grant entries* in the guest’s table, > +filling in details of the memory pages and then writing the *grant > references* > +(the index values of the grant entries) into Xenstore. The grant references > of > +the protocol buffers themselves are typically written directly into the > request > +structures passed via a shared ring. Missing newline. > +The guest is responsible for managing its own grant table. No hypercall is > +required to grant a memory page to another domain. It is sufficient to find > an > +unused grant entry and set bits in the entry to give read and/or write access > +to a remote domain also specified in the entry along with the page frame > +number. Thus the layout and content of the grant table logically forms part > of > +the guest state. Missing newline. > +Currently no grant table state is migrated, requiring a guest to separately > +maintain any state that it wishes to persist elsewhere in its memory image > and > +then restore it after migration. Thus to avoid the need for such explicit > +action by the guest, one or more new save records will be required to migrate > +the contents of the grant table. > + > +# Outline Proposal > + > +* PV backend drivers will be modified to unilaterally re-establish connection > +to a frontend if the backend state node is restored with value 4 > +(XenbusStateConnected)[6]. Missing newline. > +* The toolstack should be modified to allow domid to be randomized on initial > +creation or default migration, but make it identical to the source domain on > +non-cooperative migration. Non-Cooperative migration will have to be denied > if the > +domid is unavailable on the target host, but randomization of domid on > creation > +should hopefully minimize the likelihood of this. Non-Cooperative migration > to > +localhost will clearly not be possible. Patches have already been sent to > +`xen-devel` to make this change[7]. > +* `xenstored` should be modified to implement the new mechanisms needed. See > +*Other Para-Virtual State* above. A further design document will propose > +additional protocol messages. > +* Within the migration stream extra save records will be defined as required. > +See *Other Para-Virtual State* above. A further design document will propose > +modifications to the LibXenLight and LibXenCtrl Domain Image Formats. LibXenLight and LibXenCtrl should be fixed. > +* An option should be added to the toolstack to initiate a non-cooperative > +migration, instead of the (default) potentially co-operative migration. > +Essentially this should skip the check to see if PV drivers and migrate as if > +there are none present, but also enabling the extra save records. Note that > at > +least some of the extra records should only form part of a non-cooperative > +migration stream. For example, migrating event channel state would be counter > +productive in a normal migration as this will essentially leak event channel > +objects at the receiving end. Others, such as grant table state, could > +potentially harmlessly form part of a normal migration stream. > + > +* * * > +[1] PV drivers are deemed to be installed if the HVM parameter > +*HVM_PARAM_CALLBACK_IRQ* has been set to a non-zero value. I think this is just an approximation, but it should be good enough in practice. > + > +[2] See > https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/xenbus.h > + > +[3] See > https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxc-migration-stream.pandoc > + > +[4] See > https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc > + > +[5] See > https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt > + > +[6] `xen-blkback` and `xen-netback` have already been modified in Linux to do > +this. > + > +[7] See > https://lists.xenproject.org/archives/html/xen-devel/2020-01/msg00632.html > + > -- > 2.20.1 > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |