[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH v5 1/2] docs/designs: Add a design document for non-cooperative live migration
> -----Original Message----- > From: Xen-devel <xen-devel-bounces@xxxxxxxxxxxxxxxxxxxx> On Behalf Of Julien > Grall > Sent: 04 March 2020 15:11 > To: Durrant, Paul <pdurrant@xxxxxxxxxxxx>; xen-devel@xxxxxxxxxxxxxxxxxxxx > Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>; Wei Liu <wl@xxxxxxx>; Konrad > Rzeszutek Wilk > <konrad.wilk@xxxxxxxxxx>; George Dunlap <George.Dunlap@xxxxxxxxxxxxx>; Andrew > Cooper > <andrew.cooper3@xxxxxxxxxx>; Ian Jackson <ian.jackson@xxxxxxxxxxxxx>; Jan > Beulich <jbeulich@xxxxxxxx> > Subject: Re: [Xen-devel] [PATCH v5 1/2] docs/designs: Add a design document > for non-cooperative live > migration > > Hi Paul, > > The proposal looks sensible to me. Some NITpicking below. > > On 13/02/2020 10:53, Paul Durrant wrote: > > It has become apparent to some large cloud providers that the current > > model of cooperative migration of guests under Xen is not usable as it > > relies on software running inside the guest, which is likely beyond the > > provider's control. > > This patch introduces a proposal for non-cooperative live migration, > > designed not to rely on any guest-side software. > > > > Signed-off-by: Paul Durrant <pdurrant@xxxxxxxxxx> > > --- > > Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> > > Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx> > > Cc: Ian Jackson <ian.jackson@xxxxxxxxxxxxx> > > Cc: Jan Beulich <jbeulich@xxxxxxxx> > > Cc: Julien Grall <julien@xxxxxxx> > > Cc: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> > > Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx> > > Cc: Wei Liu <wl@xxxxxxx> > > > > v5: > > - Note that PV domain are not just expected to co-operate, they are > > required to > > > > v4: > > - Fix issues raised by Wei > > > > v2: > > - Use the term 'non-cooperative' instead of 'transparent' > > - Replace 'trust in' with 'reliance on' when referring to guest-side > > software > > --- > > docs/designs/non-cooperative-migration.md | 272 ++++++++++++++++++++++ > > 1 file changed, 272 insertions(+) > > create mode 100644 docs/designs/non-cooperative-migration.md > > > > diff --git a/docs/designs/non-cooperative-migration.md > > b/docs/designs/non-cooperative-migration.md > > new file mode 100644 > > index 0000000000..09f74c8c0d > > --- /dev/null > > +++ b/docs/designs/non-cooperative-migration.md > > @@ -0,0 +1,272 @@ > > +# Non-Cooperative Migration of Guests on Xen > > + > > +## Background > > + > > +The normal model of migration in Xen is driven by the guest because it was > > +originally implemented for PV guests, where the guest must be aware it is > > +running under Xen and is hence expected to co-operate. This model dates > > from > > +an era when it was assumed that the host administrator had control of at > > least > > +the privileged software running in the guest (i.e. the guest kernel) which > > may > > +still be true in an enterprise deployment but is not generally true in a > > cloud > > +environment. The aim of this design is to provide a model which is purely > > host > > +driven, requiring no co-operation from the software running in the > > +guest, and is thus suitable for cloud scenarios. > > + > > +PV guests are out of scope for this project because, as is outlined above, > > they > > +have a symbiotic relationship with the hypervisor and therefore a certain > > level > > +of co-operation is required. > > + > > +HVM guests can already be migrated on Xen without guest co-operation but > > only > > +if they don’t have PV drivers installed[1] or are in power state S3. The > > S3 is very ACPI centric, so I would prefer if we avoid the term. I think > the non-ACPI description is "suspend to RAM". I would be OK is you > mention S3 in parenthesis. I'm actually pulling this from the way the code is currently written, which is clearly quite x86 specific: xc_hvm_param_get(CTX->xch, domid, HVM_PARAM_ACPI_S_STATE, &hvm_s_state) . . . if (dsps->type == LIBXL_DOMAIN_TYPE_HVM && (!hvm_pvdrv || hvm_s_state)) { LOGD(DEBUG, domid, "Calling xc_domain_shutdown on HVM domain"); ret = xc_domain_shutdown(CTX->xch, domid, SHUTDOWN_suspend); . . } So actually I should say 'not in power state S0'. > > > +reason for not expecting co-operation if the guest is in S3 is obvious, > > but the > > +reason co-operation is expected if PV drivers are installed is due to the > > +nature of PV protocols. > > + > > +## Xenstore Nodes and Domain ID > > + > > +The PV driver model consists of a *frontend* and a *backend*. The frontend > > runs > > +inside the guest domain and the backend runs inside a *service domain* > > which > > +may or may not be domain 0. The frontend and backend typically pass data > > via > > +memory pages which are shared between the two domains, but this channel of > > +communication is generally established using xenstore (the store protocol > > +itself being an exception to this for obvious chicken-and-egg reasons). > > + > > +Typical protocol establishment is based on use of two separate xenstore > > +*areas*. If we consider PV drivers for the *netif* protocol (i.e. class > > vif) > > +and assume the guest has domid X, the service domain has domid Y, and the > > vif > > +has index Z then the frontend area will reside under the parent node: > > + > > +`/local/domain/Y/device/vif/Z` > > + > > +All backends, by convention, typically reside under parent node: > > + > > +`/local/domain/X/backend` > > + > > +and the normal backend area for vif Z would be: > > + > > +`/local/domain/X/backend/vif/Y/Z` > > + > > +but this should not be assumed. > > + > > +The toolstack will place two nodes in the frontend area to explicitly > > locate > > +the backend: > > + > > + * `backend`: the fully qualified xenstore path of the backend area > > + * `backend-id`: the domid of the service domain > > + > > +and similarly two nodes in the backend area to locate the frontend area: > > + > > + * `frontend`: the fully qualified xenstore path of the frontend area > > + * `frontend-id`: the domid of the guest domain > > + > > + > > +The guest domain only has write permission to the frontend area and > > similarly > > +the service domain only has write permission to the backend area, but both > > ends > > +have read permission to both areas. > > + > > +Under both frontend and backend areas is a node called *state*. This is > > key to > > +protocol establishment. Upon PV device creation the toolstack will set the > > +value of both state nodes to 1 (XenbusStateInitialising[2]). This should > > cause > > +enumeration of appropriate devices in both the guest and service domains. > > The > > +backend device, once it has written any necessary protocol specific > > information > > +into the xenstore backend area (to be read by the frontend driver) will > > update > > +the backend state node to 2 (XenbusStateInitWait). From this point on PV > > +protocols differ slightly; the following illustration is true of the netif > > +protocol. > > + > > +Upon seeing a backend state value of 2, the frontend driver will then read > > the > > +protocol specific information, write details of grant references (for > > shared > > +pages) and event channel ports (for signalling) that it has created, and > > set > > +the state node in the frontend area to 4 (XenbusStateConnected). Upon see > > this > > +frontend state, the backend driver will then read the grant references > > (mapping > > +the shared pages) and event channel ports (opening its end of them) and > > set the > > +state node in the backend area to 4. Protocol establishment is now > > complete and > > +the frontend and backend start to pass data. > > + > > +Because the domid of both ends of a PV protocol forms a key part of > > negotiating > > +the data plane for that protocol (because it is encoded into both xenstore > > +nodes and node paths), and because guest’s own domid and the domid of the > > +service domain are visible to the guest in xenstore (and hence may cached > > +internally), and neither are necessarily preserved during migration, it is > > +hence necessary to have the co-operation of the frontend in re-negotiating > > the > > +protocol using the new domid after migration. > > + > > +Moreover the backend-id value will be used by the frontend driver in > > setting up > > +grant table entries and event channels to communicate with the service > > domain, > > +so the co-operation of the guest is required to re-establish these in the > > new > > +host environment after migration. > > + > > +Thus if we are to change the model and support migration of a guest with PV > > +drivers, without the co-operation of the frontend driver code, the paths > > and > > +values in both the frontend and backend xenstore areas must remain > > unchanged > > +and valid in the new host environment, and the grant table entries and > > event > > +channels must be preserved (and remain operational once guest execution is > > +resumed). > > + > > +Because the service domain’s domid is used directly by the guest in setting > > +up grant entries and event channels, the backend drivers in the new host > > +environment must be provided by service domain with the same domid. Also, > > +because the guest can sample its own domid from the frontend area and use > > it in > > +hypercalls (e.g. HVMOP_set_param) rather than DOMID_SELF, the guest domid > > must > > +also be preserved to maintain the ABI. > > + > > +Furthermore, it will necessary to modify backend drivers to re-establish > > +communication with frontend drivers without perturbing the content of the > > +backend area or requiring any changes to the values of the xenstore state > > nodes. > > + > > +## Other Para-Virtual State > > + > > +### Shared Rings > > + > > +Because the console and store protocol shared pages are actually part of > > the > > +guest memory image (in an E820 reserved region just below 4G) then the > > content > > While Arm does not yet support migration, the concept of non-cooperative > live migration is not x86 specific. I am OK with giving arch-specific > example, but it should be clear on which architecture this is valid. > Ok. > > +will get migrated as part of the guest memory image. Hence no additional > > code > > +is require to prevent any guest visible change in the content. > > + > > +### Shared Info > > + > > +There is already a record defined in *libxenctrl Domain Image Format* [3] > > +called `SHARED_INFO` which simply contains a complete copy of the domain’s > > +shared info page. It is not currently incuded in an HVM (type `0x0002`) > > +migration stream. It may be feasible to include it as an optional record > > +but it is not clear that the content of the shared info page ever needs > > +to be preserved for an HVM guest. > > + > > +For a PV guest the `arch_shared_info` sub-structure contains important > > +information about the guest’s P2M, but this information is not relevant for > > +an HVM guest where the P2M is not directly manipulated via the guest. The > > other > > +state contained in the `shared_info` structure relates the domain > > wall-clock > > +(the state of which should already be transferred by the `RTC` HVM context > > +information which contained in the `HVM_CONTEXT` save record) and some > > event > > +channel state (particularly if using the *2l* protocol). Event channel > > state > > +will need to be fully transferred if we are not going to require the guest > > +co-operation to re-open the channels and so it should be possible to > > re-build a > > +shared info page for an HVM guest from such other state. > > + > > +Note that the shared info page also contains an array of > > `XEN_LEGACY_MAX_VCPUS` > > +(32) `vcpu_info` structures. A domain may nominate a different guest > > physical > > +address to use for the vcpu info. This is mandatory for if a domain wants > > to > > +use more than 32 vCPUs and optional for legacy vCPUs. This mapping is not > > Similar to above, those values are x86 specific. On Arm, only CPU0 is > described in shared_info. > Ok. > > +currently transferred in the migration state so this will either need to be > > +added into an existing save record, or an additional type of save record > > will > > +be needed. > > + > > +### Xenstore Watches > > + > > +As mentioned above, no domain Xenstore state is currently transferred in > > the > > +migration stream. There is a record defined in *libxenlight Domain Image > > +Format* [4] called `EMULATOR_XENSTORE_DATA` for transferring Xenstore nodes > > +relating to emulators but no record type is defined for nodes relating to > > the > > +domain itself, nor for registered *watches*. A XenStore watch is a > > mechanism > > +used by PV frontend and backend drivers to request a notification if the > > value > > +of a particular node (e.g. the other end’s state node) changes, so it is > > +important that watches continue to function after a migration. One or more > > new > > +save records will therefore be required to transfer Xenstore state. It will > > +also be necessary to extend the *store* protocol[5] with mechanisms to > > allow > > +the toolstack to acquire the list of watches that the guest has registered > > and > > +for the toolstack to register a watch on behalf of a domain. > > + > > +### Event channels > > + > > +Event channels are essentially the para-virtual equivalent of interrupts. > > They > > +are an important part of post PV protocols. Normally a frontend driver > > creates > > +an *inter-domain* event channel between its own domain and the domain > > running > > +the backend, which it discovers using the `backend-id` node in Xenstore > > (see > > +above), by making a `EVTCHNOP_alloc_unbound` hypercall. This hypercall > > +allocates an event channel object in the hypervisor and assigns a *local > > port* > > +number which is then written into the frontend area in Xenstore. The > > backend > > +driver then reads this port number and *binds* to the event channel by > > +specifying it, and the value of `frontend-id`, as *remote domain* and > > *remote > > +port* (respectively) to a `EVTCHNOP_bind_interdomain` hypercall. Once > > +connection is established in this fashion frontend and backend drivers can > > use > > +the event channel as a *mailbox* to notify each other when a shared ring > > has > > +been updated with new requests or response structures. > > + > > +Currently no event channel state is preserved on migration, requiring > > frontend > > +and backend drivers to create and bind a complete new set of event > > channels in > > +order to re-establish a protocol connection. Hence, one or more new save > > +records will be required to transfer event channel state in order to avoid > > the > > +need for explicit action by frontend drivers running in the guest. Note > > that > > +the local port numbers need to preserved in this state as they are the only > > +context the guest has to refer to the hypervisor event channel objects. > > + Note also that the PV *store* (Xenstore access) and *console* protocols > > also > > +rely on event channels which are set up by the toolstack. Normally, early > > in > > +migration, the toolstack running on the remote host would set up a new > > pair of > > +event channels for these protocols in the destination domain. These may > > not be > > +assigned the same local port numbers as the protocols running in the source > > +domain. For non-cooperative migration these channels must either be > > created with > > +fixed port numbers, or their creation must be avoided and instead be > > included > > +in the general event channel state record(s). > > + > > +### Grant table > > + > > +The grant table is essentially the para-virtual equivalent of an IOMMU. For > > +example, the shared rings of a PV protocol are *granted* by a frontend > > driver > > +to the backend driver by allocating *grant entries* in the guest’s table, > > +filling in details of the memory pages and then writing the *grant > > references* > > +(the index values of the grant entries) into Xenstore. The grant > > references of > > +the protocol buffers themselves are typically written directly into the > > request > > +structures passed via a shared ring. > > + > > +The guest is responsible for managing its own grant table. No hypercall is > > +required to grant a memory page to another domain. It is sufficient to > > find an > > +unused grant entry and set bits in the entry to give read and/or write > > access > > +to a remote domain also specified in the entry along with the page frame > > +number. Thus the layout and content of the grant table logically forms > > part of > > +the guest state. > > + > > +Currently no grant table state is migrated, requiring a guest to separately > > +maintain any state that it wishes to persist elsewhere in its memory image > > and > > +then restore it after migration. Thus to avoid the need for such explicit > > +action by the guest, one or more new save records will be required to > > migrate > > +the contents of the grant table. > > + > > +# Outline Proposal > > + > > +* PV backend drivers will be modified to unilaterally re-establish > > connection > > +to a frontend if the backend state node is restored with value 4 > > +(XenbusStateConnected)[6]. > > + > > +* The toolstack should be modified to allow domid to be randomized on > > initial > > +creation or default migration, but make it identical to the source domain > > on > > +non-cooperative migration. Non-Cooperative migration will have to be > > denied if the > > +domid is unavailable on the target host, but randomization of domid on > > creation > > +should hopefully minimize the likelihood of this. Non-Cooperative > > migration to > > +localhost will clearly not be possible. Patches have already been sent to > > +`xen-devel` to make this change[7]. > > IIRC, the patch is merged now. You may want to update the last sentence. > It is, since this has been outstanding for such a long time :-/ I'll fix it up. Paul > > + > > +* `xenstored` should be modified to implement the new mechanisms needed. > > See > > +*Other Para-Virtual State* above. A further design document will propose > > +additional protocol messages. > > + > > +* Within the migration stream extra save records will be defined as > > required. > > +See *Other Para-Virtual State* above. A further design document will > > propose > > +modifications to the libxenlight and libxenctrl Domain Image Formats. > > + > > +* An option should be added to the toolstack to initiate a non-cooperative > > +migration, instead of the (default) potentially co-operative migration. > > +Essentially this should skip the check to see if PV drivers and migrate as > > if > > +there are none present, but also enabling the extra save records. Note > > that at > > +least some of the extra records should only form part of a non-cooperative > > +migration stream. For example, migrating event channel state would be > > counter > > +productive in a normal migration as this will essentially leak event > > channel > > +objects at the receiving end. Others, such as grant table state, could > > +potentially harmlessly form part of a normal migration stream. > > + > > +* * * > > +[1] PV drivers are deemed to be installed if the HVM parameter > > +*HVM_PARAM_CALLBACK_IRQ* has been set to a non-zero value. > > + > > +[2] See > > https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/xenbus.h > > + > > +[3] See > > https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxc-migration-stream.pandoc > > + > > +[4] See > > https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc > > + > > +[5] See > > https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt > > + > > +[6] `xen-blkback` and `xen-netback` have already been modified in Linux to > > do > > +this. > > + > > +[7] See > > https://lists.xenproject.org/archives/html/xen-devel/2020-01/msg00632.html > > + > > > > Cheers, > > -- > Julien Grall > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxxxxxxxxx > https://lists.xenproject.org/mailman/listinfo/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |