|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH RFC 1/2] docs/design: Add a design document for Live Update
On 06/05/2021 11:42, Julien Grall wrote: From: Julien Grall <jgrall@xxxxxxxxxx> Looks good in general... just a few comments below... Feels like there's a sentence or two missing here. The subject has jumped from a framework that is not fit for purpose to 'the operation'. + +The operation can be divided in roughly 4 parts: + + 1. Trigger: The operation will by triggered from outside the hypervisor + (e.g. dom0 userspace). + 2. Save: The state will be stabilized by pausing the domains and + serialized by xen#1. + 3. Hand-over: xen#1 will pass the serialized state and transfer control to + xen#2. + 4. Restore: The state will be deserialized by xen#2. + +All the domains will be paused before xen#1 is starting to save the states, s/is starting/starts +and any domain that was running before Live Update will be unpaused after +xen#2 has finished to restore the states. This is to prevent a domain to try s/finished to restore/finished restoring and s/domain to try/domain trying +to modify the state of another domain while it is being saved/restored. + +The current approach could be seen as non-cooperative migration with a twist: +all the domains (including dom0) are not expected be involved in the Live +Update process. + +The major differences compare to live migration are: s/compare/compared + + * The state is not transferred to another host, but instead locally to + xen#2. + * The memory content or device state (for passthrough) does not need to + be part of the stream. Instead we need to preserve it. + * PV backends, device emulators, xenstored are not recreated but preserved + (as these are part of dom0). + + +Domains in process of being destroyed (*XEN\_DOMCTL\_destroydomain*) will need +to be preserved because another entity may have mappings (e.g foreign, grant) +on them. + +## Trigger + +Live update is built on top of the kexec interface to prepare the command line, +load xen#2 and trigger the operation. A new kexec type has been introduced +(*KEXEC\_TYPE\_LIVE\_UPDATE*) to notify Xen to Live Update. + +The Live Update will be triggered from outside the hypervisor (e.g. dom0 +userspace). Support for the operation has been added in kexec-tools 2.0.21. + +All the domains will be paused before xen#1 is starting to save the states. You already said this in the previous section. +In Xen, *domain\_pause()* will pause the vCPUs as soon as they can be re- +scheduled. In other words, a pause request will not wait for asynchronous +requests (e.g. I/O) to finish. For Live Update, this is not an ideal time to +pause because it will require more xen#1 internal state to be transferred. +Therefore, all the domains will be paused at an architectural restartable +boundary. + +Live update will not happen synchronously to the request but when all the+domains are quiescent. As domains running device emulators (e.g Dom0) will > +be part of the process to quiesce HVM domains, we will need to let them run +until xen#1 is actually starting to save the state. HVM vCPUs will be paused +as soon as any pending asynchronous request has finished. + +In the current implementation, all PV domains will continue to run while the +rest will be paused as soon as possible. Note this approach is assuming that +device emulators are only running in PV domains. + +It should be easy to extend to PVH domains not requiring device emulations. +It will require more thought if we need to run device models in HVM domains as +there might be inter-dependency. + +## Save + +xen#1 will be responsible to preserve and serialize the state of each existing +domain and any system-wide state (e.g M2P). s/to preserve and serialize/for preserving and serializing + +Each domain will be serialized independently using a modified migration stream, +if there is any dependency between domains (such as for IOREQ server) they will +be recorded using a domid. All the complexity of resolving the dependencies are +left to the restore path in xen#2 (more in the *Restore* section). + +At the moment, the domains are saved one by one in a single thread, but it +would be possible to consider multi-threading if it takes too long. Although +this may require some adjustment in the stream format. + +As we want to be able to Live Update between major versions of Xen (e.g Xen +4.11 -> Xen 4.15), the states preserved should not be a dump of Xen internal +structure but instead the minimal information that allow us to recreate the +domains. + +For instance, we don't want to preserve the frametable (and therefore +*struct page\_info*) as-is because the refcounting may be different across +between xen#1 and xen#2 (see XSA-299). Instead, we want to be able to recreate +*struct page\_info* based on minimal information that are considered stable +(such as the page type). + +Note that upgrading between version of Xen will also require all the hypercalls +to be stable. This will not be covered by this document. + +## Hand over + +### Memory usage restrictions + +xen#2 must take care not to use any memory pages which already belong to +guests. To facilitate this, a number of contiguous region of memory are +reserved for the boot allocator, known as *live update bootmem*. + +xen#1 will always reserve a region just below Xen (the size is controlled by +the Xen command line parameter liveupdate) to allow Xen growing and provide +information about LiveUpdate (see the section *Breadcrumb*). The region will be +passed to xen#2 using the same command line option but with the base address +specified. + +For simplicity, additional regions will be provided in the stream. They will +consist of region that could be re-used by xen#2 during boot (such as the s/region/a region Paul +xen#1's frametable memory). + +xen#2 must not use any pages outside those regions until it has consumed the +Live Update data stream and determined which pages are already in use by +running domains or need to be re-used as-is by Xen (e.g M2P). + +At run time, Xen may use memory from the reserved region for any purpose that +does not require preservation over a Live Update; in particular it __must__ not be +mapped to a domain or used by any Xen state requiring to be preserved (e.g +M2P). In other word, the xenheap pages could be allocated from the reserved +regions if we remove the concept of shared xenheap pages. + +The xen#2's binary may be bigger (or smaller) compare to xen#1's binary. So +for the purpose of loading xen#2 binary, kexec should treat the reserved memory +right below xen#1 and its region as a single contiguous space. xen#2 will be +loaded right at the top of the contiguous space and the rest of the memory will +be the new reserved memory (this may shrink or grow). For that reason, freed +init memory from xen#1 image is also treated as reserved liveupdate update +bootmem. + +### Live Update data stream + +During handover, xen#1 creates a Live Update data stream containing all the +information required by the new Xen#2 to restore all the domains. + +Data pages for this stream may be allocated anywhere in physical memory outside +the *live update bootmem* regions. + +As calling __vmap()__/__vunmap()__ has a cost on the downtime. We want to reduce the +number of call to __vmap()__ when restoring the stream. Therefore the stream +will be contiguously virtually mapped in xen#2. xen#1 will create an array of +MFNs of the allocated data pages, suitable for passing to __vmap()__. The +array will be physically contiguous but the MFNs don't need to be physically +contiguous. + +### Breadcrumb + +Since the Live Update data stream is created during the final **kexec\_exec** +hypercall, its address cannot be passed on the command line to the new Xen +since the command line needs to have been set up by **kexec(8)** in userspace +long beforehand. + +Thus, to allow the new Xen to find the data stream, xen#1 places a breadcrumb +in the first words of the Live Update bootmem, containing the number of data +pages, and the physical address of the contiguous MFN array. + +### IOMMU + +Where devices are passed through to domains, it may not be possible to quiesce +those devices for the purpose of performing the update. + +If performing Live Update with assigned devices, xen#1 will leave the IOMMU +mappings active during the handover (thus implying that IOMMU page tables may +not be allocated in the *live update bootmem* region either). + +xen#2 must take control of the IOMMU without causing those mappings to become +invalid even for a short period of time. In other words, xen#2 should not +re-setup the IOMMUs. On hardware which does not support Posted Interrupts, +interrupts may need to be generated on resume. + +## Restore + +After xen#2 initialized itself and map the stream, it will be responsible to +restore the state of the system and each domain. + +Unlike the save part, it is not possible to restore a domain in a single pass. +There are dependencies between: + + 1. different states of a domain. For instance, the event channels ABI + used (2l vs fifo) requires to be restored before restoring the event + channels. + 2. the same "state" within a domain. For instance, in case of PV domain, + the pages' ownership requires to be restored before restoring the type + of the page (e.g is it an L4, L1... table?). + + 3. domains. For instance when restoring the grant mapping, it will be + necessary to have the page's owner in hand to do proper refcounting. + Therefore the pages' ownership have to be restored first. + +Dependencies will be resolved using either multiple passes (for dependency +type 2 and 3) or using a specific ordering between records (for dependency +type 1). + +Each domain will be restored in 3 passes: + + * Pass 0: Create the domain and restore the P2M for HVM. This can be broken + down in 3 parts: + * Allocate a domain via _domain\_create()_ but skip part that requires + extra records (e.g HAP, P2M). + * Restore any parts which needs to be done before create the vCPUs. This + including restoring the P2M and whether HAP is used. + * Create the vCPUs. Note this doesn't restore the state of the vCPUs. + * Pass 1: It will restore the pages' ownership and the grant-table frames + * Pass 2: This steps will restore any domain states (e.g vCPU state, event + channels) that wasn't + +A domain should not have a dependency on another domain within the same pass. +Therefore it would be possible to take advantage of all the CPUs to restore +domains in parallel and reduce the overall downtime. + +Once all the domains have been restored, they will be unpaused if they were +running before Live Update. + +* * * +[1] https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md;h=4b876d809fb5b8aac02d29fd7760a5c0d5b86d87;hb=HEAD +
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |