Xen project Mailing List

Re: [Xen-devel] reliable live migration of large and busy guests

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Tue, 6 Nov 2012 23:18:39 +0000

Delivery-date: Tue, 06 Nov 2012 23:19:53 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 06/11/12 22:18, Olaf Hering wrote: > On Tue, Nov 06, Keir Fraser wrote: > >> It's known that if you have a workload that is dirtying lots of pages >> quickly, the final stop-and-copy phase will necessarily be large. A VM that >> is busy dirtying lots of pages can dirty pages much quicker than they can be >> transferred over the LAN. > In my opinion such migration should be done at application level. > >>> Should 'xm migrate --live' and 'xl migrate' get something like a >>> --no-suspend option? >> Well, it is not really possible to avoid the suspend altogether, there is >> always going to be some minimal 'dirty working set'. But could provide >> parameters to require the dirty working set to be smaller than X pages >> within Y rounds of dirty page copying. > Should such knobs be exposed to the tools like x[lm] migrate --knob1 val > --knob2 val? > > Olaf We (Citrix) are currently looking at some fairly serious performance issues with migration with both classic and pvops dom0 kernels (Patches to follow in due course) While that will make the situation better, it wont solve the problem you have described. As far as I understand (so please correct me if I am wrong), migration works by transmitting pages until the rate of dirty pages per round approaches constant, at which point the domain gets paused, all remaining dirty pages are transmitted. (With the proviso that currently there are a maximum number of rounds until an automatic pausing - this is automatically problematic with increasing guest sizes.) Having these knobs tweakable by the admin/toolstack seems like a very sensible idea. The application problem you described could possibly be something crashing because of a sufficiently large jump in time? As potential food for thought: Is there wisdom in having a new kind of live migrate which, when pausing the VM on the source host, resumes the VM on the destination host. Xen would have to track not-yet-sent pages and pause the guest on pagefault, and request the required page as a matter of priority. The advantages of this approach would be that a timing sensitive workloads would be paused for far less time. Even if it was frequently being paused for pagefaults, the time to get a single page over the LAN would be far quicker than the entire dirty set, at which point on resume, the interrupt paths would fire again; The timing paths would quickly become fully populated. Further to that, a busy workload in the guest dirtying a page which has already been sent will not result in any further network traffic. The disadvantages would be that Xen would need 2 way communication with the toolstack to prioritise which page is needed to resolve a pagefault, and presumably the toolstack->toolstack protocol would be more complicated. In addition, it would be much harder to "roll back" the migrate; Once you resume the guest on the destination host, you are committed to completing the migrate. I presume there are other issues I have overlooked, but this idea has literally just occurred to me upon reading this thread so far. Comments? ~Andrew > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > http://lists.xen.org/xen-devel -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.