Xen project Mailing List

Re: [Xen-API] cross-pool migrate, with any kind of storage (shared or local)

To: Dave Scott <Dave.Scott@xxxxxxxxxxxxx>

From: Daniel Stodden <daniel.stodden@xxxxxxxxxx>

Date: Sun, 17 Jul 2011 22:15:22 -0700

Cc: "xen-api@xxxxxxxxxxxxxxxxxxx" <xen-api@xxxxxxxxxxxxxxxxxxx>

Delivery-date: Sun, 17 Jul 2011 22:15:38 -0700

List-id: Discussion of API issues surrounding Xen <xen-api.lists.xensource.com>

On Wed, 2011-07-13 at 10:21 -0400, Dave Scott wrote: > Hi, > > I've created a page on the wiki describing a new migration protocol for xapi. > The plan is to make migrate work both within a pool and across pools, and to > work with and without storage i.e. transparently migrate storage if necessary. > > The page is here: > > http://wiki.xensource.com/xenwiki/CrossPoolMigration > > The rough idea is to: > 1. use an iSCSI target to export disks from the receiver to the transmitter > 2. use tapdisk's log dirty mode to build a continuous disk copy program > -- perhaps we should go the full way and use the tapdisk block mirror code to > establish a full storage mirror? Just 2 [ ps: oh, maybe 10 :o) ] cents: I believe motion and live mirroring, while certainly related, are two different kinds of beasts in terms of complexity (implementation, *and* space, *and* I/O load). Migration can converge to the working set, then stop-and-copy, and once finishing up, it may succeed. But more importantly, this way it may fail anytime, and you're still in business. The source copy stays where it is, so you just clean up and didn't break much. Continuous replication, consistently, is harder. You can't write just a dirty log in the VM migration dirty-bitmap sense, and do linear passes on that. You need synchronization points, or you're going to screw up file system consistency. You can do that asynchronously, still, but it takes something converging to snapshots in space complexity. Not necessarily full snapshots, but at least you need a CoW mapping to separate a sequence of partially ordered updates (the part where order applies and block ranges overlap), so you can replay the right state in the right order on the destination side. If you want global state consistency ('true' HA), that implies a fair amount of synchronization. It's certainly an interesting subject for virtual environments, because you can get a grasp of the partial write order from the block layer. It's pretty much exactly the sequence of cache flushes you observe if you declare your disk as caching, provided you trust the FS and journaling code to be bug free, and apps involved written correctly (if you're mistaken, at least you've got s/o else to blame, but without huge amounts of testing and trust gained, I guess it's tedious to do triage those little unhappy accidents.) If only FS integrity matters, you can run a coarser series of updates, for asynchronous mirroring. I suspect DRBD to do at least sth like that (I'm not a DRBD expert either). I'm not sure if the asynchronous mode I see on the feature list allows for conclusions on DRBD's idea of HA in any way. It may just limit HA to being synchronous mode. Does anyone know? Asynchronously, one could use VHD snapshots, algebraically I believe they are somewhat of a superset, just with significant overhead (I think the architectural difference, in a nutshell, would be that replication doesn't need an option to roll back). You'd take snapshots on both the source and destination nodes, in logical synchrony. Create a new on the destination, copy it over, delete the source one, from source bottom to top. But it's a lot of duped I/O to coalesce that down again on a regular basis after each network transfer, which gets awkward. It's similar to current leaf coalesce. Anyway, it's not exactly a rainy weekend project, so if you want consistent mirroring, there doesn't seem to be anything better than DRBD around the corner. In summary, my point is that it's probably better to focus on migration only - it's one flat dirty log index and works in-situ at the block level. Beyond, I think it's perfectly legal to implement mirroring independently -- the math is very similar, but the difference make for huge impact on performance, I/O overhead, space to be set aside, and robustness. Cheers, Daniel [PS: comments/corrections welcome, indeed]. > 3. use the VM metadata export/import to move the VM metadata between pools > > I'd also like to > * make the migration code unit-testable (so I can test the failure paths > easily) > * make the code more robust to host failures by host heartbeating > * make migrate properly cancellable > > I've started making a prototype-- so far I've written a simple python wrapper > around the iscsi target daemon: > > https://github.com/djs55/iscsi-target-manager > > _______________________________________________ > xen-api mailing list > xen-api@xxxxxxxxxxxxxxxxxxx > http://lists.xensource.com/mailman/listinfo/xen-api _______________________________________________ xen-api mailing list xen-api@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/mailman/listinfo/xen-api

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.