[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Xen-API] cross-pool migrate, with any kind of storage (shared or local)
On Wed, 2011-07-13 at 10:21 -0400, Dave Scott wrote:
> I've created a page on the wiki describing a new migration protocol for xapi.
> The plan is to make migrate work both within a pool and across pools, and to
> work with and without storage i.e. transparently migrate storage if necessary.
> The page is here:
> The rough idea is to:
> 1. use an iSCSI target to export disks from the receiver to the transmitter
> 2. use tapdisk's log dirty mode to build a continuous disk copy program
> -- perhaps we should go the full way and use the tapdisk block mirror code to
> establish a full storage mirror?
Just 2 [ ps: oh, maybe 10 :o) ] cents:
I believe motion and live mirroring, while certainly related, are two
different kinds of beasts in terms of complexity (implementation, *and*
space, *and* I/O load).
Migration can converge to the working set, then stop-and-copy, and once
finishing up, it may succeed. But more importantly, this way it may fail
anytime, and you're still in business. The source copy stays where it
is, so you just clean up and didn't break much.
Continuous replication, consistently, is harder. You can't write just a
dirty log in the VM migration dirty-bitmap sense, and do linear passes
on that. You need synchronization points, or you're going to screw up
file system consistency.
You can do that asynchronously, still, but it takes something converging
to snapshots in space complexity. Not necessarily full snapshots, but at
least you need a CoW mapping to separate a sequence of partially ordered
updates (the part where order applies and block ranges overlap), so you
can replay the right state in the right order on the destination side.
If you want global state consistency ('true' HA), that implies a fair
amount of synchronization. It's certainly an interesting subject for
virtual environments, because you can get a grasp of the partial write
order from the block layer. It's pretty much exactly the sequence of
cache flushes you observe if you declare your disk as caching, provided
you trust the FS and journaling code to be bug free, and apps involved
written correctly (if you're mistaken, at least you've got s/o else to
blame, but without huge amounts of testing and trust gained, I guess
it's tedious to do triage those little unhappy accidents.)
If only FS integrity matters, you can run a coarser series of updates,
for asynchronous mirroring. I suspect DRBD to do at least sth like that
(I'm not a DRBD expert either). I'm not sure if the asynchronous mode I
see on the feature list allows for conclusions on DRBD's idea of HA in
any way. It may just limit HA to being synchronous mode. Does anyone
Asynchronously, one could use VHD snapshots, algebraically I believe
they are somewhat of a superset, just with significant overhead (I think
the architectural difference, in a nutshell, would be that replication
doesn't need an option to roll back). You'd take snapshots on both the
source and destination nodes, in logical synchrony. Create a new on the
destination, copy it over, delete the source one, from source bottom to
top. But it's a lot of duped I/O to coalesce that down again on a
regular basis after each network transfer, which gets awkward. It's
similar to current leaf coalesce.
Anyway, it's not exactly a rainy weekend project, so if you want
consistent mirroring, there doesn't seem to be anything better than DRBD
around the corner.
In summary, my point is that it's probably better to focus on migration
only - it's one flat dirty log index and works in-situ at the block
level. Beyond, I think it's perfectly legal to implement mirroring
independently -- the math is very similar, but the difference make for
huge impact on performance, I/O overhead, space to be set aside, and
[PS: comments/corrections welcome, indeed].
> 3. use the VM metadata export/import to move the VM metadata between pools
> I'd also like to
> * make the migration code unit-testable (so I can test the failure paths
> * make the code more robust to host failures by host heartbeating
> * make migrate properly cancellable
> I've started making a prototype-- so far I've written a simple python wrapper
> around the iscsi target daemon:
> xen-api mailing list
xen-api mailing list