[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-API] How snapshot work on LVMoISCS SR

To: Anthony Xu <anthony@xxxxxxxxx>
From: Julian Chesterfield <julian.chesterfield@xxxxxxxxxxxxx>
Date: Wed, 27 Jan 2010 10:56:47 +0000
Cc: "\"Dave.Scott\"@eu.citrix.com" <Dave.Scott@xxxxxxxxxxxxx>, "xen-api@xxxxxxxxxxxxxxxxxxx" <xen-api@xxxxxxxxxxxxxxxxxxx>
Delivery-date: Wed, 27 Jan 2010 03:16:06 -0800
List-id: Discussion of API issues surrounding Xen <xen-api.lists.xensource.com>

Hi Anthony,

Ian already covered most of the queries, but just to add some more detail:

Anthony Xu wrote:

Hi Julian/Dave,

Thanks for your detailed explanation,

I still have below questions.

1. if a non-leaf node is coalesce-able, it will be coalesced later on
regardless how big the physical size of this node?

Correct. The background coalesce task will always coalesce nodes ifthere is work that can be done. Note that there are out of spaceconditions where this may be tricky, so the algorithm always tries tocoalesce the smallest differencing disk in a chain first.

2. there is one leaf node for a snapshot, actually it may be empty, does
it exist only because it can prevent coalesce.

It is true that the presence of a leaf node for a snapshot does preventcoalesce from occuring, however the original motivation for providing aleaf node was to enable "writeable snapshots". Some VMs, windows inparticular require that any disk attached to the OS be writeable. When adisk is presented to the OS it scribbles a disk signature at the head ofthe disk, and for some versions (XP and 2k3 I believe), you willactually see a blue screen if the disk is read-only, i.e. writes areblocked at the block device driver level.

3. a clone will introduce a writable snapshot, it will prevent coalesce

Correct.

- Julian


- Anthony



On Tue, 2010-01-26 at 02:34 -0800, Julian Chesterfield wrote:

Hi Anthony,

Anthony Xu wrote: > Hi all, > > Basically snapshot on LVMoISCSI SR work
 well, it provides thin > provisioning, so it is fast and disk space
 efficient. > > > But I still have below concern. > > There is one more
 vhd chain when creating snapshot, if I creates 16 > snapshots, there
 are 16 vhd chains, that means when one VM accesses a > disk block, it
 may need to access 16 vhd lvm one by one, then get the > right block,
 it makes VM access disk slow. However, it is > understandable, it is
 part of snapshot IMO. >   The depth and speed of access will depend on
 the write pattern to the disk. In XCP we add an optimisation called a
 BATmap which stores one bit per BAT entry. This is a fast lookup table
 that is cached in memory while the VHD is open, and tells the block
 device handler whether a block has been fully allocated. Once the
 block is fully allocated (all logical 2MB written) the block handler
 knows that it doesn't need to read or write the Bitmap that
 corresponds to the data block, it can go directly to the disk offset.
 Scanning through the VHD chain can therefore be very quick, i.e. the
 block handler reads down the chain of BAT tables for each node until
 it detects a node that is allocated with hopefully the BATmap value
 set. The worst case is a random disk write workload which causes the
 disk to be fragmented and partially allocated. Every read or write
 will therefore potentially incur a bitmap check at every level of the
 chain. > But after I delete all these 16 snapshots, there is still 16
 vhd chains, > the disk access is still slow, which is not
 understandable and > reasonable, even though there may be only several
 KB difference between > each snapshot, >   There is a mechanism in XCP
 called the GC coalesce thread which gets kicked asynchronously
 following a VDI deletion event. It queries the VHD tree, and
 determines whether there is any coalescable work to do. Coalesceable
 work is defined as:

'a hidden child node that has no siblings'

Hidden nodes are non-leaf nodes that reside within a chain. When thesnapshot leaf node is deleted therefore, it will leave redundant linksin the chain that can be safely coalesced. You can kick off a coalesceby issuing an SR scan, although it should kick off automatically within30 seconds of deleting the snapshot node, handled by XAPI. If you lookin the /var/log/SMlog file you'll see a lot of debug informationincluding tree dependencies which will tell you a) whether the GC threadis running, and b) whether there is coalescable work to do. Note thatdeleting snapshot nodes does not always mean that there is coalescablework to do since there may be other siblings, e.g. VDI clones.

is there any way we can reduce depth of vhd chain after deleting
snapshots? get VM back to normal disk performance.

The coalesce thread handles this, see above.

And, I notice there are useless vhd volume exist after deleting snap
shots, can we delete them automatically?

No. I do not recommend deleting VHDs manually since they are almostcertainly referenced by something else in the chain. If you delete themmanually you will break the chain, it will become unreadable, and youpotentially lose critical data. VHD chains must be correctly coalescedin order to maintain data integrity.


Thanks,
Julian

- Anthony




_______________________________________________
xen-api mailing list
xen-api@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/mailman/listinfo/xen-api



_______________________________________________
xen-api mailing list
xen-api@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/mailman/listinfo/xen-api

References:
- [Xen-API] How snapshot work on LVMoISCS SR
  - From: Anthony Xu
- Re: [Xen-API] How snapshot work on LVMoISCS SR
  - From: Julian Chesterfield
- Re: [Xen-API] How snapshot work on LVMoISCS SR
  - From: Anthony Xu

Prev by Date: Re: [Xen-API] How Pygrub work on VHD
Next by Date: Re: [Xen-API] How snapshot work on LVMoISCS SR
Previous by thread: Re: [Xen-API] How snapshot work on LVMoISCS SR
Next by thread: Re: [Xen-API] How snapshot work on LVMoISCS SR
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.