Xen project Mailing List

On Mon, Feb 10, 2014 at 7:35 PM, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:

On 10/02/2014 20:00, Shriram Rajagopalan wrote:

On Mon, Feb 10, 2014 at 9:20 AM, David Vrabel <david.vrabel@xxxxxxxxxx> wrote:

Here is a draft of a proposal for a new domain save image format. It
does not currently cover all use cases (e.g., images for HVM guest are
not considered).

http://xenbits.xen.org/people/dvrabel/domain-save-format-B.pdf

Introduction
============

Revision History
----------------

--------------------------------------------------------------------
Version Date Changes
------- ----------- ----------------------------------------------
Draft A 6 Feb 2014 Initial draft.

Draft B 10 Feb 2014 Corrected image header field widths.

Minor updates and clarifications.
--------------------------------------------------------------------

Purpose
-------

The _domain save image_ is the context of a running domain used for
snapshots of a domain or for transferring domains between hosts during
migration.

There are a number of problems with the format of the domain save
image used in Xen 4.4 and earlier (the _legacy format_).

* Dependant on toolstack word size. A number of fields within the
image are native types such as `unsigned long` which have different
sizes between 32-bit and 64-bit hosts. This prevents domains from
being migrated between 32-bit and 64-bit hosts.

* There is no header identifying the image.

* The image has no version information.

A new format that addresses the above is required.

ARM does not yet have have a domain save image format specified and
the format described in this specification should be suitable.

I suggest keeping the processing overhead in mind when designing the new

image format. Some key things have been addressed, such as making sure data

is always padded to maintain alignment. But there are also some aspects of this

proposal that seem awfully unnecessary.. More details below.

Overview
========

The image format consists of two main sections:

* _Headers_
* _Records_

Headers
-------

There are two headers: the _image header_, and the _domain header_.
The image header describes the format of the image (version etc.).
The _domain header_ contains general information about the domain
(architecture, type etc.).

Records
-------

The main part of the format is a sequence of different _records_.
Each record type contains information about the domain context. At a
minimum there is a END record marking the end of the records section.

Fields
------

All the fields within the headers and records have a fixed width.

Fields are always aligned to their size.

Padding and reserved fields are set to zero on save and must be
ignored during restore.

So far so good.

Integer (numeric) fields in the image header are always in big-endian
byte order.

Integer fields in the domain header and in the records are in the
endianess described in the image header (which will typically be the
native ordering).

Its tempting to adopt all the TCP-style madness for transferring a set of

structured data. Why this endian-ness mess? Am I missing something here?

I am assuming that a lion's share of Xen's deployment is on x86

(not including Amazon). So that leaves ARM. Why not let these

processors take the hit of endian-ness conversion?

The large majority is indeed x86, but don't discount ARM because it is currently in the minority. With the current requirements, the vast majority of the data will still be little endian on x86.

Headers
=======

Image Header
------------

The image header identifies an image as a Xen domain save image. It
includes the version of this specification that the image complies
with.

Tools supporting version _V_ of the specification shall always save
images using version _V_. Tools shall support restoring from version
_V_ and version _V_ - 1. Tools may additionally support restoring
from earlier versions.

The marker field can be used to distinguish between legacy images and
those corresponding to this specification. Legacy images will have at
one or more zero bits within the first 8 octets of the image.

Fields within the image header are always in _big-endian_ byte order,
regardless of the setting of the endianness bit.

and more endian-ness mess.

Network order is perfectly valid. Is is how all your network packets arrive...

True. But why should we explicitly convert the application level data to

network byte order and then convert it back to host byte order, when its

already going to be done by the underlying stack, as you put it?

0 1 2 3 4 5 6 7 octet
+-------------------------------------------------+
| marker |
+-----------------------+-------------------------+
| id | version |
+-----------+-----------+-------------------------+
| options | |
+-----------+-------------------------------------+

--------------------------------------------------------------------
Field Description
----------- --------------------------------------------------------
marker 0xFFFFFFFFFFFFFFFF.

id 0x58454E46 ("XENF" in ASCII).

version 0x00000001. The version of this specification.

options bit 0: Endianness. 0 = little-endian, 1 = big-endian.

bit 1-15: Reserved.
--------------------------------------------------------------------

Domain Header
-------------

The domain header includes general properties of the domain.

0 1 2 3 4 5 6 7 octet
+-----------+-----------+-----------+-------------+
| arch | type | page_shift| (reserved) |
+-----------+-----------+-----------+-------------+

--------------------------------------------------------------------
Field Description
----------- --------------------------------------------------------
arch 0x0000: Reserved.

0x0001: x86.

0x0002: ARM.

type 0x0000: Reserved.

0x0001: x86 PV.

0x0002 - 0xFFFF: Reserved.

page_shift Size of a guest page as a power of two.

i.e., page size = 2^page_shift^.
--------------------------------------------------------------------

Records
=======

A record has a record header, type specific data and a trailing
footer. If body_length is not a multiple of 8, the body is padded
with zeroes to align the checksum field on an 8 octet boundary.

0 1 2 3 4 5 6 7 octet
+-----------------------+-------------------------+
| type | body_length |
+-----------+-----------+-------------------------+
| options | (reserved) |
+-----------+-------------------------------------+
...
Record body of length body_length octets followed by
0 to 7 octets of padding.
...
+-----------------------+-------------------------+
| checksum | (reserved) |
+-----------------------+-------------------------+

I am assuming that you the checksum field is present only

for debugging purposes? Otherwise, I see no reason for the

computational overhead, given that we are already sending data

over a reliable channel + IIRC we already have an image-wide checksum

when saving the image to disk.

If debugging is the only use case, then I guess, the type field

can be prefixed with a 1/0 bit, eliminating the need for the

1-bit checkum options field + 7-byte padding. Similarly, if debugging

mode is not set, why waste another 8 bytes in the end for the checksum field.

Unless you think there may be more types with need of special options,

Feel free to correct me if I am missing something elementary here..

What image-wide checksum?

May be I got it wrong. I vaguely recall some sort of a crc checksum being stored along with

the saved memory snapshots. But that could have been someone else's research code. Sorry..

Are you certain that all your data is moving over reliable channels?

Lets see.. Am I certain that all migration is happening over TCP ? yes. Worst case reliable UDP.

By reliable, I just mean no bit errors or such stuff. I am not talking about security.

Are you certain that your hard drives are bit perfect.

Absolutely not. Which is why I was under the impression that the image wide checksum

would detect a corrupt image.

Are you certain that your network connection is bit perfect?

Nope. But I am fairly certain that good old TCP and IP checksums + the ethernet's checksum have

been put in place to detect these errors and recover transparent to the application. Are you are implying

that there is some remote corner case that allows corrupt data to escape all of these three checks in

the network stack and percolate to the application layer? I don't think so.

If you are implying that the DRAMs cause memory bit errors that flip bits here and here, wreaking havoc,

then probably yes, checksums make sense. However, with ECC memory modules being the norm (please

correct me if I wrong about this), why start bothering now, if we didn't over the last 3 years? What has changed?

My point here being, checksums seem like unnecessary compute overhead when doing live migration

or Remus. One can simply set this field to 0 when doing live migration/Remus.

And, as you said later in this mail, data transmission overhead is not that much.

However, as far as storing snapshots in disks is concerned, I totally agree that there needs to be some

form of a checksum to ensure that the data has not been corrupted. But why have record-level checksums?

It is not as if we can recover the corrupted records. Majority of the use cases are, IMO, do or die. If checksum

is correct, then start the restore process. Else abort. So why not have an image wide checksum?

Given the amount of data sent as part of a migration, 8 bytes per record is not a substantial overhead.

thanks

shriram

Re: [Xen-devel] Domain Save Image Format proposal (draft B)