[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [EXTERNAL][PATCH v6 2/2] docs/designs: Add a design document for migration of xenstore data



> -----Original Message-----
> From: Julien Grall <julien@xxxxxxx>
> Sent: 06 March 2020 12:03
> To: pdurrant@xxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxxx
> Cc: Durrant, Paul <pdurrant@xxxxxxxxxxxx>; Andrew Cooper 
> <andrew.cooper3@xxxxxxxxxx>; George Dunlap
> <George.Dunlap@xxxxxxxxxxxxx>; Ian Jackson <ian.jackson@xxxxxxxxxxxxx>; Jan 
> Beulich
> <jbeulich@xxxxxxxx>; Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>; Stefano 
> Stabellini
> <sstabellini@xxxxxxxxxx>; Wei Liu <wl@xxxxxxx>
> Subject: RE: [EXTERNAL][PATCH v6 2/2] docs/designs: Add a design document for 
> migration of xenstore
> data
> 
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open
> attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> Hi Paul,
> 
> On 05/03/2020 17:30, pdurrant@xxxxxxxx wrote:
> > From: Paul Durrant <pdurrant@xxxxxxxxxx>
> >
> > This patch details proposes extra migration data and xenstore protocol
> > extensions to support non-cooperative live migration of guests.
> >
> > NOTE: doc/misc/xenstore.txt is also amened to replace the <mfn> term
> >        for the INTRODUCE operation with the <gfn>, since this is what
> >        it actually is.
> >
> > Signed-off-by: Paul Durrant <pdurrant@xxxxxxxxxx>
> > ---
> > Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> > Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>
> > Cc: Ian Jackson <ian.jackson@xxxxxxxxxxxxx>
> > Cc: Jan Beulich <jbeulich@xxxxxxxx>
> > Cc: Julien Grall <julien@xxxxxxx>
> > Cc: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> > Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>
> > Cc: Wei Liu <wl@xxxxxxx>
> >
> > v6:
> >   - Addressed comments from Julien
> >
> > v5:
> >   - Add QUIESCE
> >   - Make semantics of <index> in GET_DOMAIN_WATCHES more clear
> >
> > v4:
> >   - Drop the restrictions on special paths
> >
> > v3:
> >   - New in v3
> > ---
> >   docs/designs/xenstore-migration.md | 171 +++++++++++++++++++++++++++++
> >   docs/misc/xenstore.txt             |   6 +-
> >   2 files changed, 174 insertions(+), 3 deletions(-)
> >   create mode 100644 docs/designs/xenstore-migration.md
> >
> > diff --git a/docs/designs/xenstore-migration.md 
> > b/docs/designs/xenstore-migration.md
> > new file mode 100644
> > index 0000000000..7e61f462f0
> > --- /dev/null
> > +++ b/docs/designs/xenstore-migration.md
> > @@ -0,0 +1,171 @@
> > +# Xenstore Migration
> > +
> > +## Background
> > +
> > +The design for *Non-Cooperative Migration of Guests*[1] explains that extra
> > +save records are required in the migrations stream to allow a guest running
> > +PV drivers to be migrated without its co-operation. Moreover the save
> > +records must include details of registered xenstore watches as well as
> > +content; information that cannot currently be recovered from `xenstored`,
> > +and hence some extension to the xenstore protocol[2] will also be required.
> > +
> > +The *libxenlight Domain Image Format* specification[3] already defines a
> > +record type `EMULATOR_XENSTORE_DATA` but this is not suitable for
> > +transferring xenstore data pertaining to the domain directly as it is
> > +specified such that keys are relative to the path
> > +`/local/domain/$dm_domid/device-model/$domid`. Thus it is necessary to
> > +define at least one new save record type.
> > +
> > +## Proposal
> > +
> > +### New Save Record
> > +
> > +A new mandatory record type should be defined within the libxenlight Domain
> > +Image Format:
> > +
> > +`0x00000007: DOMAIN_XENSTORE_DATA`
> > +
> > +The format of each of these new records should be as follows:
> > +
> > +
> > +```
> > +0     1     2     3     4     5     6     7 octet
> > ++------------------------+------------------------+
> > +| type                   | record specific data   |
> > ++------------------------+                        |
> > +...
> > ++-------------------------------------------------+
> > +```
> > +
> > +NB: The record data does not contain a length because the libxenlight 
> > record
> > +header specifies the length.
> > +
> > +
> > +| Field  | Description                                      |
> > +|--------|--------------------------------------------------|
> > +| `type` | 0x00000000: invalid                              |
> > +|        | 0x00000001: node data                            |
> > +|        | 0x00000002: watch data                           |
> > +|        | 0x00000003: transaction data                     |
> > +|        | 0x00000004 - 0xFFFFFFFF: reserved for future use |
> > +
> > +
> > +where data is always in the form of a tuple as follows
> 
> NIT: missing full stop.
> 

Ok.

> > +
> > +
> > +**node data**
> > +
> > +
> > +`<path>|<perm-count>|<perm-as-string>|+<value|>`
> > +
> In the xenstore spec, | means a 'nul (zero) byte'. Are you using the
> same terminology here? If so, it may be worth mentioning it.
> 

Ok, I'll cut'n'paste the format definitions in for ease of reference I think.

> > +
> > +`<path>` and `<value|>` should be suitable to formulate a `WRITE` operation
> > +to the receiving xenstored and the `<perm-as-string>|+` list should be
> > +similarly suitable to formulate a subsequent `SET_PERMS` operation.
> > +`<perm-count>` specifies the number of entries in the `<perm-as-string>|+`
> > +list and `<value|>` must be placed at the end because it may contain NUL
> > +octets.
> 
> What is the size of <perm-count>? Also, we may want perm-count to be
> aligned to its size so we don't have to worry of unaligned access.
> 
> How about moving <perm-count> at the beginning of the data blob?
>

Not sure we really need to care about alignment... I'll leave as-is for the 
moment.
 
> > +
> > +
> > +**watch data**
> > +
> > +
> > +`<path>|<token>|`
> > +
> > +`<path>` again is absolute and, together with `<token>`, should
> > +be suitable to formulate an `ADD_DOMAIN_WATCHES` operation (see below).
> > +
> > +
> > +**transaction data**
> > +
> > +
> > +`<transid-count>|<transid>|+`
> > +
> > +Each `<transid>` should be a uint32_t value represented as unsigned decimal
> > +suitable for passing as a *tx_id* to the re-defined `TRANSACTION_START`
> > +operation (see below). `<transid-count>` is the number of entries in the
> > +`<transid>|+` list.
> 
> What is the size of <transid-count>?. Also the | suggests we would have
> a NUL byte between the two numbers. This would mean the <transid> will
> not be aligned to its size.
> 

I was considering transid as text encoded (to match the TRANSACTION_START 
operation) but I agree the mix is odd. Perhaps I'll just use structures for the 
migration records instead.

> > +
> > +
> > +### Protocol Extension
> > +
> > +Before xenstore state is migrated it is necessary to wait for any pending
> > +reads, writes, watch registrations etc. to complete, and also to make sure
> > +that xenstored does not start processing any new requests (so that new
> > +requests remain pending on the shared ring for subsequent processing on the
> > +new host). Hence the following operation is needed:
> > +
> > +```
> > +QUIESCE                 <domid>|
> > +
> > +Complete processing of any request issued by the specified domain, and
> > +do not process any further requests from the shared ring.
> > +``` > +
> > +The `WATCH` operation does not allow specification of a `<domid>`; it is
> > +assumed that the watch pertains to the domain that owns the shared ring
> > +over which the operation is passed. Hence, for the tool-stack to be able
> > +to register a watch on behalf of a domain a new operation is needed:
> > +
> > +```
> > +ADD_DOMAIN_WATCHES      <domid>|<watch>|+
> > +
> > +Adds watches on behalf of the specified domain.
> > +
> > +<watch> is a NUL separated tuple of <path>|<token>. The semantics of this
> > +operation are identical to the domain issuing WATCH <path>|<token>| for
> > +each <watch>.
> > +```
> > +
> > +The watch information for a domain also needs to be extracted from the
> > +sending xenstored so the following operation is also needed:
> > +
> > +```
> > +GET_DOMAIN_WATCHES      <domid>|<index>   <gencnt>|<watch>|*
> > +
> > +Gets the list of watches that are currently registered for the domain.
> > +
> > +<watch> is a NUL separated tuple of <path>|<token>. The sub-list returned
> > +will start at <index> items into the the overall list of watches and may
> > +be truncated (at a <watch> boundary) such that the returned data fits
> > +within XENSTORE_PAYLOAD_MAX.
> > +
> > +If <index> is beyond the end of the overall list then the returned sub-
> > +list will be empty. If the value of <gencnt> changes then it indicates
> > +that the overall watch list has changed and thus it may be necessary
> > +to re-issue the operation for previous values of <index>.
> > +```
> > +
> > +To deal with transactions that were pending when the domain is migrated
> > +it is necessary to start transactions with the same `<trans-id>` in the
> > +receiving xenstored but for them to result in an `EAGAIN` when the
> > +`TRANSACTION_END` operation is peformed. Thus the `TRANSACTION_START`
> > +operation needs to be re-defined as follows:
> > +
> > +```
> > +TRANSACTION_START    |                       <transid>|
> > +     <transid> is an opaque uint32_t represented as unsigned decimal.
> > +    If tx_id is 0 for this operation then a new transaction will be started
> > +    with a tx_id allocated by xenstored. If a non-0 tx_id is specified then
> > +    a transaction with that tx_id will be started and automatically marked
> > +    `conflicting'. The tx_id will always be passed back in <transid>.
> > +    After this, the tx_id may be used in the request header field for
> > +    other operations.
> > +    When a transaction is started whole db is copied; reads and writes
> > +    happen on the copy.
> > +```
> 
> The transaction ID are per connection. As the toolstack will issue the
> command on restore, you would reserve the ID for the wrong domain. So I
> think you want to introduce a new command maybe
> RESERVE_DOMAIN_TRANSACTIONS?

Yes, of course you are correct. So much for trying to re-use things :-( I'll 
spec a new command.

  Paul

> 
> Cheers,
> 
> --
> Julien Grall


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.