[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v11 20/27] Support colo mode for qemu disk



On 03/05/2016 01:44 AM, Ian Jackson wrote:
> Changlong Xie writes ("[PATCH v11 20/27] Support colo mode for qemu disk"):
>> From: Wen Congyang <wency@xxxxxxxxxxxxxx>
>>
>> Usage: disk = 
>> ['...,colo,colo-host=xxx,colo-port=xxx,colo-export=xxx,active-disk=xxx,hidden-disk=xxx...']
>> For QEMU block replication details:
>> http://wiki.qemu.org/Features/BlockReplication
> 
> So now I am slightly confused by the design, I think.
> 
> When you replicate a VM with COLO using xl, its memory state is
> transferred over ssh.  But its disk replication is done unencrypted
> and unauthenticated ?

Yes, it is a problem. I will think how to improve it.

> 
> And the disk replication is, out of band, and needs to be configured
> separately ?  This is rather awkward, although maybe not a
> showstopper.  (Maybe we can have a plan to fix it in the future...)

colo-host,colo-port should be the global configuration. And colo-export,
active-disk,hidden-disk must be configured separately, because each
disk should have a different configuration.

> 
> And, how does the disk replication, which doesn't depend on there
> being xl running, relate to the vm state replication, which does ?  I
> think at the very least I'd like to see some information about the
> principles of operation - either explained, or referred to, in the
> user manual.

OK. The disk replication doesn't depend on xl. We only can operate it
via qemu monitor command:
1. stop the vm
2. do the checkpoint
3. start the vm
1/3 is suspend/resume the guest. We only need to do 2 when both vm are
in the consistent state.

> 
> Is it possible to use COLO with an existing full-service disk
> replication service such as DRBD ?

DRBD doesn's support the case like COLO. Because both primary guest
and secondary guest need to write to the disk.

> 
>> +(a) An example for COLO replication's configuration: disk 
>> =['...,colo,colo-host
>> +=xxx,colo-port=xxx,colo-export=xxx,active-disk=xxx,hidden-disk=xxx...']
>> +
>> +=item B<colo-host>      :Secondary host's ip address.
>> +
>> +=item B<colo-port>      :Secondary host's port, we will run a nbd server on
>> +secondary host, and the nbd server will listen this port.
>> +
>> +=item B<colo-export>    :Nbd server's disk export name of secondary host.
>> +
>> +=item B<active-disk>    :Secondary's guest write will be buffered in this 
>> disk,
>> +and it's used by secondary.
>> +
>> +=item B<hidden-disk>    :Primary's modified contents will be buffered in 
>> this
>> +disk, and it's used by secondary.
> 
> What would a typical configuration look like ?  I don't understand the
> relationship between active-disk and hidden-disk, etc.

QEMU has a feature: backing file
For example: A's backing file is B
1. If we read from A, but the sector is not allocated in A. We wil return a zero
   sector to the guest. If A has a backing file, we will read the sector from B
   instead of returning a zero sector.
2. The backing file doesn't affect the write operation.

QEMU has another feature: backup block job
Backup job has two file: one is source and another is the target. It has some 
running
mode. For block replication, we use the mode "sync=none". In this mode, we will 
read
the data from the source disk before we modify it, and write it to the target 
disk.
We keep a bitmap to remeber which sector is backuped from the source disk to the
target disk. If the target disk is an empty disk, and empty disk's backing file 
is
the source disk, we can read from the target disk to get the source disk's 
originnal data.


How does block replication work:
A. primary qemu:
1. use the block driver quorum: it will read from all children and write to all 
children.
   child 0: real disk
   child 1: nbd client
   reading from child 1 will fail, but we use the fifo mode. In this mode, we 
read from
   child 0 will success and we don't read from child 0
   write to child 1: because child 1 is nbd client, it will forward the write 
request to
   nbd server

B. secondary qemu:
We have 3 disks: active disk(called it A), hidden disk(called it H), and 
secondary disk
(real disk, called it S).
A's backing file is H, and H's backing file is S.
We also start a backup job: the source disk is S, and the target disk is H.
we run nbd server in secondary qemu. And the nbd server will write to S.

Before resuming both primary vm and secondary vm: the state is:
1. primary disk and secondary disk are in the consistent state(contain the same 
data)
2. active disk and hidden disk are the empty disk
When the guest is running:
1. NBD server receives the primary write operation and writes the data to S
2. Before we write data to S, the backup job will read the original data and 
backup it
   to H
3. The secondary vm will write data to A.
4. If secondary vm will read data from A:
   I. If the sector is allocated in A, read it from A.
  II. Otherwise, the secondary vm doesn't modify this sector after the latest 
is resumed.
 III. In this case, we read it from H. We can read S's original data from H(See 
the explanation
      In backup job).

If we have more than 1 real disk, we can use exportname to tag each disk. Each 
pair of primary disk and
secondary disk should have the same export name.

> 
>> +colo-host
>> +---------
>> +
>> +Description:           Secondary host's address
>> +Mandatory:             Yes when COLO enabled
> 
> Is it permitted to specify a host DNS name ?

IIRC, I think it is OK.

> 
>> +    if (libxl_defbool_val(disk->colo_enable)) {
>> +        tmp = xs_read(ctx->xsh, XBT_NULL,
>> +                      GCSPRINTF("%s/colo-port", be_path), &len);
>> +        if (!tmp) {
>> +            LOG(ERROR, "Missing xenstore node %s/colo-port", be_path);
>> +            goto cleanup;
>> +        }
>> +        disk->colo_port = tmp;
> 
> This is quite repetitive code.

Yes. Will introduce a new function to avoid it in the next version.

> 
> 
>> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
>> index 3610a39..bff08b0 100644
>> --- a/tools/libxl/libxl_create.c
>> +++ b/tools/libxl/libxl_create.c
>> @@ -1800,12 +1800,29 @@ static void domain_create_cb(libxl__egc *egc,
> ...
>> @@ -256,6 +280,36 @@ static int disk_try_backend(disk_try_backend_args *a,
>>      LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with script=...",
>>          a->disk->vdev, libxl_disk_backend_to_string(backend));
>>      return 0;
>> +
>> + bad_colo:
>> +    LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with colo",
>> +        a->disk->vdev, libxl_disk_backend_to_string(backend));
>> +    return 0;
> 
> This is correct here, I think.
> 
>> + bad_colo_host:
>> +    LOG(DEBUG, "Disk vdev=%s, backend %s needs colo-host=... for colo",
>> +        a->disk->vdev, libxl_disk_backend_to_string(backend));
>> +    return 0;
> 
> I think these options should be checked later.  disk_try_backend isn't
> the place for general parameter checking; it is searching for which
> backend to try.

Hmm, do you mean we check it when we need to use COLO?

> 
>>  int libxl__device_disk_set_backend(libxl__gc *gc, libxl_device_disk *disk) {
>> diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
>> index 4aca38e..ba17251 100644
>> --- a/tools/libxl/libxl_dm.c
>> +++ b/tools/libxl/libxl_dm.c
>> @@ -751,6 +751,139 @@ static int libxl__dm_runas_helper(libxl__gc *gc, const 
>> char *username)
> ...
>> +static char *qemu_disk_scsi_drive_string(libxl__gc *gc, const char 
>> *pdev_path,
>> +                                         int unit, const char *format,
>> +                                         const libxl_device_disk *disk,
>> +                                         int colo_mode)
>> +{
>> +    char *drive = NULL;
>> +    const char *exportname = disk->colo_export;
>> +    const char *active_disk = disk->active_disk;
>> +    const char *hidden_disk = disk->hidden_disk;
>> +
>> +    switch (colo_mode) {
>> +    case LIBXL__COLO_NONE:
>> +        drive = libxl__sprintf
>> +            (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,cache=writeback",
>> +             pdev_path, unit, format);
> 
> I think this would be a lot clearer if the refactoring was split into
> a seperate patch.

OK.

> 
>>                  if (strncmp(disks[i].vdev, "sd", 2) == 0) {
>> -                    drive = libxl__sprintf
>> -                        (gc, 
>> "file=%s,if=scsi,bus=0,unit=%d,format=%s,readonly=%s,cache=writeback",
>> -                         pdev_path, disk, format, disks[i].readwrite ? 
>> "off" : "on");
>> +                    if (colo_mode == LIBXL__COLO_SECONDARY) {
>> +                        /*
>> +                         * -drive if=none,driver=format,file=pdev_path,\
>> +                         * id=exportname
>> +                         */
> 
> I think this comment adds nothing to the code and could be profitably
> omitted.

OK.

> 
>> +                        drive = libxl__sprintf
>> +                            (gc, "if=none,driver=%s,file=%s,id=%s",
>> +                             format, pdev_path, disks[i].colo_export);
> 
> I don't understand how this works here.  COLO_SECONDARY seems to
> suppress the rest of the disk specification.

COLO_SECONDAY will use two disks: one is for S, and one is for A.
H is sepecified in A. This line is for S, and the codes for A is
in the function qemu_disk_scsi_drive_string().

> 
> Also, this same logic seems to appear many times.  Maybe it could be
> centralised.  Perhaps I would be able to advise more clearly if I
> understood how this was put together.
> 
>> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
>> index 9b0a537..a2078d1 100644
>> --- a/tools/libxl/libxl_types.idl
>> +++ b/tools/libxl/libxl_types.idl
>> @@ -575,6 +575,13 @@ libxl_device_disk = Struct("device_disk", [
>>      ("is_cdrom", integer),
>>      ("direct_io_safe", bool),
>>      ("discard_enable", libxl_defbool),
>> +    ("colo_enable", libxl_defbool),
>> +    ("colo_restore_enable", libxl_defbool),
>> +    ("colo_host", string),
>> +    ("colo_port", string),
>> +    ("colo_export", string),
>> +    ("active_disk", string),
>> +    ("hidden_disk", string)
> 
> In general, many these should probably not be strings.  Certainly the
> port should be an integer.  I don't quite understand the semantics of
> the others.

Yes, the port should be an integer. Will fix it in the next version.

Thanks
Wen Congyang

> 
> Ian.
> 
> 
> .
> 




_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.