[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] question about migration

Andrew Cooper writes ("Re: [Xen-devel] question about migration"):
> On 25/12/2015 03:06, Wen Congyang wrote:
> > Another problem:
> > If migration fails after the guest is suspended, we will resume it in the 
> > source.
> > In this case, we cannot shutdown it. because no process hanlds the shutdown 
> > event.
> > The log in /var/log/xen/xl-hvm_nopv.log:
> > Waiting for domain hvm_nopv (domid 1) to die [pid 5508]
> > Domain 1 has shut down, reason code 2 0x2
> > Domain has suspended.
> > Done. Exiting now
> >
> > The xl has exited...
> Hmm yes.  This is a libxl bug in libxl_evenable_domain_death(). CC'ing 
> the toolstack maintainers.

AIUI this is a response to Wen's comments above.

> It waits for the @releasedomain watch, but doesn't interpret the results 
> correctly.  In particular, if it can still make successful hypercalls 
> with the provided domid, that domain was not the subject of 
> @releasedomain.  (I also observe that domain_death_xswatch_callback() is 
> very inefficient.  It only needs to make a single hypercall, not query 
> the entire state of all domains.)

I don't understand precisely what you allege this bug to be, but:

* libxl_evenable_domain_death may generate two events, a
  This is documented in libxl.h (although it refers to DESTROY rather
  than DEATH - see patch below to fix the doc).

* @releaseDomain usually triggers twice for each domain: once when it
  goes to SHUTDOWN and once when it is actually destroyed.  (This is
  obviously necessary to implement the above.)

* @releaseDomain does not have a specific domain which is the "subject
  of @releaseDomain".  Arguably this is unhelpful, but it is not
  libxl's fault.  It arises from the VIRQ generated by Xen.  Note that
  xenstored needs to search its own list of active domains to see what
  has happened; it generates the @releaseDomain event and throws away
  the domid.

* domain_death_xswatch_callback handles the reporting for all
  domains that this instance of libxl is interested in.

* That libxl searches its own smaller list of domains it is interested
  in is I think not a huge performance problem in a context where
  xenstored has to search all the domains on the system.  In the worst
  case where each libxl process is interested in one domain it makes
  O(number of domains) hypercalls requesting info for 200 domains
  each.  If this `200' is a problem a better heuristic would be quite

* Wen is using xl.  The domain death events in xl are handled in
  xl_cmdimpl.c in the big while loop near the end of create_domain.
  The message in the log comes from this (in `create_domain'):

            LOG("Domain %d has shut down, reason code %d 0x%x", domid,

  and then this (in `handle_domain_death')

            LOG("Domain has suspended.");
            return 0;

  This is anomalous coding style because handle_domain_death should
  return one of the DOMAIN_RESTART values.  This anomaly was
  introduced in 1a0e1789 "xl: support on_{poweroff,reboot,crash}
  domain configuration options", which introduced ACTION_* but left
  the suspend case to `return 0'.  0 now means in this context
  DOMAIN_RESTART_NONE which means that the daemonic xl should simply

  But this stylistic anomaly is of no functional import: even before
  that (eg in 0e89a4ce~), shutdown with the reason code suspend causes
  xl to exit.

  The daemonic xl exits in this situation because it expects the
  migration machinery to tidy up the domain.

* It is not possible to resume the domain in the source after it has
  suspended.  The domain is, in the hypervisor, in the lifecycle state
  SHUTDOWN.  For this reason in general, it is a bad idea to suspend
  the guest until the migration is known to be likely to succeed.

  After a late migration failure the domain should probably be treated
  as crashed.  This is slightly awkward to implement because the
  daemonic xl would have to hang around to see what happened next and
  perhaps be told what to do.  Or maybe the migrating xl would have to
  know what the on_crash configuration was.

  In the meantime the failure is rather less graceful: debris is
  simply left for the user.  This is not ideal but tolerable I think.


From 00c4fc6495078026d68f3fdd88bfd29a8ad483d7 Mon Sep 17 00:00:00 2001
From: Ian Jackson <ian.jackson@xxxxxxxxxxxxx>
Date: Mon, 4 Jan 2016 15:13:14 +0000
Subject: [PATCH] libxl: Fix doc comment ref to DOMAIN_DEATH

The doc comment for libxl_evdisable_domain_death mistakenly referred
to DOMAIN_DESTROY but the event type name is actually DOMAIN_DEATH.

Signed-off-by: Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>
CC: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
 tools/libxl/libxl_event.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/libxl/libxl_event.h b/tools/libxl/libxl_event.h
index fad4c14..1ea789e 100644
--- a/tools/libxl/libxl_event.h
+++ b/tools/libxl/libxl_event.h
@@ -179,9 +179,9 @@ typedef struct libxl__evgen_domain_death 
 int libxl_evenable_domain_death(libxl_ctx *ctx, uint32_t domid,
                          libxl_ev_user, libxl_evgen_domain_death **evgen_out);
 void libxl_evdisable_domain_death(libxl_ctx *ctx, libxl_evgen_domain_death*);
-  /* Arranges for the generation of DOMAIN_SHUTDOWN and DOMAIN_DESTROY
+  /* Arranges for the generation of DOMAIN_SHUTDOWN and DOMAIN_DEATH
    * events.  A domain which is destroyed before it shuts down
-   * may generate only a DESTROY event.
+   * may generate only a DEATH event.
 typedef struct libxl__evgen_disk_eject libxl_evgen_disk_eject;

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.