[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] RFC: Proposal to improve error reporting in libxl



Hi,

Several of us at Citrix are currently working on rewriting xenopsd -
the component of the Xapi toolstack which handles domain creation,
destruction, device attachment and so on - to use libxl instead of libxc.
One of the problems we have run into is that error reporting in libxl
is not detailed and expressive enough to let us take corrective action
or present meaningful high level error messages to users.   In particular:

  1 Many API calls return error codes, but these codes don't tell us
    very much about what went wrong.   In most cases the error code is
    ERROR_FAIL, ERROR_INVAL or ERROR_NOMEM.   The code may also have
    been returned by a utility function and passed back unaltered by
    the API call.

    Examples: libxl_domain_resume, libxl_domain_remus_start (mostly
      ERROR_FAIL).   libxl_domain_rename uses ERROR_NOMEM, ERROR_INVAL
      and ERROR_FAIL.

  2 A few API calls return pointers.   In some cases a null pointer
    indicates that an error occured, but in others null could possibly be
    a normal return value.   If null does indicate an error, it still does
    not give us any information about what happened.

    Examples: libxl_list_domain, libxl_list_cpupool, libxl_list_vm

  3 A few API calls have no return value but may still log errors. 

    Examples: libxl_disable_domain_death,

  4 Most API calls log detailed error messages through XTL, but there
    does not seem to be a way for the caller to make use of them.  Some
    error messages may also have been logged by utility functions, not
    by the API call itself.

We would like to make libxl's error reporting more regular and informative
for callers.    We think we need to:

  * Make a list of the error conditions which can be encountered by
    all the current API calls and define a set of error codes to
    cover them.   Some codes will be generally applicable, but give
    more information than a bare 'fail'; others may be specific to a
    particular API call.   We may keep the existing FAIL, INVAL and 
    NOMEM codes for times when they are appropriate.

    The error messages logged by each API call are a good starting point
    for this list.   I have included a partial list below.

  * Change the API calls in point 1 above to use the new, more detailed
    error codes.   These changes will break client code which checks for
    particular error codes, rather than just checking whether the return
    code is non-zero.

  * For the API calls in points 2 and 3 which can encounter errors
    which callers might care about, change their interfaces to return
    error codes.   For functions which previously returned pointers,
    add pointer-to-pointer parameters.

  * For error codes returned by utility functions, described in point 4, 
    decide whether the code can be reported directly or an API-level
    error should be reported instead.

These changes will certainly require a new version of the libxl API.   We
have considered other options which would be less likely to break existing
code - in particular the option of keeping the existing return codes but
adding an errno-like field which could contain a more specific error number.
This approach would not break existing client code and would allow API calls
which return void or pointers to report errors.   However there is no obvious
place to put the errno field - the ctx struct is not suitable because it might
be used by several callers simultaneously.    The API-breaking change outlined
above seems to be the cleanest option, although it requires work on both libxl
and its clients.

All comments on this proposal are very welcome.   Examples of API calls
which would not fit well into the new design would be particularly useful.
We are also keen to hear whether or not these changes would be useful
for libvirt and other toolstacks using libxl, and whether those toolstacks
have encountered any other error reporting problems which we have missed.

Thanks,
Euan

----

Notes on some API functions defined in libxl.c.

'ERROR_FOO, <string>' means that when ERROR_FOO is returned, <string>
is also logged.   In most non-trivial functions, error reporting is
handled as follows:
   * store the error code in 'rc' or a similar local variable 
   * log the error string
   * goto the 'out' or 'out_err' label at the end of the function
   

libxl_ctx_alloc:
  success: 0
  failure:
    ERROR_VERSION
    ERROR_FAIL, Failed to initialize mutex
    ERROR_FAIL, cannot open libxc handle
    ERROR_FAIL, cannot connect to xenstore
    ERROR_FAIL, cannot open libxc evtchn handle (from libxl__ctx_evtchn_init) 

libxl_ctx_free:
  success: 0
  no errors
  
libxl_string_list_dispose:
  no return value
  no interesting failure modes apart from segfault

libxl_string_list_copy:
libxl_key_value_list_copy
  no return value
  if calloc fails, logs "libxl: FATAL ERROR: memory allocation failure" and 
exits
  if strdup fails, logs "libxl: FATAL ERROR: memory allocation failure" and 
exits

libxl_domain_rename:
  success: 0 
  failure:
    ERROR_NOMEM
    ERROR_INVAL | ERROR_NOMEM, unexpected error checking for existing domain
    ERROR_INVAL, domain with name \"%s\" already exists.
       Looks like ERROR_NOMEM from libxl_name_to_domid is not handled - falls 
through to 'domain already exists' 
  
    ERROR_FAIL, errno + check old name for domain %d allegedly named %s
    ERROR_FAIL, failed to write new name `%s' for domain %"PRIu32" previously 
named `%s'
    ERROR_FAIL, unable to rename stub-domain
    ERROR_FAIL, failed to commit new name %s for domain %d previously named %s
  

libxl_domain_resume:
  success: 0
  failure:
    ERROR_FAIL, xc_domain_resume failed for domain %u
    ERROR_FAIL (no log message if libxl__domain_type returns 
LIBXL_DOMAIN_TYPE_INVALID)
    ERROR_FAIL | ERROR_INVAL, failed to resume device model for domain %u:%d
    ERROR_FAIL, xs_resume_domain failed for domain %u
  
libxl_domain_preserve:
  success: 0
  failure:
    ERROR_NOMEM
    ERROR_FAIL, failed to get dompath for %d (from libxl__xs_get_dompath)
    propagates all failure codes from libxl_domain_rename

libxl_list_domain:
  success: pointer to domain info list
  failure:
    NULL, error allocating domain info
    NULL, error getting_domain_info_list
    return codes do not distinguish between a failure and an empty domain info 
list, 
       but an empty list should not happen as dom0 is always present
    in all failure cases, nb_domain_out is uninitialized and may contain garbage

libxl_domain_info:
  success: 0
  failure:
    ERROR_FAIL, geting domain info list
    ERROR_INVAL, (no error message if domid does not match)

libxl_cpupool_info:
  success: 0
  failure:
    ERROR_FAIL, failed to get info for cpupool
    ERROR_FAIL, got info for cpupool%d, wanted cpupool%d
    ERROR_FAIL, (no error message if libxl_cpupoolid_to_name fails)
    ERROR_FAIL, invalid number of cpus provided (from libxl_cpu_bitmap_alloc)
    ERROR_FAIL, failed to retrieve the maximum number of cpus (from 
libxl_cpu_bitmap_alloc)
  
libxl_list_cpupool:
  success: pointer to list of cpupoolinfo structs, *nb_pool_out = length
  failure:
    NULL, nb_pool_out = 0
    NULL, allocating cpupool info, *nb_pool_out unchanged

  can't distinguish between no cpupools and failure (if *nb_pool_out is 0 
initially)


libxl_list_vm:
  success: 
    pointer to array of libxl_list_vm structs 
  failure:
    NULL, getting domain info list

  libxl_vminfo should never be empty. If no domains exist a 1 element array is
  allocated anyway to satisfy libxl__calloc.

libxl_domain_remus_start:
  success: 0
  failure: 
    ERROR_FAIL, (invalid domain type, no message, should this be ERROR_INVAL?)
    ERROR_FAIL, Unsafe mode must be enabled to replicate to /dev/null..."
    ERROR_FAIL, Remus: No support for network buffering

libxl_domain_suspend:
  success: 0
  failure:
    ERROR_FAIL , (invalid domain type, no message)
  
libxl_domain_pause:
  success: 0
  failure:
    ERROR_FAIL, pausing domain %d

libxl_domain_core_dump:
  success: 0
  failure:
    ERROR_FAIL, core dumping domain %d to %s

libxl_domain_unpause:
  success: 0
  failure:
    ERROR_FAIL, (invalid domain type, no message, should this be ERROR_INVAL?)
    ERROR_FAIL, unpausing domain %d

libxl_domain_shutdown:
libxl_domain_reboot:
  success: 0
  failure:
     ERROR_FAIL, (invalid domain type, no message)
     -1, getting HVM callback IRQ (from xc call in 
libxl__domain_pvcontrol_available)
     ERROR_NOPARAVIRT
     ERROR_FAIL, failed to get dompath for %d (logged by libxl__xs_get_dompath)
     -1 (vasprintf fails in libxl__xs_write)
  

libxl_evenable_domain_death:
  success: 0
  failure: 
    ERROR_FAIL, create watch path for %s (from libxl__ev_xswatch_register)

libxl_evdisable_domain_death:
  no return value
  if unwatch fails in libxl__ev_xswatch_deregister, remove watch for path %s is 
logged

libxl_evenable_disk_eject:
  success: 0
  failure:
    ERROR_FAIL, create watch path for %s (from libxl__ev_xswatch_register)
    ERROR_NOMEM

libxl_evdisable_disk_eject:
  no return value
  if unwatch fails in libxl__ev_xswatch_deregister, remove watch for path %s is 
logged


Utility functions:

libxl_key_value_list_length:
libxl_key_value_list_dispose:
  no interesting failure modes beyond segfault

libxl_fd_set_cloexec:
libxl_fd_set_nonblock:
  success: 0
  failure: ERROR_FAIL, log errno set by fcntl

libxl_hwcap_copy:
libxl_mac_copy:
  no failure modes beyond segfault

libxl_retrieve_domain_configuration:
  success: 0
  failure: 
    ERROR_LOCK_FAIL 
    ERROR_FAIL, fail to get domain configuration for domain
    ERROR_FAIL, fail to get domain name for domain
    ERROR_FAIL, fail to get domain info for domain
    ERROR_FAIL, fail to get memory target for domain
    

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.