[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] RFC: Proposal to improve error reporting in libxl
Hi, Several of us at Citrix are currently working on rewriting xenopsd - the component of the Xapi toolstack which handles domain creation, destruction, device attachment and so on - to use libxl instead of libxc. One of the problems we have run into is that error reporting in libxl is not detailed and expressive enough to let us take corrective action or present meaningful high level error messages to users. In particular: 1 Many API calls return error codes, but these codes don't tell us very much about what went wrong. In most cases the error code is ERROR_FAIL, ERROR_INVAL or ERROR_NOMEM. The code may also have been returned by a utility function and passed back unaltered by the API call. Examples: libxl_domain_resume, libxl_domain_remus_start (mostly ERROR_FAIL). libxl_domain_rename uses ERROR_NOMEM, ERROR_INVAL and ERROR_FAIL. 2 A few API calls return pointers. In some cases a null pointer indicates that an error occured, but in others null could possibly be a normal return value. If null does indicate an error, it still does not give us any information about what happened. Examples: libxl_list_domain, libxl_list_cpupool, libxl_list_vm 3 A few API calls have no return value but may still log errors. Examples: libxl_disable_domain_death, 4 Most API calls log detailed error messages through XTL, but there does not seem to be a way for the caller to make use of them. Some error messages may also have been logged by utility functions, not by the API call itself. We would like to make libxl's error reporting more regular and informative for callers. We think we need to: * Make a list of the error conditions which can be encountered by all the current API calls and define a set of error codes to cover them. Some codes will be generally applicable, but give more information than a bare 'fail'; others may be specific to a particular API call. We may keep the existing FAIL, INVAL and NOMEM codes for times when they are appropriate. The error messages logged by each API call are a good starting point for this list. I have included a partial list below. * Change the API calls in point 1 above to use the new, more detailed error codes. These changes will break client code which checks for particular error codes, rather than just checking whether the return code is non-zero. * For the API calls in points 2 and 3 which can encounter errors which callers might care about, change their interfaces to return error codes. For functions which previously returned pointers, add pointer-to-pointer parameters. * For error codes returned by utility functions, described in point 4, decide whether the code can be reported directly or an API-level error should be reported instead. These changes will certainly require a new version of the libxl API. We have considered other options which would be less likely to break existing code - in particular the option of keeping the existing return codes but adding an errno-like field which could contain a more specific error number. This approach would not break existing client code and would allow API calls which return void or pointers to report errors. However there is no obvious place to put the errno field - the ctx struct is not suitable because it might be used by several callers simultaneously. The API-breaking change outlined above seems to be the cleanest option, although it requires work on both libxl and its clients. All comments on this proposal are very welcome. Examples of API calls which would not fit well into the new design would be particularly useful. We are also keen to hear whether or not these changes would be useful for libvirt and other toolstacks using libxl, and whether those toolstacks have encountered any other error reporting problems which we have missed. Thanks, Euan ---- Notes on some API functions defined in libxl.c. 'ERROR_FOO, <string>' means that when ERROR_FOO is returned, <string> is also logged. In most non-trivial functions, error reporting is handled as follows: * store the error code in 'rc' or a similar local variable * log the error string * goto the 'out' or 'out_err' label at the end of the function libxl_ctx_alloc: success: 0 failure: ERROR_VERSION ERROR_FAIL, Failed to initialize mutex ERROR_FAIL, cannot open libxc handle ERROR_FAIL, cannot connect to xenstore ERROR_FAIL, cannot open libxc evtchn handle (from libxl__ctx_evtchn_init) libxl_ctx_free: success: 0 no errors libxl_string_list_dispose: no return value no interesting failure modes apart from segfault libxl_string_list_copy: libxl_key_value_list_copy no return value if calloc fails, logs "libxl: FATAL ERROR: memory allocation failure" and exits if strdup fails, logs "libxl: FATAL ERROR: memory allocation failure" and exits libxl_domain_rename: success: 0 failure: ERROR_NOMEM ERROR_INVAL | ERROR_NOMEM, unexpected error checking for existing domain ERROR_INVAL, domain with name \"%s\" already exists. Looks like ERROR_NOMEM from libxl_name_to_domid is not handled - falls through to 'domain already exists' ERROR_FAIL, errno + check old name for domain %d allegedly named %s ERROR_FAIL, failed to write new name `%s' for domain %"PRIu32" previously named `%s' ERROR_FAIL, unable to rename stub-domain ERROR_FAIL, failed to commit new name %s for domain %d previously named %s libxl_domain_resume: success: 0 failure: ERROR_FAIL, xc_domain_resume failed for domain %u ERROR_FAIL (no log message if libxl__domain_type returns LIBXL_DOMAIN_TYPE_INVALID) ERROR_FAIL | ERROR_INVAL, failed to resume device model for domain %u:%d ERROR_FAIL, xs_resume_domain failed for domain %u libxl_domain_preserve: success: 0 failure: ERROR_NOMEM ERROR_FAIL, failed to get dompath for %d (from libxl__xs_get_dompath) propagates all failure codes from libxl_domain_rename libxl_list_domain: success: pointer to domain info list failure: NULL, error allocating domain info NULL, error getting_domain_info_list return codes do not distinguish between a failure and an empty domain info list, but an empty list should not happen as dom0 is always present in all failure cases, nb_domain_out is uninitialized and may contain garbage libxl_domain_info: success: 0 failure: ERROR_FAIL, geting domain info list ERROR_INVAL, (no error message if domid does not match) libxl_cpupool_info: success: 0 failure: ERROR_FAIL, failed to get info for cpupool ERROR_FAIL, got info for cpupool%d, wanted cpupool%d ERROR_FAIL, (no error message if libxl_cpupoolid_to_name fails) ERROR_FAIL, invalid number of cpus provided (from libxl_cpu_bitmap_alloc) ERROR_FAIL, failed to retrieve the maximum number of cpus (from libxl_cpu_bitmap_alloc) libxl_list_cpupool: success: pointer to list of cpupoolinfo structs, *nb_pool_out = length failure: NULL, nb_pool_out = 0 NULL, allocating cpupool info, *nb_pool_out unchanged can't distinguish between no cpupools and failure (if *nb_pool_out is 0 initially) libxl_list_vm: success: pointer to array of libxl_list_vm structs failure: NULL, getting domain info list libxl_vminfo should never be empty. If no domains exist a 1 element array is allocated anyway to satisfy libxl__calloc. libxl_domain_remus_start: success: 0 failure: ERROR_FAIL, (invalid domain type, no message, should this be ERROR_INVAL?) ERROR_FAIL, Unsafe mode must be enabled to replicate to /dev/null..." ERROR_FAIL, Remus: No support for network buffering libxl_domain_suspend: success: 0 failure: ERROR_FAIL , (invalid domain type, no message) libxl_domain_pause: success: 0 failure: ERROR_FAIL, pausing domain %d libxl_domain_core_dump: success: 0 failure: ERROR_FAIL, core dumping domain %d to %s libxl_domain_unpause: success: 0 failure: ERROR_FAIL, (invalid domain type, no message, should this be ERROR_INVAL?) ERROR_FAIL, unpausing domain %d libxl_domain_shutdown: libxl_domain_reboot: success: 0 failure: ERROR_FAIL, (invalid domain type, no message) -1, getting HVM callback IRQ (from xc call in libxl__domain_pvcontrol_available) ERROR_NOPARAVIRT ERROR_FAIL, failed to get dompath for %d (logged by libxl__xs_get_dompath) -1 (vasprintf fails in libxl__xs_write) libxl_evenable_domain_death: success: 0 failure: ERROR_FAIL, create watch path for %s (from libxl__ev_xswatch_register) libxl_evdisable_domain_death: no return value if unwatch fails in libxl__ev_xswatch_deregister, remove watch for path %s is logged libxl_evenable_disk_eject: success: 0 failure: ERROR_FAIL, create watch path for %s (from libxl__ev_xswatch_register) ERROR_NOMEM libxl_evdisable_disk_eject: no return value if unwatch fails in libxl__ev_xswatch_deregister, remove watch for path %s is logged Utility functions: libxl_key_value_list_length: libxl_key_value_list_dispose: no interesting failure modes beyond segfault libxl_fd_set_cloexec: libxl_fd_set_nonblock: success: 0 failure: ERROR_FAIL, log errno set by fcntl libxl_hwcap_copy: libxl_mac_copy: no failure modes beyond segfault libxl_retrieve_domain_configuration: success: 0 failure: ERROR_LOCK_FAIL ERROR_FAIL, fail to get domain configuration for domain ERROR_FAIL, fail to get domain name for domain ERROR_FAIL, fail to get domain info for domain ERROR_FAIL, fail to get memory target for domain _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |