[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] Ballooning dom0: insufficient memory (libxl) or CPU soft lockups (libvirt)
Hi, I've recently been testing large memory (64GB - 1TB) domains, and encountering CPU soft lockups while dom0 is ballooning down to free memory for the domain. The root of the issue also exposes a difference between libxl and libvirt. When creating a domain using xl, if ballooning is enabled (and required) there is a 33 second window for the memory request to be satisfied. If not, an ERROR_NOMEM is returned and the domain create fails. (See tools/libxl/xl_cmdimpl.c:freemem) The libvirt code for the same operation (src/libxl/libxl_domain.c:libxlDomainFreeMem) is nearly identical, except the function returns the value of 'ret'. The intent seems to be the same as libxl, but ret is set to 0 by libxl_wait_for_memory_target if memory ballooning is still ongoing at the end of the 33 second loop. The end result is that when using libvirt, the process believes the free memory call succeeded and continues to create the domain despite the fact that dom0 has not finished ballooning. In either case, dom0 continues to balloon in the background. In the case of libxl, a second attempt to create the domain will succeed after waiting until this ballooning finishes. With libvirt, the original create request encounters contention between dom0 ballooning down and that same memory being allocated to the starting domain. This contention can cause CPU soft lockups, and a major performance degradation. This issue is more easily seen when using large domains (64-128GB+) and slower memory models (such as large NUMA configurations). It is trivial to correct the bug in libvirt and cause it to return ERROR_NOMEM if ballooning is not finished by the end of the libxlDomainFreeMem loop. (I've tested this, and it does cause libvirt to behave like libxl.) However, it seems that a more correct fix would be to continue to wait for free memory if the ballooning process is progressing. In some tests I've performed, ballooning down 100GB has taken as long as 2.5 minutes. If users are attempting to create very large domains, the 33 second delay to balloon the memory seems rather low. I realize that best practices include using a set dom0 size with ballooning disabled, but I'd rather not see insufficient memory errors or produce CPU soft lockups if users choose not to follow this advice. To summarize: - If using xl, dom0 ballooning has to complete in 33 seconds, or ERROR_NOMEM will be encountered. - If using virsh, the domain can be created while dom0 is still ballooning down. This results in CPU soft lockups/performance degradation across the entire host. (When creating a very large domain, the soft lockups can be severe enough to kill the machine.) Any thoughts on handling this? Thanks, Mike _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |