[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] Xen hangs with NFS root under high loads
Hi All! We now have a small and growing group of customers running on Xen-hosted machines -- Chris Clarke (in the Cc:) was the first, a few months ago, under Xen 1.0 (would that make him the first commercial Xenoserver customer?). We switched to 1.2 in mid-February. Other than the following, the only recent issues are related to working out the bugs and features in my own controller code, which I owe you another copy of. But we have seen a recurring issue where a few domains hang for no readily apparent reason, don't respond to 'xc_dom_control.py shutdown', but do respond to 'xc_dom_control.py destroy'. I usually see alternating "NFS server not found" and "NFS server OK" messages on the domain 0 console around the time that a guest on that node hangs. When this happens, it seems to usually be associated with someone running something I/O intensive like 'rsync' or 'apt-get' in the guest domain. Right now I'm running all swap partitions in VBD's, and the root partitions are all on a central NFS server so that: - I can mirror them and back them up. - We can migrate guests between nodes by assigning a guest to a different node -- right now that's implemented via shutdown/reboot. - We can recover from hardware failure in a couple of minutes, just by assigning a guest to a different node. But when researching this problem I noted a message from Ian (18 Mar 2004) Linux saying: We've seen some weird hangs under extreme conditions with NFS root, but we can reproduce these on stock Linux :-( Ian, do these symptoms sound like this is what we're hitting? Until I can reliably reproduce the problem myself, I'm going to assume this is the case. What are other people doing to meet those requirements of backups, migration, and failover? How is the live migration code? The copy-on-write NFSd, or COW VBD's? Any other backup or mirroring code added to VBD's lately? Other alternatives (ENBD etc.) that anyone knows from experience to be production-quality? Here's what I'm going to have to do unless I hear otherwise: - Try moving the NFS server to the Xen server node itself. This will provide better bandwidth and latency versus the 100Mb switch we're going through now. I don't know if that will help. I will need to backup each individual node's disk then. Each node's disks will need to be mirrored (who else is using md raid 1 for DOM0's root partition?) And we won't be able to cleanly migrate guests between nodes. No hardware failover either. Grrr. - If that doesn't work, then I'll need to migrate each root into a Xen virtual block device on the node (right now only swap is there). Then I won't be able to ensure backups get done myself -- any backups will have to be done from within each guest's O/S. They can't be mirrored. And migrating between nodes becomes doubly hard, and can take hours depending on partition size. No hardware failover. Thoughts/suggestions? Steve -- Stephen G. Traugott (KG6HDQ) UNIX/Linux Infrastructure Architect, TerraLuna LLC stevegt@xxxxxxxxxxxxx http://www.stevegt.com -- http://Infrastructures.Org ------------------------------------------------------- This SF.Net email is sponsored by: SourceForge.net Broadband Sign-up now for SourceForge Broadband and get the fastest 6.0/768 connection for only $19.95/mo for the first 3 months! http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.sourceforge.net/lists/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |