[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Fatal crash on xen4.2 HVM + qemu-xen dm + NFS
If QEMU is completing writes before they've actually been done, haven't we got a wider set of problems to worry about?Reading the thread you linked in a previous email, it seems that it can actually happen that a userspace application is told that the write is completed before all the outstanding network requests are dealt with. What is 'userspace application' in this context? QEMU running in dom0? That would seem to me to be a kernel bug unless the page is marked CoW, wouldn't it? Else write() then alter the page might write the altered data. But perhaps I've misunderstood (see below) Could the problem be "cache=writeback" on the QEMU command line (evident from a 'ps'). If caching is writeback perhaps QEMU needs to copy the data. Is there some setting to turn this off in xl for test purposes?The command line cache options are ignored by xen_disk, so, assuming that the guest is using the PV disk interface, that can't be the issue. This appears not to be the case (at least in our environment). We use PV on HVM and: disk = [ 'tap:qcow2:/my/nfs/directory/testdisk.qcow2,xvda,w' ] (remainder of config file in the original message) We tried modifying the cache= setting using the patch below (yes, the mail client will probably have eaten it, but in essence change the word 'writeback' to 'none'), and that stops it booting VMs at all with hd0 write error error: couldn't read file so it would appear not to be entirely correct that the cache= settings are being ignored. I've not had time to find out why (possibly it's trying and failing to use O_DIRECT on NFS) but I'll try writethrough. One thing the guest is doing is writing to the partition table (UEC cloud images do this on boot). This isn't special cased in any way is it? > Isn't there a way to prevent tcp_retransmit from running when the > request is already completed? Or stop it if you find out that the pages > are already gone? But what would you do? If you don't run the tcp_retransmit the write would be lost (to say nothing of the NFS connection to the server).Well, that is not true: if the write was really lost, the kernel wouldn't have completed the AIO write and notified QEMU. Isn't that exactly what you said did happen? The kernel completed the AIO write and notified QEMU prior to the write actually completing as the data to write is still sitting in some as-yet-unacked TCP buffer. The kernel then doesn't get the ACK in respect of that sequence number and decides to resend the entire TCP segment. That than blows up because the TCP segment it points to contains data pointing to a hole in memory. Perhaps I'm misunderstanding the problem. If TCP does not retransmit, that segment will never get ACKed, and the TCP stream will lock up (this assumes that the cause of the original need to retransmit was packet loss - if it's simply buffering at a busy filer, then I agree). > You could try persistent grants, that wouldn't solve the bug but they > should be able to "hide" it pretty well. Not ideal, I know. > The QEMU side commit is 9e496d7458bb01b717afe22db10a724db57d53fd. > Konrad issued a pull request recently with the corresponding Linux > blkfront changes: > > git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git > stable/for-jens-3.8 That's presumably the fir 8 commits at: http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=shortlog;h=re fs/heads/stable/for-jens-3.8 So I'd need a new dom0 kernel and to backport the QEMU patch.Yep. What puzzles me about this is (a) why we never see the same problems on KVM, and (b) why this doesn't affect NFS clients even when no virtualisation is involved. -- Alex Bligh dcrisan@dcrisan-lnx:/home/dcrisan/code/git/xen-4.2-live-migrate$ git diff diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c index 7662b3d..7b74e24 100644 --- a/tools/libxl/libxl_dm.c +++ b/tools/libxl/libxl_dm.c@@ -549,10 +549,10 @@ static char ** libxl__build_device_model_args_new(libxl__gc *gc, if (disks[i].is_cdrom) { if (disks[i].format == LIBXL_DISK_FORMAT_EMPTY) drive = libxl__sprintf- (gc, "if=ide,index=%d,media=cdrom,cache=writeback", disk); + (gc, "if=ide,index=%d,media=cdrom,cache=none", disk); else drive = libxl__sprintf- (gc, "file=%s,if=ide,index=%d,media=cdrom,format=%s,cache=writeback", + (gc, "file=%s,if=ide,index=%d,media=cdrom,format=%s,cache=none", disks[i].pdev_path, disk, format); } else { if (disks[i].format == LIBXL_DISK_FORMAT_EMPTY) {@@ -575,11 +575,11 @@ static char ** libxl__build_device_model_args_new(libxl__gc *gc, */ if (strncmp(disks[i].vdev, "sd", 2) == 0) drive = libxl__sprintf- (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,cache=writeback", + (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,cache=none", disks[i].pdev_path, disk, format); else if (disk < 4) drive = libxl__sprintf- (gc, "file=%s,if=ide,index=%d,media=disk,format=%s,cache=writeback", + (gc, "file=%s,if=ide,index=%d,media=disk,format=%s,cache=none", disks[i].pdev_path, disk, format); else continue; /* Do not emulate this disk */ _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |