[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Fatal crash on xen4.2 HVM + qemu-xen dm + NFS




If QEMU is completing writes before they've actually been done, haven't
we got a wider set of problems to worry about?

Reading the thread you linked in a previous email, it seems that
it can actually happen that a userspace application is told that
the write is completed before all the outstanding network requests are
dealt with.

What is 'userspace application' in this context? QEMU running in dom0?
That would seem to me to be a kernel bug unless the page is marked
CoW, wouldn't it? Else write() then alter the page might write the
altered data. But perhaps I've misunderstood (see below)

Could the problem be "cache=writeback" on the QEMU command
line (evident from a 'ps'). If caching is writeback perhaps QEMU
needs to copy the data. Is there some setting to turn this off in
xl for test purposes?

The command line cache options are ignored by xen_disk, so, assuming
that the guest is using the PV disk interface, that can't be the issue.

This appears not to be the case (at least in our environment).

We use PV on HVM and:
disk = [ 'tap:qcow2:/my/nfs/directory/testdisk.qcow2,xvda,w' ]
(remainder of config file in the original message)

We tried modifying the cache= setting using the patch below (yes,
the mail client will probably have eaten it, but in essence change
the word 'writeback' to 'none'), and that stops it booting VMs
at all with
hd0 write error
error: couldn't read file
so it would appear not to be entirely correct that the cache=
settings are being ignored. I've not had time to find out why
(possibly it's trying and failing to use O_DIRECT on NFS) but
I'll try writethrough.

One thing the guest is doing is writing to the partition table
(UEC cloud images do this on boot). This isn't special cased in
any way is it?

> Isn't there a way to prevent tcp_retransmit from running when the
> request is already completed? Or stop it if you find out that the pages
> are already gone?

But what would you do? If you don't run the tcp_retransmit the write
would be lost (to say nothing of the NFS connection to the server).

Well, that is not true: if the write was really lost, the kernel wouldn't
have completed the AIO write and notified QEMU.

Isn't that exactly what you said did happen? The kernel completed the AIO
write and notified QEMU prior to the write actually completing as the
data to write is still sitting in some as-yet-unacked TCP buffer. The
kernel then doesn't get the ACK in respect of that sequence number and
decides to resend the entire TCP segment. That than blows up because
the TCP segment it points to contains data pointing to a hole in memory.
Perhaps I'm misunderstanding the problem.

If TCP does not retransmit, that segment will never get ACKed, and the
TCP stream will lock up (this assumes that the cause of the original
need to retransmit was packet loss - if it's simply buffering at
a busy filer, then I agree).

> You could try persistent grants, that wouldn't solve the bug but they
> should be able to "hide" it pretty well. Not ideal, I know.
> The QEMU side commit is 9e496d7458bb01b717afe22db10a724db57d53fd.
> Konrad issued a pull request recently with the corresponding Linux
> blkfront changes:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git
> stable/for-jens-3.8

That's presumably the fir 8 commits at:
http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=shortlog;h=re
fs/heads/stable/for-jens-3.8

So I'd need a new dom0 kernel and to backport the QEMU patch.

Yep.

What puzzles me about this is (a) why we never see the same problems
on KVM, and (b) why this doesn't affect NFS clients even when no
virtualisation is involved.

--
Alex Bligh


dcrisan@dcrisan-lnx:/home/dcrisan/code/git/xen-4.2-live-migrate$ git diff
diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index 7662b3d..7b74e24 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -549,10 +549,10 @@ static char ** libxl__build_device_model_args_new(libxl__gc *gc,
            if (disks[i].is_cdrom) {
                if (disks[i].format == LIBXL_DISK_FORMAT_EMPTY)
                    drive = libxl__sprintf
- (gc, "if=ide,index=%d,media=cdrom,cache=writeback", disk); + (gc, "if=ide,index=%d,media=cdrom,cache=none", disk);
                else
                    drive = libxl__sprintf
- (gc, "file=%s,if=ide,index=%d,media=cdrom,format=%s,cache=writeback", + (gc, "file=%s,if=ide,index=%d,media=cdrom,format=%s,cache=none",
                         disks[i].pdev_path, disk, format);
            } else {
                if (disks[i].format == LIBXL_DISK_FORMAT_EMPTY) {
@@ -575,11 +575,11 @@ static char ** libxl__build_device_model_args_new(libxl__gc *gc,
                 */
                if (strncmp(disks[i].vdev, "sd", 2) == 0)
                    drive = libxl__sprintf
- (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,cache=writeback", + (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,cache=none",
                         disks[i].pdev_path, disk, format);
                else if (disk < 4)
                    drive = libxl__sprintf
- (gc, "file=%s,if=ide,index=%d,media=disk,format=%s,cache=writeback", + (gc, "file=%s,if=ide,index=%d,media=disk,format=%s,cache=none",
                         disks[i].pdev_path, disk, format);
                else
                    continue; /* Do not emulate this disk */


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.