[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Fatal crash on xen4.2 HVM + qemu-xen dm + NFS


If so, I think it's the case that *ALL* NFS dom0 access by Xen domU
VMs is unsafe in the event of tcp retransmit (both in the sense that
the grant can be freed up causing a crash, or the domU's data can be
rewritten post write causing corruption).

Yes. Prior to your report this (assuming it is the same issue) had been
a very difficult to trigger issue -- I was only able to do so with
userspace firewalls rules which deliberately delayed TCP acks.


The fact that you can reproduce so easily makes me wonder if this is
really the same issue. To trigger the issue you need this sequence of
      * Send an RPC
      * RPC is encapsulated into a TCP/IP frame (or several) and sent.
      * Wait for an ACK response to the TCP/IP frame
      * Timeout.
      * Queue a retransmit of the TCP/IP frame(s)
      * Receive the ACK to the original.
      * Receive the reply to the RPC as well
      * Report success up the stack
      * Userspace gets success and unmaps the page
      * Retransmit hits the front of the queue
      * BOOM

To do this you need to be pretty unlucky or retransmitting a lot (which
would usually imply something up with either the network or the filer).

Well, the two things we are doing different that potentially make this
easier to replicate are:

* We are using a QCOW2 backing file, and running a VM image which
 expands the partition, and then the filing system. This is a particularly
 write heavy load. We're also using upstream qemu DM which I think
 wasn't there when you lasted tested.

* The filer we run this on is a dev filer which is performs poorly,
 and has lots of LUNs (though I think we replicated it on another
 filer too). Though the filer and network certainly aren't great,
 they can run VMs just fine.

BTW, there is also a similar situation with RPC level retransmits, which
I think might be where the NFSv3 vs v4 comes from (i.e. only v3 is
susceptible to that specific case), this one is very hard to reproduce
as well (although slightly easier than the TCP retransmit one, IIRC)

I /think/ that won't be the issue we have as RPC retransmits on v4 over
TCP happen only very rarely - i.e. when the TCP connection has died,
and I don't believe we are seeing that (though it's difficult to
tell given the box dies totally what happened first).

However, I agree it will suffer from the same problem.

 I think that would also
apply to iSCSI over tcp, which would presumably suffer similarly.

Correct, iSCSI over TCP can also have this issue.

Is that analysis correct?

The important thing is zero copy vs. non-zero copy or not. IOW it is
only a problem if the actual userspace page, which is a mapped domU
page, is what gets queued up. Whether zero copy is done or not depends
on things like O_DIRECT and write(2) vs. sendpage(2) etc and what the
underlying fs implements etc. I thought NFS only did it for O_DIRECT. I
may be mistaken. aio is probably a factor too.

Right, and I'm pretty sure we're not using O_DIRECT as we're using
cache=writeback (which is the default). Is there some way to make it
copy pages?

I'm wondering whether what's happening is that when the disk grows
(or there's a backing file in place) some sort of different I/O is
done by qemu. Perhaps irrespective of write cache setting, it does some
form of zero copy I/O when there's a backing file in place.

FWIW blktap2 always copies for pretty much this reason, I seem to recall
the maintainer saying the perf hit wasn't noticeable.

I'm afraid I find the various blk* combinations a bit of an impenetrable
maze. Is it possible (if only for testing purposes) to use blktap2
with HVM domU and qcow2 disks with backing files? I had thought the
alternatives were qdisk and tap?

And a late comment on  your previous email:

Surely before Xen removes the grant on the page, unmapping it from dom0's
memory, it should check to see if there are any existing references
to the page and if there are, given the kernel its own COW copy, rather
than unmap it totally which is going to lead to problems.

Unfortunately each page only has one reference count, so you cannot
distinguish between references from this particular NFS write from other
references (other writes, the ref held by the process itself, etc).

Sure, I understand that. But I wasn't suggesting the tcp layer triggered
this (in which case it would need to get back to the NFS write). I
think Trond said you were arranging for sendpage() to provide a callback.
I'm not suggesting that.

What I was (possibly naively) suggesting, is that the single reference
count to the page should be zero by the time the xen grant stuff is
about to remove the mapping, else it's in use somewhere in the domain
into which it's mapped. The xen grant stuff can't know whether that's
for NFS, or iSCSI or whatever. But it does know some other bit of the
kernel is going to use that page, and when it's finished with it will
decrement the reference count and presumably free the page up. So if
it finds a page like this, surely the right thing to do is to leave
a copy of it in dom0, which is no longer associated with the domU
page; it will then get freed when the tcp stack (or whatever is using
it) decrements the reference count later. I don't know if that makes
any sense.

Alex Bligh

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.