[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [win-pv-devel] Problems with xenvbd



On Fri, 4 Sep 2015, Paul Durrant wrote:
> > -----Original Message-----
> > From: Stefano Stabellini [mailto:stefano.stabellini@xxxxxxxxxxxxx]
> > Sent: 04 September 2015 17:25
> > To: Paul Durrant
> > Cc: Fabio Fantoni; RafaÅ WojdyÅa; win-pv-devel@xxxxxxxxxxxxxxxxxxxx; Stefano
> > Stabellini
> > Subject: RE: [win-pv-devel] Problems with xenvbd
> >
> > On Fri, 4 Sep 2015, Paul Durrant wrote:
> > > > -----Original Message-----
> > > > From: win-pv-devel-bounces@xxxxxxxxxxxxxxxxxxxx [mailto:win-pv-devel-
> > > > bounces@xxxxxxxxxxxxxxxxxxxx] On Behalf Of Paul Durrant
> > > > Sent: 02 September 2015 10:00
> > > > To: Fabio Fantoni; RafaÅ WojdyÅa; win-pv-devel@xxxxxxxxxxxxxxxxxxxx
> > > > Cc: Stefano Stabellini
> > > > Subject: Re: [win-pv-devel] Problems with xenvbd
> > > >
> > > > > -----Original Message-----
> > > > > From: Fabio Fantoni [mailto:fabio.fantoni@xxxxxxx]
> > > > > Sent: 02 September 2015 09:54
> > > > > To: Paul Durrant; RafaÅ WojdyÅa; win-pv-devel@xxxxxxxxxxxxxxxxxxxx
> > > > > Cc: Stefano Stabellini
> > > > > Subject: Re: [win-pv-devel] Problems with xenvbd
> > > > >
> > > > > Il 01/09/2015 16:41, Paul Durrant ha scritto:
> > > > > >> -----Original Message-----
> > > > > >> From: Fabio Fantoni [mailto:fabio.fantoni@xxxxxxx]
> > > > > >> Sent: 21 August 2015 14:14
> > > > > >> To: RafaÅ WojdyÅa; Paul Durrant; win-pv-devel@xxxxxxxxxxxxxxxxxxxx
> > > > > >> Subject: Re: [win-pv-devel] Problems with xenvbd
> > > > > >>
> > > > > >> Il 21/08/2015 10:12, Fabio Fantoni ha scritto:
> > > > > >>> Il 21/08/2015 00:03, RafaÅ WojdyÅa ha scritto:
> > > > > >>>> On 2015-08-19 23:25, Paul Durrant wrote:
> > > > > >>>>>> -----Original Message----- From:
> > > > > >>>>>> win-pv-devel-bounces@xxxxxxxxxxxxxxxxxxxx [mailto:win-pv-
> > devel-
> > > > > >>>>>> bounces@xxxxxxxxxxxxxxxxxxxx] On Behalf Of Rafal Wojdyla
> > Sent: 18
> > > > > >>>>>> August 2015 14:33 To: win-pv-devel@xxxxxxxxxxxxxxxxxxxx
> > Subject:
> > > > > >>>>>> [win-pv-devel] Problems with xenvbd
> > > > > >>>>>>
> > > > > >>>>>> Hi,
> > > > > >>>>>>
> > > > > >>>>>> I've been testing the current pvdrivers code in preparation for
> > > > > >>>>>> creating upstream patches for my xeniface additions and I
> > noticed
> > > > > >>>>>> than xenvbd seems to be very unstable for me. I'm not sure if
> > it's
> > > > > >>>>>> a problem with xenvbd itself or my code because it seemed to
> > only
> > > > > >>>>>> manifest when the full suite of our guest tools was installed
> > along
> > > > > >>>>>> with xenvbd. In short, most of the time the system crashed
> > with
> > > > > >>>>>> kernel memory corruption in seemingly random processes
> > shortly
> > > > > >>>>>> after start. Driver Verifier didn't seem to catch anything. You
> > can
> > > > > >>>>>> see a log from one such crash in the attachment crash1.txt.
> > > > > >>>>>>
> > > > > >>>>>> Today I tried to perform some more tests but this time without
> > our
> > > > > >>>>>> guest tools (only pvdrivers and our shared libraries were
> > > > > >>>>>> installed). To my surprise now Driver Verifier was crashing the
> > > > > >>>>>> system every time in xenvbd (see crash2.txt). I don't know why
> > it
> > > > > >>>>>> didn't catch that previously... If adding some timeout to the
> > > > > >>>>>> offending wait doesn't break anything I'll try that to see if 
> > > > > >>>>>> I can
> > > > > >>>>>> reproduce the previous memory corruptions.
> > > > > >>>>>>
> > > > > >>>>> Those crashes do look odd. I'm on PTO for the next week but I'll
> > > > have
> > > > > >>>>> a look when I get back to the office. I did run verifier on all 
> > > > > >>>>> the
> > > > > >>>>> drivers a week or so back (while running vbd plug/unplug tests)
> > but
> > > > > >>>>> there have been a couple of changes since then.
> > > > > >>>>>
> > > > > >>>>> Paul
> > > > > >>>>>
> > > > > >>>> No problem. I attached some more logs. The last one was during
> > > > > system
> > > > > >>>> shutdown, after that the OS failed to boot (probably corrupted
> > > > > >>>> filesystem since the BSOD itself seemed to indicate that). I 
> > > > > >>>> think
> > > > every
> > > > > >>>> time there is a BLKIF_RSP_ERROR somewhere but I'm not yet
> > familiar
> > > > > with
> > > > > >>>> Xen PV device interfaces so not sure what that means.
> > > > > >>>>
> > > > > >>>> In the meantime I've run more tests on my modified xeniface
> > driver
> > > > to
> > > > > >>>> make sure it's not contributing to these issues but everything
> > > > seemed
> > > > > to
> > > > > >>>> be fine there.
> > > > > >>>>
> > > > > >>>>
> > > > > >>> I also had a disk corruption on windows 10 pro 64 bit with pv 
> > > > > >>> drivers
> > > > > >>> build of 11 august but I'm not sure that is related to winpv 
> > > > > >>> drivers,
> > > > > >>> on same domU I started testing also snapshot with qcow2 disk
> > overlay.
> > > > > >>> For this case I don't have useful information because don't try to
> > > > > >>> boot windows at all but if rehappen I'll try to take other useful
> > > > > >>> information.
> > > > > >> Happen another time but also this I was unable to understand what
> > is
> > > > > >> exactly the cause.
> > > > > >> On windows reboot all seems was ok and did a clean shutdown but
> > on
> > > > > >> reboot seabios don't found bootable disk and qemu log don't show
> > > > useful
> > > > > >> informations.
> > > > > >> qemu-img check show errors:
> > > > > >>> /usr/lib/xen/bin/qemu-img check W10.disk1.cow-sn1
> > > > > >>> ERROR cluster 143 refcount=1 reference=2
> > > > > >>> Leaked cluster 1077 refcount=1 reference=0
> > > > > >>> ERROR cluster 1221 refcount=1 reference=2
> > > > > >>> Leaked cluster 2703 refcount=1 reference=0
> > > > > >>> Leaked cluster 5212 refcount=1 reference=0
> > > > > >>> Leaked cluster 13375 refcount=1 reference=0
> > > > > >>>
> > > > > >>> 2 errors were found on the image.
> > > > > >>> Data may be corrupted, or further writes to the image may corrupt
> > it.
> > > > > >>>
> > > > > >>> 4 leaked clusters were found on the image.
> > > > > >>> This means waste of disk space, but no harm to data.
> > > > > >>> 27853/819200 = 3.40% allocated, 22.65% fragmented, 0.00%
> > > > compressed
> > > > > >>> clusters
> > > > > >>> Image end offset: 1850736640
> > > > > >> I created it with:
> > > > > >> /usr/lib/xen/bin/qemu-img create -o
> > > > > >> backing_file=W10.disk1.xm,backing_fmt=raw -f qcow2
> > W10.disk1.cow-
> > > > > sn1
> > > > > >> and changed the xl domU configuration:
> > > > > >> disk=['/mnt/vm2/W10.disk1.cow-sn1,qcow2,hda,rw',...
> > > > > >> Dom0 is with xen 4.6-rc1 and qemu 2.4.0
> > > > > >> DomU is windows 10 pro 64 bit with pv drivers build of 11 august
> > > > > >>
> > > > > >> How I can know for sure if it is a winpv or qemu or other problem
> > and
> > > > > >> take useful information to report?
> > > > > >>
> > > > > >> Thanks for any reply and sorry for my bad english.
> > > > > > This sounds very much like a lack of synchronization somewhere. I
> > recall
> > > > > seeing other problems of this ilk when someone was messing around
> > with
> > > > > O_DIRECT for opening images. I wonder if we are missing a flush
> > operation
> > > > > on shutdown.
> > > > > >
> > > > > >    Paul
> > > > > >
> > > > > Thanks for reply.
> > > > > I did a fast search but I not found O_DIRECT grepping in libxl, I 
> > > > > found
> > > > > it only in qemu code.
> > > > > After I tried with patch that seems added setting of it for xen:
> > > > >
> > > >
> > http://git.qemu.org/?p=qemu.git;a=commitdiff;h=454ae734f1d9f591345fa78
> > > > > 376435a8e74bb4edd
> > > > > Checking in libxl seems disabled by default and from some old xen post
> > > > > seems that O_DIRECT creates problems.
> > > > > I should try it enable direct-io-safe in domUs qcow2 disks?
> > > > > Added also Stefano Stabellini as cc.
> > > > > @Stefano Stabellini: What is the current know status and result of
> > > > > direct-io-safe?
> >
> > O_DIRECT should be entirely safe to use, at least on ide and qdisk. I
> > haven't done the analysis on ahci emulation in qemu to know whether that
> > would be true for ahci disks, but that doesn't matter because unplug is
> > not implemented for ahci disks.
> >
> >
> > > > > Sorry is the question are stupid by or my english is too bad or many
> > > > > post of latest years are confused and in same cases seems also
> > > > > contradictory about stability/integrity/performance using it or not.
> > > > > In particular seems crash with some kernels but I not understand
> > exactly
> > > > > what versions and/or with which patches.
> > > > >
> > > > > @Paul Durrant: have you see my other mail when I wrote that based on
> > my
> > > > > latest test with xen 4.6 without udev file windows domUs with new pv
> > > > > driver don't boot and for still boot it correctly I must readd udev
> > > > > file, can this cause unexpected case related to this problem or is
> > > > > different?
> > > > > http://lists.xen.org/archives/html/win-pv-devel/2015-
> > 08/msg00033.html
> > > > >
> > > >
> > > > I'm not sure why udev would be an issue here. The problem you have
> > > > appears to be QEMU ignoring the request to unplug emulated disks. I've
> > not
> > > > seen this behaviour on my test box so I'll need to dig some more.
> > > >
> > >
> > > I notice you have 6 IDE channels? Are you using AHCI by any chance? If you
> > are then it looks like QEMU is not honouring the unplug request... that 
> > would
> > be where the bug is. I'll try to repro myself.
> >
> > Unplug on ahci is actually unimplemented, see hw/i386/xen/xen_platform.c:
> >
> > static void unplug_disks(PCIBus *b, PCIDevice *d, void *o)
> > {
> >     /* We have to ignore passthrough devices */
> >     if (pci_get_word(d->config + PCI_CLASS_DEVICE) ==
> >             PCI_CLASS_STORAGE_IDE
> >             && strcmp(d->name, "xen-pci-passthrough") != 0) {
> >         pci_piix3_xen_ide_unplug(DEVICE(d));
> >     }
> > }
> >
> > the function specifically only unplugs IDE disks.
> > I am not sure what to do about ahci unplug, given that we don't
> > implement scsi disk unplug either. After all, if the goal is to unplug
> > the disk, why choose a faster emulated protocol?
>
> I think we should unplug the disk regardless of type, if we support 
> configuring disks of that type through libxl. The reason, in this case AFAIU, 
> for wanting ahci is to speed up Windows boot where initial driver load is 
> still done through int13 and hence emulated disk.

I would be happy to take a patch which makes QEMU unplug all kinds of
disks, as long as is able to skip passed though devices (see comment in
the code).
_______________________________________________
win-pv-devel mailing list
win-pv-devel@xxxxxxxxxxxxxxxxxxxx
http://lists.xenproject.org/cgi-bin/mailman/listinfo/win-pv-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.