[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Radeon DRM dom0 issues



On Wed, Feb 19, 2014 at 02:33:26PM -0500, Michael Labriola wrote:
> On Wed, Feb 19, 2014 at 12:04 PM, Konrad Rzeszutek Wilk
> <konrad.wilk@xxxxxxxxxx> wrote:
> > On Tue, Feb 11, 2014 at 10:35:18AM -0500, Michael D Labriola wrote:
> >> Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote on 01/24/2014
> >> 09:49:38 AM:
> >>
> >> > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> >> > To: Michael D Labriola <mlabriol@xxxxxxxx>,
> >> > Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>,
> >> > michael.d.labriola@xxxxxxxxx, xen-devel@xxxxxxxxxxxxx, xen-devel-
> >> > bounces@xxxxxxxxxxxxx
> >> > Date: 01/24/2014 09:50 AM
> >> > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> >> >
> >> > On Thu, Jan 23, 2014 at 11:54:37AM -0500, Michael D Labriola wrote:
> >> > > xen-devel-bounces@xxxxxxxxxxxxx wrote on 01/21/2014 04:59:05 PM:
> >> > >
> >> > > > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> >> > > > To: Michael D Labriola <mlabriol@xxxxxxxx>,
> >> > > > Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>,
> >> > > > michael.d.labriola@xxxxxxxxx, xen-devel@xxxxxxxxxxxxx
> >> > > > Date: 01/21/2014 04:59 PM
> >> > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> >> > > > Sent by: xen-devel-bounces@xxxxxxxxxxxxx
> >> > > >
> >> > > > On Mon, Jan 20, 2014 at 03:15:24PM -0500, Michael D Labriola wrote:
> >> > > > > Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote on 01/20/2014
> >>
> >> > > > > 10:38:27 AM:
> >> > > > >
> >> > > > > > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> >> > > > > > To: Michael D Labriola <mlabriol@xxxxxxxx>,
> >> > > > > > Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>,
> >> > > > > > michael.d.labriola@xxxxxxxxx, xen-devel@xxxxxxxxxxxxx
> >> > > > > > Date: 01/20/2014 10:38 AM
> >> > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> >> > > > > >
> >> > > > > > On Mon, Jan 20, 2014 at 10:26:22AM -0500, Michael D Labriola
> >> wrote:
> >> > > > > > > Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx> wrote on 01/20/2014
> >> > > 10:14:36
> >> > > > > AM:
> >> > > > > > >
> >> > > > > > > > From: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>
> >> > > > > > > > To: Michael D Labriola <mlabriol@xxxxxxxx>,
> >> > > > > > > > Cc: xen-devel@xxxxxxxxxxxxx, michael.d.labriola@xxxxxxxxx
> >> > > > > > > > Date: 01/20/2014 10:14 AM
> >> > > > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> >> > > > > > > >
> >> > > > > > > > On Mon, Jan 20, 2014 at 09:58:32AM -0500, Michael D Labriola
> >>
> >> > > wrote:
> >> > > > > > > > > Anyone here running a dom0 w/ Radeon DRM?  I'm having
> >> > > consistent
> >> > > > > > > crashes
> >> > > > > > > > > with multiple older R600 series (HD 6470 and HD 6570) and
> >> > > unusably
> >> > > > >
> >> > > > > > > slow
> >> > > > > > > > > graphics with a newer HD7000 (can see each line refresh
> >> > > > > indiviually on
> >> > > > > > >
> >> > > > > > > > > radeonfb tty).  All 3 systems seem to work fine bare
> >> metal.
> >> > > > > > > >
> >> > > > > > > > I hadn't been using DRM, just Xserver. Is that what you
> >> mean?
> >> > > > > > >
> >> > > > > > > The R600 problems happen when in X, using OpenGL, on my dom0.
> >> The
> >> > >
> >> > > > > > > RadeonSI sluggishness is when using the KMS framebuffer device
> >> for
> >> > > a
> >> > > > > plain
> >> > > > > > > text console login.
> >> > > > > >
> >> > > > > > So sluggish is probably due to the PAT not being enabled. This
> >> patch
> >> > > > > > should be applied:
> >> > > > > >
> >> > > > > > lkml.org/lkml/2011/11/8/406
> >> > > > > >
> >> > > > > > (or http://marc.info/?l=linux-kernel&m=132888833209874&w=2)
> >> > > > > >
> >> > > > > > and these two reverted:
> >> > > > > >
> >> > > > > >  "xen/pat: Disable PAT support for now."
> >> > > > > >  "xen/pat: Disable PAT using pat_enabled value."
> >> > > > > >
> >> > > > > > Which is to say do:
> >> > > > > >
> >> > > > > > git revert c79c49826270b8b0061b2fca840fc3f013c8a78a
> >> > > > > > git revert 8eaffa67b43e99ae581622c5133e20b0f48bcef1
> >> > > > >
> >> > > > > Thanks!  I cherry-picked that patch out of your testing tree,
> >> reverted
> >> > >
> >> > > > > those 2 commits, recompiled and installed.  Definitely fixed the
> >> HD
> >> > > 7000
> >> > > > > sluggishness and appears to have fixed the R600 crashes (although
> >> it's
> >> > >
> >> > > > > only been running a few hours).
> >> > > > >
> >> > > > > How come that patch didn't get into mainline?  It looks pretty
> >> > > innocuous
> >> > > > > to me...
> >> > > >
> >> > > > <Sigh> the x86 maintainers wanted a different route. And I hadn't
> >> had
> >> > > > the chance nor time to implement it.
> >> > >
> >> > > I see.  Well, I've got a handful of boxes in my lab that need that
> >> patch
> >> > > to be usable.  If you do come up with a more mainline-able solution,
> >> I'd
> >> > > gladly test it for you.  ;-)
> >> >
> >> > Thank you!
> >>
> >> Uh, oh.  Looks like those reverts and patches didn't entirely fix my
> >> problem.  My box with the HD5450 (r600 gallium3d) started going bonkers
> >> again yeserday.  After being solid as a rock for 2 weeks as my primary
> >> workstation, X has crashed a half dozen or so times so far this week. I've
> >> been in Xen with 2 paravirtual linux guests running almost constantly for
> >> this whole period.  I don't understand what's changed, but my system has
> >> been entirely unstable now.  I did recompile my kernel... but I all did
> >> was merge the v3.13.1 stable commit into my working tree and turn a few
> >> things on (netfilter, wifi, a couple drivers turned on here and there).  I
> >> just went and verified that those patches are still applied in my tree
> >> (i.e., I didn't accidentally undo them).  I'm scratching my head (and
> >> staring at a TTY login).
> >>
> >> When X crashes, my kernel log prints a couple dozen iterations of this. 3d
> >> acceleration no longer functions unless I reboot.  If memory serves, the
> >> unpatched behavior upon X crash was that the kernel continued to spew
> >> these errors until the whole box locked up.  At least that's not happening
> >> any more... ;-)
> >>
> >> [  702.070084] [TTM] radeon 0000:01:00.0: Unable to get page 2
> >> [  702.075971] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
> >> (r:-12)!
> >> [  704.720699] [TTM] radeon 0000:01:00.0: Unable to get page 0
> >> [  704.726635] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
> >> (r:-12)!
> >> [  704.733910] [drm:radeon_gem_object_create] *ERROR* Failed to allocate
> >> GEM object (8192, 2, 4096, -12)
> >>
> >> and here's a slightly different variant that happened while I was typing
> >> this email (on a different machine, luckily):
> >>
> >> [ 3107.713039] sdf: detected capacity change from 31625052160 to 0
> >> [ 3114.491717] usb 9-1: USB disconnect, device number 2
> >> [64348.271534] [TTM] radeon 0000:01:00.0: Unable to get page 3
> >> [64348.277312] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
> >> (r:-12)!
> >> [64348.284470] [TTM] radeon 0000:01:00.0: Unable to get page 0
> >> [64348.290257] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
> >> (r:-12)!
> >> [64348.297561] [TTM] Buffer eviction failed
> >> [64349.550518] [TTM] radeon 0000:01:00.0: Unable to get page 0
> >> [64349.556417] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
> >> (r:-12)!
> >> [64349.563714] [drm:radeon_gem_object_create] *ERROR* Failed to allocate
> >> GEM object (16384, 2, 4096, -12)
> >>
> >> Any ideas?
> >
> > yes. I believe you have a memory leak. As in, some driver (or X) is
> > eating up the memory and not giving up enough. That means the TTM
> > layer is hitting its ceiling of how much memory it can allocate.
> >
> > Now finding the culprit is going to be a bit hard.
> >
> > You could use:
> >
> > [root@phenom 1]# cat /sys/kernel/debug/dri/1/ttm_dma_page_pool
> >          pool      refills   pages freed    inuse available     name
> >            wc          259           224      808        4 nouveau 
> > 0000:05:00.0
> >        cached      3403058      13561071    51158        3 radeon 
> > 0000:01:00.0
> >        cached           25             0       96        4 nouveau 
> > 0000:05:00.0
> >
> > to figure out if my thinking is really true. You should have a huge
> > 'inuse' count and almost no 'available'.
> 
> My /sys/kernel/debug/dri directory has a 0 and a 64 entry, which appear to
> always have the same contents.  Is that normal?

Yes.
> 
> My /sys/kernel/debug/dri/0/ttm_dma_page_pool file doesn't exist bare
> metal... only in Xen.  Is that normal?

It would show up on baremetal if you boot with 'iommu=soft'

> 
>          pool      refills   pages freed    inuse available     name
>        cached        15190         59551     1205        4 radeon 0000:01:00.0
> 
> If I watch that file while creating xterms, moving them around, etc, I can
> see the number available fluctuate between 3 and 6.  This is true, even on
> my box w/ the newer R7 card in it, which hasn't gotten that GEM error
> message (yet?).

OK, so lets see what happens when the error shows. Incidentally - what amount of
memory does your initial domain have? And is it different then when you
boot it as a baremetal?

Thank you.

> 
> 
> >
> > But that will get us just to confirm that yes - you have a big usage
> > of memory and it is hitting the ceiling.
> >
> > Now to actually figure out which application is hanging on these - that
> > I am not sure about. I think there is some drm info tool to investigate
> > how many pages each application is using. You can leave it running and
> > see which app is gulping up the memory. But I am not sure which
> > tool that is (if there was one).
> >
> > Well, lets do one step at a time - see if my theory is correct first.
> 
> 
> 
> -- 
> Michael D Labriola
> 21 Rip Van Winkle Cir
> Warwick, RI 02886
> 401-316-9844 (cell)
> 401-848-8871 (work)
> 401-234-1306 (home)

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.