Xen project Mailing List

Re: [Xen-devel] Radeon DRM dom0 issues

To: Michael Labriola <michael.d.labriola@xxxxxxxxx>

From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Date: Wed, 19 Feb 2014 14:57:05 -0500

Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, xen-devel-bounces@xxxxxxxxxxxxx, Michael D Labriola <mlabriol@xxxxxxxx>, xen-devel@xxxxxxxxxxxxx

Delivery-date: Wed, 19 Feb 2014 19:57:37 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Wed, Feb 19, 2014 at 02:33:26PM -0500, Michael Labriola wrote: > On Wed, Feb 19, 2014 at 12:04 PM, Konrad Rzeszutek Wilk > <konrad.wilk@xxxxxxxxxx> wrote: > > On Tue, Feb 11, 2014 at 10:35:18AM -0500, Michael D Labriola wrote: > >> Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote on 01/24/2014 > >> 09:49:38 AM: > >> > >> > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> > >> > To: Michael D Labriola <mlabriol@xxxxxxxx>, > >> > Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, > >> > michael.d.labriola@xxxxxxxxx, xen-devel@xxxxxxxxxxxxx, xen-devel- > >> > bounces@xxxxxxxxxxxxx > >> > Date: 01/24/2014 09:50 AM > >> > Subject: Re: [Xen-devel] Radeon DRM dom0 issues > >> > > >> > On Thu, Jan 23, 2014 at 11:54:37AM -0500, Michael D Labriola wrote: > >> > > xen-devel-bounces@xxxxxxxxxxxxx wrote on 01/21/2014 04:59:05 PM: > >> > > > >> > > > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> > >> > > > To: Michael D Labriola <mlabriol@xxxxxxxx>, > >> > > > Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, > >> > > > michael.d.labriola@xxxxxxxxx, xen-devel@xxxxxxxxxxxxx > >> > > > Date: 01/21/2014 04:59 PM > >> > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues > >> > > > Sent by: xen-devel-bounces@xxxxxxxxxxxxx > >> > > > > >> > > > On Mon, Jan 20, 2014 at 03:15:24PM -0500, Michael D Labriola wrote: > >> > > > > Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote on 01/20/2014 > >> > >> > > > > 10:38:27 AM: > >> > > > > > >> > > > > > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> > >> > > > > > To: Michael D Labriola <mlabriol@xxxxxxxx>, > >> > > > > > Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, > >> > > > > > michael.d.labriola@xxxxxxxxx, xen-devel@xxxxxxxxxxxxx > >> > > > > > Date: 01/20/2014 10:38 AM > >> > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues > >> > > > > > > >> > > > > > On Mon, Jan 20, 2014 at 10:26:22AM -0500, Michael D Labriola > >> wrote: > >> > > > > > > Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx> wrote on 01/20/2014 > >> > > 10:14:36 > >> > > > > AM: > >> > > > > > > > >> > > > > > > > From: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx> > >> > > > > > > > To: Michael D Labriola <mlabriol@xxxxxxxx>, > >> > > > > > > > Cc: xen-devel@xxxxxxxxxxxxx, michael.d.labriola@xxxxxxxxx > >> > > > > > > > Date: 01/20/2014 10:14 AM > >> > > > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues > >> > > > > > > > > >> > > > > > > > On Mon, Jan 20, 2014 at 09:58:32AM -0500, Michael D Labriola > >> > >> > > wrote: > >> > > > > > > > > Anyone here running a dom0 w/ Radeon DRM? I'm having > >> > > consistent > >> > > > > > > crashes > >> > > > > > > > > with multiple older R600 series (HD 6470 and HD 6570) and > >> > > unusably > >> > > > > > >> > > > > > > slow > >> > > > > > > > > graphics with a newer HD7000 (can see each line refresh > >> > > > > indiviually on > >> > > > > > > > >> > > > > > > > > radeonfb tty). All 3 systems seem to work fine bare > >> metal. > >> > > > > > > > > >> > > > > > > > I hadn't been using DRM, just Xserver. Is that what you > >> mean? > >> > > > > > > > >> > > > > > > The R600 problems happen when in X, using OpenGL, on my dom0. > >> The > >> > > > >> > > > > > > RadeonSI sluggishness is when using the KMS framebuffer device > >> for > >> > > a > >> > > > > plain > >> > > > > > > text console login. > >> > > > > > > >> > > > > > So sluggish is probably due to the PAT not being enabled. This > >> patch > >> > > > > > should be applied: > >> > > > > > > >> > > > > > lkml.org/lkml/2011/11/8/406 > >> > > > > > > >> > > > > > (or http://marc.info/?l=linux-kernel&m=132888833209874&w=2) > >> > > > > > > >> > > > > > and these two reverted: > >> > > > > > > >> > > > > > "xen/pat: Disable PAT support for now." > >> > > > > > "xen/pat: Disable PAT using pat_enabled value." > >> > > > > > > >> > > > > > Which is to say do: > >> > > > > > > >> > > > > > git revert c79c49826270b8b0061b2fca840fc3f013c8a78a > >> > > > > > git revert 8eaffa67b43e99ae581622c5133e20b0f48bcef1 > >> > > > > > >> > > > > Thanks! I cherry-picked that patch out of your testing tree, > >> reverted > >> > > > >> > > > > those 2 commits, recompiled and installed. Definitely fixed the > >> HD > >> > > 7000 > >> > > > > sluggishness and appears to have fixed the R600 crashes (although > >> it's > >> > > > >> > > > > only been running a few hours). > >> > > > > > >> > > > > How come that patch didn't get into mainline? It looks pretty > >> > > innocuous > >> > > > > to me... > >> > > > > >> > > > <Sigh> the x86 maintainers wanted a different route. And I hadn't > >> had > >> > > > the chance nor time to implement it. > >> > > > >> > > I see. Well, I've got a handful of boxes in my lab that need that > >> patch > >> > > to be usable. If you do come up with a more mainline-able solution, > >> I'd > >> > > gladly test it for you. ;-) > >> > > >> > Thank you! > >> > >> Uh, oh. Looks like those reverts and patches didn't entirely fix my > >> problem. My box with the HD5450 (r600 gallium3d) started going bonkers > >> again yeserday. After being solid as a rock for 2 weeks as my primary > >> workstation, X has crashed a half dozen or so times so far this week. I've > >> been in Xen with 2 paravirtual linux guests running almost constantly for > >> this whole period. I don't understand what's changed, but my system has > >> been entirely unstable now. I did recompile my kernel... but I all did > >> was merge the v3.13.1 stable commit into my working tree and turn a few > >> things on (netfilter, wifi, a couple drivers turned on here and there). I > >> just went and verified that those patches are still applied in my tree > >> (i.e., I didn't accidentally undo them). I'm scratching my head (and > >> staring at a TTY login). > >> > >> When X crashes, my kernel log prints a couple dozen iterations of this. 3d > >> acceleration no longer functions unless I reboot. If memory serves, the > >> unpatched behavior upon X crash was that the kernel continued to spew > >> these errors until the whole box locked up. At least that's not happening > >> any more... ;-) > >> > >> [ 702.070084] [TTM] radeon 0000:01:00.0: Unable to get page 2 > >> [ 702.075971] [TTM] radeon 0000:01:00.0: Failed to fill cached pool > >> (r:-12)! > >> [ 704.720699] [TTM] radeon 0000:01:00.0: Unable to get page 0 > >> [ 704.726635] [TTM] radeon 0000:01:00.0: Failed to fill cached pool > >> (r:-12)! > >> [ 704.733910] [drm:radeon_gem_object_create] *ERROR* Failed to allocate > >> GEM object (8192, 2, 4096, -12) > >> > >> and here's a slightly different variant that happened while I was typing > >> this email (on a different machine, luckily): > >> > >> [ 3107.713039] sdf: detected capacity change from 31625052160 to 0 > >> [ 3114.491717] usb 9-1: USB disconnect, device number 2 > >> [64348.271534] [TTM] radeon 0000:01:00.0: Unable to get page 3 > >> [64348.277312] [TTM] radeon 0000:01:00.0: Failed to fill cached pool > >> (r:-12)! > >> [64348.284470] [TTM] radeon 0000:01:00.0: Unable to get page 0 > >> [64348.290257] [TTM] radeon 0000:01:00.0: Failed to fill cached pool > >> (r:-12)! > >> [64348.297561] [TTM] Buffer eviction failed > >> [64349.550518] [TTM] radeon 0000:01:00.0: Unable to get page 0 > >> [64349.556417] [TTM] radeon 0000:01:00.0: Failed to fill cached pool > >> (r:-12)! > >> [64349.563714] [drm:radeon_gem_object_create] *ERROR* Failed to allocate > >> GEM object (16384, 2, 4096, -12) > >> > >> Any ideas? > > > > yes. I believe you have a memory leak. As in, some driver (or X) is > > eating up the memory and not giving up enough. That means the TTM > > layer is hitting its ceiling of how much memory it can allocate. > > > > Now finding the culprit is going to be a bit hard. > > > > You could use: > > > > [root@phenom 1]# cat /sys/kernel/debug/dri/1/ttm_dma_page_pool > > pool refills pages freed inuse available name > > wc 259 224 808 4 nouveau > > 0000:05:00.0 > > cached 3403058 13561071 51158 3 radeon > > 0000:01:00.0 > > cached 25 0 96 4 nouveau > > 0000:05:00.0 > > > > to figure out if my thinking is really true. You should have a huge > > 'inuse' count and almost no 'available'. > > My /sys/kernel/debug/dri directory has a 0 and a 64 entry, which appear to > always have the same contents. Is that normal? Yes. > > My /sys/kernel/debug/dri/0/ttm_dma_page_pool file doesn't exist bare > metal... only in Xen. Is that normal? It would show up on baremetal if you boot with 'iommu=soft' > > pool refills pages freed inuse available name > cached 15190 59551 1205 4 radeon 0000:01:00.0 > > If I watch that file while creating xterms, moving them around, etc, I can > see the number available fluctuate between 3 and 6. This is true, even on > my box w/ the newer R7 card in it, which hasn't gotten that GEM error > message (yet?). OK, so lets see what happens when the error shows. Incidentally - what amount of memory does your initial domain have? And is it different then when you boot it as a baremetal? Thank you. > > > > > > But that will get us just to confirm that yes - you have a big usage > > of memory and it is hitting the ceiling. > > > > Now to actually figure out which application is hanging on these - that > > I am not sure about. I think there is some drm info tool to investigate > > how many pages each application is using. You can leave it running and > > see which app is gulping up the memory. But I am not sure which > > tool that is (if there was one). > > > > Well, lets do one step at a time - see if my theory is correct first. > > > > -- > Michael D Labriola > 21 Rip Van Winkle Cir > Warwick, RI 02886 > 401-316-9844 (cell) > 401-848-8871 (work) > 401-234-1306 (home) _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.