Xen project Mailing List

Re: [Xen-devel] Radeon DRM dom0 issues

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

From: Michael Labriola <michael.d.labriola@xxxxxxxxx>

Date: Wed, 19 Feb 2014 15:08:08 -0500

Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, xen-devel-bounces@xxxxxxxxxxxxx, Michael D Labriola <mlabriol@xxxxxxxx>, xen-devel@xxxxxxxxxxxxx

Delivery-date: Thu, 20 Feb 2014 08:06:13 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Wed, Feb 19, 2014 at 2:57 PM, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote: > On Wed, Feb 19, 2014 at 02:33:26PM -0500, Michael Labriola wrote: >> On Wed, Feb 19, 2014 at 12:04 PM, Konrad Rzeszutek Wilk >> <konrad.wilk@xxxxxxxxxx> wrote: >> > On Tue, Feb 11, 2014 at 10:35:18AM -0500, Michael D Labriola wrote: >> >> Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote on 01/24/2014 >> >> 09:49:38 AM: >> >> >> >> > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> >> >> > To: Michael D Labriola <mlabriol@xxxxxxxx>, >> >> > Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, >> >> > michael.d.labriola@xxxxxxxxx, xen-devel@xxxxxxxxxxxxx, xen-devel- >> >> > bounces@xxxxxxxxxxxxx >> >> > Date: 01/24/2014 09:50 AM >> >> > Subject: Re: [Xen-devel] Radeon DRM dom0 issues >> >> > >> >> > On Thu, Jan 23, 2014 at 11:54:37AM -0500, Michael D Labriola wrote: >> >> > > xen-devel-bounces@xxxxxxxxxxxxx wrote on 01/21/2014 04:59:05 PM: >> >> > > >> >> > > > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> >> >> > > > To: Michael D Labriola <mlabriol@xxxxxxxx>, >> >> > > > Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, >> >> > > > michael.d.labriola@xxxxxxxxx, xen-devel@xxxxxxxxxxxxx >> >> > > > Date: 01/21/2014 04:59 PM >> >> > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues >> >> > > > Sent by: xen-devel-bounces@xxxxxxxxxxxxx >> >> > > > >> >> > > > On Mon, Jan 20, 2014 at 03:15:24PM -0500, Michael D Labriola wrote: >> >> > > > > Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote on 01/20/2014 >> >> >> >> > > > > 10:38:27 AM: >> >> > > > > >> >> > > > > > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> >> >> > > > > > To: Michael D Labriola <mlabriol@xxxxxxxx>, >> >> > > > > > Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, >> >> > > > > > michael.d.labriola@xxxxxxxxx, xen-devel@xxxxxxxxxxxxx >> >> > > > > > Date: 01/20/2014 10:38 AM >> >> > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues >> >> > > > > > >> >> > > > > > On Mon, Jan 20, 2014 at 10:26:22AM -0500, Michael D Labriola >> >> wrote: >> >> > > > > > > Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx> wrote on 01/20/2014 >> >> > > 10:14:36 >> >> > > > > AM: >> >> > > > > > > >> >> > > > > > > > From: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx> >> >> > > > > > > > To: Michael D Labriola <mlabriol@xxxxxxxx>, >> >> > > > > > > > Cc: xen-devel@xxxxxxxxxxxxx, michael.d.labriola@xxxxxxxxx >> >> > > > > > > > Date: 01/20/2014 10:14 AM >> >> > > > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues >> >> > > > > > > > >> >> > > > > > > > On Mon, Jan 20, 2014 at 09:58:32AM -0500, Michael D Labriola >> >> >> >> > > wrote: >> >> > > > > > > > > Anyone here running a dom0 w/ Radeon DRM? I'm having >> >> > > consistent >> >> > > > > > > crashes >> >> > > > > > > > > with multiple older R600 series (HD 6470 and HD 6570) and >> >> > > unusably >> >> > > > > >> >> > > > > > > slow >> >> > > > > > > > > graphics with a newer HD7000 (can see each line refresh >> >> > > > > indiviually on >> >> > > > > > > >> >> > > > > > > > > radeonfb tty). All 3 systems seem to work fine bare >> >> metal. >> >> > > > > > > > >> >> > > > > > > > I hadn't been using DRM, just Xserver. Is that what you >> >> mean? >> >> > > > > > > >> >> > > > > > > The R600 problems happen when in X, using OpenGL, on my dom0. >> >> The >> >> > > >> >> > > > > > > RadeonSI sluggishness is when using the KMS framebuffer device >> >> for >> >> > > a >> >> > > > > plain >> >> > > > > > > text console login. >> >> > > > > > >> >> > > > > > So sluggish is probably due to the PAT not being enabled. This >> >> patch >> >> > > > > > should be applied: >> >> > > > > > >> >> > > > > > lkml.org/lkml/2011/11/8/406 >> >> > > > > > >> >> > > > > > (or http://marc.info/?l=linux-kernel&m=132888833209874&w=2) >> >> > > > > > >> >> > > > > > and these two reverted: >> >> > > > > > >> >> > > > > > "xen/pat: Disable PAT support for now." >> >> > > > > > "xen/pat: Disable PAT using pat_enabled value." >> >> > > > > > >> >> > > > > > Which is to say do: >> >> > > > > > >> >> > > > > > git revert c79c49826270b8b0061b2fca840fc3f013c8a78a >> >> > > > > > git revert 8eaffa67b43e99ae581622c5133e20b0f48bcef1 >> >> > > > > >> >> > > > > Thanks! I cherry-picked that patch out of your testing tree, >> >> reverted >> >> > > >> >> > > > > those 2 commits, recompiled and installed. Definitely fixed the >> >> HD >> >> > > 7000 >> >> > > > > sluggishness and appears to have fixed the R600 crashes (although >> >> it's >> >> > > >> >> > > > > only been running a few hours). >> >> > > > > >> >> > > > > How come that patch didn't get into mainline? It looks pretty >> >> > > innocuous >> >> > > > > to me... >> >> > > > >> >> > > > <Sigh> the x86 maintainers wanted a different route. And I hadn't >> >> had >> >> > > > the chance nor time to implement it. >> >> > > >> >> > > I see. Well, I've got a handful of boxes in my lab that need that >> >> patch >> >> > > to be usable. If you do come up with a more mainline-able solution, >> >> I'd >> >> > > gladly test it for you. ;-) >> >> > >> >> > Thank you! >> >> >> >> Uh, oh. Looks like those reverts and patches didn't entirely fix my >> >> problem. My box with the HD5450 (r600 gallium3d) started going bonkers >> >> again yeserday. After being solid as a rock for 2 weeks as my primary >> >> workstation, X has crashed a half dozen or so times so far this week. I've >> >> been in Xen with 2 paravirtual linux guests running almost constantly for >> >> this whole period. I don't understand what's changed, but my system has >> >> been entirely unstable now. I did recompile my kernel... but I all did >> >> was merge the v3.13.1 stable commit into my working tree and turn a few >> >> things on (netfilter, wifi, a couple drivers turned on here and there). I >> >> just went and verified that those patches are still applied in my tree >> >> (i.e., I didn't accidentally undo them). I'm scratching my head (and >> >> staring at a TTY login). >> >> >> >> When X crashes, my kernel log prints a couple dozen iterations of this. 3d >> >> acceleration no longer functions unless I reboot. If memory serves, the >> >> unpatched behavior upon X crash was that the kernel continued to spew >> >> these errors until the whole box locked up. At least that's not happening >> >> any more... ;-) >> >> >> >> [ 702.070084] [TTM] radeon 0000:01:00.0: Unable to get page 2 >> >> [ 702.075971] [TTM] radeon 0000:01:00.0: Failed to fill cached pool >> >> (r:-12)! >> >> [ 704.720699] [TTM] radeon 0000:01:00.0: Unable to get page 0 >> >> [ 704.726635] [TTM] radeon 0000:01:00.0: Failed to fill cached pool >> >> (r:-12)! >> >> [ 704.733910] [drm:radeon_gem_object_create] *ERROR* Failed to allocate >> >> GEM object (8192, 2, 4096, -12) >> >> >> >> and here's a slightly different variant that happened while I was typing >> >> this email (on a different machine, luckily): >> >> >> >> [ 3107.713039] sdf: detected capacity change from 31625052160 to 0 >> >> [ 3114.491717] usb 9-1: USB disconnect, device number 2 >> >> [64348.271534] [TTM] radeon 0000:01:00.0: Unable to get page 3 >> >> [64348.277312] [TTM] radeon 0000:01:00.0: Failed to fill cached pool >> >> (r:-12)! >> >> [64348.284470] [TTM] radeon 0000:01:00.0: Unable to get page 0 >> >> [64348.290257] [TTM] radeon 0000:01:00.0: Failed to fill cached pool >> >> (r:-12)! >> >> [64348.297561] [TTM] Buffer eviction failed >> >> [64349.550518] [TTM] radeon 0000:01:00.0: Unable to get page 0 >> >> [64349.556417] [TTM] radeon 0000:01:00.0: Failed to fill cached pool >> >> (r:-12)! >> >> [64349.563714] [drm:radeon_gem_object_create] *ERROR* Failed to allocate >> >> GEM object (16384, 2, 4096, -12) >> >> >> >> Any ideas? >> > >> > yes. I believe you have a memory leak. As in, some driver (or X) is >> > eating up the memory and not giving up enough. That means the TTM >> > layer is hitting its ceiling of how much memory it can allocate. >> > >> > Now finding the culprit is going to be a bit hard. >> > >> > You could use: >> > >> > [root@phenom 1]# cat /sys/kernel/debug/dri/1/ttm_dma_page_pool >> > pool refills pages freed inuse available name >> > wc 259 224 808 4 nouveau >> > 0000:05:00.0 >> > cached 3403058 13561071 51158 3 radeon >> > 0000:01:00.0 >> > cached 25 0 96 4 nouveau >> > 0000:05:00.0 >> > >> > to figure out if my thinking is really true. You should have a huge >> > 'inuse' count and almost no 'available'. >> >> My /sys/kernel/debug/dri directory has a 0 and a 64 entry, which appear to >> always have the same contents. Is that normal? > > Yes. >> >> My /sys/kernel/debug/dri/0/ttm_dma_page_pool file doesn't exist bare >> metal... only in Xen. Is that normal? > > It would show up on baremetal if you boot with 'iommu=soft' > >> >> pool refills pages freed inuse available name >> cached 15190 59551 1205 4 radeon >> 0000:01:00.0 >> >> If I watch that file while creating xterms, moving them around, etc, I can >> see the number available fluctuate between 3 and 6. This is true, even on >> my box w/ the newer R7 card in it, which hasn't gotten that GEM error >> message (yet?). > > OK, so lets see what happens when the error shows. Incidentally - what amount > of > memory does your initial domain have? And is it different then when you > boot it as a baremetal? I've got the problem very reproducible on 3 boxes. All three are booting the dom0 with as much RAM as Xen will give them, then giving up some of their RAM as needed when I create domUs. The 3 boxes have 4G, 8G, and 16G if memory serves. Does the amount of RAM on the actual video cards matter? All the older cards (that crash all the time) have 2G, whereas the R7 that hasn't crashed yet only has 1G. I've been reproducing the crash by just logging in and out of fluxbox via XDM over and over again right after booting my dom0 in Xen w/ no guests running. That makes it happen within a few minutes. Otherwise it randomly crashes while I'm in the middle of trying to work... ;-) > > Thank you. > >> >> >> > >> > But that will get us just to confirm that yes - you have a big usage >> > of memory and it is hitting the ceiling. >> > >> > Now to actually figure out which application is hanging on these - that >> > I am not sure about. I think there is some drm info tool to investigate >> > how many pages each application is using. You can leave it running and >> > see which app is gulping up the memory. But I am not sure which >> > tool that is (if there was one). >> > >> > Well, lets do one step at a time - see if my theory is correct first. -- Michael D Labriola 21 Rip Van Winkle Cir Warwick, RI 02886 401-316-9844 (cell) 401-848-8871 (work) 401-234-1306 (home) _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.