[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Crashes under Xen with Radeon graphics card


  • To: Juergen Gross <jgross@xxxxxxxx>, lkml <linux-kernel@xxxxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, "amd-gfx@xxxxxxxxxxxxxxxxxxxxx" <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>
  • From: "Deucher, Alexander" <Alexander.Deucher@xxxxxxx>
  • Date: Fri, 15 Dec 2023 16:04:40 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=amd.com; dmarc=pass action=none header.from=amd.com; dkim=pass header.d=amd.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=Ckcw/fAX2EZCrBYjqm5Ss1V4REYBXH7rXqP8aSCno7w=; b=SorEAp9HXBIsxjzY5cnUhzMWUITfk4GPcPEyhJWKekMByozAZEAnX+nbFIRBfZyHm4E6A17Kp8NhlqlQ9Er94LN3LONj/78tBwi1UIe25VIWQmPGWv+SWu2CFWYDiTzKHDT1YXeeh+adoucMWZee9M9OZOnp0pxLHQVjIFTEnBHCyKWaMqC56NYuPw312RUxYQudIHrFxX0VbfQhjTC74ZGl7nsecVHlr5eA+l1drHm45hcHG4RJdSD8DnubHQYLAUpDu3RUXlydguORddrU/WtUTSwNMOJ6AkrA5WV350ArIl7XWKdsN/nSQTHrj9sWRyQth2ho4Zwwiw1p+k7mZg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Rt/VQ+DfRQfOnHq62IFNHqb2AehuQDC7IJ01jR7F2C1s5/D2IKKRv+jlHVyCXCduaUx/G09sFO22EVZaRKV8yy8u6zYwVmxACHmcPARKPEOT8bkEyxDPndIptq9gL0PBj1hCKCCOdBgE/OGSbs9jmNHkx68TPz9DQpX/5gEv+XmLEtaBkTS+gjjdyAntMLrPUA433iC6g190/KZYzXrsTpEtG01Z3Q4ZXazzPMrLVJdTfziXl7XcDyMVC0M7iTZpApeFh5Y17VPYdNU4JYQl9ImLId5gSM/g8nOo4cBDTYWyxVVVFQNKh94UexVK8xHhf/3WJr4WvYtolkzuejz4lg==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=amd.com;
  • Cc: "Koenig, Christian" <Christian.Koenig@xxxxxxx>, "Pan, Xinhui" <Xinhui.Pan@xxxxxxx>
  • Delivery-date: Fri, 15 Dec 2023 16:13:00 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Msip_labels: MSIP_Label_d4243a53-6221-4f75-8154-e4b33a5707a1_ActionId=81c24ba7-b2f2-4c46-866e-4101b85b4dff;MSIP_Label_d4243a53-6221-4f75-8154-e4b33a5707a1_ContentBits=0;MSIP_Label_d4243a53-6221-4f75-8154-e4b33a5707a1_Enabled=true;MSIP_Label_d4243a53-6221-4f75-8154-e4b33a5707a1_Method=Privileged;MSIP_Label_d4243a53-6221-4f75-8154-e4b33a5707a1_Name=Public-AIP 2.0;MSIP_Label_d4243a53-6221-4f75-8154-e4b33a5707a1_SetDate=2023-12-15T16:02:58Z;MSIP_Label_d4243a53-6221-4f75-8154-e4b33a5707a1_SiteId=3dd8961f-e488-4e60-8e11-a82d994e183d;
  • Thread-index: AQHaL03LUpC/YyNzGkyNa5aGXUbMjbCqgbQQ
  • Thread-topic: Crashes under Xen with Radeon graphics card

[Public]

> -----Original Message-----
> From: Juergen Gross <jgross@xxxxxxxx>
> Sent: Friday, December 15, 2023 6:57 AM
> To: lkml <linux-kernel@xxxxxxxxxxxxxxx>; xen-devel@xxxxxxxxxxxxxxxxxxxx; amd-
> gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian
> <Christian.Koenig@xxxxxxx>; Pan, Xinhui <Xinhui.Pan@xxxxxxx>
> Subject: Crashes under Xen with Radeon graphics card
>
> Hi,
>
> I recently stumbled over a test system which showed crashes probably
> resulting from memory being overwritten randomly.
>
> The problem is occurring only in Dom0 when running under Xen. It seems to
> be present since at least kernel 6.3 (I didn't go back further yet), and it 
> seems
> NOT to be present in kernel 5.14.
>
> I tracked the problem down to the initialization of the graphics card (the
> problem might surface only later, but at least an early initialization 
> failure made
> the problem go away).
>
> # lspci
> 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
> Caicos XTX [Radeon HD 8490 / R5 235X OEM]
> 01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Caicos HDMI
> Audio [Radeon HD 6450 / 7450/8450/8490 OEM / R5 230/235/235X OEM]
>
> I had a working .config and one which did produce the crashes, so I narrowed
> the problem down to detect that the important difference was in the area of
> firmware loading (the working .config didn't have
> CONFIG_FW_LOADER_COMPRESS_XZ set, causing firmware loading for the
> card to fail). This was of course not the real problem, but it caused the card
> initialization to fail.
>
> I manually decompressed the firmware files one by one to see whether the
> problem would be in the decompressor or probably in the driver of the card.
>
> The last step without crash was:
>
> # dmesg | grep radeon
> [   10.106405] [drm] radeon kernel modesetting enabled.
> [   10.106455] radeon 0000:01:00.0: vgaarb: deactivate vga console
> [   10.222944] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000
> -
> 0x000000003FFFFFFF (1024M used)
> [   10.252921] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
> 0x000000007FFFFFFF
> [   10.278255] [drm] radeon: 1024M of VRAM memory ready
> [   10.295828] [drm] radeon: 1024M of GTT memory ready.
> [   10.295867] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_pfp.bin succeeded
> [   10.330846] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_me.bin succeeded
> [   10.330858] radeon 0000:01:00.0: Direct firmware load for
> radeon/BTC_rlc.bin
> succeeded
> [   10.330870] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_mc.bin failed with error -2
> [   10.380979] ni_cp: Failed to load firmware "radeon/CAICOS_mc.bin"
> [   10.381006] [drm:evergreen_init [radeon]] *ERROR* Failed to load
> firmware!
> [   10.405765] radeon 0000:01:00.0: Fatal error during GPU init
> [   10.432107] [drm] radeon: finishing device.
> [   10.439179] [drm] radeon: ttm finalized
> [   10.463203] radeon: probe of 0000:01:00.0 failed with error -2
>
> And with decompressing radeon/CAICOS_mc.bin I got:
>
> # dmesg | grep radeon
> [   10.266491] [drm] radeon kernel modesetting enabled.
> [   10.266552] radeon 0000:01:00.0: vgaarb: deactivate vga console
> [   10.456047] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000
> -
> 0x000000003FFFFFFF (1024M used)
> [   10.470270] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
> 0x000000007FFFFFFF
> [   10.566946] [drm] radeon: 1024M of VRAM memory ready
> [   10.576891] [drm] radeon: 1024M of GTT memory ready.
> [   10.586971] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_pfp.bin succeeded
> [   10.611886] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_me.bin succeeded
> [   10.611909] radeon 0000:01:00.0: Direct firmware load for
> radeon/BTC_rlc.bin
> succeeded
> [   10.611938] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_mc.bin succeeded
> [   10.660599] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_smc.bin failed with error -2
> [   10.660601] smc: error loading firmware "radeon/CAICOS_smc.bin"

You also need to make sure CAICOS_smc.bin is available.

> [   10.661676] [drm] radeon: power management initialized
> [   10.713666] radeon 0000:01:00.0: Direct firmware load for
> radeon/SUMO_uvd.bin
> failed with error -2
> [   10.713668] radeon 0000:01:00.0: radeon_uvd: Can't load firmware
> "radeon/SUMO_uvd.bin"
> [   10.713669] radeon 0000:01:00.0: failed UVD (-2) init.

And SUMO_uvd.bin.

> [   10.714787] [drm] enabling PCIE gen 2 link speeds, disable with
> radeon.pcie_gen2=0
> [   10.809213] radeon 0000:01:00.0: WB enabled
> [   10.817528] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr
> 0x0000000040000c00
> [   10.833755] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr
> 0x0000000040000c0c
> [   10.850330] radeon 0000:01:00.0: radeon: MSI limited to 32-bit
> [   10.862154] radeon 0000:01:00.0: radeon: using MSI.
> [   10.871930] [drm] radeon: irq initialized.
> [   11.062028] [drm] Initialized radeon 2.50.0 20080528 for 0000:01:00.0 on
> minor 0
> [   11.119723] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
> monitor but no|invalid EDID
> [   11.411370] fbcon: radeondrmfb (fb0) is primary device
> [   11.507252] radeon 0000:01:00.0: [drm] fb0: radeondrmfb frame buffer
> device
> [   11.674028] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
> monitor but no|invalid EDID
> [   11.834317] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
> monitor but no|invalid EDID
> [   28.313041] snd_hda_intel 0000:01:00.1: bound 0000:01:00.0 (ops
> radeon_audio_component_bind_ops [radeon])
> [   44.371991] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
> monitor but no|invalid EDID
> [   44.428068] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
> monitor but no|invalid EDID
>
> followed by a crash some seconds after the system was up.
>
> The crashes vary, but often the kernel accesses non-canonical addresses or
> tries to map illegal physical addresses. Sometimes the system is just hanging,
> either with softlockups or without any further signs of being alive.
>
> I can easily reproduce the problem, so any debug patches to narrow down the
> problem are welcome.

There are still missing firmware required for proper operation.  Please fix 
them up.

Alex


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.