[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

PCI-passthrough hvm guest not working after domu restart



I am passing an NVIDIA A10 GPU through to a domu running in hvm mode.  nvidia-smi detects the GPU OK.  

I restart the VM, and GPU is no longer detected by nvidia-smi.  

lspci output includes the device even when nvidia-smi detect does not.

Any ideas where I start to debug this?  What output I should include here to assist?  Options I should try on the host/dom0?

I'd tried resetting a few things on the host (no difference).
echo 1 > /sys/bus/pci/devices/0000\:06\:00.0/remove
echo 1 > /sys/bus/pci/rescan
echo "0" > "/sys/bus/pci/devices/0000:06:00.0/power"
echo "1" > /sys/bus/pci/devices/0000\:06\:00.0/reset

I'd tried with a Xen 4.17.  Also with 4.18.3.  Same behavior on each.

root@gpu:~# lspci  | grep NVIDIA
00:05.0 3D controller: NVIDIA Corporation GA102GL [A10] (rev a1)
00:06.0 3D controller: NVIDIA Corporation GA102GL [A10] (rev a1)

domu dmesg:
[    4.798048] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.105.17  Tue Mar 28 18:02:59 UTC 2023
[    4.801411] cirrus 0000:00:03.0: [drm] drm_plane_enable_fb_damage_clips() not called
[    4.976379] Console: switching to colour frame buffer device 128x48
[    5.122610] cirrus 0000:00:03.0: [drm] fb0: cirrusdrmfb frame buffer device
[    5.247082] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  525.105.17  Tue Mar 28 22:18:37 UTC 2023
[    5.656447] [drm] [nvidia-drm] [GPU ID 0x00000005] Loading driver
[    5.656451] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:05.0 on minor 1
[    5.656570] [drm] [nvidia-drm] [GPU ID 0x00000006] Loading driver
[    5.656572] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:06.0 on minor 2
[   94.161919] nvidia 0000:00:05.0: firmware: direct-loading firmware nvidia/525.105.17/gsp_tu10x.bin
[   94.328842] NVRM: GPU 0000:00:05.0: RmInitAdapter failed! (0x62:0xffff:2351)
[   94.329981] NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor number 0
[   94.348750] nvidia 0000:00:05.0: firmware: direct-loading firmware nvidia/525.105.17/gsp_tu10x.bin
[   94.749557] NVRM: GPU 0000:00:05.0: RmInitAdapter failed! (0x62:0xffff:2351)
[   94.750732] NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor number 0

Thanks in advance for any guidance.
- Peter

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.