[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: PCI-passthrough hvm guest not working after domu restart
On 26 Jun 2023 03:05, Peter Gimmemail wrote: I am passing an NVIDIA A10 GPU through to a domu running in hvm mode. nvidia-smi detects the GPU OK. I restart the VM, and GPU is no longer detected by nvidia-smi. lspci output includes the device even when nvidia-smi detect does not. Any ideas where I start to debug this? What output I should include here to assist? Options I should try on the host/dom0? I'd tried resetting a few things on the host (no difference). echo 1 > /sys/bus/pci/devices/0000\:06\:00.0/remove echo 1 > /sys/bus/pci/rescan echo "0" > "/sys/bus/pci/devices/0000:06:00.0/power" echo "1" > /sys/bus/pci/devices/0000\:06\:00.0/reset I'd tried with a Xen 4.17. Also with 4.18.3. Same behavior on each. root@gpu:~# lspci | grep NVIDIA 00:05.0 3D controller: NVIDIA Corporation GA102GL [A10] (rev a1) 00:06.0 3D controller: NVIDIA Corporation GA102GL [A10] (rev a1) domu dmesg: [ 4.798048] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 525.105.17 Tue Mar 28 18:02:59 UTC 2023 [ 4.801411] cirrus 0000:00:03.0: [drm] drm_plane_enable_fb_damage_clips() not called [ 4.976379] Console: switching to colour frame buffer device 128x48 [ 5.122610] cirrus 0000:00:03.0: [drm] fb0: cirrusdrmfb frame buffer device [ 5.247082] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 525.105.17 Tue Mar 28 22:18:37 UTC 2023 [ 5.656447] [drm] [nvidia-drm] [GPU ID 0x00000005] Loading driver [ 5.656451] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:05.0 on minor 1 [ 5.656570] [drm] [nvidia-drm] [GPU ID 0x00000006] Loading driver [ 5.656572] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:06.0 on minor 2 [ 94.161919] nvidia 0000:00:05.0: firmware: direct-loading firmware nvidia/525.105.17/gsp_tu10x.bin [ 94.328842] NVRM: GPU 0000:00:05.0: RmInitAdapter failed! (0x62:0xffff:2351) [ 94.329981] NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor number 0 [ 94.348750] nvidia 0000:00:05.0: firmware: direct-loading firmware nvidia/525.105.17/gsp_tu10x.bin [ 94.749557] NVRM: GPU 0000:00:05.0: RmInitAdapter failed! (0x62:0xffff:2351) [ 94.750732] NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor number 0 Thanks in advance for any guidance. - Peter I'll first ask a few generic questions, then give my experience. Note that I'm only an average user, nothing more ; ) Did you try "poweroff" then "xl create" instead of rebooting the domU ?I remember Xen 4.11 had a bug like that when using PT adapters in the config file (use "on_reboot=destroy" then). How do you PT the GPU ? What options ? Is the vgaarb stuff kicking in (dom0/domU dmesg) ? Does your card have FLReset ? root@dom0 # lspci -vvv -s 06:00.0 | grep -i flr (as root, a user account doesn't get that info)Do you get error messages when you "xl create" the domU (like libxl__device_pci_reset) ? Do "xl dmesg" and logs in /var/log/xen tell you something useful ? My experience now, maybe it will help ?With my AMD GPU (no FLReset), I have the same problem : *sometimes* a domU (re)boot does not pick the GPU, usually following a domU crash. I solve this by running the same commands as you do, but sometimes I have to run them *several times* before it works again. So : echo 1 > /sys/bus/pci/devices/0000\:06\:00.0/remove echo 1 > /sys/bus/pci/devices/0000\:06\:00.1/remove echo 1 > /sys/bus/pci/rescan (the .1 is the soundcard, but I think you don't have one) I've never used "power" and "reset". BTW, shouldn't you run "rescan" -after- trying "power/reset" ?Sometimes, above commands are not enough, and I have to remove then re-add the devices from/to the PCI assignable pool. xl pci-assignable-remove 06:00.0 xl pci-assignable-remove 06:00.1 (run sysfs commands again) xl pci-assignable-add 06:00.0 xl pci-assignable-add 06:00.1 (run sysfs commands again)Sometimes, I have to run -remove a few times before they get out of the pool. Sometimes, I cannot even remove them from the pool, xl does not agree, but even then, the GPU eventually gets PT again. Sometimes, a few rounds of "start/stop domU", "run sysfs commands" and "remove/add to the PCI pool" is needed. What's strange in my case is that there's no "exact" solution to what looks like the same problem, as you've read I used "sometimes" a lot. So, nothing scientific really, just trial and error !It's like I have no clue and I'm "pushing all the buttons till it works", but it has always worked like that ^^ I've never have to reboot dom0 to re-assign the buggy GPU (yeah should read logs and compare ...). But as usual, YMMV !
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |