[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PCI-passthrough hvm guest not working after domu restart



On 26 Jun 2023 03:05, Peter Gimmemail wrote:
I am passing an NVIDIA A10 GPU through to a domu running in hvm mode.
nvidia-smi detects the GPU OK.

I restart the VM, and GPU is no longer detected by nvidia-smi.

lspci output includes the device even when nvidia-smi detect does not.

Any ideas where I start to debug this?  What output I should include here
to assist?  Options I should try on the host/dom0?

I'd tried resetting a few things on the host (no difference).
echo 1 > /sys/bus/pci/devices/0000\:06\:00.0/remove
echo 1 > /sys/bus/pci/rescan
echo "0" > "/sys/bus/pci/devices/0000:06:00.0/power"
echo "1" > /sys/bus/pci/devices/0000\:06\:00.0/reset

I'd tried with a Xen 4.17.  Also with 4.18.3.  Same behavior on each.

root@gpu:~# lspci  | grep NVIDIA
00:05.0 3D controller: NVIDIA Corporation GA102GL [A10] (rev a1)
00:06.0 3D controller: NVIDIA Corporation GA102GL [A10] (rev a1)

domu dmesg:
[    4.798048] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.105.17
  Tue Mar 28 18:02:59 UTC 2023
[    4.801411] cirrus 0000:00:03.0: [drm]
drm_plane_enable_fb_damage_clips() not called
[    4.976379] Console: switching to colour frame buffer device 128x48
[    5.122610] cirrus 0000:00:03.0: [drm] fb0: cirrusdrmfb frame buffer
device
[    5.247082] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver
for UNIX platforms  525.105.17  Tue Mar 28 22:18:37 UTC 2023
[    5.656447] [drm] [nvidia-drm] [GPU ID 0x00000005] Loading driver
[    5.656451] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:05.0
on minor 1
[    5.656570] [drm] [nvidia-drm] [GPU ID 0x00000006] Loading driver
[    5.656572] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:06.0
on minor 2
[   94.161919] nvidia 0000:00:05.0: firmware: direct-loading firmware
nvidia/525.105.17/gsp_tu10x.bin
[   94.328842] NVRM: GPU 0000:00:05.0: RmInitAdapter failed!
(0x62:0xffff:2351)
[   94.329981] NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor
number 0
[   94.348750] nvidia 0000:00:05.0: firmware: direct-loading firmware
nvidia/525.105.17/gsp_tu10x.bin
[   94.749557] NVRM: GPU 0000:00:05.0: RmInitAdapter failed!
(0x62:0xffff:2351)
[   94.750732] NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor
number 0

Thanks in advance for any guidance.
- Peter


I'll first ask a few generic questions, then give my experience.
Note that I'm only an average user, nothing more ; )

Did you try "poweroff" then "xl create" instead of rebooting the domU ?
I remember Xen 4.11 had a bug like that when using PT adapters in the config file (use "on_reboot=destroy" then).

How do you PT the GPU ? What options ?
Is the vgaarb stuff kicking in (dom0/domU dmesg) ?

Does your card have FLReset ?
  root@dom0 # lspci -vvv -s 06:00.0 | grep -i flr
  (as root, a user account doesn't get that info)

Do you get error messages when you "xl create" the domU (like libxl__device_pci_reset) ?
Do "xl dmesg" and logs in /var/log/xen tell you something useful ?

My experience now, maybe it will help ?

With my AMD GPU (no FLReset), I have the same problem : *sometimes* a domU (re)boot does not pick the GPU, usually following a domU crash. I solve this by running the same commands as you do, but sometimes I have to run them *several times* before it works again.

So :
  echo 1 > /sys/bus/pci/devices/0000\:06\:00.0/remove
  echo 1 > /sys/bus/pci/devices/0000\:06\:00.1/remove
  echo 1 > /sys/bus/pci/rescan
(the .1 is the soundcard, but I think you don't have one)

I've never used "power" and "reset".
BTW, shouldn't you run "rescan" -after- trying "power/reset" ?

Sometimes, above commands are not enough, and I have to remove then re-add the devices from/to the PCI assignable pool.

  xl pci-assignable-remove 06:00.0
  xl pci-assignable-remove 06:00.1
  (run sysfs commands again)
  xl pci-assignable-add 06:00.0
  xl pci-assignable-add 06:00.1
  (run sysfs commands again)

Sometimes, I have to run -remove a few times before they get out of the pool. Sometimes, I cannot even remove them from the pool, xl does not agree, but even then, the GPU eventually gets PT again.

Sometimes, a few rounds of "start/stop domU", "run sysfs commands" and "remove/add to the PCI pool" is needed.

What's strange in my case is that there's no "exact" solution to what looks like the same problem, as you've read I used "sometimes" a lot.
So, nothing scientific really, just trial and error !
It's like I have no clue and I'm "pushing all the buttons till it works", but it has always worked like that ^^ I've never have to reboot dom0 to re-assign the buggy GPU (yeah should read logs and compare ...).
But as usual, YMMV !



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.