[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-users] CUDA Nvidia GPU computing on Xen DomU
Hi All, I'm in a crunch trying to deploy two GeForce RTX 2080 SUPER cards on one of my Xen DomU computing nodes. I was under impression that GPU passthrough for CUDA computing is supported and well documented up until I tried to complete this exercise. I went up and down the official documentation https://xenbits.xenproject.org/docs/4.13-testing/ as well as https://wiki.xenproject.org/wiki/Xen_PCI_Passthrough My Xen Dom0 runs on Alpine Linux xen1:/etc# more alpine-release 3.11.3 xen1:/etc# uname -a Linux xen1.int.autonsys.com 5.4.12-1-lts #2-Alpine SMP Thu, 16 Jan 2020 12:53:54 UTC x86_64 Linux xen1:/boot# more /boot/extlinux.conf # Generated by update-extlinux 6.04_pre1-r6 DEFAULT menu.c32 PROMPT 0 MENU TITLE Alpine/Linux Boot Menu MENU HIDDEN MENU AUTOBOOT Alpine will be booted automatically in # seconds. TIMEOUT 30 LABEL xen-lts MENU LABEL Xen + Linux lts COM32 mboot.c32 APPEND xen.gz dom0_mem=16384M --- vmlinuz-lts root=UUID=f1d049ca-b639-4f14-8f3 1-162c471373b7 modules=sd-mod,usb-storage,ext4 nomodeset quiet rootfstype=ext4 - -- initramfs-lts LABEL lts MENU LABEL Linux lts LINUX vmlinuz-lts INITRD initramfs-lts APPEND root=UUID=f1d049ca-b639-4f14-8f31-162c471373b7 modules=sd-mod,usb-stora ge,ext4 nomodeset quiet rootfstype=ext4 MENU SEPARATOR xen1:/boot# lspci | grep -i nvidia 02:00.0 VGA compatible controller: NVIDIA Corporation TU104 [GeForce RTX 2080 SUPER] (rev a1) 02:00.1 Audio device: NVIDIA Corporation Device 10f8 (rev a1) 02:00.2 USB controller: NVIDIA Corporation Device 1ad8 (rev a1) 02:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad9 (rev a1) 03:00.0 VGA compatible controller: NVIDIA Corporation TU104 [GeForce RTX 2080 SUPER] (rev a1) 03:00.1 Audio device: NVIDIA Corporation Device 10f8 (rev a1) 03:00.2 USB controller: NVIDIA Corporation Device 1ad8 (rev a1) 03:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad9 (rev a1) My Xen DomU runs root@springdale1$ more /etc/redhat-release Springdale Linux release 7.7 (Verona) root@springdale1$ uname -a Linux springdale1.int.autonsys.com 3.10.0-1062.12.1.el7.x86_64 #1 SMP Wed Feb 5 07:15:42 EST 2020 x86_64 x86_64 x86_64 GNU/Linux I tried to set up GPU passthrough following https://wiki.xenproject.org/wiki/Xen_PCI_Passthrough modprobe xen-pciback xl pci-assignable-add 02:00.0 xl pci-assignable-add 02:00.1 xl pci-assignable-add 02:00.2 xl pci-assignable-add 03:00.0 xl pci-assignable-add 03:00.1 xl pci-assignable-add 03:00.2 xen1:~# xl pci-assignable-list 0000:02:00.2 0000:03:00.3 0000:03:00.1 0000:02:00.3 0000:02:00.1 0000:03:00.2 # Add this to config file. Nothing else pci=['02:00.0','03:00.0'] xen1:/boot# more /etc/xen/my-guests/auto/springdale1.cfg type = "hvm" name="springdale-1" vcpus=16 # memory=65536 memory=262144 # gfx_passthru=1 # pci=['02:00.0','02:00.1','02:00.2','02:00.3','03:00.0','03:00.1','03:00.2','03 :00.3'] pci=['02:00.0','03:00.0'] # disk = [ '/dev/disk/by-uuid/63f65160-e3ee-4458-b5f1-8b5b9d934563,raw,xvda,rw' disk=['/dev/sda,raw,xvda,rw'] vif=['mac=00:16:3e:10:5f:95, bridge=br0'] # on_poweroff="destroy" on_reboot="restart" on_crash="restart" root@springdale1$ lspci | grep -i nvidia 00:05.0 VGA compatible controller: NVIDIA Corporation Device 1e81 (rev a1) 00:06.0 VGA compatible controller: NVIDIA Corporation Device 1e81 (rev a1) This is already problematic. I normal non-virtualized host would report GPU cards differently like this root@gpu19$ lspci | grep -i nvidia 18:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1) 18:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1) 18:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Controller (rev a1) 18:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 UCSI Controller (rev a1) 3b:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1) 3b:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1) 3b:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Controller (rev a1) 3b:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 UCSI Controller (rev a1) 86:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1) 86:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1) 86:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Controller (rev a1) 86:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 UCSI Controller (rev a1) af:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1) af:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1) af:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Controller (rev a1) af:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 UCSI Controller (rev a1) Driver compilation and CUDA installation on the virtual host are going through but when I try to probe the card I get the following error root@springdale1$ nvidia-smi Unable to determine the device handle for GPU 0000:00:05.0: Unknown Error I do see in the Xen DomU log files messages:Mar 2 00:31:11 springdale1 kernel: NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor number 0 messages:Mar 2 00:39:32 springdale1 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 440.44 Sun Dec 8 03:38:56 UTC 2019 messages:Mar 2 00:40:11 springdale1 kernel: NVRM: GPU 0000:00:05.0: RmInitAdapter failed! (0x23:0x56:515) messages:Mar 2 00:40:11 springdale1 kernel: NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor number 0 messages:Mar 2 00:40:11 springdale1 kernel: NVRM: GPU 0000:00:06.0: RmInitAdapter failed! (0x23:0x56:515) messages:Mar 2 00:40:11 springdale1 kernel: NVRM: GPU 0000:00:06.0: rm_init_adapter failed, device minor number 1 messages:Mar 2 00:40:11 springdale1 kernel: NVRM: GPU 0000:00:05.0: RmInitAdapter failed! (0x23:0x56:515) messages:Mar 2 00:40:11 springdale1 kernel: NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor number 0 messages:Mar 2 00:50:00 springdale1 kernel: NVRM: GPU 0000:00:05.0: RmInitAdapter failed! (0x23:0x56:515) messages:Mar 2 00:50:00 springdale1 kernel: NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor number 0 messages:Mar 2 00:50:00 springdale1 kernel: NVRM: GPU 0000:00:06.0: RmInitAdapter failed! (0x23:0x56:515) messages:Mar 2 00:50:00 springdale1 kernel: NVRM: GPU 0000:00:06.0: rm_init_adapter failed, device minor number 1 messages:Mar 2 00:50:00 springdale1 kernel: NVRM: GPU 0000:00:05.0: RmInitAdapter failed! (0x23:0x56:515) messages:Mar 2 00:50:00 springdale1 kernel: NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor number 0 It seems that I need to figure out if it is possible to pass parameter to Xen which will hide host ID from the guest ID. This is definitely possible on ESXi with a flag like hypervisor.cpuid.v0 = "FALSE". https://devtalk.nvidia.com/default/topic/982322/linux/nvidia-smi-reports-unable-to-determine-the-device-handle-for-gpu/ Ideally Xen DomO should completely passthrough cards to DomU. Does anyone on this mailing list use CUDA on Xen Dom0? Could you please give me some hints? I am finding few bits here and there on the Internet but nothing really coherent needed for enterprise deployment. Most Kind Regards, Predrag Punosevac _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |