[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Some trouble to use NVIDIA CUDA with Xen



Hello.

Partial SUCCSESS !

On Tue, 13 Aug 2013, Konrad Rzeszutek Wilk wrote:
NVRM: PAT configuration unsupported.
Right, so there are couple of patches that can enable that back.

You need to revert these two:
8eaffa67b43e99ae581622c5133e20b0f48bcef1
c79c49826270b8b0061b2fca840fc3f013c8a78a

And apply this patch:

https://lkml.org/lkml/2012/2/10/229

That should re-enable PAT.  Try that and please report back.

I applied the patch to 3.9.11-200.PAT.fc18.x86_64 (3.10 is not
working due to incompatibilities with nvidia driver source code).

Did you revert the other two git commits?

Yes, double check (combined patch is in the attachment to rpmbuild/SOURCE/, rpmbuild patch too).

# rdmsr 0x277
50100070406

I look to nvidia source code.

The error is on nvidia side:

snip from /usr/src/nvidia-319.37/nv-pat.c
=================
....
#if defined(HAVE_NV_XEN) && defined(CONFIG_XEN) && defined(CONFIG_PARAVIRT)
    if (PAT_WC_index == 4)
        return NV_PAT_MODE_KERNEL;
#endif

    if (PAT_WC_index == 1)
        return NV_PAT_MODE_KERNEL;
    else if (PAT_WC_index != 0xf)
    {
        nv_printf(NV_DBG_ERRORS,
            "NVRM: PAT configuration unsupported.\n");
        return NV_PAT_MODE_DISABLED;
    }
....
===================

HAVE_NV_XEN is NOT defined.

HAVE_NV_XEN is defined only if "nv-xen.h" is present (tested in /usr/src/nvidia-319.37/conftest.h) and it seems to be removed from distributed source (~ in nvidia driver 19x.x.x versions).

Ok, i downloaded some older version "nv-xen.h" from net to /usr/src/nvidia-319.37/nv-xen.h recompile driver
("cd /usr/src/nvidia-319.37; make clean module; rmmod nvidia;
cp nvidia.ko /lib/modules/3.9.11-200.PAT.fc18.x86_64/extra;
modprobe nvidia").

Error "NVRM: PAT configuration unsupported." does not shown (as expected).

Most CUDA demoprograms WORKS without error!!!

But some programs hung PCIe and kernel:

[55799.433278] BUG: Bad rss-counter state mm:ffff8800723e0000 idx:1 val:21
[55800.139090] abrt-handle-eve[10175]: segfault at 18 ip 0000003f20ebb6d3 sp 
00007fffa7e6ef00 error 4 in libc-2.16.so[3f20e00000+1ad000]
[55800.375196] BUG: Bad rss-counter state mm:ffff8800723e2680 idx:1 val:5
[55845.124636] BUG: Bad rss-counter state mm:ffff8800723e0000 idx:1 val:8
[55962.186275] BUG: Bad rss-counter state mm:ffff880074a27800 idx:0 val:5
[55962.192811] BUG: Bad rss-counter state mm:ffff880074a27800 idx:1 val:795
[55962.262019] traps: abrt-handle-eve[10287] general protection ip:3f20ebb7a6 
sp:7fffbd613410 error:0 in libc-2.16.so[3f20e00000+1ad000]
[55962.394789] BUG: Bad rss-counter state mm:ffff8800723e0380 idx:1 val:13
[55981.779246] NVRM: GPU at 0000:02:00: GPU-fe328712-3546-53fe-149d-3d78e7aa64d5
[55981.786391] NVRM: Xid (0000:02:00): 38, 0001 00000000 00000000 00000000 
00000000 00000000
[55982.407300] NVRM: GPU at 0000:02:00.0 has fallen off the bus.
[55982.425810] NVRM: os_pci_init_handle: invalid context!
....
[57200.089052] BUG: soft lockup - CPU#0 stuck for 22s! [dct8x8:10290]
....
[56008.089053] RIP: e030:[<ffffffffa15061a8>]  [<ffffffffa15061a8>] 
_nv012574rm+0x4/0x51 [nvidia]
....
[56008.089053] Call Trace:
[56008.089053]  [<ffffffffa15056f3>] ? _nv012271rm+0xbe/0x1c6 [nvidia]
[56008.089053]  [<ffffffffa17cddf3>] ? _nv008298rm+0x26/0xb2 [nvidia]
[56008.089053]  [<ffffffffa17ec512>] ? _nv003411rm+0x47dd/0xb184 [nvidia]
[56008.089053]  [<ffffffffa17ec5b2>] ? _nv003411rm+0x487d/0xb184 [nvidia]
[56008.089053]  [<ffffffffa17f49cd>] ? _nv014043rm+0xfcd/0x1b30 [nvidia]
[56008.089053]  [<ffffffffa17ec820>] ? _nv003411rm+0x4aeb/0xb184 [nvidia]
[56008.089053]  [<ffffffffa17ec94d>] ? _nv003411rm+0x4c18/0xb184 [nvidia]
[56008.089053]  [<ffffffffa18a203f>] ? _nv010926rm+0x28/0xeb [nvidia]
[56008.089053]  [<ffffffffa18a1df4>] ? _nv011116rm+0x162/0x385 [nvidia]
[56008.089053]  [<ffffffffa14da4bd>] ? _nv008434rm+0xed/0x176 [nvidia]
[56008.089053]  [<ffffffffa1929132>] ? _nv013320rm+0x5e/0xb4 [nvidia]
[56008.089053]  [<ffffffffa192f5ea>] ? _nv013321rm+0xc76/0x2dcc [nvidia]
[56008.089053]  [<ffffffffa192f975>] ? _nv013321rm+0x1001/0x2dcc [nvidia]
[56008.089053]  [<ffffffffa192f400>] ? _nv013321rm+0xa8c/0x2dcc [nvidia]
[56008.089053]  [<ffffffffa17f1657>] ? _nv003411rm+0x9922/0xb184 [nvidia]
[56008.089053]  [<ffffffffa17efbe6>] ? _nv003411rm+0x7eb1/0xb184 [nvidia]
[56008.089053]  [<ffffffffa19c35ef>] ? _nv000747rm+0x2a3/0x2f2 [nvidia]
[56008.089053]  [<ffffffffa19bcee2>] ? rm_disable_adapter+0x74/0x107 [nvidia]
[56008.089053]  [<ffffffffa19da600>] ? nv_check_pci_config_space+0x1d0/0x2e0 
[nvidia]
[56008.089053]  [<ffffffff8108808e>] ? down+0x2e/0x50
[56008.089053]  [<ffffffffa19dc987>] ? nv_kern_close+0x147/0x440 [nvidia]
[56008.089053]  [<ffffffff8119ed3c>] ? __fput+0xec/0x240
[56008.089053]  [<ffffffff8119ee9e>] ? ____fput+0xe/0x10
[56008.089053]  [<ffffffff8107f6d5>] ? task_work_run+0xc5/0xe0
[56008.089053]  [<ffffffff81064a5e>] ? do_exit+0x2ae/0xa30
[56008.089053]  [<ffffffff8108ffcb>] ? finish_task_switch+0x4b/0xe0
[56008.089053]  [<ffffffff8106526f>] ? do_group_exit+0x3f/0xa0
[56008.089053]  [<ffffffff810652e7>] ? sys_exit_group+0x17/0x20
[56008.089053]  [<ffffffff81667f99>] ? system_call_fastpath+0x16/0x1b

M.C>

Attachment: linux-3.9.fc18-PAT_rpmbuild.patch
Description: Text document

Attachment: linux-3.9.fc18-PAT.patch
Description: Text document

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.