[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Some trouble to use NVIDIA CUDA with Xen


On Thu, 15 Aug 2013, Konrad Rzeszutek Wilk wrote:
to double check that it is working correctly).

I will try @weekend.

I tried.  I have NOT solution but questions exist @ END.


- enable verbose debugging in nvidia module ("make clean module DEFINES='-DDEBUG 
-DNV_MEM_LOGGER -DNV_DBG_MEM'" + "os-interface.c:cur_debuglevel = 0x0")
- added some more debug strings (additional tag "MX")
- i attached debug output ("demsg | grep NVRM > out.txt")
- tested program CUDA 5.5 "bandwidthTest", nvdriver 319.37, linux 
3.9.11-200.PAT1.fc18.x86_64, xen 4.2.2, GTX770 on pci2:0.0
- i loaded module wb_to_wc.ko but it does not help much


1) nv-xen.h - functions never called (function probably for DomU)
2) memory debugging shows that UC mark and WB unmark pairs works OK
   - look @ out.txt ("egrep 'nv_alloc_pages:2481.*flags = 
0x000[12]|nv_free_pages:2510.*flags = 0x000[12]' out.txt")
   - search "nv_alloc_pages" and "cache_type" 1 (NV_MEMORY_UNCACHED) or 2 
     - calling set_memory_array_uc() (==MX_AR_UC tag) set_memory_uc() (==MX_UC 
     - correspoding flags are set @ page structure - see "flags" (struct page) 
in page_table dump
   - search ""nv_free_pages" and "flags = 0x0001xxxx" / "flags = 0x0002xxxx" 
(nvidia page)
     - calling set_memory_array_wb() (==MX_AR_WB tag) set_memory_wb() (==MX_WB 
     - correspoding flags are cleared @ page structure (see page_table before 
and after *_WB tag)



1) why is it requested "NV_MEMORY_WRITECOMBINED" it is alocated as set_memory*_uc() and NOT set_memory*_wc() (NV_MEMORY_WRITECOMBINED and NV_MEMORY_UNCACHED allocated as UC) ?
(for example timestamp [ 4659.741768] in out.txt)

code (nv-vm.c:nv_alloc_system_pages()):
    if (!NV_ALLOC_MAPPING_CACHED(at->flags))
        nv_set_memory_type(at, NV_MEMORY_UNCACHED);

2) why is the allocated block by "1)" (eg. NV_MEMORY_WRITECOMBINED but
it is flagged set_memory*_uc())  flagged as WC in nv-mmap.c:nv_kern_mmap() ?
"vm_page_prot" is encoded MANUALLY in nv-mmap.c:nv_encode_caching() !
(for example timestamp [ 4659.902599] in out.txt)

code nv-mmap.c:nv_encode_caching():
    switch (cache_type)
            if ((nv_pat_mode != NV_PAT_MODE_DISABLED) &&
                    (memory_type != NV_MEMORY_TYPE_REGISTERS))
                pgprot_val(*prot) &= ~(_PAGE_PSE | _PAGE_PCD | _PAGE_PWT);
                *prot = __pgprot(pgprot_val(*prot) | _PAGE_PWT);

code nv-mmap.c:nv_kern_mmap():
        for (j = i; j < (i + pages); j++)
            if (NV_REMAP_PAGE_RANGE(start, at->page_table[j]->phys_addr,
                    PAGE_SIZE, vma->vm_page_prot))
                status = -EAGAIN;
                goto done;
            start += PAGE_SIZE;
(NV_REMAP_PAGE_RANGE() == remap_pfn_range())



1) Is it problem when the same pages is in kernel flagged as UC and mmaped to userspace as WC ?

2) Is it ok to manually encode WC in "remap_pfn_range()" (is it remapped to real XEN aware PTE later ?xen_pte_val?) ?

Manually WC encoded as "_PAGE_PWT" eg. select entry PAT1 in non-xen kernel mapped to memory type "01H" == "Write Combining (WC)" BUT in xen kernel
is PAT1 mapped to "04H" == "Write Through (WT)".
Xen kernel should use "_PAGE_PAT" eg. select entry PAT4 mapped to memory type "01H" (xen rdmsr 0x277 == 50100070406).

(Intel64 and IA-32 Architectures Software Developerʼs Manual
Volume 3A: System Programming Guide, Part 1/chapter 11.12


Problem still persists:

If I used CUDA the system becomes unstable and sometimes crashes.

[17037.717699] systemd-udevd[9160]: segfault at 18 ip 00007ff415c126d3 sp 
00007fff742bfa50 error 4 in libc-2.16.so[7ff415b57000+1ad000]
[17037.863424] BUG: Bad rss-counter state mm:ffff880071b15180 idx:1 val:10
[17040.876791] systemd-udevd[9161]: segfault at 3f21200ed0 ip 0000003f21200ed0 
sp 00007fff742bf968 error 14 in libnss_files-2.16.so[7ff4144d0000+c000]
[17040.898748] BUG: Bad rss-counter state mm:ffff880071b17100 idx:1 val:6
[17047.662793] bash[9191]: segfault at 10 ip 0000003f20e7d0dd sp 
00007fff1ebd95d0 error 4 in libc-2.16.so[3f20e00000+1ad000]
[17047.821840] BUG: Bad rss-counter state mm:ffff880053cbb800 idx:1 val:487


Thanks for answers,

Martin Cerveny

Attachment: out.txt
Description: Text document

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.