[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest



On Thu, Jul 16, 2015 at 12:32:42PM +0200, Dario Faggioli wrote:
> Hey,
> 
> This started on IRC, but it's actually appropriate to have the
> conversation here.
> 
> I just discovered an issue with vNUMA, when PV guests are used. In fact,
> creating a 4 vCPUs PV guest, and making up things so that all the 4
> vCPUs should be busy, I see this:
> 
> root@Zhaman:~# xl vcpu-list test
> Name                                ID  VCPU   CPU State   Time(s) Affinity 
> (Hard / Soft)
> test                                 4     0    5   r--    1481.9  all / 0-7
> test                                 4     1    2   r--    1479.4  all / 0-7
> test                                 4     2   15   -b-       7.5  all / 8-15
> test                                 4     3   10   -b-    1324.8  all / 8-15
> 
> Going checking inside the guest, confirms that *everything* runs on
> vCPUs 0 and 1. However, using schedtool or taskset, I can force tasks to
> execute on vCPUs 2 and 3.
> 
> Inspecting the guest's dmesg, I've seen this:
> 
> [    0.128416] ------------[ cut here ]------------
> [    0.128416] WARNING: CPU: 2 PID: 0 at ../arch/x86/kernel/smpboot.c:317 
> topology_sane.isra.2+0x74/0x88()
> [    0.128416] sched: CPU #2's smt-sibling CPU #0 is not on the same node! 
> [node: 1 != 0]. Ignoring dependency.
> [    0.128416] Modules linked in:
> [    0.128416] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.19.0+ #1
> [    0.128416]  0000000000000009 ffff88001ee3bdd0 ffffffff81657c7b 
> ffffffff810bbd2c
> [    0.128416]  ffff88001ee3be20 ffff88001ee3be10 ffffffff81081510 
> ffff88001ee3bea0
> [    0.128416]  ffffffff8103aa02 ffff88003ea0a001 0000000000000000 
> ffff88001f20a040
> [    0.128416] Call Trace:
> [    0.128416]  [<ffffffff81657c7b>] dump_stack+0x4f/0x7b
> [    0.128416]  [<ffffffff810bbd2c>] ? up+0x39/0x3e
> [    0.128416]  [<ffffffff81081510>] warn_slowpath_common+0xa1/0xbb
> [    0.128416]  [<ffffffff8103aa02>] ? topology_sane.isra.2+0x74/0x88
> [    0.128416]  [<ffffffff81081570>] warn_slowpath_fmt+0x46/0x48
> [    0.128416]  [<ffffffff8101eeb1>] ? __cpuid.constprop.0+0x15/0x19
> [    0.128416]  [<ffffffff8103aa02>] topology_sane.isra.2+0x74/0x88
> [    0.128416]  [<ffffffff8103ac70>] set_cpu_sibling_map+0x21a/0x444
> [    0.128416]  [<ffffffff81056ac3>] ? numa_add_cpu+0x98/0x9f
> [    0.128416]  [<ffffffff8100b8f2>] cpu_bringup+0x63/0xa8
> [    0.128416]  [<ffffffff8100b945>] cpu_bringup_and_idle+0xe/0x1a
> [    0.128416] ---[ end trace 95bff1aef57ee1b1 ]---
> 
> So, basically, Linux is complaining that we're trying to put two vCPUs,
> that looks to be SMT siblings, on different NUMA nodes. And, yes, I
> think this is quite disruptive for the Linux's scheduler internal logic.
> 
> The vnuma bits of the guest config are these:
> 
>  vnuma = [ [ "pnode=0","size=512","vcpus=0-1","vdistances=10,20"  ],
>            [ "pnode=1","size=512","vcpus=2-3","vdistances=20,10"  ] ]
> 
> From inside the guest, the topology looks to be like this:
> 
> root@test:~# numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1
> node 0 size: 475 MB
> node 0 free: 382 MB
> node 1 cpus: 2 3
> node 1 size: 495 MB
> node 1 free: 475 MB
> node distances:
> node   0   1 
>   0:  10  10 
>   1:  20  10
> 
> root@test:~# cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list 
> 0-1
> root@test:~# cat /sys/devices/system/cpu/cpu0/topology/core_siblings_list 
> 0-3
> root@test:~# cat /sys/devices/system/cpu/cpu2/topology/thread_siblings_list 
> 2-3
> root@test:~# cat /sys/devices/system/cpu/cpu2/topology/core_siblings_list 
> 0-3
> 
> So the complain during boot seems to be against 'core_siblings' (which
> was not what I expected, but perhaps I misremember the meaning of
> "core_siblings" VS. "thread_siblings" VS. smt-siblings in Linux; I'll
> double check).
> 
> Anyway, is there anything we can do to fix or workaround things?
> 


IIRC Linux already consumes some bits returned by cpuid anyway, is it
possible to generate a "dummy" layout in Linux kernel according to vNUMA
information? I had this idea long ago but wasn't quite sure if it's dumb
or not.

Wei.

> Regards,
> Dario
> -- 
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.