[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [PATCH v5 0/8] vnuma introduction
v5 of the patchset is mostly to bring back to life conversation about vnuma and to set an intention to be included in Xen rc4.5 (along woth dom0). libxl part will be modified to alignt with Wei Liu work. vnuma placement mechanism is a subject to discusstion. Your comments are welcome. vNUMA introduction This series of patches introduces vNUMA topology awareness and provides interfaces and data structures to enable vNUMA for PV guests. There is a plan to extend this support for dom0 and HVM domains. vNUMA topology support should be supported by PV guest kernel. Corresponging patches should be applied. Introduction ------------- vNUMA topology is exposed to the PV guest to improve performance when running workloads on NUMA machines. XEN vNUMA implementation provides a way to create vNUMA-enabled guests on NUMA/UMA and map vNUMA topology to physical NUMA in a optimal way. XEN vNUMA support Current set of patches introduces subop hypercall that is available for enlightened PV guests with vNUMA patches applied. Domain structure was modified to reflect per-domain vNUMA topology for use in other vNUMA-aware subsystems (e.g. ballooning). libxc libxc provides interfaces to build PV guests with vNUMA support and in case of NUMA machines provides initial memory allocation on physical NUMA nodes. This implemented by utilizing nodemap formed by automatic NUMA placement. Details are in patch #3. libxl libxl provides a way to predefine in VM config vNUMA topology - number of vnodes, memory arrangement, vcpus to vnodes assignment, distance map. PV guest As of now, only PV guest can take advantage of vNUMA functionality. vNUMA Linux patches should be applied and NUMA support should be compiled in kernel. Examples of booting vNUMA enabled PV Linux guest on real NUMA machine: 1. Automatic vNUMA placement on h/w NUMA machine: VM config: memory = 16384 vcpus = 4 name = "rcbig" vnodes = 4 vnumamem = [10,10] vnuma_distance = [10, 30, 10, 30] vcpu_to_vnode = [0, 0, 1, 1] Xen: (XEN) Memory location of each domain: (XEN) Domain 0 (total: 2569511): (XEN) Node 0: 1416166 (XEN) Node 1: 1153345 (XEN) Domain 5 (total: 4194304): (XEN) Node 0: 2097152 (XEN) Node 1: 2097152 (XEN) Domain has 4 vnodes (XEN) vnode 0 - pnode 0 (4096) MB (XEN) vnode 1 - pnode 0 (4096) MB (XEN) vnode 2 - pnode 1 (4096) MB (XEN) vnode 3 - pnode 1 (4096) MB (XEN) Domain vcpu to vnode: (XEN) 0 1 2 3 dmesg on pv guest: [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x0009ffff] [ 0.000000] node 0: [mem 0x00100000-0xffffffff] [ 0.000000] node 1: [mem 0x100000000-0x1ffffffff] [ 0.000000] node 2: [mem 0x200000000-0x2ffffffff] [ 0.000000] node 3: [mem 0x300000000-0x3ffffffff] [ 0.000000] On node 0 totalpages: 1048479 [ 0.000000] DMA zone: 56 pages used for memmap [ 0.000000] DMA zone: 21 pages reserved [ 0.000000] DMA zone: 3999 pages, LIFO batch:0 [ 0.000000] DMA32 zone: 14280 pages used for memmap [ 0.000000] DMA32 zone: 1044480 pages, LIFO batch:31 [ 0.000000] On node 1 totalpages: 1048576 [ 0.000000] Normal zone: 14336 pages used for memmap [ 0.000000] Normal zone: 1048576 pages, LIFO batch:31 [ 0.000000] On node 2 totalpages: 1048576 [ 0.000000] Normal zone: 14336 pages used for memmap [ 0.000000] Normal zone: 1048576 pages, LIFO batch:31 [ 0.000000] On node 3 totalpages: 1048576 [ 0.000000] Normal zone: 14336 pages used for memmap [ 0.000000] Normal zone: 1048576 pages, LIFO batch:31 [ 0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org [ 0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs [ 0.000000] No local APIC present [ 0.000000] APIC: disable apic facility [ 0.000000] APIC: switched to apic NOOP [ 0.000000] nr_irqs_gsi: 16 [ 0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff] [ 0.000000] e820: cannot find a gap in the 32bit address range [ 0.000000] e820: PCI devices with unassigned 32bit BARs may break! [ 0.000000] e820: [mem 0x400100000-0x4004fffff] available for PCI devices [ 0.000000] Booting paravirtualized kernel on Xen [ 0.000000] Xen version: 4.4-unstable (preserve-AD) [ 0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4 nr_node_ids:4 [ 0.000000] PERCPU: Embedded 28 pages/cpu @ffff8800ffc00000 s85376 r8192 d21120 u2097152 [ 0.000000] pcpu-alloc: s85376 r8192 d21120 u2097152 alloc=1*2097152 [ 0.000000] pcpu-alloc: [0] 0 [1] 1 [2] 2 [3] 3 pv guest: numactl --hardware: root@heatpipe:~# numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 node 0 size: 4031 MB node 0 free: 3997 MB node 1 cpus: 1 node 1 size: 4039 MB node 1 free: 4022 MB node 2 cpus: 2 node 2 size: 4039 MB node 2 free: 4023 MB node 3 cpus: 3 node 3 size: 3975 MB node 3 free: 3963 MB node distances: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 20 2: 20 20 10 20 3: 20 20 20 10 Comments: None of the configuration options are correct so default values were used. Since machine is NUMA machine and there is no vcpu pinning defines, NUMA automatic node selection mechanism is used and you can see how vnodes were split across physical nodes. 2. Example with e820_host = 1 (32GB real NUMA machines, two nodes). pv config: memory = 4000 vcpus = 8 # The name of the domain, change this if you want more than 1 VM. name = "null" vnodes = 4 #vnumamem = [3000, 1000] vdistance = [10, 40] #vnuma_vcpumap = [1, 0, 3, 2] vnuma_vnodemap = [1, 0, 1, 0] #vnuma_autoplacement = 1 e820_host = 1 guest boot: [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct [ 0.000000] Linux version 3.12.0+ (assert@superpipe) (gcc version 4.7.2 (Debi an 4.7.2-5) ) #111 SMP Tue Dec 3 14:54:36 EST 2013 [ 0.000000] Command line: root=/dev/xvda1 ro earlyprintk=xen debug loglevel=8 debug print_fatal_signals=1 loglvl=all guest_loglvl=all LOGLEVEL=8 earlyprintk= xen sched_debug [ 0.000000] ACPI in unprivileged domain disabled [ 0.000000] Freeing ac228-fa000 pfn range: 318936 pages freed [ 0.000000] 1-1 mapping on ac228->100000 [ 0.000000] Released 318936 pages of unused memory [ 0.000000] Set 343512 page(s) to 1-1 mapping [ 0.000000] Populating 100000-14ddd8 pfn range: 318936 pages added [ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable [ 0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved [ 0.000000] Xen: [mem 0x0000000000100000-0x00000000ac227fff] usable [ 0.000000] Xen: [mem 0x00000000ac228000-0x00000000ac26bfff] reserved [ 0.000000] Xen: [mem 0x00000000ac26c000-0x00000000ac57ffff] unusable [ 0.000000] Xen: [mem 0x00000000ac580000-0x00000000ac5a0fff] reserved [ 0.000000] Xen: [mem 0x00000000ac5a1000-0x00000000ac5bbfff] unusable [ 0.000000] Xen: [mem 0x00000000ac5bc000-0x00000000ac5bdfff] reserved [ 0.000000] Xen: [mem 0x00000000ac5be000-0x00000000ac5befff] unusable [ 0.000000] Xen: [mem 0x00000000ac5bf000-0x00000000ac5cafff] reserved [ 0.000000] Xen: [mem 0x00000000ac5cb000-0x00000000ac5d9fff] unusable [ 0.000000] Xen: [mem 0x00000000ac5da000-0x00000000ac5fafff] reserved [ 0.000000] Xen: [mem 0x00000000ac5fb000-0x00000000ac6b6fff] unusable [ 0.000000] Xen: [mem 0x00000000ac6b7000-0x00000000ac7fafff] ACPI NVS [ 0.000000] Xen: [mem 0x00000000ac7fb000-0x00000000ac80efff] unusable [ 0.000000] Xen: [mem 0x00000000ac80f000-0x00000000ac80ffff] ACPI data [ 0.000000] Xen: [mem 0x00000000ac810000-0x00000000ac810fff] unusable [ 0.000000] Xen: [mem 0x00000000ac811000-0x00000000ac812fff] ACPI data [ 0.000000] Xen: [mem 0x00000000ac813000-0x00000000ad7fffff] unusable [ 0.000000] Xen: [mem 0x00000000b0000000-0x00000000b3ffffff] reserved [ 0.000000] Xen: [mem 0x00000000fed20000-0x00000000fed3ffff] reserved [ 0.000000] Xen: [mem 0x00000000fed50000-0x00000000fed8ffff] reserved [ 0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved [ 0.000000] Xen: [mem 0x00000000ffa00000-0x00000000ffa3ffff] reserved [ 0.000000] Xen: [mem 0x0000000100000000-0x000000014ddd7fff] usable [ 0.000000] bootconsole [xenboot0] enabled [ 0.000000] NX (Execute Disable) protection: active [ 0.000000] DMI not present or invalid. [ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved [ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable [ 0.000000] No AGP bridge found [ 0.000000] e820: last_pfn = 0x14ddd8 max_arch_pfn = 0x400000000 [ 0.000000] e820: last_pfn = 0xac228 max_arch_pfn = 0x400000000 [ 0.000000] Base memory trampoline at [ffff88000009a000] 9a000 size 24576 [ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff] [ 0.000000] [mem 0x00000000-0x000fffff] page 4k [ 0.000000] init_memory_mapping: [mem 0x14da00000-0x14dbfffff] [ 0.000000] [mem 0x14da00000-0x14dbfffff] page 4k [ 0.000000] BRK [0x019bd000, 0x019bdfff] PGTABLE [ 0.000000] BRK [0x019be000, 0x019befff] PGTABLE [ 0.000000] init_memory_mapping: [mem 0x14c000000-0x14d9fffff] [ 0.000000] [mem 0x14c000000-0x14d9fffff] page 4k [ 0.000000] BRK [0x019bf000, 0x019bffff] PGTABLE [ 0.000000] BRK [0x019c0000, 0x019c0fff] PGTABLE [ 0.000000] BRK [0x019c1000, 0x019c1fff] PGTABLE [ 0.000000] BRK [0x019c2000, 0x019c2fff] PGTABLE [ 0.000000] init_memory_mapping: [mem 0x100000000-0x14bffffff] [ 0.000000] [mem 0x100000000-0x14bffffff] page 4k [ 0.000000] init_memory_mapping: [mem 0x00100000-0xac227fff] [ 0.000000] [mem 0x00100000-0xac227fff] page 4k [ 0.000000] init_memory_mapping: [mem 0x14dc00000-0x14ddd7fff] [ 0.000000] [mem 0x14dc00000-0x14ddd7fff] page 4k [ 0.000000] RAMDISK: [mem 0x01dc8000-0x0346ffff] [ 0.000000] NUMA: Initialized distance table, cnt=4 [ 0.000000] Initmem setup node 0 [mem 0x00000000-0x3e7fffff] [ 0.000000] NODE_DATA [mem 0x3e7d9000-0x3e7fffff] [ 0.000000] Initmem setup node 1 [mem 0x3e800000-0x7cffffff] [ 0.000000] NODE_DATA [mem 0x7cfd9000-0x7cffffff] [ 0.000000] Initmem setup node 2 [mem 0x7d000000-0x10f5dffff] [ 0.000000] NODE_DATA [mem 0x10f5b9000-0x10f5dffff] [ 0.000000] Initmem setup node 3 [mem 0x10f800000-0x14ddd7fff] [ 0.000000] NODE_DATA [mem 0x14ddad000-0x14ddd3fff] [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x00001000-0x00ffffff] [ 0.000000] DMA32 [mem 0x01000000-0xffffffff] [ 0.000000] Normal [mem 0x100000000-0x14ddd7fff] [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x0009ffff] [ 0.000000] node 0: [mem 0x00100000-0x3e7fffff] [ 0.000000] node 1: [mem 0x3e800000-0x7cffffff] [ 0.000000] node 2: [mem 0x7d000000-0xac227fff] [ 0.000000] node 2: [mem 0x100000000-0x10f5dffff] [ 0.000000] node 3: [mem 0x10f5e0000-0x14ddd7fff] [ 0.000000] On node 0 totalpages: 255903 [ 0.000000] DMA zone: 56 pages used for memmap [ 0.000000] DMA zone: 21 pages reserved [ 0.000000] DMA zone: 3999 pages, LIFO batch:0 [ 0.000000] DMA32 zone: 3444 pages used for memmap [ 0.000000] DMA32 zone: 251904 pages, LIFO batch:31 [ 0.000000] On node 1 totalpages: 256000 [ 0.000000] DMA32 zone: 3500 pages used for memmap [ 0.000000] DMA32 zone: 256000 pages, LIFO batch:31 [ 0.000000] On node 2 totalpages: 256008 [ 0.000000] DMA32 zone: 2640 pages used for memmap [ 0.000000] DMA32 zone: 193064 pages, LIFO batch:31 [ 0.000000] Normal zone: 861 pages used for memmap [ 0.000000] Normal zone: 62944 pages, LIFO batch:15 [ 0.000000] On node 3 totalpages: 255992 [ 0.000000] Normal zone: 3500 pages used for memmap [ 0.000000] Normal zone: 255992 pages, LIFO batch:31 [ 0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org [ 0.000000] smpboot: Allowing 8 CPUs, 0 hotplug CPUs root@heatpipe:~# numactl --ha available: 4 nodes (0-3) node 0 cpus: 0 4 node 0 size: 977 MB node 0 free: 947 MB node 1 cpus: 1 5 node 1 size: 985 MB node 1 free: 974 MB node 2 cpus: 2 6 node 2 size: 985 MB node 2 free: 973 MB node 3 cpus: 3 7 node 3 size: 969 MB node 3 free: 958 MB node distances: node 0 1 2 3 0: 10 40 40 40 1: 40 10 40 40 2: 40 40 10 40 3: 40 40 40 10 root@heatpipe:~# numastat -m Per-node system memory usage (in MBs): Node 0 Node 1 Node 2 Node 3 Total --------------- --------------- --------------- --------------- --------------- MemTotal 977.14 985.50 985.44 969.91 3917.99 hypervisor: xl debug-keys u (XEN) 'u' pressed -> dumping numa info (now-0x2A3:F7B8CB0F) (XEN) Domain 2 (total: 1024000): (XEN) Node 0: 415468 (XEN) Node 1: 608532 (XEN) Domain has 4 vnodes (XEN) vnode 0 - pnode 1 1000 MB, vcpus: 0 4 (XEN) vnode 1 - pnode 0 1000 MB, vcpus: 1 5 (XEN) vnode 2 - pnode 1 2341 MB, vcpus: 2 6 (XEN) vnode 3 - pnode 0 999 MB, vcpus: 3 7 This size descrepancy caused by the way how size if calculated from guest pfns: end - start. Thus the hole size in this case of ~1,3Gb is included in the size. 3. zero vNUMA configuration for every pv domain. Will be at least one vnuma node if vnuma topology was not specified. pv config: memory = 4000 vcpus = 8 # The name of the domain, change this if you want more than 1 VM. name = "null" #vnodes = 4 vnumamem = [3000, 1000] vdistance = [10, 40] vnuma_vcpumap = [1, 0, 3, 2] vnuma_vnodemap = [1, 0, 1, 0] vnuma_autoplacement = 1 e820_host = 1 boot: [ 0.000000] init_memory_mapping: [mem 0x14dc00000-0x14ddd7fff] [ 0.000000] [mem 0x14dc00000-0x14ddd7fff] page 4k [ 0.000000] RAMDISK: [mem 0x01dc8000-0x0346ffff] [ 0.000000] NUMA: Initialized distance table, cnt=1 [ 0.000000] Initmem setup node 0 [mem 0x00000000-0x14ddd7fff] [ 0.000000] NODE_DATA [mem 0x14ddad000-0x14ddd3fff] [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x00001000-0x00ffffff] [ 0.000000] DMA32 [mem 0x01000000-0xffffffff] [ 0.000000] Normal [mem 0x100000000-0x14ddd7fff] [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x0009ffff] [ 0.000000] node 0: [mem 0x00100000-0xac227fff] [ 0.000000] node 0: [mem 0x100000000-0x14ddd7fff] root@heatpipe:~# numactl --ha maxn: 0 available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 3918 MB node 0 free: 3853 MB node distances: node 0 0: 10 root@heatpipe:~# numastat -m Per-node system memory usage (in MBs): Node 0 Total --------------- --------------- MemTotal 3918.74 3918.74 hypervisor: xl debug-keys u (XEN) Memory location of each domain: (XEN) Domain 0 (total: 6787432): (XEN) Node 0: 3485706 (XEN) Node 1: 3301726 (XEN) Domain 3 (total: 1024000): (XEN) Node 0: 512000 (XEN) Node 1: 512000 (XEN) Domain has 1 vnodes (XEN) vnode 0 - pnode any 5341 MB, vcpus: 0 1 2 3 4 5 6 7 Patchsets for Xen and linux: git://gitorious.org/xenvnuma_v5/linuxvnuma_v5.git https://git.gitorious.org/xenvnuma_v5/linuxvnuma_v5.git Xen patchset is available at: git://gitorious.org/xenvnuma_v5/xenvnuma_v5.git https://git.gitorious.org/xenvnuma_v5/xenvnuma_v5.git Issues: Issue with automatic NUMA placement was found and resolved. New issue arises with recursive spinlock when changin numa protection. This is being investigated currently. Elena Ufimtseva (1): add vnuma info for debug-key xen/arch/x86/numa.c | 29 ++++++++++++++++++++++++++++- 1 file changed, 28 insertions(+), 1 deletion(-) -- 1.7.10.4 _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |