Xen project Mailing List

Re: AMD EPYC VM to VM performance investigation

To: Stefano Stabellini <sstabellini@xxxxxxxxxx>

From: David Morel <david.morel@xxxxxxxxxx>

Date: Wed, 10 Jan 2024 11:21:06 +0100

Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, xenia.ragiadakou@xxxxxxx, andrew.cooper3@xxxxxxxxxx, Jan Beulich <jbeulich@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>

Delivery-date: Wed, 10 Jan 2024 10:21:23 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Thu, Jan 04, 2024 at 16:39:46PM, Stefano Stabellini wrote: > On Thu, 4 Jan 2024, David Morel wrote: > > Hello, > > > > We have a customer and multiple users on our forum having performances that > > seems quite low related to the general performance of the machines on AMD > > EPYC > > Zen hosts when doing VM to VM networking. > > By "VM to VM networking" I take you mean VM-to-VM on the same host using > PV network? > > > > Below you'll find a write up about what we had a look at and what's in the > > TODO on our side, but in the meantime we would like to ask here for some > > feedback, suggestions and possible leads. > > > > To sum up, the VM to VM performance on Zen generation server CPUs seems > > quite > > low, and only minimally scaling when adding threads. They are outperformed > > by > > 10 year old AMD desktop cpu and pretty low frequency XEON bronze from 2014. > > CPU usage does not seem to be the limiting factor as neither the VM threads > > or > > the kthreads on host seems to go to a 100% cpu usage. > > > > As we're Vates, I'm talking about XCP-ng here, so Xen 4.13.5 and a dom0 > > kernel > > 4.19. I did try a Xen 4.18-rc2 and kernel 6.1.56 on a Zen4 epyc, but as it > > was > > borrowed from a colleague I was unsure of the setup, so although it was > > actually worse than on my other test setups, I would not consider that a > > complete validation the issues is also present on recent Xen versions. > > I think it might be difficult to triage this if you are working on a > Xen/Linux version that is so different from upstream I ran some tests on a Xen 4.13.5 with a dom0 in 6.6.10, and on an XCP-ng on the same machine, the performances are similar, a few percent better on the recent Xen, but still pretty low for such a machine and similar to other EPYC we looked at. > > > 1. Has anybody else noticed a similar behavior? > > 2. Has anybody done any kind of investigation about it beside us? > > 3. Any insight and suggestions of other points to look at would be welcome > > :) > > > > And now the lengthy part about what we tested, I tried to make it shorter > > and > > more legible than a full report… > > > > Investigated > > ------------ > > > > - Bench various cpu with iperf2 (iperf3 is not actually multithreaded): > > - amd fx8320e, xeon 3106: not impacted. > > - epyc 7451, 7443, 7302p, 7313p, 9124: impacted, but the zen4 one scales a > > bit more than zen1, 2 and 3. > > - ryzen 5950x, ryzen 7600: performances should likely be better than > > observed results, but still way better than epycs, and scaling nicely > > with > > more threads. > > - Bench with tinymembench[1]: performances were as expected and didn't show > > issues with rep movsb as discussed in this article[2] and issue[3]. Which > > makes sense as it looks like this issues is related to ERMS support which > > is > > not present on Zen1 and 2 where the issue has been raised. > > - Bench skb allocation with a small kernel module measuring cycles: actually > > same or lower on epyc than on the xeon with higher frequency so can be > > considered faster and likely not related to our issue. > > - mitigations: we tried disabling what can be disabled through boot > > parameters, both for xen, dom0 and guests, but this made no differences. > > - disabling AVX; Zen cpus before zen4 are know to limit boost and cpu > > scaling > > when doing heavy AVX load on one core, there was no reason to think this > > was > > related, but it was a quick test and as expected had no effect. > > - localhost iperf bench on dom0 and guests: we noticed that on other > > machines > > host/guest with 1 threads are almost 1:1, with 4 threads guests are about > > generally not scaling as well in guests. On epyc machines, host tests were > > significantly slower than guests both with 1 and 4 threads, first > > investigation of profiling didn't help finding a cause yet. More in the > > profiling and TODO. > > Wait, are you saying that the localhost iperf benchmark is faster in a > VM compared to host ("host" I take means baremetal Linux without a > hypervisor) ? Maybe you meant the other way around? > > > > - cpu load: top/htop/xentop all seem to indicate that machines are not under > > full load, queue allocations on dom0 for VIF are by default (1 per vcpu) > > and > > seem to be all used when traffic is running but at a percentage below 100% > > per core/thread. > > - pinning: manually pinning dom0 and guests to the same node and avoiding > > sharing cpu "threads" between host and guests gives a minimal increase of > > a > > few percents, but nothing drastic. Note, we do not know about the > > ccd/ccx/node mapping on these cpus, so we are not sure all memory access > > are > > "local". > > - sched weight: playing with sched weight to prioritize dom0 did not make a > > difference either, which makes sense as the system are not under full > > load. > > - cpu scaling: it is unlikely the core of the issue, but indeed the cpu > > scaling does not take advantage of the boost, never going above the base > > clock of these cpus. Also it also seems that less cores that the number of > > working kthreads/vcpus are going to base clock, may be normal in regard to > > the system not being fully loaded, to be defined. > > - QUESTION: is the powernow support in xen cpufreq implementation > > sufficient > > for zen cpus? Recent kernels/distributions use acpi_cpufreq and can use > > amd_pstate or even amd_pstate_epp. More concerning than the turbo boost > > could be the handling of package power limitation used in Zen CPUs that > > could prevent even all cores to base clock, to be checked… > > > > Profiling > > --------- > > > > We profiled iperf on dom0 and guests on epyc, older amd desktop, and xeon > > machines and gathered profiling traces, but analysis are still ongoing. > > > > - localhost: > > Client and server were profiled both on dom0 and guests runs for a xeon, an > > old FX and a zen platform, to analyze the discrepancy shown by the localhost > > tests earlier. It shows we spend a larger chunk of time in the copyout() or > > copyin() functions on epyc and fx. This is likely related to the use of > > copy_user_generic_string() on epyc (zen1) and old FX, whereas xeon uses > > copy_user_enhanced_fast_string(), as it has ERMS support. But on the same > > machine, guests are going way faster, and the implementation of > > copy_user_generic_string() is the same between the dom0 and guests, so this > > is > > likely related to other changes in kernel and userland, and not only to > > these > > function. Therefore it likely isn't directly linked to the issue. > > > > - vm to vm: server, client & dom0 -> profiling traces to be analysed. > > > > TODO > > ---- > > > > - More Analysis of profiling traces in VM to VM case > > - X2APIC (not enabled on the machines and setup we are using) > > - Profiling at xen level / hypercalls > > - Tests on a clean install of a newer Xen version > > - Dig some more on cpu scaling, likely not the root of the problem but could > > be some gain to make. > > > > [1] https://github.com/ssvb/tinymembench > > [2] https://xuanwo.io/2023/04-rust-std-fs-slower-than-python/ > > [3] https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515 > > > > -- > > David Morel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.