|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: AMD EPYC VM to VM performance investigation
On Thu, Jan 04, 2024 at 16:39:46PM, Stefano Stabellini wrote:
> On Thu, 4 Jan 2024, David Morel wrote:
> > Hello,
> >
> > We have a customer and multiple users on our forum having performances that
> > seems quite low related to the general performance of the machines on AMD
> > EPYC
> > Zen hosts when doing VM to VM networking.
>
> By "VM to VM networking" I take you mean VM-to-VM on the same host using
> PV network?
>
>
> > Below you'll find a write up about what we had a look at and what's in the
> > TODO on our side, but in the meantime we would like to ask here for some
> > feedback, suggestions and possible leads.
> >
> > To sum up, the VM to VM performance on Zen generation server CPUs seems
> > quite
> > low, and only minimally scaling when adding threads. They are outperformed
> > by
> > 10 year old AMD desktop cpu and pretty low frequency XEON bronze from 2014.
> > CPU usage does not seem to be the limiting factor as neither the VM threads
> > or
> > the kthreads on host seems to go to a 100% cpu usage.
> >
> > As we're Vates, I'm talking about XCP-ng here, so Xen 4.13.5 and a dom0
> > kernel
> > 4.19. I did try a Xen 4.18-rc2 and kernel 6.1.56 on a Zen4 epyc, but as it
> > was
> > borrowed from a colleague I was unsure of the setup, so although it was
> > actually worse than on my other test setups, I would not consider that a
> > complete validation the issues is also present on recent Xen versions.
>
> I think it might be difficult to triage this if you are working on a
> Xen/Linux version that is so different from upstream
I ran some tests on a Xen 4.13.5 with a dom0 in 6.6.10, and on an XCP-ng on
the same machine, the performances are similar, a few percent better on
the recent Xen, but still pretty low for such a machine and similar to
other EPYC we looked at.
>
> > 1. Has anybody else noticed a similar behavior?
> > 2. Has anybody done any kind of investigation about it beside us?
> > 3. Any insight and suggestions of other points to look at would be welcome
> > :)
> >
> > And now the lengthy part about what we tested, I tried to make it shorter
> > and
> > more legible than a full report…
> >
> > Investigated
> > ------------
> >
> > - Bench various cpu with iperf2 (iperf3 is not actually multithreaded):
> > - amd fx8320e, xeon 3106: not impacted.
> > - epyc 7451, 7443, 7302p, 7313p, 9124: impacted, but the zen4 one scales a
> > bit more than zen1, 2 and 3.
> > - ryzen 5950x, ryzen 7600: performances should likely be better than
> > observed results, but still way better than epycs, and scaling nicely
> > with
> > more threads.
> > - Bench with tinymembench[1]: performances were as expected and didn't show
> > issues with rep movsb as discussed in this article[2] and issue[3]. Which
> > makes sense as it looks like this issues is related to ERMS support which
> > is
> > not present on Zen1 and 2 where the issue has been raised.
> > - Bench skb allocation with a small kernel module measuring cycles: actually
> > same or lower on epyc than on the xeon with higher frequency so can be
> > considered faster and likely not related to our issue.
> > - mitigations: we tried disabling what can be disabled through boot
> > parameters, both for xen, dom0 and guests, but this made no differences.
> > - disabling AVX; Zen cpus before zen4 are know to limit boost and cpu
> > scaling
> > when doing heavy AVX load on one core, there was no reason to think this
> > was
> > related, but it was a quick test and as expected had no effect.
> > - localhost iperf bench on dom0 and guests: we noticed that on other
> > machines
> > host/guest with 1 threads are almost 1:1, with 4 threads guests are about
> > generally not scaling as well in guests. On epyc machines, host tests were
> > significantly slower than guests both with 1 and 4 threads, first
> > investigation of profiling didn't help finding a cause yet. More in the
> > profiling and TODO.
>
> Wait, are you saying that the localhost iperf benchmark is faster in a
> VM compared to host ("host" I take means baremetal Linux without a
> hypervisor) ? Maybe you meant the other way around?
>
>
> > - cpu load: top/htop/xentop all seem to indicate that machines are not under
> > full load, queue allocations on dom0 for VIF are by default (1 per vcpu)
> > and
> > seem to be all used when traffic is running but at a percentage below 100%
> > per core/thread.
> > - pinning: manually pinning dom0 and guests to the same node and avoiding
> > sharing cpu "threads" between host and guests gives a minimal increase of
> > a
> > few percents, but nothing drastic. Note, we do not know about the
> > ccd/ccx/node mapping on these cpus, so we are not sure all memory access
> > are
> > "local".
> > - sched weight: playing with sched weight to prioritize dom0 did not make a
> > difference either, which makes sense as the system are not under full
> > load.
> > - cpu scaling: it is unlikely the core of the issue, but indeed the cpu
> > scaling does not take advantage of the boost, never going above the base
> > clock of these cpus. Also it also seems that less cores that the number of
> > working kthreads/vcpus are going to base clock, may be normal in regard to
> > the system not being fully loaded, to be defined.
> > - QUESTION: is the powernow support in xen cpufreq implementation
> > sufficient
> > for zen cpus? Recent kernels/distributions use acpi_cpufreq and can use
> > amd_pstate or even amd_pstate_epp. More concerning than the turbo boost
> > could be the handling of package power limitation used in Zen CPUs that
> > could prevent even all cores to base clock, to be checked…
> >
> > Profiling
> > ---------
> >
> > We profiled iperf on dom0 and guests on epyc, older amd desktop, and xeon
> > machines and gathered profiling traces, but analysis are still ongoing.
> >
> > - localhost:
> > Client and server were profiled both on dom0 and guests runs for a xeon, an
> > old FX and a zen platform, to analyze the discrepancy shown by the localhost
> > tests earlier. It shows we spend a larger chunk of time in the copyout() or
> > copyin() functions on epyc and fx. This is likely related to the use of
> > copy_user_generic_string() on epyc (zen1) and old FX, whereas xeon uses
> > copy_user_enhanced_fast_string(), as it has ERMS support. But on the same
> > machine, guests are going way faster, and the implementation of
> > copy_user_generic_string() is the same between the dom0 and guests, so this
> > is
> > likely related to other changes in kernel and userland, and not only to
> > these
> > function. Therefore it likely isn't directly linked to the issue.
> >
> > - vm to vm: server, client & dom0 -> profiling traces to be analysed.
> >
> > TODO
> > ----
> >
> > - More Analysis of profiling traces in VM to VM case
> > - X2APIC (not enabled on the machines and setup we are using)
> > - Profiling at xen level / hypercalls
> > - Tests on a clean install of a newer Xen version
> > - Dig some more on cpu scaling, likely not the root of the problem but could
> > be some gain to make.
> >
> > [1] https://github.com/ssvb/tinymembench
> > [2] https://xuanwo.io/2023/04-rust-std-fs-slower-than-python/
> > [3] https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515
> >
> > --
> > David Morel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |