[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 00/16] make system memory API available for common code

To: BALATON Zoltan <balaton@xxxxxxxxxx>
From: Pierrick Bouvier <pierrick.bouvier@xxxxxxxxxx>
Date: Mon, 10 Mar 2025 09:56:36 -0700
Cc: qemu-devel@xxxxxxxxxx, qemu-ppc@xxxxxxxxxx, Alistair Francis <alistair.francis@xxxxxxx>, Richard Henderson <richard.henderson@xxxxxxxxxx>, Harsh Prateek Bora <harshpb@xxxxxxxxxxxxx>, alex.bennee@xxxxxxxxxx, Palmer Dabbelt <palmer@xxxxxxxxxxx>, Daniel Henrique Barboza <danielhb413@xxxxxxxxx>, kvm@xxxxxxxxxxxxxxx, Peter Xu <peterx@xxxxxxxxxx>, Nicholas Piggin <npiggin@xxxxxxxxx>, Liu Zhiwei <zhiwei_liu@xxxxxxxxxxxxxxxxx>, David Hildenbrand <david@xxxxxxxxxx>, Weiwei Li <liwei1518@xxxxxxxxx>, Paul Durrant <paul@xxxxxxx>, "Edgar E. Iglesias" <edgar.iglesias@xxxxxxxxx>, Philippe Mathieu-Daudé <philmd@xxxxxxxxxx>, Anthony PERARD <anthony@xxxxxxxxxxxxxx>, Yoshinori Sato <ysato@xxxxxxxxxxxxxxxxxxxx>, manos.pitsidianakis@xxxxxxxxxx, qemu-riscv@xxxxxxxxxx, Paolo Bonzini <pbonzini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx, Stefano Stabellini <sstabellini@xxxxxxxxxx>
Delivery-date: Mon, 10 Mar 2025 16:57:06 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 3/10/25 09:28, Pierrick Bouvier wrote:

Hi Zoltan,

On 3/10/25 06:23, BALATON Zoltan wrote:

On Sun, 9 Mar 2025, Pierrick Bouvier wrote:

The main goal of this series is to be able to call any memory ld/st function
from code that is *not* target dependent.


Why is that needed?


this series belongs to the "single binary" topic, where we are trying to
build a single QEMU binary with all architectures embedded.

To achieve that, we need to have every single compilation unit compiled
only once, to be able to link a binary without any symbol conflict.

A consequence of that is target specific code (in terms of code relying
of target specific macros) needs to be converted to common code,
checking at runtime properties of the target we run. We are tackling
various places in QEMU codebase at the same time, which can be confusing
for the community members.

This series take care of system memory related functions and associated
compilation units in system/.

As a positive side effect, we can
turn related system compilation units into common code.


Are there any negative side effects? In particular have you done any
performance benchmarking to see if this causes a measurable slow down?
Such as with the STREAM benchmark:
https://stackoverflow.com/questions/56086993/what-does-stream-memory-bandwidth-benchmark-really-measure

Maybe it would be good to have some performance tests similiar to
functional tests that could be run like the CI tests to detect such
performance changes. People report that QEMU is getting slower and slower
with each release. Maybe it could be a GSoC project to make such tests but
maybe we're too late for that.


I agree with you, and it's something we have mentioned during our
"internal" conversations. Testing performance with existing functional
tests would already be a first good step. However, given the poor
reliability we have on our CI runners, I think it's a bit doomed.

Ideally, every QEMU release cycle should have a performance measurement
window to detect potential sources of regressions.

To answer to your specific question, I am trying first to get a review
on the approach taken. We can always optimize in next series version, in
case we identify it's a big deal to introduce a branch for every memory
related function call.

In all cases, transforming code relying on compile time
optimization/dead code elimination through defines to runtime checks
will *always* have an impact, even though it should be minimal in most
of cases. But the maintenance and compilation time benefits, as well as
the perspectives it opens (single binary, heterogeneous emulation, use
QEMU as a library) are worth it IMHO.

Regards,
BALATON Zoltan


Regards,
Pierrick

As a side note, we recently did some work around performance analysis(for aarch64), as you can see here [1]. In the end, QEMU performancedepends (roughly in this order) on:

1. quality of code generated by TCG
2. helper code to implement instructions
3. mmu emulation

Other state of the art translators that exist are faster (fex, box64)mainly by enhancing 1, and relying on various tricks to avoidtranslating some libraries calls. But those translators are host/targetspecific, and the ratio of instructions generated (vs target ones read)is much lower than QEMU. In the experimentation listed in the blog, Iobserved that for qemu-system-aarch64, we have an average expansionfactor of around 18 (1 guest insn translates to 18 host ones).

For users seeing performance decreases, beyond the QEMU code changes,adding new target instructions may add new helpers, which may be calledby the stack people use, and they can sometimes observe a slower behaviour.

There are probably some other low hanging fruits for other targetarchitectures.


[1] https://www.linaro.org/blog/qemu-a-tale-of-performance-analysis/

Follow-Ups:
- Re: [PATCH 00/16] make system memory API available for common code
  - From: BALATON Zoltan

References:
- [PATCH 00/16] make system memory API available for common code
  - From: Pierrick Bouvier
- Re: [PATCH 00/16] make system memory API available for common code
  - From: BALATON Zoltan
- Re: [PATCH 00/16] make system memory API available for common code
  - From: Pierrick Bouvier

Prev by Date: Re: [PATCH 00/16] make system memory API available for common code
Next by Date: Re: [PATCH 13/16] hw/xen: add stubs for various functions
Previous by thread: Re: [PATCH 00/16] make system memory API available for common code
Next by thread: Re: [PATCH 00/16] make system memory API available for common code
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.