Xen project Mailing List

Re: [PATCH v12 5/8] xen/riscv: add minimal stuff to mm.h to build full Xen

To: Jan Beulich <jbeulich@xxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

From: "Oleksii K." <oleksii.kurochko@xxxxxxxxx>

Date: Tue, 18 Jun 2024 10:21:30 +0200

Cc: Alistair Francis <alistair.francis@xxxxxxx>, Bob Eshleman <bobbyeshleman@xxxxxxxxx>, Connor Davis <connojdavis@xxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Delivery-date: Tue, 18 Jun 2024 08:21:41 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Fri, 2024-06-14 at 10:47 +0100, Andrew Cooper wrote: > On 11/06/2024 7:23 pm, Oleksii K. wrote: > > On Tue, 2024-06-11 at 16:53 +0100, Andrew Cooper wrote: > > > On 30/05/2024 7:22 pm, Oleksii K. wrote: > > > > On Thu, 2024-05-30 at 18:23 +0100, Andrew Cooper wrote: > > > > > On 29/05/2024 8:55 pm, Oleksii Kurochko wrote: > > > > > > Signed-off-by: Oleksii Kurochko > > > > > > <oleksii.kurochko@xxxxxxxxx> > > > > > > Acked-by: Jan Beulich <jbeulich@xxxxxxxx> > > > > > This patch looks like it can go in independently? Or does it > > > > > depend > > > > > on > > > > > having bitops.h working in practice? > > > > > > > > > > However, one very strong suggestion... > > > > > > > > > > > > > > > > diff --git a/xen/arch/riscv/include/asm/mm.h > > > > > > b/xen/arch/riscv/include/asm/mm.h > > > > > > index 07c7a0abba..cc4a07a71c 100644 > > > > > > --- a/xen/arch/riscv/include/asm/mm.h > > > > > > +++ b/xen/arch/riscv/include/asm/mm.h > > > > > > @@ -3,11 +3,246 @@ > > > > > > <snip> > > > > > > +/* PDX of the first page in the frame table. */ > > > > > > +extern unsigned long frametable_base_pdx; > > > > > > + > > > > > > +/* Convert between machine frame numbers and page-info > > > > > > structures. > > > > > > */ > > > > > > +#define > > > > > > mfn_to_page(mfn) > > > > > > \ > > > > > > + (frame_table + (mfn_to_pdx(mfn) - > > > > > > frametable_base_pdx)) > > > > > > +#define > > > > > > page_to_mfn(pg) > > > > > > \ > > > > > > + pdx_to_mfn((unsigned long)((pg) - frame_table) + > > > > > > frametable_base_pdx) > > > > > Do yourself a favour and not introduce frametable_base_pdx to > > > > > begin > > > > > with. > > > > > > > > > > Every RISC-V board I can find has things starting from 0 in > > > > > physical > > > > > address space, with RAM starting immediately after. > > > > I checked Linux kernel and grep there: > > > > [ok@fedora linux-aia]$ grep -Rni "memory@" > > > > arch/riscv/boot/dts/ > > > > -- > > > > exclude "*.tmp" -I > > > > arch/riscv/boot/dts/starfive/jh7110-starfive-visionfive- > > > > 2.dtsi:33: > > > > memory@40000000 { > > > > arch/riscv/boot/dts/starfive/jh7100-common.dtsi:28: > > > > memory@80000000 > > > > { > > > > arch/riscv/boot/dts/microchip/mpfs-sev-kit.dts:49: > > > > ddrc_cache: > > > > memory@1000000000 { > > > > arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:33: > > > > ddrc_cache_lo: > > > > memory@80000000 { > > > > arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:37: > > > > ddrc_cache_hi: > > > > memory@1040000000 { > > > > arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:34: > > > > ddrc_cache_lo: > > > > memory@80000000 { > > > > arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:40: > > > > ddrc_cache_hi: > > > > memory@1000000000 { > > > > arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:22: > > > > ddrc_cache_lo: > > > > memory@80000000 { > > > > arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:27: > > > > ddrc_cache_hi: > > > > memory@1000000000 { > > > > arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:57: > > > > ddrc_cache_lo: > > > > memory@80000000 { > > > > arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:63: > > > > ddrc_cache_hi: > > > > memory@1040000000 { > > > > arch/riscv/boot/dts/thead/th1520-beaglev-ahead.dts:32: > > > > memory@0 > > > > { > > > > arch/riscv/boot/dts/thead/th1520-lichee-module- > > > > 4a.dtsi:14: > > > > memory@0 { > > > > arch/riscv/boot/dts/sophgo/cv1800b-milkv-duo.dts:26: > > > > memory@80000000 > > > > { > > > > arch/riscv/boot/dts/sophgo/cv1812h.dtsi:12: > > > > memory@80000000 > > > > { > > > > arch/riscv/boot/dts/sifive/hifive-unmatched-a00.dts:26: > > > > memory@80000000 > > > > { > > > > arch/riscv/boot/dts/sifive/hifive-unleashed-a00.dts:25: > > > > memory@80000000 > > > > { > > > > arch/riscv/boot/dts/canaan/k210.dtsi:82: sram: > > > > memory@80000000 { > > > > > > > > And based on that majority of supported by Linux kernel boards > > > > has > > > > RAM > > > > starting not from 0 in physical address space. Am I confusing > > > > something? > > > > > > > > > Taking the microchip board as an example, RAM actually starts > > > > > at > > > > > 0x8000000, > > > > Today we had conversation with the guy from SiFive in xen-devel > > > > channel > > > > and he mentioned that they are using "starfive visionfive2 and > > > > sifive > > > > unleashed platforms." which based on the grep above has RAM not > > > > at > > > > 0 > > > > address. > > > > > > > > Also, QEMU uses 0x8000000. > > > > > > > > > which means that having frametable_base_pdx and assuming it > > > > > does get set to 0x8000 (which isn't even a certainty, given > > > > > that > > > > > I > > > > > think > > > > > you'll need struct pages covering the PLICs), then what you > > > > > are > > > > > trading > > > > > off is: > > > > > > > > > > * Saving 32k of virtual address space only (no need to even > > > > > allocate > > > > > memory for this range of the framtable), by > > > > > * Having an extra memory load and add/sub in every page <-> > > > > > mfn > > > > > conversion, which is a screaming hotpath all over Xen. > > > > > > > > > > It's a terribly short-sighted tradeoff. > > > > > > > > > > 32k of VA space might be worth saving in a 32bit build (I > > > > > personally > > > > > wouldn't - especially as there's no need to share Xen's VA > > > > > space > > > > > with > > > > > guests, given no PV guests on ARM/RISC-V), but it's > > > > > absolutely > > > > > not at > > > > > all in an a 64bit build with TB of VA space available. > > > > > > > > > > Even if we do find a board with the first interesting thing > > > > > in > > > > > the > > > > > frametable starting sufficiently away from 0 that it might be > > > > > worth > > > > > considering this slide, then it should still be Kconfig-able > > > > > in a > > > > > similar way to PDX_COMPRESSION. > > > > I found your tradeoffs reasonable, but I don't understand how > > > > it > > > > will > > > > work if RAM does not start from 0, as the frametable address > > > > and > > > > RAM > > > > address are linked. > > > > I tried to look at the PDX_COMPRESSION config and couldn't find > > > > any > > > > "slide" there. Could you please clarify this for me? > > > > If to use this "slide" would it help to avoid the mentioned > > > > above > > > > tradeoffs? > > > > > > > > One more question: if we decide to go without > > > > frametable_base_pdx, > > > > would it be sufficient to simply remove mentions of it from the > > > > code ( > > > > at least, for now )? > > > There is a relationship between system/host physical addresses > > > (what > > > Xen > > > called maddr/mfn), and the frametable. The frametable has one > > > entry > > > per > > > mfn. > > > > > > In the most simple case, there's a 1:1 relationship. i.e. > > > frametable[0] > > > = maddr(0), frametable[1] = maddr(4k) etc. This is very simple, > > > and > > > very easy to calculate (page_to_mfn()/mfn_to_page()). > > > > > > The frametable is one big array. It starts at a compile-time > > > fixed > > > address, and needs to be long enough to cover everything > > > interesting > > > in > > > memory. Therefore it potentially takes a large amount of virtual > > > address space. > > > > > > However, only interesting maddrs need to have data in the > > > frametable, > > > so > > > it's fine for the backing RAM to be sparsely allocated/mapped in > > > the > > > frametable virtual addresses. > > > > > > For 64bit, that's really all you need, because there's always far > > > more > > > virtual address space than physical RAM in the system, even when > > > you're > > > looking at TB-scale giant servers. > > > > > > > > > For 32bit, virtual address space is a limited resource. (Also to > > > an > > > extent, 64bit x86 with PV guests because we give 98% of the > > > virtual > > > address space to the guest kernel to use.) > > > > > > There are two tricks to reduce the virtual address space used, > > > but > > > they > > > both cost performance in fastpaths. > > > > > > 1) PDX Compression. > > > > > > PDX compression makes a non-linear mfn <-> maddr mapping. This > > > is > > > for a > > > usecase where you've got multiple RAM banks which are separated > > > by a > > > large distance (and evenly spaced), then you can "compress" a > > > single > > > range of 0's out of the middle of the system/host physical > > > address. > > > > > > The cost is that all page <-> mfn conversions need to read two > > > masks > > > and > > > a shift-count from variables in memory, to split/shift/recombine > > > the > > > address bits. > > > > > > 2) A slide, which is frametable_base_pdx in this context. > > > > > > When there's a big gap between 0 and the start of something > > > interesting, > > > you could chop out that range by just subtracting base_pdx. What > > > qualifies as "big" is subjective, but Qemu starting at 128M > > > certainly > > > does not qualify as big enough to warrant frametable_base_pdx. > > > > > > This is less expensive that PDX compression. It only adds one > > > memory > > > read to the fastpath, but it also doesn't save as much virtual > > > address > > > space as PDX compression. > > > > > > > > > When virtual address space is a major constraint (32 bit builds), > > > both > > > of these techniques are worth doing. But when there's no > > > constraint > > > on > > > virtual address space (64 bit builds), there's no reason to use > > > either; > > > and the performance will definitely improve as a result. > > Thanks for such good explanation. > > > > For RISC-V we have PDX config disabled as I haven't seen multiple > > RAM > > banks at boards which has hypervisor extension. Thereby > > mfn_to_pdx() > > and pdx_to_mfn() do nothing. The same for frametable_base_pdx, in > > case > > of PDX disabled, it just an offset ( or a slide ). > > > > IIUUC, you meant that it make sense to map frametable not to the > > address of where RAM is started, but to 0, so then we don't need > > this > > +-frametable_base_pdx. The price for that is loosing of VA space > > for > > the range from 0 to RAM start address. > > > > Right now, we are trying to support 3 boards with the following RAM > > address: > > 1. 0x8000_0000 - QEMU and SiFive board > > 2. 0x40_0000_0000 - Microchip board > > > > So if we mapping frametable to 0 ( not RAM start ) we will loose: > > 1. 0x8000_0 ( amount of page entries to cover range [0, > > 0x8000_0000] ) > > * 64 ( size of struct page_info ) = 32 MB > > 2. 0x40_0000_0 ( amount of page entries to cover range [0, > > 0x40_0000_0000] ) * 64 ( size of struct page_info ) = 4 Gb > > > > In terms of available virtual address space for RV-64 we can > > consider > > both options as acceptable. > > For Qemu and SiFive, 32M is definitely not worth worrying about. > > I personally wouldn't worry about Microchip either. That's 4G of 1T > VA > space (assuming Sv39). > > > [OPTION 1] If we accepting of loosing 4 Gb of VA then we could > > implement mfn_to_page() and page_to_mfn() in the following way: > > ``` > > diff --git a/xen/arch/riscv/include/asm/mm.h > > b/xen/arch/riscv/include/asm/mm.h > > index cc4a07a71c..fdac7e0646 100644 > > --- a/xen/arch/riscv/include/asm/mm.h > > +++ b/xen/arch/riscv/include/asm/mm.h > > @@ -107,14 +107,11 @@ struct page_info > > > > #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START) > > > > -/* PDX of the first page in the frame table. */ > > -extern unsigned long frametable_base_pdx; > > - > > /* Convert between machine frame numbers and page-info > > structures. > > */ > > #define > > mfn_to_page(mfn) > > \ > > - (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) > > + (frame_table + mfn)) > > #define > > page_to_mfn(pg) > > \ > > - pdx_to_mfn((unsigned long)((pg) - frame_table) + > > frametable_base_pdx) > > + ((unsigned long)((pg) - frame_table)) > > > > static inline void *page_to_virt(const struct page_info *pg) > > { > > diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c > > index 9c0fd80588..8f6dbdc699 100644 > > --- a/xen/arch/riscv/mm.c > > +++ b/xen/arch/riscv/mm.c > > @@ -15,7 +15,7 @@ > > #include <asm/page.h> > > #include <asm/processor.h> > > > > -unsigned long __ro_after_init frametable_base_pdx; > > unsigned long __ro_after_init frametable_virt_end; > > > > struct mmu_desc { > > ``` > > I firmly recommend option 1, especially at this point. Jan, as you gave your Acked before, don't you mind to define mfn_to_page() and page_to_mfn as mentioned above ( Option 1 )? ~ Oleksi > > If specific boards really have a problem with losing 4G of VA space, > then option 2 can be added easily at a later point. > > That said, I'd think carefully about doing option 2. Even > subtracting a > constant - which is far better than subtracting a variable - is still > extra overhead in fastpaths. Option 2 needs careful consideration on > a > board-by-board case. > > ~Andrew

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.