Xen project Mailing List

Re: [PATCH v12 5/8] xen/riscv: add minimal stuff to mm.h to build full Xen

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

From: "Oleksii K." <oleksii.kurochko@xxxxxxxxx>

Date: Tue, 11 Jun 2024 20:23:43 +0200

Cc: Alistair Francis <alistair.francis@xxxxxxx>, Bob Eshleman <bobbyeshleman@xxxxxxxxx>, Connor Davis <connojdavis@xxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>

Delivery-date: Tue, 11 Jun 2024 18:24:07 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Tue, 2024-06-11 at 16:53 +0100, Andrew Cooper wrote: > On 30/05/2024 7:22 pm, Oleksii K. wrote: > > On Thu, 2024-05-30 at 18:23 +0100, Andrew Cooper wrote: > > > On 29/05/2024 8:55 pm, Oleksii Kurochko wrote: > > > > Signed-off-by: Oleksii Kurochko <oleksii.kurochko@xxxxxxxxx> > > > > Acked-by: Jan Beulich <jbeulich@xxxxxxxx> > > > This patch looks like it can go in independently? Or does it > > > depend > > > on > > > having bitops.h working in practice? > > > > > > However, one very strong suggestion... > > > > > > > > > > diff --git a/xen/arch/riscv/include/asm/mm.h > > > > b/xen/arch/riscv/include/asm/mm.h > > > > index 07c7a0abba..cc4a07a71c 100644 > > > > --- a/xen/arch/riscv/include/asm/mm.h > > > > +++ b/xen/arch/riscv/include/asm/mm.h > > > > @@ -3,11 +3,246 @@ > > > > <snip> > > > > +/* PDX of the first page in the frame table. */ > > > > +extern unsigned long frametable_base_pdx; > > > > + > > > > +/* Convert between machine frame numbers and page-info > > > > structures. > > > > */ > > > > +#define > > > > mfn_to_page(mfn) \ > > > > + (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) > > > > +#define > > > > page_to_mfn(pg) \ > > > > + pdx_to_mfn((unsigned long)((pg) - frame_table) + > > > > frametable_base_pdx) > > > Do yourself a favour and not introduce frametable_base_pdx to > > > begin > > > with. > > > > > > Every RISC-V board I can find has things starting from 0 in > > > physical > > > address space, with RAM starting immediately after. > > I checked Linux kernel and grep there: > > [ok@fedora linux-aia]$ grep -Rni "memory@" arch/riscv/boot/dts/ > > -- > > exclude "*.tmp" -I > > arch/riscv/boot/dts/starfive/jh7110-starfive-visionfive- > > 2.dtsi:33: > > memory@40000000 { > > arch/riscv/boot/dts/starfive/jh7100-common.dtsi:28: > > memory@80000000 > > { > > arch/riscv/boot/dts/microchip/mpfs-sev-kit.dts:49: > > ddrc_cache: > > memory@1000000000 { > > arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:33: > > ddrc_cache_lo: > > memory@80000000 { > > arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:37: > > ddrc_cache_hi: > > memory@1040000000 { > > arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:34: > > ddrc_cache_lo: > > memory@80000000 { > > arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:40: > > ddrc_cache_hi: > > memory@1000000000 { > > arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:22: > > ddrc_cache_lo: > > memory@80000000 { > > arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:27: > > ddrc_cache_hi: > > memory@1000000000 { > > arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:57: > > ddrc_cache_lo: > > memory@80000000 { > > arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:63: > > ddrc_cache_hi: > > memory@1040000000 { > > arch/riscv/boot/dts/thead/th1520-beaglev-ahead.dts:32: memory@0 > > { > > arch/riscv/boot/dts/thead/th1520-lichee-module-4a.dtsi:14: > > memory@0 { > > arch/riscv/boot/dts/sophgo/cv1800b-milkv-duo.dts:26: > > memory@80000000 > > { > > arch/riscv/boot/dts/sophgo/cv1812h.dtsi:12: memory@80000000 > > { > > arch/riscv/boot/dts/sifive/hifive-unmatched-a00.dts:26: > > memory@80000000 > > { > > arch/riscv/boot/dts/sifive/hifive-unleashed-a00.dts:25: > > memory@80000000 > > { > > arch/riscv/boot/dts/canaan/k210.dtsi:82: sram: > > memory@80000000 { > > > > And based on that majority of supported by Linux kernel boards has > > RAM > > starting not from 0 in physical address space. Am I confusing > > something? > > > > > Taking the microchip board as an example, RAM actually starts at > > > 0x8000000, > > Today we had conversation with the guy from SiFive in xen-devel > > channel > > and he mentioned that they are using "starfive visionfive2 and > > sifive > > unleashed platforms." which based on the grep above has RAM not at > > 0 > > address. > > > > Also, QEMU uses 0x8000000. > > > > > which means that having frametable_base_pdx and assuming it > > > does get set to 0x8000 (which isn't even a certainty, given that > > > I > > > think > > > you'll need struct pages covering the PLICs), then what you are > > > trading > > > off is: > > > > > > * Saving 32k of virtual address space only (no need to even > > > allocate > > > memory for this range of the framtable), by > > > * Having an extra memory load and add/sub in every page <-> mfn > > > conversion, which is a screaming hotpath all over Xen. > > > > > > It's a terribly short-sighted tradeoff. > > > > > > 32k of VA space might be worth saving in a 32bit build (I > > > personally > > > wouldn't - especially as there's no need to share Xen's VA space > > > with > > > guests, given no PV guests on ARM/RISC-V), but it's absolutely > > > not at > > > all in an a 64bit build with TB of VA space available. > > > > > > Even if we do find a board with the first interesting thing in > > > the > > > frametable starting sufficiently away from 0 that it might be > > > worth > > > considering this slide, then it should still be Kconfig-able in a > > > similar way to PDX_COMPRESSION. > > I found your tradeoffs reasonable, but I don't understand how it > > will > > work if RAM does not start from 0, as the frametable address and > > RAM > > address are linked. > > I tried to look at the PDX_COMPRESSION config and couldn't find any > > "slide" there. Could you please clarify this for me? > > If to use this "slide" would it help to avoid the mentioned above > > tradeoffs? > > > > One more question: if we decide to go without frametable_base_pdx, > > would it be sufficient to simply remove mentions of it from the > > code ( > > at least, for now )? > > There is a relationship between system/host physical addresses (what > Xen > called maddr/mfn), and the frametable. The frametable has one entry > per > mfn. > > In the most simple case, there's a 1:1 relationship. i.e. > frametable[0] > = maddr(0), frametable[1] = maddr(4k) etc. This is very simple, and > very easy to calculate (page_to_mfn()/mfn_to_page()). > > The frametable is one big array. It starts at a compile-time fixed > address, and needs to be long enough to cover everything interesting > in > memory. Therefore it potentially takes a large amount of virtual > address space. > > However, only interesting maddrs need to have data in the frametable, > so > it's fine for the backing RAM to be sparsely allocated/mapped in the > frametable virtual addresses. > > For 64bit, that's really all you need, because there's always far > more > virtual address space than physical RAM in the system, even when > you're > looking at TB-scale giant servers. > > > For 32bit, virtual address space is a limited resource. (Also to an > extent, 64bit x86 with PV guests because we give 98% of the virtual > address space to the guest kernel to use.) > > There are two tricks to reduce the virtual address space used, but > they > both cost performance in fastpaths. > > 1) PDX Compression. > > PDX compression makes a non-linear mfn <-> maddr mapping. This is > for a > usecase where you've got multiple RAM banks which are separated by a > large distance (and evenly spaced), then you can "compress" a single > range of 0's out of the middle of the system/host physical address. > > The cost is that all page <-> mfn conversions need to read two masks > and > a shift-count from variables in memory, to split/shift/recombine the > address bits. > > 2) A slide, which is frametable_base_pdx in this context. > > When there's a big gap between 0 and the start of something > interesting, > you could chop out that range by just subtracting base_pdx. What > qualifies as "big" is subjective, but Qemu starting at 128M certainly > does not qualify as big enough to warrant frametable_base_pdx. > > This is less expensive that PDX compression. It only adds one memory > read to the fastpath, but it also doesn't save as much virtual > address > space as PDX compression. > > > When virtual address space is a major constraint (32 bit builds), > both > of these techniques are worth doing. But when there's no constraint > on > virtual address space (64 bit builds), there's no reason to use > either; > and the performance will definitely improve as a result. Thanks for such good explanation. For RISC-V we have PDX config disabled as I haven't seen multiple RAM banks at boards which has hypervisor extension. Thereby mfn_to_pdx() and pdx_to_mfn() do nothing. The same for frametable_base_pdx, in case of PDX disabled, it just an offset ( or a slide ). IIUUC, you meant that it make sense to map frametable not to the address of where RAM is started, but to 0, so then we don't need this +-frametable_base_pdx. The price for that is loosing of VA space for the range from 0 to RAM start address. Right now, we are trying to support 3 boards with the following RAM address: 1. 0x8000_0000 - QEMU and SiFive board 2. 0x40_0000_0000 - Microchip board So if we mapping frametable to 0 ( not RAM start ) we will loose: 1. 0x8000_0 ( amount of page entries to cover range [0, 0x8000_0000] ) * 64 ( size of struct page_info ) = 32 MB 2. 0x40_0000_0 ( amount of page entries to cover range [0, 0x40_0000_0000] ) * 64 ( size of struct page_info ) = 4 Gb In terms of available virtual address space for RV-64 we can consider both options as acceptable. [OPTION 1] If we accepting of loosing 4 Gb of VA then we could implement mfn_to_page() and page_to_mfn() in the following way: ``` diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h index cc4a07a71c..fdac7e0646 100644 --- a/xen/arch/riscv/include/asm/mm.h +++ b/xen/arch/riscv/include/asm/mm.h @@ -107,14 +107,11 @@ struct page_info #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START) -/* PDX of the first page in the frame table. */ -extern unsigned long frametable_base_pdx; - /* Convert between machine frame numbers and page-info structures. */ #define mfn_to_page(mfn) \ - (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) + (frame_table + mfn)) #define page_to_mfn(pg) \ - pdx_to_mfn((unsigned long)((pg) - frame_table) + frametable_base_pdx) + ((unsigned long)((pg) - frame_table)) static inline void *page_to_virt(const struct page_info *pg) { diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c index 9c0fd80588..8f6dbdc699 100644 --- a/xen/arch/riscv/mm.c +++ b/xen/arch/riscv/mm.c @@ -15,7 +15,7 @@ #include <asm/page.h> #include <asm/processor.h> -unsigned long __ro_after_init frametable_base_pdx; unsigned long __ro_after_init frametable_virt_end; struct mmu_desc { ``` [OPTION 2] If we are not accepting of loosing 4 Gb of VA I think that there is no sense to rework that in the following way: ``` diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h index cc4a07a71c..fdac7e0646 100644 --- a/xen/arch/riscv/include/asm/mm.h +++ b/xen/arch/riscv/include/asm/mm.h @@ -107,14 +107,11 @@ struct page_info #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START) -/* PDX of the first page in the frame table. */ -extern unsigned long frametable_base_pdx; - /* Convert between machine frame numbers and page-info structures. */ #define mfn_to_page(mfn) \ - (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) + (frame_table + mfn - FRAMETABLE_BASE_OFFSET)) #define page_to_mfn(pg) \ - pdx_to_mfn((unsigned long)((pg) - frame_table) + frametable_base_pdx) + ((unsigned long)((pg) - frame_table) + FRAMETABLE_BASE_OFFSET) static inline void *page_to_virt(const struct page_info *pg) { diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c index 9c0fd80588..8f6dbdc699 100644 --- a/xen/arch/riscv/mm.c +++ b/xen/arch/riscv/mm.c @@ -15,7 +15,7 @@ #include <asm/page.h> #include <asm/processor.h> -unsigned long __ro_after_init frametable_base_pdx; unsigned long __ro_after_init frametable_virt_end; struct mmu_desc { ``` And I am not sure that there is any sense in option 2 as basically it the same to have the following macros definition with PDX disabled: ``` /* Convert between machine frame numbers and page-info structures. */ #define mfn_to_page(mfn) \ (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) #define page_to_mfn(pg) \ pdx_to_mfn((unsigned long)((pg) - frame_table) + frametable_base_pdx) ``` The only sense of option 2 is the case when FRAMETABLE_BASE_OFFSET is equal to 0, so the compiler will generate the code without additional sub/add of FRAMETABLE_BASE_OFFSET. Could you please clarify if my understanding is correct? Should we still change the definition of mfn_to_page() and page_to_mfn() with the fact that PDX is disabled? If yes, OPTION_1 is okay or I am missing something? Thanks in advance. ~ Oleksii

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.