[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v12 5/8] xen/riscv: add minimal stuff to mm.h to build full Xen



On Tue, 2024-06-11 at 16:53 +0100, Andrew Cooper wrote:
> On 30/05/2024 7:22 pm, Oleksii K. wrote:
> > On Thu, 2024-05-30 at 18:23 +0100, Andrew Cooper wrote:
> > > On 29/05/2024 8:55 pm, Oleksii Kurochko wrote:
> > > > Signed-off-by: Oleksii Kurochko <oleksii.kurochko@xxxxxxxxx>
> > > > Acked-by: Jan Beulich <jbeulich@xxxxxxxx>
> > > This patch looks like it can go in independently?  Or does it
> > > depend
> > > on
> > > having bitops.h working in practice?
> > > 
> > > However, one very strong suggestion...
> > > 
> > > 
> > > > diff --git a/xen/arch/riscv/include/asm/mm.h
> > > > b/xen/arch/riscv/include/asm/mm.h
> > > > index 07c7a0abba..cc4a07a71c 100644
> > > > --- a/xen/arch/riscv/include/asm/mm.h
> > > > +++ b/xen/arch/riscv/include/asm/mm.h
> > > > @@ -3,11 +3,246 @@
> > > > <snip>
> > > > +/* PDX of the first page in the frame table. */
> > > > +extern unsigned long frametable_base_pdx;
> > > > +
> > > > +/* Convert between machine frame numbers and page-info
> > > > structures.
> > > > */
> > > > +#define
> > > > mfn_to_page(mfn)                                            \
> > > > +    (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
> > > > +#define
> > > > page_to_mfn(pg)                                             \
> > > > +    pdx_to_mfn((unsigned long)((pg) - frame_table) +
> > > > frametable_base_pdx)
> > > Do yourself a favour and not introduce frametable_base_pdx to
> > > begin
> > > with.
> > > 
> > > Every RISC-V board I can find has things starting from 0 in
> > > physical
> > > address space, with RAM starting immediately after.
> > I checked Linux kernel and grep there:
> >    [ok@fedora linux-aia]$ grep -Rni "memory@" arch/riscv/boot/dts/
> > --
> >    exclude "*.tmp" -I
> >    arch/riscv/boot/dts/starfive/jh7110-starfive-visionfive-
> > 2.dtsi:33:    
> >    memory@40000000 {
> >    arch/riscv/boot/dts/starfive/jh7100-common.dtsi:28:    
> > memory@80000000
> >    {
> >    arch/riscv/boot/dts/microchip/mpfs-sev-kit.dts:49:     
> > ddrc_cache:
> >    memory@1000000000 {
> >    arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:33:  
> > ddrc_cache_lo:
> >    memory@80000000 {
> >    arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:37:  
> > ddrc_cache_hi:
> >    memory@1040000000 {
> >    arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:34:     
> > ddrc_cache_lo:
> >    memory@80000000 {
> >    arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:40:     
> > ddrc_cache_hi:
> >    memory@1000000000 {
> >    arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:22:  
> > ddrc_cache_lo:
> >    memory@80000000 {
> >    arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:27:  
> > ddrc_cache_hi:
> >    memory@1000000000 {
> >    arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:57:  
> > ddrc_cache_lo:
> >    memory@80000000 {
> >    arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:63:  
> > ddrc_cache_hi:
> >    memory@1040000000 {
> >    arch/riscv/boot/dts/thead/th1520-beaglev-ahead.dts:32:  memory@0
> > {
> >    arch/riscv/boot/dts/thead/th1520-lichee-module-4a.dtsi:14:     
> >    memory@0 {
> >    arch/riscv/boot/dts/sophgo/cv1800b-milkv-duo.dts:26:   
> > memory@80000000
> >    {
> >    arch/riscv/boot/dts/sophgo/cv1812h.dtsi:12:     memory@80000000
> > {
> >    arch/riscv/boot/dts/sifive/hifive-unmatched-a00.dts:26:
> > memory@80000000
> >    {
> >    arch/riscv/boot/dts/sifive/hifive-unleashed-a00.dts:25:
> > memory@80000000
> >    {
> >    arch/riscv/boot/dts/canaan/k210.dtsi:82:        sram:
> > memory@80000000 {
> >    
> > And based on  that majority of supported by Linux kernel boards has
> > RAM
> > starting not from 0 in physical address space. Am I confusing
> > something?
> > 
> > > Taking the microchip board as an example, RAM actually starts at
> > > 0x8000000,
> > Today we had conversation with the guy from SiFive in xen-devel
> > channel
> > and he mentioned that they are using "starfive visionfive2 and
> > sifive
> > unleashed platforms." which based on the grep above has RAM not at
> > 0
> > address.
> > 
> > Also, QEMU uses 0x8000000.
> > 
> > >  which means that having frametable_base_pdx and assuming it
> > > does get set to 0x8000 (which isn't even a certainty, given that
> > > I
> > > think
> > > you'll need struct pages covering the PLICs), then what you are
> > > trading
> > > off is:
> > > 
> > > * Saving 32k of virtual address space only (no need to even
> > > allocate
> > > memory for this range of the framtable), by
> > > * Having an extra memory load and add/sub in every page <-> mfn
> > > conversion, which is a screaming hotpath all over Xen.
> > > 
> > > It's a terribly short-sighted tradeoff.
> > > 
> > > 32k of VA space might be worth saving in a 32bit build (I
> > > personally
> > > wouldn't - especially as there's no need to share Xen's VA space
> > > with
> > > guests, given no PV guests on ARM/RISC-V), but it's absolutely
> > > not at
> > > all in an a 64bit build with TB of VA space available.
> > > 
> > > Even if we do find a board with the first interesting thing in
> > > the
> > > frametable starting sufficiently away from 0 that it might be
> > > worth
> > > considering this slide, then it should still be Kconfig-able in a
> > > similar way to PDX_COMPRESSION.
> > I found your tradeoffs reasonable, but I don't understand how it
> > will
> > work if RAM does not start from 0, as the frametable address and
> > RAM
> > address are linked.
> > I tried to look at the PDX_COMPRESSION config and couldn't find any
> > "slide" there. Could you please clarify this for me?
> > If to use this "slide" would it help to avoid the mentioned above
> > tradeoffs?
> > 
> > One more question: if we decide to go without frametable_base_pdx,
> > would it be sufficient to simply remove mentions of it from the
> > code (
> > at least, for now )?
> 
> There is a relationship between system/host physical addresses (what
> Xen
> called maddr/mfn), and the frametable.  The frametable has one entry
> per
> mfn.
> 
> In the most simple case, there's a 1:1 relationship.  i.e.
> frametable[0]
> = maddr(0), frametable[1] = maddr(4k) etc.  This is very simple, and
> very easy to calculate (page_to_mfn()/mfn_to_page()).
> 
> The frametable is one big array.  It starts at a compile-time fixed
> address, and needs to be long enough to cover everything interesting
> in
> memory.  Therefore it potentially takes a large amount of virtual
> address space.
> 
> However, only interesting maddrs need to have data in the frametable,
> so
> it's fine for the backing RAM to be sparsely allocated/mapped in the
> frametable virtual addresses.
> 
> For 64bit, that's really all you need, because there's always far
> more
> virtual address space than physical RAM in the system, even when
> you're
> looking at TB-scale giant servers.
> 
> 
> For 32bit, virtual address space is a limited resource.  (Also to an
> extent, 64bit x86 with PV guests because we give 98% of the virtual
> address space to the guest kernel to use.)
> 
> There are two tricks to reduce the virtual address space used, but
> they
> both cost performance in fastpaths.
> 
> 1) PDX Compression.
> 
> PDX compression makes a non-linear mfn <-> maddr mapping.  This is
> for a
> usecase where you've got multiple RAM banks which are separated by a
> large distance (and evenly spaced), then you can "compress" a single
> range of 0's out of the middle of the system/host physical address.
> 
> The cost is that all page <-> mfn conversions need to read two masks
> and
> a shift-count from variables in memory, to split/shift/recombine the
> address bits.
> 
> 2) A slide, which is frametable_base_pdx in this context.
> 
> When there's a big gap between 0 and the start of something
> interesting,
> you could chop out that range by just subtracting base_pdx.  What
> qualifies as "big" is subjective, but Qemu starting at 128M certainly
> does not qualify as big enough to warrant frametable_base_pdx.
> 
> This is less expensive that PDX compression.  It only adds one memory
> read to the fastpath, but it also doesn't save as much virtual
> address
> space as PDX compression.
> 
> 
> When virtual address space is a major constraint (32 bit builds),
> both
> of these techniques are worth doing.  But when there's no constraint
> on
> virtual address space (64 bit builds), there's no reason to use
> either;
> and the performance will definitely improve as a result.

Thanks for such good explanation.

For RISC-V we have PDX config disabled as I haven't seen multiple RAM
banks at boards which has hypervisor extension. Thereby mfn_to_pdx()
and pdx_to_mfn() do nothing. The same for frametable_base_pdx, in case
of PDX disabled, it just an offset ( or a slide ).

IIUUC, you meant that it make sense to map frametable not to the
address of where RAM is started, but to 0, so then we don't need this
+-frametable_base_pdx. The price for that is loosing of VA space for
the range from 0 to RAM start address.

Right now, we are trying to support 3 boards with the following RAM
address:
1. 0x8000_0000 - QEMU and SiFive board
2. 0x40_0000_0000 - Microchip board

So if we mapping frametable to 0 ( not RAM start ) we will loose:
1. 0x8000_0 ( amount of page entries to cover range [0, 0x8000_0000] )
* 64 ( size of struct page_info ) = 32 MB
2. 0x40_0000_0 ( amount of page entries to cover range [0,
0x40_0000_0000] ) * 64 ( size of struct page_info ) = 4 Gb

In terms of available virtual address space for RV-64 we can consider
both options as acceptable.

[OPTION 1] If we accepting of loosing 4 Gb of VA then we could
implement mfn_to_page() and page_to_mfn() in the following way:
```
   diff --git a/xen/arch/riscv/include/asm/mm.h
   b/xen/arch/riscv/include/asm/mm.h
   index cc4a07a71c..fdac7e0646 100644
   --- a/xen/arch/riscv/include/asm/mm.h
   +++ b/xen/arch/riscv/include/asm/mm.h
   @@ -107,14 +107,11 @@ struct page_info
   
    #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START)
   
   -/* PDX of the first page in the frame table. */
   -extern unsigned long frametable_base_pdx;
   -
    /* Convert between machine frame numbers and page-info structures.
*/
    #define mfn_to_page(mfn)                                          
\
   -    (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
   +    (frame_table + mfn))
    #define page_to_mfn(pg)                                           
\
   -    pdx_to_mfn((unsigned long)((pg) - frame_table) +
   frametable_base_pdx)
   +    ((unsigned long)((pg) - frame_table))
   
    static inline void *page_to_virt(const struct page_info *pg)
    {
   diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
   index 9c0fd80588..8f6dbdc699 100644
   --- a/xen/arch/riscv/mm.c
   +++ b/xen/arch/riscv/mm.c
   @@ -15,7 +15,7 @@
    #include <asm/page.h>
    #include <asm/processor.h>
   
   -unsigned long __ro_after_init frametable_base_pdx;
    unsigned long __ro_after_init frametable_virt_end;
   
    struct mmu_desc {
```

[OPTION 2] If we are not accepting of loosing 4 Gb of VA I think that
there is no sense to rework that in the following way:
```
   diff --git a/xen/arch/riscv/include/asm/mm.h
   b/xen/arch/riscv/include/asm/mm.h
   index cc4a07a71c..fdac7e0646 100644
   --- a/xen/arch/riscv/include/asm/mm.h
   +++ b/xen/arch/riscv/include/asm/mm.h
   @@ -107,14 +107,11 @@ struct page_info
   
    #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START)
   
   -/* PDX of the first page in the frame table. */
   -extern unsigned long frametable_base_pdx;
   -
    /* Convert between machine frame numbers and page-info structures.
*/
    #define mfn_to_page(mfn)                                          
\
   -    (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
   +    (frame_table + mfn - FRAMETABLE_BASE_OFFSET))
    #define page_to_mfn(pg)                                           
\
   -    pdx_to_mfn((unsigned long)((pg) - frame_table) +
   frametable_base_pdx)
   +    ((unsigned long)((pg) - frame_table) + FRAMETABLE_BASE_OFFSET)
   
    static inline void *page_to_virt(const struct page_info *pg)
    {
   diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
   index 9c0fd80588..8f6dbdc699 100644
   --- a/xen/arch/riscv/mm.c
   +++ b/xen/arch/riscv/mm.c
   @@ -15,7 +15,7 @@
    #include <asm/page.h>
    #include <asm/processor.h>
   
   -unsigned long __ro_after_init frametable_base_pdx;
    unsigned long __ro_after_init frametable_virt_end;
   
    struct mmu_desc {
```

And I am not sure that there is any sense in option 2 as basically it
the same to have the following macros definition with PDX disabled:
```
   /* Convert between machine frame numbers and page-info structures.
   */
   #define mfn_to_page(mfn)                                           
   \
       (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
   #define page_to_mfn(pg)                                            
   \
       pdx_to_mfn((unsigned long)((pg) - frame_table) +
   frametable_base_pdx)
```
The only sense of option 2 is the case when FRAMETABLE_BASE_OFFSET is
equal to 0, so the compiler will generate the code without additional
sub/add of FRAMETABLE_BASE_OFFSET.

Could you please clarify if my understanding is correct?

Should we still change the definition of mfn_to_page() and
page_to_mfn() with the fact that PDX is disabled? If yes, OPTION_1 is
okay or I am missing something?

Thanks in advance.

~ Oleksii



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.