[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v12 5/8] xen/riscv: add minimal stuff to mm.h to build full Xen



On Fri, 2024-06-14 at 10:47 +0100, Andrew Cooper wrote:
> On 11/06/2024 7:23 pm, Oleksii K. wrote:
> > On Tue, 2024-06-11 at 16:53 +0100, Andrew Cooper wrote:
> > > On 30/05/2024 7:22 pm, Oleksii K. wrote:
> > > > On Thu, 2024-05-30 at 18:23 +0100, Andrew Cooper wrote:
> > > > > On 29/05/2024 8:55 pm, Oleksii Kurochko wrote:
> > > > > > Signed-off-by: Oleksii Kurochko
> > > > > > <oleksii.kurochko@xxxxxxxxx>
> > > > > > Acked-by: Jan Beulich <jbeulich@xxxxxxxx>
> > > > > This patch looks like it can go in independently?  Or does it
> > > > > depend
> > > > > on
> > > > > having bitops.h working in practice?
> > > > > 
> > > > > However, one very strong suggestion...
> > > > > 
> > > > > 
> > > > > > diff --git a/xen/arch/riscv/include/asm/mm.h
> > > > > > b/xen/arch/riscv/include/asm/mm.h
> > > > > > index 07c7a0abba..cc4a07a71c 100644
> > > > > > --- a/xen/arch/riscv/include/asm/mm.h
> > > > > > +++ b/xen/arch/riscv/include/asm/mm.h
> > > > > > @@ -3,11 +3,246 @@
> > > > > > <snip>
> > > > > > +/* PDX of the first page in the frame table. */
> > > > > > +extern unsigned long frametable_base_pdx;
> > > > > > +
> > > > > > +/* Convert between machine frame numbers and page-info
> > > > > > structures.
> > > > > > */
> > > > > > +#define
> > > > > > mfn_to_page(mfn)                                           
> > > > > > \
> > > > > > +    (frame_table + (mfn_to_pdx(mfn) -
> > > > > > frametable_base_pdx))
> > > > > > +#define
> > > > > > page_to_mfn(pg)                                            
> > > > > > \
> > > > > > +    pdx_to_mfn((unsigned long)((pg) - frame_table) +
> > > > > > frametable_base_pdx)
> > > > > Do yourself a favour and not introduce frametable_base_pdx to
> > > > > begin
> > > > > with.
> > > > > 
> > > > > Every RISC-V board I can find has things starting from 0 in
> > > > > physical
> > > > > address space, with RAM starting immediately after.
> > > > I checked Linux kernel and grep there:
> > > >    [ok@fedora linux-aia]$ grep -Rni "memory@"
> > > > arch/riscv/boot/dts/
> > > > --
> > > >    exclude "*.tmp" -I
> > > >    arch/riscv/boot/dts/starfive/jh7110-starfive-visionfive-
> > > > 2.dtsi:33:    
> > > >    memory@40000000 {
> > > >    arch/riscv/boot/dts/starfive/jh7100-common.dtsi:28:    
> > > > memory@80000000
> > > >    {
> > > >    arch/riscv/boot/dts/microchip/mpfs-sev-kit.dts:49:     
> > > > ddrc_cache:
> > > >    memory@1000000000 {
> > > >    arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:33:  
> > > > ddrc_cache_lo:
> > > >    memory@80000000 {
> > > >    arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:37:  
> > > > ddrc_cache_hi:
> > > >    memory@1040000000 {
> > > >    arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:34:     
> > > > ddrc_cache_lo:
> > > >    memory@80000000 {
> > > >    arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:40:     
> > > > ddrc_cache_hi:
> > > >    memory@1000000000 {
> > > >    arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:22:  
> > > > ddrc_cache_lo:
> > > >    memory@80000000 {
> > > >    arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:27:  
> > > > ddrc_cache_hi:
> > > >    memory@1000000000 {
> > > >    arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:57:  
> > > > ddrc_cache_lo:
> > > >    memory@80000000 {
> > > >    arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:63:  
> > > > ddrc_cache_hi:
> > > >    memory@1040000000 {
> > > >    arch/riscv/boot/dts/thead/th1520-beaglev-ahead.dts:32: 
> > > > memory@0
> > > > {
> > > >    arch/riscv/boot/dts/thead/th1520-lichee-module-
> > > > 4a.dtsi:14:     
> > > >    memory@0 {
> > > >    arch/riscv/boot/dts/sophgo/cv1800b-milkv-duo.dts:26:   
> > > > memory@80000000
> > > >    {
> > > >    arch/riscv/boot/dts/sophgo/cv1812h.dtsi:12:    
> > > > memory@80000000
> > > > {
> > > >    arch/riscv/boot/dts/sifive/hifive-unmatched-a00.dts:26:
> > > > memory@80000000
> > > >    {
> > > >    arch/riscv/boot/dts/sifive/hifive-unleashed-a00.dts:25:
> > > > memory@80000000
> > > >    {
> > > >    arch/riscv/boot/dts/canaan/k210.dtsi:82:        sram:
> > > > memory@80000000 {
> > > >    
> > > > And based on  that majority of supported by Linux kernel boards
> > > > has
> > > > RAM
> > > > starting not from 0 in physical address space. Am I confusing
> > > > something?
> > > > 
> > > > > Taking the microchip board as an example, RAM actually starts
> > > > > at
> > > > > 0x8000000,
> > > > Today we had conversation with the guy from SiFive in xen-devel
> > > > channel
> > > > and he mentioned that they are using "starfive visionfive2 and
> > > > sifive
> > > > unleashed platforms." which based on the grep above has RAM not
> > > > at
> > > > 0
> > > > address.
> > > > 
> > > > Also, QEMU uses 0x8000000.
> > > > 
> > > > >  which means that having frametable_base_pdx and assuming it
> > > > > does get set to 0x8000 (which isn't even a certainty, given
> > > > > that
> > > > > I
> > > > > think
> > > > > you'll need struct pages covering the PLICs), then what you
> > > > > are
> > > > > trading
> > > > > off is:
> > > > > 
> > > > > * Saving 32k of virtual address space only (no need to even
> > > > > allocate
> > > > > memory for this range of the framtable), by
> > > > > * Having an extra memory load and add/sub in every page <->
> > > > > mfn
> > > > > conversion, which is a screaming hotpath all over Xen.
> > > > > 
> > > > > It's a terribly short-sighted tradeoff.
> > > > > 
> > > > > 32k of VA space might be worth saving in a 32bit build (I
> > > > > personally
> > > > > wouldn't - especially as there's no need to share Xen's VA
> > > > > space
> > > > > with
> > > > > guests, given no PV guests on ARM/RISC-V), but it's
> > > > > absolutely
> > > > > not at
> > > > > all in an a 64bit build with TB of VA space available.
> > > > > 
> > > > > Even if we do find a board with the first interesting thing
> > > > > in
> > > > > the
> > > > > frametable starting sufficiently away from 0 that it might be
> > > > > worth
> > > > > considering this slide, then it should still be Kconfig-able
> > > > > in a
> > > > > similar way to PDX_COMPRESSION.
> > > > I found your tradeoffs reasonable, but I don't understand how
> > > > it
> > > > will
> > > > work if RAM does not start from 0, as the frametable address
> > > > and
> > > > RAM
> > > > address are linked.
> > > > I tried to look at the PDX_COMPRESSION config and couldn't find
> > > > any
> > > > "slide" there. Could you please clarify this for me?
> > > > If to use this "slide" would it help to avoid the mentioned
> > > > above
> > > > tradeoffs?
> > > > 
> > > > One more question: if we decide to go without
> > > > frametable_base_pdx,
> > > > would it be sufficient to simply remove mentions of it from the
> > > > code (
> > > > at least, for now )?
> > > There is a relationship between system/host physical addresses
> > > (what
> > > Xen
> > > called maddr/mfn), and the frametable.  The frametable has one
> > > entry
> > > per
> > > mfn.
> > > 
> > > In the most simple case, there's a 1:1 relationship.  i.e.
> > > frametable[0]
> > > = maddr(0), frametable[1] = maddr(4k) etc.  This is very simple,
> > > and
> > > very easy to calculate (page_to_mfn()/mfn_to_page()).
> > > 
> > > The frametable is one big array.  It starts at a compile-time
> > > fixed
> > > address, and needs to be long enough to cover everything
> > > interesting
> > > in
> > > memory.  Therefore it potentially takes a large amount of virtual
> > > address space.
> > > 
> > > However, only interesting maddrs need to have data in the
> > > frametable,
> > > so
> > > it's fine for the backing RAM to be sparsely allocated/mapped in
> > > the
> > > frametable virtual addresses.
> > > 
> > > For 64bit, that's really all you need, because there's always far
> > > more
> > > virtual address space than physical RAM in the system, even when
> > > you're
> > > looking at TB-scale giant servers.
> > > 
> > > 
> > > For 32bit, virtual address space is a limited resource.  (Also to
> > > an
> > > extent, 64bit x86 with PV guests because we give 98% of the
> > > virtual
> > > address space to the guest kernel to use.)
> > > 
> > > There are two tricks to reduce the virtual address space used,
> > > but
> > > they
> > > both cost performance in fastpaths.
> > > 
> > > 1) PDX Compression.
> > > 
> > > PDX compression makes a non-linear mfn <-> maddr mapping.  This
> > > is
> > > for a
> > > usecase where you've got multiple RAM banks which are separated
> > > by a
> > > large distance (and evenly spaced), then you can "compress" a
> > > single
> > > range of 0's out of the middle of the system/host physical
> > > address.
> > > 
> > > The cost is that all page <-> mfn conversions need to read two
> > > masks
> > > and
> > > a shift-count from variables in memory, to split/shift/recombine
> > > the
> > > address bits.
> > > 
> > > 2) A slide, which is frametable_base_pdx in this context.
> > > 
> > > When there's a big gap between 0 and the start of something
> > > interesting,
> > > you could chop out that range by just subtracting base_pdx.  What
> > > qualifies as "big" is subjective, but Qemu starting at 128M
> > > certainly
> > > does not qualify as big enough to warrant frametable_base_pdx.
> > > 
> > > This is less expensive that PDX compression.  It only adds one
> > > memory
> > > read to the fastpath, but it also doesn't save as much virtual
> > > address
> > > space as PDX compression.
> > > 
> > > 
> > > When virtual address space is a major constraint (32 bit builds),
> > > both
> > > of these techniques are worth doing.  But when there's no
> > > constraint
> > > on
> > > virtual address space (64 bit builds), there's no reason to use
> > > either;
> > > and the performance will definitely improve as a result.
> > Thanks for such good explanation.
> > 
> > For RISC-V we have PDX config disabled as I haven't seen multiple
> > RAM
> > banks at boards which has hypervisor extension. Thereby
> > mfn_to_pdx()
> > and pdx_to_mfn() do nothing. The same for frametable_base_pdx, in
> > case
> > of PDX disabled, it just an offset ( or a slide ).
> > 
> > IIUUC, you meant that it make sense to map frametable not to the
> > address of where RAM is started, but to 0, so then we don't need
> > this
> > +-frametable_base_pdx. The price for that is loosing of VA space
> > for
> > the range from 0 to RAM start address.
> > 
> > Right now, we are trying to support 3 boards with the following RAM
> > address:
> > 1. 0x8000_0000 - QEMU and SiFive board
> > 2. 0x40_0000_0000 - Microchip board
> > 
> > So if we mapping frametable to 0 ( not RAM start ) we will loose:
> > 1. 0x8000_0 ( amount of page entries to cover range [0,
> > 0x8000_0000] )
> > * 64 ( size of struct page_info ) = 32 MB
> > 2. 0x40_0000_0 ( amount of page entries to cover range [0,
> > 0x40_0000_0000] ) * 64 ( size of struct page_info ) = 4 Gb
> > 
> > In terms of available virtual address space for RV-64 we can
> > consider
> > both options as acceptable.
> 
> For Qemu and SiFive, 32M is definitely not worth worrying about.
> 
> I personally wouldn't worry about Microchip either.  That's 4G of 1T
> VA
> space (assuming Sv39).
> 
> > [OPTION 1] If we accepting of loosing 4 Gb of VA then we could
> > implement mfn_to_page() and page_to_mfn() in the following way:
> > ```
> >    diff --git a/xen/arch/riscv/include/asm/mm.h
> >    b/xen/arch/riscv/include/asm/mm.h
> >    index cc4a07a71c..fdac7e0646 100644
> >    --- a/xen/arch/riscv/include/asm/mm.h
> >    +++ b/xen/arch/riscv/include/asm/mm.h
> >    @@ -107,14 +107,11 @@ struct page_info
> >    
> >     #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START)
> >    
> >    -/* PDX of the first page in the frame table. */
> >    -extern unsigned long frametable_base_pdx;
> >    -
> >     /* Convert between machine frame numbers and page-info
> > structures.
> > */
> >     #define
> > mfn_to_page(mfn)                                          
> > \
> >    -    (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
> >    +    (frame_table + mfn))
> >     #define
> > page_to_mfn(pg)                                           
> > \
> >    -    pdx_to_mfn((unsigned long)((pg) - frame_table) +
> >    frametable_base_pdx)
> >    +    ((unsigned long)((pg) - frame_table))
> >    
> >     static inline void *page_to_virt(const struct page_info *pg)
> >     {
> >    diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
> >    index 9c0fd80588..8f6dbdc699 100644
> >    --- a/xen/arch/riscv/mm.c
> >    +++ b/xen/arch/riscv/mm.c
> >    @@ -15,7 +15,7 @@
> >     #include <asm/page.h>
> >     #include <asm/processor.h>
> >    
> >    -unsigned long __ro_after_init frametable_base_pdx;
> >     unsigned long __ro_after_init frametable_virt_end;
> >    
> >     struct mmu_desc {
> > ```
> 
> I firmly recommend option 1, especially at this point.
Jan, as you gave your Acked before, don't you mind to define
mfn_to_page() and page_to_mfn as mentioned above ( Option 1 )?

~ Oleksi

> 
> If specific boards really have a problem with losing 4G of VA space,
> then option 2 can be added easily at a later point.
> 
> That said, I'd think carefully about doing option 2.  Even
> subtracting a
> constant - which is far better than subtracting a variable - is still
> extra overhead in fastpaths.  Option 2 needs careful consideration on
> a
> board-by-board case.
> 
> ~Andrew




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.