Xen project Mailing List

Re: [Xen-devel] [PATCH] create multiple banks for dom0 in 1:1 mapping if necessary

On Tue, Jan 14, 2014 at 10:51 AM, Ian Campbell <Ian.Campbell@xxxxxxxxxx> wrote: > On Sat, 2014-01-11 at 01:58 +0000, karim.allah.ahmed@xxxxxxxxx wrote: >> On Fri, Jan 10, 2014 at 3:48 PM, Ian Campbell <Ian.Campbell@xxxxxxxxxx> >> wrote: >> > On Fri, 2014-01-10 at 04:12 +0000, Karim Raslan wrote: >> >> Create multiple banks to hold dom0 memory in case of 1:1 mapping >> >> if we failed to find 1 large contiguous memory to hold the whole >> >> thing. >> > >> > Thanks. While with my ARM maintainer hat on I'd love for this to go in >> > for 4.4 with my acting release manager hat on I think if I have to be >> > honest this is too big a change for 4.4 at this stage, which is a >> > pity :-( >> > >> > >> >> >> >> Signed-off-by: Karim Raslan <karim.allah.ahmed@xxxxxxxxx> >> >> --- >> >> xen/arch/arm/domain_build.c | 74 ++++++++++++++++++++++++++----------- >> >> xen/arch/arm/kernel.c | 86 >> >> ++++++++++++++++++++++++++++++++++--------- >> >> 2 files changed, 121 insertions(+), 39 deletions(-) >> >> >> >> diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c >> >> index faff88e..bb44cdd 100644 >> >> --- a/xen/arch/arm/domain_build.c >> >> +++ b/xen/arch/arm/domain_build.c >> >> @@ -69,39 +69,71 @@ static void allocate_memory_11(struct domain *d, >> >> struct kernel_info *kinfo) >> >> { >> >> paddr_t start; >> >> paddr_t size; >> >> + unsigned int cur_order, cur_bank, nr_banks = 1, index = 0; >> >> struct page_info *pg = NULL; >> >> - unsigned int order = get_order_from_bytes(dom0_mem); >> >> + unsigned int order = get_order_from_bytes(kinfo->unassigned_mem); >> >> int res; >> >> paddr_t spfn; >> >> unsigned int bits; >> >> >> >> - for ( bits = PAGE_SHIFT + 1; bits < PADDR_BITS; bits++ ) >> >> +#define MIN_BANK_ORDER 10 >> > >> > 2^10 is < PAGE_SIZE (PAGE_SHIFT is 12). 12 is the lowest allocation size >> > which can be made, but I think in practice the lowest useful bank size >> > is going to be somewhat larger than that. >> >> MIN_BANK_ORDER is in pages, so it 2^10 pages not 2^10 bytes which >> makes the minimum 4 Mbyte > > My mistake. Please add a comment though. > >> >> + >> >> + kinfo->mem.nr_banks = 0; >> >> + >> >> + /* >> >> + * We always first try to allocate all dom0 memory in 1 bank. >> >> + * However, if we failed to allocate physically contiguous memory >> >> + * from the allocator, then we try to create more than one bank. >> >> + */ >> >> + for ( cur_order = order; cur_order > MIN_BANK_ORDER; cur_order--) >> > >> > I think this can be just >> > for( order = get_order_from_bytes(...) ; order > MIN_BANK_ORDER ; >> > order-- ) >> > (maybe order >= ?) or >> > while (order > MIN_BANK_ORDER ) >> > { >> > ... >> > order--; >> > } >> > I think the first works better. >> > >> > This does away with the need for cur_order vs order. I think order is >> > mostly unused after this patch, also not renaming cur_order would >> > hopefully reduce the diff and therefore the "scariness" of the patch wrt >> > 4.4 (although that may not be sufficient). >> >> Yes, that's correct, however I'm intentionally using a different >> variable because I thought that this is going to make things more >> obvious. If you think it's better to use the same variable, then I'll >> update it. > > Personally I found it confusing to read as it is. Although reading > further down I'm still confused and I'm not sure that my suggestion > would actually help. > >> > >> >> + pg = alloc_domheap_pages(d, cur_order, >> >> MEMF_bits(bits)); >> >> + if ( pg != NULL ) >> >> + break; >> >> + } >> >> + >> >> + if ( !pg ) >> >> + break; >> >> + >> >> + spfn = page_to_mfn(pg); >> >> + start = pfn_to_paddr(spfn); >> >> + size = pfn_to_paddr((1 << cur_order)); >> >> + >> >> + kinfo->mem.bank[index].start = start; >> >> + kinfo->mem.bank[index].size = size; >> >> + index++; >> >> + kinfo->mem.nr_banks++; >> >> + } >> >> + >> >> + if( pg ) >> >> + break; >> >> + >> >> + nr_banks = (nr_banks - cur_bank + 1) << 1; >> > >> > <<1 ? >> >> * 2 :) > > I know that ;-) But why are you doubling the number of banks at all? > >> > Is this not just kinfo->mem.nr_banks? >> >> No, In this equation I'm calculating how much more memory will be >> needed to satisfy the memory size of dom0. >> So at the end of the iteration, I check how much memory has been >> allocated (=cur_bank * cur_order) and how much memory was needed >> (=nr_banks * cur_order), so nr_unallocated_banks = (nr_banks - >> cur_bank + 1) * cur_order >> So cur_order is decremented and nr_unallocated_banks is doubled ( <<1 >> ) and we do another iteration to satisfy the rest of unallocated >> memory. > > Hrm, I'm still a bit confused. I think perhaps choosing better names for > the variables (e.g. it seems like you are saying nr_banks is actually > nr_unallocated_banks?) and adding some comments might help. > > But is this algorithm not more complex than it needs to be? Why not just > allocate memory, subtracting it from kinfo->unassigned_mem as you go and > adding a new bank each time? The result would the same wouldn't it? > e.g.: > > while ( kinfo->unassigned_mem ) > { > order = get_order_of_bytes(kinfo->unsigned_mem) > do { > pg = alloc_domheap_pages(... order ...) > } while(!pg && order>>=1 > MIN_BANK_ORDER) > > if (pg) > urk! > > kinfo->mem.bank[kinfo->mem.nr_banks].{start,size} = (stuff > based on pg, order etc) > kinfo->mem.nr_banks++ > kinfo->unassigned_mem -= /*whatever it is*/ > > /* maybe do guest_physmap_add_page here too */ > } > > I think that will produce the same result, won't it? > > In either algorithm there needs to be a check for NR_MEM_BANKS somewhere > or else it will run off the end of kinfo->mem.bank[]. > >> >> > >> > The basic structure here is: >> > >> > for ( cur_order = order; cur_order > MIN_BANK_ORDER; cur_order--) >> > for( cur_bank = 1; cur_bank <= nr_banks; cur_bank++ ) >> > for ( bits = PAGE_SHIFT + 1; bits < PADDR_BITS; bits++ ) >> > >> > Shouldn't the iteration over bank be the outer one? >> > >> > The banks might be different sizes, right? >> > >> >> The outer loop will define the order of the allocated bank[s] while >> the inner one will define how many banks of that order will be needed. >> >> So you try to allocate one big bank, if it succeeds you're done. If >> not, you double the number of required banks and retry with smaller >> order (order - 1). >> >> The code can indeed allocate banks of different sizes. So, if we >> failed to allocate 1 big bank, we will try to allocate two banks of >> (order = order -1) in that case we might only allocate the first bank >> and fail to allocate the second one. So, we try to allocate the >> required memory for the second one in two banks. >> >> So the logic is always: In the end of the outer loop calculate how >> many banks we need to allocate for next iteration as well as the >> required order. So all allocation that occur in the same iteration is >> equal, while each iteration changes the nr banks and order depending >> on how much memory we still need to allocate > > I think I get it now, but it still seems complicated. Am I missing > something with the simpler loop I suggested? > >> > Also with either approach then depending on where memory is available >> > this may result in the memory not being allocated in and/or that they >> > are not in increasing order (in fact, because Xen prefer to allocate >> > higher memory first it seems likely that it will be in reverse order). >> >> Yes, there's no restriction what so ever on the order of the >> addresses. However each allocation is trying to allocate the bank as >> early as possible >> >> for ( bits = PAGE_SHIFT + 1; bits < PADDR_BITS; bits++ ) >> { >> pg = alloc_domheap_pages(d, cur_order, MEMF_bits(bits)); >> if ( pg != NULL ) >> break; >> } >> >> > >> > I don't know if either of those things matter. What does ePAPR have to >> > say on the matter? >> >> There is no mention of address range ordering ( at least in section >> 3.4 "Memory Node" ) > > OK. It'll be interesting to see whether the implementations match up to > that! > > Since max banks is necessarily small I do wonder if it might be easier > to just do a quick insertion sort as we go rather than risking it. I > suppose there will be plenty of time in the 4.5 cycle (which this work > is now almost certainly targeting) to hit any problem cases. > >> > I'd expect that the ordering might matter from the point of view of >> > putting the kernel in the first bank -- since that may no longer be the >> > lowest address. >> > >> In the patch, when I set the loadaddr of the image I search through >> the banks for the lowest address bank not the first one anyway. > > OK, I think that makes sense since the requirement is wrt position > relative to the start of RAM, not the first bank. > >> > You don't seem to reference kinfo->unassigned_mem anywhere after the >> > initial order calculation -- I think you need to subtract memory from it >> > on each iteration, or else I'm not sure you will actually get the right >> > amount allocated in all cases. >> >> It's being properly calculated in the external mapping loop. > > Hrm yes with all that doubling etc. > >> >> >> >> - kinfo->mem.bank[0].start = start; >> >> - kinfo->mem.bank[0].size = size; >> >> - kinfo->mem.nr_banks = 1; >> >> + if ( res ) >> >> + panic("Unable to add pages in DOM0: %d", res); >> >> >> >> - kinfo->unassigned_mem -= size; >> >> + kinfo->unassigned_mem -= size; >> >> + } >> >> } >> >> >> >> static void allocate_memory(struct domain *d, struct kernel_info *kinfo) >> >> diff --git a/xen/arch/arm/kernel.c b/xen/arch/arm/kernel.c >> >> index 6a5772b..658c3de 100644 >> >> --- a/xen/arch/arm/kernel.c >> >> +++ b/xen/arch/arm/kernel.c >> >> @@ -79,15 +79,43 @@ static void place_modules(struct kernel_info *info, >> >> const paddr_t total = initrd_len + dtb_len; >> >> >> >> /* Convenient */ >> > >> > If you are going to remove all of the following convenience variables >> > then this comment is no longer correct. >> > >> > (Convenient here means a shorter local name for something complex) >> >> It still applies to these variables except probably "unsigned int i, >> min_i = -1;", so I'll push the comment one line down. > > No it doesn't AFAICT. "Convenient" here means "these are shorthand for > something longer and inconvenient to type", if the variables aren't > const then they almost certainly are not a convenience in the intended > sense. same_bank, mem_* and kernel_size are all just regular variables I > think. > >> >> - const paddr_t mem_start = info->mem.bank[0].start; >> >> - const paddr_t mem_size = info->mem.bank[0].size; >> >> - const paddr_t mem_end = mem_start + mem_size; >> >> - const paddr_t kernel_size = kernel_end - kernel_start; >> >> + unsigned int i, min_i = -1; >> >> + bool_t same_bank = false; >> >> + >> >> + paddr_t mem_start, mem_end, mem_size; >> >> + paddr_t kernel_size; >> >> >> >> paddr_t addr; >> >> >> >> - if ( total + kernel_size > mem_size ) >> >> - panic("Not enough memory in the first bank for the dtb+initrd"); >> >> + kernel_size = kernel_end - kernel_start; >> >> + >> >> + for ( i = 0; i < info->mem.nr_banks; i++ ) >> >> + { >> >> + mem_start = info->mem.bank[i].start; >> >> + mem_size = info->mem.bank[i].size; >> >> + mem_end = mem_start + mem_size - 1; >> >> + >> >> + if ( (kernel_end > mem_start) && (kernel_end <= mem_end) ) >> >> + same_bank = true; >> >> + else >> >> + same_bank = false; >> >> + >> >> + if ( same_bank && ((total + kernel_size) < mem_size) ) >> >> + min_i = i; >> >> + else if ( (!same_bank) && (total < mem_size) ) >> >> + min_i = i; >> >> + else >> >> + continue; >> > >> > What is all this doing? >> >> Search through the banks for the bank that fits the initrd and dtb. >> Calculation is slightly different depending on whether I ended up >> using the same bank as the one used by the kernel or not. ( as >> mentioned previously, I don't just blindly choose the first bank for >> the kernel. I search through all of the banks for the lowest banks - >> address-wise - ). > > Where is the address comparison before assigning min_i? > >> In the same_bank case, I just use the old >> calculations that was already there in the code, however in the >> !same_bank case I just start at the end of the bank. > > Would it be clearer to do > if ( dtb+initrd fits in kernel's back ) > ok > else > search > ? > > There is also a requirement that the dtb and initrd are within certain > ranges of the start of RAM (see the booting.txt and Booting in > Documentation/arm*/), does this implement this? It doesn't look to be > considered during the search or when placing in the !same_bank case. > > This would all be simpler if we sorted the banks too. Which is another > argument for doing so IMHO. > >> >> - load_end = info->mem.bank[0].start + info->mem.bank[0].size; >> >> - load_end = MIN(info->mem.bank[0].start + MB(128), load_end); >> >> + /* >> >> + * Load kernel at the lowest possible bank >> >> + * ( check commit 6c21cb36e263de2db8716b477157a5b6cd531e1e for >> >> reason behind that ) >> > >> > That commit says nothing about loading in the lowest possible bank, >> > though. If there is some relevant factor which is worth commenting on >> > please do so directly. >> >> Loading at the lowest bank is safer because of the: >> "The current solution in Linux assuming that the delta physical >> address - virtual address is always negative". >> This delta is being calculated based on where the kernel was loaded ( >> using "adr" to find physical address ). >> So loading it as early as possible is a good idea to avoid this problem. > > I think this problem is now fixed upstream, the intention was to > eventually revert the workaround (Julien was going to tell me when it > had gone into stable etc, but this is now a 4.5 era revert candidate). > >> However, I'm not quite sure why not just load it at the beginning of >> the memory then! I think I'll look at booting.txt for that, maybe it's >> a decompresser limitation or something! > > Yes, it's all a bit complicated. > >> > Actually now that the kernel is fixed upstream we don't need the >> > behaviour of that commit at all. Although there are still restrictions >> > based on load address vs start of RAM (See booting.txt in the kernel >> > docs) >> Ah, I see. I'm using an allwinner branch of the kernel ( >> https://github.com/bjzhang/linux-allwinner.git ). I'll take a look at >> the latest thing. > > git://github.com/linux-sunxi/linux-sunxi.git either sunxi-next or > sunxi-devel branches are pretty good baselines nowadays. I've got an > outstanding TODO to update the wiki... Sorry for the delay! Okay, I'm going to update my system with this tree and refresh the patch based on your comments and resend this weekend. > > Ian. > -- Karim Allah Ahmed. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.