[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's



On Apr 11, 2014 2:38 PM, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:
>
> On 11/04/14 19:08, Konrad Rzeszutek Wilk wrote: 
> > From: Malcolm Crossley <malcolm.crossley@xxxxxxxxxx> 
> > 
> > The page scrubbing is done in 128MB chunks in lockstep across all the 
> > CPU's. 
>
> No apostrophe in plural CPUs (also further through) 
>
> > This allows for the boot CPU to hold the heap_lock whilst each chunk is 
> > being 
> > scrubbed and then release the heap_lock when all CPU's are finished 
> > scrubing 
> > their individual chunk. This allows for the heap_lock to not be held 
> > continously and for pending softirqs are to be serviced periodically across 
> > all CPU's. 
> > 
> > The page scrub memory chunks are allocated to the CPU's in a NUMA aware 
> > fashion to reduce socket interconnect overhead and improve performance. 
> > Specifically in the first phase we scrub at the same time on all the 
> > NUMA nodes that have CPUs - we also weed out the SMT threads so that 
> > we only use cores (that gives a 50% boost). The second phase is to use 
> > all of the CPUs for the NUMA nodes that have no CPUs. 
> > 
> > This patch reduces the boot page scrub time on a 128GB 64 core AMD Opteron 
> > 6386 machine from 49 seconds to 3 seconds. 
> > On a IvyBridge-EX 8 socket box with 1.5TB it cuts it down from 15 minutes 
> > to 117 seconds. 
>
> The numbers from our 1TB box are also along a similar line, but I don't 
> have them to hand. 
>
> > 
> > v2 
> >Â - Reduced default chunk size to 128MB 
> >Â - Added code to scrub NUMA nodes with no active CPU linked to them 
> >Â - Be robust to boot CPU not being linked to a NUMA node 
> > 
> > v3: 
> >Â - Don't use SMT threads 
> >Â - Take care of remainder if the number of CPUs (or memory) is odd 
> >Â - Restructure the worker thread 
> >Â - s/u64/unsigned long/ 
> > 
> > Signed-off-by: Malcolm Crossley <malcolm.crossley@xxxxxxxxxx> 
> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> 
> > --- 
> >Â docs/misc/xen-command-line.markdown |ÂÂ 10 ++ 
> >Â xen/common/page_alloc.cÂÂÂÂÂÂÂÂÂÂÂÂ |Â 177 
> >+++++++++++++++++++++++++++++++---- 
> >Â 2 files changed, 167 insertions(+), 20 deletions(-) 
> > 
> > diff --git a/docs/misc/xen-command-line.markdown 
> > b/docs/misc/xen-command-line.markdown 
> > index 87de2dc..a7da227 100644 
> > --- a/docs/misc/xen-command-line.markdown 
> > +++ b/docs/misc/xen-command-line.markdown 
> > @@ -197,6 +197,16 @@ Scrub free RAM during boot. This is a safety feature 
> > to prevent 
> >Â accidentally leaking sensitive VM data into other VMs if Xen crashes 
> >Â and reboots. 
>
> > +### bootscrub_chunk_ 
>
> Erronious trailing underscore, and the first one needs escaping for the 
> markdown to end up properly formatted. 
>
> > +> `= <size>` 
> > + 
> > +> Default: `128MiB` 
>
> `128MB` 
>
> 128MiB will be rejected by the size parsing code

Yes. Will correct.
>
> > + 
> > +Maximum RAM block size chunks to be scrubbed whilst holding the page heap 
> > lock 
> > +and not running softirqs. Reduce this if softirqs are not being run 
> > frequently 
> > +enough. Setting this to a high value may cause cause boot failure, 
> > particularly 
> > +if the NMI watchdog is also enabled. 
> > + 
> >Â ### cachesize 
> >Â > `= <size>` 
>
> > diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c 
> > index 601319c..3ad6e1d 100644 
> > --- a/xen/common/page_alloc.c 
> > +++ b/xen/common/page_alloc.c 
> > @@ -65,6 +65,12 @@ static bool_t opt_bootscrub __initdata = 1; 
> >Â boolean_param("bootscrub", opt_bootscrub); 
>
> >Â /* 
> > + * bootscrub_blocksize -> Size (bytes) of mem block to scrub with heaplock 
> > held 
>
> Stale comment?

Aye
>
> > + */ 
> > +static unsigned int __initdata opt_bootscrub_chunk = 128 * 1024 * 1024; 
> > +size_param("bootscrub_chunk", opt_bootscrub_chunk); 
> > + 
> > +/* 
> >ÂÂ * Bit width of the DMA heap -- used to override NUMA-node-first. 
> >ÂÂ * allocation strategy, which can otherwise exhaust low memory. 
> >ÂÂ */ 
> > @@ -90,6 +96,16 @@ static struct bootmem_region { 
> >Â } *__initdata bootmem_region_list; 
> >Â static unsigned int __initdata nr_bootmem_regions; 
>
> > +struct scrub_region { 
> > +ÂÂÂ unsigned long offset; 
> > +ÂÂÂ unsigned long start; 
> > +ÂÂÂ unsigned long per_cpu_sz; 
> > +ÂÂÂ unsigned long rem; 
> > +ÂÂÂ cpumask_t cpu; 
>
> cpus surely? 
>
> > +}; 
> > +static struct scrub_region __initdata region[MAX_NUMNODES]; 
> > +static unsigned long __initdata chunk_size; 
> > + 
> >Â static void __init boot_bug(int line) 
> >Â { 
> >ÂÂÂÂÂ panic("Boot BUG at %s:%d", __FILE__, line); 
> > @@ -1256,45 +1272,166 @@ void __init end_boot_allocator(void) 
> >ÂÂÂÂÂ printk("\n"); 
> >Â } 
>
> > +void __init smp_scrub_heap_pages(void *data) 
> > +{ 
> > +ÂÂÂ unsigned long mfn, start, end; 
> > +ÂÂÂ struct page_info *pg; 
> > +ÂÂÂ struct scrub_region *r; 
> > +ÂÂÂ unsigned int temp_cpu, node, cpu_idx = 0; 
> > +ÂÂÂ unsigned int cpu = smp_processor_id(); 
> > + 
> > +ÂÂÂ if ( data ) 
> > +ÂÂÂÂÂÂÂ r = data; 
> > +ÂÂÂ else { 
> > +ÂÂÂÂÂÂÂ node = cpu_to_node(cpu); 
> > +ÂÂÂÂÂÂÂ if ( node == NUMA_NO_NODE ) 
> > +ÂÂÂÂÂÂÂÂÂÂÂ return; 
> > +ÂÂÂÂÂÂÂ r = &region[node]; 
> > +ÂÂÂ } 
> > +ÂÂÂ ASSERT(r != NULL); 
>
> Under what conditions would NULL be passed? Can't the caller do 
> something more sane? 
>

Left over code from v2.
> > + 
> > +ÂÂÂ /* Determine the current CPU's index into CPU's linked to this node*/ 
>
> Space at the end of the comment 
>
> > +ÂÂÂ for_each_cpu ( temp_cpu, &r->cpu ) 
> > +ÂÂÂ { 
> > +ÂÂÂÂÂÂÂ if ( cpu == temp_cpu ) 
> > +ÂÂÂÂÂÂÂÂÂÂÂ break; 
> > +ÂÂÂÂÂÂÂ cpu_idx++; 
> > +ÂÂÂ } 
>
> Is this really the best way to do this? perhaps, but it absolutely 
> should be calculated once on the first chunk scrubbed rather than for 
> each chunk. Or better yet, probably calculated by the BSP when it is 
> splitting up RAM.

Where would you store this? The region is per-node. We would need to allocate 
an CPUs array to store the cpu-idx values. I am not sure if it is worth that 
considering this code only runs at boot up.
>
> > + 
> > +ÂÂÂ /* Calculate the starting mfn for this CPU's memory block */ 
> > +ÂÂÂ start = r->start + (r->per_cpu_sz * cpu_idx) + r->offset; 
> > + 
> > +ÂÂÂ /* Calculate the end mfn into this CPU's memory block for this 
> > iteration */ 
> > +ÂÂÂ if ( r->offset + chunk_size > r->per_cpu_sz ) { 
> > +ÂÂÂÂÂÂÂ end = r->start + (r->per_cpu_sz * cpu_idx) + r->per_cpu_sz; 
> > +ÂÂÂÂÂÂÂ if ( r->rem && ((cpumask_weight(&r->cpu) - 1 == cpu_idx )) ) 
> > +ÂÂÂÂÂÂÂÂÂÂÂ end += r->rem; 
> > +ÂÂÂ } 
> > +ÂÂÂ else 
> > +ÂÂÂÂÂÂÂ end = start + chunk_size; 
> > + 
> > +ÂÂÂ for ( mfn = start; mfn < end; mfn++ ) 
> > +ÂÂÂ { 
> > +ÂÂÂÂÂÂÂ pg = mfn_to_page(mfn); 
> > + 
> > +ÂÂÂÂÂÂÂ /* Check the mfn is valid and page is free. */ 
> > +ÂÂÂÂÂÂÂ if ( !mfn_valid(mfn) || !page_state_is(pg, free) ) 
> > +ÂÂÂÂÂÂÂÂÂÂÂ continue; 
> > + 
> > +ÂÂÂÂÂÂÂ scrub_one_page(pg); 
> > +ÂÂÂ } 
> > +ÂÂÂ wmb(); 
>
> Why this barrier?

Tim asked for it in his review - see 
http://lists.xen.org/archives/html/xen-devel/2013-10/msg00131.html
" + scrub_one_page(pg); > + } There should be a wmb() here, to make sure the 
main scrub dispatcher can't exit while the last worker is still issuing writes."

But with the @wait = 1 this fix is not needed anymore. Will remove it.

>
> > +} 
> > + 
> >Â /* 
> > - * Scrub all unallocated pages in all heap zones. This function is more 
> > - * convoluted than appears necessary because we do not want to 
> > continuously 
> > - * hold the lock while scrubbing very large memory areas. 
> > + * Scrub all unallocated pages in all heap zones. This function uses all 
> > + * online cpu's to scrub the memory in parallel. 
> >ÂÂ */ 
> >Â void __init scrub_heap_pages(void) 
> >Â { 
> > -ÂÂÂ unsigned long mfn; 
> > -ÂÂÂ struct page_info *pg; 
> > +ÂÂÂ cpumask_t node_cpus, temp_cpus, all_worker_cpus = {{ 0 }}; 
> > +ÂÂÂ unsigned int i, j, cpu, sibling; 
> > +ÂÂÂ unsigned long offset, max_per_cpu_sz = 0; 
> > +ÂÂÂ unsigned long start, end; 
> > +ÂÂÂ unsigned long rem = 0; 
>
> >ÂÂÂÂÂ if ( !opt_bootscrub ) 
> >ÂÂÂÂÂÂÂÂÂ return; 
>
> > -ÂÂÂ printk("Scrubbing Free RAM: "); 
> > +ÂÂÂ /* Scrub block size */ 
> > +ÂÂÂ chunk_size = opt_bootscrub_chunk >> PAGE_SHIFT; 
> > +ÂÂÂ if ( chunk_size == 0 ) 
> > +ÂÂÂÂÂÂÂ chunk_size = 1; 
>
> > -ÂÂÂ for ( mfn = first_valid_mfn; mfn < max_page; mfn++ ) 
> > +ÂÂÂ /* Round #0 - figure out amounts and which CPUs to use */ 
> > +ÂÂÂ for_each_online_node ( i ) 
> >ÂÂÂÂÂ { 
> > +ÂÂÂÂÂÂÂ /* Calculate Node memory start and end address */ 
> > +ÂÂÂÂÂÂÂ start = max(node_start_pfn(i), first_valid_mfn); 
> > +ÂÂÂÂÂÂÂ end = min(node_start_pfn(i) + node_spanned_pages(i), max_page); 
> > +ÂÂÂÂÂÂÂ /* CPUs that are online and on this node (if none, that it is OK 
> > */ 
> > +ÂÂÂÂÂÂÂ cpumask_and(&node_cpus, &node_to_cpumask(i), &cpu_online_map); 
> > +ÂÂÂÂÂÂÂ cpumask_copy(&temp_cpus, &node_cpus); 
> > +ÂÂÂÂÂÂÂ /* Rip out threads. */ 
> > +ÂÂÂÂÂÂÂ for_each_cpu ( j, &temp_cpus ) 
> > +ÂÂÂÂÂÂÂ { 
> > +ÂÂÂÂÂÂÂÂÂÂÂ cpu = 0; 
> > +ÂÂÂÂÂÂÂÂÂÂÂ for_each_cpu(sibling, per_cpu(cpu_sibling_mask, j)) { 
> > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (cpu++ == 0) /* Skip 1st CPU - the core */ 
> > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ continue; 
> > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ cpumask_clear_cpu(sibling, &node_cpus); 
> > +ÂÂÂÂÂÂÂÂÂÂÂ } 
> > +ÂÂÂÂÂÂÂ } 
> > +ÂÂÂÂÂÂÂ cpumask_or(&all_worker_cpus, &all_worker_cpus, &node_cpus); 
> > +ÂÂÂÂÂÂÂ if ( cpumask_empty(&node_cpus) ) { /* No CPUs on this node. */ 
> > +ÂÂÂÂÂÂÂÂÂÂÂ rem = 0; 
> > +ÂÂÂÂÂÂÂÂÂÂÂ region[i].per_cpu_sz = (end - start); 
> > +ÂÂÂÂÂÂÂ } else { 
> > +ÂÂÂÂÂÂÂÂÂÂÂ rem = (end - start) % cpumask_weight(&node_cpus); 
> > +ÂÂÂÂÂÂÂÂÂÂÂ region[i].per_cpu_sz = (end - start) / 
> > cpumask_weight(&node_cpus); 
> > +ÂÂÂÂÂÂÂÂÂÂÂ if ( region[i].per_cpu_sz > max_per_cpu_sz ) 
> > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ max_per_cpu_sz = region[i].per_cpu_sz; 
> > +ÂÂÂÂÂÂÂ } 
> > +ÂÂÂÂÂÂÂ region[i].start = start; 
> > +ÂÂÂÂÂÂÂ region[i].rem = rem; 
> > +ÂÂÂÂÂÂÂ cpumask_copy(&region[i].cpu, &node_cpus); 
> > + 
> > +ÂÂÂ } 
> > +ÂÂÂ cpu = smp_processor_id(); 
>
> This is surely always 0?, and I don't see it being used any further down.... 
>

Leftover from v2.

Thanks for looking!
> ~Andrew 
>
> > +ÂÂÂ /* Round default chunk size down if required */ 
> > +ÂÂÂ if ( max_per_cpu_sz && chunk_size > max_per_cpu_sz ) 
> > +ÂÂÂÂÂÂÂ chunk_size = max_per_cpu_sz; 
> > + 
> > +ÂÂÂ printk("Scrubbing Free RAM on %u nodes using %u CPUs: ", 
> > num_online_nodes(), 
> > +ÂÂÂÂÂÂÂÂÂÂ cpumask_weight(&all_worker_cpus)); 
> > + 
> > +ÂÂÂ /* Round: #1 - do NUMA nodes with CPUs */ 
> > +ÂÂÂ for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size ) 
> > +ÂÂÂ { 
> > +ÂÂÂÂÂÂÂ for_each_online_node ( i ) 
> > +ÂÂÂÂÂÂÂÂÂÂÂ region[i].offset = offset; 
> > + 
> >ÂÂÂÂÂÂÂÂÂ process_pending_softirqs(); 
>
> > -ÂÂÂÂÂÂÂ pg = mfn_to_page(mfn); 
> > +ÂÂÂÂÂÂÂ spin_lock(&heap_lock); 
> > +ÂÂÂÂÂÂÂ on_selected_cpus(&all_worker_cpus, smp_scrub_heap_pages, NULL, 1); 
> > +ÂÂÂÂÂÂÂ spin_unlock(&heap_lock); 
>
> > -ÂÂÂÂÂÂÂ /* Quick lock-free check. */ 
> > -ÂÂÂÂÂÂÂ if ( !mfn_valid(mfn) || !page_state_is(pg, free) ) 
> > +ÂÂÂÂÂÂÂ printk("."); 
> > +ÂÂÂ } 
> > + 
> > +ÂÂÂ /* Round #2: NUMA nodes with no CPUs get scrubbed with all CPUs. */ 
> > +ÂÂÂ for_each_online_node ( i ) 
> > +ÂÂÂ { 
> > +ÂÂÂÂÂÂÂ node_cpus = node_to_cpumask(i); 
> > + 
> > +ÂÂÂÂÂÂÂ if ( !cpumask_empty(&node_cpus) ) 
> >ÂÂÂÂÂÂÂÂÂÂÂÂÂ continue; 
>
> > -ÂÂÂÂÂÂÂ /* Every 100MB, print a progress dot. */ 
> > -ÂÂÂÂÂÂÂ if ( (mfn % ((100*1024*1024)/PAGE_SIZE)) == 0 ) 
> > -ÂÂÂÂÂÂÂÂÂÂÂ printk("."); 
> > +ÂÂÂÂÂÂÂ /* We already have the node information from round #0 */ 
> > +ÂÂÂÂÂÂÂ end = region[i].start + region[i].per_cpu_sz; 
> > +ÂÂÂÂÂÂÂ rem = region[i].per_cpu_sz % cpumask_weight(&all_worker_cpus); 
>
> > -ÂÂÂÂÂÂÂ spin_lock(&heap_lock); 
> > +ÂÂÂÂÂÂÂ region[i].rem = rem; 
> > +ÂÂÂÂÂÂÂ region[i].per_cpu_sz /= cpumask_weight(&all_worker_cpus); 
> > +ÂÂÂÂÂÂÂ max_per_cpu_sz = region[i].per_cpu_sz; 
> > +ÂÂÂÂÂÂÂ if ( max_per_cpu_sz && chunk_size > max_per_cpu_sz ) 
> > +ÂÂÂÂÂÂÂÂÂÂÂ chunk_size = max_per_cpu_sz; 
> > +ÂÂÂÂÂÂÂ cpumask_copy(&region[i].cpu, &all_worker_cpus); 
>
> > -ÂÂÂÂÂÂÂ /* Re-check page status with lock held. */ 
> > -ÂÂÂÂÂÂÂ if ( page_state_is(pg, free) ) 
> > -ÂÂÂÂÂÂÂÂÂÂÂ scrub_one_page(pg); 
> > +ÂÂÂÂÂÂÂ for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size ) 
> > +ÂÂÂÂÂÂÂ { 
> > +ÂÂÂÂÂÂÂÂÂÂÂ region[i].offset = offset; 
>
> > -ÂÂÂÂÂÂÂ spin_unlock(&heap_lock); 
> > -ÂÂÂ } 
> > +ÂÂÂÂÂÂÂÂÂÂÂ process_pending_softirqs(); 
> > + 
> > +ÂÂÂÂÂÂÂÂÂÂÂ spin_lock(&heap_lock); 
> > +ÂÂÂÂÂÂÂÂÂÂÂ on_selected_cpus(&all_worker_cpus, smp_scrub_heap_pages, 
> > &region[i], 1); 
> > +ÂÂÂÂÂÂÂÂÂÂÂ spin_unlock(&heap_lock); 
>
> > -ÂÂÂ printk("done.\n"); 
> > +ÂÂÂÂÂÂÂÂÂÂÂ printk("."); 
> > +ÂÂÂÂÂÂÂ } 
> > +ÂÂÂ } 
>
> >ÂÂÂÂÂ /* Now that the heap is initialized, run checks and set bounds 
> >ÂÂÂÂÂÂ * for the low mem virq algorithm. */ 
>
>
> _______________________________________________ 
> Xen-devel mailing list 
> Xen-devel@xxxxxxxxxxxxx 
> http://lists.xen.org/xen-devel 
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.