[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v6)



On Fri, Jun 13, 2014 at 07:32:46PM +0100, Andrew Cooper wrote:
> On 13/06/14 19:05, Konrad Rzeszutek Wilk wrote:
> > From: Malcolm Crossley <malcolm.crossley@xxxxxxxxxx>
> >
> > The page scrubbing is done in 128MB chunks in lockstep across all the
> > non-SMT CPU's. This allows for the boot CPU to hold the heap_lock whilst 
> > each
> > chunk is being scrubbed and then release the heap_lock when the CPU's are
> > finished scrubing their individual chunk. This allows for the heap_lock to
> > not be held continously and for pending softirqs are to be serviced
> > periodically across the CPU's.
> >
> > The page scrub memory chunks are allocated to the CPU's in a NUMA aware
> > fashion to reduce socket interconnect overhead and improve performance.
> > Specifically in the first phase we scrub at the same time on all the
> > NUMA nodes that have CPUs - we also weed out the SMT threads so that
> > we only use cores (that gives a 50% boost). The second phase is for NUMA
> > nodes that have no CPUs - for that we use the closest NUMA node's CPUs
> > (non-SMT again) to do the job.
> >
> > This patch reduces the boot page scrub time on a 128GB 64 core AMD Opteron
> > 6386 machine from 49 seconds to 3 seconds.
> > On a IvyBridge-EX 8 socket box with 1.5TB it cuts it down from 15 minutes
> > to 63 seconds.
> >
> > Signed-off-by: Malcolm Crossley <malcolm.crossley@xxxxxxxxxx>
> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> > Reviewed-by: Tim Deegan <tim@xxxxxxx>
> 
> Functionally, Reviewed-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> 
> A few minor nits...
> 
> >
> > ---
> >
> > v2
> >  - Reduced default chunk size to 128MB
> >  - Added code to scrub NUMA nodes with no active CPU linked to them
> >  - Be robust to boot CPU not being linked to a NUMA node
> >
> > v3:
> >  - Don't use SMT threads
> >  - Take care of remainder if the number of CPUs (or memory) is odd
> >  - Restructure the worker thread
> >  - s/u64/unsigned long/
> >
> > v4:
> >  - Don't use all CPUs on non-CPU NUMA nodes, just closest one
> >  - Syntax, and docs updates
> >  - Compile on ARM
> >  - Fix bug when NUMA node has 0 pages
> >
> > v5:
> >  - Properly figure out best NUMA node.
> >  - Fix comments to be proper style.
> >
> > v6:
> >  - Missing page shift on default values, optimize cpumask usage.
> >  - Fix case of NODE having one page and below first_valid_mfn
> >  - Add Ack
> > ---
> >  docs/misc/xen-command-line.markdown |   10 ++
> >  xen/common/page_alloc.c             |  211 
> > ++++++++++++++++++++++++++++++++---
> >  xen/include/asm-arm/numa.h          |    1 +
> >  3 files changed, 204 insertions(+), 18 deletions(-)
> >
> > diff --git a/docs/misc/xen-command-line.markdown 
> > b/docs/misc/xen-command-line.markdown
> > index 25829fe..509f462 100644
> > --- a/docs/misc/xen-command-line.markdown
> > +++ b/docs/misc/xen-command-line.markdown
> > @@ -198,6 +198,16 @@ Scrub free RAM during boot.  This is a safety feature 
> > to prevent
> >  accidentally leaking sensitive VM data into other VMs if Xen crashes
> >  and reboots.
> >  
> > +### bootscrub_chunk
> 
> _ needs escaping with a backslash
> 
> > +> `= <size>`
> > +
> > +> Default: `134217728`
> 
> Due to the implicit 'k' suffix on sizes, this is not a useful hint as to
> how to change the default.
> 
> Furthermore, `128M` is substantially clearer.

<smiles>
> 
> > +
> > +Maximum RAM block size chunks to be scrubbed whilst holding the page heap 
> > lock
> > +and not running softirqs. Reduce this if softirqs are not being run 
> > frequently
> > +enough. Setting this to a high value may cause boot failure, particularly 
> > if
> > +the NMI watchdog is also enabled.
> > +
> >  ### cachesize
> >  > `= <size>`
> >  
> > diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> > ...
> > - * Scrub all unallocated pages in all heap zones. This function is more
> > - * convoluted than appears necessary because we do not want to continuously
> > - * hold the lock while scrubbing very large memory areas.
> > + * Scrub all unallocated pages in all heap zones. This function uses all
> > + * online cpu's to scrub the memory in parallel.
> >   */
> >  void __init scrub_heap_pages(void)
> >  {
> > -    unsigned long mfn;
> > -    struct page_info *pg;
> > +    cpumask_t node_cpus, all_worker_cpus;
> > +    unsigned int i, j;
> > +    unsigned long offset, max_per_cpu_sz = 0;
> > +    unsigned long start, end;
> > +    unsigned long rem = 0;
> > +    int last_distance, best_node;
> > +    int cpus;
> >  
> >      if ( !opt_bootscrub )
> >          return;
> >  
> > -    printk("Scrubbing Free RAM: ");
> > +    cpumask_clear(&all_worker_cpus);
> > +    /* Scrub block size. */
> > +    chunk_size = opt_bootscrub_chunk >> PAGE_SHIFT;
> > +    if ( chunk_size == 0 )
> > +        chunk_size = MB(128) >> PAGE_SHIFT;
> > +
> > +    /* Round #0 - figure out amounts and which CPUs to use. */
> > +    for_each_online_node ( i )
> > +    {
> > +        if ( !node_spanned_pages(i) )
> > +            continue;
> > +        /* Calculate Node memory start and end address. */
> > +        start = max(node_start_pfn(i), first_valid_mfn);
> > +        end = min(node_start_pfn(i) + node_spanned_pages(i), max_page);
> > +        /* Just in case NODE has 1 page and starts below first_valid_mfn. 
> > */
> > +        end = max(end, start);
> > +        /* CPUs that are online and on this node (if none, that it is OK). 
> > */
> > +        cpus = find_non_smt(i, &node_cpus);
> > +        cpumask_or(&all_worker_cpus, &all_worker_cpus, &node_cpus);
> > +        if ( cpus <= 0 )
> > +        {
> > +            /* No CPUs on this node. Round #2 will take of it. */
> > +            rem = 0;
> > +            region[i].per_cpu_sz = (end - start);
> > +        }
> > +        else
> > +        {
> > +            rem = (end - start) % cpumask_weight(&node_cpus);
> > +            region[i].per_cpu_sz = (end - start) / 
> > cpumask_weight(&node_cpus);
> 
> The 'cpus' variable still holds cpumask_weight(&node_cpus), and is
> rather more efficient to use.

Grrr. Right!
> 
> > +            if ( region[i].per_cpu_sz > max_per_cpu_sz )
> > +                max_per_cpu_sz = region[i].per_cpu_sz;
> > +        }
> > +        region[i].start = start;
> > +        region[i].rem = rem;
> > +        cpumask_copy(&region[i].cpus, &node_cpus);
> > +    }
> > +
> > +    printk("Scrubbing Free RAM on %u nodes using %u CPUs\n", 
> > num_online_nodes(),
> > +           cpumask_weight(&all_worker_cpus));
> 
> Both of these are signed quantities, rather than unsigned.

Done!
> 
> ~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.