[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: IRQ latency measurements in hypervisor
Hi Julien, Julien Grall writes: > On 15/01/2021 15:45, Volodymyr Babchuk wrote: >> Hi Julien, >> Julien Grall writes: >> >>> Hi Volodymyr, Stefano, >>> >>> On 14/01/2021 23:33, Stefano Stabellini wrote: >>>> + Bertrand, Andrew (see comment on alloc_heap_pages()) >>> >>> Long running hypercalls are usually considered security issues. >>> >>> In this case, only the control domain can issue large memory >>> allocation (2GB at a time). Guest, would only be able to allocate 2MB >>> at the time, so from the numbers below, it would only take 1ms max. >>> >>> So I think we are fine here. Next time, you find a large loop, please >>> provide an explanation why they are not security issues (e.g. cannot >>> be used by guests) or send an email to the Security Team in doubt. >> Sure. In this case I took into account that only control domain can >> issue this call, I just didn't stated this explicitly. Next time will >> do. > > I am afraid that's not correct. The guest can request to populate a > region. This is used for instance in the ballooning case. > > The main difference is a non-privileged guest will not be able to do > allocation larger than 2MB. I did some measurements. According to them allocation of order 9 takes about 265us on my HW. I covered this in detail at the end of email. >>>> This is very interestingi too. Did you get any spikes with the >>>> period >>>> set to 100us? It would be fantastic if there were none. >>>> >>>>> 3. Huge latency spike during domain creation. I conducted some >>>>> additional tests, including use of PV drivers, but this didn't >>>>> affected the latency in my "real time" domain. But attempt to >>>>> create another domain with relatively large memory size of 2GB led >>>>> to huge spike in latency. Debugging led to this call path: >>>>> >>>>> XENMEM_populate_physmap -> populate_physmap() -> >>>>> alloc_domheap_pages() -> alloc_heap_pages()-> huge >>>>> "for ( i = 0; i < (1 << order); i++ )" loop. >>> >>> There are two for loops in alloc_heap_pages() using this syntax. Which >>> one are your referring to? >> I did some tracing with Lautrebach. It pointed to the first loop and >> especially to flush_page_to_ram() call if I remember correctly. > > Thanks, I am not entirely surprised because we are clean and > invalidating the region line by line and across all the CPUs. > > If we are assuming 128 bytes cacheline, we will need to issue 32 cache > instructions per page. This going to involve quite a bit of traffic on > the system. > > One possibility would be to defer the cache flush when the domain is > created and use the hypercall XEN_DOMCTL_cacheflush to issue the > flush. Can we flush caches on first access to a page? What I mean - do not populate stage 2 tables with allocated memory. Flush caches in abort handler and then populate them. > Note that XEN_DOMCTL_cacheflush would need some modification to be > preemptible. But at least, it will work on a GFN which is easier to > track. > >>>>> I managed to overcome the issue #3 by commenting out all calls to >>>>> populate_one_size() except the populate_one_size(PFN_4K_SHIFT) in >>>>> xg_dom_arm.c. This lengthened domain construction, but my "RT" domain >>>>> didn't experienced so big latency issues. Apparently all other >>>>> hypercalls which are used during domain creation are either fast or >>>>> preemptible. No doubts that my hack lead to page tables inflation and >>>>> overall performance drop. >>>> I think we need to follow this up and fix this. Maybe just by adding >>>> a hypercall continuation to the loop. >>> >>> When I read "hypercall continuation", I read we will return to the >>> guest context so it can process interrupts and potentially switch to >>> another task. >>> >>> This means that the guest could issue a second populate_physmap() from >>> the vCPU. Therefore any restart information should be part of the >>> hypercall parameters. So far, I don't see how this would be possible. >>> >>> Even if we overcome that part, this can be easily abuse by a guest as >>> the memory is not yet accounted to the domain. Imagine a guest that >>> never request the continuation of the populate_physmap(). So we would >>> need to block the vCPU until the allocation is finished. >> Moreover, most of the alloc_heap_pages() sits under spinlock, so >> first >> step would be to split this function into smaller atomic parts. > > Do you have any suggestion how to split it? > Well, it is quite complex function and I can't tell right away. At this time I don't quite understand why spin_unlock() is called after the first (1 << order) loop for instance. Also, this function does many different things for its simple name. >> >>> I think the first step is we need to figure out which part of the >>> allocation is slow (see my question above). From there, we can figure >>> out if there is a way to reduce the impact. >> I'll do more tracing and will return with more accurate numbers. >> But as far as I can see, any loop on 262144 pages will take some time.. > . > > It really depends on the content of the loop. On any modern > processors, you are very likely not going to notice a loop that update > just a flag. > > However, you are likely going to be see an impact if your loop is > going to clean & invalidate the cache for each page. > Totally agree. I used Xen tracing subsystem to do the measurements and I can confirm that call to flush_page_to_ram() causes most of the impact. There is the details: I added number of tracing points to the function: static struct page_info *alloc_heap_pages( unsigned int zone_lo, unsigned int zone_hi, unsigned int order, unsigned int memflags, struct domain *d) { nodeid_t node; unsigned int i, buddy_order, zone, first_dirty; unsigned long request = 1UL << order; struct page_info *pg; bool need_tlbflush = false; uint32_t tlbflush_timestamp = 0; unsigned int dirty_cnt = 0; /* Make sure there are enough bits in memflags for nodeID. */ BUILD_BUG_ON((_MEMF_bits - _MEMF_node) < (8 * sizeof(nodeid_t))); ASSERT(zone_lo <= zone_hi); ASSERT(zone_hi < NR_ZONES); if ( unlikely(order > MAX_ORDER) ) return NULL; spin_lock(&heap_lock); TRACE_1D(TRC_PGALLOC_PT1, order); // <================================= /* * Claimed memory is considered unavailable unless the request * is made by a domain with sufficient unclaimed pages. */ if ( (outstanding_claims + request > total_avail_pages) && ((memflags & MEMF_no_refcount) || !d || d->outstanding_pages < request) ) { spin_unlock(&heap_lock); return NULL; } pg = get_free_buddy(zone_lo, zone_hi, order, memflags, d); /* Try getting a dirty buddy if we couldn't get a clean one. */ if ( !pg && !(memflags & MEMF_no_scrub) ) pg = get_free_buddy(zone_lo, zone_hi, order, memflags | MEMF_no_scrub, d); if ( !pg ) { /* No suitable memory blocks. Fail the request. */ spin_unlock(&heap_lock); return NULL; } TRACE_0D(TRC_PGALLOC_PT2); // <================================= node = phys_to_nid(page_to_maddr(pg)); zone = page_to_zone(pg); buddy_order = PFN_ORDER(pg); first_dirty = pg->u.free.first_dirty; /* We may have to halve the chunk a number of times. */ while ( buddy_order != order ) { buddy_order--; page_list_add_scrub(pg, node, zone, buddy_order, (1U << buddy_order) > first_dirty ? first_dirty : INVALID_DIRTY_IDX); pg += 1U << buddy_order; if ( first_dirty != INVALID_DIRTY_IDX ) { /* Adjust first_dirty */ if ( first_dirty >= 1U << buddy_order ) first_dirty -= 1U << buddy_order; else first_dirty = 0; /* We've moved past original first_dirty */ } } TRACE_0D(TRC_PGALLOC_PT3); // <================================= ASSERT(avail[node][zone] >= request); avail[node][zone] -= request; total_avail_pages -= request; ASSERT(total_avail_pages >= 0); check_low_mem_virq(); if ( d != NULL ) d->last_alloc_node = node; for ( i = 0; i < (1 << order); i++ ) { /* Reference count must continuously be zero for free pages. */ if ( (pg[i].count_info & ~PGC_need_scrub) != PGC_state_free ) { printk(XENLOG_ERR "pg[%u] MFN %"PRI_mfn" c=%#lx o=%u v=%#lx t=%#x\n", i, mfn_x(page_to_mfn(pg + i)), pg[i].count_info, pg[i].v.free.order, pg[i].u.free.val, pg[i].tlbflush_timestamp); BUG(); } /* PGC_need_scrub can only be set if first_dirty is valid */ ASSERT(first_dirty != INVALID_DIRTY_IDX || !(pg[i].count_info & PGC_need_scrub)); /* Preserve PGC_need_scrub so we can check it after lock is dropped. */ pg[i].count_info = PGC_state_inuse | (pg[i].count_info & PGC_need_scrub); if ( !(memflags & MEMF_no_tlbflush) ) accumulate_tlbflush(&need_tlbflush, &pg[i], &tlbflush_timestamp); /* Initialise fields which have other uses for free pages. */ pg[i].u.inuse.type_info = 0; page_set_owner(&pg[i], NULL); /* Ensure cache and RAM are consistent for platforms where the * guest can control its own visibility of/through the cache. */ flush_page_to_ram(mfn_x(page_to_mfn(&pg[i])), !(memflags & MEMF_no_icache_flush)); } TRACE_0D(TRC_PGALLOC_PT4); // <================================= spin_unlock(&heap_lock); if ( first_dirty != INVALID_DIRTY_IDX || (scrub_debug && !(memflags & MEMF_no_scrub)) ) { for ( i = 0; i < (1U << order); i++ ) { if ( test_bit(_PGC_need_scrub, &pg[i].count_info) ) { if ( !(memflags & MEMF_no_scrub) ) scrub_one_page(&pg[i]); dirty_cnt++; spin_lock(&heap_lock); pg[i].count_info &= ~PGC_need_scrub; spin_unlock(&heap_lock); } else if ( !(memflags & MEMF_no_scrub) ) check_one_page(&pg[i]); } if ( dirty_cnt ) { spin_lock(&heap_lock); node_need_scrub[node] -= dirty_cnt; spin_unlock(&heap_lock); } } TRACE_0D(TRC_PGALLOC_PT5); // <================================= if ( need_tlbflush ) filtered_flush_tlb_mask(tlbflush_timestamp); TRACE_0D(TRC_PGALLOC_PT6); // <================================= return pg; } And wrote a simple Python scripts that parses the output of xentrace. There are results for different orders: 46.842032: page_alloc trace point 1. Order: 18 46.842035: page_alloc trace point 2 (+ 0.000003) 46.842035: page_alloc trace point 3 (+ 0.000000) 46.975105: page_alloc trace point 4 (+ 0.133069) 46.975106: page_alloc trace point 5 (+ 0.000001) 46.975106: page_alloc trace point 6 (+ 0.000000): total: 0.133074 46.998536: page_alloc trace point 1. Order: 9 46.998538: page_alloc trace point 2 (+ 0.000002) 46.998540: page_alloc trace point 3 (+ 0.000001) 46.998799: page_alloc trace point 4 (+ 0.000259) 46.998800: page_alloc trace point 5 (+ 0.000000) 46.998800: page_alloc trace point 6 (+ 0.000000): total: 0.000264 46.835802: page_alloc trace point 1. Order: 3 46.835803: page_alloc trace point 2 (+ 0.000000) 46.835803: page_alloc trace point 3 (+ 0.000000) 46.835812: page_alloc trace point 4 (+ 0.000009) 46.835813: page_alloc trace point 5 (+ 0.000000) 46.835813: page_alloc trace point 6 (+ 0.000001): total: 0.000011 46.998815: page_alloc trace point 1. Order: 0 46.998816: page_alloc trace point 2 (+ 0.000002) 46.998817: page_alloc trace point 3 (+ 0.000000) 46.998818: page_alloc trace point 4 (+ 0.000002) 46.998819: page_alloc trace point 5 (+ 0.000001) 46.998819: page_alloc trace point 6 (+ 0.000000): total: 0.000005 Then I commented out call to flush_page_to_ram() and got the following results: 149.561902: page_alloc trace point 1. Order: 18 149.561905: page_alloc trace point 2 (+ 0.000003) 149.561905: page_alloc trace point 3 (+ 0.000000) 149.569450: page_alloc trace point 4 (+ 0.007545) 149.569451: page_alloc trace point 5 (+ 0.000001) 149.569452: page_alloc trace point 6 (+ 0.000000): total: 0.007550 149.592624: page_alloc trace point 1. Order: 9 149.592626: page_alloc trace point 2 (+ 0.000003) 149.592627: page_alloc trace point 3 (+ 0.000001) 149.592639: page_alloc trace point 4 (+ 0.000012) 149.592639: page_alloc trace point 5 (+ 0.000000) 149.592640: page_alloc trace point 6 (+ 0.000000): total: 0.000016 All time units are seconds, by the way. -- Volodymyr Babchuk at EPAM
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |