[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v2] x86/mm: also flush TLB when putting writable foreign page reference



On 03/05/17 10:56, Andrew Cooper wrote:
> On 03/05/17 08:21, Jan Beulich wrote:
> >>>> On 02.05.17 at 19:37, <andrew.cooper3@xxxxxxxxxx> wrote:
> >> On 02/05/17 10:43, Tim Deegan wrote:
> >>> At 02:50 -0600 on 02 May (1493693403), Jan Beulich wrote:
> >>>>>>> On 02.05.17 at 10:32, <tim@xxxxxxx> wrote:
> >>>>> At 04:52 -0600 on 28 Apr (1493355160), Jan Beulich wrote:
> >>>>>>>>> On 27.04.17 at 11:51, <tim@xxxxxxx> wrote:
> >>>>>>> At 03:23 -0600 on 27 Apr (1493263380), Jan Beulich wrote:
> >>>>>>>> ... it wouldn't better be the other way around: We use the
> >>>>>>>> patch in its current (or even v1) form, and try to do something
> >>>>>>>> about performance only if we really find a case where it
> >>>>>>>> matters. To be honest, I'm not even sure how I could
> >>>>>>>> meaningfully measure the impact here: Simply counting how
> many
> >>>>>>>> extra flushes there would end up being wouldn't seem all that
> >>>>>>>> useful, and whether there would be any measurable difference in
> >>>>>>>> the overall execution time of e.g. domain creation I would
> >>>>>>>> highly doubt (but if it's that what you're after, I could certainly
> collect a few numbers).
> >>>>>>> I think that would be a good idea, just as a sanity-check.
> >>>>>> As it turns out there is a measurable effect: xc_dom_boot_image()
> >>>>>> for a 4Gb PV guest takes about 70% longer now. Otoh it is itself
> >>>>>> responsible for less than 10% of the overall time
> >>>>>> libxl__build_dom() takes, and that in turn is only a pretty small
> >>>>>> portion of the overall "xl create".
> >>>>> Do you think that slowdown is OK?  I'm not sure -- I'd be inclined
> >>>>> to avoid it, but could be persuaded, and it's not me doing the
> >>>>> work. :)
> >>>> Well, if there was a way to avoid it in a clean way without too
> >>>> much code churn, I'd be all for avoiding it. The avenues we've
> >>>> explored so far either didn't work (using pg_owner's dirty mask) or
> >>>> didn't promise to actually reduce the flush overhead in a
> >>>> meaningful way (adding a separate mask to be merged into the mask
> >>>> used for the flush in __get_page_type()), unless - as has been the
> >>>> case before - I didn't fully understand your thoughts there.
> >>> Quoting your earlier response:
> >>>
> >>>> Wouldn't it suffice to set bits in this mask in put_page_from_l1e()
> >>>> and consume/clear them in __get_page_type()? Right now I can't see
> >>>> it being necessary for correctness to fiddle with any of the other
> >>>> flushes using the domain dirty mask.
> >>>>
> >>>> But then again this may not be much of a win, unless the put
> >>>> operations come through in meaningful batches, not interleaved by
> >>>> any type changes (the latter ought to be guaranteed during domain
> >>>> construction and teardown at least, as the guest itself can't do
> >>>> anything at that time to effect type changes).
> >>> I'm not sure how much batching there needs to be.  I agree that the
> >>> domain creation case should work well though.  Let me think about
> >>> the scenarios when dom B is live:
> >>>
> >>> 1. Dom A drops its foreign map of page X; dom B immediately changes
> >>> the type of page X.  This case isn't helped at all, but I don't see
> >>> any way to improve it -- dom A's TLBs need to be flushed right away.
> >>>
> >>> 2. Dom A drops its foreign map of page X; dom B immediately changes
> >>> the type of page Y.  Now dom A's dirty CPUs are in the new map, but
> >>> B may not need to flush them right away.  B can filter by page Y's
> >>> timestamp, and flush (and clear) only some of the cpus in the map.
> >>>
> >>> So that seems good, but then there's a risk that cpus never get
> >>> cleared from the map, and __get_page_type() ends up doing a lot of
> >>> unnecessary work filtering timestaps.  When is it safe to remove a
> >>> CPU from that map?
> >>>  - obvs safe if we IPI it to flush the TLB (though may need memory
> >>>    barriers -- need to think about a race with CPU C putting A _into_
> >>>    the map at the same time...)
> >>>  - we could track the timestamp of the most recent addition to the
> >>>    map, and drop any CPU whose TLB has been flushed since that,
> >>>    but that still lets unrelated unmaps keep CPUs alive in the map...
> >>>  - we could double-buffer the map: always add CPUs to the active map;
> >>>    from time to time, swap maps and flush everything in the non-active
> >>>    map (filtered by the TLB timestamp when we last swapped over).
> >>>
> >>> Bah, this is turning into a tar pit.  Let's stick to the v2 patch as
> >>> being (relatively) simple and correct, and revisit this if it causes
> >>> trouble. :)
> >> :(
> >>
> >> A 70% performance hit for guest creation is certainly going to cause
> >> problems, but we obviously need to prioritise correctness in this case.
> > Hmm, you did understand that the 70% hit is on a specific sub-part of
> > the overall process, not guest creation as a whole? Anyway, your reply
> > is neither an ack nor a nak nor an indication of what needs to change
> > ...
> 
> Yes - I realise it isn't all of domain creation, but this performance hit 
> will also
> hit migration, qemu DMA mappings, etc.
> 
> XenServer has started a side-by-side performance work-up of this change, as
> presented at the root of this thread.  We should hopefully have some
> number in the next day or two.
> 

I did some measurements on two builds of a recent version of XenServer using 
Xen upstream 4.9.0-3.0. The only difference between the builds was the patch 
x86-put-l1e-foreign-flush.patch  in 
https://lists.xenproject.org/archives/html/xen-devel/2017-04/msg02945.html.

I observed no measurable difference between these builds with a guest RAM value 
of 4G, 8G and 14G for the following operations:
- time xe vm-start
- time xe vm-shutdown
- vm downtime during "xe vm-migration" (as measured by pinging the vm during 
migration and verifying for how long pings would fail when both domains are 
paused)
- time xe vm-migrate # for HVM guests (eg. win7 and win10)

But I observed a difference for the duration of "time xe vm-migrate" for PV 
guests (eg. centos68, debian70, ubuntu1204). For centos68, for instance, I 
obtained the following values on a machine with a Intel E3-1281v3 3.7Ghz CPU, 
averaged over 10 runs for each data point:
|   Guest RAM   |  no patch  | with patch | difference |  diff/RAM | 
|   14GB        |   10.44s   |   13.46s   |    3.02s   |    0.22s/GB    |
|    8GB        |    6.46s   |    8.28s   |    1.82s   |    0.23s/GB    |
|    4GB        |    3.85s   |    4.74s   |    0.89s   |    0.22s/GB    |

From these numbers, if the patch is present, it looks like VM migration of a PV 
guest would take an extra 1s for each extra 5GB of guest RAM. The VMs are 
mostly idle during migration. At this point, it's not clear to me why this 
difference is only visible on VM migration (as opposed to VM start for 
example), and only on a PV guest (as opposed to an HVM).

Marcus

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.