[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader



On Thu, Jan 8, 2015 at 2:31 PM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
> On Tue, Jan 06, 2015 at 11:49:09AM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 6, 2015 at 10:45 AM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
>> > On Tue, Jan 06, 2015 at 10:26:22AM -0800, Andy Lutomirski wrote:
>> >> On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> 
>> >> wrote:
>> >> > On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
>> >> >> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@xxxxxxxxxx> wrote:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On 06/01/2015 09:42, Paolo Bonzini wrote:
>> >> >> > > > > Still confused.  So we can freeze all vCPUs in the host, then 
>> >> >> > > > > update
>> >> >> > > > > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, 
>> >> >> > > > > we have
>> >> >> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and 
>> >> >> > > > > KVM
>> >> >> > > > > doesn't increment the version pre-update, and we can return 
>> >> >> > > > > completely
>> >> >> > > > > bogus results.
>> >> >> > > > Yes.
>> >> >> > > But then the getcpu test would fail (1->0).  Even if you have an 
>> >> >> > > ABA
>> >> >> > > situation (1->0->1), it's okay because the pvti that is fetched is 
>> >> >> > > the
>> >> >> > > one returned by the first getcpu.
>> >> >> >
>> >> >> > ... this case of partial update of pvti, which is caught by the 
>> >> >> > version
>> >> >> > field, if of course different from the other (extremely unlikely) 
>> >> >> > that
>> >> >> > Andy pointed out.  That is when the getcpus are done on the same 
>> >> >> > vCPU,
>> >> >> > but the rdtsc is another.
>> >> >> >
>> >> >> > That one can be fixed by rdtscp, like
>> >> >> >
>> >> >> > do {
>> >> >> >     // get a consistent (pvti, v, tsc) tuple
>> >> >> >     do {
>> >> >> >         cpu = get_cpu();
>> >> >> >         pvti = get_pvti(cpu);
>> >> >> >         v = pvti->version & ~1;
>> >> >> >         // also acts as rmb();
>> >> >> >         rdtsc_barrier();
>> >> >> >         tsc = rdtscp(&cpu1);
>> >> >>
>> >> >> Off-topic note: rdtscp doesn't need a barrier at all.  AIUI AMD
>> >> >> specified it that way and both AMD and Intel implement it correctly.
>> >> >> (rdtsc, on the other hand, definitely needs the barrier beforehand.)
>> >> >>
>> >> >> >         // control dependency, no need for rdtsc_barrier?
>> >> >> >     } while(cpu != cpu1);
>> >> >> >
>> >> >> >     // ... compute nanoseconds from pvti and tsc ...
>> >> >> >     rmb();
>> >> >> > }   while(v != pvti->version);
>> >> >>
>> >> >> Still no good.  We can migrate a bunch of times so we see the same CPU
>> >> >> all three times and *still* don't get a consistent read, unless we
>> >> >> play nasty games with lots of version checks (I have a patch for that,
>> >> >> but I don't like it very much).  The patch is here:
>> >> >>
>> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
>> >> >>
>> >> >> but I don't like it.
>> >> >>
>> >> >> Thus far, I've been told unambiguously that a guest can't observe pvti
>> >> >> while it's being written, and I think you're now telling me that this
>> >> >> isn't true and that a guest *can* observe pvti while it's being
>> >> >> written while the low bit of the version field is not set.  If so,
>> >> >> this is rather strongly incompatible with the spec in the KVM docs.
>> >> >>
>> >> >> I don't suppose that you and Marcelo could agree on what the actual
>> >> >> semantics that KVM provides are and could write it down in a way that
>> >> >> people who haven't spent a long time staring at the request code
>> >> >> understand?  And maybe you could even fix the implementation while
>> >> >> you're at it if the implementation is, indeed, broken.  I have ugly
>> >> >> patches to fix it here:
>> >> >>
>> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0
>> >> >>
>> >> >> but I'm not thrilled with them.
>> >> >>
>> >> >> --Andy
>> >> >
>> >> > I suppose that separating the version write from the rest of the pvclock
>> >> > structure is sufficient, as that would guarantee the writes are not
>> >> > reordered even with fast string REP MOVS.
>> >> >
>> >> > Thanks for catching this Andy!
>> >> >
>> >>
>> >> Don't you stil need:
>> >>
>> >> version++;
>> >> write the rest;
>> >> version++;
>> >>
>> >> with possible smp_wmb() in there to keep the compiler from messing around?
>> >
>> > Correct. Could just as well follow the protocol and use odd/even, which
>> > is what your patch does.
>> >
>> > What is the point with the new flags bit though?
>>
>> To try to work around the problem on old hosts.  I'm not at all
>> convinced that this is worthwhile or that it helps, though.
>
> Andy,
>
> Are you going to submit the fix or should i?
>

I'd prefer if you did it.  I'm not familiar enough with the KVM memory
management stuff to do it confidently.  Feel free to mooch from my
patch if it's helpful.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.