[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 05/11] qspinlock: Optimize for smaller NR_CPUS

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Date: Mon, 7 Jul 2014 16:33:57 +0200
Cc: Waiman.Long@xxxxxx, linux-arch@xxxxxxxxxxxxxxx, gleb@xxxxxxxxxx, kvm@xxxxxxxxxxxxxxx, boris.ostrovsky@xxxxxxxxxx, scott.norton@xxxxxx, raghavendra.kt@xxxxxxxxxxxxxxxxxx, paolo.bonzini@xxxxxxxxx, linux-kernel@xxxxxxxxxxxxxxx, virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx, chegu_vinod@xxxxxx, david.vrabel@xxxxxxxxxx, oleg@xxxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxxx, tglx@xxxxxxxxxxxxx, paulmck@xxxxxxxxxxxxxxxxxx, torvalds@xxxxxxxxxxxxxxxxxxxx, mingo@xxxxxxxxxx
Delivery-date: Mon, 07 Jul 2014 14:34:30 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Wed, Jun 18, 2014 at 11:57:30AM -0400, Konrad Rzeszutek Wilk wrote:
> On Sun, Jun 15, 2014 at 02:47:02PM +0200, Peter Zijlstra wrote:
> > From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > 
> > When we allow for a max NR_CPUS < 2^14 we can optimize the pending
> > wait-acquire and the xchg_tail() operations.
> > 
> > By growing the pending bit to a byte, we reduce the tail to 16bit.
> > This means we can use xchg16 for the tail part and do away with all
> > the repeated compxchg() operations.
> > 
> > This in turn allows us to unconditionally acquire; the locked state
> > as observed by the wait loops cannot change. And because both locked
> > and pending are now a full byte we can use simple stores for the
> > state transition, obviating one atomic operation entirely.
> 
> I have to ask - how much more performance do you get from this?
> 
> Is this extra atomic operation hurting that much?

Its not extra, its a cmpxchg loop vs an unconditional xchg.

And yes, its somewhat tedious to show, but on 4 socket systems you can
really see it make a difference. I'll try and run some numbers, I need
to reinstall the box.

(there were numbers in the previous threads, but you're right, I
should've put some in the Changelog).

> >  /**
> >   * queue_spin_lock_slowpath - acquire the queue spinlock
> > @@ -173,8 +259,13 @@ void queue_spin_lock_slowpath(struct qsp
> >      * we're pending, wait for the owner to go away.
> >      *
> >      * *,1,1 -> *,1,0
> > +    *
> > +    * this wait loop must be a load-acquire such that we match the
> > +    * store-release that clears the locked bit and create lock
> > +    * sequentiality; this because not all clear_pending_set_locked()
> > +    * implementations imply full barriers.
> >      */
> > -   while ((val = atomic_read(&lock->val)) & _Q_LOCKED_MASK)
> > +   while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_MASK)
> 
> lock->val.counter? Ugh, all to deal with the 'int' -> 'u32' (or 'u64')

No, to do atomic_t -> int.

> Could you introduce a macro in atomic.h called 'atomic_read_raw' which
> would do the this? Like this:

That would be worse I think. It looks like a function returning an
rvalue whereas we really want an lvalue.

Attachment: pgp5U5l9Vl5Pf.pgp
Description: PGP signature

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Prev by Date: Re: [Xen-devel] [PATCH] qemu build: disable C++ optional code
Next by Date: Re: [Xen-devel] [PATCH 05/11] qspinlock: Optimize for smaller NR_CPUS
Previous by thread: [Xen-devel] [linux-linus test] 28937: regressions - trouble: broken/fail/pass
Next by thread: Re: [Xen-devel] [PATCH 05/11] qspinlock: Optimize for smaller NR_CPUS
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.