[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 12/30] xen/riscv: introduce cmpxchg.h



On Mon, 2024-02-26 at 12:28 +0100, Jan Beulich wrote:
> On 26.02.2024 12:18, Oleksii wrote:
> > On Mon, 2024-02-26 at 10:45 +0100, Jan Beulich wrote:
> > > On 23.02.2024 13:23, Oleksii wrote:
> > > > > 
> > > > > > > > As 1- and 2-byte cases are emulated I decided that is
> > > > > > > > not
> > > > > > > > to
> > > > > > > > provide
> > > > > > > > sfx argument for emulation macros as it will not have
> > > > > > > > to
> > > > > > > > much
> > > > > > > > affect on
> > > > > > > > emulated types and just consume more performance on
> > > > > > > > acquire
> > > > > > > > and
> > > > > > > > release
> > > > > > > > version of sc/ld instructions.
> > > > > > > 
> > > > > > > Question is whether the common case (4- and 8-byte
> > > > > > > accesses)
> > > > > > > shouldn't
> > > > > > > be valued higher, with 1- and 2-byte emulation being
> > > > > > > there
> > > > > > > just
> > > > > > > to
> > > > > > > allow things to not break altogether.
> > > > > > If I understand you correctly, it would make sense to add
> > > > > > the
> > > > > > 'sfx'
> > > > > > argument for the 1/2-byte access case, ensuring that all
> > > > > > options
> > > > > > are
> > > > > > available for 1/2-byte access case as well.
> > > > > 
> > > > > That's one of the possibilities. As said, I'm not overly
> > > > > worried
> > > > > about
> > > > > the emulated cases. For the initial implementation I'd
> > > > > recommend
> > > > > going
> > > > > with what is easiest there, yielding the best possible result
> > > > > for
> > > > > the
> > > > > 4- and 8-byte cases. If later it turns out repeated
> > > > > acquire/release
> > > > > accesses are a problem in the emulation loop, things can be
> > > > > changed
> > > > > to explicit barriers, without touching the 4- and 8-byte
> > > > > cases.
> > > > I am confused then a little bit if emulated case is not an
> > > > issue.
> > > > 
> > > > For 4- and 8-byte cases for xchg .aqrl is used, for relaxed and
> > > > aqcuire
> > > > version of xchg barries are used.
> > > > 
> > > > The similar is done for cmpxchg.
> > > > 
> > > > If something will be needed to change in emulation loop it
> > > > won't
> > > > require to change 4- and 8-byte cases.
> > > 
> > > I'm afraid I don't understand your reply.
> > IIUC, emulated cases it is implemented correctly in terms of usage
> > barriers. And it also OK not to use sfx for lr/sc instructions and
> > use
> > only barriers.
> > 
> > For 4- and 8-byte cases are used sfx + barrier depending on the
> > specific case ( relaxed, acquire, release, generic xchg/cmpxchg ).
> > What also looks to me correct. But you suggested to provide the
> > best
> > possible result for 4- and 8-byte cases. 
> > 
> > So I don't understand what the best possible result is as the
> > current
> > one usage of __{cmp}xchg_generic for each specific case  ( relaxed,
> > acquire, release, generic xchg/cmpxchg ) looks correct to me:
> > xchg -> (..., ".aqrl", "", "") just suffix .aqrl suffix without
> > barriers.
> > xchg_release -> (..., "", RISCV_RELEASE_BARRIER, "" ) use only
> > release
> > barrier
> > xchg_acquire -> (..., "", "", RISCV_ACQUIRE_BARRIER ), only acquire
> > barrier
> > xchg_relaxed ->  (..., "", "", "") - no barries, no sfx
> 
> So first: While explicit barriers are technically okay, I don't
> follow why
> you insist on using them when you can achieve the same by suitably
> tagging
> the actual insn doing the exchange. Then second: It's somewhat hard
> for me
> to see the final effect on the emulation paths without you actually
> having
> done the switch. Maybe no special handling is necessary there anymore
> then. And as said, it may actually be acceptable for the emulation
> paths
> to "only" be correct, but not be ideal in terms of performance. After
> all,
> if you use the normal 4-byte primitive in there, more (non-explicit)
> barriers than needed would occur if the involved loop has to take
> more
> than one iteration. Which could (but imo doesn't need to be) avoided
> by
> using a more relaxed 4-byte primitive there and an explicit barrier
> outside of the loop.

According to the spec:
Table A.5 ( part of the table only I copied here )

Linux Construct          RVWMO Mapping
atomic <op> relaxed           amo<op>.{w|d}
atomic <op> acquire           amo<op>.{w|d}.aq
atomic <op> release           amo<op>.{w|d}.rl
atomic <op>                   amo<op>.{w|d}.aqrl

Linux Construct          RVWMO LR/SC Mapping
atomic <op> relaxed       loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
atomic <op> acquire       loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop
atomic <op> release       loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez 
loop OR
                          fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ;
bnez loop
atomic <op>               loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez
loop

The Linux mappings for release operations may seem stronger than
necessary, but these mappings
are needed to cover some cases in which Linux requires stronger
orderings than the more intuitive
mappings would provide. In particular, as of the time this text is
being written, Linux is actively
debating whether to require load-load, load-store, and store-store
orderings between accesses in one
critical section and accesses in a subsequent critical section in the
same hart and protected by the
same synchronization object. Not all combinations of FENCE RW,W/FENCE
R,RW mappings
with aq/rl mappings combine to provide such orderings. There are a few
ways around this problem,
including:
1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This suffices
but is undesir-
able, as it defeats the purpose of the aq/rl modifiers.
2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does not
currently work
due to the lack of load and store opcodes with aq and rl modifiers.
3. Strengthen the mappings of release operations such that they would
enforce sufficient order-
ings in the presence of either type of acquire mapping. This is the
currently-recommended
solution, and the one shown in Table A.5.


Based on this it is enough in our case use only suffixed istructions
(amo<op>.{w|d}{.aq, .rl, .aqrl, .aqrl }, lr.{w|d}.{.aq, .aqrl }.


But as far as I understand in Linux atomics were strengthen with
fences:
    Atomics present the same issue with locking: release and acquire
    variants need to be strengthened to meet the constraints defined
    by the Linux-kernel memory consistency model [1].
    
    Atomics present a further issue: implementations of atomics such
    as atomic_cmpxchg() and atomic_add_unless() rely on LR/SC pairs,
    which do not give full-ordering with .aqrl; for example, current
    implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" test
    below to end up with the state indicated in the "exists" clause.
    
    In order to "synchronize" LKMM and RISC-V's implementation, this
    commit strengthens the implementations of the atomics operations
    by replacing .rl and .aq with the use of ("lightweigth") fences,
    and by replacing .aqrl LR/SC pairs in sequences such as:
    
      0:      lr.w.aqrl  %0, %addr
              bne        %0, %old, 1f
              ...
              sc.w.aqrl  %1, %new, %addr
              bnez       %1, 0b
      1:
    
    with sequences of the form:
    
      0:      lr.w       %0, %addr
              bne        %0, %old, 1f
              ...
              sc.w.rl    %1, %new, %addr   /* SC-release   */
              bnez       %1, 0b
              fence      rw, rw            /* "full" fence */
      1:
    
    following Daniel's suggestion.
    
    These modifications were validated with simulation of the RISC-V
    with sequences of the form:
    
      0:      lr.w       %0, %addr
              bne        %0, %old, 1f
              ...
              sc.w.rl    %1, %new, %addr   /* SC-release   */
              bnez       %1, 0b
              fence      rw, rw            /* "full" fence */
      1:
    
    following Daniel's suggestion.
    
    These modifications were validated with simulation of the RISC-V
    memory consistency model.
    
    C lr-sc-aqrl-pair-vs-full-barrier
    
    {}
    
    P0(int *x, int *y, atomic_t *u)
    {
            int r0;
            int r1;
    
            WRITE_ONCE(*x, 1);
            r0 = atomic_cmpxchg(u, 0, 1);
            r1 = READ_ONCE(*y);
    }
    
    P1(int *x, int *y, atomic_t *v)
    {
            int r0;
            int r1;
    
            WRITE_ONCE(*y, 1);
            r0 = atomic_cmpxchg(v, 0, 1);
            r1 = READ_ONCE(*x);
    }
    
    exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0)
    
    [1] https://marc.info/?l=linux-kernel&m=151930201102853&w=2
     
https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/hKywNHBkAXM
        https://marc.info/?l=linux-kernel&m=151633436614259&w=2


Thereby Linux kernel implementation seems to me more safe and it is a
reason why I want/wanted to be aligned with it.

~ Oleksii





 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.