|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v4 12/30] xen/riscv: introduce cmpxchg.h
On Mon, 2024-02-26 at 12:28 +0100, Jan Beulich wrote:
> On 26.02.2024 12:18, Oleksii wrote:
> > On Mon, 2024-02-26 at 10:45 +0100, Jan Beulich wrote:
> > > On 23.02.2024 13:23, Oleksii wrote:
> > > > >
> > > > > > > > As 1- and 2-byte cases are emulated I decided that is
> > > > > > > > not
> > > > > > > > to
> > > > > > > > provide
> > > > > > > > sfx argument for emulation macros as it will not have
> > > > > > > > to
> > > > > > > > much
> > > > > > > > affect on
> > > > > > > > emulated types and just consume more performance on
> > > > > > > > acquire
> > > > > > > > and
> > > > > > > > release
> > > > > > > > version of sc/ld instructions.
> > > > > > >
> > > > > > > Question is whether the common case (4- and 8-byte
> > > > > > > accesses)
> > > > > > > shouldn't
> > > > > > > be valued higher, with 1- and 2-byte emulation being
> > > > > > > there
> > > > > > > just
> > > > > > > to
> > > > > > > allow things to not break altogether.
> > > > > > If I understand you correctly, it would make sense to add
> > > > > > the
> > > > > > 'sfx'
> > > > > > argument for the 1/2-byte access case, ensuring that all
> > > > > > options
> > > > > > are
> > > > > > available for 1/2-byte access case as well.
> > > > >
> > > > > That's one of the possibilities. As said, I'm not overly
> > > > > worried
> > > > > about
> > > > > the emulated cases. For the initial implementation I'd
> > > > > recommend
> > > > > going
> > > > > with what is easiest there, yielding the best possible result
> > > > > for
> > > > > the
> > > > > 4- and 8-byte cases. If later it turns out repeated
> > > > > acquire/release
> > > > > accesses are a problem in the emulation loop, things can be
> > > > > changed
> > > > > to explicit barriers, without touching the 4- and 8-byte
> > > > > cases.
> > > > I am confused then a little bit if emulated case is not an
> > > > issue.
> > > >
> > > > For 4- and 8-byte cases for xchg .aqrl is used, for relaxed and
> > > > aqcuire
> > > > version of xchg barries are used.
> > > >
> > > > The similar is done for cmpxchg.
> > > >
> > > > If something will be needed to change in emulation loop it
> > > > won't
> > > > require to change 4- and 8-byte cases.
> > >
> > > I'm afraid I don't understand your reply.
> > IIUC, emulated cases it is implemented correctly in terms of usage
> > barriers. And it also OK not to use sfx for lr/sc instructions and
> > use
> > only barriers.
> >
> > For 4- and 8-byte cases are used sfx + barrier depending on the
> > specific case ( relaxed, acquire, release, generic xchg/cmpxchg ).
> > What also looks to me correct. But you suggested to provide the
> > best
> > possible result for 4- and 8-byte cases.
> >
> > So I don't understand what the best possible result is as the
> > current
> > one usage of __{cmp}xchg_generic for each specific case ( relaxed,
> > acquire, release, generic xchg/cmpxchg ) looks correct to me:
> > xchg -> (..., ".aqrl", "", "") just suffix .aqrl suffix without
> > barriers.
> > xchg_release -> (..., "", RISCV_RELEASE_BARRIER, "" ) use only
> > release
> > barrier
> > xchg_acquire -> (..., "", "", RISCV_ACQUIRE_BARRIER ), only acquire
> > barrier
> > xchg_relaxed -> (..., "", "", "") - no barries, no sfx
>
> So first: While explicit barriers are technically okay, I don't
> follow why
> you insist on using them when you can achieve the same by suitably
> tagging
> the actual insn doing the exchange. Then second: It's somewhat hard
> for me
> to see the final effect on the emulation paths without you actually
> having
> done the switch. Maybe no special handling is necessary there anymore
> then. And as said, it may actually be acceptable for the emulation
> paths
> to "only" be correct, but not be ideal in terms of performance. After
> all,
> if you use the normal 4-byte primitive in there, more (non-explicit)
> barriers than needed would occur if the involved loop has to take
> more
> than one iteration. Which could (but imo doesn't need to be) avoided
> by
> using a more relaxed 4-byte primitive there and an explicit barrier
> outside of the loop.
According to the spec:
Table A.5 ( part of the table only I copied here )
Linux Construct RVWMO Mapping
atomic <op> relaxed amo<op>.{w|d}
atomic <op> acquire amo<op>.{w|d}.aq
atomic <op> release amo<op>.{w|d}.rl
atomic <op> amo<op>.{w|d}.aqrl
Linux Construct RVWMO LR/SC Mapping
atomic <op> relaxed loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
atomic <op> acquire loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop
atomic <op> release loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez
loop OR
fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ;
bnez loop
atomic <op> loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez
loop
The Linux mappings for release operations may seem stronger than
necessary, but these mappings
are needed to cover some cases in which Linux requires stronger
orderings than the more intuitive
mappings would provide. In particular, as of the time this text is
being written, Linux is actively
debating whether to require load-load, load-store, and store-store
orderings between accesses in one
critical section and accesses in a subsequent critical section in the
same hart and protected by the
same synchronization object. Not all combinations of FENCE RW,W/FENCE
R,RW mappings
with aq/rl mappings combine to provide such orderings. There are a few
ways around this problem,
including:
1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This suffices
but is undesir-
able, as it defeats the purpose of the aq/rl modifiers.
2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does not
currently work
due to the lack of load and store opcodes with aq and rl modifiers.
3. Strengthen the mappings of release operations such that they would
enforce sufficient order-
ings in the presence of either type of acquire mapping. This is the
currently-recommended
solution, and the one shown in Table A.5.
Based on this it is enough in our case use only suffixed istructions
(amo<op>.{w|d}{.aq, .rl, .aqrl, .aqrl }, lr.{w|d}.{.aq, .aqrl }.
But as far as I understand in Linux atomics were strengthen with
fences:
Atomics present the same issue with locking: release and acquire
variants need to be strengthened to meet the constraints defined
by the Linux-kernel memory consistency model [1].
Atomics present a further issue: implementations of atomics such
as atomic_cmpxchg() and atomic_add_unless() rely on LR/SC pairs,
which do not give full-ordering with .aqrl; for example, current
implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" test
below to end up with the state indicated in the "exists" clause.
In order to "synchronize" LKMM and RISC-V's implementation, this
commit strengthens the implementations of the atomics operations
by replacing .rl and .aq with the use of ("lightweigth") fences,
and by replacing .aqrl LR/SC pairs in sequences such as:
0: lr.w.aqrl %0, %addr
bne %0, %old, 1f
...
sc.w.aqrl %1, %new, %addr
bnez %1, 0b
1:
with sequences of the form:
0: lr.w %0, %addr
bne %0, %old, 1f
...
sc.w.rl %1, %new, %addr /* SC-release */
bnez %1, 0b
fence rw, rw /* "full" fence */
1:
following Daniel's suggestion.
These modifications were validated with simulation of the RISC-V
with sequences of the form:
0: lr.w %0, %addr
bne %0, %old, 1f
...
sc.w.rl %1, %new, %addr /* SC-release */
bnez %1, 0b
fence rw, rw /* "full" fence */
1:
following Daniel's suggestion.
These modifications were validated with simulation of the RISC-V
memory consistency model.
C lr-sc-aqrl-pair-vs-full-barrier
{}
P0(int *x, int *y, atomic_t *u)
{
int r0;
int r1;
WRITE_ONCE(*x, 1);
r0 = atomic_cmpxchg(u, 0, 1);
r1 = READ_ONCE(*y);
}
P1(int *x, int *y, atomic_t *v)
{
int r0;
int r1;
WRITE_ONCE(*y, 1);
r0 = atomic_cmpxchg(v, 0, 1);
r1 = READ_ONCE(*x);
}
exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0)
[1] https://marc.info/?l=linux-kernel&m=151930201102853&w=2
https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/hKywNHBkAXM
https://marc.info/?l=linux-kernel&m=151633436614259&w=2
Thereby Linux kernel implementation seems to me more safe and it is a
reason why I want/wanted to be aligned with it.
~ Oleksii
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |