[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v4 12/30] xen/riscv: introduce cmpxchg.h
On Mon, 2024-02-26 at 12:28 +0100, Jan Beulich wrote: > On 26.02.2024 12:18, Oleksii wrote: > > On Mon, 2024-02-26 at 10:45 +0100, Jan Beulich wrote: > > > On 23.02.2024 13:23, Oleksii wrote: > > > > > > > > > > > > > As 1- and 2-byte cases are emulated I decided that is > > > > > > > > not > > > > > > > > to > > > > > > > > provide > > > > > > > > sfx argument for emulation macros as it will not have > > > > > > > > to > > > > > > > > much > > > > > > > > affect on > > > > > > > > emulated types and just consume more performance on > > > > > > > > acquire > > > > > > > > and > > > > > > > > release > > > > > > > > version of sc/ld instructions. > > > > > > > > > > > > > > Question is whether the common case (4- and 8-byte > > > > > > > accesses) > > > > > > > shouldn't > > > > > > > be valued higher, with 1- and 2-byte emulation being > > > > > > > there > > > > > > > just > > > > > > > to > > > > > > > allow things to not break altogether. > > > > > > If I understand you correctly, it would make sense to add > > > > > > the > > > > > > 'sfx' > > > > > > argument for the 1/2-byte access case, ensuring that all > > > > > > options > > > > > > are > > > > > > available for 1/2-byte access case as well. > > > > > > > > > > That's one of the possibilities. As said, I'm not overly > > > > > worried > > > > > about > > > > > the emulated cases. For the initial implementation I'd > > > > > recommend > > > > > going > > > > > with what is easiest there, yielding the best possible result > > > > > for > > > > > the > > > > > 4- and 8-byte cases. If later it turns out repeated > > > > > acquire/release > > > > > accesses are a problem in the emulation loop, things can be > > > > > changed > > > > > to explicit barriers, without touching the 4- and 8-byte > > > > > cases. > > > > I am confused then a little bit if emulated case is not an > > > > issue. > > > > > > > > For 4- and 8-byte cases for xchg .aqrl is used, for relaxed and > > > > aqcuire > > > > version of xchg barries are used. > > > > > > > > The similar is done for cmpxchg. > > > > > > > > If something will be needed to change in emulation loop it > > > > won't > > > > require to change 4- and 8-byte cases. > > > > > > I'm afraid I don't understand your reply. > > IIUC, emulated cases it is implemented correctly in terms of usage > > barriers. And it also OK not to use sfx for lr/sc instructions and > > use > > only barriers. > > > > For 4- and 8-byte cases are used sfx + barrier depending on the > > specific case ( relaxed, acquire, release, generic xchg/cmpxchg ). > > What also looks to me correct. But you suggested to provide the > > best > > possible result for 4- and 8-byte cases. > > > > So I don't understand what the best possible result is as the > > current > > one usage of __{cmp}xchg_generic for each specific case ( relaxed, > > acquire, release, generic xchg/cmpxchg ) looks correct to me: > > xchg -> (..., ".aqrl", "", "") just suffix .aqrl suffix without > > barriers. > > xchg_release -> (..., "", RISCV_RELEASE_BARRIER, "" ) use only > > release > > barrier > > xchg_acquire -> (..., "", "", RISCV_ACQUIRE_BARRIER ), only acquire > > barrier > > xchg_relaxed -> (..., "", "", "") - no barries, no sfx > > So first: While explicit barriers are technically okay, I don't > follow why > you insist on using them when you can achieve the same by suitably > tagging > the actual insn doing the exchange. Then second: It's somewhat hard > for me > to see the final effect on the emulation paths without you actually > having > done the switch. Maybe no special handling is necessary there anymore > then. And as said, it may actually be acceptable for the emulation > paths > to "only" be correct, but not be ideal in terms of performance. After > all, > if you use the normal 4-byte primitive in there, more (non-explicit) > barriers than needed would occur if the involved loop has to take > more > than one iteration. Which could (but imo doesn't need to be) avoided > by > using a more relaxed 4-byte primitive there and an explicit barrier > outside of the loop. According to the spec: Table A.5 ( part of the table only I copied here ) Linux Construct RVWMO Mapping atomic <op> relaxed amo<op>.{w|d} atomic <op> acquire amo<op>.{w|d}.aq atomic <op> release amo<op>.{w|d}.rl atomic <op> amo<op>.{w|d}.aqrl Linux Construct RVWMO LR/SC Mapping atomic <op> relaxed loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop atomic <op> acquire loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop atomic <op> release loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez loop OR fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ; bnez loop atomic <op> loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez loop The Linux mappings for release operations may seem stronger than necessary, but these mappings are needed to cover some cases in which Linux requires stronger orderings than the more intuitive mappings would provide. In particular, as of the time this text is being written, Linux is actively debating whether to require load-load, load-store, and store-store orderings between accesses in one critical section and accesses in a subsequent critical section in the same hart and protected by the same synchronization object. Not all combinations of FENCE RW,W/FENCE R,RW mappings with aq/rl mappings combine to provide such orderings. There are a few ways around this problem, including: 1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This suffices but is undesir- able, as it defeats the purpose of the aq/rl modifiers. 2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does not currently work due to the lack of load and store opcodes with aq and rl modifiers. 3. Strengthen the mappings of release operations such that they would enforce sufficient order- ings in the presence of either type of acquire mapping. This is the currently-recommended solution, and the one shown in Table A.5. Based on this it is enough in our case use only suffixed istructions (amo<op>.{w|d}{.aq, .rl, .aqrl, .aqrl }, lr.{w|d}.{.aq, .aqrl }. But as far as I understand in Linux atomics were strengthen with fences: Atomics present the same issue with locking: release and acquire variants need to be strengthened to meet the constraints defined by the Linux-kernel memory consistency model [1]. Atomics present a further issue: implementations of atomics such as atomic_cmpxchg() and atomic_add_unless() rely on LR/SC pairs, which do not give full-ordering with .aqrl; for example, current implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" test below to end up with the state indicated in the "exists" clause. In order to "synchronize" LKMM and RISC-V's implementation, this commit strengthens the implementations of the atomics operations by replacing .rl and .aq with the use of ("lightweigth") fences, and by replacing .aqrl LR/SC pairs in sequences such as: 0: lr.w.aqrl %0, %addr bne %0, %old, 1f ... sc.w.aqrl %1, %new, %addr bnez %1, 0b 1: with sequences of the form: 0: lr.w %0, %addr bne %0, %old, 1f ... sc.w.rl %1, %new, %addr /* SC-release */ bnez %1, 0b fence rw, rw /* "full" fence */ 1: following Daniel's suggestion. These modifications were validated with simulation of the RISC-V with sequences of the form: 0: lr.w %0, %addr bne %0, %old, 1f ... sc.w.rl %1, %new, %addr /* SC-release */ bnez %1, 0b fence rw, rw /* "full" fence */ 1: following Daniel's suggestion. These modifications were validated with simulation of the RISC-V memory consistency model. C lr-sc-aqrl-pair-vs-full-barrier {} P0(int *x, int *y, atomic_t *u) { int r0; int r1; WRITE_ONCE(*x, 1); r0 = atomic_cmpxchg(u, 0, 1); r1 = READ_ONCE(*y); } P1(int *x, int *y, atomic_t *v) { int r0; int r1; WRITE_ONCE(*y, 1); r0 = atomic_cmpxchg(v, 0, 1); r1 = READ_ONCE(*x); } exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0) [1] https://marc.info/?l=linux-kernel&m=151930201102853&w=2 https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/hKywNHBkAXM https://marc.info/?l=linux-kernel&m=151633436614259&w=2 Thereby Linux kernel implementation seems to me more safe and it is a reason why I want/wanted to be aligned with it. ~ Oleksii
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |