Xen project Mailing List

Re: [PATCH v4 12/30] xen/riscv: introduce cmpxchg.h

On Mon, 2024-02-19 at 12:22 +0100, Jan Beulich wrote: > On 15.02.2024 14:41, Oleksii wrote: > > > > + : "=r" (ret), "+A" (*ptr) \ > > > > + : "r" (new) \ > > > > + : "memory" ); \ > > > > +}) > > > > + > > > > +#define emulate_xchg_1_2(ptr, new, ret, release_barrier, > > > > acquire_barrier) \ > > > > +({ \ > > > > + uint32_t *ptr_32b_aligned = (uint32_t > > > > *)ALIGN_DOWN((unsigned > > > > long)ptr, 4); \ > > > > > > You now appear to assume that this macro is only used with inputs > > > not > > > crossing word boundaries. That's okay as long as suitably > > > guaranteed > > > at the use sites, but imo wants saying in a comment. > > > > > > > + uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - > > > > sizeof(*ptr))) > > > > * BITS_PER_BYTE; \ > > > > > > Why 0x8 (i.e. spanning 64 bits), not 4 (matching the uint32_t use > > > above)? > > The idea to read 8 bytes was to deal with crossing word boundary. > > So if > > our address is 0x3 and we have to xchg() 2 bytes, what will cross 4 > > byte boundary. Instead we align add 0x3, so it will become 0x0 and > > then > > just always work with 8 bytes. > > Then what if my 2-byte access crosses a dword boundary? A cache line > one? A page one? Everything looks okay to me, except in the case of a page boundary. In the scenario of a dword boundary: 0 1 2 3 4 5 6 7 8 9 ... X X X X X X X 1 1 X Assuming a variable starts at address 7, 4-byte alignment will be enforced, and 8 bytes will be processed starting from address 4. Concerning a cache line, it should still work, with potential performance issues arising only if a part of the variable is cached while another part is not. Regarding page crossing, I acknowledge that it could be problematic if the variable is entirely located at the end of a page, as there is no guarantee that the next page exists. In this case, it would be preferable to consistently read 4 bytes with 4-byte alignment: X 4094 4095 4096? X 1 1 ? However, if the variable spans two pages, proper page mapping should be ensured. It appears sensible to reconsider the macros and implement 4-byte alignment and 4-byte access, but then this is not clear how better to deal with first case ( dword boundary ). Panic ? or use the macros twice for address 4, and address 8? > > > > > + unsigned long new_ = (unsigned long)(new) << mask_l; \ > > > > + unsigned long ret_; \ > > > > + unsigned long rc; \ > > > > > > Similarly, why unsigned long here? > > sizeof(unsigned long) is 8 bytes and it was chosen as we are > > working > > with lc/sc.d which are working with 8 bytes. > > > > > > > > I also wonder about the mix of underscore suffixed (or not) > > > variable > > > names here. > > If the question about ret_, then the same as before size of ret > > argument of the macros will be 1 or 2, but {lc/sc}.d expected to > > work > > with 8 bytes. > > Then what's the uint32_t * about? Agree, then it should be also unsigned long. > > > > > > + __typeof__(*(ptr)) new__ = (new); \ > > > > + __typeof__(*(ptr)) ret__; \ > > > > + switch (size) \ > > > > + { \ > > > > + case 1: \ > > > > + case 2: \ > > > > + emulate_xchg_1_2(ptr__, new__, ret__, release_barrier, > > > > acquire_barrier); \ > > > > > > ... this would become > > > > > > ret__ = emulate_xchg_1_2(ptr__, new__, release_barrier, > > > acquire_barrier); \ > > > > > > But, unlike assumed above, there's no enforcement here that a 2- > > > byte > > > quantity won't cross a word, double-word, cache line, or even > > > page > > > boundary. That might be okay if then the code would simply crash > > > (like > > > the AMO insns emitted further down would), but aiui silent > > > misbehavior > > > would result. > > As I mentioned above with 4-byte alignment and then reading and > > working > > with 8-byte then crossing a word or double-word boundary shouldn't > > be > > an issue. > > > > I am not sure that I know how to check that we are crossing cache > > line > > boundary. > > > > Regarding page boundary, if the next page is mapped then all should > > work fine, otherwise it will be an exception. > > Are you sure lr.d / sc.d are happy to access across such a boundary, > when both pages are mapped? If they are mapped, my expectation that lr.d and sc.d should be happy. > > To me it seems pretty clear that for atomic accesses you want to > demand natural alignment, i.e. 2-byte alignment for 2-byte accesses. > This way you can be sure no potentially problematic boundaries will > be crossed. It makes sense, but I am not sure that I can guarantee that a user of macros will always have 2-byte alignment (except during a panic) in the future. Even now, I am uncertain that everyone will be willing to add __alignment(...) to struct vcpu->is_urgent (xen/include/xen/sched.h:218) and other possible cases to accommodate RISC-V requirements. > > > > > + break; \ > > > > + case 4: \ > > > > + __amoswap_generic(ptr__, new__, ret__,\ > > > > + ".w" sfx, release_barrier, > > > > acquire_barrier); \ > > > > + break; \ > > > > + case 8: \ > > > > + __amoswap_generic(ptr__, new__, ret__,\ > > > > + ".d" sfx, release_barrier, > > > > acquire_barrier); \ > > > > + break; \ > > > > + default: \ > > > > + STATIC_ASSERT_UNREACHABLE(); \ > > > > + } \ > > > > + ret__; \ > > > > +}) > > > > + > > > > +#define xchg_relaxed(ptr, x) \ > > > > +({ \ > > > > + __typeof__(*(ptr)) x_ = (x); \ > > > > + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, > > > > sizeof(*(ptr)), > > > > "", "", ""); \ > > > > +}) > > > > + > > > > +#define xchg_acquire(ptr, x) \ > > > > +({ \ > > > > + __typeof__(*(ptr)) x_ = (x); \ > > > > + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, > > > > sizeof(*(ptr)), \ > > > > + "", "", > > > > RISCV_ACQUIRE_BARRIER); \ > > > > +}) > > > > + > > > > +#define xchg_release(ptr, x) \ > > > > +({ \ > > > > + __typeof__(*(ptr)) x_ = (x); \ > > > > + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, > > > > sizeof(*(ptr)),\ > > > > + "", > > > > RISCV_RELEASE_BARRIER, > > > > ""); \ > > > > +}) > > > > + > > > > +#define xchg(ptr,x) \ > > > > +({ \ > > > > + __typeof__(*(ptr)) ret__; \ > > > > + ret__ = (__typeof__(*(ptr))) \ > > > > + __xchg_generic(ptr, (unsigned long)(x), > > > > sizeof(*(ptr)), \ > > > > + ".aqrl", "", ""); \ > > > > > > The .aqrl doesn't look to affect the (emulated) 1- and 2-byte > > > cases. > > > > > > Further, amoswap also exists in release-only and acquire-only > > > forms. > > > Why do you prefer explicit barrier insns over those? (Looks to > > > similarly apply to the emulation path as well as to the cmpxchg > > > machinery then, as both lr and sc also come in all four possible > > > acquire/release forms. Perhaps for the emulation path using > > > explicit barriers is better, in case the acquire/release forms of > > > lr/sc - being used inside the loop - might perform worse.) > > As 1- and 2-byte cases are emulated I decided that is not to > > provide > > sfx argument for emulation macros as it will not have to much > > affect on > > emulated types and just consume more performance on acquire and > > release > > version of sc/ld instructions. > > Question is whether the common case (4- and 8-byte accesses) > shouldn't > be valued higher, with 1- and 2-byte emulation being there just to > allow things to not break altogether. If I understand you correctly, it would make sense to add the 'sfx' argument for the 1/2-byte access case, ensuring that all options are available for 1/2-byte access case as well. ~ Oleksii

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.