|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v4 12/30] xen/riscv: introduce cmpxchg.h
On 15.02.2024 14:41, Oleksii wrote:
>>> + : "=r" (ret), "+A" (*ptr) \
>>> + : "r" (new) \
>>> + : "memory" ); \
>>> +})
>>> +
>>> +#define emulate_xchg_1_2(ptr, new, ret, release_barrier,
>>> acquire_barrier) \
>>> +({ \
>>> + uint32_t *ptr_32b_aligned = (uint32_t *)ALIGN_DOWN((unsigned
>>> long)ptr, 4); \
>>
>> You now appear to assume that this macro is only used with inputs not
>> crossing word boundaries. That's okay as long as suitably guaranteed
>> at the use sites, but imo wants saying in a comment.
>>
>>> + uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - sizeof(*ptr)))
>>> * BITS_PER_BYTE; \
>>
>> Why 0x8 (i.e. spanning 64 bits), not 4 (matching the uint32_t use
>> above)?
> The idea to read 8 bytes was to deal with crossing word boundary. So if
> our address is 0x3 and we have to xchg() 2 bytes, what will cross 4
> byte boundary. Instead we align add 0x3, so it will become 0x0 and then
> just always work with 8 bytes.
Then what if my 2-byte access crosses a dword boundary? A cache line
one? A page one?
>>> + unsigned long new_ = (unsigned long)(new) << mask_l; \
>>> + unsigned long ret_; \
>>> + unsigned long rc; \
>>
>> Similarly, why unsigned long here?
> sizeof(unsigned long) is 8 bytes and it was chosen as we are working
> with lc/sc.d which are working with 8 bytes.
>
>>
>> I also wonder about the mix of underscore suffixed (or not) variable
>> names here.
> If the question about ret_, then the same as before size of ret
> argument of the macros will be 1 or 2, but {lc/sc}.d expected to work
> with 8 bytes.
Then what's the uint32_t * about?
>>> + release_barrier \
>>> + "0: lr.d %0, %2\n" \
>>
>> Even here it's an 8-byte access. Even if - didn't check - the insn
>> was
>> okay to use with just a 4-byte aligned pointer, wouldn't it make
>> sense
>> then to 8-byte align it, and be consistent throughout this macro wrt
>> the base unit acted upon? Alternatively, why not use lr.w here, thus
>> reducing possible collisions between multiple CPUs accessing the same
>> cache line?
> According to the docs:
> LR and SC operate on naturally-aligned 64-bit (RV64 only) or 32-bit
> words in memory. Misaligned
> addresses will generate misaligned address exceptions.
>
> My intention was to deal with 4-byte crossing boundary. so if ptr is 4-
> byte aligned then by reading 8-bytes we shouldn't care about boundary
> crossing, if I am not missing something.
If a ptr is 4-byte aligned, there's no point reading more than 4 bytes.
>>> + " and %1, %0, %z4\n" \
>>> + " or %1, %1, %z3\n" \
>>> + " sc.d %1, %1, %2\n" \
>>> + " bnez %1, 0b\n" \
>>> + acquire_barrier \
>>> + : "=&r" (ret_), "=&r" (rc), "+A" (*ptr_32b_aligned) \
>>> + : "rJ" (new_), "rJ" (~mask) \
>>
>> I think that as soon as there are more than 2 or maybe 3 operands,
>> legibility is vastly improved by using named asm() operands.
> Just to clarify you mean that it would be better to use instead of %0
> use names?
Yes. Just like you have it in one of the other patches that I looked at
later.
>>> + : "memory"); \
>>
>> Nit: Missing blank before closing parenthesis.
>>
>>> + \
>>> + ret = (__typeof__(*(ptr)))((ret_ & mask) >> mask_l); \
>>> +})
>>
>> Why does "ret" need to be a macro argument? If you had only the
>> expression here, not the the assigment, ...
>>
>>> +#define __xchg_generic(ptr, new, size, sfx, release_barrier,
>>> acquire_barrier) \
>>> +({ \
>>> + __typeof__(ptr) ptr__ = (ptr); \
>>
>> Is this local variable really needed? Can't you use "ptr" directly
>> in the three macro invocations?
>>
>>> + __typeof__(*(ptr)) new__ = (new); \
>>> + __typeof__(*(ptr)) ret__; \
>>> + switch (size) \
>>> + { \
>>> + case 1: \
>>> + case 2: \
>>> + emulate_xchg_1_2(ptr__, new__, ret__, release_barrier,
>>> acquire_barrier); \
>>
>> ... this would become
>>
>> ret__ = emulate_xchg_1_2(ptr__, new__, release_barrier,
>> acquire_barrier); \
>>
>> But, unlike assumed above, there's no enforcement here that a 2-byte
>> quantity won't cross a word, double-word, cache line, or even page
>> boundary. That might be okay if then the code would simply crash
>> (like
>> the AMO insns emitted further down would), but aiui silent
>> misbehavior
>> would result.
> As I mentioned above with 4-byte alignment and then reading and working
> with 8-byte then crossing a word or double-word boundary shouldn't be
> an issue.
>
> I am not sure that I know how to check that we are crossing cache line
> boundary.
>
> Regarding page boundary, if the next page is mapped then all should
> work fine, otherwise it will be an exception.
Are you sure lr.d / sc.d are happy to access across such a boundary,
when both pages are mapped?
To me it seems pretty clear that for atomic accesses you want to
demand natural alignment, i.e. 2-byte alignment for 2-byte accesses.
This way you can be sure no potentially problematic boundaries will
be crossed.
>>> + break; \
>>> + case 4: \
>>> + __amoswap_generic(ptr__, new__, ret__,\
>>> + ".w" sfx, release_barrier,
>>> acquire_barrier); \
>>> + break; \
>>> + case 8: \
>>> + __amoswap_generic(ptr__, new__, ret__,\
>>> + ".d" sfx, release_barrier,
>>> acquire_barrier); \
>>> + break; \
>>> + default: \
>>> + STATIC_ASSERT_UNREACHABLE(); \
>>> + } \
>>> + ret__; \
>>> +})
>>> +
>>> +#define xchg_relaxed(ptr, x) \
>>> +({ \
>>> + __typeof__(*(ptr)) x_ = (x); \
>>> + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),
>>> "", "", ""); \
>>> +})
>>> +
>>> +#define xchg_acquire(ptr, x) \
>>> +({ \
>>> + __typeof__(*(ptr)) x_ = (x); \
>>> + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), \
>>> + "", "",
>>> RISCV_ACQUIRE_BARRIER); \
>>> +})
>>> +
>>> +#define xchg_release(ptr, x) \
>>> +({ \
>>> + __typeof__(*(ptr)) x_ = (x); \
>>> + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),\
>>> + "", RISCV_RELEASE_BARRIER,
>>> ""); \
>>> +})
>>> +
>>> +#define xchg(ptr,x) \
>>> +({ \
>>> + __typeof__(*(ptr)) ret__; \
>>> + ret__ = (__typeof__(*(ptr))) \
>>> + __xchg_generic(ptr, (unsigned long)(x),
>>> sizeof(*(ptr)), \
>>> + ".aqrl", "", ""); \
>>
>> The .aqrl doesn't look to affect the (emulated) 1- and 2-byte cases.
>>
>> Further, amoswap also exists in release-only and acquire-only forms.
>> Why do you prefer explicit barrier insns over those? (Looks to
>> similarly apply to the emulation path as well as to the cmpxchg
>> machinery then, as both lr and sc also come in all four possible
>> acquire/release forms. Perhaps for the emulation path using
>> explicit barriers is better, in case the acquire/release forms of
>> lr/sc - being used inside the loop - might perform worse.)
> As 1- and 2-byte cases are emulated I decided that is not to provide
> sfx argument for emulation macros as it will not have to much affect on
> emulated types and just consume more performance on acquire and release
> version of sc/ld instructions.
Question is whether the common case (4- and 8-byte accesses) shouldn't
be valued higher, with 1- and 2-byte emulation being there just to
allow things to not break altogether.
>> No RISCV_..._BARRIER for use here and ...
>>
>>> + ret__; \
>>> +})
>>> +
>>> +#define __cmpxchg(ptr, o, n, s) \
>>> +({ \
>>> + __typeof__(*(ptr)) ret__; \
>>> + ret__ = (__typeof__(*(ptr))) \
>>> + __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned
>>> long)(n), \
>>> + s, ".rl", "", " fence rw, rw\n"); \
>>
>> ... here? And anyway, wouldn't it make sense to have
>>
>> #define cmpxchg(ptr, o, n) __cmpxchg(ptr, o, n, sizeof(*(ptr))
>>
>> to limit redundancy?
>>
>> Plus wouldn't
>>
>> #define __cmpxchg(ptr, o, n, s) \
>> ((__typeof__(*(ptr))) \
>> __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned long)(n), \
>> s, ".rl", "", " fence rw, rw\n"))
>>
>> be shorter and thus easier to follow as well? As I notice only now,
>> this would apparently apply further up as well.
> I understand your point about "#define cmpxchg(ptr, o, n) __cmpxchg(",
> but I can't undestand how the definition of __cmxchng should be done
> shorter. Could you please clarify that?
You did notice that in my form there's no local variable, and hence
also no macro-wide scope ( ({ ... }) )?
Jan
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |