[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6
The only really interesting changes here are the updates to mem* which update to actually optimised versions and introduce an optimised memcmp. bitops: No change to the bits we import. Record new baseline. cmpxchg: Import: 60010e5 arm64: cmpxchg: update macros to prevent warnings Author: Mark Hambleton <mahamble@xxxxxxxxxxxx> Signed-off-by: Mark Hambleton <mahamble@xxxxxxxxxxxx> Signed-off-by: Mark Brown <broonie@xxxxxxxxxx> Signed-off-by: Catalin Marinas <catalin.marinas@xxxxxxx> e1dfda9 arm64: xchg: prevent warning if return value is unused Author: Will Deacon <will.deacon@xxxxxxx> Signed-off-by: Will Deacon <will.deacon@xxxxxxx> Signed-off-by: Catalin Marinas <catalin.marinas@xxxxxxx> e1dfda9 resolves the warning which previous caused us to skip 60010e508111. Since arm32 and arm64 now differ (as do Linux arm and arm64) here the existing definition in asm/system.h gets moved to asm/arm32/cmpxchg.h. Previously this was shadowing the arm64 one but they happened to be identical. atomics: Import: 8715466 arch,arm64: Convert smp_mb__*() Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx> Signed-off-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx> This just drops some unused (by us) smp_mb__*_atomic_*. spinlocks: No change. Record new baseline. mem*: Import: 808dbac arm64: lib: Implement optimized memcpy routine Author: zhichang.yuan <zhichang.yuan@xxxxxxxxxx> Signed-off-by: Zhichang Yuan <zhichang.yuan@xxxxxxxxxx> Signed-off-by: Deepak Saxena <dsaxena@xxxxxxxxxx> Signed-off-by: Catalin Marinas <catalin.marinas@xxxxxxx> 280adc1 arm64: lib: Implement optimized memmove routine Author: zhichang.yuan <zhichang.yuan@xxxxxxxxxx> Signed-off-by: Zhichang Yuan <zhichang.yuan@xxxxxxxxxx> Signed-off-by: Deepak Saxena <dsaxena@xxxxxxxxxx> Signed-off-by: Catalin Marinas <catalin.marinas@xxxxxxx> b29a51f arm64: lib: Implement optimized memset routine Author: zhichang.yuan <zhichang.yuan@xxxxxxxxxx> Signed-off-by: Zhichang Yuan <zhichang.yuan@xxxxxxxxxx> Signed-off-by: Deepak Saxena <dsaxena@xxxxxxxxxx> Signed-off-by: Catalin Marinas <catalin.marinas@xxxxxxx> d875c9b arm64: lib: Implement optimized memcmp routine Author: zhichang.yuan <zhichang.yuan@xxxxxxxxxx> Signed-off-by: Zhichang Yuan <zhichang.yuan@xxxxxxxxxx> Signed-off-by: Deepak Saxena <dsaxena@xxxxxxxxxx> Signed-off-by: Catalin Marinas <catalin.marinas@xxxxxxx> These import various routines from Linaro's Cortex Strings library. Added assembler.h similar to on arm32 to define the various magic symbols which these imported routines depend on (e.g. CPU_LE() and CPU_BE()) str*: No changes. Record new baseline. Correct the paths in the README. *_page: No changes. Record new baseline. README previous said clear_page was unused while clear page was, which was backwards. Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx> --- xen/arch/arm/README.LinuxPrimitives | 36 +++-- xen/arch/arm/arm64/lib/Makefile | 2 +- xen/arch/arm/arm64/lib/assembler.h | 13 ++ xen/arch/arm/arm64/lib/memchr.S | 1 + xen/arch/arm/arm64/lib/memcmp.S | 258 +++++++++++++++++++++++++++++++++++ xen/arch/arm/arm64/lib/memcpy.S | 193 +++++++++++++++++++++++--- xen/arch/arm/arm64/lib/memmove.S | 191 ++++++++++++++++++++++---- xen/arch/arm/arm64/lib/memset.S | 208 +++++++++++++++++++++++++--- xen/include/asm-arm/arm32/cmpxchg.h | 3 + xen/include/asm-arm/arm64/atomic.h | 5 - xen/include/asm-arm/arm64/cmpxchg.h | 35 +++-- xen/include/asm-arm/string.h | 5 + xen/include/asm-arm/system.h | 3 - 13 files changed, 844 insertions(+), 109 deletions(-) create mode 100644 xen/arch/arm/arm64/lib/assembler.h create mode 100644 xen/arch/arm/arm64/lib/memcmp.S diff --git a/xen/arch/arm/README.LinuxPrimitives b/xen/arch/arm/README.LinuxPrimitives index 6cd03ca..69eeb70 100644 --- a/xen/arch/arm/README.LinuxPrimitives +++ b/xen/arch/arm/README.LinuxPrimitives @@ -6,29 +6,26 @@ were last updated. arm64: ===================================================================== -bitops: last sync @ v3.14-rc7 (last commit: 8e86f0b) +bitops: last sync @ v3.16-rc6 (last commit: 8715466b6027) linux/arch/arm64/lib/bitops.S xen/arch/arm/arm64/lib/bitops.S linux/arch/arm64/include/asm/bitops.h xen/include/asm-arm/arm64/bitops.h --------------------------------------------------------------------- -cmpxchg: last sync @ v3.14-rc7 (last commit: 95c4189) +cmpxchg: last sync @ v3.16-rc6 (last commit: e1dfda9ced9b) linux/arch/arm64/include/asm/cmpxchg.h xen/include/asm-arm/arm64/cmpxchg.h -Skipped: - 60010e5 arm64: cmpxchg: update macros to prevent warnings - --------------------------------------------------------------------- -atomics: last sync @ v3.14-rc7 (last commit: 95c4189) +atomics: last sync @ v3.16-rc6 (last commit: 8715466b6027) linux/arch/arm64/include/asm/atomic.h xen/include/asm-arm/arm64/atomic.h --------------------------------------------------------------------- -spinlocks: last sync @ v3.14-rc7 (last commit: 95c4189) +spinlocks: last sync @ v3.16-rc6 (last commit: 95c4189689f9) linux/arch/arm64/include/asm/spinlock.h xen/include/asm-arm/arm64/spinlock.h @@ -38,30 +35,31 @@ Skipped: --------------------------------------------------------------------- -mem*: last sync @ v3.14-rc7 (last commit: 4a89922) +mem*: last sync @ v3.16-rc6 (last commit: d875c9b37240) -linux/arch/arm64/lib/memchr.S xen/arch/arm/arm64/lib/memchr.S -linux/arch/arm64/lib/memcpy.S xen/arch/arm/arm64/lib/memcpy.S -linux/arch/arm64/lib/memmove.S xen/arch/arm/arm64/lib/memmove.S -linux/arch/arm64/lib/memset.S xen/arch/arm/arm64/lib/memset.S +linux/arch/arm64/lib/memchr.S xen/arch/arm/arm64/lib/memchr.S +linux/arch/arm64/lib/memcmp.S xen/arch/arm/arm64/lib/memcmp.S +linux/arch/arm64/lib/memcpy.S xen/arch/arm/arm64/lib/memcpy.S +linux/arch/arm64/lib/memmove.S xen/arch/arm/arm64/lib/memmove.S +linux/arch/arm64/lib/memset.S xen/arch/arm/arm64/lib/memset.S -for i in memchr.S memcpy.S memmove.S memset.S ; do +for i in memchr.S memcmp.S memcpy.S memmove.S memset.S ; do diff -u linux/arch/arm64/lib/$i xen/arch/arm/arm64/lib/$i done --------------------------------------------------------------------- -str*: last sync @ v3.14-rc7 (last commit: 2b8cac8) +str*: last sync @ v3.16-rc6 (last commit: 2b8cac814cd5) -linux/arch/arm/lib/strchr.S xen/arch/arm/arm64/lib/strchr.S -linux/arch/arm/lib/strrchr.S xen/arch/arm/arm64/lib/strrchr.S +linux/arch/arm64/lib/strchr.S xen/arch/arm/arm64/lib/strchr.S +linux/arch/arm64/lib/strrchr.S xen/arch/arm/arm64/lib/strrchr.S --------------------------------------------------------------------- -{clear,copy}_page: last sync @ v3.14-rc7 (last commit: f27bb13) +{clear,copy}_page: last sync @ v3.16-rc6 (last commit: f27bb139c387) -linux/arch/arm64/lib/clear_page.S unused in Xen -linux/arch/arm64/lib/copy_page.S xen/arch/arm/arm64/lib/copy_page.S +linux/arch/arm64/lib/clear_page.S xen/arch/arm/arm64/lib/clear_page.S +linux/arch/arm64/lib/copy_page.S unused in Xen ===================================================================== arm32 diff --git a/xen/arch/arm/arm64/lib/Makefile b/xen/arch/arm/arm64/lib/Makefile index b895afa..2e7fb64 100644 --- a/xen/arch/arm/arm64/lib/Makefile +++ b/xen/arch/arm/arm64/lib/Makefile @@ -1,4 +1,4 @@ -obj-y += memcpy.o memmove.o memset.o memchr.o +obj-y += memcpy.o memcmp.o memmove.o memset.o memchr.o obj-y += clear_page.o obj-y += bitops.o find_next_bit.o obj-y += strchr.o strrchr.o diff --git a/xen/arch/arm/arm64/lib/assembler.h b/xen/arch/arm/arm64/lib/assembler.h new file mode 100644 index 0000000..84669d1 --- /dev/null +++ b/xen/arch/arm/arm64/lib/assembler.h @@ -0,0 +1,13 @@ +#ifndef __ASM_ASSEMBLER_H__ +#define __ASM_ASSEMBLER_H__ + +#ifndef __ASSEMBLY__ +#error "Only include this from assembly code" +#endif + +/* Only LE support so far */ +#define CPU_BE(x...) +#define CPU_LE(x...) x + +#endif /* __ASM_ASSEMBLER_H__ */ + diff --git a/xen/arch/arm/arm64/lib/memchr.S b/xen/arch/arm/arm64/lib/memchr.S index 3cc1b01..b04590c 100644 --- a/xen/arch/arm/arm64/lib/memchr.S +++ b/xen/arch/arm/arm64/lib/memchr.S @@ -18,6 +18,7 @@ */ #include <xen/config.h> +#include "assembler.h" /* * Find a character in an area of memory. diff --git a/xen/arch/arm/arm64/lib/memcmp.S b/xen/arch/arm/arm64/lib/memcmp.S new file mode 100644 index 0000000..9aad925 --- /dev/null +++ b/xen/arch/arm/arm64/lib/memcmp.S @@ -0,0 +1,258 @@ +/* + * Copyright (C) 2013 ARM Ltd. + * Copyright (C) 2013 Linaro. + * + * This code is based on glibc cortex strings work originally authored by Linaro + * and re-licensed under GPLv2 for the Linux kernel. The original code can + * be found @ + * + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ + * files/head:/src/aarch64/ + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see <http://www.gnu.org/licenses/>. + */ + +#include <xen/config.h> +#include "assembler.h" + +/* +* compare memory areas(when two memory areas' offset are different, +* alignment handled by the hardware) +* +* Parameters: +* x0 - const memory area 1 pointer +* x1 - const memory area 2 pointer +* x2 - the maximal compare byte length +* Returns: +* x0 - a compare result, maybe less than, equal to, or greater than ZERO +*/ + +/* Parameters and result. */ +src1 .req x0 +src2 .req x1 +limit .req x2 +result .req x0 + +/* Internal variables. */ +data1 .req x3 +data1w .req w3 +data2 .req x4 +data2w .req w4 +has_nul .req x5 +diff .req x6 +endloop .req x7 +tmp1 .req x8 +tmp2 .req x9 +tmp3 .req x10 +pos .req x11 +limit_wd .req x12 +mask .req x13 + +ENTRY(memcmp) + cbz limit, .Lret0 + eor tmp1, src1, src2 + tst tmp1, #7 + b.ne .Lmisaligned8 + ands tmp1, src1, #7 + b.ne .Lmutual_align + sub limit_wd, limit, #1 /* limit != 0, so no underflow. */ + lsr limit_wd, limit_wd, #3 /* Convert to Dwords. */ + /* + * The input source addresses are at alignment boundary. + * Directly compare eight bytes each time. + */ +.Lloop_aligned: + ldr data1, [src1], #8 + ldr data2, [src2], #8 +.Lstart_realigned: + subs limit_wd, limit_wd, #1 + eor diff, data1, data2 /* Non-zero if differences found. */ + csinv endloop, diff, xzr, cs /* Last Dword or differences. */ + cbz endloop, .Lloop_aligned + + /* Not reached the limit, must have found a diff. */ + tbz limit_wd, #63, .Lnot_limit + + /* Limit % 8 == 0 => the diff is in the last 8 bytes. */ + ands limit, limit, #7 + b.eq .Lnot_limit + /* + * The remained bytes less than 8. It is needed to extract valid data + * from last eight bytes of the intended memory range. + */ + lsl limit, limit, #3 /* bytes-> bits. */ + mov mask, #~0 +CPU_BE( lsr mask, mask, limit ) +CPU_LE( lsl mask, mask, limit ) + bic data1, data1, mask + bic data2, data2, mask + + orr diff, diff, mask + b .Lnot_limit + +.Lmutual_align: + /* + * Sources are mutually aligned, but are not currently at an + * alignment boundary. Round down the addresses and then mask off + * the bytes that precede the start point. + */ + bic src1, src1, #7 + bic src2, src2, #7 + ldr data1, [src1], #8 + ldr data2, [src2], #8 + /* + * We can not add limit with alignment offset(tmp1) here. Since the + * addition probably make the limit overflown. + */ + sub limit_wd, limit, #1/*limit != 0, so no underflow.*/ + and tmp3, limit_wd, #7 + lsr limit_wd, limit_wd, #3 + add tmp3, tmp3, tmp1 + add limit_wd, limit_wd, tmp3, lsr #3 + add limit, limit, tmp1/* Adjust the limit for the extra. */ + + lsl tmp1, tmp1, #3/* Bytes beyond alignment -> bits.*/ + neg tmp1, tmp1/* Bits to alignment -64. */ + mov tmp2, #~0 + /*mask off the non-intended bytes before the start address.*/ +CPU_BE( lsl tmp2, tmp2, tmp1 )/*Big-endian.Early bytes are at MSB*/ + /* Little-endian. Early bytes are at LSB. */ +CPU_LE( lsr tmp2, tmp2, tmp1 ) + + orr data1, data1, tmp2 + orr data2, data2, tmp2 + b .Lstart_realigned + + /*src1 and src2 have different alignment offset.*/ +.Lmisaligned8: + cmp limit, #8 + b.lo .Ltiny8proc /*limit < 8: compare byte by byte*/ + + and tmp1, src1, #7 + neg tmp1, tmp1 + add tmp1, tmp1, #8/*valid length in the first 8 bytes of src1*/ + and tmp2, src2, #7 + neg tmp2, tmp2 + add tmp2, tmp2, #8/*valid length in the first 8 bytes of src2*/ + subs tmp3, tmp1, tmp2 + csel pos, tmp1, tmp2, hi /*Choose the maximum.*/ + + sub limit, limit, pos + /*compare the proceeding bytes in the first 8 byte segment.*/ +.Ltinycmp: + ldrb data1w, [src1], #1 + ldrb data2w, [src2], #1 + subs pos, pos, #1 + ccmp data1w, data2w, #0, ne /* NZCV = 0b0000. */ + b.eq .Ltinycmp + cbnz pos, 1f /*diff occurred before the last byte.*/ + cmp data1w, data2w + b.eq .Lstart_align +1: + sub result, data1, data2 + ret + +.Lstart_align: + lsr limit_wd, limit, #3 + cbz limit_wd, .Lremain8 + + ands xzr, src1, #7 + b.eq .Lrecal_offset + /*process more leading bytes to make src1 aligned...*/ + add src1, src1, tmp3 /*backwards src1 to alignment boundary*/ + add src2, src2, tmp3 + sub limit, limit, tmp3 + lsr limit_wd, limit, #3 + cbz limit_wd, .Lremain8 + /*load 8 bytes from aligned SRC1..*/ + ldr data1, [src1], #8 + ldr data2, [src2], #8 + + subs limit_wd, limit_wd, #1 + eor diff, data1, data2 /*Non-zero if differences found.*/ + csinv endloop, diff, xzr, ne + cbnz endloop, .Lunequal_proc + /*How far is the current SRC2 from the alignment boundary...*/ + and tmp3, tmp3, #7 + +.Lrecal_offset:/*src1 is aligned now..*/ + neg pos, tmp3 +.Lloopcmp_proc: + /* + * Divide the eight bytes into two parts. First,backwards the src2 + * to an alignment boundary,load eight bytes and compare from + * the SRC2 alignment boundary. If all 8 bytes are equal,then start + * the second part's comparison. Otherwise finish the comparison. + * This special handle can garantee all the accesses are in the + * thread/task space in avoid to overrange access. + */ + ldr data1, [src1,pos] + ldr data2, [src2,pos] + eor diff, data1, data2 /* Non-zero if differences found. */ + cbnz diff, .Lnot_limit + + /*The second part process*/ + ldr data1, [src1], #8 + ldr data2, [src2], #8 + eor diff, data1, data2 /* Non-zero if differences found. */ + subs limit_wd, limit_wd, #1 + csinv endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/ + cbz endloop, .Lloopcmp_proc +.Lunequal_proc: + cbz diff, .Lremain8 + +/*There is differnence occured in the latest comparison.*/ +.Lnot_limit: +/* +* For little endian,reverse the low significant equal bits into MSB,then +* following CLZ can find how many equal bits exist. +*/ +CPU_LE( rev diff, diff ) +CPU_LE( rev data1, data1 ) +CPU_LE( rev data2, data2 ) + + /* + * The MS-non-zero bit of DIFF marks either the first bit + * that is different, or the end of the significant data. + * Shifting left now will bring the critical information into the + * top bits. + */ + clz pos, diff + lsl data1, data1, pos + lsl data2, data2, pos + /* + * We need to zero-extend (char is unsigned) the value and then + * perform a signed subtraction. + */ + lsr data1, data1, #56 + sub result, data1, data2, lsr #56 + ret + +.Lremain8: + /* Limit % 8 == 0 =>. all data are equal.*/ + ands limit, limit, #7 + b.eq .Lret0 + +.Ltiny8proc: + ldrb data1w, [src1], #1 + ldrb data2w, [src2], #1 + subs limit, limit, #1 + + ccmp data1w, data2w, #0, ne /* NZCV = 0b0000. */ + b.eq .Ltiny8proc + sub result, data1, data2 + ret +.Lret0: + mov result, #0 + ret +ENDPROC(memcmp) diff --git a/xen/arch/arm/arm64/lib/memcpy.S b/xen/arch/arm/arm64/lib/memcpy.S index c8197c6..7cc885d 100644 --- a/xen/arch/arm/arm64/lib/memcpy.S +++ b/xen/arch/arm/arm64/lib/memcpy.S @@ -1,5 +1,13 @@ /* * Copyright (C) 2013 ARM Ltd. + * Copyright (C) 2013 Linaro. + * + * This code is based on glibc cortex strings work originally authored by Linaro + * and re-licensed under GPLv2 for the Linux kernel. The original code can + * be found @ + * + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ + * files/head:/src/aarch64/ * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as @@ -15,6 +23,8 @@ */ #include <xen/config.h> +#include <asm/cache.h> +#include "assembler.h" /* * Copy a buffer from src to dest (alignment handled by the hardware) @@ -26,27 +36,166 @@ * Returns: * x0 - dest */ +dstin .req x0 +src .req x1 +count .req x2 +tmp1 .req x3 +tmp1w .req w3 +tmp2 .req x4 +tmp2w .req w4 +tmp3 .req x5 +tmp3w .req w5 +dst .req x6 + +A_l .req x7 +A_h .req x8 +B_l .req x9 +B_h .req x10 +C_l .req x11 +C_h .req x12 +D_l .req x13 +D_h .req x14 + ENTRY(memcpy) - mov x4, x0 - subs x2, x2, #8 - b.mi 2f -1: ldr x3, [x1], #8 - subs x2, x2, #8 - str x3, [x4], #8 - b.pl 1b -2: adds x2, x2, #4 - b.mi 3f - ldr w3, [x1], #4 - sub x2, x2, #4 - str w3, [x4], #4 -3: adds x2, x2, #2 - b.mi 4f - ldrh w3, [x1], #2 - sub x2, x2, #2 - strh w3, [x4], #2 -4: adds x2, x2, #1 - b.mi 5f - ldrb w3, [x1] - strb w3, [x4] -5: ret + mov dst, dstin + cmp count, #16 + /*When memory length is less than 16, the accessed are not aligned.*/ + b.lo .Ltiny15 + + neg tmp2, src + ands tmp2, tmp2, #15/* Bytes to reach alignment. */ + b.eq .LSrcAligned + sub count, count, tmp2 + /* + * Copy the leading memory data from src to dst in an increasing + * address order.By this way,the risk of overwritting the source + * memory data is eliminated when the distance between src and + * dst is less than 16. The memory accesses here are alignment. + */ + tbz tmp2, #0, 1f + ldrb tmp1w, [src], #1 + strb tmp1w, [dst], #1 +1: + tbz tmp2, #1, 2f + ldrh tmp1w, [src], #2 + strh tmp1w, [dst], #2 +2: + tbz tmp2, #2, 3f + ldr tmp1w, [src], #4 + str tmp1w, [dst], #4 +3: + tbz tmp2, #3, .LSrcAligned + ldr tmp1, [src],#8 + str tmp1, [dst],#8 + +.LSrcAligned: + cmp count, #64 + b.ge .Lcpy_over64 + /* + * Deal with small copies quickly by dropping straight into the + * exit block. + */ +.Ltail63: + /* + * Copy up to 48 bytes of data. At this point we only need the + * bottom 6 bits of count to be accurate. + */ + ands tmp1, count, #0x30 + b.eq .Ltiny15 + cmp tmp1w, #0x20 + b.eq 1f + b.lt 2f + ldp A_l, A_h, [src], #16 + stp A_l, A_h, [dst], #16 +1: + ldp A_l, A_h, [src], #16 + stp A_l, A_h, [dst], #16 +2: + ldp A_l, A_h, [src], #16 + stp A_l, A_h, [dst], #16 +.Ltiny15: + /* + * Prefer to break one ldp/stp into several load/store to access + * memory in an increasing address order,rather than to load/store 16 + * bytes from (src-16) to (dst-16) and to backward the src to aligned + * address,which way is used in original cortex memcpy. If keeping + * the original memcpy process here, memmove need to satisfy the + * precondition that src address is at least 16 bytes bigger than dst + * address,otherwise some source data will be overwritten when memove + * call memcpy directly. To make memmove simpler and decouple the + * memcpy's dependency on memmove, withdrew the original process. + */ + tbz count, #3, 1f + ldr tmp1, [src], #8 + str tmp1, [dst], #8 +1: + tbz count, #2, 2f + ldr tmp1w, [src], #4 + str tmp1w, [dst], #4 +2: + tbz count, #1, 3f + ldrh tmp1w, [src], #2 + strh tmp1w, [dst], #2 +3: + tbz count, #0, .Lexitfunc + ldrb tmp1w, [src] + strb tmp1w, [dst] + +.Lexitfunc: + ret + +.Lcpy_over64: + subs count, count, #128 + b.ge .Lcpy_body_large + /* + * Less than 128 bytes to copy, so handle 64 here and then jump + * to the tail. + */ + ldp A_l, A_h, [src],#16 + stp A_l, A_h, [dst],#16 + ldp B_l, B_h, [src],#16 + ldp C_l, C_h, [src],#16 + stp B_l, B_h, [dst],#16 + stp C_l, C_h, [dst],#16 + ldp D_l, D_h, [src],#16 + stp D_l, D_h, [dst],#16 + + tst count, #0x3f + b.ne .Ltail63 + ret + + /* + * Critical loop. Start at a new cache line boundary. Assuming + * 64 bytes per line this ensures the entire loop is in one line. + */ + .p2align L1_CACHE_SHIFT +.Lcpy_body_large: + /* pre-get 64 bytes data. */ + ldp A_l, A_h, [src],#16 + ldp B_l, B_h, [src],#16 + ldp C_l, C_h, [src],#16 + ldp D_l, D_h, [src],#16 +1: + /* + * interlace the load of next 64 bytes data block with store of the last + * loaded 64 bytes data. + */ + stp A_l, A_h, [dst],#16 + ldp A_l, A_h, [src],#16 + stp B_l, B_h, [dst],#16 + ldp B_l, B_h, [src],#16 + stp C_l, C_h, [dst],#16 + ldp C_l, C_h, [src],#16 + stp D_l, D_h, [dst],#16 + ldp D_l, D_h, [src],#16 + subs count, count, #64 + b.ge 1b + stp A_l, A_h, [dst],#16 + stp B_l, B_h, [dst],#16 + stp C_l, C_h, [dst],#16 + stp D_l, D_h, [dst],#16 + + tst count, #0x3f + b.ne .Ltail63 + ret ENDPROC(memcpy) diff --git a/xen/arch/arm/arm64/lib/memmove.S b/xen/arch/arm/arm64/lib/memmove.S index 1bf0936..f4065b9 100644 --- a/xen/arch/arm/arm64/lib/memmove.S +++ b/xen/arch/arm/arm64/lib/memmove.S @@ -1,5 +1,13 @@ /* * Copyright (C) 2013 ARM Ltd. + * Copyright (C) 2013 Linaro. + * + * This code is based on glibc cortex strings work originally authored by Linaro + * and re-licensed under GPLv2 for the Linux kernel. The original code can + * be found @ + * + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ + * files/head:/src/aarch64/ * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as @@ -15,6 +23,8 @@ */ #include <xen/config.h> +#include <asm/cache.h> +#include "assembler.h" /* * Move a buffer from src to test (alignment handled by the hardware). @@ -27,30 +37,161 @@ * Returns: * x0 - dest */ +dstin .req x0 +src .req x1 +count .req x2 +tmp1 .req x3 +tmp1w .req w3 +tmp2 .req x4 +tmp2w .req w4 +tmp3 .req x5 +tmp3w .req w5 +dst .req x6 + +A_l .req x7 +A_h .req x8 +B_l .req x9 +B_h .req x10 +C_l .req x11 +C_h .req x12 +D_l .req x13 +D_h .req x14 + ENTRY(memmove) - cmp x0, x1 - b.ls memcpy - add x4, x0, x2 - add x1, x1, x2 - subs x2, x2, #8 - b.mi 2f -1: ldr x3, [x1, #-8]! - subs x2, x2, #8 - str x3, [x4, #-8]! - b.pl 1b -2: adds x2, x2, #4 - b.mi 3f - ldr w3, [x1, #-4]! - sub x2, x2, #4 - str w3, [x4, #-4]! -3: adds x2, x2, #2 - b.mi 4f - ldrh w3, [x1, #-2]! - sub x2, x2, #2 - strh w3, [x4, #-2]! -4: adds x2, x2, #1 - b.mi 5f - ldrb w3, [x1, #-1] - strb w3, [x4, #-1] -5: ret + cmp dstin, src + b.lo memcpy + add tmp1, src, count + cmp dstin, tmp1 + b.hs memcpy /* No overlap. */ + + add dst, dstin, count + add src, src, count + cmp count, #16 + b.lo .Ltail15 /*probably non-alignment accesses.*/ + + ands tmp2, src, #15 /* Bytes to reach alignment. */ + b.eq .LSrcAligned + sub count, count, tmp2 + /* + * process the aligned offset length to make the src aligned firstly. + * those extra instructions' cost is acceptable. It also make the + * coming accesses are based on aligned address. + */ + tbz tmp2, #0, 1f + ldrb tmp1w, [src, #-1]! + strb tmp1w, [dst, #-1]! +1: + tbz tmp2, #1, 2f + ldrh tmp1w, [src, #-2]! + strh tmp1w, [dst, #-2]! +2: + tbz tmp2, #2, 3f + ldr tmp1w, [src, #-4]! + str tmp1w, [dst, #-4]! +3: + tbz tmp2, #3, .LSrcAligned + ldr tmp1, [src, #-8]! + str tmp1, [dst, #-8]! + +.LSrcAligned: + cmp count, #64 + b.ge .Lcpy_over64 + + /* + * Deal with small copies quickly by dropping straight into the + * exit block. + */ +.Ltail63: + /* + * Copy up to 48 bytes of data. At this point we only need the + * bottom 6 bits of count to be accurate. + */ + ands tmp1, count, #0x30 + b.eq .Ltail15 + cmp tmp1w, #0x20 + b.eq 1f + b.lt 2f + ldp A_l, A_h, [src, #-16]! + stp A_l, A_h, [dst, #-16]! +1: + ldp A_l, A_h, [src, #-16]! + stp A_l, A_h, [dst, #-16]! +2: + ldp A_l, A_h, [src, #-16]! + stp A_l, A_h, [dst, #-16]! + +.Ltail15: + tbz count, #3, 1f + ldr tmp1, [src, #-8]! + str tmp1, [dst, #-8]! +1: + tbz count, #2, 2f + ldr tmp1w, [src, #-4]! + str tmp1w, [dst, #-4]! +2: + tbz count, #1, 3f + ldrh tmp1w, [src, #-2]! + strh tmp1w, [dst, #-2]! +3: + tbz count, #0, .Lexitfunc + ldrb tmp1w, [src, #-1] + strb tmp1w, [dst, #-1] + +.Lexitfunc: + ret + +.Lcpy_over64: + subs count, count, #128 + b.ge .Lcpy_body_large + /* + * Less than 128 bytes to copy, so handle 64 bytes here and then jump + * to the tail. + */ + ldp A_l, A_h, [src, #-16] + stp A_l, A_h, [dst, #-16] + ldp B_l, B_h, [src, #-32] + ldp C_l, C_h, [src, #-48] + stp B_l, B_h, [dst, #-32] + stp C_l, C_h, [dst, #-48] + ldp D_l, D_h, [src, #-64]! + stp D_l, D_h, [dst, #-64]! + + tst count, #0x3f + b.ne .Ltail63 + ret + + /* + * Critical loop. Start at a new cache line boundary. Assuming + * 64 bytes per line this ensures the entire loop is in one line. + */ + .p2align L1_CACHE_SHIFT +.Lcpy_body_large: + /* pre-load 64 bytes data. */ + ldp A_l, A_h, [src, #-16] + ldp B_l, B_h, [src, #-32] + ldp C_l, C_h, [src, #-48] + ldp D_l, D_h, [src, #-64]! +1: + /* + * interlace the load of next 64 bytes data block with store of the last + * loaded 64 bytes data. + */ + stp A_l, A_h, [dst, #-16] + ldp A_l, A_h, [src, #-16] + stp B_l, B_h, [dst, #-32] + ldp B_l, B_h, [src, #-32] + stp C_l, C_h, [dst, #-48] + ldp C_l, C_h, [src, #-48] + stp D_l, D_h, [dst, #-64]! + ldp D_l, D_h, [src, #-64]! + subs count, count, #64 + b.ge 1b + stp A_l, A_h, [dst, #-16] + stp B_l, B_h, [dst, #-32] + stp C_l, C_h, [dst, #-48] + stp D_l, D_h, [dst, #-64]! + + tst count, #0x3f + b.ne .Ltail63 + ret ENDPROC(memmove) diff --git a/xen/arch/arm/arm64/lib/memset.S b/xen/arch/arm/arm64/lib/memset.S index 25a4fb6..4ee714d 100644 --- a/xen/arch/arm/arm64/lib/memset.S +++ b/xen/arch/arm/arm64/lib/memset.S @@ -1,5 +1,13 @@ /* * Copyright (C) 2013 ARM Ltd. + * Copyright (C) 2013 Linaro. + * + * This code is based on glibc cortex strings work originally authored by Linaro + * and re-licensed under GPLv2 for the Linux kernel. The original code can + * be found @ + * + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ + * files/head:/src/aarch64/ * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as @@ -15,6 +23,8 @@ */ #include <xen/config.h> +#include <asm/cache.h> +#include "assembler.h" /* * Fill in the buffer with character c (alignment handled by the hardware) @@ -26,27 +36,181 @@ * Returns: * x0 - buf */ + +dstin .req x0 +val .req w1 +count .req x2 +tmp1 .req x3 +tmp1w .req w3 +tmp2 .req x4 +tmp2w .req w4 +zva_len_x .req x5 +zva_len .req w5 +zva_bits_x .req x6 + +A_l .req x7 +A_lw .req w7 +dst .req x8 +tmp3w .req w9 +tmp3 .req x9 + ENTRY(memset) - mov x4, x0 - and w1, w1, #0xff - orr w1, w1, w1, lsl #8 - orr w1, w1, w1, lsl #16 - orr x1, x1, x1, lsl #32 - subs x2, x2, #8 - b.mi 2f -1: str x1, [x4], #8 - subs x2, x2, #8 - b.pl 1b -2: adds x2, x2, #4 - b.mi 3f - sub x2, x2, #4 - str w1, [x4], #4 -3: adds x2, x2, #2 - b.mi 4f - sub x2, x2, #2 - strh w1, [x4], #2 -4: adds x2, x2, #1 - b.mi 5f - strb w1, [x4] -5: ret + mov dst, dstin /* Preserve return value. */ + and A_lw, val, #255 + orr A_lw, A_lw, A_lw, lsl #8 + orr A_lw, A_lw, A_lw, lsl #16 + orr A_l, A_l, A_l, lsl #32 + + cmp count, #15 + b.hi .Lover16_proc + /*All store maybe are non-aligned..*/ + tbz count, #3, 1f + str A_l, [dst], #8 +1: + tbz count, #2, 2f + str A_lw, [dst], #4 +2: + tbz count, #1, 3f + strh A_lw, [dst], #2 +3: + tbz count, #0, 4f + strb A_lw, [dst] +4: + ret + +.Lover16_proc: + /*Whether the start address is aligned with 16.*/ + neg tmp2, dst + ands tmp2, tmp2, #15 + b.eq .Laligned +/* +* The count is not less than 16, we can use stp to store the start 16 bytes, +* then adjust the dst aligned with 16.This process will make the current +* memory address at alignment boundary. +*/ + stp A_l, A_l, [dst] /*non-aligned store..*/ + /*make the dst aligned..*/ + sub count, count, tmp2 + add dst, dst, tmp2 + +.Laligned: + cbz A_l, .Lzero_mem + +.Ltail_maybe_long: + cmp count, #64 + b.ge .Lnot_short +.Ltail63: + ands tmp1, count, #0x30 + b.eq 3f + cmp tmp1w, #0x20 + b.eq 1f + b.lt 2f + stp A_l, A_l, [dst], #16 +1: + stp A_l, A_l, [dst], #16 +2: + stp A_l, A_l, [dst], #16 +/* +* The last store length is less than 16,use stp to write last 16 bytes. +* It will lead some bytes written twice and the access is non-aligned. +*/ +3: + ands count, count, #15 + cbz count, 4f + add dst, dst, count + stp A_l, A_l, [dst, #-16] /* Repeat some/all of last store. */ +4: + ret + + /* + * Critical loop. Start at a new cache line boundary. Assuming + * 64 bytes per line, this ensures the entire loop is in one line. + */ + .p2align L1_CACHE_SHIFT +.Lnot_short: + sub dst, dst, #16/* Pre-bias. */ + sub count, count, #64 +1: + stp A_l, A_l, [dst, #16] + stp A_l, A_l, [dst, #32] + stp A_l, A_l, [dst, #48] + stp A_l, A_l, [dst, #64]! + subs count, count, #64 + b.ge 1b + tst count, #0x3f + add dst, dst, #16 + b.ne .Ltail63 +.Lexitfunc: + ret + + /* + * For zeroing memory, check to see if we can use the ZVA feature to + * zero entire 'cache' lines. + */ +.Lzero_mem: + cmp count, #63 + b.le .Ltail63 + /* + * For zeroing small amounts of memory, it's not worth setting up + * the line-clear code. + */ + cmp count, #128 + b.lt .Lnot_short /*count is at least 128 bytes*/ + + mrs tmp1, dczid_el0 + tbnz tmp1, #4, .Lnot_short + mov tmp3w, #4 + and zva_len, tmp1w, #15 /* Safety: other bits reserved. */ + lsl zva_len, tmp3w, zva_len + + ands tmp3w, zva_len, #63 + /* + * ensure the zva_len is not less than 64. + * It is not meaningful to use ZVA if the block size is less than 64. + */ + b.ne .Lnot_short +.Lzero_by_line: + /* + * Compute how far we need to go to become suitably aligned. We're + * already at quad-word alignment. + */ + cmp count, zva_len_x + b.lt .Lnot_short /* Not enough to reach alignment. */ + sub zva_bits_x, zva_len_x, #1 + neg tmp2, dst + ands tmp2, tmp2, zva_bits_x + b.eq 2f /* Already aligned. */ + /* Not aligned, check that there's enough to copy after alignment.*/ + sub tmp1, count, tmp2 + /* + * grantee the remain length to be ZVA is bigger than 64, + * avoid to make the 2f's process over mem range.*/ + cmp tmp1, #64 + ccmp tmp1, zva_len_x, #8, ge /* NZCV=0b1000 */ + b.lt .Lnot_short + /* + * We know that there's at least 64 bytes to zero and that it's safe + * to overrun by 64 bytes. + */ + mov count, tmp1 +1: + stp A_l, A_l, [dst] + stp A_l, A_l, [dst, #16] + stp A_l, A_l, [dst, #32] + subs tmp2, tmp2, #64 + stp A_l, A_l, [dst, #48] + add dst, dst, #64 + b.ge 1b + /* We've overrun a bit, so adjust dst downwards.*/ + add dst, dst, tmp2 +2: + sub count, count, zva_len_x +3: + dc zva, dst + add dst, dst, zva_len_x + subs count, count, zva_len_x + b.ge 3b + ands count, count, zva_bits_x + b.ne .Ltail_maybe_long + ret ENDPROC(memset) diff --git a/xen/include/asm-arm/arm32/cmpxchg.h b/xen/include/asm-arm/arm32/cmpxchg.h index 3f4e7a1..9a511f2 100644 --- a/xen/include/asm-arm/arm32/cmpxchg.h +++ b/xen/include/asm-arm/arm32/cmpxchg.h @@ -40,6 +40,9 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size return ret; } +#define xchg(ptr,x) \ + ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr)))) + /* * Atomic compare and exchange. Compare OLD with MEM, if identical, * store NEW in MEM. Return the initial value in MEM. Success is diff --git a/xen/include/asm-arm/arm64/atomic.h b/xen/include/asm-arm/arm64/atomic.h index b5d50f2..b49219e 100644 --- a/xen/include/asm-arm/arm64/atomic.h +++ b/xen/include/asm-arm/arm64/atomic.h @@ -136,11 +136,6 @@ static inline int __atomic_add_unless(atomic_t *v, int a, int u) #define atomic_add_negative(i,v) (atomic_add_return(i, v) < 0) -#define smp_mb__before_atomic_dec() smp_mb() -#define smp_mb__after_atomic_dec() smp_mb() -#define smp_mb__before_atomic_inc() smp_mb() -#define smp_mb__after_atomic_inc() smp_mb() - #endif /* * Local variables: diff --git a/xen/include/asm-arm/arm64/cmpxchg.h b/xen/include/asm-arm/arm64/cmpxchg.h index 4e930ce..ae42b2f 100644 --- a/xen/include/asm-arm/arm64/cmpxchg.h +++ b/xen/include/asm-arm/arm64/cmpxchg.h @@ -54,7 +54,12 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size } #define xchg(ptr,x) \ - ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr)))) +({ \ + __typeof__(*(ptr)) __ret; \ + __ret = (__typeof__(*(ptr))) \ + __xchg((unsigned long)(x), (ptr), sizeof(*(ptr))); \ + __ret; \ +}) extern void __bad_cmpxchg(volatile void *ptr, int size); @@ -144,17 +149,23 @@ static inline unsigned long __cmpxchg_mb(volatile void *ptr, unsigned long old, return ret; } -#define cmpxchg(ptr,o,n) \ - ((__typeof__(*(ptr)))__cmpxchg_mb((ptr), \ - (unsigned long)(o), \ - (unsigned long)(n), \ - sizeof(*(ptr)))) - -#define cmpxchg_local(ptr,o,n) \ - ((__typeof__(*(ptr)))__cmpxchg((ptr), \ - (unsigned long)(o), \ - (unsigned long)(n), \ - sizeof(*(ptr)))) +#define cmpxchg(ptr, o, n) \ +({ \ + __typeof__(*(ptr)) __ret; \ + __ret = (__typeof__(*(ptr))) \ + __cmpxchg_mb((ptr), (unsigned long)(o), (unsigned long)(n), \ + sizeof(*(ptr))); \ + __ret; \ +}) + +#define cmpxchg_local(ptr, o, n) \ +({ \ + __typeof__(*(ptr)) __ret; \ + __ret = (__typeof__(*(ptr))) \ + __cmpxchg((ptr), (unsigned long)(o), \ + (unsigned long)(n), sizeof(*(ptr))); \ + __ret; \ +}) #endif /* diff --git a/xen/include/asm-arm/string.h b/xen/include/asm-arm/string.h index 3242762..dfad1fe 100644 --- a/xen/include/asm-arm/string.h +++ b/xen/include/asm-arm/string.h @@ -17,6 +17,11 @@ extern char * strchr(const char * s, int c); #define __HAVE_ARCH_MEMCPY extern void * memcpy(void *, const void *, __kernel_size_t); +#if defined(CONFIG_ARM_64) +#define __HAVE_ARCH_MEMCMP +extern int memcmp(const void *, const void *, __kernel_size_t); +#endif + /* Some versions of gcc don't have this builtin. It's non-critical anyway. */ #define __HAVE_ARCH_MEMMOVE extern void *memmove(void *dest, const void *src, size_t n); diff --git a/xen/include/asm-arm/system.h b/xen/include/asm-arm/system.h index 7aaaf50..ce3d38a 100644 --- a/xen/include/asm-arm/system.h +++ b/xen/include/asm-arm/system.h @@ -33,9 +33,6 @@ #define smp_wmb() dmb(ishst) -#define xchg(ptr,x) \ - ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr)))) - /* * This is used to ensure the compiler did actually allocate the register we * asked it for some inline assembly sequences. Apparently we can't trust -- 1.7.10.4 _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |