Xen project Mailing List

Re: [Xen-devel] [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6

To: Ian Campbell <ian.campbell@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx

From: Julien Grall <julien.grall@xxxxxxxxxx>

Date: Fri, 25 Jul 2014 16:36:23 +0100

Cc: tim@xxxxxxx, stefano.stabellini@xxxxxxxxxxxxx

Delivery-date: Fri, 25 Jul 2014 15:36:45 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Hi Ian, On 07/25/2014 04:22 PM, Ian Campbell wrote: > The only really interesting changes here are the updates to mem* which update > to actually optimised versions and introduce an optimised memcmp. I didn't read the whole code as I assume it's just a copy with few changes from Linux. Acked-by: Julien Grall <julien.grall@xxxxxxxxxx> Regards, > bitops: No change to the bits we import. Record new baseline. > > cmpxchg: Import: > 60010e5 arm64: cmpxchg: update macros to prevent warnings > Author: Mark Hambleton <mahamble@xxxxxxxxxxxx> > Signed-off-by: Mark Hambleton <mahamble@xxxxxxxxxxxx> > Signed-off-by: Mark Brown <broonie@xxxxxxxxxx> > Signed-off-by: Catalin Marinas <catalin.marinas@xxxxxxx> > > e1dfda9 arm64: xchg: prevent warning if return value is unused > Author: Will Deacon <will.deacon@xxxxxxx> > Signed-off-by: Will Deacon <will.deacon@xxxxxxx> > Signed-off-by: Catalin Marinas <catalin.marinas@xxxxxxx> > > e1dfda9 resolves the warning which previous caused us to skip 60010e508111. > > Since arm32 and arm64 now differ (as do Linux arm and arm64) here the > existing definition in asm/system.h gets moved to asm/arm32/cmpxchg.h. > Previously this was shadowing the arm64 one but they happened to be > identical. > > atomics: Import: > 8715466 arch,arm64: Convert smp_mb__*() > Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > Signed-off-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > > This just drops some unused (by us) smp_mb__*_atomic_*. > > spinlocks: No change. Record new baseline. > > mem*: Import: > 808dbac arm64: lib: Implement optimized memcpy routine > Author: zhichang.yuan <zhichang.yuan@xxxxxxxxxx> > Signed-off-by: Zhichang Yuan <zhichang.yuan@xxxxxxxxxx> > Signed-off-by: Deepak Saxena <dsaxena@xxxxxxxxxx> > Signed-off-by: Catalin Marinas <catalin.marinas@xxxxxxx> > 280adc1 arm64: lib: Implement optimized memmove routine > Author: zhichang.yuan <zhichang.yuan@xxxxxxxxxx> > Signed-off-by: Zhichang Yuan <zhichang.yuan@xxxxxxxxxx> > Signed-off-by: Deepak Saxena <dsaxena@xxxxxxxxxx> > Signed-off-by: Catalin Marinas <catalin.marinas@xxxxxxx> > b29a51f arm64: lib: Implement optimized memset routine > Author: zhichang.yuan <zhichang.yuan@xxxxxxxxxx> > Signed-off-by: Zhichang Yuan <zhichang.yuan@xxxxxxxxxx> > Signed-off-by: Deepak Saxena <dsaxena@xxxxxxxxxx> > Signed-off-by: Catalin Marinas <catalin.marinas@xxxxxxx> > d875c9b arm64: lib: Implement optimized memcmp routine > Author: zhichang.yuan <zhichang.yuan@xxxxxxxxxx> > Signed-off-by: Zhichang Yuan <zhichang.yuan@xxxxxxxxxx> > Signed-off-by: Deepak Saxena <dsaxena@xxxxxxxxxx> > Signed-off-by: Catalin Marinas <catalin.marinas@xxxxxxx> > > These import various routines from Linaro's Cortex Strings library. > > Added assembler.h similar to on arm32 to define the various magic symbols > which these imported routines depend on (e.g. CPU_LE() and CPU_BE()) > > str*: No changes. Record new baseline. > > Correct the paths in the README. > > *_page: No changes. Record new baseline. > > README previous said clear_page was unused while clear page was, which was > backwards. > > Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx> > --- > xen/arch/arm/README.LinuxPrimitives | 36 +++-- > xen/arch/arm/arm64/lib/Makefile | 2 +- > xen/arch/arm/arm64/lib/assembler.h | 13 ++ > xen/arch/arm/arm64/lib/memchr.S | 1 + > xen/arch/arm/arm64/lib/memcmp.S | 258 > +++++++++++++++++++++++++++++++++++ > xen/arch/arm/arm64/lib/memcpy.S | 193 +++++++++++++++++++++++--- > xen/arch/arm/arm64/lib/memmove.S | 191 ++++++++++++++++++++++---- > xen/arch/arm/arm64/lib/memset.S | 208 +++++++++++++++++++++++++--- > xen/include/asm-arm/arm32/cmpxchg.h | 3 + > xen/include/asm-arm/arm64/atomic.h | 5 - > xen/include/asm-arm/arm64/cmpxchg.h | 35 +++-- > xen/include/asm-arm/string.h | 5 + > xen/include/asm-arm/system.h | 3 - > 13 files changed, 844 insertions(+), 109 deletions(-) > create mode 100644 xen/arch/arm/arm64/lib/assembler.h > create mode 100644 xen/arch/arm/arm64/lib/memcmp.S > > diff --git a/xen/arch/arm/README.LinuxPrimitives > b/xen/arch/arm/README.LinuxPrimitives > index 6cd03ca..69eeb70 100644 > --- a/xen/arch/arm/README.LinuxPrimitives > +++ b/xen/arch/arm/README.LinuxPrimitives > @@ -6,29 +6,26 @@ were last updated. > arm64: > ===================================================================== > > -bitops: last sync @ v3.14-rc7 (last commit: 8e86f0b) > +bitops: last sync @ v3.16-rc6 (last commit: 8715466b6027) > > linux/arch/arm64/lib/bitops.S xen/arch/arm/arm64/lib/bitops.S > linux/arch/arm64/include/asm/bitops.h xen/include/asm-arm/arm64/bitops.h > > --------------------------------------------------------------------- > > -cmpxchg: last sync @ v3.14-rc7 (last commit: 95c4189) > +cmpxchg: last sync @ v3.16-rc6 (last commit: e1dfda9ced9b) > > linux/arch/arm64/include/asm/cmpxchg.h xen/include/asm-arm/arm64/cmpxchg.h > > -Skipped: > - 60010e5 arm64: cmpxchg: update macros to prevent warnings > - > --------------------------------------------------------------------- > > -atomics: last sync @ v3.14-rc7 (last commit: 95c4189) > +atomics: last sync @ v3.16-rc6 (last commit: 8715466b6027) > > linux/arch/arm64/include/asm/atomic.h xen/include/asm-arm/arm64/atomic.h > > --------------------------------------------------------------------- > > -spinlocks: last sync @ v3.14-rc7 (last commit: 95c4189) > +spinlocks: last sync @ v3.16-rc6 (last commit: 95c4189689f9) > > linux/arch/arm64/include/asm/spinlock.h xen/include/asm-arm/arm64/spinlock.h > > @@ -38,30 +35,31 @@ Skipped: > > --------------------------------------------------------------------- > > -mem*: last sync @ v3.14-rc7 (last commit: 4a89922) > +mem*: last sync @ v3.16-rc6 (last commit: d875c9b37240) > > -linux/arch/arm64/lib/memchr.S xen/arch/arm/arm64/lib/memchr.S > -linux/arch/arm64/lib/memcpy.S xen/arch/arm/arm64/lib/memcpy.S > -linux/arch/arm64/lib/memmove.S xen/arch/arm/arm64/lib/memmove.S > -linux/arch/arm64/lib/memset.S xen/arch/arm/arm64/lib/memset.S > +linux/arch/arm64/lib/memchr.S xen/arch/arm/arm64/lib/memchr.S > +linux/arch/arm64/lib/memcmp.S xen/arch/arm/arm64/lib/memcmp.S > +linux/arch/arm64/lib/memcpy.S xen/arch/arm/arm64/lib/memcpy.S > +linux/arch/arm64/lib/memmove.S xen/arch/arm/arm64/lib/memmove.S > +linux/arch/arm64/lib/memset.S xen/arch/arm/arm64/lib/memset.S > > -for i in memchr.S memcpy.S memmove.S memset.S ; do > +for i in memchr.S memcmp.S memcpy.S memmove.S memset.S ; do > diff -u linux/arch/arm64/lib/$i xen/arch/arm/arm64/lib/$i > done > > --------------------------------------------------------------------- > > -str*: last sync @ v3.14-rc7 (last commit: 2b8cac8) > +str*: last sync @ v3.16-rc6 (last commit: 2b8cac814cd5) > > -linux/arch/arm/lib/strchr.S xen/arch/arm/arm64/lib/strchr.S > -linux/arch/arm/lib/strrchr.S xen/arch/arm/arm64/lib/strrchr.S > +linux/arch/arm64/lib/strchr.S xen/arch/arm/arm64/lib/strchr.S > +linux/arch/arm64/lib/strrchr.S xen/arch/arm/arm64/lib/strrchr.S > > --------------------------------------------------------------------- > > -{clear,copy}_page: last sync @ v3.14-rc7 (last commit: f27bb13) > +{clear,copy}_page: last sync @ v3.16-rc6 (last commit: f27bb139c387) > > -linux/arch/arm64/lib/clear_page.S unused in Xen > -linux/arch/arm64/lib/copy_page.S xen/arch/arm/arm64/lib/copy_page.S > +linux/arch/arm64/lib/clear_page.S xen/arch/arm/arm64/lib/clear_page.S > +linux/arch/arm64/lib/copy_page.S unused in Xen > > ===================================================================== > arm32 > diff --git a/xen/arch/arm/arm64/lib/Makefile b/xen/arch/arm/arm64/lib/Makefile > index b895afa..2e7fb64 100644 > --- a/xen/arch/arm/arm64/lib/Makefile > +++ b/xen/arch/arm/arm64/lib/Makefile > @@ -1,4 +1,4 @@ > -obj-y += memcpy.o memmove.o memset.o memchr.o > +obj-y += memcpy.o memcmp.o memmove.o memset.o memchr.o > obj-y += clear_page.o > obj-y += bitops.o find_next_bit.o > obj-y += strchr.o strrchr.o > diff --git a/xen/arch/arm/arm64/lib/assembler.h > b/xen/arch/arm/arm64/lib/assembler.h > new file mode 100644 > index 0000000..84669d1 > --- /dev/null > +++ b/xen/arch/arm/arm64/lib/assembler.h > @@ -0,0 +1,13 @@ > +#ifndef __ASM_ASSEMBLER_H__ > +#define __ASM_ASSEMBLER_H__ > + > +#ifndef __ASSEMBLY__ > +#error "Only include this from assembly code" > +#endif > + > +/* Only LE support so far */ > +#define CPU_BE(x...) > +#define CPU_LE(x...) x > + > +#endif /* __ASM_ASSEMBLER_H__ */ > + > diff --git a/xen/arch/arm/arm64/lib/memchr.S b/xen/arch/arm/arm64/lib/memchr.S > index 3cc1b01..b04590c 100644 > --- a/xen/arch/arm/arm64/lib/memchr.S > +++ b/xen/arch/arm/arm64/lib/memchr.S > @@ -18,6 +18,7 @@ > */ > > #include <xen/config.h> > +#include "assembler.h" > > /* > * Find a character in an area of memory. > diff --git a/xen/arch/arm/arm64/lib/memcmp.S b/xen/arch/arm/arm64/lib/memcmp.S > new file mode 100644 > index 0000000..9aad925 > --- /dev/null > +++ b/xen/arch/arm/arm64/lib/memcmp.S > @@ -0,0 +1,258 @@ > +/* > + * Copyright (C) 2013 ARM Ltd. > + * Copyright (C) 2013 Linaro. > + * > + * This code is based on glibc cortex strings work originally authored by > Linaro > + * and re-licensed under GPLv2 for the Linux kernel. The original code can > + * be found @ > + * > + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ > + * files/head:/src/aarch64/ > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program. If not, see <http://www.gnu.org/licenses/>. > + */ > + > +#include <xen/config.h> > +#include "assembler.h" > + > +/* > +* compare memory areas(when two memory areas' offset are different, > +* alignment handled by the hardware) > +* > +* Parameters: > +* x0 - const memory area 1 pointer > +* x1 - const memory area 2 pointer > +* x2 - the maximal compare byte length > +* Returns: > +* x0 - a compare result, maybe less than, equal to, or greater than ZERO > +*/ > + > +/* Parameters and result. */ > +src1 .req x0 > +src2 .req x1 > +limit .req x2 > +result .req x0 > + > +/* Internal variables. */ > +data1 .req x3 > +data1w .req w3 > +data2 .req x4 > +data2w .req w4 > +has_nul .req x5 > +diff .req x6 > +endloop .req x7 > +tmp1 .req x8 > +tmp2 .req x9 > +tmp3 .req x10 > +pos .req x11 > +limit_wd .req x12 > +mask .req x13 > + > +ENTRY(memcmp) > + cbz limit, .Lret0 > + eor tmp1, src1, src2 > + tst tmp1, #7 > + b.ne .Lmisaligned8 > + ands tmp1, src1, #7 > + b.ne .Lmutual_align > + sub limit_wd, limit, #1 /* limit != 0, so no underflow. */ > + lsr limit_wd, limit_wd, #3 /* Convert to Dwords. */ > + /* > + * The input source addresses are at alignment boundary. > + * Directly compare eight bytes each time. > + */ > +.Lloop_aligned: > + ldr data1, [src1], #8 > + ldr data2, [src2], #8 > +.Lstart_realigned: > + subs limit_wd, limit_wd, #1 > + eor diff, data1, data2 /* Non-zero if differences found. */ > + csinv endloop, diff, xzr, cs /* Last Dword or differences. */ > + cbz endloop, .Lloop_aligned > + > + /* Not reached the limit, must have found a diff. */ > + tbz limit_wd, #63, .Lnot_limit > + > + /* Limit % 8 == 0 => the diff is in the last 8 bytes. */ > + ands limit, limit, #7 > + b.eq .Lnot_limit > + /* > + * The remained bytes less than 8. It is needed to extract valid data > + * from last eight bytes of the intended memory range. > + */ > + lsl limit, limit, #3 /* bytes-> bits. */ > + mov mask, #~0 > +CPU_BE( lsr mask, mask, limit ) > +CPU_LE( lsl mask, mask, limit ) > + bic data1, data1, mask > + bic data2, data2, mask > + > + orr diff, diff, mask > + b .Lnot_limit > + > +.Lmutual_align: > + /* > + * Sources are mutually aligned, but are not currently at an > + * alignment boundary. Round down the addresses and then mask off > + * the bytes that precede the start point. > + */ > + bic src1, src1, #7 > + bic src2, src2, #7 > + ldr data1, [src1], #8 > + ldr data2, [src2], #8 > + /* > + * We can not add limit with alignment offset(tmp1) here. Since the > + * addition probably make the limit overflown. > + */ > + sub limit_wd, limit, #1/*limit != 0, so no underflow.*/ > + and tmp3, limit_wd, #7 > + lsr limit_wd, limit_wd, #3 > + add tmp3, tmp3, tmp1 > + add limit_wd, limit_wd, tmp3, lsr #3 > + add limit, limit, tmp1/* Adjust the limit for the extra. */ > + > + lsl tmp1, tmp1, #3/* Bytes beyond alignment -> bits.*/ > + neg tmp1, tmp1/* Bits to alignment -64. */ > + mov tmp2, #~0 > + /*mask off the non-intended bytes before the start address.*/ > +CPU_BE( lsl tmp2, tmp2, tmp1 )/*Big-endian.Early bytes are at MSB*/ > + /* Little-endian. Early bytes are at LSB. */ > +CPU_LE( lsr tmp2, tmp2, tmp1 ) > + > + orr data1, data1, tmp2 > + orr data2, data2, tmp2 > + b .Lstart_realigned > + > + /*src1 and src2 have different alignment offset.*/ > +.Lmisaligned8: > + cmp limit, #8 > + b.lo .Ltiny8proc /*limit < 8: compare byte by byte*/ > + > + and tmp1, src1, #7 > + neg tmp1, tmp1 > + add tmp1, tmp1, #8/*valid length in the first 8 bytes of src1*/ > + and tmp2, src2, #7 > + neg tmp2, tmp2 > + add tmp2, tmp2, #8/*valid length in the first 8 bytes of src2*/ > + subs tmp3, tmp1, tmp2 > + csel pos, tmp1, tmp2, hi /*Choose the maximum.*/ > + > + sub limit, limit, pos > + /*compare the proceeding bytes in the first 8 byte segment.*/ > +.Ltinycmp: > + ldrb data1w, [src1], #1 > + ldrb data2w, [src2], #1 > + subs pos, pos, #1 > + ccmp data1w, data2w, #0, ne /* NZCV = 0b0000. */ > + b.eq .Ltinycmp > + cbnz pos, 1f /*diff occurred before the last byte.*/ > + cmp data1w, data2w > + b.eq .Lstart_align > +1: > + sub result, data1, data2 > + ret > + > +.Lstart_align: > + lsr limit_wd, limit, #3 > + cbz limit_wd, .Lremain8 > + > + ands xzr, src1, #7 > + b.eq .Lrecal_offset > + /*process more leading bytes to make src1 aligned...*/ > + add src1, src1, tmp3 /*backwards src1 to alignment boundary*/ > + add src2, src2, tmp3 > + sub limit, limit, tmp3 > + lsr limit_wd, limit, #3 > + cbz limit_wd, .Lremain8 > + /*load 8 bytes from aligned SRC1..*/ > + ldr data1, [src1], #8 > + ldr data2, [src2], #8 > + > + subs limit_wd, limit_wd, #1 > + eor diff, data1, data2 /*Non-zero if differences found.*/ > + csinv endloop, diff, xzr, ne > + cbnz endloop, .Lunequal_proc > + /*How far is the current SRC2 from the alignment boundary...*/ > + and tmp3, tmp3, #7 > + > +.Lrecal_offset:/*src1 is aligned now..*/ > + neg pos, tmp3 > +.Lloopcmp_proc: > + /* > + * Divide the eight bytes into two parts. First,backwards the src2 > + * to an alignment boundary,load eight bytes and compare from > + * the SRC2 alignment boundary. If all 8 bytes are equal,then start > + * the second part's comparison. Otherwise finish the comparison. > + * This special handle can garantee all the accesses are in the > + * thread/task space in avoid to overrange access. > + */ > + ldr data1, [src1,pos] > + ldr data2, [src2,pos] > + eor diff, data1, data2 /* Non-zero if differences found. */ > + cbnz diff, .Lnot_limit > + > + /*The second part process*/ > + ldr data1, [src1], #8 > + ldr data2, [src2], #8 > + eor diff, data1, data2 /* Non-zero if differences found. */ > + subs limit_wd, limit_wd, #1 > + csinv endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/ > + cbz endloop, .Lloopcmp_proc > +.Lunequal_proc: > + cbz diff, .Lremain8 > + > +/*There is differnence occured in the latest comparison.*/ > +.Lnot_limit: > +/* > +* For little endian,reverse the low significant equal bits into MSB,then > +* following CLZ can find how many equal bits exist. > +*/ > +CPU_LE( rev diff, diff ) > +CPU_LE( rev data1, data1 ) > +CPU_LE( rev data2, data2 ) > + > + /* > + * The MS-non-zero bit of DIFF marks either the first bit > + * that is different, or the end of the significant data. > + * Shifting left now will bring the critical information into the > + * top bits. > + */ > + clz pos, diff > + lsl data1, data1, pos > + lsl data2, data2, pos > + /* > + * We need to zero-extend (char is unsigned) the value and then > + * perform a signed subtraction. > + */ > + lsr data1, data1, #56 > + sub result, data1, data2, lsr #56 > + ret > + > +.Lremain8: > + /* Limit % 8 == 0 =>. all data are equal.*/ > + ands limit, limit, #7 > + b.eq .Lret0 > + > +.Ltiny8proc: > + ldrb data1w, [src1], #1 > + ldrb data2w, [src2], #1 > + subs limit, limit, #1 > + > + ccmp data1w, data2w, #0, ne /* NZCV = 0b0000. */ > + b.eq .Ltiny8proc > + sub result, data1, data2 > + ret > +.Lret0: > + mov result, #0 > + ret > +ENDPROC(memcmp) > diff --git a/xen/arch/arm/arm64/lib/memcpy.S b/xen/arch/arm/arm64/lib/memcpy.S > index c8197c6..7cc885d 100644 > --- a/xen/arch/arm/arm64/lib/memcpy.S > +++ b/xen/arch/arm/arm64/lib/memcpy.S > @@ -1,5 +1,13 @@ > /* > * Copyright (C) 2013 ARM Ltd. > + * Copyright (C) 2013 Linaro. > + * > + * This code is based on glibc cortex strings work originally authored by > Linaro > + * and re-licensed under GPLv2 for the Linux kernel. The original code can > + * be found @ > + * > + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ > + * files/head:/src/aarch64/ > * > * This program is free software; you can redistribute it and/or modify > * it under the terms of the GNU General Public License version 2 as > @@ -15,6 +23,8 @@ > */ > > #include <xen/config.h> > +#include <asm/cache.h> > +#include "assembler.h" > > /* > * Copy a buffer from src to dest (alignment handled by the hardware) > @@ -26,27 +36,166 @@ > * Returns: > * x0 - dest > */ > +dstin .req x0 > +src .req x1 > +count .req x2 > +tmp1 .req x3 > +tmp1w .req w3 > +tmp2 .req x4 > +tmp2w .req w4 > +tmp3 .req x5 > +tmp3w .req w5 > +dst .req x6 > + > +A_l .req x7 > +A_h .req x8 > +B_l .req x9 > +B_h .req x10 > +C_l .req x11 > +C_h .req x12 > +D_l .req x13 > +D_h .req x14 > + > ENTRY(memcpy) > - mov x4, x0 > - subs x2, x2, #8 > - b.mi 2f > -1: ldr x3, [x1], #8 > - subs x2, x2, #8 > - str x3, [x4], #8 > - b.pl 1b > -2: adds x2, x2, #4 > - b.mi 3f > - ldr w3, [x1], #4 > - sub x2, x2, #4 > - str w3, [x4], #4 > -3: adds x2, x2, #2 > - b.mi 4f > - ldrh w3, [x1], #2 > - sub x2, x2, #2 > - strh w3, [x4], #2 > -4: adds x2, x2, #1 > - b.mi 5f > - ldrb w3, [x1] > - strb w3, [x4] > -5: ret > + mov dst, dstin > + cmp count, #16 > + /*When memory length is less than 16, the accessed are not aligned.*/ > + b.lo .Ltiny15 > + > + neg tmp2, src > + ands tmp2, tmp2, #15/* Bytes to reach alignment. */ > + b.eq .LSrcAligned > + sub count, count, tmp2 > + /* > + * Copy the leading memory data from src to dst in an increasing > + * address order.By this way,the risk of overwritting the source > + * memory data is eliminated when the distance between src and > + * dst is less than 16. The memory accesses here are alignment. > + */ > + tbz tmp2, #0, 1f > + ldrb tmp1w, [src], #1 > + strb tmp1w, [dst], #1 > +1: > + tbz tmp2, #1, 2f > + ldrh tmp1w, [src], #2 > + strh tmp1w, [dst], #2 > +2: > + tbz tmp2, #2, 3f > + ldr tmp1w, [src], #4 > + str tmp1w, [dst], #4 > +3: > + tbz tmp2, #3, .LSrcAligned > + ldr tmp1, [src],#8 > + str tmp1, [dst],#8 > + > +.LSrcAligned: > + cmp count, #64 > + b.ge .Lcpy_over64 > + /* > + * Deal with small copies quickly by dropping straight into the > + * exit block. > + */ > +.Ltail63: > + /* > + * Copy up to 48 bytes of data. At this point we only need the > + * bottom 6 bits of count to be accurate. > + */ > + ands tmp1, count, #0x30 > + b.eq .Ltiny15 > + cmp tmp1w, #0x20 > + b.eq 1f > + b.lt 2f > + ldp A_l, A_h, [src], #16 > + stp A_l, A_h, [dst], #16 > +1: > + ldp A_l, A_h, [src], #16 > + stp A_l, A_h, [dst], #16 > +2: > + ldp A_l, A_h, [src], #16 > + stp A_l, A_h, [dst], #16 > +.Ltiny15: > + /* > + * Prefer to break one ldp/stp into several load/store to access > + * memory in an increasing address order,rather than to load/store 16 > + * bytes from (src-16) to (dst-16) and to backward the src to aligned > + * address,which way is used in original cortex memcpy. If keeping > + * the original memcpy process here, memmove need to satisfy the > + * precondition that src address is at least 16 bytes bigger than dst > + * address,otherwise some source data will be overwritten when memove > + * call memcpy directly. To make memmove simpler and decouple the > + * memcpy's dependency on memmove, withdrew the original process. > + */ > + tbz count, #3, 1f > + ldr tmp1, [src], #8 > + str tmp1, [dst], #8 > +1: > + tbz count, #2, 2f > + ldr tmp1w, [src], #4 > + str tmp1w, [dst], #4 > +2: > + tbz count, #1, 3f > + ldrh tmp1w, [src], #2 > + strh tmp1w, [dst], #2 > +3: > + tbz count, #0, .Lexitfunc > + ldrb tmp1w, [src] > + strb tmp1w, [dst] > + > +.Lexitfunc: > + ret > + > +.Lcpy_over64: > + subs count, count, #128 > + b.ge .Lcpy_body_large > + /* > + * Less than 128 bytes to copy, so handle 64 here and then jump > + * to the tail. > + */ > + ldp A_l, A_h, [src],#16 > + stp A_l, A_h, [dst],#16 > + ldp B_l, B_h, [src],#16 > + ldp C_l, C_h, [src],#16 > + stp B_l, B_h, [dst],#16 > + stp C_l, C_h, [dst],#16 > + ldp D_l, D_h, [src],#16 > + stp D_l, D_h, [dst],#16 > + > + tst count, #0x3f > + b.ne .Ltail63 > + ret > + > + /* > + * Critical loop. Start at a new cache line boundary. Assuming > + * 64 bytes per line this ensures the entire loop is in one line. > + */ > + .p2align L1_CACHE_SHIFT > +.Lcpy_body_large: > + /* pre-get 64 bytes data. */ > + ldp A_l, A_h, [src],#16 > + ldp B_l, B_h, [src],#16 > + ldp C_l, C_h, [src],#16 > + ldp D_l, D_h, [src],#16 > +1: > + /* > + * interlace the load of next 64 bytes data block with store of the last > + * loaded 64 bytes data. > + */ > + stp A_l, A_h, [dst],#16 > + ldp A_l, A_h, [src],#16 > + stp B_l, B_h, [dst],#16 > + ldp B_l, B_h, [src],#16 > + stp C_l, C_h, [dst],#16 > + ldp C_l, C_h, [src],#16 > + stp D_l, D_h, [dst],#16 > + ldp D_l, D_h, [src],#16 > + subs count, count, #64 > + b.ge 1b > + stp A_l, A_h, [dst],#16 > + stp B_l, B_h, [dst],#16 > + stp C_l, C_h, [dst],#16 > + stp D_l, D_h, [dst],#16 > + > + tst count, #0x3f > + b.ne .Ltail63 > + ret > ENDPROC(memcpy) > diff --git a/xen/arch/arm/arm64/lib/memmove.S > b/xen/arch/arm/arm64/lib/memmove.S > index 1bf0936..f4065b9 100644 > --- a/xen/arch/arm/arm64/lib/memmove.S > +++ b/xen/arch/arm/arm64/lib/memmove.S > @@ -1,5 +1,13 @@ > /* > * Copyright (C) 2013 ARM Ltd. > + * Copyright (C) 2013 Linaro. > + * > + * This code is based on glibc cortex strings work originally authored by > Linaro > + * and re-licensed under GPLv2 for the Linux kernel. The original code can > + * be found @ > + * > + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ > + * files/head:/src/aarch64/ > * > * This program is free software; you can redistribute it and/or modify > * it under the terms of the GNU General Public License version 2 as > @@ -15,6 +23,8 @@ > */ > > #include <xen/config.h> > +#include <asm/cache.h> > +#include "assembler.h" > > /* > * Move a buffer from src to test (alignment handled by the hardware). > @@ -27,30 +37,161 @@ > * Returns: > * x0 - dest > */ > +dstin .req x0 > +src .req x1 > +count .req x2 > +tmp1 .req x3 > +tmp1w .req w3 > +tmp2 .req x4 > +tmp2w .req w4 > +tmp3 .req x5 > +tmp3w .req w5 > +dst .req x6 > + > +A_l .req x7 > +A_h .req x8 > +B_l .req x9 > +B_h .req x10 > +C_l .req x11 > +C_h .req x12 > +D_l .req x13 > +D_h .req x14 > + > ENTRY(memmove) > - cmp x0, x1 > - b.ls memcpy > - add x4, x0, x2 > - add x1, x1, x2 > - subs x2, x2, #8 > - b.mi 2f > -1: ldr x3, [x1, #-8]! > - subs x2, x2, #8 > - str x3, [x4, #-8]! > - b.pl 1b > -2: adds x2, x2, #4 > - b.mi 3f > - ldr w3, [x1, #-4]! > - sub x2, x2, #4 > - str w3, [x4, #-4]! > -3: adds x2, x2, #2 > - b.mi 4f > - ldrh w3, [x1, #-2]! > - sub x2, x2, #2 > - strh w3, [x4, #-2]! > -4: adds x2, x2, #1 > - b.mi 5f > - ldrb w3, [x1, #-1] > - strb w3, [x4, #-1] > -5: ret > + cmp dstin, src > + b.lo memcpy > + add tmp1, src, count > + cmp dstin, tmp1 > + b.hs memcpy /* No overlap. */ > + > + add dst, dstin, count > + add src, src, count > + cmp count, #16 > + b.lo .Ltail15 /*probably non-alignment accesses.*/ > + > + ands tmp2, src, #15 /* Bytes to reach alignment. */ > + b.eq .LSrcAligned > + sub count, count, tmp2 > + /* > + * process the aligned offset length to make the src aligned firstly. > + * those extra instructions' cost is acceptable. It also make the > + * coming accesses are based on aligned address. > + */ > + tbz tmp2, #0, 1f > + ldrb tmp1w, [src, #-1]! > + strb tmp1w, [dst, #-1]! > +1: > + tbz tmp2, #1, 2f > + ldrh tmp1w, [src, #-2]! > + strh tmp1w, [dst, #-2]! > +2: > + tbz tmp2, #2, 3f > + ldr tmp1w, [src, #-4]! > + str tmp1w, [dst, #-4]! > +3: > + tbz tmp2, #3, .LSrcAligned > + ldr tmp1, [src, #-8]! > + str tmp1, [dst, #-8]! > + > +.LSrcAligned: > + cmp count, #64 > + b.ge .Lcpy_over64 > + > + /* > + * Deal with small copies quickly by dropping straight into the > + * exit block. > + */ > +.Ltail63: > + /* > + * Copy up to 48 bytes of data. At this point we only need the > + * bottom 6 bits of count to be accurate. > + */ > + ands tmp1, count, #0x30 > + b.eq .Ltail15 > + cmp tmp1w, #0x20 > + b.eq 1f > + b.lt 2f > + ldp A_l, A_h, [src, #-16]! > + stp A_l, A_h, [dst, #-16]! > +1: > + ldp A_l, A_h, [src, #-16]! > + stp A_l, A_h, [dst, #-16]! > +2: > + ldp A_l, A_h, [src, #-16]! > + stp A_l, A_h, [dst, #-16]! > + > +.Ltail15: > + tbz count, #3, 1f > + ldr tmp1, [src, #-8]! > + str tmp1, [dst, #-8]! > +1: > + tbz count, #2, 2f > + ldr tmp1w, [src, #-4]! > + str tmp1w, [dst, #-4]! > +2: > + tbz count, #1, 3f > + ldrh tmp1w, [src, #-2]! > + strh tmp1w, [dst, #-2]! > +3: > + tbz count, #0, .Lexitfunc > + ldrb tmp1w, [src, #-1] > + strb tmp1w, [dst, #-1] > + > +.Lexitfunc: > + ret > + > +.Lcpy_over64: > + subs count, count, #128 > + b.ge .Lcpy_body_large > + /* > + * Less than 128 bytes to copy, so handle 64 bytes here and then jump > + * to the tail. > + */ > + ldp A_l, A_h, [src, #-16] > + stp A_l, A_h, [dst, #-16] > + ldp B_l, B_h, [src, #-32] > + ldp C_l, C_h, [src, #-48] > + stp B_l, B_h, [dst, #-32] > + stp C_l, C_h, [dst, #-48] > + ldp D_l, D_h, [src, #-64]! > + stp D_l, D_h, [dst, #-64]! > + > + tst count, #0x3f > + b.ne .Ltail63 > + ret > + > + /* > + * Critical loop. Start at a new cache line boundary. Assuming > + * 64 bytes per line this ensures the entire loop is in one line. > + */ > + .p2align L1_CACHE_SHIFT > +.Lcpy_body_large: > + /* pre-load 64 bytes data. */ > + ldp A_l, A_h, [src, #-16] > + ldp B_l, B_h, [src, #-32] > + ldp C_l, C_h, [src, #-48] > + ldp D_l, D_h, [src, #-64]! > +1: > + /* > + * interlace the load of next 64 bytes data block with store of the last > + * loaded 64 bytes data. > + */ > + stp A_l, A_h, [dst, #-16] > + ldp A_l, A_h, [src, #-16] > + stp B_l, B_h, [dst, #-32] > + ldp B_l, B_h, [src, #-32] > + stp C_l, C_h, [dst, #-48] > + ldp C_l, C_h, [src, #-48] > + stp D_l, D_h, [dst, #-64]! > + ldp D_l, D_h, [src, #-64]! > + subs count, count, #64 > + b.ge 1b > + stp A_l, A_h, [dst, #-16] > + stp B_l, B_h, [dst, #-32] > + stp C_l, C_h, [dst, #-48] > + stp D_l, D_h, [dst, #-64]! > + > + tst count, #0x3f > + b.ne .Ltail63 > + ret > ENDPROC(memmove) > diff --git a/xen/arch/arm/arm64/lib/memset.S b/xen/arch/arm/arm64/lib/memset.S > index 25a4fb6..4ee714d 100644 > --- a/xen/arch/arm/arm64/lib/memset.S > +++ b/xen/arch/arm/arm64/lib/memset.S > @@ -1,5 +1,13 @@ > /* > * Copyright (C) 2013 ARM Ltd. > + * Copyright (C) 2013 Linaro. > + * > + * This code is based on glibc cortex strings work originally authored by > Linaro > + * and re-licensed under GPLv2 for the Linux kernel. The original code can > + * be found @ > + * > + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ > + * files/head:/src/aarch64/ > * > * This program is free software; you can redistribute it and/or modify > * it under the terms of the GNU General Public License version 2 as > @@ -15,6 +23,8 @@ > */ > > #include <xen/config.h> > +#include <asm/cache.h> > +#include "assembler.h" > > /* > * Fill in the buffer with character c (alignment handled by the hardware) > @@ -26,27 +36,181 @@ > * Returns: > * x0 - buf > */ > + > +dstin .req x0 > +val .req w1 > +count .req x2 > +tmp1 .req x3 > +tmp1w .req w3 > +tmp2 .req x4 > +tmp2w .req w4 > +zva_len_x .req x5 > +zva_len .req w5 > +zva_bits_x .req x6 > + > +A_l .req x7 > +A_lw .req w7 > +dst .req x8 > +tmp3w .req w9 > +tmp3 .req x9 > + > ENTRY(memset) > - mov x4, x0 > - and w1, w1, #0xff > - orr w1, w1, w1, lsl #8 > - orr w1, w1, w1, lsl #16 > - orr x1, x1, x1, lsl #32 > - subs x2, x2, #8 > - b.mi 2f > -1: str x1, [x4], #8 > - subs x2, x2, #8 > - b.pl 1b > -2: adds x2, x2, #4 > - b.mi 3f > - sub x2, x2, #4 > - str w1, [x4], #4 > -3: adds x2, x2, #2 > - b.mi 4f > - sub x2, x2, #2 > - strh w1, [x4], #2 > -4: adds x2, x2, #1 > - b.mi 5f > - strb w1, [x4] > -5: ret > + mov dst, dstin /* Preserve return value. */ > + and A_lw, val, #255 > + orr A_lw, A_lw, A_lw, lsl #8 > + orr A_lw, A_lw, A_lw, lsl #16 > + orr A_l, A_l, A_l, lsl #32 > + > + cmp count, #15 > + b.hi .Lover16_proc > + /*All store maybe are non-aligned..*/ > + tbz count, #3, 1f > + str A_l, [dst], #8 > +1: > + tbz count, #2, 2f > + str A_lw, [dst], #4 > +2: > + tbz count, #1, 3f > + strh A_lw, [dst], #2 > +3: > + tbz count, #0, 4f > + strb A_lw, [dst] > +4: > + ret > + > +.Lover16_proc: > + /*Whether the start address is aligned with 16.*/ > + neg tmp2, dst > + ands tmp2, tmp2, #15 > + b.eq .Laligned > +/* > +* The count is not less than 16, we can use stp to store the start 16 bytes, > +* then adjust the dst aligned with 16.This process will make the current > +* memory address at alignment boundary. > +*/ > + stp A_l, A_l, [dst] /*non-aligned store..*/ > + /*make the dst aligned..*/ > + sub count, count, tmp2 > + add dst, dst, tmp2 > + > +.Laligned: > + cbz A_l, .Lzero_mem > + > +.Ltail_maybe_long: > + cmp count, #64 > + b.ge .Lnot_short > +.Ltail63: > + ands tmp1, count, #0x30 > + b.eq 3f > + cmp tmp1w, #0x20 > + b.eq 1f > + b.lt 2f > + stp A_l, A_l, [dst], #16 > +1: > + stp A_l, A_l, [dst], #16 > +2: > + stp A_l, A_l, [dst], #16 > +/* > +* The last store length is less than 16,use stp to write last 16 bytes. > +* It will lead some bytes written twice and the access is non-aligned. > +*/ > +3: > + ands count, count, #15 > + cbz count, 4f > + add dst, dst, count > + stp A_l, A_l, [dst, #-16] /* Repeat some/all of last store. */ > +4: > + ret > + > + /* > + * Critical loop. Start at a new cache line boundary. Assuming > + * 64 bytes per line, this ensures the entire loop is in one line. > + */ > + .p2align L1_CACHE_SHIFT > +.Lnot_short: > + sub dst, dst, #16/* Pre-bias. */ > + sub count, count, #64 > +1: > + stp A_l, A_l, [dst, #16] > + stp A_l, A_l, [dst, #32] > + stp A_l, A_l, [dst, #48] > + stp A_l, A_l, [dst, #64]! > + subs count, count, #64 > + b.ge 1b > + tst count, #0x3f > + add dst, dst, #16 > + b.ne .Ltail63 > +.Lexitfunc: > + ret > + > + /* > + * For zeroing memory, check to see if we can use the ZVA feature to > + * zero entire 'cache' lines. > + */ > +.Lzero_mem: > + cmp count, #63 > + b.le .Ltail63 > + /* > + * For zeroing small amounts of memory, it's not worth setting up > + * the line-clear code. > + */ > + cmp count, #128 > + b.lt .Lnot_short /*count is at least 128 bytes*/ > + > + mrs tmp1, dczid_el0 > + tbnz tmp1, #4, .Lnot_short > + mov tmp3w, #4 > + and zva_len, tmp1w, #15 /* Safety: other bits reserved. */ > + lsl zva_len, tmp3w, zva_len > + > + ands tmp3w, zva_len, #63 > + /* > + * ensure the zva_len is not less than 64. > + * It is not meaningful to use ZVA if the block size is less than 64. > + */ > + b.ne .Lnot_short > +.Lzero_by_line: > + /* > + * Compute how far we need to go to become suitably aligned. We're > + * already at quad-word alignment. > + */ > + cmp count, zva_len_x > + b.lt .Lnot_short /* Not enough to reach alignment. */ > + sub zva_bits_x, zva_len_x, #1 > + neg tmp2, dst > + ands tmp2, tmp2, zva_bits_x > + b.eq 2f /* Already aligned. */ > + /* Not aligned, check that there's enough to copy after alignment.*/ > + sub tmp1, count, tmp2 > + /* > + * grantee the remain length to be ZVA is bigger than 64, > + * avoid to make the 2f's process over mem range.*/ > + cmp tmp1, #64 > + ccmp tmp1, zva_len_x, #8, ge /* NZCV=0b1000 */ > + b.lt .Lnot_short > + /* > + * We know that there's at least 64 bytes to zero and that it's safe > + * to overrun by 64 bytes. > + */ > + mov count, tmp1 > +1: > + stp A_l, A_l, [dst] > + stp A_l, A_l, [dst, #16] > + stp A_l, A_l, [dst, #32] > + subs tmp2, tmp2, #64 > + stp A_l, A_l, [dst, #48] > + add dst, dst, #64 > + b.ge 1b > + /* We've overrun a bit, so adjust dst downwards.*/ > + add dst, dst, tmp2 > +2: > + sub count, count, zva_len_x > +3: > + dc zva, dst > + add dst, dst, zva_len_x > + subs count, count, zva_len_x > + b.ge 3b > + ands count, count, zva_bits_x > + b.ne .Ltail_maybe_long > + ret > ENDPROC(memset) > diff --git a/xen/include/asm-arm/arm32/cmpxchg.h > b/xen/include/asm-arm/arm32/cmpxchg.h > index 3f4e7a1..9a511f2 100644 > --- a/xen/include/asm-arm/arm32/cmpxchg.h > +++ b/xen/include/asm-arm/arm32/cmpxchg.h > @@ -40,6 +40,9 @@ static inline unsigned long __xchg(unsigned long x, > volatile void *ptr, int size > return ret; > } > > +#define xchg(ptr,x) \ > + ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr)))) > + > /* > * Atomic compare and exchange. Compare OLD with MEM, if identical, > * store NEW in MEM. Return the initial value in MEM. Success is > diff --git a/xen/include/asm-arm/arm64/atomic.h > b/xen/include/asm-arm/arm64/atomic.h > index b5d50f2..b49219e 100644 > --- a/xen/include/asm-arm/arm64/atomic.h > +++ b/xen/include/asm-arm/arm64/atomic.h > @@ -136,11 +136,6 @@ static inline int __atomic_add_unless(atomic_t *v, int > a, int u) > > #define atomic_add_negative(i,v) (atomic_add_return(i, v) < 0) > > -#define smp_mb__before_atomic_dec() smp_mb() > -#define smp_mb__after_atomic_dec() smp_mb() > -#define smp_mb__before_atomic_inc() smp_mb() > -#define smp_mb__after_atomic_inc() smp_mb() > - > #endif > /* > * Local variables: > diff --git a/xen/include/asm-arm/arm64/cmpxchg.h > b/xen/include/asm-arm/arm64/cmpxchg.h > index 4e930ce..ae42b2f 100644 > --- a/xen/include/asm-arm/arm64/cmpxchg.h > +++ b/xen/include/asm-arm/arm64/cmpxchg.h > @@ -54,7 +54,12 @@ static inline unsigned long __xchg(unsigned long x, > volatile void *ptr, int size > } > > #define xchg(ptr,x) \ > - ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr)))) > +({ \ > + __typeof__(*(ptr)) __ret; \ > + __ret = (__typeof__(*(ptr))) \ > + __xchg((unsigned long)(x), (ptr), sizeof(*(ptr))); \ > + __ret; \ > +}) > > extern void __bad_cmpxchg(volatile void *ptr, int size); > > @@ -144,17 +149,23 @@ static inline unsigned long __cmpxchg_mb(volatile void > *ptr, unsigned long old, > return ret; > } > > -#define cmpxchg(ptr,o,n) \ > - ((__typeof__(*(ptr)))__cmpxchg_mb((ptr), \ > - (unsigned long)(o), \ > - (unsigned long)(n), \ > - sizeof(*(ptr)))) > - > -#define cmpxchg_local(ptr,o,n) > \ > - ((__typeof__(*(ptr)))__cmpxchg((ptr), \ > - (unsigned long)(o), \ > - (unsigned long)(n), \ > - sizeof(*(ptr)))) > +#define cmpxchg(ptr, o, n) \ > +({ \ > + __typeof__(*(ptr)) __ret; \ > + __ret = (__typeof__(*(ptr))) \ > + __cmpxchg_mb((ptr), (unsigned long)(o), (unsigned long)(n), \ > + sizeof(*(ptr))); \ > + __ret; \ > +}) > + > +#define cmpxchg_local(ptr, o, n) \ > +({ \ > + __typeof__(*(ptr)) __ret; \ > + __ret = (__typeof__(*(ptr))) \ > + __cmpxchg((ptr), (unsigned long)(o), \ > + (unsigned long)(n), sizeof(*(ptr))); \ > + __ret; \ > +}) > > #endif > /* > diff --git a/xen/include/asm-arm/string.h b/xen/include/asm-arm/string.h > index 3242762..dfad1fe 100644 > --- a/xen/include/asm-arm/string.h > +++ b/xen/include/asm-arm/string.h > @@ -17,6 +17,11 @@ extern char * strchr(const char * s, int c); > #define __HAVE_ARCH_MEMCPY > extern void * memcpy(void *, const void *, __kernel_size_t); > > +#if defined(CONFIG_ARM_64) > +#define __HAVE_ARCH_MEMCMP > +extern int memcmp(const void *, const void *, __kernel_size_t); > +#endif > + > /* Some versions of gcc don't have this builtin. It's non-critical anyway. */ > #define __HAVE_ARCH_MEMMOVE > extern void *memmove(void *dest, const void *src, size_t n); > diff --git a/xen/include/asm-arm/system.h b/xen/include/asm-arm/system.h > index 7aaaf50..ce3d38a 100644 > --- a/xen/include/asm-arm/system.h > +++ b/xen/include/asm-arm/system.h > @@ -33,9 +33,6 @@ > > #define smp_wmb() dmb(ishst) > > -#define xchg(ptr,x) \ > - ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr)))) > - > /* > * This is used to ensure the compiler did actually allocate the register we > * asked it for some inline assembly sequences. Apparently we can't trust > -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.