CN108764037B

CN108764037B - Face detection implementation method based on ARM Cotex-A series platform

Info

Publication number: CN108764037B
Application number: CN201810372936.5A
Authority: CN
Inventors: 洪朝群; 王善炮
Original assignee: Shishi Senke Intelligent Technology Co ltd
Current assignee: Shishi Senke Intelligent Technology Co ltd
Priority date: 2018-04-24
Filing date: 2018-04-24
Publication date: 2021-12-24
Anticipated expiration: 2038-04-24
Also published as: CN108764037A

Abstract

The invention discloses a face detection implementation method based on an ARM Cotex-A series platform, which comprises the following steps: s1, modifying the source code in the faceDetection of setaface under the hardware environment of ARM Cotex-A series processor, and modifying the type of the compiler into a cross compiler; s2, adding NEON compiling options in the setting of a compiler; s3, replacing the original header file required by the SSE instruction in the faceDetection with the header file required by the NEON; s4, modifying the part of the original code aiming at the faceDetection, which uses the SSE instruction, into an NEON instruction, and modifying the function using the SSE instruction into a function using the NEON; s5, recompiling the program under the support of the compiling option of the NEON added in the step S2 to obtain the required dynamic link library file, thereby compiling the faceDetection program supporting the NEON under the ARM Cotex-A series processor platform. The invention can effectively improve the efficiency of face detection by using setaface in hardware using a Cotex-A series processor.

Description

Face detection implementation method based on ARM Cotex-A series platform

Technical Field

The invention relates to the field of computer image processing, in particular to a face detection implementation method of a Cotex-A series platform based on ARM.

Background

The existing face detection has certain requirements on image processing equipment (hardware), so the face detection module of setaface is mostly applied to an x86 platform at present, and the detection process is shown in fig. 1. The ARM processor is mostly used in mobile devices and embedded devices, and such hardware balances energy consumption and performance, so the performance is not very high, but limited by performance, and the efficiency of face detection is not high in these hardware environments.

SIMD single instruction stream Multiple Data (SIMD) is a technique that uses one controller to control Multiple processors while performing the same operation on each of a set of Data (also called "Data vectors") separately to achieve spatial parallelism. In a microprocessor, the SIMD technology is a controller that controls multiple parallel processing elements.

Different processors implement SIMD using different approaches, e.g., Intel processors may use SSE (Single instruction multiple data stream) and ARM platforms by using NEON extensionsAnd (5) unfolding the structure. The NEON technique is ARM Cortex^TMA 128-bit SIMD (single instruction, multiple data) architecture extension of the a-series processor, aimed at providing flexible, powerful acceleration functionality for consumer multimedia applications, thereby significantly improving the user experience. It has 32 registers, 64 bits wide (16 registers, 128 bits wide in double view).

The main problems with the use of setaface today are: when the setaface is used for face detection, SSE instruction acceleration can be used under an x86 platform, but the method is limited by an instruction set under an ARM platform, and an original acceleration method cannot be used, so that the detection efficiency is low.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a face detection implementation method based on a Cotex-A7 platform, which effectively improves the efficiency of face detection by using setaface in hardware using a Cotex-A series processor.

In order to achieve the purpose, the invention adopts the following technical scheme:

a face detection implementation method based on an ARM Cotex-A series platform comprises the following steps:

s1, modifying the source code in the faceDetection of setaface under the hardware environment of ARM Cotex-A series processor, and modifying the type of the compiler into a cross compiler;

s2, adding NEON compiling options in the setting of a compiler;

s3, replacing the original header file required by the SSE instruction in the faceDetection with the header file required by the NEON;

s4, modifying the part of the original code aiming at the faceDetection, which uses the SSE instruction, into an NEON instruction, and modifying the function using the SSE instruction into a function using the NEON;

s5, recompiling the program under the support of the compiling option of the NEON added in the step S2 to obtain the required dynamic link library file, thereby compiling the faceDetection program supporting the NEON under the ARM Cotex-A series processor platform.

It should be noted that, the specific operations in step S1 are:

modify the SET command:

1) setting the system type, selecting to use linux:

SET(CMAKE_SYSTEM_NAME Linux)

2) setting a cross compiler path: the cross-compiler is enabled and adds the cross-compiler's path:

SET(CMAKE_CXX_COMPILER

"/opt/hisi-linux/x86-arm/arm-hisiv400-linux/bin/arm-hisiv400-linux-gnueabi-g++")。

it should be noted that, the specific operations in step S2 are:

txt in cmakelist. txt, enable instruction dependent settings, enable NEON, modify (increase) the compilation option of NEON in the compiler option setting in set command:

-mfloat-abi＝softfp-mfpu＝neon。

it should be noted that, the specific operations in step S3 are:

replacing a header file immittin.h required by an original SSE instruction, and replacing the header file immittin.h by a function implementation and header file required by a NEON instruction, wherein the function implementation and header file comprise SseToNeon.h and a NEON instruction header file arm _ neo.h, and the NEON function implementation required by the project is included.

It should be noted that the specific process of step S4 is as follows:

converting the original SSE instruction into a neon instruction under an arm instruction set;

firstly, replacing the original SSE code in the code; the functions in the code that use SSE instructions are as follows:

_mm_add_epi32(__m128i a,__m128i b)——①；

_mm_sub_epi32(__m128i a,__m128i b)——②；

_mm_mullo_epi32(__m128i a,__m128i b)——③；

_mm_mul_ps(__m128i a,__m128i b)——④；

_mm_cmpgt_ps(__m128a,__m128b)——⑤；

_mm_set_epi32(int i3,int i2,int i1,int i0)——⑥；

wherein:

the function of the function _ mm _ add _ epi32() is to complete the addition of 4 32-bit integer numbers at a time and return the addition result, and the alternative function of the function (r) is: vaddq _ s32(a, b); the function prototype of vaddq _ s32() is int32x4_ t vaddq _ s32(int32x4_ t __ a, int32x4_ t __ b); for vector calculation under the arm instruction set, the function is the same as _ mm _ add _ epi32 ();

the function of the function _ mm _ sub _ epi32() is to perform the subtraction of 4 32-bit integer numbers at a time and return the addition result, and the alternative function of the function (c) is: vsubq _ s32(a, b); the function prototype of vsubq _ s32() is int32x4_ t vsubq _ s32(int32x4_ t __ a, int32x4_ t __ b); for vector calculation under the arm instruction set, the function is the same as _ mm _ sub _ epi32 ();

the function of the _ mm _ mullo _ epi32() is to complete the multiplication of 4 32-bit integer numbers at a time and return the addition result; the replacement function of function (c) is vmulq _ s32(a, b); the function prototype of vmulq _ s32() is int32x4_ t vmulq _ s32(int32x4_ t __ a, int32x4_ t __ b); for vector computations under the arm instruction set, the function is the same as _ mm _ mullo _ epi32 ();

the function of the _mm _ mul _ ps () function is to complete the multiplication of 4 32-bit integer numbers at a time and return the addition result; for function iv, returning to a register at __ m128, the specific function is implemented as follows:

the function of the _ mm _ cmpgt _ ps () function is compare greater; the replacement function for function (c) is (__ m128) vcreq _ f32(a, b); the function prototype of vcreq _ f32() was float32x4_ t vcreq _ f32(float32x4_ t __ a, float32x4_ t __ b); for vector computation under the arm instruction set, the function is the same as _ mm _ cmple _ ps ();

the _ mm _ set _ epi32() function is to set 4 signed 32-bit integer values; the alternative function of function sixthly is: vrenterpretq _ m128i _ s32(vld1q _ s32 (data));

wherein the type of return value is defined in the macro definition as follows:

the invention has the beneficial effects that: the method of the invention modifies the FaceDetection of setaface, can realize the instruction acceleration by using NEON instruction supported by ARM platform, accelerates the vector calculation part in the code, accelerates the program operation, and improves the efficiency of face detection by using setaface.

Drawings

Fig. 1 is a schematic diagram of a process of using setaface to perform face detection;

FIG. 2 is a schematic flow chart of an embodiment of the present invention;

FIG. 3 is a flow chart of adding NEON according to an embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings, and it should be noted that the following examples are provided to illustrate the detailed embodiments and specific operations based on the technical solutions of the present invention, but the scope of the present invention is not limited to the examples.

As shown in fig. 2, a face detection implementation method based on an ARM Cotex-a series platform includes the following steps:

s2, adding NEON compiling options in the setting of a compiler;

Examples

Step S1, modifying the source code in the faceDetection of setaface under the ARM Cotex-A series processor hardware environment, wherein the type of the modified compiler is a cross compiler:

the SET command is modified so that the SET command,

SET(CMAKE_SYSTEM_NAME Linux)————①

SET(CMAKE_CXX_COMPILER

"/opt/hisi-linux/x86-arm/arm-hisiv400-linux/bin/arm-hisiv400-linux-gnueabi-g++")————②

the method comprises the following steps: the system type is set. Choose to use linux, must make this setting with the cross compiler;

secondly, the step of: a cross compiler path is set. Starting a cross compiler and adding a path of the cross compiler;

step S2, add the compile option of NEON in the compiler setting:

txt in cmakelist. txt the enable instruction dependent setting is modified, nenon is enabled, the compilation option of nenon is modified (increased) in the compiler option setting in the set command.

-mfloat-abi＝softfp-mfpu＝neon。

Step S3, replacing the header file required by the original SSE instruction in facedetect with the header file required by the NEON:

Step S4, modifying the part of the source code for FaceDetection using the SSE instruction into an NEON instruction, and modifying the function using the SSE instruction into a function using NEON:

the code that originally used the SSE is replaced in the code. The functions in the code that use SSE instructions are as follows:

_mm_add_epi32(__m128i a,__m128i b)——①

_mm_sub_epi32(__m128i a,__m128i b)——②

_mm_mullo_epi32(__m128i a,__m128i b)——③

_mm_mul_ps(__m128i a,__m128i b)——④

_mm_cmpgt_ps(__m128a,__m128b)——⑤

_mm_set_epi32(int i3,int i2,int i1,int i0)——⑥

the method comprises the following steps: the function of the _ mm _ add _ epi32() function is to complete the addition of 4 32-bit integer numbers at a time and return the addition result. The replacement function is: vaddq _ s32(a, b);

wherein the function prototype of vaddq _ s32() is int32x4_ t vaddq _ s32(int32x4_ t __ a, int32x4_ t __ b); for vector computations under the arm instruction set, the function is the same as _ mm _ add _ epi32 ().

Secondly, the step of: the function of the _ mm _ sub _ epi32() function is to complete the subtraction of 4 32-bit integer numbers at a time and return the addition result. The replacement function is: vsubq _ s32(a, b);

wherein the function prototype of vsubq _ s32() is int32x4_ t vsubq _ s32(int32x4_ t __ a, int32x4_ t __ b); for vector computation under the arm instruction set, the function is the same as _ mm _ sub _ epi32 ().

③ mm _ mullo _ epi32() function is to complete the multiplication of 4 32-bit integer numbers at a time and return the addition result.

The replacement function is vmulq _ s32(a, b);

wherein the function prototype of vmulq _ s32() is int32x4_ t vmulq _ s32(int32x4_ t __ a, int32x4_ t __ b); for vector computations under the arm instruction set, the function is the same as _ mm _ mullo _ epi32 ().

The function of the (mm _ mul _ ps () function is to complete the multiplication of 4 32-bit integer numbers at a time and return the addition result. Returning to the register at __ m128, the specific function is implemented as follows:

fifthly: the function of the _ mm _ cmpgt _ ps () function is compare greater.

The substitution function is (__ m128) vcreq _ f32(a, b);

wherein the function prototype of vcreq _ f32() is float32x4_ t vcreq _ f32(float32x4_ t __ a, float32x4_ t __ b); for vector computation under the arm instruction set, the function is the same as _ mm _ cmple _ ps ().

Sixthly, the function of the _ mm _ set _ epi32() is to set 4 signed 32-bit integer values.

The replacement function is: vrenterpretq _ m128i _ s32(vld1q _ s32 (data));

wherein the type of return value is defined in the macro definition as follows:

and (3) performance testing:

the test platform selects an ARM Cotex-a7 series, the detection object is an image of pixels 120X120 and 1280X720, and the operation result pair is shown in table 1:

TABLE 1

It can be seen that through the operation processing of the above mentioned implementation method for face detection based on the Cotex-a series platform, finally the FaceDetection of setaface under the ARM Cotex-a series platform can be used, and the efficiency is kept high.

Example 2

As shown in fig. 3, when the method of the present invention is used, the feature point processing is performed on the input image, and it is determined whether the vector operation is necessary, and if necessary, the code is modified according to the method of embodiment 1.

Various corresponding changes and modifications can be made by those skilled in the art based on the above technical solutions and concepts, and all such changes and modifications should be included in the protection scope of the present invention.

Claims

1. A face detection implementation method based on an ARM Cotex-A series platform is characterized by comprising the following steps:

s2, adding NEON compiling options in the setting of a compiler;

s5, recompiling the program under the support of the compiling option of the NEON added in the step S2 to obtain a required dynamic link library file, and compiling to obtain a faceDetection program supporting the NEON under an ARM Cotex-A series processor platform;

the specific operation of step S1 is:

modify the SET command:

1) setting the system type, selecting to use linux:

SET(CMAKE_SYSTEM_NAME Linux)

SET(CMAKE_CXX_COMPILER"/opt/hisi-linux/x86-arm/arm-hisiv400-linux/bin/arm-hisiv400-linux-gnueabi-g++")；

the specific operation of step S4 is:

_mm_add_epi32(__m128i a,__m128i b)——①；

_mm_sub_epi32(__m128i a,__m128i b)——②；

_mm_mullo_epi32(__m128i a,__m128i b)——③；

_mm_mul_ps(__m128i a,__m128i b)——④；

_mm_cmpgt_ps(__m128 a,__m128 b)——⑤；

_mm_set_epi32(int i3,int i2,int i1,int i0)——⑥；

wherein:

INLINE__m128_mm_mul_ps(__m128 a,__m128 b)

{

__m128 ret；

ret[0]＝a[0]*b[0]；

ret[1]＝a[1]*b[1]；

ret[2]＝a[2]*b[2]；

ret[3]＝a[3]*b[3]；

return ret；

}

the function of the _ mm _ cmpgt _ ps () function is compare greater; the replacement function for function (c) is (__ m128) vcreq _ f32(a, b); the function prototype of vcreq _ f32() was float32x4_ tvc req _ f32(float32x4_ t __ a, float32x4_ t __ b); for vector computation under the arm instruction set, the function is the same as _ mm _ cmple _ ps ();

wherein the type of return value is defined in the macro definition as follows:

#define_MM_SHUFFLE(z,y,x,w)((z<<6)|(y<<4)|(x<<2)|w)

#define vreinterpretq_m128 i_s32(x)\

(x)

#define vreinterpretq_m128i_u32(x)\

vreinterpretq_s32_u32(x)

#define vreinterpretq_s32_m128i(x)\

(x)。

2. the method for realizing human face detection based on the ARM Cotex-A series platform as claimed in claim 1, wherein the specific operation of the step S2 is as follows:

-mfloat-abi＝softfp-mfpu＝neon。

3. the method for realizing human face detection based on the ARM Cotex-A series platform as claimed in claim 1, wherein the specific operation of the step S3 is as follows: