CN110782009B

CN110782009B - Computing kernel optimization method based on ARMv8 system

Info

Publication number: CN110782009B
Application number: CN201910986292.3A
Authority: CN
Inventors: 全哲; 何楠; 刘彦; 彭阳
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2019-10-17
Filing date: 2019-10-17
Publication date: 2023-09-08
Anticipated expiration: 2039-10-17
Also published as: CN110782009A

Abstract

The invention discloses a computing kernel optimization method based on an ARMv8 system. Aiming at the use condition of the registers, the invention uses all 32 registers, can better and more completely call hardware resources, saves matrix blocking time when matrix calculation is carried out in the convolution process, thereby improving the integral calculation efficiency, helping us obtain higher engineering efficiency to be applied in the convolution process of the convolution neural network, and can effectively accelerate the operation time of the ARMv8 system structure.

Description

Computing kernel optimization method based on ARMv8 system

Technical Field

The invention relates to the field of computers, in particular to a computing kernel optimization method based on an ARMv8 system.

Background

At present, with the continuous development of the deep learning industry, convolutional neural networks are increasingly widely applied, and the requirements related to the application are also increasingly diversified. The application of deep learning frameworks on embedded devices to infer tasks is a new research hotspot, and on the other hand, the variety of embedded devices has also emerged as a variety of different chip categories with the development of technology. Thus, the great difference in efficiency performance that occurs when applied to a particular embedded device for predictive tasks is due to: 1. different deep learning frameworks are employed and the implementation within different tools varies. 2. The processor cores of different embedded device platforms differ, and the instruction sets employed therein differ from processor to processor, with many optimizations being those performed for a particular instruction set.

Disclosure of Invention

In order to solve the problems, the invention provides a computing kernel optimization method based on an ARMv8 system. Aiming at the use condition of the registers, the invention uses all 32 registers, can better and more completely call hardware resources, saves matrix blocking time when matrix calculation is carried out in the convolution process, thereby improving the integral calculation efficiency, helping us obtain higher engineering efficiency to be applied in the convolution process of the convolution neural network, and can effectively accelerate the operation time of the ARMv8 system structure.

In order to achieve the technical effects, the technical scheme of the invention is as follows:

a computing kernel optimization method based on an ARMv8 system comprises the following steps:

step one, constructing an input matrix: converting a convolution kernel into a matrix A and converting an input image into a matrix B through an im2col function in the convolution process; matrix a matrix B as an input matrix; respectively quantizing the matrix A and the matrix B to 8 bits to obtain a quantized matrix A and a quantized matrix B; the quantized matrix A and the quantized matrix B are used as quantization matrixes;

step two, carrying out block packing operation on the quantization matrix to obtain data, namely putting elements in the quantization matrix into continuous memory addresses according to the sequence of calculating the size of the kernel, namely according to the size (8 x 8) of the kernel, and putting the matrix into the continuous memory addresses from left to right from top to bottom;

step three, loading the data into a register according to the arranged sequence of the kernel layout in the previous step, and operating the register data to perform multiply-add calculation;

step four, storing the result obtained by the multiplication and addition calculation of the data into an intermediate matrix, and carrying out the next multiplication and addition until all the data are loaded;

and fifthly, unpacking the obtained result, namely, putting the data back to the original matrix position according to the size sequence of the calculation kernel, thereby ensuring the accuracy of the calculation result.

Further improvement, in the first step:

the size of the matrix A and the size of the matrix B are 5x5, the size of the convolution kernel is 3x3, and each convolution window is converted into corresponding column vectors in sequence to form a new matrix;

performing quantization operation on the input matrix to obtain a uint8 value to obtain a quantization matrix, namely performing operation in a range of 8 bits, wherein an affine relation between a real number and the quantized uint8 value is as follows

real_value＝scale*(quantized_value-zero_point)

Wherein real_value represents a real value, scale represents a quantization range, quantized_value represents a quantization value, zero_point represents a 0 value of the quantization value, and corresponds to 0 in a real number

In the second step, the sizes of the quantized matrix a and the quantized matrix B are judged, so that the size of the kernel can be divided in the dimension of the matrix block, and if the kernel cannot be divided in an integer manner, 0 supplementing operation is performed, so that the block operation can be executed completely;

the quantization matrix is partitioned and packed according to the size of the kernel, namely the quantization matrix is rearranged according to the size and sequence of the kernel to carry out the calculation process of subsequent data, namely the data is sequentially put into a continuous memory address to be circularly reciprocated according to the defined size of the 8x8 kernel until all the data are partitioned.

In a further improvement, in the third step, the data multiplication and addition calculation mode is as follows:

3.1 Loading the left and right matrices into the vector register;

3.2 Judging whether the calculation is the first calculation, if not, loading the intermediate result of the last calculation into a corresponding register;

3.3 Performing multiply-add operation on the corresponding register data;

3.4 After the multiplication accumulation of the left matrix and the right matrix is finished, the obtained result is stored in an intermediate result register and is continuously loaded in the next cycle.

Compared with the prior art, the invention can achieve the following technical effects:

1. the 32 vector registers under the ARMv8 architecture platform are fully utilized, data are stored in each vector register in the whole calculation process, and hardware resources are more completely called.

2. In the matrix blocking scale, the size of the blocking of the invention is larger than that of the matrix used in the prior art, so that the matrix can be decomposed more rapidly when a large matrix is processed, and the blocking time is greatly reduced.

3. The kernel can be used for greatly accelerating training time and improving overall calculation efficiency when calculating the matrix in the convolution process.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of the im2col function principle;

FIG. 3 is a schematic diagram of a matrix pack;

FIG. 4 is a schematic diagram of a kernel assembly calculation statement;

FIG. 5 is a diagram showing the comparison of the original core and the current core efficiency under multithreading.

Detailed Description

The following detailed description and the accompanying drawings illustrate the technical aspects of the present invention, and unless otherwise indicated, the components or devices in the following examples are all common standard components or components known to those skilled in the art, and their structures and principles are all known to those skilled in the art through technical manuals or through routine experimental methods.

Example 1

First, constructing an input matrix:

1.1 converting the input image data and the convolution kernel into a matrix through an im2col function to obtain a corresponding next input, wherein the schematic diagram of the function is shown in the following figure 2, the matrix size is 5x5, the convolution kernel size is 3x3, and each convolution window is sequentially converted into a corresponding column vector to form a new matrix.

1.2 quantization of the input matrix, i.e. to the 8bit range, wherein

Affine relation of real and quantized uint8 values is as follows

real_value＝scale*(quantized_value-zero_point)

Step two, packing and blocking the input matrix

2.1 judging the size of the input quantization matrix, ensuring that the size of the kernel can be divided in the dimension of matrix partitioning, and if the kernel cannot be divided, performing 0 supplementing operation, thereby ensuring that the partitioning operation can be completely executed.

2.2 packing and blocking core is to pack the quantization matrix obtained in the previous step according to the kernel size defined in the calculation process, i.e. rearrange the quantization matrix according to the kernel size and sequence to facilitate the calculation process of the subsequent data, fig. 3 is a packing example of the right matrix, i.e. the data is placed in the continuous memory address according to the 8x8 kernel size defined in sequence, and the cycle is repeated until all matrix data blocks are completed.

Third step, kernel Kernel operation

3.1 using the principle of single instruction multiple operation assembly statement calculation matrix multiplication, as shown in fig. 4, for a 128-bit vector register, multiple data are stored at a time and are simultaneously operated through a neon statement, thereby achieving the purpose of efficient calculation.

3.2 core assembly computation procedure

3.2.1 Loading left and right matrices into vector registers

"ld1{v0.8b},[％[rhs_ptr]],#8\n"

"ld1{v1.8b},[％[rhs_ptr]],#8\n"

"ld1{v2.8b},[％[rhs_ptr]],#8\n"

"ld1{v3.8b},[％[rhs_ptr]],#8\n"

3.2.2 judging whether the calculation is the first calculation, otherwise, loading the intermediate result of the last calculation into the corresponding register

"mov x0,x1\n"

"ld1{v8.16b},[x0],#16\n"

"subs％[run_depth],％[run_depth],#8\n"

"ld1{v16.16b},[x0],#16\n"

"add x1,x1,％[dst_col_stride]\n"

3.2.3 multiply-add operation on corresponding register data

"umlal v8.4s,v4.4h,v0.h[0]\n"

"umlal v9.4s,v4.4h,v0.h[1]\n"

"umlal v10.4s,v4.4h,v0.h[2]\n"

"umlal v11.4s,v4.4h,v0.h[3]\n"

"umlal v12.4s,v4.4h,v0.h[4]\n"

"umlal v13.4s,v4.4h,v0.h[5]\n"

"umlal v14.4s,v4.4h,v0.h[6]\n"

"umlal v15.4s,v4.4h,v0.h[7]\n"

3.2.4 after the matrix multiplication is finished, the obtained result is stored in the intermediate result register, and is continuously loaded in the next cycle

"mov x1,％[dst_ptr]\n"

"mov x0,x1\n"

"st1{v8.16b},[x0],#16\n"

"subs％[run_depth],％[run_depth],#8\n"

"st1{v16.16b},[x0],#16\n"

"add x1,x1,％[dst_col_stride]\n"

Fourth, unpacking and restoring the result

4.1 fetching the data in the register storing the result into the continuous address space and performing the dequantization operation, i.e. restoring the data range to the pre-quantization range.

4.2 rearranging the data to the original position according to the kernel size defined above in order, so as to finish the whole operation.

Fifth step, writing test cases, and testing kernel calculation results

Sixth step, analysis and discussion of Experimental results

6.1 matrix data sources used in the test are randomly generated, and after the original kernel is calculated, matrix multiplication of different scales is calculated under single thread and multithread respectively to obtain results, wherein the mode of calculating the comparison result is the ratio of matrix scale to calculation time, the unit is gflips, and the result is shown in figure 5

By comparing the calculation efficiency of the original kernel, the method is superior to the traditional kernel calculation method in matrix calculation on an ARMv8 system structure platform. Compared with the traditional matrix computing kernel method, the method can obtain better results, and the method proves that the method is effective.

The foregoing is merely a specific guiding embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modification of the present invention by using the concept should be construed as infringement of the protection scope of the present invention.

Claims

1. The ARMv8 system-based computing kernel optimization method is characterized by comprising the following steps of:

step two, carrying out block packing operation on the quantization matrix to obtain data, namely placing elements in the quantization matrix into continuous memory addresses according to the sequence of calculating the size of the kernel, namely according to the sequence of calculating the size 8x8 of the kernel, and from left to right from top to bottom;

step five, unpacking the obtained result, namely, putting the data back to the original matrix position according to the size sequence of the calculation kernel, thereby ensuring the accuracy of the calculation result;

in the second step, the sizes of the quantized matrix A and the quantized matrix B are judged, so that the size of a kernel can be divided in the dimension of matrix partitioning, and if the kernel cannot be divided in an integer manner, 0 supplementing operation is performed, so that the partitioning operation can be completely executed; the quantization matrix is partitioned and packed according to the size of the kernel, namely the quantization matrix is rearranged according to the size and sequence of the kernel to carry out the calculation process of subsequent data, namely the data is sequentially put into a continuous memory address to be circularly reciprocated according to the defined size of the 8x8 kernel until all the data are partitioned.

2. The method for optimizing computing kernel based on ARMv8 system of claim 1, wherein in the first step:

real_value＝scale*(quantized_value-zero_point)

Wherein real_value represents a real value, scale represents a quantization range, quantized_value represents a quantization value, zero_point represents a 0 value of the quantization value, and corresponds to 0 in a real number.

3. The method for optimizing computing kernel based on ARMv8 system according to claim 2, wherein in the third step, the data multiply-add calculation mode is as follows:

3.1 Loading the left and right matrices into the vector register;

3.3 Performing multiply-add operation on the corresponding register data;