CN110782009B - Computing kernel optimization method based on ARMv8 system - Google Patents
Computing kernel optimization method based on ARMv8 system Download PDFInfo
- Publication number
- CN110782009B CN110782009B CN201910986292.3A CN201910986292A CN110782009B CN 110782009 B CN110782009 B CN 110782009B CN 201910986292 A CN201910986292 A CN 201910986292A CN 110782009 B CN110782009 B CN 110782009B
- Authority
- CN
- China
- Prior art keywords
- matrix
- kernel
- data
- calculation
- size
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a computing kernel optimization method based on an ARMv8 system. Aiming at the use condition of the registers, the invention uses all 32 registers, can better and more completely call hardware resources, saves matrix blocking time when matrix calculation is carried out in the convolution process, thereby improving the integral calculation efficiency, helping us obtain higher engineering efficiency to be applied in the convolution process of the convolution neural network, and can effectively accelerate the operation time of the ARMv8 system structure.
Description
Technical Field
The invention relates to the field of computers, in particular to a computing kernel optimization method based on an ARMv8 system.
Background
At present, with the continuous development of the deep learning industry, convolutional neural networks are increasingly widely applied, and the requirements related to the application are also increasingly diversified. The application of deep learning frameworks on embedded devices to infer tasks is a new research hotspot, and on the other hand, the variety of embedded devices has also emerged as a variety of different chip categories with the development of technology. Thus, the great difference in efficiency performance that occurs when applied to a particular embedded device for predictive tasks is due to: 1. different deep learning frameworks are employed and the implementation within different tools varies. 2. The processor cores of different embedded device platforms differ, and the instruction sets employed therein differ from processor to processor, with many optimizations being those performed for a particular instruction set.
Disclosure of Invention
In order to solve the problems, the invention provides a computing kernel optimization method based on an ARMv8 system. Aiming at the use condition of the registers, the invention uses all 32 registers, can better and more completely call hardware resources, saves matrix blocking time when matrix calculation is carried out in the convolution process, thereby improving the integral calculation efficiency, helping us obtain higher engineering efficiency to be applied in the convolution process of the convolution neural network, and can effectively accelerate the operation time of the ARMv8 system structure.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
a computing kernel optimization method based on an ARMv8 system comprises the following steps:
step one, constructing an input matrix: converting a convolution kernel into a matrix A and converting an input image into a matrix B through an im2col function in the convolution process; matrix a matrix B as an input matrix; respectively quantizing the matrix A and the matrix B to 8 bits to obtain a quantized matrix A and a quantized matrix B; the quantized matrix A and the quantized matrix B are used as quantization matrixes;
step two, carrying out block packing operation on the quantization matrix to obtain data, namely putting elements in the quantization matrix into continuous memory addresses according to the sequence of calculating the size of the kernel, namely according to the size (8 x 8) of the kernel, and putting the matrix into the continuous memory addresses from left to right from top to bottom;
step three, loading the data into a register according to the arranged sequence of the kernel layout in the previous step, and operating the register data to perform multiply-add calculation;
step four, storing the result obtained by the multiplication and addition calculation of the data into an intermediate matrix, and carrying out the next multiplication and addition until all the data are loaded;
and fifthly, unpacking the obtained result, namely, putting the data back to the original matrix position according to the size sequence of the calculation kernel, thereby ensuring the accuracy of the calculation result.
Further improvement, in the first step:
the size of the matrix A and the size of the matrix B are 5x5, the size of the convolution kernel is 3x3, and each convolution window is converted into corresponding column vectors in sequence to form a new matrix;
performing quantization operation on the input matrix to obtain a uint8 value to obtain a quantization matrix, namely performing operation in a range of 8 bits, wherein an affine relation between a real number and the quantized uint8 value is as follows
real_value=scale*(quantized_value-zero_point)
Wherein real_value represents a real value, scale represents a quantization range, quantized_value represents a quantization value, zero_point represents a 0 value of the quantization value, and corresponds to 0 in a real number
In the second step, the sizes of the quantized matrix a and the quantized matrix B are judged, so that the size of the kernel can be divided in the dimension of the matrix block, and if the kernel cannot be divided in an integer manner, 0 supplementing operation is performed, so that the block operation can be executed completely;
the quantization matrix is partitioned and packed according to the size of the kernel, namely the quantization matrix is rearranged according to the size and sequence of the kernel to carry out the calculation process of subsequent data, namely the data is sequentially put into a continuous memory address to be circularly reciprocated according to the defined size of the 8x8 kernel until all the data are partitioned.
In a further improvement, in the third step, the data multiplication and addition calculation mode is as follows:
3.1 Loading the left and right matrices into the vector register;
3.2 Judging whether the calculation is the first calculation, if not, loading the intermediate result of the last calculation into a corresponding register;
3.3 Performing multiply-add operation on the corresponding register data;
3.4 After the multiplication accumulation of the left matrix and the right matrix is finished, the obtained result is stored in an intermediate result register and is continuously loaded in the next cycle.
Compared with the prior art, the invention can achieve the following technical effects:
1. the 32 vector registers under the ARMv8 architecture platform are fully utilized, data are stored in each vector register in the whole calculation process, and hardware resources are more completely called.
2. In the matrix blocking scale, the size of the blocking of the invention is larger than that of the matrix used in the prior art, so that the matrix can be decomposed more rapidly when a large matrix is processed, and the blocking time is greatly reduced.
3. The kernel can be used for greatly accelerating training time and improving overall calculation efficiency when calculating the matrix in the convolution process.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of the im2col function principle;
FIG. 3 is a schematic diagram of a matrix pack;
FIG. 4 is a schematic diagram of a kernel assembly calculation statement;
FIG. 5 is a diagram showing the comparison of the original core and the current core efficiency under multithreading.
Detailed Description
The following detailed description and the accompanying drawings illustrate the technical aspects of the present invention, and unless otherwise indicated, the components or devices in the following examples are all common standard components or components known to those skilled in the art, and their structures and principles are all known to those skilled in the art through technical manuals or through routine experimental methods.
Example 1
First, constructing an input matrix:
1.1 converting the input image data and the convolution kernel into a matrix through an im2col function to obtain a corresponding next input, wherein the schematic diagram of the function is shown in the following figure 2, the matrix size is 5x5, the convolution kernel size is 3x3, and each convolution window is sequentially converted into a corresponding column vector to form a new matrix.
1.2 quantization of the input matrix, i.e. to the 8bit range, wherein
Affine relation of real and quantized uint8 values is as follows
real_value=scale*(quantized_value-zero_point)
Step two, packing and blocking the input matrix
2.1 judging the size of the input quantization matrix, ensuring that the size of the kernel can be divided in the dimension of matrix partitioning, and if the kernel cannot be divided, performing 0 supplementing operation, thereby ensuring that the partitioning operation can be completely executed.
2.2 packing and blocking core is to pack the quantization matrix obtained in the previous step according to the kernel size defined in the calculation process, i.e. rearrange the quantization matrix according to the kernel size and sequence to facilitate the calculation process of the subsequent data, fig. 3 is a packing example of the right matrix, i.e. the data is placed in the continuous memory address according to the 8x8 kernel size defined in sequence, and the cycle is repeated until all matrix data blocks are completed.
Third step, kernel Kernel operation
3.1 using the principle of single instruction multiple operation assembly statement calculation matrix multiplication, as shown in fig. 4, for a 128-bit vector register, multiple data are stored at a time and are simultaneously operated through a neon statement, thereby achieving the purpose of efficient calculation.
3.2 core assembly computation procedure
3.2.1 Loading left and right matrices into vector registers
"ld1{v0.8b},[%[rhs_ptr]],#8\n"
"ld1{v1.8b},[%[rhs_ptr]],#8\n"
"ld1{v2.8b},[%[rhs_ptr]],#8\n"
"ld1{v3.8b},[%[rhs_ptr]],#8\n"
3.2.2 judging whether the calculation is the first calculation, otherwise, loading the intermediate result of the last calculation into the corresponding register
"mov x0,x1\n"
"ld1{v8.16b},[x0],#16\n"
"subs%[run_depth],%[run_depth],#8\n"
"ld1{v16.16b},[x0],#16\n"
"add x1,x1,%[dst_col_stride]\n"
3.2.3 multiply-add operation on corresponding register data
"umlal v8.4s,v4.4h,v0.h[0]\n"
"umlal v9.4s,v4.4h,v0.h[1]\n"
"umlal v10.4s,v4.4h,v0.h[2]\n"
"umlal v11.4s,v4.4h,v0.h[3]\n"
"umlal v12.4s,v4.4h,v0.h[4]\n"
"umlal v13.4s,v4.4h,v0.h[5]\n"
"umlal v14.4s,v4.4h,v0.h[6]\n"
"umlal v15.4s,v4.4h,v0.h[7]\n"
3.2.4 after the matrix multiplication is finished, the obtained result is stored in the intermediate result register, and is continuously loaded in the next cycle
"mov x1,%[dst_ptr]\n"
"mov x0,x1\n"
"st1{v8.16b},[x0],#16\n"
"subs%[run_depth],%[run_depth],#8\n"
"st1{v16.16b},[x0],#16\n"
"add x1,x1,%[dst_col_stride]\n"
Fourth, unpacking and restoring the result
4.1 fetching the data in the register storing the result into the continuous address space and performing the dequantization operation, i.e. restoring the data range to the pre-quantization range.
4.2 rearranging the data to the original position according to the kernel size defined above in order, so as to finish the whole operation.
Fifth step, writing test cases, and testing kernel calculation results
Sixth step, analysis and discussion of Experimental results
6.1 matrix data sources used in the test are randomly generated, and after the original kernel is calculated, matrix multiplication of different scales is calculated under single thread and multithread respectively to obtain results, wherein the mode of calculating the comparison result is the ratio of matrix scale to calculation time, the unit is gflips, and the result is shown in figure 5
By comparing the calculation efficiency of the original kernel, the method is superior to the traditional kernel calculation method in matrix calculation on an ARMv8 system structure platform. Compared with the traditional matrix computing kernel method, the method can obtain better results, and the method proves that the method is effective.
The foregoing is merely a specific guiding embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modification of the present invention by using the concept should be construed as infringement of the protection scope of the present invention.
Claims (3)
1. The ARMv8 system-based computing kernel optimization method is characterized by comprising the following steps of:
step one, constructing an input matrix: converting a convolution kernel into a matrix A and converting an input image into a matrix B through an im2col function in the convolution process; matrix a matrix B as an input matrix; respectively quantizing the matrix A and the matrix B to 8 bits to obtain a quantized matrix A and a quantized matrix B; the quantized matrix A and the quantized matrix B are used as quantization matrixes;
step two, carrying out block packing operation on the quantization matrix to obtain data, namely placing elements in the quantization matrix into continuous memory addresses according to the sequence of calculating the size of the kernel, namely according to the sequence of calculating the size 8x8 of the kernel, and from left to right from top to bottom;
step three, loading the data into a register according to the arranged sequence of the kernel layout in the previous step, and operating the register data to perform multiply-add calculation;
step four, storing the result obtained by the multiplication and addition calculation of the data into an intermediate matrix, and carrying out the next multiplication and addition until all the data are loaded;
step five, unpacking the obtained result, namely, putting the data back to the original matrix position according to the size sequence of the calculation kernel, thereby ensuring the accuracy of the calculation result;
in the second step, the sizes of the quantized matrix A and the quantized matrix B are judged, so that the size of a kernel can be divided in the dimension of matrix partitioning, and if the kernel cannot be divided in an integer manner, 0 supplementing operation is performed, so that the partitioning operation can be completely executed; the quantization matrix is partitioned and packed according to the size of the kernel, namely the quantization matrix is rearranged according to the size and sequence of the kernel to carry out the calculation process of subsequent data, namely the data is sequentially put into a continuous memory address to be circularly reciprocated according to the defined size of the 8x8 kernel until all the data are partitioned.
2. The method for optimizing computing kernel based on ARMv8 system of claim 1, wherein in the first step:
the size of the matrix A and the size of the matrix B are 5x5, the size of the convolution kernel is 3x3, and each convolution window is converted into corresponding column vectors in sequence to form a new matrix;
performing quantization operation on the input matrix to obtain a uint8 value to obtain a quantization matrix, namely performing operation in a range of 8 bits, wherein an affine relation between a real number and the quantized uint8 value is as follows
real_value=scale*(quantized_value-zero_point)
Wherein real_value represents a real value, scale represents a quantization range, quantized_value represents a quantization value, zero_point represents a 0 value of the quantization value, and corresponds to 0 in a real number.
3. The method for optimizing computing kernel based on ARMv8 system according to claim 2, wherein in the third step, the data multiply-add calculation mode is as follows:
3.1 Loading the left and right matrices into the vector register;
3.2 Judging whether the calculation is the first calculation, if not, loading the intermediate result of the last calculation into a corresponding register;
3.3 Performing multiply-add operation on the corresponding register data;
3.4 After the multiplication accumulation of the left matrix and the right matrix is finished, the obtained result is stored in an intermediate result register and is continuously loaded in the next cycle.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910986292.3A CN110782009B (en) | 2019-10-17 | 2019-10-17 | Computing kernel optimization method based on ARMv8 system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910986292.3A CN110782009B (en) | 2019-10-17 | 2019-10-17 | Computing kernel optimization method based on ARMv8 system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110782009A CN110782009A (en) | 2020-02-11 |
CN110782009B true CN110782009B (en) | 2023-09-08 |
Family
ID=69385837
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910986292.3A Active CN110782009B (en) | 2019-10-17 | 2019-10-17 | Computing kernel optimization method based on ARMv8 system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110782009B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113495786B (en) * | 2020-03-19 | 2023-10-13 | 杭州海康威视数字技术股份有限公司 | Image convolution processing method and electronic equipment |
CN113066508A (en) * | 2021-03-15 | 2021-07-02 | 腾讯科技(深圳)有限公司 | Voice content processing method, device and equipment and readable storage medium |
CN117634711B (en) * | 2024-01-25 | 2024-05-14 | 北京壁仞科技开发有限公司 | Tensor dimension segmentation method, system, device and medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104572234A (en) * | 2014-12-29 | 2015-04-29 | 杭州华为数字技术有限公司 | Method for generating source codes used for parallel computing architecture and source-to-source compiler |
CN105808309A (en) * | 2016-03-08 | 2016-07-27 | 中国科学院软件研究所 | High-performance realization method of BLAS (Basic Linear Algebra Subprograms) three-level function GEMM on the basis of SW platform |
CN106970896A (en) * | 2017-03-30 | 2017-07-21 | 中国人民解放军国防科学技术大学 | The vectorization implementation method of the two-dimensional matrix convolution of vector processor-oriented |
CN108009634A (en) * | 2017-12-21 | 2018-05-08 | 美的集团股份有限公司 | A kind of optimization method of convolutional neural networks, device and computer-readable storage medium |
CN108985236A (en) * | 2018-07-20 | 2018-12-11 | 南京开为网络科技有限公司 | A kind of face identification method separating convolution model based on depthization |
CN109086244A (en) * | 2018-07-11 | 2018-12-25 | 中国人民解放军国防科技大学 | Matrix convolution vectorization implementation method based on vector processor |
CN110110844A (en) * | 2019-04-24 | 2019-08-09 | 西安电子科技大学 | Convolutional neural networks method for parallel processing based on OpenCL |
CN110263923A (en) * | 2019-08-12 | 2019-09-20 | 上海燧原智能科技有限公司 | Tensor convolutional calculation method and system |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9798378B2 (en) * | 2014-03-31 | 2017-10-24 | Google Technology Holdings LLC | Apparatus and method for awakening a primary processor out of sleep mode |
US10620682B2 (en) * | 2017-12-21 | 2020-04-14 | Intel Corporation | System, apparatus and method for processor-external override of hardware performance state control of a processor |
-
2019
- 2019-10-17 CN CN201910986292.3A patent/CN110782009B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104572234A (en) * | 2014-12-29 | 2015-04-29 | 杭州华为数字技术有限公司 | Method for generating source codes used for parallel computing architecture and source-to-source compiler |
CN105808309A (en) * | 2016-03-08 | 2016-07-27 | 中国科学院软件研究所 | High-performance realization method of BLAS (Basic Linear Algebra Subprograms) three-level function GEMM on the basis of SW platform |
CN106970896A (en) * | 2017-03-30 | 2017-07-21 | 中国人民解放军国防科学技术大学 | The vectorization implementation method of the two-dimensional matrix convolution of vector processor-oriented |
CN108009634A (en) * | 2017-12-21 | 2018-05-08 | 美的集团股份有限公司 | A kind of optimization method of convolutional neural networks, device and computer-readable storage medium |
CN109086244A (en) * | 2018-07-11 | 2018-12-25 | 中国人民解放军国防科技大学 | Matrix convolution vectorization implementation method based on vector processor |
CN108985236A (en) * | 2018-07-20 | 2018-12-11 | 南京开为网络科技有限公司 | A kind of face identification method separating convolution model based on depthization |
CN110110844A (en) * | 2019-04-24 | 2019-08-09 | 西安电子科技大学 | Convolutional neural networks method for parallel processing based on OpenCL |
CN110263923A (en) * | 2019-08-12 | 2019-09-20 | 上海燧原智能科技有限公司 | Tensor convolutional calculation method and system |
Non-Patent Citations (1)
Title |
---|
姜浩 等.面向ARMv8 64位多核处理器的QGEMM设计与实现.《计算机学报》.2017,第40卷(第9期),第2018-2029页. * |
Also Published As
Publication number | Publication date |
---|---|
CN110782009A (en) | 2020-02-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10860922B2 (en) | Sparse convolutional neural network accelerator | |
Jang et al. | Sparsity-aware and re-configurable NPU architecture for Samsung flagship mobile SoC | |
Zhu et al. | An efficient hardware accelerator for structured sparse convolutional neural networks on FPGAs | |
CN110782009B (en) | Computing kernel optimization method based on ARMv8 system | |
CN109543816B (en) | Convolutional neural network calculation method and system based on weight kneading | |
US20220012593A1 (en) | Neural network accelerator and neural network acceleration method based on structured pruning and low-bit quantization | |
CN111062472B (en) | Sparse neural network accelerator based on structured pruning and acceleration method thereof | |
CN111563599B (en) | Quantum circuit decomposition method and device, storage medium and electronic device | |
CN112200300B (en) | Convolutional neural network operation method and device | |
CN109086244A (en) | Matrix convolution vectorization implementation method based on vector processor | |
US20220058486A1 (en) | System and method of accelerating execution of a neural network | |
CN109791628B (en) | Neural network model block compression method, training method, computing device and system | |
US8433883B2 (en) | Inclusive “OR” bit matrix compare resolution of vector update conflict masks | |
CN113850389B (en) | Quantum circuit construction method and device | |
US20230068450A1 (en) | Method and apparatus for processing sparse data | |
US20230186050A1 (en) | Method and apparatus for processing computation of zero value in processing of layers in neural network | |
CN114418105B (en) | Method and device for processing quantum application problem based on quantum circuit | |
CN104617959A (en) | Universal processor-based LDPC (Low Density Parity Check) encoding and decoding method | |
CN110851779A (en) | Systolic array architecture for sparse matrix operations | |
CN114090954A (en) | Integer matrix multiplication kernel optimization method based on FT-2000+ | |
CN111553471A (en) | Data analysis processing method and device | |
CN101438598A (en) | Instruction for producing two independent sums of absolute differences | |
CN113885941A (en) | Singular value decomposition operation implementation method, device and related equipment | |
CN104572588A (en) | Matrix inversion processing method and device | |
CN116842304A (en) | Method and system for calculating irregular sparse matrix |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |