CN110782009B - Computing kernel optimization method based on ARMv8 system - Google Patents

Computing kernel optimization method based on ARMv8 system Download PDF

Info

Publication number
CN110782009B
CN110782009B CN201910986292.3A CN201910986292A CN110782009B CN 110782009 B CN110782009 B CN 110782009B CN 201910986292 A CN201910986292 A CN 201910986292A CN 110782009 B CN110782009 B CN 110782009B
Authority
CN
China
Prior art keywords
matrix
kernel
data
calculation
size
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910986292.3A
Other languages
Chinese (zh)
Other versions
CN110782009A (en
Inventor
全哲
何楠
刘彦
彭阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN201910986292.3A priority Critical patent/CN110782009B/en
Publication of CN110782009A publication Critical patent/CN110782009A/en
Application granted granted Critical
Publication of CN110782009B publication Critical patent/CN110782009B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a computing kernel optimization method based on an ARMv8 system. Aiming at the use condition of the registers, the invention uses all 32 registers, can better and more completely call hardware resources, saves matrix blocking time when matrix calculation is carried out in the convolution process, thereby improving the integral calculation efficiency, helping us obtain higher engineering efficiency to be applied in the convolution process of the convolution neural network, and can effectively accelerate the operation time of the ARMv8 system structure.

Description

Computing kernel optimization method based on ARMv8 system
Technical Field
The invention relates to the field of computers, in particular to a computing kernel optimization method based on an ARMv8 system.
Background
At present, with the continuous development of the deep learning industry, convolutional neural networks are increasingly widely applied, and the requirements related to the application are also increasingly diversified. The application of deep learning frameworks on embedded devices to infer tasks is a new research hotspot, and on the other hand, the variety of embedded devices has also emerged as a variety of different chip categories with the development of technology. Thus, the great difference in efficiency performance that occurs when applied to a particular embedded device for predictive tasks is due to: 1. different deep learning frameworks are employed and the implementation within different tools varies. 2. The processor cores of different embedded device platforms differ, and the instruction sets employed therein differ from processor to processor, with many optimizations being those performed for a particular instruction set.
Disclosure of Invention
In order to solve the problems, the invention provides a computing kernel optimization method based on an ARMv8 system. Aiming at the use condition of the registers, the invention uses all 32 registers, can better and more completely call hardware resources, saves matrix blocking time when matrix calculation is carried out in the convolution process, thereby improving the integral calculation efficiency, helping us obtain higher engineering efficiency to be applied in the convolution process of the convolution neural network, and can effectively accelerate the operation time of the ARMv8 system structure.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
a computing kernel optimization method based on an ARMv8 system comprises the following steps:
step one, constructing an input matrix: converting a convolution kernel into a matrix A and converting an input image into a matrix B through an im2col function in the convolution process; matrix a matrix B as an input matrix; respectively quantizing the matrix A and the matrix B to 8 bits to obtain a quantized matrix A and a quantized matrix B; the quantized matrix A and the quantized matrix B are used as quantization matrixes;
step two, carrying out block packing operation on the quantization matrix to obtain data, namely putting elements in the quantization matrix into continuous memory addresses according to the sequence of calculating the size of the kernel, namely according to the size (8 x 8) of the kernel, and putting the matrix into the continuous memory addresses from left to right from top to bottom;
step three, loading the data into a register according to the arranged sequence of the kernel layout in the previous step, and operating the register data to perform multiply-add calculation;
step four, storing the result obtained by the multiplication and addition calculation of the data into an intermediate matrix, and carrying out the next multiplication and addition until all the data are loaded;
and fifthly, unpacking the obtained result, namely, putting the data back to the original matrix position according to the size sequence of the calculation kernel, thereby ensuring the accuracy of the calculation result.
Further improvement, in the first step:
the size of the matrix A and the size of the matrix B are 5x5, the size of the convolution kernel is 3x3, and each convolution window is converted into corresponding column vectors in sequence to form a new matrix;
performing quantization operation on the input matrix to obtain a uint8 value to obtain a quantization matrix, namely performing operation in a range of 8 bits, wherein an affine relation between a real number and the quantized uint8 value is as follows
real_value=scale*(quantized_value-zero_point)
Wherein real_value represents a real value, scale represents a quantization range, quantized_value represents a quantization value, zero_point represents a 0 value of the quantization value, and corresponds to 0 in a real number
In the second step, the sizes of the quantized matrix a and the quantized matrix B are judged, so that the size of the kernel can be divided in the dimension of the matrix block, and if the kernel cannot be divided in an integer manner, 0 supplementing operation is performed, so that the block operation can be executed completely;
the quantization matrix is partitioned and packed according to the size of the kernel, namely the quantization matrix is rearranged according to the size and sequence of the kernel to carry out the calculation process of subsequent data, namely the data is sequentially put into a continuous memory address to be circularly reciprocated according to the defined size of the 8x8 kernel until all the data are partitioned.
In a further improvement, in the third step, the data multiplication and addition calculation mode is as follows:
3.1 Loading the left and right matrices into the vector register;
3.2 Judging whether the calculation is the first calculation, if not, loading the intermediate result of the last calculation into a corresponding register;
3.3 Performing multiply-add operation on the corresponding register data;
3.4 After the multiplication accumulation of the left matrix and the right matrix is finished, the obtained result is stored in an intermediate result register and is continuously loaded in the next cycle.
Compared with the prior art, the invention can achieve the following technical effects:
1. the 32 vector registers under the ARMv8 architecture platform are fully utilized, data are stored in each vector register in the whole calculation process, and hardware resources are more completely called.
2. In the matrix blocking scale, the size of the blocking of the invention is larger than that of the matrix used in the prior art, so that the matrix can be decomposed more rapidly when a large matrix is processed, and the blocking time is greatly reduced.
3. The kernel can be used for greatly accelerating training time and improving overall calculation efficiency when calculating the matrix in the convolution process.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of the im2col function principle;
FIG. 3 is a schematic diagram of a matrix pack;
FIG. 4 is a schematic diagram of a kernel assembly calculation statement;
FIG. 5 is a diagram showing the comparison of the original core and the current core efficiency under multithreading.
Detailed Description
The following detailed description and the accompanying drawings illustrate the technical aspects of the present invention, and unless otherwise indicated, the components or devices in the following examples are all common standard components or components known to those skilled in the art, and their structures and principles are all known to those skilled in the art through technical manuals or through routine experimental methods.
Example 1
First, constructing an input matrix:
1.1 converting the input image data and the convolution kernel into a matrix through an im2col function to obtain a corresponding next input, wherein the schematic diagram of the function is shown in the following figure 2, the matrix size is 5x5, the convolution kernel size is 3x3, and each convolution window is sequentially converted into a corresponding column vector to form a new matrix.
1.2 quantization of the input matrix, i.e. to the 8bit range, wherein
Affine relation of real and quantized uint8 values is as follows
real_value=scale*(quantized_value-zero_point)
Step two, packing and blocking the input matrix
2.1 judging the size of the input quantization matrix, ensuring that the size of the kernel can be divided in the dimension of matrix partitioning, and if the kernel cannot be divided, performing 0 supplementing operation, thereby ensuring that the partitioning operation can be completely executed.
2.2 packing and blocking core is to pack the quantization matrix obtained in the previous step according to the kernel size defined in the calculation process, i.e. rearrange the quantization matrix according to the kernel size and sequence to facilitate the calculation process of the subsequent data, fig. 3 is a packing example of the right matrix, i.e. the data is placed in the continuous memory address according to the 8x8 kernel size defined in sequence, and the cycle is repeated until all matrix data blocks are completed.
Third step, kernel Kernel operation
3.1 using the principle of single instruction multiple operation assembly statement calculation matrix multiplication, as shown in fig. 4, for a 128-bit vector register, multiple data are stored at a time and are simultaneously operated through a neon statement, thereby achieving the purpose of efficient calculation.
3.2 core assembly computation procedure
3.2.1 Loading left and right matrices into vector registers
"ld1{v0.8b},[%[rhs_ptr]],#8\n"
"ld1{v1.8b},[%[rhs_ptr]],#8\n"
"ld1{v2.8b},[%[rhs_ptr]],#8\n"
"ld1{v3.8b},[%[rhs_ptr]],#8\n"
3.2.2 judging whether the calculation is the first calculation, otherwise, loading the intermediate result of the last calculation into the corresponding register
"mov x0,x1\n"
"ld1{v8.16b},[x0],#16\n"
"subs%[run_depth],%[run_depth],#8\n"
"ld1{v16.16b},[x0],#16\n"
"add x1,x1,%[dst_col_stride]\n"
3.2.3 multiply-add operation on corresponding register data
"umlal v8.4s,v4.4h,v0.h[0]\n"
"umlal v9.4s,v4.4h,v0.h[1]\n"
"umlal v10.4s,v4.4h,v0.h[2]\n"
"umlal v11.4s,v4.4h,v0.h[3]\n"
"umlal v12.4s,v4.4h,v0.h[4]\n"
"umlal v13.4s,v4.4h,v0.h[5]\n"
"umlal v14.4s,v4.4h,v0.h[6]\n"
"umlal v15.4s,v4.4h,v0.h[7]\n"
3.2.4 after the matrix multiplication is finished, the obtained result is stored in the intermediate result register, and is continuously loaded in the next cycle
"mov x1,%[dst_ptr]\n"
"mov x0,x1\n"
"st1{v8.16b},[x0],#16\n"
"subs%[run_depth],%[run_depth],#8\n"
"st1{v16.16b},[x0],#16\n"
"add x1,x1,%[dst_col_stride]\n"
Fourth, unpacking and restoring the result
4.1 fetching the data in the register storing the result into the continuous address space and performing the dequantization operation, i.e. restoring the data range to the pre-quantization range.
4.2 rearranging the data to the original position according to the kernel size defined above in order, so as to finish the whole operation.
Fifth step, writing test cases, and testing kernel calculation results
Sixth step, analysis and discussion of Experimental results
6.1 matrix data sources used in the test are randomly generated, and after the original kernel is calculated, matrix multiplication of different scales is calculated under single thread and multithread respectively to obtain results, wherein the mode of calculating the comparison result is the ratio of matrix scale to calculation time, the unit is gflips, and the result is shown in figure 5
By comparing the calculation efficiency of the original kernel, the method is superior to the traditional kernel calculation method in matrix calculation on an ARMv8 system structure platform. Compared with the traditional matrix computing kernel method, the method can obtain better results, and the method proves that the method is effective.
The foregoing is merely a specific guiding embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modification of the present invention by using the concept should be construed as infringement of the protection scope of the present invention.

Claims (3)

1. The ARMv8 system-based computing kernel optimization method is characterized by comprising the following steps of:
step one, constructing an input matrix: converting a convolution kernel into a matrix A and converting an input image into a matrix B through an im2col function in the convolution process; matrix a matrix B as an input matrix; respectively quantizing the matrix A and the matrix B to 8 bits to obtain a quantized matrix A and a quantized matrix B; the quantized matrix A and the quantized matrix B are used as quantization matrixes;
step two, carrying out block packing operation on the quantization matrix to obtain data, namely placing elements in the quantization matrix into continuous memory addresses according to the sequence of calculating the size of the kernel, namely according to the sequence of calculating the size 8x8 of the kernel, and from left to right from top to bottom;
step three, loading the data into a register according to the arranged sequence of the kernel layout in the previous step, and operating the register data to perform multiply-add calculation;
step four, storing the result obtained by the multiplication and addition calculation of the data into an intermediate matrix, and carrying out the next multiplication and addition until all the data are loaded;
step five, unpacking the obtained result, namely, putting the data back to the original matrix position according to the size sequence of the calculation kernel, thereby ensuring the accuracy of the calculation result;
in the second step, the sizes of the quantized matrix A and the quantized matrix B are judged, so that the size of a kernel can be divided in the dimension of matrix partitioning, and if the kernel cannot be divided in an integer manner, 0 supplementing operation is performed, so that the partitioning operation can be completely executed; the quantization matrix is partitioned and packed according to the size of the kernel, namely the quantization matrix is rearranged according to the size and sequence of the kernel to carry out the calculation process of subsequent data, namely the data is sequentially put into a continuous memory address to be circularly reciprocated according to the defined size of the 8x8 kernel until all the data are partitioned.
2. The method for optimizing computing kernel based on ARMv8 system of claim 1, wherein in the first step:
the size of the matrix A and the size of the matrix B are 5x5, the size of the convolution kernel is 3x3, and each convolution window is converted into corresponding column vectors in sequence to form a new matrix;
performing quantization operation on the input matrix to obtain a uint8 value to obtain a quantization matrix, namely performing operation in a range of 8 bits, wherein an affine relation between a real number and the quantized uint8 value is as follows
real_value=scale*(quantized_value-zero_point)
Wherein real_value represents a real value, scale represents a quantization range, quantized_value represents a quantization value, zero_point represents a 0 value of the quantization value, and corresponds to 0 in a real number.
3. The method for optimizing computing kernel based on ARMv8 system according to claim 2, wherein in the third step, the data multiply-add calculation mode is as follows:
3.1 Loading the left and right matrices into the vector register;
3.2 Judging whether the calculation is the first calculation, if not, loading the intermediate result of the last calculation into a corresponding register;
3.3 Performing multiply-add operation on the corresponding register data;
3.4 After the multiplication accumulation of the left matrix and the right matrix is finished, the obtained result is stored in an intermediate result register and is continuously loaded in the next cycle.
CN201910986292.3A 2019-10-17 2019-10-17 Computing kernel optimization method based on ARMv8 system Active CN110782009B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910986292.3A CN110782009B (en) 2019-10-17 2019-10-17 Computing kernel optimization method based on ARMv8 system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910986292.3A CN110782009B (en) 2019-10-17 2019-10-17 Computing kernel optimization method based on ARMv8 system

Publications (2)

Publication Number Publication Date
CN110782009A CN110782009A (en) 2020-02-11
CN110782009B true CN110782009B (en) 2023-09-08

Family

ID=69385837

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910986292.3A Active CN110782009B (en) 2019-10-17 2019-10-17 Computing kernel optimization method based on ARMv8 system

Country Status (1)

Country Link
CN (1) CN110782009B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113495786B (en) * 2020-03-19 2023-10-13 杭州海康威视数字技术股份有限公司 Image convolution processing method and electronic equipment
CN113066508A (en) * 2021-03-15 2021-07-02 腾讯科技(深圳)有限公司 Voice content processing method, device and equipment and readable storage medium
CN117634711B (en) * 2024-01-25 2024-05-14 北京壁仞科技开发有限公司 Tensor dimension segmentation method, system, device and medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572234A (en) * 2014-12-29 2015-04-29 杭州华为数字技术有限公司 Method for generating source codes used for parallel computing architecture and source-to-source compiler
CN105808309A (en) * 2016-03-08 2016-07-27 中国科学院软件研究所 High-performance realization method of BLAS (Basic Linear Algebra Subprograms) three-level function GEMM on the basis of SW platform
CN106970896A (en) * 2017-03-30 2017-07-21 中国人民解放军国防科学技术大学 The vectorization implementation method of the two-dimensional matrix convolution of vector processor-oriented
CN108009634A (en) * 2017-12-21 2018-05-08 美的集团股份有限公司 A kind of optimization method of convolutional neural networks, device and computer-readable storage medium
CN108985236A (en) * 2018-07-20 2018-12-11 南京开为网络科技有限公司 A kind of face identification method separating convolution model based on depthization
CN109086244A (en) * 2018-07-11 2018-12-25 中国人民解放军国防科技大学 Matrix convolution vectorization implementation method based on vector processor
CN110110844A (en) * 2019-04-24 2019-08-09 西安电子科技大学 Convolutional neural networks method for parallel processing based on OpenCL
CN110263923A (en) * 2019-08-12 2019-09-20 上海燧原智能科技有限公司 Tensor convolutional calculation method and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9798378B2 (en) * 2014-03-31 2017-10-24 Google Technology Holdings LLC Apparatus and method for awakening a primary processor out of sleep mode
US10620682B2 (en) * 2017-12-21 2020-04-14 Intel Corporation System, apparatus and method for processor-external override of hardware performance state control of a processor

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572234A (en) * 2014-12-29 2015-04-29 杭州华为数字技术有限公司 Method for generating source codes used for parallel computing architecture and source-to-source compiler
CN105808309A (en) * 2016-03-08 2016-07-27 中国科学院软件研究所 High-performance realization method of BLAS (Basic Linear Algebra Subprograms) three-level function GEMM on the basis of SW platform
CN106970896A (en) * 2017-03-30 2017-07-21 中国人民解放军国防科学技术大学 The vectorization implementation method of the two-dimensional matrix convolution of vector processor-oriented
CN108009634A (en) * 2017-12-21 2018-05-08 美的集团股份有限公司 A kind of optimization method of convolutional neural networks, device and computer-readable storage medium
CN109086244A (en) * 2018-07-11 2018-12-25 中国人民解放军国防科技大学 Matrix convolution vectorization implementation method based on vector processor
CN108985236A (en) * 2018-07-20 2018-12-11 南京开为网络科技有限公司 A kind of face identification method separating convolution model based on depthization
CN110110844A (en) * 2019-04-24 2019-08-09 西安电子科技大学 Convolutional neural networks method for parallel processing based on OpenCL
CN110263923A (en) * 2019-08-12 2019-09-20 上海燧原智能科技有限公司 Tensor convolutional calculation method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
姜浩 等.面向ARMv8 64位多核处理器的QGEMM设计与实现.《计算机学报》.2017,第40卷(第9期),第2018-2029页. *

Also Published As

Publication number Publication date
CN110782009A (en) 2020-02-11

Similar Documents

Publication Publication Date Title
US10860922B2 (en) Sparse convolutional neural network accelerator
Jang et al. Sparsity-aware and re-configurable NPU architecture for Samsung flagship mobile SoC
Zhu et al. An efficient hardware accelerator for structured sparse convolutional neural networks on FPGAs
CN110782009B (en) Computing kernel optimization method based on ARMv8 system
CN109543816B (en) Convolutional neural network calculation method and system based on weight kneading
US20220012593A1 (en) Neural network accelerator and neural network acceleration method based on structured pruning and low-bit quantization
CN111062472B (en) Sparse neural network accelerator based on structured pruning and acceleration method thereof
CN111563599B (en) Quantum circuit decomposition method and device, storage medium and electronic device
CN112200300B (en) Convolutional neural network operation method and device
CN109086244A (en) Matrix convolution vectorization implementation method based on vector processor
US20220058486A1 (en) System and method of accelerating execution of a neural network
CN109791628B (en) Neural network model block compression method, training method, computing device and system
US8433883B2 (en) Inclusive “OR” bit matrix compare resolution of vector update conflict masks
CN113850389B (en) Quantum circuit construction method and device
US20230068450A1 (en) Method and apparatus for processing sparse data
US20230186050A1 (en) Method and apparatus for processing computation of zero value in processing of layers in neural network
CN114418105B (en) Method and device for processing quantum application problem based on quantum circuit
CN104617959A (en) Universal processor-based LDPC (Low Density Parity Check) encoding and decoding method
CN110851779A (en) Systolic array architecture for sparse matrix operations
CN114090954A (en) Integer matrix multiplication kernel optimization method based on FT-2000+
CN111553471A (en) Data analysis processing method and device
CN101438598A (en) Instruction for producing two independent sums of absolute differences
CN113885941A (en) Singular value decomposition operation implementation method, device and related equipment
CN104572588A (en) Matrix inversion processing method and device
CN116842304A (en) Method and system for calculating irregular sparse matrix

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant