CN112199636A

CN112199636A - Fast convolution method and device suitable for microprocessor

Info

Publication number: CN112199636A
Application number: CN202011103515.6A
Authority: CN
Inventors: 丁贵广; 温发琥
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2021-01-08
Anticipated expiration: 2040-10-15
Also published as: CN112199636B

Abstract

The invention discloses a fast convolution method and a device suitable for a microprocessor, wherein the method comprises the following steps: partitioning the acquired input features in a space dimension, synchronizing small blocks obtained after partitioning to corresponding channels, and parallelly finishing input transformation of a Winograd domain by each channel; synchronizing the weight to the corresponding channel, and parallelly finishing the weight transformation of a Winograd domain by each channel; and performing matrix multiplication on the transformed input features and the corresponding transformed weights, performing output transformation on the matrix multiplication result, and returning the matrix multiplication result to the spatial domain as a convolution calculation result. The method can effectively accelerate the forward inference process of the convolution network on ARM Cortex equipment, compared with TensorFlow Lite quantitative convolution calculation, and realize 2 times of addition of quick results on ARM Cortex A72 equipment by convolution calculation operation with the same parameter configuration.

Description

Fast convolution method and device suitable for microprocessor

Technical Field

The present invention relates to the field of microprocessor technology, and in particular, to a fast convolution method and apparatus suitable for a microprocessor.

Background

The use of the convolutional neural network enables the visual application of the embedded scene to obtain universal effect improvement. The ARM device of the mobile terminal is different from devices with a large amount of computing resources at the desktop terminal and the server terminal, the ARM mobile device often has the disadvantages of low power consumption, relatively low computing capability and limited memory, and the deployment of the convolution network computing with intensive computing on the ARM device also needs to realize multiple kinds of optimization pertinently.

The model quantization method is a network model lightweight method with remarkable benefits for model operation acceleration and power consumption reduction, and researches in the field prove that the deep neural network still has considerable accuracy and expressive force on corresponding tasks under low-precision representation. The combination of low-order mathematical operations with quantization parameters and the quantization of the intermediate calculations of the neural network results in a greater computational gain and higher performance. In addition to performance advantages, the quantized neural network also increases power efficiency due to reduced memory access costs while computational efficiency is increased. Less on-chip and off-chip data movement is required using low-bit quantized data, thereby reducing memory bandwidth and saving significant energy. Lower precision mathematical operations (e.g., 8-bit integer multiplication) consume less energy and increase computational efficiency, thereby reducing power consumption. Furthermore, reducing the number of bits used to represent neural network parameters may reduce memory space. In the application of the industry, the native support of the quantization network (8-bit integer matrix) calculation and the real end-to-end low precision calculation are realized by combining the hardware condition of the embedded device, and the method is an important way for realizing the performance acceleration and the power consumption reduction of the network model.

For visual tasks, the application of convolutional neural networks dominates. In this network model, the convolution operation occupies most of the computation, so that the bottleneck of performance acceleration and power consumption optimization is the implementation of the convolution operation. In the implementation of Convolution, there are mainly gem-based Convolution (im2col) and fast Convolution algorithm (FFT-based Convolution, Winograd Convolution) at present. The GEMM-based convolution algorithm is relatively simple to implement, and after simple recombination is implemented on data, mature matrix multiplication can be directly called to implement convolution operation. The fast convolution method, although having a relatively high implementation difficulty, can effectively achieve the reduction of the convolution computation complexity. The convolution kernels of the main convolution operations in the network are all 3x3 in combination with the characteristics of a lightweight network applied to a mobile terminal. In theoretical research and practice, the winogrd fast convolution method has a significant computational complexity advantage for the 3 × 3 convolution. On the other hand, in combination with the characteristics of the quantized network parameters, due to the implementation complexity of the Winograd convolution and the frequent transformation of numerical representation precision in quantization calculation, the Winograd algorithm applicable to desktop-side and floating-point operations cannot be efficiently implemented. The most common convolution algorithm at the mobile end is im2col and its variants. The quantitative calculation implementation of Winograd convolution is not mature enough in research and application at present.

Furthermore, the computational bottleneck in convolutional networks is in convolutional computation, which is often implemented in close relation to matrix multiplication. The mainstream convolution implementation method can directly or indirectly convert convolution operation into matrix multiplication calculation. Therefore, one major focus and difficulty in the effective reasoning problem in convolutional neural networks is the efficient implementation of matrix multiplication (also referred to as GEMM in linear algebra libraries). Although the HPC field has been a rather mature research result for GEMM, the research in the HPC field for GEMM is directed to x 86-architecture devices with a large amount of computing resources and sufficient computing power on the server side, the resources and capabilities are limited in the marginal computing field, and there is a large deviation in the scenario dominated by the ARM-architecture devices, and the quantitative computation in this scenario is very little. In addition, the problem concerned by matrix multiplication in high-performance computation is often a large-scale matrix, and the matrix operation involved in the neural network deployed at the mobile terminal often does not reach the scale, so that many optimization strategies for large matrix operation are redundant here, and at the same time, additional overhead is brought.

The method has the advantages that the existing research aiming at a mobile terminal network model and a model lightweight method is combined, the computing capacity of a hardware platform is fully utilized, the optimal design is realized on a mobile terminal application operator library (kernel library) by combining a fast convolution method, and the method has self-evident importance and necessity for realizing the deployment of a high-efficiency and low-power-consumption deep network model, edge intelligence or intelligent internet of things (AIoT).

Disclosure of Invention

The present invention is directed to a fast convolution method and apparatus for a microprocessor, which is used to solve the above-mentioned problems in the prior art.

The invention provides a fast convolution method suitable for a microprocessor, which comprises the following steps:

partitioning the acquired input features in a space dimension, synchronizing small blocks obtained after partitioning to corresponding channels, and parallelly finishing input transformation of a Winograd domain by each channel;

synchronizing the weight to the corresponding channel, and parallelly finishing the weight transformation of a Winograd domain by each channel;

and performing matrix multiplication on the transformed input features and the corresponding transformed weights, performing output transformation on the matrix multiplication result, and returning the matrix multiplication result to the spatial domain as a convolution calculation result.

The invention provides a fast convolution device suitable for a microprocessor, which comprises:

the partition transformation module is used for partitioning the acquired input features in the space dimension, synchronizing small blocks obtained after partitioning to corresponding channels, and parallelly finishing the input transformation of a Winograd domain by each channel;

the weight transformation module is used for synchronizing the weight to the corresponding channel, and each channel completes the weight transformation of a Winograd domain in parallel;

and the matrix multiplication module is used for performing matrix multiplication on the transformed input characteristics and the corresponding transformed weights, performing output transformation on the matrix multiplication result, and returning the matrix multiplication result to the spatial domain as a convolution calculation result.

The embodiment of the present invention further provides a fast convolution device suitable for a microprocessor, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the above fast convolution method for a microprocessor.

The embodiment of the present invention further provides a computer-readable storage medium, where an implementation program for information transfer is stored on the computer-readable storage medium, and when the implementation program is executed by a processor, the steps of the fast convolution method suitable for the microprocessor are implemented.

By adopting the embodiment of the invention, the forward inference process of the convolutional network on the ARM Cortex equipment can be effectively accelerated, compared with the TensorFlow Lite quantitative convolution calculation, the convolution calculation operation with the same parameter configuration realizes 2 times of addition of quick results on the ARM Cortex A72 equipment.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a fast convolution method for a microprocessor according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a multi-channel Winograd convolution algorithm according to an embodiment of the present invention;

FIG. 3 is a diagram of a fast convolution apparatus for a microprocessor according to a first embodiment of the present invention;

FIG. 4 is a diagram of a fast convolution apparatus for a microprocessor according to a second embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a fast convolution method suitable for ARM equipment, which mainly comprises two main points: 1. aiming at the problem of integer convolution optimization of a Winograd convolution method, a local area multi-channel Wino-grad convolution algorithm is provided, and aiming at limited computing resources of embedded equipment, the spatial locality of convolution operation and the channel parallelization of the Winograd algorithm are fully utilized, so that the problem of acceleration of integer convolution operation on the embedded equipment is effectively solved. 2. In order to further optimize the speed of integer convolution, the efficient parallel computing capability of the NEON instruction of ARM is utilized, and a medium-small scale integer matrix multiplication cache friendly high-throughput implementation strategy is provided aiming at the parameter characteristics of a mobile end lightweight convolution network, so that the limited storage and computing parallel capability is better utilized, the extra overhead of matrix multiplication in the traditional high-performance computation is avoided, and the speed of the integer convolution on the ARM Cortex-A framework is further accelerated.

The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", and the like, indicate orientations and positional relationships based on those shown in the drawings, and are used only for convenience of description and simplicity of description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be considered as limiting the present invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first", "second", may explicitly or implicitly include one or more of the described features. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise. Furthermore, the terms "mounted," "connected," and "connected" are to be construed broadly and may, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Method embodiment

According to an embodiment of the present invention, a fast convolution method suitable for a microprocessor is provided. Fig. 1 is a flowchart of a fast convolution method for a microprocessor according to an embodiment of the present invention, and as shown in fig. 1, the fast convolution method for a microprocessor according to an embodiment of the present invention specifically includes:

step 101, partitioning the acquired input features in a space dimension, synchronizing small blocks obtained after partitioning to corresponding channels, and completing input transformation of a Winograd domain in parallel by each channel; wherein the input features specifically include: input images or 3D features; in the embodiment of the invention, the small blocks obtained after partitioning are synchronized to the corresponding channels through single instruction multiple data operation (SIMD);

step 102, synchronizing the weight to the corresponding channel, and parallelly finishing the weight transformation of a Winograd domain by each channel; in embodiments of the present invention, the weights may be synchronized to the corresponding lanes through the SIMDs.

And 103, performing matrix multiplication on the transformed input features and the corresponding transformed weights, performing output transformation on the matrix multiplication result, and returning the matrix multiplication result to the spatial domain as a convolution calculation result.

In step 103, the matrix multiplication of the transformed input features and the corresponding transformed weights specifically includes:

for the matrix A and the matrix B which participate in the operation, firstly, the right multiplication matrix B is packed according to the block factors of the register, then, the preset small matrix blocks of the respective block factor units are taken from the corresponding positions in the storage space corresponding to the matrix A and the matrix B, the two small matrix blocks are subjected to dot product, the obtained calculation result is the sub-matrix of the corresponding position of the matrix multiplication of the whole matrix A and the matrix B, finally, the sub-matrix of each dimension of the matrix A and the matrix B is traversed according to the block factors, and finally, the matrix multiplication result of the matrix A and the matrix B is calculated.

The above technical solutions of the embodiments of the present invention are described in detail below with reference to the accompanying drawings.

The following is specific details of the implementation of the quantized fast convolution method.

1. Convolution algorithm of Winograd of local regionalization multichannel

As shown in fig. 2, the input image or 3D features are first partitioned in the spatial dimension, and then the partitioned small blocks are implemented into the input transform of the Winograd domain, in which the channels are independent from each other, so that the input transform is implemented on the corresponding channel through SIMD synchronization. Similarly, the weight transform also enables parallel processing in the channel dimension. Then the transformed input and the weight are correspondingly subjected to matrix multiplication, finally the result of the matrix multiplication is subjected to output transformation, and the convolution calculation result in the spatial domain is returned.

2. Efficient quantization matrix multiplication implementation

The scale of matrix multiplication involved in Winograd convolution in mobile equipment is different from tens to hundreds, and the input numerical value participating in matrix multiplication calculation is represented as a 16-bit integer and the output is a 32-bit integer, so that the stress of parameters on storage equipment is small. Efficient matrix multiplication requires that, overall, the characteristics corresponding to the hierarchical structure of the storage device of the computer be exhibited to fully multiplex data already loaded in a high-speed device, the more commonly used data being placed at higher positions in the hierarchical structure of the storage, i.e., the faster storage device. Meanwhile, the scale of the high-multiplexing data is matched with the capacity of corresponding storage hardware, and the higher the multiplexing degree is, the finer the data division granularity is; in addition, it is known that continuous storage positions need to be accessed as much as possible in the calculation process, and the storage mode of data is adapted to the calculation sequence to optimize the cache utilization. Therefore, the following efficient quantization small and medium-scale matrix multiplication, namely matrix multiplication suitable for quantizing Winograd convolution in a lightweight network, is proposed here:

specifically, for the matrixes a and B participating in the operation, firstly, the right multiplication matrix B is packed according to the Blocking Factor of the register, then, at the corresponding position in the storage space corresponding to the matrixes a and B, the preset small matrix blocks of the respective Blocking Factor unit are taken, the dot product is carried out on the two small matrix blocks, and the calculation result is the sub-matrix of the corresponding position of the integral matrix a and B multiplication. And traversing the submatrices of the A and B in each dimension according to the Blocking Factor, and finally calculating the multiplication result of the A and B matrixes.

In summary, by means of the technical scheme of the embodiment of the invention, the forward inference process of the convolutional network on the ARM Cortex equipment can be effectively accelerated, compared with the tensrflow Lite quantitative convolution calculation, the convolution calculation operation with the same parameter configuration realizes 2 times of addition of quick results on the ARM Cortex a72 equipment.

Apparatus embodiment one

According to the embodiment of the invention, the fast convolution device suitable for the microprocessor is provided, wherein the microprocessor is ARM equipment. Fig. 3 is a schematic diagram of a fast convolution apparatus suitable for a microprocessor according to a first embodiment of the present invention, and as shown in fig. 3, the fast convolution apparatus suitable for a microprocessor according to the first embodiment of the present invention specifically includes:

the partition transformation module 30 is configured to partition the acquired input features in a spatial dimension, synchronize small blocks obtained after partitioning to corresponding channels, and perform input transformation of a Winograd domain in parallel for each channel; the input features specifically include: input images or 3D features; the partition transformation module 30 is specifically configured to: synchronizing the small blocks obtained after partitioning to corresponding channels through single instruction multiple data operation (SIMD);

the weight transformation module 32 is used for synchronizing the weight to the corresponding channel, and each channel completes the weight transformation of Winograd domain in parallel; the weight transformation module 32 is specifically configured to: the weights are synchronized to the corresponding lanes through the SIMDs.

And the matrix multiplication module 34 is used for performing matrix multiplication on the transformed input features and the corresponding transformed weights, performing output transformation on the matrix multiplication result, and returning the matrix multiplication result to the spatial domain as a convolution calculation result.

The matrix multiplication module 34 is specifically configured to:

The embodiment of the present invention is an apparatus embodiment corresponding to the above method embodiment, and specific operations of each module may be understood with reference to the description of the method embodiment, which is not described herein again.

Device embodiment II

An embodiment of the present invention provides a fast convolution device suitable for a microprocessor, as shown in fig. 4, including: a memory 40, a processor 42 and a computer program stored on the memory 40 and executable on the processor 42, which computer program, when executed by the processor 42, carries out the following method steps:

Device embodiment III

The embodiment of the present invention provides a computer-readable storage medium, on which an implementation program for information transmission is stored, and when being executed by a processor 42, the implementation program implements the following method steps:

The computer-readable storage medium of this embodiment includes, but is not limited to: ROM, RAM, magnetic or optical disks, and the like.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A fast convolution method for a microprocessor, comprising:

2. The method of claim 1,

the input features specifically include: input images or 3D features;

the microprocessor is an ARM device.

3. The method of claim 1,

synchronizing the small blocks obtained after partitioning to the corresponding channels specifically comprises:

synchronizing the small blocks obtained after partitioning to corresponding channels through single instruction multiple data operation (SIMD);

synchronizing the weights to the corresponding channels specifically includes:

the weights are synchronized to the corresponding lanes through the SIMDs.

4. The method of claim 1, wherein matrix multiplying the transformed input features and the corresponding transformed weights specifically comprises:

5. A fast convolution device adapted for use with a microprocessor, comprising:

6. The apparatus of claim 5,

the input features specifically include: input images or 3D features;

the microprocessor is an ARM device.

7. The apparatus of claim 5,

the partition transformation module is specifically configured to:

the weight transformation module is specifically configured to:

the weights are synchronized to the corresponding lanes through the SIMDs.

8. The apparatus of claim 5, wherein the matrix multiplication module is specifically configured to:

9. A fast convolution device adapted for use with a microprocessor, comprising: memory, processor and computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, carries out the steps of the fast convolution method for a microprocessor according to any one of claims 1 to 4.

10. A computer-readable storage medium, on which an information-transfer-implementing program is stored, which, when executed by a processor, implements the steps of the fast convolution method for a microprocessor according to any one of claims 1 to 4.