CN112199636A - Fast convolution method and device suitable for microprocessor - Google Patents

Fast convolution method and device suitable for microprocessor Download PDF

Info

Publication number
CN112199636A
CN112199636A CN202011103515.6A CN202011103515A CN112199636A CN 112199636 A CN112199636 A CN 112199636A CN 202011103515 A CN202011103515 A CN 202011103515A CN 112199636 A CN112199636 A CN 112199636A
Authority
CN
China
Prior art keywords
matrix
transformation
convolution
synchronizing
microprocessor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011103515.6A
Other languages
Chinese (zh)
Other versions
CN112199636B (en
Inventor
丁贵广
温发琥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202011103515.6A priority Critical patent/CN112199636B/en
Publication of CN112199636A publication Critical patent/CN112199636A/en
Application granted granted Critical
Publication of CN112199636B publication Critical patent/CN112199636B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/10Interfaces, programming languages or software development kits, e.g. for simulating neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Computing Systems (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a fast convolution method and a device suitable for a microprocessor, wherein the method comprises the following steps: partitioning the acquired input features in a space dimension, synchronizing small blocks obtained after partitioning to corresponding channels, and parallelly finishing input transformation of a Winograd domain by each channel; synchronizing the weight to the corresponding channel, and parallelly finishing the weight transformation of a Winograd domain by each channel; and performing matrix multiplication on the transformed input features and the corresponding transformed weights, performing output transformation on the matrix multiplication result, and returning the matrix multiplication result to the spatial domain as a convolution calculation result. The method can effectively accelerate the forward inference process of the convolution network on ARM Cortex equipment, compared with TensorFlow Lite quantitative convolution calculation, and realize 2 times of addition of quick results on ARM Cortex A72 equipment by convolution calculation operation with the same parameter configuration.

Description

Fast convolution method and device suitable for microprocessor
Technical Field
The present invention relates to the field of microprocessor technology, and in particular, to a fast convolution method and apparatus suitable for a microprocessor.
Background
The use of the convolutional neural network enables the visual application of the embedded scene to obtain universal effect improvement. The ARM device of the mobile terminal is different from devices with a large amount of computing resources at the desktop terminal and the server terminal, the ARM mobile device often has the disadvantages of low power consumption, relatively low computing capability and limited memory, and the deployment of the convolution network computing with intensive computing on the ARM device also needs to realize multiple kinds of optimization pertinently.
The model quantization method is a network model lightweight method with remarkable benefits for model operation acceleration and power consumption reduction, and researches in the field prove that the deep neural network still has considerable accuracy and expressive force on corresponding tasks under low-precision representation. The combination of low-order mathematical operations with quantization parameters and the quantization of the intermediate calculations of the neural network results in a greater computational gain and higher performance. In addition to performance advantages, the quantized neural network also increases power efficiency due to reduced memory access costs while computational efficiency is increased. Less on-chip and off-chip data movement is required using low-bit quantized data, thereby reducing memory bandwidth and saving significant energy. Lower precision mathematical operations (e.g., 8-bit integer multiplication) consume less energy and increase computational efficiency, thereby reducing power consumption. Furthermore, reducing the number of bits used to represent neural network parameters may reduce memory space. In the application of the industry, the native support of the quantization network (8-bit integer matrix) calculation and the real end-to-end low precision calculation are realized by combining the hardware condition of the embedded device, and the method is an important way for realizing the performance acceleration and the power consumption reduction of the network model.
For visual tasks, the application of convolutional neural networks dominates. In this network model, the convolution operation occupies most of the computation, so that the bottleneck of performance acceleration and power consumption optimization is the implementation of the convolution operation. In the implementation of Convolution, there are mainly gem-based Convolution (im2col) and fast Convolution algorithm (FFT-based Convolution, Winograd Convolution) at present. The GEMM-based convolution algorithm is relatively simple to implement, and after simple recombination is implemented on data, mature matrix multiplication can be directly called to implement convolution operation. The fast convolution method, although having a relatively high implementation difficulty, can effectively achieve the reduction of the convolution computation complexity. The convolution kernels of the main convolution operations in the network are all 3x3 in combination with the characteristics of a lightweight network applied to a mobile terminal. In theoretical research and practice, the winogrd fast convolution method has a significant computational complexity advantage for the 3 × 3 convolution. On the other hand, in combination with the characteristics of the quantized network parameters, due to the implementation complexity of the Winograd convolution and the frequent transformation of numerical representation precision in quantization calculation, the Winograd algorithm applicable to desktop-side and floating-point operations cannot be efficiently implemented. The most common convolution algorithm at the mobile end is im2col and its variants. The quantitative calculation implementation of Winograd convolution is not mature enough in research and application at present.
Furthermore, the computational bottleneck in convolutional networks is in convolutional computation, which is often implemented in close relation to matrix multiplication. The mainstream convolution implementation method can directly or indirectly convert convolution operation into matrix multiplication calculation. Therefore, one major focus and difficulty in the effective reasoning problem in convolutional neural networks is the efficient implementation of matrix multiplication (also referred to as GEMM in linear algebra libraries). Although the HPC field has been a rather mature research result for GEMM, the research in the HPC field for GEMM is directed to x 86-architecture devices with a large amount of computing resources and sufficient computing power on the server side, the resources and capabilities are limited in the marginal computing field, and there is a large deviation in the scenario dominated by the ARM-architecture devices, and the quantitative computation in this scenario is very little. In addition, the problem concerned by matrix multiplication in high-performance computation is often a large-scale matrix, and the matrix operation involved in the neural network deployed at the mobile terminal often does not reach the scale, so that many optimization strategies for large matrix operation are redundant here, and at the same time, additional overhead is brought.
The method has the advantages that the existing research aiming at a mobile terminal network model and a model lightweight method is combined, the computing capacity of a hardware platform is fully utilized, the optimal design is realized on a mobile terminal application operator library (kernel library) by combining a fast convolution method, and the method has self-evident importance and necessity for realizing the deployment of a high-efficiency and low-power-consumption deep network model, edge intelligence or intelligent internet of things (AIoT).
Disclosure of Invention
The present invention is directed to a fast convolution method and apparatus for a microprocessor, which is used to solve the above-mentioned problems in the prior art.
The invention provides a fast convolution method suitable for a microprocessor, which comprises the following steps:
partitioning the acquired input features in a space dimension, synchronizing small blocks obtained after partitioning to corresponding channels, and parallelly finishing input transformation of a Winograd domain by each channel;
synchronizing the weight to the corresponding channel, and parallelly finishing the weight transformation of a Winograd domain by each channel;
and performing matrix multiplication on the transformed input features and the corresponding transformed weights, performing output transformation on the matrix multiplication result, and returning the matrix multiplication result to the spatial domain as a convolution calculation result.
The invention provides a fast convolution device suitable for a microprocessor, which comprises:
the partition transformation module is used for partitioning the acquired input features in the space dimension, synchronizing small blocks obtained after partitioning to corresponding channels, and parallelly finishing the input transformation of a Winograd domain by each channel;
the weight transformation module is used for synchronizing the weight to the corresponding channel, and each channel completes the weight transformation of a Winograd domain in parallel;
and the matrix multiplication module is used for performing matrix multiplication on the transformed input characteristics and the corresponding transformed weights, performing output transformation on the matrix multiplication result, and returning the matrix multiplication result to the spatial domain as a convolution calculation result.
The embodiment of the present invention further provides a fast convolution device suitable for a microprocessor, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the above fast convolution method for a microprocessor.
The embodiment of the present invention further provides a computer-readable storage medium, where an implementation program for information transfer is stored on the computer-readable storage medium, and when the implementation program is executed by a processor, the steps of the fast convolution method suitable for the microprocessor are implemented.
By adopting the embodiment of the invention, the forward inference process of the convolutional network on the ARM Cortex equipment can be effectively accelerated, compared with the TensorFlow Lite quantitative convolution calculation, the convolution calculation operation with the same parameter configuration realizes 2 times of addition of quick results on the ARM Cortex A72 equipment.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of a fast convolution method for a microprocessor according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a multi-channel Winograd convolution algorithm according to an embodiment of the present invention;
FIG. 3 is a diagram of a fast convolution apparatus for a microprocessor according to a first embodiment of the present invention;
FIG. 4 is a diagram of a fast convolution apparatus for a microprocessor according to a second embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a fast convolution method suitable for ARM equipment, which mainly comprises two main points: 1. aiming at the problem of integer convolution optimization of a Winograd convolution method, a local area multi-channel Wino-grad convolution algorithm is provided, and aiming at limited computing resources of embedded equipment, the spatial locality of convolution operation and the channel parallelization of the Winograd algorithm are fully utilized, so that the problem of acceleration of integer convolution operation on the embedded equipment is effectively solved. 2. In order to further optimize the speed of integer convolution, the efficient parallel computing capability of the NEON instruction of ARM is utilized, and a medium-small scale integer matrix multiplication cache friendly high-throughput implementation strategy is provided aiming at the parameter characteristics of a mobile end lightweight convolution network, so that the limited storage and computing parallel capability is better utilized, the extra overhead of matrix multiplication in the traditional high-performance computation is avoided, and the speed of the integer convolution on the ARM Cortex-A framework is further accelerated.
The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", and the like, indicate orientations and positional relationships based on those shown in the drawings, and are used only for convenience of description and simplicity of description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be considered as limiting the present invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first", "second", may explicitly or implicitly include one or more of the described features. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise. Furthermore, the terms "mounted," "connected," and "connected" are to be construed broadly and may, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
Method embodiment
According to an embodiment of the present invention, a fast convolution method suitable for a microprocessor is provided. Fig. 1 is a flowchart of a fast convolution method for a microprocessor according to an embodiment of the present invention, and as shown in fig. 1, the fast convolution method for a microprocessor according to an embodiment of the present invention specifically includes:
step 101, partitioning the acquired input features in a space dimension, synchronizing small blocks obtained after partitioning to corresponding channels, and completing input transformation of a Winograd domain in parallel by each channel; wherein the input features specifically include: input images or 3D features; in the embodiment of the invention, the small blocks obtained after partitioning are synchronized to the corresponding channels through single instruction multiple data operation (SIMD);
step 102, synchronizing the weight to the corresponding channel, and parallelly finishing the weight transformation of a Winograd domain by each channel; in embodiments of the present invention, the weights may be synchronized to the corresponding lanes through the SIMDs.
And 103, performing matrix multiplication on the transformed input features and the corresponding transformed weights, performing output transformation on the matrix multiplication result, and returning the matrix multiplication result to the spatial domain as a convolution calculation result.
In step 103, the matrix multiplication of the transformed input features and the corresponding transformed weights specifically includes:
for the matrix A and the matrix B which participate in the operation, firstly, the right multiplication matrix B is packed according to the block factors of the register, then, the preset small matrix blocks of the respective block factor units are taken from the corresponding positions in the storage space corresponding to the matrix A and the matrix B, the two small matrix blocks are subjected to dot product, the obtained calculation result is the sub-matrix of the corresponding position of the matrix multiplication of the whole matrix A and the matrix B, finally, the sub-matrix of each dimension of the matrix A and the matrix B is traversed according to the block factors, and finally, the matrix multiplication result of the matrix A and the matrix B is calculated.
The above technical solutions of the embodiments of the present invention are described in detail below with reference to the accompanying drawings.
The following is specific details of the implementation of the quantized fast convolution method.
1. Convolution algorithm of Winograd of local regionalization multichannel
As shown in fig. 2, the input image or 3D features are first partitioned in the spatial dimension, and then the partitioned small blocks are implemented into the input transform of the Winograd domain, in which the channels are independent from each other, so that the input transform is implemented on the corresponding channel through SIMD synchronization. Similarly, the weight transform also enables parallel processing in the channel dimension. Then the transformed input and the weight are correspondingly subjected to matrix multiplication, finally the result of the matrix multiplication is subjected to output transformation, and the convolution calculation result in the spatial domain is returned.
2. Efficient quantization matrix multiplication implementation
The scale of matrix multiplication involved in Winograd convolution in mobile equipment is different from tens to hundreds, and the input numerical value participating in matrix multiplication calculation is represented as a 16-bit integer and the output is a 32-bit integer, so that the stress of parameters on storage equipment is small. Efficient matrix multiplication requires that, overall, the characteristics corresponding to the hierarchical structure of the storage device of the computer be exhibited to fully multiplex data already loaded in a high-speed device, the more commonly used data being placed at higher positions in the hierarchical structure of the storage, i.e., the faster storage device. Meanwhile, the scale of the high-multiplexing data is matched with the capacity of corresponding storage hardware, and the higher the multiplexing degree is, the finer the data division granularity is; in addition, it is known that continuous storage positions need to be accessed as much as possible in the calculation process, and the storage mode of data is adapted to the calculation sequence to optimize the cache utilization. Therefore, the following efficient quantization small and medium-scale matrix multiplication, namely matrix multiplication suitable for quantizing Winograd convolution in a lightweight network, is proposed here:
Figure BDA0002726187650000081
specifically, for the matrixes a and B participating in the operation, firstly, the right multiplication matrix B is packed according to the Blocking Factor of the register, then, at the corresponding position in the storage space corresponding to the matrixes a and B, the preset small matrix blocks of the respective Blocking Factor unit are taken, the dot product is carried out on the two small matrix blocks, and the calculation result is the sub-matrix of the corresponding position of the integral matrix a and B multiplication. And traversing the submatrices of the A and B in each dimension according to the Blocking Factor, and finally calculating the multiplication result of the A and B matrixes.
In summary, by means of the technical scheme of the embodiment of the invention, the forward inference process of the convolutional network on the ARM Cortex equipment can be effectively accelerated, compared with the tensrflow Lite quantitative convolution calculation, the convolution calculation operation with the same parameter configuration realizes 2 times of addition of quick results on the ARM Cortex a72 equipment.
Apparatus embodiment one
According to the embodiment of the invention, the fast convolution device suitable for the microprocessor is provided, wherein the microprocessor is ARM equipment. Fig. 3 is a schematic diagram of a fast convolution apparatus suitable for a microprocessor according to a first embodiment of the present invention, and as shown in fig. 3, the fast convolution apparatus suitable for a microprocessor according to the first embodiment of the present invention specifically includes:
the partition transformation module 30 is configured to partition the acquired input features in a spatial dimension, synchronize small blocks obtained after partitioning to corresponding channels, and perform input transformation of a Winograd domain in parallel for each channel; the input features specifically include: input images or 3D features; the partition transformation module 30 is specifically configured to: synchronizing the small blocks obtained after partitioning to corresponding channels through single instruction multiple data operation (SIMD);
the weight transformation module 32 is used for synchronizing the weight to the corresponding channel, and each channel completes the weight transformation of Winograd domain in parallel; the weight transformation module 32 is specifically configured to: the weights are synchronized to the corresponding lanes through the SIMDs.
And the matrix multiplication module 34 is used for performing matrix multiplication on the transformed input features and the corresponding transformed weights, performing output transformation on the matrix multiplication result, and returning the matrix multiplication result to the spatial domain as a convolution calculation result.
The matrix multiplication module 34 is specifically configured to:
for the matrix A and the matrix B which participate in the operation, firstly, the right multiplication matrix B is packed according to the block factors of the register, then, the preset small matrix blocks of the respective block factor units are taken from the corresponding positions in the storage space corresponding to the matrix A and the matrix B, the two small matrix blocks are subjected to dot product, the obtained calculation result is the sub-matrix of the corresponding position of the matrix multiplication of the whole matrix A and the matrix B, finally, the sub-matrix of each dimension of the matrix A and the matrix B is traversed according to the block factors, and finally, the matrix multiplication result of the matrix A and the matrix B is calculated.
The embodiment of the present invention is an apparatus embodiment corresponding to the above method embodiment, and specific operations of each module may be understood with reference to the description of the method embodiment, which is not described herein again.
Device embodiment II
An embodiment of the present invention provides a fast convolution device suitable for a microprocessor, as shown in fig. 4, including: a memory 40, a processor 42 and a computer program stored on the memory 40 and executable on the processor 42, which computer program, when executed by the processor 42, carries out the following method steps:
step 101, partitioning the acquired input features in a space dimension, synchronizing small blocks obtained after partitioning to corresponding channels, and completing input transformation of a Winograd domain in parallel by each channel; wherein the input features specifically include: input images or 3D features; in the embodiment of the invention, the small blocks obtained after partitioning are synchronized to the corresponding channels through single instruction multiple data operation (SIMD);
step 102, synchronizing the weight to the corresponding channel, and parallelly finishing the weight transformation of a Winograd domain by each channel; in embodiments of the present invention, the weights may be synchronized to the corresponding lanes through the SIMDs.
And 103, performing matrix multiplication on the transformed input features and the corresponding transformed weights, performing output transformation on the matrix multiplication result, and returning the matrix multiplication result to the spatial domain as a convolution calculation result.
In step 103, the matrix multiplication of the transformed input features and the corresponding transformed weights specifically includes:
for the matrix A and the matrix B which participate in the operation, firstly, the right multiplication matrix B is packed according to the block factors of the register, then, the preset small matrix blocks of the respective block factor units are taken from the corresponding positions in the storage space corresponding to the matrix A and the matrix B, the two small matrix blocks are subjected to dot product, the obtained calculation result is the sub-matrix of the corresponding position of the matrix multiplication of the whole matrix A and the matrix B, finally, the sub-matrix of each dimension of the matrix A and the matrix B is traversed according to the block factors, and finally, the matrix multiplication result of the matrix A and the matrix B is calculated.
Device embodiment III
The embodiment of the present invention provides a computer-readable storage medium, on which an implementation program for information transmission is stored, and when being executed by a processor 42, the implementation program implements the following method steps:
step 101, partitioning the acquired input features in a space dimension, synchronizing small blocks obtained after partitioning to corresponding channels, and completing input transformation of a Winograd domain in parallel by each channel; wherein the input features specifically include: input images or 3D features; in the embodiment of the invention, the small blocks obtained after partitioning are synchronized to the corresponding channels through single instruction multiple data operation (SIMD);
step 102, synchronizing the weight to the corresponding channel, and parallelly finishing the weight transformation of a Winograd domain by each channel; in embodiments of the present invention, the weights may be synchronized to the corresponding lanes through the SIMDs.
And 103, performing matrix multiplication on the transformed input features and the corresponding transformed weights, performing output transformation on the matrix multiplication result, and returning the matrix multiplication result to the spatial domain as a convolution calculation result.
In step 103, the matrix multiplication of the transformed input features and the corresponding transformed weights specifically includes:
for the matrix A and the matrix B which participate in the operation, firstly, the right multiplication matrix B is packed according to the block factors of the register, then, the preset small matrix blocks of the respective block factor units are taken from the corresponding positions in the storage space corresponding to the matrix A and the matrix B, the two small matrix blocks are subjected to dot product, the obtained calculation result is the sub-matrix of the corresponding position of the matrix multiplication of the whole matrix A and the matrix B, finally, the sub-matrix of each dimension of the matrix A and the matrix B is traversed according to the block factors, and finally, the matrix multiplication result of the matrix A and the matrix B is calculated.
The computer-readable storage medium of this embodiment includes, but is not limited to: ROM, RAM, magnetic or optical disks, and the like.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A fast convolution method for a microprocessor, comprising:
partitioning the acquired input features in a space dimension, synchronizing small blocks obtained after partitioning to corresponding channels, and parallelly finishing input transformation of a Winograd domain by each channel;
synchronizing the weight to the corresponding channel, and parallelly finishing the weight transformation of a Winograd domain by each channel;
and performing matrix multiplication on the transformed input features and the corresponding transformed weights, performing output transformation on the matrix multiplication result, and returning the matrix multiplication result to the spatial domain as a convolution calculation result.
2. The method of claim 1,
the input features specifically include: input images or 3D features;
the microprocessor is an ARM device.
3. The method of claim 1,
synchronizing the small blocks obtained after partitioning to the corresponding channels specifically comprises:
synchronizing the small blocks obtained after partitioning to corresponding channels through single instruction multiple data operation (SIMD);
synchronizing the weights to the corresponding channels specifically includes:
the weights are synchronized to the corresponding lanes through the SIMDs.
4. The method of claim 1, wherein matrix multiplying the transformed input features and the corresponding transformed weights specifically comprises:
for the matrix A and the matrix B which participate in the operation, firstly, the right multiplication matrix B is packed according to the block factors of the register, then, the preset small matrix blocks of the respective block factor units are taken from the corresponding positions in the storage space corresponding to the matrix A and the matrix B, the two small matrix blocks are subjected to dot product, the obtained calculation result is the sub-matrix of the corresponding position of the matrix multiplication of the whole matrix A and the matrix B, finally, the sub-matrix of each dimension of the matrix A and the matrix B is traversed according to the block factors, and finally, the matrix multiplication result of the matrix A and the matrix B is calculated.
5. A fast convolution device adapted for use with a microprocessor, comprising:
the partition transformation module is used for partitioning the acquired input features in the space dimension, synchronizing small blocks obtained after partitioning to corresponding channels, and parallelly finishing the input transformation of a Winograd domain by each channel;
the weight transformation module is used for synchronizing the weight to the corresponding channel, and each channel completes the weight transformation of a Winograd domain in parallel;
and the matrix multiplication module is used for performing matrix multiplication on the transformed input characteristics and the corresponding transformed weights, performing output transformation on the matrix multiplication result, and returning the matrix multiplication result to the spatial domain as a convolution calculation result.
6. The apparatus of claim 5,
the input features specifically include: input images or 3D features;
the microprocessor is an ARM device.
7. The apparatus of claim 5,
the partition transformation module is specifically configured to:
synchronizing the small blocks obtained after partitioning to corresponding channels through single instruction multiple data operation (SIMD);
the weight transformation module is specifically configured to:
the weights are synchronized to the corresponding lanes through the SIMDs.
8. The apparatus of claim 5, wherein the matrix multiplication module is specifically configured to:
for the matrix A and the matrix B which participate in the operation, firstly, the right multiplication matrix B is packed according to the block factors of the register, then, the preset small matrix blocks of the respective block factor units are taken from the corresponding positions in the storage space corresponding to the matrix A and the matrix B, the two small matrix blocks are subjected to dot product, the obtained calculation result is the sub-matrix of the corresponding position of the matrix multiplication of the whole matrix A and the matrix B, finally, the sub-matrix of each dimension of the matrix A and the matrix B is traversed according to the block factors, and finally, the matrix multiplication result of the matrix A and the matrix B is calculated.
9. A fast convolution device adapted for use with a microprocessor, comprising: memory, processor and computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, carries out the steps of the fast convolution method for a microprocessor according to any one of claims 1 to 4.
10. A computer-readable storage medium, on which an information-transfer-implementing program is stored, which, when executed by a processor, implements the steps of the fast convolution method for a microprocessor according to any one of claims 1 to 4.
CN202011103515.6A 2020-10-15 2020-10-15 Fast convolution method and device suitable for microprocessor Active CN112199636B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011103515.6A CN112199636B (en) 2020-10-15 2020-10-15 Fast convolution method and device suitable for microprocessor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011103515.6A CN112199636B (en) 2020-10-15 2020-10-15 Fast convolution method and device suitable for microprocessor

Publications (2)

Publication Number Publication Date
CN112199636A true CN112199636A (en) 2021-01-08
CN112199636B CN112199636B (en) 2022-10-28

Family

ID=74010180

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011103515.6A Active CN112199636B (en) 2020-10-15 2020-10-15 Fast convolution method and device suitable for microprocessor

Country Status (1)

Country Link
CN (1) CN112199636B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112765552A (en) * 2021-01-21 2021-05-07 中国科学院重庆绿色智能技术研究院 Block parameter space optimization method of matrix multiplication based on array packing
CN112948758A (en) * 2021-02-24 2021-06-11 上海商汤智能科技有限公司 Data processing method and device and chip
CN113407904A (en) * 2021-06-09 2021-09-17 中山大学 Winograd processing method, system and medium compatible with multi-dimensional convolutional neural network
CN116401502A (en) * 2023-06-09 2023-07-07 之江实验室 Method and device for optimizing Winograd convolution based on NUMA system characteristics
CN116776946A (en) * 2023-06-26 2023-09-19 中国科学院长春光学精密机械与物理研究所 Pipeline parallel convolution array design method based on Winograd convolution algorithm

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009634A (en) * 2017-12-21 2018-05-08 美的集团股份有限公司 A kind of optimization method of convolutional neural networks, device and computer-readable storage medium
US20190042542A1 (en) * 2018-03-28 2019-02-07 Intel Corporation Accelerator for sparse-dense matrix multiplication
CN109388777A (en) * 2017-08-07 2019-02-26 英特尔公司 A kind of system and method for optimized Winograd convolution accelerator
CN109767000A (en) * 2019-01-16 2019-05-17 厦门美图之家科技有限公司 Neural network convolution method and device based on Winograd algorithm
US20190370644A1 (en) * 2018-06-04 2019-12-05 Lightmatter, Inc. Convolutional layers for neural networks using programmable nanophotonics
US20200019851A1 (en) * 2018-07-10 2020-01-16 The George Washington University Optical convolutional neural network accelerator
US20200234124A1 (en) * 2019-01-23 2020-07-23 Samsung Electronics Co., Ltd. Winograd transform convolution operations for neural networks

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109388777A (en) * 2017-08-07 2019-02-26 英特尔公司 A kind of system and method for optimized Winograd convolution accelerator
CN108009634A (en) * 2017-12-21 2018-05-08 美的集团股份有限公司 A kind of optimization method of convolutional neural networks, device and computer-readable storage medium
US20190042542A1 (en) * 2018-03-28 2019-02-07 Intel Corporation Accelerator for sparse-dense matrix multiplication
US20190370644A1 (en) * 2018-06-04 2019-12-05 Lightmatter, Inc. Convolutional layers for neural networks using programmable nanophotonics
US20200019851A1 (en) * 2018-07-10 2020-01-16 The George Washington University Optical convolutional neural network accelerator
CN109767000A (en) * 2019-01-16 2019-05-17 厦门美图之家科技有限公司 Neural network convolution method and device based on Winograd algorithm
US20200234124A1 (en) * 2019-01-23 2020-07-23 Samsung Electronics Co., Ltd. Winograd transform convolution operations for neural networks
CN111476360A (en) * 2019-01-23 2020-07-31 三星电子株式会社 Apparatus and method for Winograd transform convolution operation of neural network

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
SACHIN BAGGA等: "Virtualization Approach to Cluster Based Winograd"s Variant of Strassen"s Method Using RMI", 《2016 SECOND INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE & COMMUNICATION TECHNOLOGY (CICT)》, 18 August 2016 (2016-08-18) *
WEIWEN CHEN等: "Hardware Acceleration Implementation of Three-Dimensional Convolutional Neural Network on Vector Digital Signal Processors", 《2020 4TH INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION SCIENCES (ICRAS)》, 8 July 2020 (2020-07-08) *
丁贵广等: "基于提升技术的运动补偿三维小波视频编码", 《系统工程与电子技术》 *
丁贵广等: "基于提升技术的运动补偿三维小波视频编码", 《系统工程与电子技术》, no. 09, 20 September 2004 (2004-09-20) *
丁贵广等: "面向云环境的图像高维特征索引框架", 《计算机集成制造系统》 *
丁贵广等: "面向云环境的图像高维特征索引框架", 《计算机集成制造系统》, no. 08, 15 August 2011 (2011-08-15) *
吴恩华等: "基于图形处理器(GPU)的通用计算", 《计算机辅助设计与图形学学报》, no. 05, 20 May 2004 (2004-05-20) *
徐睿等: "基于Winograd稀疏算法的卷积神经网络加速器设计与研究", 《计算机工程与科学》, vol. 41, no. 09, 15 September 2019 (2019-09-15) *
邓熙等: "基于TMS320C64系列的H.264的整数变换快速实现", 《电视技术》, no. 07, 17 July 2008 (2008-07-17) *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112765552A (en) * 2021-01-21 2021-05-07 中国科学院重庆绿色智能技术研究院 Block parameter space optimization method of matrix multiplication based on array packing
CN112765552B (en) * 2021-01-21 2024-05-07 中国科学院重庆绿色智能技术研究院 Block parameter space optimization method for matrix multiplication based on array packing
CN112948758A (en) * 2021-02-24 2021-06-11 上海商汤智能科技有限公司 Data processing method and device and chip
CN113407904A (en) * 2021-06-09 2021-09-17 中山大学 Winograd processing method, system and medium compatible with multi-dimensional convolutional neural network
CN116401502A (en) * 2023-06-09 2023-07-07 之江实验室 Method and device for optimizing Winograd convolution based on NUMA system characteristics
CN116401502B (en) * 2023-06-09 2023-11-03 之江实验室 Method and device for optimizing Winograd convolution based on NUMA system characteristics
CN116776946A (en) * 2023-06-26 2023-09-19 中国科学院长春光学精密机械与物理研究所 Pipeline parallel convolution array design method based on Winograd convolution algorithm

Also Published As

Publication number Publication date
CN112199636B (en) 2022-10-28

Similar Documents

Publication Publication Date Title
CN112199636B (en) Fast convolution method and device suitable for microprocessor
EP3605402B1 (en) Chip device and related product
CN107229967B (en) Hardware accelerator and method for realizing sparse GRU neural network based on FPGA
US10698657B2 (en) Hardware accelerator for compressed RNN on FPGA
CN107704916B (en) Hardware accelerator and method for realizing RNN neural network based on FPGA
CN111199273B (en) Convolution calculation method, device, equipment and storage medium
US10810484B2 (en) Hardware accelerator for compressed GRU on FPGA
US20200089535A1 (en) Data sharing system and data sharing method therefor
CN107301456B (en) Deep neural network multi-core acceleration implementation method based on vector processor
CN111325321A (en) Brain-like computing system based on multi-neural network fusion and execution method of instruction set
US11120101B2 (en) Matrix multiplication system and method
Li et al. VBSF: a new storage format for SIMD sparse matrix–vector multiplication on modern processors
CN110766128A (en) Convolution calculation unit, calculation method and neural network calculation platform
CN111381968B (en) Convolution operation optimization method and system for efficiently running deep learning task
KR20210037569A (en) Power-efficient hybrid traversal apparatus and method for convolutional neural network accelerator architecture
CN111738276A (en) Image processing method, device and equipment based on multi-core convolutional neural network
Mielikainen et al. Constant coefficients linear prediction for lossless compression of ultraspectral sounder data using a graphics processing unit
Niu et al. SPEC2: Spectral sparse CNN accelerator on FPGAs
CN112149047A (en) Data processing method and device, storage medium and electronic device
CN113655986B (en) FFT convolution algorithm parallel implementation method and system based on NUMA affinity
CA3186227A1 (en) System and method for accelerating training of deep learning networks
CN114003198A (en) Inner product processing unit, arbitrary precision calculation device, method, and readable storage medium
CN110766136B (en) Compression method of sparse matrix and vector
Wu et al. Skeletongcn: a simple yet effective accelerator for gcn training
Sakr et al. Memory-efficient CMSIS-NN with replacement strategy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20210108

Assignee: CSIC PRIDE(Nanjing)Intelligent Equipment System Co.,Ltd

Assignor: TSINGHUA University

Contract record no.: X2023320000119

Denomination of invention: A Fast Convolutional Method and Device Suitable for Microprocessors

Granted publication date: 20221028

License type: Common License

Record date: 20230323

EE01 Entry into force of recordation of patent licensing contract