CN113392957B - Convolution operation processing method, electronic equipment, mobile terminal and storage medium - Google Patents
Convolution operation processing method, electronic equipment, mobile terminal and storage medium Download PDFInfo
- Publication number
- CN113392957B CN113392957B CN202110553804.4A CN202110553804A CN113392957B CN 113392957 B CN113392957 B CN 113392957B CN 202110553804 A CN202110553804 A CN 202110553804A CN 113392957 B CN113392957 B CN 113392957B
- Authority
- CN
- China
- Prior art keywords
- convolution
- matrix
- matrix multiplication
- configuration
- parameter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/15—Correlation function computation including computation of convolution operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Stored Programmes (AREA)
Abstract
The application discloses a convolution operation processing method, electronic equipment, a mobile terminal and a storage medium, wherein the convolution operation processing method comprises the following steps: acquiring convolution operation to be processed and a configuration database; converting the convolution operation into matrix multiplication, wherein the matrix multiplication corresponds to a convolution size; if the configuration parameters corresponding to the convolution size in the configuration database are determined to be absent, defining a parameter search space according to the convolution size and the hardware parameters; generating a plurality of operation codes according to configuration parameters in the parameter search space, and calculating matrix multiplication by using the operation codes to obtain a plurality of operation results; and storing the configuration parameters of the operation codes corresponding to one operation result which meets the preset condition in the plurality of operation results into a configuration database. Through the mode, the method and the device can carry out reconstruction optimization on the matrix multiplication, so that the convolution operation is promoted by the matrix multiplication with better performance.
Description
Technical Field
The present application relates to the field of reconfigurable technologies, and in particular, to a convolution processing method, an electronic device, a mobile terminal, and a storage medium.
Background
In recent years, a large number of Deep Learning (DL) applications have gradually spread from the professional scientific field to the consumer market, specific applications including real-time gaming robots, autonomous car navigation, VR social platforms and traffic monitoring using millions of cameras. In many cases, models trained with GPU clusters, TPU clusters, are typically deployed for use on edge devices to provide real-time artificial intelligence services.
Convolution calculation is a main operation part in a Convolutional Neural Network (CNN) commonly used in artificial intelligence services, and the operation ratio of many network models reaches over 99 percent. Convolution calculations can be converted to matrix multiplications, so many applications use BLAS (basic linear algebra subroutines), manually written matrix operation routines, and even extended matrix operation routines as implementations of convolution calculations.
At present, most of matrices generated in a convolutional neural network are strip-shaped matrices, and BLAS computation libraries with good performance are basically optimized for square matrix operation and are inconsistent based on optimization strategies, so that the matrices cannot generally provide the best performance on the computation of the strip-shaped matrices, and the performance of matrix multiplication cannot be improved well.
Disclosure of Invention
A first aspect of the embodiments of the present application provides a processing method of convolution operation, where the processing method includes: obtaining convolution operation to be processed and a configuration database; converting the convolution operation into matrix multiplication, wherein the matrix multiplication corresponds to a convolution size; if the configuration parameters corresponding to the convolution size in the configuration database are determined not to exist, defining a parameter search space according to the convolution size and the hardware parameters; generating a plurality of operation codes according to configuration parameters in the parameter search space, and calculating matrix multiplication by using the operation codes to obtain a plurality of operation results; and storing the configuration parameters of the operation codes corresponding to one operation result which meets the preset condition in the plurality of operation results into a configuration database.
A second aspect of an embodiment of the present application provides a mobile terminal, including: the device comprises a processor and a memory, wherein the memory stores a computer program, and the processor is used for executing the computer program to realize the processing method provided by the first aspect of the embodiment of the application.
A third aspect of embodiments of the present application provides a computer-readable storage medium, where a computer program is stored, and the computer program can be executed by a processor to implement the processing method provided by the first aspect of embodiments of the present application.
The beneficial effect of this application is: different from the situation in the prior art, the method for processing the convolution operation aims at the prior art, by determining that no corresponding configuration parameter exists in a configuration database and defining a parameter search space according to the convolution size and hardware parameters, the matrix multiplication is reconstructed and optimized according to the configuration parameter in the parameter search space to generate a plurality of operation codes, and the matrix multiplication is calculated by using the operation codes to obtain a plurality of operation results, so that the performance of the matrix multiplication on the convolution operation is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart diagram illustrating a first embodiment of a convolution processing method according to the present application;
FIG. 2 is a flowchart illustrating an embodiment of step S13 of FIG. 1;
FIG. 3 is a flowchart illustrating an embodiment of step S23 in FIG. 2;
FIG. 4 is a flowchart illustrating an embodiment of step S14 of FIG. 1;
FIG. 5 is a flowchart illustrating an embodiment of step S15 of FIG. 1;
FIG. 6 is a schematic diagram of a matrix framework according to an embodiment of the convolution processing method of the present application;
FIG. 7 is a block structure diagram of the matrix of the present application;
FIG. 8 is a flowchart illustrating a method for handling convolution according to an embodiment of the present disclosure;
FIG. 9 is a diagram illustrating the result of the convolution processing method according to the present application;
FIG. 10 is a schematic block diagram of an embodiment of a mobile terminal of the present application;
FIG. 11 is a schematic block diagram of one embodiment of a computer-readable storage medium of the present application;
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
In 2017, 15 hundred million mobile phones were sold in the entire mobile phone consumer market. The Tencent Beacon reports a number of 682,956,170 active users/mobile devices online in the second quarter of 2019. Assuming an average calculated performance per mobile device of 50GFlops, the sum of the overall theoretical peak performance of the active mobile devices mentioned in the update report will exceed the world's fastest ARM architecture-based supercomputer Fugaku.
Convolution calculation is the main operation part in the conventional Convolution Neural Network (CNN), and the operation ratio of the convolution calculation in a plurality of network models reaches more than 99 percent, as shown in the following table 1:
TABLE 1 time consumption ratio of convolution calculation in deep learning common CNN network model
In the context of mobile computing, an ARM architecture-based CPU is the primary hardware architecture used in mobile devices, which is a suitable and practical hardware platform to explore the best solution for current neural network deployments. Each year, tens of authorized vendors manufacture dozens of different types of ARM socs by modifying the ARM architecture's cache size, memory type, instruction CPI, or instruction set. Thus, application performance portability is also a challenge if deep learning applications are to take full advantage of hardware resources in a given device. When deep learning applications are to use the back-end compute libraries on ARM socs to service deep learning models, they must solve the problem of "application performance portability".
For billions of ARM socs with hundreds of hardware specifications, the productivity of performance migration is another challenge to deploy deep learning models. ARM has promulgated 10 Cortex-M and 16 Cortex-a/X family architectures, while Apple and other vendors have promulgated 37 architectures based on ARM architecture. Therefore, it is not economical to cover all the matrix operation libraries of the ARM hardware architecture by manually tuning. For example, the authors of OpenBLAS stopped migration work on Cortex-a73, published in 2016, and devices published thereafter, after migrating this efficient matrix operator library to 13 different ARM architectures.
Therefore, under the condition, the application provides a convolution operation processing method, and the realization of an efficient and automatic ARM equipment performance transplanting mode is very important for designing a new matrix operation library. Referring to fig. 1, fig. 1 is a schematic flowchart of a processing method of convolution operation according to a first embodiment of the present application, which specifically includes the following steps:
s11: acquiring convolution operation to be processed and a configuration database;
generally speaking, convolution operation generally consists of three parts, wherein two parts are convolution and one part is a convolution calculation method corresponding to two convolution products, and the preparation processing work of the convolution operation can be realized by obtaining the convolution operation to be processed.
Usually, a configuration database is stored locally, and there is a corresponding convolution calculation method in the configuration database, so that when performing convolution operation, by obtaining the configuration database, further calculation processing can be performed for the conversion after the convolution.
The convolution operation to be processed may be obtained first, and the configuration database may be obtained through convolution. In addition, the convolution operation to be processed and the configuration database may also be obtained at the same time, and a person skilled in the art may also perform a part of processing on the convolution operation to be processed, and then obtain the configuration database specifically according to the requirement, which is not limited herein.
S12: converting the convolution operation into matrix multiplication, wherein the matrix multiplication corresponds to a convolution size;
after the convolution operation to be processed is obtained, the convolution operation can be converted into matrix multiplication, and because the convolution generally corresponds to a convolution size, the converted matrix multiplication also corresponds to a convolution size.
In addition, a convolution operation to be processed is obtained first, and the configuration database is obtained through convolution, specifically, a convolution calculation method in the configuration database may be obtained through the size of convolution, so that the convolution operation is converted into matrix multiplication.
Specifically, the Im2col algorithm may be performed on the convolution operation to convert the convolution operation into a matrix multiplication calculation corresponding to the convolution, and since the convolution generally corresponds to a convolution size, the converted matrix multiplication also corresponds to a convolution size.
Generally, in the case of only one channel, the Im2col algorithm is to pull the first matrix into a column from left to right and from top to bottom to form a new matrix, and if there are multiple channels, the conversion may be performed according to one channel first, and then according to a similar method.
S13: if the configuration parameters corresponding to the convolution size in the configuration database are determined to be absent, defining a parameter search space according to the convolution size and the hardware parameters;
generally, there are configuration parameters corresponding to convolution sizes in the configuration database, which can be used to directly obtain the optimal parameter combination from experience, so as to generate codes to calculate the matrix multiplication of the conversion, and further obtain the calculation result.
As the variety of deep learning applications in data centers and mobile devices has increased, the shape of matrices in matrix operations has changed dramatically. In addition, various newly developed socs have also been put on the market. More and more socs with different architectures and various deep learning applications cause no configuration parameters corresponding to convolution sizes in a configuration database, and meanwhile, aggravate the difficulty of supporting and optimizing the existing matrix operation library by software developers.
If the configuration parameters corresponding to the convolution-free sizes in the configuration database are determined under a large number of different hardware configurations and different matrix shapes, a parameter search space is defined according to the convolution sizes and the hardware parameters and is used for storing the configuration parameters and providing a space for matrix multiplication.
S14: generating a plurality of operation codes according to configuration parameters in the parameter search space, and calculating matrix multiplication by using the operation codes to obtain a plurality of operation results;
since a plurality of matrixes can be corresponding to a plurality of convolved future, and each convolution has the matched self property, the corresponding matrixes form various configuration parameters and are stored in the parameter search space.
Because each configuration parameter has a preset value range, a specific configuration parameter combination can be determined according to the determined value, so that an operation code is generated according to the configuration parameters in the parameter search space, and a corresponding operation result can be obtained by calculating matrix multiplication by using the operation code.
When there are a plurality of combinations of configuration parameters, a plurality of operation codes may be generated, and a plurality of operation codes may be used to perform matrix multiplication to obtain a plurality of operation results, which may or may not be the same.
S15: and storing the configuration parameters of the operation codes corresponding to one operation result which meets the preset condition in the plurality of operation results into a configuration database.
In order to select the optimal configuration parameters, condition limits need to be set on multiple operation results for selecting the operation results that meet preset conditions, and specifically, the preset conditions may be set, where the preset conditions may be time for matrix multiplication operation after optimization, or performance errors obtained by matrix multiplication operation after optimization, and the like.
When one of the operation results meets the preset condition, the configuration parameters of the operation code corresponding to the operation result meeting the preset condition in the operation results can be stored in the configuration database for self-updating and self-optimizing the configuration database.
Therefore, according to the processing method of the current convolution operation, by determining that no corresponding configuration parameter exists in the configuration database and defining a parameter search space according to the convolution size and the hardware parameter, the matrix multiplication is reconstructed and optimized according to the configuration parameter in the parameter search space to generate a plurality of operation codes, and the matrix multiplication is calculated by using the operation codes to obtain a plurality of operation results, so that the performance of the matrix multiplication on the convolution operation is improved.
Further, if it is determined that there is no configuration parameter corresponding to the convolution size in the configuration database, a parameter search space is defined according to the convolution size and the hardware parameter, please refer to fig. 2, where fig. 2 is a flowchart of an embodiment of step S13 in fig. 1, and specifically includes the following steps:
s21: judging whether a configuration database has configuration parameters corresponding to the convolution sizes;
if the obtained convolution operation to be processed is the convolution operation which is obtained before or the convolution size is consistent with the convolution operation which is obtained before, the configuration database is indicated to have the configuration parameters corresponding to the convolution size, and if the obtained convolution size in the convolution operation to be processed is changed or inconsistent, the configuration database is also indicated to have no corresponding configuration parameters. The judgment can be used for knowing.
If there is a configuration parameter corresponding to the convolution size in the configuration database, which indicates that there is an optimal configuration parameter in the configuration database, and an efficient matrix multiplication operation can be completed, the process proceeds to step S22: and generating an operation code according to the configuration parameters and calculating to obtain an operation result.
If there is no configuration parameter corresponding to the convolution size in the configuration database, indicating that there is no optimal configuration parameter in the configuration database, and the matrix multiplication operation can be completed only by continuously searching and optimizing, then step S23 is performed: and defining a parameter search space corresponding to the configuration parameters according to the convolution size and the hardware parameters.
The configuration parameters corresponding to the convolution size at least comprise a row number M of the first matrix A, a column number K of the first matrix A, a row number mc of a cache block of the first matrix A, a column number kc of a cache block of the first matrix A, a column number N of the second matrix B, a column number nc of a cache block of the second matrix B, a row number M _ reg of a register block, a column number N _ rreg of a register block, a pre-fetch value pre _ a of the first matrix A, a pre-fetch value pre _ B of the second matrix B and a search space tag loopReorder.
Specifically, the number of rows of the cache blocks of the first matrix a is in a range of [8,max (M, 1024) ], M is the number of rows of the first matrix a, the number of columns of the cache blocks of the first matrix a is in a range of [8,max (K, 1024) ], K is the number of columns of the first matrix a, the number of columns of the cache blocks of the second matrix B is in a range of [8,max (N, 1024) ], N is the number of rows of the second matrix, the number of rows of the register block M _ reg is in a range of 4 or 8, the number of columns of the register block N _ reg is in a range of 8, 12 or 16, the prefetch value pre _ a of the first matrix a and the prefetch value pre _ B of the second matrix B include at least one of 0, 32, 64, 128, 256 or 512, and the search space tag loopReorder includes at least one of values 0, 1, 2 or 3, and specifically refer to table 2.
TABLE 2 runtime parameters and search spaces for reconfigurable matrix multiplication libraries
Runtime parameters | Definition of | Value range |
M | No.rows of matrix A | |
N | No.cols of matrix B | |
K | No.cols of matrix A | |
mc | No.rows of cache block of matrix A | [8,max(M,1024)] |
nc | No.cols of cache block of matrix B | [8,max(N,1024)] |
kc | No.cols of cache block of matrix A | [8,max(K,1024)] |
m_reg | No.rows of register block | 4,8 |
n_rreg | No.cols of register block | 8,12,16 |
pre_a | prefetch size of |
0,32,64,128,256,512 |
pre_b | prefetch size of |
0,32,64,128,256,512 |
loopReorder | |
0,1,2,3 |
Further, if there is no configuration parameter corresponding to the convolution size in the configuration database, a parameter search space corresponding to the configuration parameter is defined according to the convolution size and the hardware parameter, please refer to fig. 3, where fig. 3 is a schematic flowchart of an embodiment of step S23 in fig. 2, which specifically includes the following steps:
s31: configuring multiple groups of parameter combinations corresponding to the convolution sizes according to the hardware parameters to obtain configuration parameters;
different hardware parameters make different sets of parameter combinations corresponding to convolution sizes, and in order to better obtain configuration parameters and make subsequent matrix multiplication better operate, the sets of parameter combinations corresponding to the convolution sizes can be configured according to the hardware parameters to obtain the configuration parameters.
S32: selecting one group from a plurality of groups of parameter combinations;
in order to make the operation of the matrix multiplication faster, one group may be first selected from multiple groups of parameter combinations to perform the operation on the matrix multiplication, and of course, multiple groups may also be selected from multiple groups of parameter combinations, where allowable conditions, such as allowable operation spaces corresponding to hardware parameters, may be set as more specifically required, and this is not limited here.
S33: and defining a corresponding parameter search space based on the selected group of parameter combinations.
Because each group of parameter combination can define a corresponding parameter search space, the corresponding parameter search space can be defined based on the selected group of parameter combinations; if based on the selected multiple sets of parameter combinations, the corresponding multiple sets of parameter search spaces can be defined simultaneously.
Further, a plurality of operation codes are generated in the parameter search space, and the matrix multiplication is performed by using the plurality of operation codes to obtain a plurality of operation results, please refer to fig. 4, where fig. 4 is a flowchart illustrating an embodiment of step S14 in fig. 1, and specifically includes the following steps:
s41: generating a plurality of operation codes corresponding to the convolution in a parameter search space based on the selected parameter combination;
in the parameter search space, because each set of parameter combination can generate one operation code corresponding to convolution, multiple sets of parameter combinations can generate multiple operation codes corresponding to convolution for operating the same matrix multiplication.
S42: and calculating the matrix multiplication by using a plurality of operation codes to obtain an I-th operation result, wherein I is a positive integer greater than 1 and is less than or equal to the number of the operation codes.
Calculating the matrix multiplication by using an operation code to obtain an operation result; and calculating the matrix multiplication by using a plurality of operation codes to obtain a plurality of operation results, wherein the operation results are represented by an I-th operation result, and I is a positive integer greater than 1 and is less than or equal to the number of the operation codes.
Specifically, the multiple operation codes may be sorted in a stack manner, and the matrix multiplication may be performed separately to obtain multiple operation results, or the multiple operation codes may be used simultaneously to perform the matrix multiplication to obtain multiple operation results simultaneously.
Further, the configuration parameters of the operation code corresponding to one operation result satisfying the preset condition among the operation results are stored in the configuration database, please refer to fig. 5, where fig. 5 is a flowchart illustrating an embodiment of step S15 in fig. 1, and the method specifically includes the following steps:
s51: judging whether the first operation result and/or the I-th operation result meet a preset condition, wherein the preset condition at least comprises that the time period of matrix multiplication calculation is the shortest of the operation results;
in order to screen the ith operation result, a preset condition is set, and the preset condition may be set more specifically, the time period at least including the matrix multiplication calculation may be the shortest of the operation results, that is, the ith operation result is the operation time period obtained by executing the operation code for the matrix multiplication, and the preset condition is to find the shortest operation time period in the operation time periods, so as to select the ith operation result.
If the first operation result and/or the ith operation result satisfy the preset condition, the method proceeds to step S52: storing the configuration parameters corresponding to the first operation result and/or the I-th operation result in a configuration database; if the first operation result and/or the ith operation result do not satisfy the preset condition, the method proceeds to step S53: and discarding the configuration parameters corresponding to the I-th operation result, and storing the configuration parameters corresponding to the first operation result into a configuration database.
Specifically, in the screening process, three operation results may also be selected by sequentially comparing, for example, the first operation result represents that the operation time is 10 seconds, the second operation result represents that the operation time is 3 seconds, the third operation result represents that the operation time is 5 seconds, and 10 seconds <3 seconds >5 seconds, that is, the configuration parameters corresponding to the second operation result are optimal, although there may be multiple comparison manners, and selection may be performed according to specific requirements, which is not limited herein. Usually, through reconstruction and self-optimization of matrix multiplication, the performance of the reconfigurable matrix multiplication computation library corresponding to the configuration database on different convolution layers can be improved by 2% -17%.
For better understanding of the solution of the present application, the following describes the convolution processing method of the present application by taking the convolution operation on the CPU of the ARM architecture as an example, in which the convolution operation is a convolution calculation, which means that the convolution calculation is consistent in practice. It is an object of this embodiment of the present application to provide an accelerated library of convolution calculations that can automatically generate optimized convolution calculations to explore the best performance on the latest ARM-based hardware architectures from different vendors. Because convolution calculations can be converted to matrix multiplications by the Im2col algorithm, the convolution computation library is optimized primarily for matrix multiplications, integrating the matrix multiplications into a parameterized reconfigurable library that is used to search for the best combination of runtime parameters in matrix multiplications, including the register kernel shape, the cache block size, and the scheduling policy for any given convolution shape and hardware target.
The method comprises the following design characteristics:
(1) Since the convolution calculation can be converted into matrix multiplication through the Im2col algorithm, the present application mainly optimizes the matrix multiplication.
(2) A reconfigurable matrix multiplication bank. Convolution calculations are converted to matrix multiplications by the Im2col algorithm, so the present application designs a reconfigurable library for matrix multiplications, the library having a multi-level code cache hierarchy. It is used to search and reproduce all possible code structures, including various combinations of register kernel shape, cache block size and round robin order scheduling, reordering strategy, memory access pattern, online/offline computation. By this reconfigurable matrix multiplication bank, the workload of manual tuning is reduced.
(3) An automatic optimization method based on a configurable algorithm library. After the generated microkernel is embedded into a reconfigurable matrix multiplication library, a convolution calculation library can be constructed by combining with an Im2col algorithm. It can use the auto-tuning strategy to search all parameter configurations to obtain the best performance given the hardware specification and convolution problem size. The optimal parameter configuration may be stored and reused. This convolution computation library can be embedded into an existing deep learning framework software stack. The convolution computation library does not need to be optimized manually for various combinations formed by different hardware specifications and convolution shapes.
Referring to fig. 6 and 7, the design idea of the present application is described below with reference to fig. 6 and 7, fig. 6 is a schematic diagram of a matrix frame of an embodiment of a convolution processing method of the present application, and fig. 7 is a schematic diagram of a block structure of a matrix of the present application in fig. 1.
1. Converting convolution calculations to matrix multiplications using Im2col algorithm
Similar to matrix multiplication, convolution computation multiplies each convolution kernel tensor with a patch of the input image element-by-element, and then accumulates the results for all input channels. Due to the high calculation access ratio, the matrix multiplication is well optimized on the operation of a large square matrix. The steps of the Im2col algorithm are shown in fig. 6. The convolution kernel tensor F is reformed into a matrix with the size of K multiplied by CRS, and then an original input image is copied into the matrix with the size of CRS multiplied by HW. The convolution calculation is converted into a matrix multiplication by these two steps. An output matrix of size K × EF can then be obtained by using a single matrix multiplication. Therefore, the matrix generated by the Im2col algorithm in the deep learning has various shapes, such as an elongated matrix. Fixed microkernels, scheduling of data placement and data processing may not provide an optimal solution for matrix operations of different shapes.
Specifically, F m It can be a rearrangement of convolution kernels, where the right side is the F square shape to perform D accounting, then the D is arranged from bottom to top, red to red, green to green, blue to blue, and then O is obtained m 。
2. Building a reconfigurable matrix multiplication library framework and automatic optimization
First, a reconfigurable matrix multiplication framework needs to be built. For the implementation of matrix multiplication, there are various high-performance implementations. According to some analysis in papers, such as the Goto paper, one of the methods named "GEPB" performs significantly better than the other methods in the case of the line master. Thus, the method of "GEPB" is chosen to implement a reconfigurable matrix multiplication framework. The step diagram of the GEPB method is shown as follows:
(1) Firstly, the matrix A is divided into a plurality of column matrix blocks according to columns, and the matrix B is divided into a plurality of row matrix blocks according to rows.
(2) Then, the memory of a column matrix block of the matrix A is rearranged, and the row matrix block of the matrix B is divided into a plurality of column matrix blocks according to columns.
(3) A column matrix block of the matrix A is divided into a plurality of row matrix blocks according to rows. The memory of the small partition of matrix B is then rearranged.
(4) And calculating a column of matrix blocks of the matrix A and small blocks of the matrix B, and writing the calculated result back to the matrix C.
(5) And repeating the steps until the calculation is completed.
However, the conventional GEPB method does not perform corresponding optimization for different hardware platforms, so that some optimization strategies capable of adapting to different hardware platforms need to be added in the method, thereby forming a reconfigurable matrix multiplication framework. The TLB and cached information in the CPU will affect the size of cache blocks, register blocks and data prefetching in the matrix multiplication, which will seriously affect the performance of the matrix multiplication, especially in the case where the matrix in deep learning after the Im2col algorithm is usually a long stripe matrix. However, the TLB and the specific information cached are different for different hardware configurations. Furthermore, under different convolution shapes, it is difficult for us to quantify the matrix multiplication performance based on TLB and cache information. Thus, several cache and TLB related runtime parameters may be extracted from the matrix multiply implementation, as shown in Table 2. In order to determine to store different blocks of the matrix into different caches (L1, L2, L3, etc.), a circular reordering method is also used to control the memory access pattern of the matrix multiplication subroutine. Circular reordering on cache blocks may determine whether a block of matrix a is to be scanned once and stored in L1 cache, while a block of matrix B is to be scanned multiple times and stored in L2 cache, or a block of matrix B is to be scanned once and stored in L1 cache, while a block of matrix a is to be scanned multiple times and stored in L2 cache. Using these runtime parameters, a reconfigurable matrix multiplication library can be built. The process of automatic optimization is performed on a matrix multiplication framework.
Referring to fig. 8, fig. 8 is a schematic flowchart illustrating a processing method of convolution operation according to an embodiment of the present application; the whole flow of convolution calculation and automatic optimization is shown in fig. 8, and the steps of automatic optimization are as follows:
s801: starting;
s802: executing an Im2col algorithm, and converting convolution operation into matrix multiplication;
s803: whether the optimal parameters exist in the local database or not;
firstly, a local configuration database is queried to see whether the optimal configuration parameters exist under the current hardware configuration and convolution size, wherein the optimal configuration parameters can be the optimal results obtained by running matrix multiplication under the condition that the hardware configuration and the convolution size are the same last time, so that the optimal configuration parameters are found. If there is already an optimal parameter configuration, step S804 is entered: generating a code from the optimal parameter combination; if not, go to step S805: and (5) utilizing the operation code to operate matrix multiplication to obtain an operation result.
S806: selecting a parameter combination from the parameter search space;
in the parameter search space, under the condition of a plurality of configuration parameter types, a plurality of groups of parameter combinations can exist, but one group of parameter combination only has one operation result, so that one parameter combination can be selected for facilitating the comparison of the subsequent operation results.
S807: generating an operation code according to the parameter combination;
if the optimal parameter configuration does not exist, a parameter configuration search space is defined by the specific hardware parameters and the convolution sizes, and an operation code is generated according to parameter combination and used for executing an algorithm.
S808: using the operation code to operate matrix multiplication to obtain an operation result;
s809: whether the operation result is better than the original optimal result;
if the operation result is the first operation result, there is no original optimal result, and no comparison is needed, if there is an original optimal result, the operation result may be compared with the original optimal result, and if the operation result is better than the original optimal result, the process proceeds to step S810: and updating the configuration database, searching in the whole parameter configuration space, and if the performance is better than the current best performance, updating the optimal parameter configuration until the whole parameter configuration space is searched. Then, the process proceeds to step S811: whether the traversal search is finished or not; if the traversal search is completed, the process proceeds to step S812: after the tuning is finished, storing the optimal code;
s813: and (6) ending.
A reconfigurable matrix multiplication framework is built to adapt to ARM framework CPU computing platforms with different characteristics, and a plurality of key performance-related runtime parameters are extracted from common matrix multiplication, so that the matrix multiplication library can generate matrix multiplication codes with better performance aiming at different hardware frameworks and convolution shapes, and convolution calculation is efficiently completed.
The problem of poor performance of the long strip-shaped matrix multiplication can be solved by using an automatic optimization technology in a reconfigurable matrix multiplication frame and searching the optimal parameter combination under the current hardware architecture and convolution shape from all parameter combinations, so that the matrix multiplication with better performance is obtained to complete convolution calculation to obtain enough good performance.
In order to confirm that the scheme is effective and feasible, a performance test is carried out on a chip Kunpeng920 of Huawei server grade, a computer is superior to a mobile phone generally, and the hardware configuration of the Kunpeng920 is shown in the following table 3:
table 3 hardware configuration of Kunpeng920
Processor | #CPUs@Clock Speed | Memory | Compiler |
Kunpeng 920 | 8@2.60GHz(64KB-L1,512KB-L2,32MB-share-L3) | 16GB DDR4 | GCC version 9.3.0 |
The deep learning network model selected by the test is VGG-16. Since the performance of the Im2col algorithm in the convolution calculation is the same as that of other prior art, we directly pick up another part of matrix multiplication in the convolution calculation for performance test and comparison. Our matrix multiplication implementation version FastConv-GEMM was compared to AutoTVM, autoTVM + LIBXSMM, autoTVM + OpenBLAS and OpenBLAS. The implementation of the present application is divided into two versions: fastConv with out automation, which means that the result of the test is directly performed without automatic optimization in a reconfigurable matrix multiplication library using default parameter configuration; fastConv with autoTuning, which means that in a reconfigurable matrix multiplication library, automatic optimization is performed first, the optimal parameter configuration under the current hardware architecture and convolution size is found, and then the result of testing is performed by using the optimal parameter configuration.
Referring to fig. 9, fig. 9 is a schematic diagram of the result of the convolution processing method of the present application, and fig. 9 shows the speed-up ratio of the implemented AutoTVM on Kunpeng920 as the test reference, where the AutoTVM uses its own Halide generation kernel on Kunpeng920 as the reference of all six competitors. Compared with AutoTVM tuning by using OpenBLAS and LIBXSMM, the OpenBLAS computation library itself has better performance, which indicates that AutoTVM cannot improve the efficiency of the previous high-performance matrix computation library on Kunpeng920 by using a machine learning scheduling method. However, under the inspiration of AutoTVM, the performance of our reconfigurable matrix multiplication computation library using default parameter configuration is ranked second in all approaches, next to the reconfigurable matrix multiplication computation library after autotuning. After the automatic optimization is performed, the performance of the reconfigurable matrix multiplication computation library on different convolution layers is improved by about 2% -17%, for example, a Speedup over automatic TVM broken line in FIG. 9 represents the optimized and unoptimized ratio of the scheme. In the VGG-16 network model, the matrixes generated by convolution calculation of the middle layers are mostly square matrixes, and the matrixes generated by convolution calculation of the initial layers and the final layers are mostly strip matrixes. Therefore, it can be concluded from the figure that the matrix multiplication performance of the rectangular matrix can be significantly improved after the automatic optimization of the reconfigurable matrix multiplication computation library, and the matrix multiplication performance of the square matrix is not adversely affected.
Further, please refer to fig. 10, wherein fig. 10 is a schematic block diagram of an embodiment of a mobile terminal of the present application. The embodiment of the present application provides a mobile terminal 2, which includes a processor 21 and a memory 22, where the memory 22 stores a computer program 221, and the processor 21 is configured to execute the computer program 221 to perform the processing method according to the first aspect of the embodiment of the present application, which is not described herein again.
Referring to fig. 11, fig. 11 is a schematic block diagram of an embodiment of a computer-readable storage medium of the present application. If implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in the computer-readable storage medium 30. Based on such understanding, the technical solutions of the present application, which are essential or contribute to the prior art, or all or part of the technical solutions may be embodied in the form of a software product, which is stored in a storage device and includes several instructions (computer program 31) for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods according to the embodiments of the present application. The aforementioned storage device includes: various media such as a usb disk, a portable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and electronic devices such as a computer, a mobile phone, a notebook computer, a tablet computer, and a camera having the storage medium.
The description of the execution process of the computer program in the computer-readable storage medium may refer to the foregoing embodiment of the processing method of the mobile terminal 2 in the present application, and will not be repeated here.
The above description is only a part of the embodiments of the present application, and not intended to limit the scope of the present application, and all equivalent devices or equivalent processes performed by the content of the present application and the attached drawings, or directly or indirectly applied to other related technical fields, are also included in the scope of the present application.
Claims (4)
1. A method for processing convolution operations, the method comprising:
acquiring convolution operation to be processed and a configuration database;
converting the convolution operation into a matrix multiplication, the matrix multiplication corresponding to a convolution size, comprising: executing an Im2col algorithm on convolution to convert the convolution operation into a matrix multiplication calculation corresponding to the convolution, wherein the matrix multiplication corresponds to a convolution size;
judging whether the configuration database has configuration parameters corresponding to the convolution sizes;
if the configuration database has the configuration parameters corresponding to the convolution sizes, generating operation codes according to the configuration parameters and calculating to obtain operation results;
if it is determined that there is no configuration parameter corresponding to the convolution size in the configuration database, defining a parameter search space according to the convolution size and a hardware parameter, including: configuring multiple groups of parameter combinations corresponding to the convolution sizes according to the hardware parameters to obtain the configuration parameters; selecting one group from a plurality of groups of parameter combinations; defining a corresponding parameter search space based on the selected group of parameter combinations; the configuration parameters corresponding to the convolution size at least comprise row numbers of a first matrix, column numbers of the first matrix, row numbers of cache blocks of the first matrix, column numbers of a second matrix, column numbers of cache blocks of the second matrix, row numbers of register blocks, column numbers of register blocks, pre-fetching values of the first matrix, pre-fetching values of the second matrix and search space tags; generating a plurality of operation codes according to configuration parameters in the parameter search space, and calculating the matrix multiplication by using the operation codes to obtain a plurality of operation results, wherein the operation results comprise:
generating a plurality of operation codes corresponding to the convolution in the parameter search space based on the selected parameter combination;
calculating the matrix multiplication by using the plurality of operation codes to obtain a first operation result and an I operation result, wherein I is a positive integer greater than 1 and is less than or equal to the number of the operation codes;
storing the configuration parameters of the operation codes corresponding to one operation result meeting the preset condition in the plurality of operation results in the configuration database, wherein the configuration parameters comprise:
judging whether the first operation result and/or the I-th operation result meet a preset condition, wherein the preset condition at least comprises that the time period of the matrix multiplication calculation is the shortest of a plurality of operation results;
if the first operation result and/or the I-th operation result meet a preset condition, storing the configuration parameters corresponding to the first operation result and/or the I-th operation result in the configuration database;
if the first operation result and/or the I-th operation result do not meet the preset condition, the configuration parameters corresponding to the I-th operation result are abandoned, and the configuration parameters corresponding to the first operation result are stored in the configuration database.
2. The processing method according to claim 1,
the number of rows of the cache blocks of the first matrix is in a range of [8, max (M, 1024) ], M is the number of rows of the first matrix, the number of columns of the cache blocks of the first matrix is in a range of [8, max (K, 1024) ], K is the number of columns of the first matrix, the number of columns of the cache blocks of the second matrix is in a range of [8, max (N, 1024) ], N is the number of rows of the second matrix, the number of rows of the register block is in a range of 4 or 8, the number of columns of the register block is 8, 12 or 16, the prefetch values of the first matrix and the prefetch values of the second matrix at least comprise one of 0, 32, 64, 128, 256 or 512, and the search space tag value at least comprises 0, 1, 2 or 3.
3. A mobile terminal, comprising: a processor and a memory, the memory having stored therein a computer program for execution by the processor to implement the processing method of claim 1 or 2.
4. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is capable of implementing the processing method of claim 1 or 2 when the computer program is executed by a processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110553804.4A CN113392957B (en) | 2021-05-20 | 2021-05-20 | Convolution operation processing method, electronic equipment, mobile terminal and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110553804.4A CN113392957B (en) | 2021-05-20 | 2021-05-20 | Convolution operation processing method, electronic equipment, mobile terminal and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113392957A CN113392957A (en) | 2021-09-14 |
CN113392957B true CN113392957B (en) | 2023-01-17 |
Family
ID=77618397
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110553804.4A Active CN113392957B (en) | 2021-05-20 | 2021-05-20 | Convolution operation processing method, electronic equipment, mobile terminal and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113392957B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118113972A (en) * | 2022-11-30 | 2024-05-31 | 华为技术有限公司 | Operation resource processing method and related equipment |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112561029A (en) * | 2019-09-26 | 2021-03-26 | 中国科学院深圳先进技术研究院 | Multithreading data processing method, accelerator and system |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10255547B2 (en) * | 2014-12-04 | 2019-04-09 | Nvidia Corporation | Indirectly accessing sample data to perform multi-convolution operations in a parallel processing system |
US10067910B2 (en) * | 2016-07-01 | 2018-09-04 | Palo Alto Research Center Incorporated | System and method for GPU maximum register count optimization applied to general matrix-matrix multiplication |
CN106844294B (en) * | 2016-12-29 | 2019-05-03 | 华为机器有限公司 | Convolution algorithm chip and communication equipment |
CN107392308B (en) * | 2017-06-20 | 2020-04-03 | 中国科学院计算技术研究所 | Convolutional neural network acceleration method and system based on programmable device |
WO2020050886A1 (en) * | 2018-09-05 | 2020-03-12 | Futurewei Technologies, Inc. | Compiler-level general matrix multiplication configuration optimization |
CN111882035A (en) * | 2020-07-21 | 2020-11-03 | 北京百度网讯科技有限公司 | Super network searching method, device, equipment and medium based on convolution kernel |
-
2021
- 2021-05-20 CN CN202110553804.4A patent/CN113392957B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112561029A (en) * | 2019-09-26 | 2021-03-26 | 中国科学院深圳先进技术研究院 | Multithreading data processing method, accelerator and system |
Also Published As
Publication number | Publication date |
---|---|
CN113392957A (en) | 2021-09-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10963292B2 (en) | Techniques to manage virtual classes for statistical tests | |
Guan et al. | FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates | |
US10489703B2 (en) | Memory efficiency for convolutional neural networks operating on graphics processing units | |
Sedaghati et al. | Automatic selection of sparse matrix representation on GPUs | |
Lu et al. | Optimizing depthwise separable convolution operations on gpus | |
US20150324441A1 (en) | System and method for high performance k-means clustering on gpu with smart kernels | |
Chen | Escoin: Efficient sparse convolutional neural network inference on gpus | |
Gutiérrez et al. | GPU-SME-kNN: Scalable and memory efficient kNN and lazy learning using GPUs | |
US20220303176A1 (en) | Efficient optimization for neural network deployment and execution | |
Dong et al. | Characterizing the microarchitectural implications of a convolutional neural network (cnn) execution on gpus | |
CN113392957B (en) | Convolution operation processing method, electronic equipment, mobile terminal and storage medium | |
US12033035B2 (en) | Method and apparatus for predicting kernel tuning parameters | |
Zardoshti et al. | Adaptive sparse matrix representation for efficient matrix–vector multiplication | |
Aghapour et al. | CPU-GPU layer-switched low latency CNN inference | |
Meng et al. | Automatic generation of high-performance convolution kernels on ARM CPUs for deep learning | |
US20090064120A1 (en) | Method and apparatus to achieve maximum outer level parallelism of a loop | |
US11361050B2 (en) | Assigning dependent matrix-vector multiplication operations to consecutive crossbars of a dot product engine | |
Metz et al. | ML-based power estimation of convolutional neural networks on GPGPUs | |
Shivdikar | SMASH: Sparse matrix atomic scratchpad hashing | |
Sakr et al. | Memory-efficient CMSIS-NN with replacement strategy | |
Du et al. | Handling heavy-tailed input of transformer inference on GPUS | |
Wasti et al. | LoopStack: a Lightweight Tensor Algebra Compiler Stack | |
Li et al. | Autotsmm: An auto-tuning framework for building high-performance tall-and-skinny matrix-matrix multiplication on cpus | |
WO2022241725A1 (en) | Convolution operation processing method, and electronic device, mobile terminal and storage medium | |
Qu et al. | A coordinated model pruning and mapping framework for rram-based dnn accelerators |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |