CN110751263B

CN110751263B - High-parallelism convolution operation access method and circuit

Info

Publication number: CN110751263B
Application number: CN201910848453.2A
Authority: CN
Inventors: 廖裕民; 郑柏春
Original assignee: Rockchip Electronics Co Ltd
Current assignee: Rockchip Electronics Co Ltd
Priority date: 2019-09-09
Filing date: 2019-09-09
Publication date: 2022-07-01
Anticipated expiration: 2039-09-09
Also published as: CN110751263A

Abstract

The invention provides a high-parallelism convolution operation access method and a circuit, wherein the method comprises the following steps: the first reading unit reads first data from the main storage unit and sequentially writes the read first data into a first line cache group; after the first line cache group is filled, the first recombination control unit assembles the first data cached by each cache line in the first line cache group to obtain a single line of the first data after being recombined, and writes the single line of the first data into the first output cache unit; the second reading unit and the second resetting control unit read second data from the main storage unit and write the second data into the second output buffer unit according to the same mode; and the multiply-add array unit acquires the first data in the current first output cache unit and the second data in the current second output cache unit, performs multiply-add operation and outputs an operation result. The scheme can effectively improve the data reading parallelism, and further improve the convolution operation efficiency.

Description

High-parallelism convolution operation access method and circuit

Technical Field

The invention relates to the field of neural network circuits, in particular to a high-parallelism convolution operation access method and circuit.

Background

With the rapid development of the artificial intelligence industry, the requirements of users on the operation speed and the operation power consumption of the neural network are higher and higher. The convolutional neural network is the most important type of neural network, and the core convolutional operation occupies most of the neural network accelerating circuit, so that the efficiency and the circuit area of the convolutional operation directly influence the final efficiency and the circuit area of the whole neural network accelerating circuit. However, in the prior art, the convolution operation circuit does not well deal with the design problems of high-efficiency data multiplexing and low power consumption under high parallelism, so that the area and the power consumption of the neural network operation circuit are still high and the efficiency is low.

Disclosure of Invention

Therefore, a technical scheme of high-parallelism convolution operation access is required to be provided, so that the problem of low data reading efficiency in the operation process of the conventional convolution operation circuit is solved.

In order to achieve the above object, the inventor provides a convolution operation access circuit with high parallelism, which includes a reading unit, a recombination unit, an output buffer unit and a multiply-add array unit; the reading unit comprises a first reading unit and a second reading unit, the recombination unit comprises a first recombination unit and a second recombination unit, and the output buffer unit comprises a first output buffer unit and a second output buffer unit; the first reading unit is connected with the first recombination unit, and the first recombination unit is connected with the first output buffer unit; the second reading unit is connected with the second recombination unit, and the second recombination unit is connected with the second output buffer unit; the first output buffer unit and the second output buffer unit are respectively connected with the multiply-add array unit;

the first recombination unit comprises a first recombination control unit and a first line cache group, the first line cache group comprises a plurality of first line caches, and each first line cache is connected with the first recombination control unit; the second recombination unit comprises a second recombination control unit and a second line cache group, the second line cache group comprises a plurality of second line caches, and each second line cache is connected with the second recombination control unit;

the first reading unit is used for reading first data from the main storage unit and sequentially writing the read first data into a first row cache group;

the first reassembly control unit is used for assembling the first data cached by each cache line in the first line cache group after the first line cache group is filled up to obtain a reassembled single-line first data, and writing the single-line first data into the first output cache unit;

the second reading unit is used for reading second data from the main storage unit and sequentially writing the read second data into a second row cache group;

the second reassembly control unit is configured to, after the second line cache group is filled up, assemble the second data cached by each cache line in the second line cache group to obtain a reassembled single-line second data, and write the single-line second data into the second output cache unit;

and the multiply-add array unit is used for acquiring the first data in the current first output cache unit and the second data in the second output cache unit after receiving the convolution operation starting signal, and outputting an operation result after carrying out multiply-add operation.

Further, the number of the first line cache group and the second line cache is two;

when one first line cache group is filled with first data, the first reading unit is used for continuously reading the first data from the main storage unit and writing the first data into another first line cache group;

when one second line cache group is filled with the second data, the second reading unit is used for continuously reading the second data from the main storage unit and writing the second data into another second line cache group.

Further, the circuit further comprises a read control balancing unit;

the reading control balancing unit is used for controlling the first reading unit to stop reading the first data from the main storage unit when judging that two first line cache groups are both filled with the first data and two second line cache groups are not filled with the second data, and restoring the first reading unit to continue reading the data from the main storage unit when judging that at least one of the two second line cache groups is filled with the second data;

or, the read control balancing unit is configured to control the second read unit to stop reading the second data from the main storage unit when determining that both the second line buffer sets are filled with the second data and neither of the first line buffer sets is filled with the first data, and to resume the second read unit to continue reading data from the main storage unit when determining that at least one of the two first line buffer sets is filled with the second data.

Furthermore, the circuit also comprises an operation effective judgment unit which is respectively connected with the first output cache unit, the second output cache unit and the multiplication and addition array unit;

and the operation validity judging unit is used for sending a convolution operation starting signal to the multiply-add array unit after receiving a first data output valid signal sent by the first output buffer unit and a second data output valid signal sent by the second data output buffer unit.

Furthermore, the circuit also comprises a mode configuration unit which is respectively connected with the first reading unit and the second reading unit;

the mode configuration unit is used for correspondingly adjusting the first buffer quantity in the first line buffer group and the second buffer quantity in the second line buffer group participating in the operation according to the mode configuration signal, and correspondingly adjusting the storage length range of the first output buffer unit and the second output buffer unit participating in the operation.

The inventor provides a high-parallelism convolution operation access method, which is applied to a high-parallelism convolution operation access circuit, wherein the circuit comprises a reading unit, a recombination unit, an output buffer unit and a multiply-add array unit; the reading unit comprises a first reading unit and a second reading unit, the recombination unit comprises a first recombination unit and a second recombination unit, and the output buffer unit comprises a first output buffer unit and a second output buffer unit; the first reading unit is connected with a first recombination unit, and the first recombination unit is connected with a first output cache unit; the second reading unit is connected with the second recombination unit, and the second recombination unit is connected with the second output buffer unit; the first output buffer unit and the second output buffer unit are respectively connected with the multiply-add array unit;

the method comprises the following steps:

the first reading unit reads first data from the main storage unit and sequentially writes the read first data into a first line cache group;

after the first line cache group is filled, the first recombination control unit assembles the first data cached by each cache line in the first line cache group to obtain a single line of the first data after being recombined, and writes the single line of the first data into the first output cache unit;

the second reading unit reads second data from the main storage unit and sequentially writes the read second data into a second line cache group;

after the second row of cache groups are filled, the second reassembly control unit assembles the second data cached by each cache line in the second row of cache groups to obtain a reassembled single-row second data, and writes the single-row second data into the second output cache unit;

and the multiplication and addition array unit acquires the first data in the current first output cache unit and the second data in the second output cache unit after receiving the convolution operation starting signal, and outputs an operation result after carrying out multiplication and addition operation.

Further, the number of the first line cache group and the second line cache is two; the method comprises the following steps:

when one first line cache group is filled with first data, the first reading unit continuously reads the first data from the main storage unit and writes the first data into another first line cache group;

when one second line cache group is filled with the second data, the second reading unit continues to read the second data from the main storage unit and write the second data into another second line cache group.

Further, the circuit further comprises a read control balancing unit; the method comprises the following steps:

the reading control balancing unit controls the first reading unit to stop reading the first data from the main storage unit when judging that the two first line cache groups are both filled with the first data and the two second line cache groups are not filled with the second data, and restores the first reading unit to continue reading the data from the main storage unit when judging that at least one of the two second line cache groups is filled with the second data;

or, the reading control balancing unit controls the second reading unit to stop reading the second data from the main storage unit when judging that both the two second line cache groups are filled with the second data and neither of the two first line cache groups is filled with the first data, and restores the second reading unit to continue reading the data from the main storage unit when judging that at least one of the two first line cache groups is filled with the second data.

the operation effective judging unit is used for sending a convolution operation starting signal to the multiply-add array unit after receiving a first data output effective signal sent by the first output buffer unit and a second data output effective signal sent by the second data output buffer unit.

the mode configuration unit is used for correspondingly adjusting the first buffer amount in the first line buffer group and the second buffer amount in the second line buffer group participating in the operation according to the mode configuration signal, and correspondingly adjusting the storage length range of the first output buffer unit and the second output buffer unit participating in the operation.

The high-parallelism convolution operation access method and the circuit in the technical scheme comprise the following steps: the first reading unit reads first data from the main storage unit and sequentially writes the read first data into a first line cache group; after the first line cache group is filled, the first recombination control unit assembles the first data cached by each cache line in the first line cache group to obtain a single line of the first data after being recombined, and writes the single line of the first data into the first output cache unit; the second reading unit reads the second data from the main storage unit and sequentially writes the read second data into a second line cache group; after the second row of cache groups are filled, the second reassembly control unit assembles the second data cached by each cache line in the second row of cache groups to obtain a reassembled single-row second data, and writes the single-row second data into the second output cache unit; and the multiplication and addition array unit acquires the first data in the current first output cache unit and the second data in the second output cache unit after receiving the convolution operation starting signal, and outputs an operation result after carrying out multiplication and addition operation. According to the scheme, on the basis of ensuring high parallelism operation of convolution operation, high operation efficiency and data multiplexing are still kept, and the data bandwidth requirement and power consumption are greatly reduced, so that the area and the power consumption of the whole neural network circuit are reduced.

Drawings

FIG. 1 is a diagram illustrating a high-parallelism access circuit for convolution operations according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a reading unit reading from a main storage unit according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a reading unit reading from a main storage unit according to an embodiment of the present invention;

FIG. 4 is a diagram of a high-parallelism access circuit for convolution operations according to another embodiment of the present invention;

FIG. 5 is a flowchart of a high-parallelism access method for convolution operations according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a Winograd algorithm according to an embodiment of the present invention;

FIG. 7 is a diagram of a multiply-add array unit according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating a first matrix operation unit according to an embodiment of the present invention;

description of the reference numerals:

10. a convolution operation access circuit with high parallelism; 20. a main storage unit;

101. a first reading unit;

102. a second reading unit;

103. a first recombination unit; 1031. a first recombination control unit; 1032. a first line cache set;

104. a second recombination unit; 1041. a second reset control unit; 1042. a second line cache set;

105. a first output buffer unit;

106. a second output buffer unit;

107. a multiply-add array unit; 1071. a first matrix operation unit; 1072. a second matrix operation unit;

108. reading a control balance unit;

109. an operation validity judgment unit;

110. a first adder; 111. a second adder; 112. a first complement arithmetic unit; 113. a second complement arithmetic unit.

Detailed Description

To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.

Fig. 1 is a schematic diagram of a high-parallelism convolution operation access circuit according to an embodiment of the present invention. The high-parallelism convolution operation access circuit 10 comprises a reading unit, a recombination unit, an output buffer unit and a multiplication and addition array unit; the reading units comprise a first reading unit 101 and a second reading unit 102, the recombination unit comprises a first recombination unit 103 and a second recombination unit 104, and the output buffer unit comprises a first output buffer unit 105 and a second output buffer unit 106; the first reading unit 101 is connected with the first reorganizing unit 103, and the first reorganizing unit 103 is connected with the first output buffer unit 105; second reading unit 102 is connected to second reconstructing unit 104, and second reconstructing unit 104 is connected to second output buffer unit 106; the first output buffer unit 105 and the second output buffer unit 106 are respectively connected with a multiplication and addition array unit 107;

the first reassembly unit 103 includes a first reassembly control unit 1031 and a first line cache set 1032, where the first line cache set 1032 includes a plurality of first line caches, and each first line cache is connected to the first reassembly control unit; the second reconfiguration unit 104 includes a second reconfiguration control unit 1041 and a second line cache group 1042, where the second line cache group 1042 includes a plurality of second line caches, and each second line cache is connected to the second reconfiguration control unit;

the first reading unit 101 is configured to read first data from the main storage unit 20, and sequentially write the read first data into the first row buffer group 1032;

the first reassembly control unit 1031 is configured to, after the first line cache group 1032 is filled up, assemble the first data cached by each cache line in the first line cache group 1032 to obtain a single line of reassembled first data, and write the single line of first data into the first output cache unit 105;

the second reading unit 102 is configured to read second data from the main storage unit 20, and sequentially write the read second data into the second row buffer set 1042;

the second reassembly control unit 1041 is configured to, after the second line cache set 1042 is filled up, assemble the second data cached in each cache line in the second line cache set 1042 to obtain a reassembled single-line second data, and write the single-line second data into the second output cache unit 106;

the multiply-add array unit 107 is configured to, after receiving the convolution operation start signal, obtain the first data in the first output buffer unit 105 and the second data in the second output buffer unit 106, perform multiply-add operation, and output an operation result.

In the present embodiment, the main memory unit is a memory unit in which first data and second data are stored, and preferably DDR. The first data are weight data required by convolution operation, the second data are feature data required by convolution operation, and the convolution operation is that multiplication and addition operation is carried out on the read weight data and the feature data. Of course, in other embodiments, the first data may also be feature data, and the second data may be weight data. The weight data and feature data may be matrix data, such as a 4x4 matrix, a 3x3 matrix, or the like.

Through the scheme, after the first reading unit reads the first data from the main storage unit, the read data are written into the first line cache group for caching, the first reorganization control unit assembles the first data cached in each cache line in the first line cache group after the first line cache group is filled up to obtain the reorganized single-line first data, and writes the single-line first data into the first output cache unit. The second reading unit and the second resetting control unit process in the same way to generate single-row second data and write the second data into the second output buffer unit. And then the multiplication and addition array unit carries out multiplication and addition operation on the first data in the current first output buffer unit and the second data in the second output buffer unit so as to output a convolution operation result. Because the first data and the second data acquired by the multiply-add array unit are recombined and assembled into a single row, all data only need to be read from the output cache unit one by one, and compared with a mode of reading data from a plurality of row caches, the data reading efficiency is effectively improved, and further the convolution operation efficiency is improved.

In some embodiments, the first line cache set and the second line cache are two in number. When one first line cache group is filled with first data, the first reading unit is used for continuously reading the first data from the main storage unit and writing the first data into another first line cache group; when one second line cache group is filled with the second data, the second reading unit is used for continuously reading the second data from the main storage unit and writing the second data into another second line cache group.

As shown in fig. 1, the number of the first line buffer groups is two, and the first line buffer group includes a first line buffer group a and a first line buffer group B, the data read by the first reading unit 101 is first written into the first line buffer group a, and the first data read by the first reading unit 101 is written into the first line buffer group B after the first line buffer group a is filled. When the first line buffer group B is filled and the data of the first line buffer group a is reassembled, the first data read by the first reading unit 101 may be continuously written into the first line buffer group a, and so on, thereby implementing ping-pong pipeline operation, and improving the first data reading efficiency.

Similarly, the number of the second line cache sets is two, and the second line cache sets include a second line cache set a and a second line cache set B, the data read by the second reading unit 102 will be written into the second line cache set a first, and the second data read by the second reading unit 102 will be written into the second line cache set B after the second line cache set a is filled. When the second line buffer group B is filled and the data of the second line buffer group a is reassembled, the second data read by the second reading unit 102 may be continuously written into the second line buffer group a, and so on, thereby implementing ping-pong pipeline operations, and improving the second data reading efficiency.

In order to ensure the balance of the reading schedules of the first data and the second data, in some embodiments, the circuit further includes a reading control balancing unit 108:

the reading control balancing unit 108 is configured to control the first reading unit to stop reading the first data from the main storage unit when determining that both the first line buffer groups are filled with the first data and neither of the second line buffer groups is filled with the second data, and to recover the first reading unit to continue reading data from the main storage unit when determining that at least one of the two second line buffer groups is filled with the second data;

alternatively, the read control balance unit 108 is configured to control the second read unit to stop reading the second data from the main storage unit when it is determined that both the two second line buffer sets are filled with the second data and neither of the two first line buffer sets is filled with the first data, and to resume reading the data from the main storage unit when it is determined that at least one of the two first line buffer sets is filled with the second data.

For example, a line buffer group includes 4 registers with 4x16 bits, and when all the 4 line buffers are fully written, the first recombination control unit reads out the 4x16 bits in parallel to form 16x16bit (256bit) data, and puts the data into the parallel buffer unit. The parallel cache unit is composed of a memory, the bit width is 16xNbit (N is bit precision of one data, in this example, 16bit), and the depth is 8 (the depth can be adjusted according to actual needs, and the deeper the depth, the more data that can be cached, the stronger the bus efficiency fluctuation resistance). When at least one valid data (i.e. the first data of a single line or the second data of a single line is filled in the output buffer unit) exists in the parallel buffer unit (i.e. the output buffer unit), the ready signal of the output data is pulled high. When the data preparation completion signals of the parallel buffer units (namely the first output buffer unit and the second output buffer unit) corresponding to feature (feature data) and weight (weight data) are all high, the convolution multiplication and addition array unit starts to carry out convolution operation. And reading the data of 16xNbit from the parallel cache units of feature and weight each time to complete convolution operation. The convolution operation is a very conventional operation in the field of neural network operation, and is not expanded here, and preferably, the convolution operation in the present embodiment may be performed by using a winogrd operation.

Therefore, only when at least one first line of cache groups and at least one second line of cache groups are filled with data, the multiply-add array unit can acquire the data to start multiply-add operation, and the problem that when the first recombination unit or the second recombination unit is not written with the data in the other recombination unit, the data can still continuously send a read data request to a bus to occupy bus signal bandwidth because the cache groups of the first recombination unit or the second recombination unit are completely filled and cannot start convolution operation can be effectively solved.

In some embodiments, the circuit further includes an operation validity determination unit 109, and the operation validity determination unit 109 is connected to the first output buffer unit 105, the second output buffer unit 106, and the multiply-add array unit 107 respectively; the operation validity judging unit 109 is configured to send a convolution operation start signal to the multiply-add array unit 107 after receiving the first data output valid signal sent by the first output buffer unit 105 and the second data output valid signal sent by the second data output buffer unit.

Preferably, the first data output valid signal is sent to the operation validity judging unit when the first output buffer unit is filled, and the second data output valid signal is sent to the operation validity judging unit when the second output buffer unit is filled. Therefore, the multiplication and addition array unit can start convolution operation only when the first output buffer unit stores first data required by convolution operation and the second output buffer unit stores second data required by convolution operation, and the convolution operation is performed orderly.

In a certain real time, the circuit further comprises a mode configuration unit, and the mode configuration unit is respectively connected with the first reading unit and the second reading unit; the mode configuration unit is used for correspondingly adjusting the first buffer quantity in the first line buffer group and the second buffer quantity in the second line buffer group participating in the operation according to the mode configuration signal, and correspondingly adjusting the storage length range of the first output buffer unit and the second output buffer unit participating in the operation.

Assuming that the first line buffer group includes 4 line buffers, each line buffer includes 4 line buffers, and the reading unit reads the required data from the DDR by a unit of 4 × 4 matrix. When the circuit of the invention is processing a 3x3 matrix, the data reading sequence is changed from left to right of 4x4, from top to bottom to left to right of 3x3, and from top to bottom (i.e. taking the 3x3 matrix data at the top left corner of the 4x4 matrix). Meanwhile, only 3x16 bits of the 4x16bit registers in each line buffer group are used. In this mode, when 3x16 bits are full, it means that a row buffer group is full, i.e. 4 rows are cached in the 4x4 matrix mode, and it can be determined that the row buffer group is full only if 4 rows are full. The 3x3 matrix mode downlink buffer has 4 rows, but the used part has only 3 rows, so when 3 row buffers are full, it can be determined that the row buffer group is full. In addition, only 9xNbit (N is bit precision of one data, in this case, 16bit) is used in the corresponding output buffer unit, and all the remaining 7xNbit are filled with 0, so as to ensure that only the first 9/16 parts of data of the output buffer unit participate in the operation during the convolution operation. The operation flow in the other 3x3 mode is the same as the operation flow in the 4x4 mode, and is not described herein again.

Of course, when the number of line buffers of the line buffer group is configured, the adjustment can be performed according to the actual operation requirement. For example, when the convolution operation requires 8 × 8 matrix operations at most, each line buffer set includes 8 line buffers, and each line buffer includes 8 data buffers, so as to be suitable for writing 8 × 8 matrix data. In addition, the circuit can be configured with different operation modes through the mode configuration unit, for example, 8x8 matrix data can be configured to adjust the number of line buffers participating in the operation, so that the circuit can be applied to convolution operations of a 6x6 matrix, a 4x4 matrix, a 3x3 matrix (an NXN matrix, N is a positive integer and is less than 8) and the like besides convolution operations of an 8x8 matrix, thereby effectively improving the overall reusability of the circuit operation.

As shown in fig. 2 and 3, the first reading unit or the second reading unit reads data to be subjected to convolution operation from the main storage unit according to a clock cycle, for example, the current weight data or feature data are both 4 × 4 matrixes, and the first reading unit or the second reading unit sequentially reads the first data or the second data respectively in a direction indicated by an arrow in fig. 2. Assuming that the step value is 2, after the reading unit finishes reading data according to the arrow direction shown in fig. 2, the data of each channel will be read from the main storage unit according to the arrow direction shown in fig. 3.

As shown in fig. 4, taking the first data as the weight data and the second data as the feature data, which are both 4 × 4 matrixes, the following will further describe the process of performing convolution operation and access by the circuit according to the present invention.

The first reading unit reads 4 weight data on one channel at a time, and the weight reading of one channel is completed every 4 times; the second reading unit is responsible for reading feature _ data (namely feature data) in the channel direction, reading 4 feature _ data on one channel at a time, and finishing reading the feature _ data of one channel at every 4 times.

The weight reorganization unit (i.e. the first reorganization unit) is responsible for putting the 4 read weight data into the corresponding first line cache, where the number of the first line cache is 4. When the weight read control unit finishes the 16 weight reads of one channel after 4 reads, 4 line buffers are also full correspondingly. At this time, the weight restructuring unit may restructure the 16 weight data into 16 parallel data (i.e., a single row of first data), which are filled into the 16 parallel output buffers of the weight parallel output buffer unit (i.e., the first output buffer unit) one by one.

The weight parallel output buffer unit (i.e. the first output buffer unit) is responsible for sending the buffered 16 weight data to the multiply-add array unit in parallel for multiply-add convolution operation, so that the 4 × 4 matrix winograd convolution operation on one channel can be completed at one time.

The feature _ data restructuring unit (i.e. the second restructuring unit) is responsible for putting 4 feature _ data read in each time into the corresponding second line cache, and the number of the second line cache is 4. When 4 times of reading of the feature _ data reading control unit finishes 16 feature _ data reading of one channel, 4 line caches are also full correspondingly. At this time, the feature _ data restructuring unit restructures the 16 feature _ data into 16 parallel data, which are filled in the 16 parallel output buffers of the feature _ data parallel output buffer unit one by one.

The feature _ data parallel output buffer unit (namely, the second output buffer unit) is responsible for sending the buffered 16 feature _ data to the multiply-add array unit in parallel to perform multiply-add convolution operation, so that the 4 × 4 matrix winograd convolution operation on one channel can be completed at one time.

And the multiply-add array unit is responsible for carrying out multiply-add operation of a wigogrd 4x4 matrix on the weight data and the feature _ data to complete convolution operation.

Fig. 6 is a schematic diagram illustrating a Winograd algorithm according to an embodiment of the present invention. The Winograd algorithm simply means that more addition calculations are used to reduce multiplication calculations. One premise, therefore, is that the number of clock cycles for multiplication is greater than the number of clock cycles for addition in the processor. The number of multiplications that need to be completed by Winograd calculation convolution is:

μ(F(m×n,r×s))＝(m+n-1)×(n+s-1)

r × s denotes the size of the convolution kernel, and m × n denotes the output size.

Therefore, a simple comparison calculation is performed with a convolution kernel of 3 × 3 and an output of 2 × 2, so that the number of multiplications required for the sliding window or im2col is 3 × 3 × 2 × 36, and the number of multiplications required for Winograd is (3+2-1) × (3+2-1) ═ 16.

The Winograd proof method is complex and uses some knowledge in number theory, but is simple to use. It is only necessary to calculate according to the following formula:

Y＝A^T[[GgG^T]⊙[B^TdB]]A

wherein, element-wise multiplication (array elements are multiplied in sequence). A, G, B have different definitions according to the output size and the size of the convolution kernel, and are determined in advance. The specific number of A, G, B for each output size and convolution kernel can be determined byhttps://github.com/andravin/wincnnThe script of (1). g denotes a convolution kernel, d denotes data to be subjected to convolution calculation, and g has a size of r × r and a size of (m + r-1) × (m + r-1).

Taking an input matrix as a 4x4 matrix as an example, when performing convolution operation on the 4x4 matrix, the essence of the algorithm is that convolution operation of 4 3x3 matrices can be performed, and after convolution operation of one 4x4 matrix, the convolution operation can be changed into a result of convolution of 4 3x3 through an algorithm formula shown in fig. 6.

Further, in the algorithm shown in fig. 6, left (2 rows and 4 columns of matrices) and right (4 rows and 2 columns of matrices) are matrix data to be subjected to matrix multiplication, respectively. The middle 4x4 matrix is a 4x4 matrix operation result after multiplier operation, wherein the numbers 0-15 are data position numbers, and the data position numbers are specifically arranged in the sequence from top to bottom and from left to right. After the formula operation shown in fig. 1, a 2x2 matrix result is obtained, and each data in the result is a convolution operation result of the original 4x 3 matrices, which is as follows:

the 1 st row and 1 st column data of the 2x2 matrix correspond to the operation result of a 3x3 matrix (hereinafter referred to as "matrix 1") composed of 9 data of "0, 1, 2, 4, 5, 6, 8, 9 and 10";

the 1 st row and 2 nd column data of the 2x2 matrix correspond to the operation result of a 3x3 matrix (hereinafter referred to as "matrix 2") composed of 9 data of "4, 5, 6, 8, 9, 10, 12, 13 and 14";

the 2 nd row and 1 st column data of the 2x2 matrix corresponds to the operation result of a 3x3 matrix (hereinafter referred to as "matrix 3") composed of 9 data of "1, 2, 3, 5, 6, 7, 9, 10 and 11";

the 2 nd row and 2 nd column data of the 2x2 matrix corresponds to the operation result of a 3x3 matrix (hereinafter referred to as "matrix 4") composed of 9 data "5, 6, 7, 9, 10, 11, 13, 14, 15".

In order to realize the function of the above algorithm and reduce the operation amount of convolution operation, as shown in fig. 7, the present invention provides a schematic structural diagram of a multiply-add array unit, wherein the multiply-add array unit includes a multiply operation unit, a matrix operation unit, and a data buffer unit; the matrix operation unit includes a first matrix operation unit 1071 and a second matrix operation unit 1072; the multiplication operation unit comprises a plurality of multipliers;

the first matrix operation unit 1071 is configured to receive a first operation result of a multiplier connected thereto, perform a first matrix operation to obtain a first matrix operation result, and store the first matrix operation result in a corresponding data cache unit;

the second matrix operation unit 1072 is configured to obtain a first matrix operation result of the data caching unit, perform a second matrix operation, obtain a second matrix operation result, and output the second matrix operation result.

Preferably, in some embodiments, the matrix for performing the convolution optimization operation is a 4 × 4 matrix, the number of the multipliers is 16, the number of the first matrix operation units is 4, the number of the data buffer units is 8, and the number of the second matrix operation units is 2; each first matrix operation unit is correspondingly connected with 4 multipliers and correspondingly connected with 2 data cache units according to a first configuration rule; and each second data operation unit is correspondingly connected with 4 data cache units according to a second configuration rule.

For example, fig. 7 shows 16 multipliers numbered 0 to 15, four first matrix operation units numbered a to D, and 8 data buffer units numbered 1 st row, 1 st column, 2 nd row, 4 th column. And two second matrix operation units with numbers of a and b. Multipliers 0 to 3 are connected to the first matrix operation unit a, multipliers 4 to 7 are connected to the first matrix operation unit B, multipliers 8 to 11 are connected to the first matrix operation unit C, and multipliers 12 to 15 are connected to the first matrix operation unit D. After each first matrix operation unit is subjected to first matrix operation, two operation results are respectively obtained, and the two operation results are respectively stored in two data cache units correspondingly connected with the first matrix operation unit. The second matrix operation unit a receives data in the 1 st row, the 2 nd column, the 1 st row, the 3 rd column and the 1 st row, the 4 th column buffer unit respectively, and performs a second matrix operation on the four data to obtain operation results of the matrix 1 and the matrix 2; the second matrix operation unit b receives the data in the buffer units of row 2, column 1, row 2, column 2, row 2, column 3 and row 2, column 4 respectively, and performs a second matrix operation on the four data to obtain the operation results of the matrix 3 and the matrix 4.

As shown in fig. 8, in some embodiments, the matrix operation unit includes an addition operation unit and a complement operation unit. The complement operation unit is used for performing complement operation on the operation result of the multiplier connected with the complement operation unit and transmitting the complement operation result to the addition operation unit; the addition unit is used for adding the operation result of the multiplier connected with the addition unit or adding the operation result of the multiplier and the complement operation result.

Preferably, the multipliers include a first multiplier (i.e., multiplier 0 in fig. 7), a second multiplier (i.e., multiplier 1 in fig. 7), a third multiplier (i.e., multiplier 2 in fig. 7), and a fourth multiplier (i.e., multiplier 3 in fig. 7), and the first multiplier, the second multiplier, the third multiplier, and the fourth multiplier are respectively connected to the first matrix operation unit; the addition operation unit includes a first adder 110 and a second adder 111; the complement operation unit comprises a first complement operation unit 112 and a second complement operation unit 113;

the first multiplier, the second multiplier and the third multiplier are respectively connected with the first adder, the second multiplier is connected with the second adder, the third multiplier is connected with the first complement arithmetic unit, the fourth multiplier is connected with the second complement arithmetic unit, and the first complement arithmetic unit and the second complement arithmetic unit are respectively connected with the second adder;

the first adder 110 is configured to perform a first addition operation on the multiplication operation results output by the first multiplier, the second multiplier, and the third multiplier, obtain a first addition operation result, and output the first addition operation result;

the second adder 111 is configured to perform a second addition operation on the operation results output by the second multiplier, the first complement operation unit, and the second complement operation unit, obtain a second addition operation result, and output the second addition operation result.

The complement operation is adding one more for negation, because the matrix in fig. 1 involves multiply-add operation of the element "-1", for negative number operation, it can be converted into complement operation to improve the processing efficiency, specifically, the number is converted into binary number, then each bit is inverted (i.e. the original digit of "0" becomes "1", the original digit of "1" becomes "0"), then 1 is added to obtain the complement corresponding to the original number, so that the original subtraction operation of two numbers is changed into the addition operation of two numbers (i.e. one number subtracts another number, and can be converted into one number plus another number of complement).

As shown in fig. 8, the first matrix operation unit a is connected to the

multipliers

0, 1, 2, and 3, respectively, and the first adder performs an addition operation on the multiplication results of the

multipliers

0, 1, and 2, and stores the obtained first intermediate result in the 1 st row and 1 st column buffer; the operation results of the multiplier 2 and the multiplier 3 are first subjected to complement operation in respective complement operation units, the complement operation results are transmitted to a second adder, the second adder adds the two complement operation results and the multiplication result of the multiplier 1 to obtain a second intermediate result, and the second intermediate result is stored in the 2 nd row and 1 st column cache. The other first matrix operation units perform the first matrix operation in a manner similar to the first matrix operation unit a, and only the results of the first three multipliers need to be directly added to serve as a first intermediate result, and the operation result of the second multiplier, the complement operation result of the third multiplier and the complement operation result of the fourth multiplier are added to obtain a second intermediate result.

The second matrix operation units a and b are similar to the first matrix operation units a to D in structure, except that the data acquired by the second matrix operation units a and b are data stored in a data buffer unit, specifically, the data acquired by the second matrix operation unit a is data in a 1 st row and 1 st column buffer, a 1 st row and 2 nd column buffer, a 1 st row and 3 rd column buffer, and a 1 st row and 4 th column buffer as shown in fig. 6, and the data acquired by the second matrix operation unit b is data in a 2 nd row and 1 st column buffer, a 2 nd row and 2 nd column buffer, a 2 nd row and 3 rd column buffer, and a 2 nd row and 4 th column buffer.

Taking the second matrix operation unit a as an example, a first adder inside the second matrix operation unit a performs addition operation on intermediate result data obtained from the 1 st row and 1 st column cache, the 1 st row and 2 nd column cache, and the 1 st row and 3 rd column cache, so as to obtain a convolution operation result of 'matrix 1'; a second adder in the convolution operation device adds the intermediate result data buffered in the 1 st row and the 2 nd column, the data buffered in the 1 st row and the 3 rd column and the data buffered in the 1 st row and the 4 th column to obtain the convolution operation result of the matrix 2.

Taking the second matrix operation unit b as an example, the first adder therein performs addition operation on the intermediate result data in the row 2, column 1 buffer, row 2, column 2 buffer and row 2, column 3 buffer, so as to obtain the convolution operation result of the "matrix 3"; the second adder in the convolution operation device adds the intermediate result data buffered in the 2 nd row and the 2 nd column, the data buffered in the 2 nd row and the 3 rd column and the data buffered in the 2 nd row and the 4 th column to obtain the convolution operation result of the matrix 4.

Through the scheme, the multiplication array unit can output convolution operation results of 4 3x3 matrixes under the condition of receiving one 4x4 matrix, and compared with a mode of operation one by one, the convolution operation efficiency of the neural network is greatly improved.

In order to enable the multiplication array unit of the present invention to support not only the convolution operation of the 4x4 matrix but also the convolution operation of the 3x3 matrix, in this embodiment, the multiplication array unit further includes a zero padding unit, and when the matrix subjected to the convolution optimization operation is the 3x3 matrix, the zero padding unit is configured to perform a zero padding operation on the 3x3 matrix to obtain the 4x4 matrix. In short, when the input is a 3x3 matrix, all data of the 3x3 matrix is correspondingly filled to the position of "matrix 1" in fig. 6, and all values are set to 0 for the areas except for matrix 1, i.e., the input data of

multipliers

0, 1, 2, 4, 5, 6, 8, 9 and 10 are respectively set to the data of the 3x3 matrix from top to bottom and from left to right, and the data of the multipliers except for the above 9 multipliers are all set to 0. Thus, the operation performed by the multiplication array unit shown in fig. 7 obtains the operation result of the matrix 1, which is the convolution operation result of the input 3 × 3 matrix.

By arranging the zero padding unit, the multiplication array unit related to the invention not only can process the 4x4 matrix to obtain the convolution operation result of 4 3x3 matrixes, but also can directly process the 3x3 matrix to obtain the convolution operation result corresponding to the 3x3 matrix, thereby effectively improving the reusability of the multiplication array unit operation circuit.

As shown in fig. 5, the inventor further provides a high-parallelism convolution operation access method, which is applied to a high-parallelism convolution operation access circuit, where the circuit includes a reading unit, a recombination unit, an output buffer unit, and a multiply-add array unit; the reading unit comprises a first reading unit and a second reading unit, the recombination unit comprises a first recombination unit and a second recombination unit, and the output buffer unit comprises a first output buffer unit and a second output buffer unit; the first reading unit is connected with the first recombination unit, and the first recombination unit is connected with the first output buffer unit; the second reading unit is connected with the second recombination unit, and the second recombination unit is connected with the second output buffer unit; the first output buffer unit and the second output buffer unit are respectively connected with the multiply-add array unit;

the method comprises the following steps:

firstly, entering step S501, a first reading unit reads first data from a main storage unit, and writes the read first data into a first line cache group in sequence;

then, after the first line cache group is filled up, the first reassembly control unit assembles the first data cached by each cache line in the first line cache group to obtain a single-line first data after reassembly, and writes the single-line first data into the first output cache unit in step S502;

while performing step S501 and step S502, step S503 may be performed synchronously, where the second reading unit reads the second data from the main storage unit, and sequentially writes the read second data into the second line buffer group; step S504, after the second row of cache groups are filled, the second reassembly control unit assembles the second data cached by each cache line in the second row of cache groups to obtain a reassembled single-row second data, and writes the single-row second data into the second output cache unit;

after steps S503 and S504 are completed, step S505 may be entered to the multiply-add array unit, and after receiving the convolution operation start signal, the multiply-add array unit obtains the first data in the first output buffer unit and the second data in the second output buffer unit, and outputs the operation result after performing the multiply-add operation.

In some embodiments, the number of the first line cache set and the second line cache is two; the method comprises the following steps:

In some embodiments, the circuit further comprises a read control balancing unit; the method comprises the following steps:

In some embodiments, the circuit further includes an operation validity judging unit, and the operation validity judging unit is respectively connected to the first output buffer unit, the second output buffer unit, and the multiply-add array unit; the method comprises the following steps:

and the operation effective judging unit sends a convolution operation starting signal to the multiply-add array unit after receiving the first data output effective signal sent by the first output buffer unit and the second data output effective signal sent by the second data output buffer unit.

In some embodiments, the circuit further comprises a mode configuration unit, wherein the mode configuration unit is respectively connected with the first reading unit and the second reading unit; the method comprises the following steps:

the mode configuration unit correspondingly adjusts the first cache number in the first line cache group and the second cache number in the second line cache group participating in the operation according to the mode configuration signal, and correspondingly adjusts the storage length range of the first output cache unit and the second output cache unit participating in the operation.

The high-parallelism convolution operation access method and the circuit in the technical scheme comprise the following steps: the first reading unit reads first data from the main storage unit and sequentially writes the read first data into a first line cache group; after the first row cache group is filled, the first reassembly control unit assembles the first data cached by each cache line in the first row cache group to obtain a reassembled single-row first data, and writes the single-row first data into the first output cache unit; the second reading unit reads second data from the main storage unit and sequentially writes the read second data into a second line cache group; after the second row of cache groups are filled, the second reassembly control unit assembles the second data cached by each cache line in the second row of cache groups to obtain a reassembled single-row second data, and writes the single-row second data into the second output cache unit; and the multiplication and addition array unit acquires the first data in the current first output cache unit and the second data in the second output cache unit after receiving the convolution operation starting signal, and outputs an operation result after carrying out multiplication and addition operation. According to the scheme, on the basis of ensuring high parallelism operation of convolution operation, high operation efficiency and data multiplexing are still kept, and the data bandwidth requirement and power consumption are greatly reduced, so that the area and the power consumption of the whole neural network circuit are reduced.

It should be noted that, although the above embodiments have been described herein, the invention is not limited thereto. Therefore, based on the innovative concepts of the present invention, the technical solutions of the present invention can be directly or indirectly applied to other related technical fields by making changes and modifications to the embodiments described herein, or by using equivalent structures or equivalent processes performed in the content of the present specification and the attached drawings, which are included in the scope of the present invention.

Claims

1. A convolution operation access circuit with high parallelism is characterized by comprising a reading unit, a recombination unit, an output buffer unit and a multiplication and addition array unit; the reading unit comprises a first reading unit and a second reading unit, the recombination unit comprises a first recombination unit and a second recombination unit, and the output buffer unit comprises a first output buffer unit and a second output buffer unit; the first reading unit is connected with the first recombination unit, and the first recombination unit is connected with the first output buffer unit; the second reading unit is connected with the second recombination unit, and the second recombination unit is connected with the second output buffer unit; the first output buffer unit and the second output buffer unit are respectively connected with the multiply-add array unit;

and the multiplication and addition array unit is used for acquiring the first data in the current first output buffer unit and the second data in the second output buffer unit after receiving the convolution operation starting signal, and outputting an operation result after carrying out multiplication and addition operation.

2. The high-parallelism convolution-operation access circuit according to claim 1, wherein the number of the first line buffer group and the second line buffer is two;

3. The high parallelism access circuit for convolution operations according to claim 2, further comprising a read control balance unit;

4. The high parallelism convolution access circuit according to claim 1, further comprising an operation validity determination unit connected to the first output buffer unit, the second output buffer unit, and the multiply-add array unit, respectively;

5. The high-parallelism access circuit for convolution operation according to claim 1, further comprising a mode configuration unit, wherein the mode configuration unit is connected to the first reading unit and the second reading unit respectively;

6. The high-parallelism convolution operation access method is characterized by being applied to a high-parallelism convolution operation access circuit, wherein the circuit comprises a reading unit, a recombination unit, an output buffer unit and a multiplication and addition array unit; the reading unit comprises a first reading unit and a second reading unit, the recombination unit comprises a first recombination unit and a second recombination unit, and the output buffer unit comprises a first output buffer unit and a second output buffer unit; the first reading unit is connected with a first recombination unit, and the first recombination unit is connected with a first output cache unit; the second reading unit is connected with the second recombination unit, and the second recombination unit is connected with the second output buffer unit; the first output buffer unit and the second output buffer unit are respectively connected with the multiply-add array unit;

the method comprises the following steps:

after the second line cache group is filled, the second reassembly control unit assembles the second data cached by each cache line in the second line cache group to obtain a reassembled single-line second data, and writes the single-line second data into the second output cache unit;

7. The method of claim 6, wherein the number of the first line buffer group and the second line buffer is two; the method comprises the following steps:

8. The high parallelism access method of convolution operations according to claim 7, wherein the circuit further comprises a read control balance unit; the method comprises the following steps:

9. The method as claimed in claim 6, wherein the circuit further comprises an operation validity determination unit, the operation validity determination unit is connected to the first output buffer unit, the second output buffer unit, and the multiply-add array unit; the method comprises the following steps:

10. The method for fetching convolution operations with high parallelism according to claim 6, wherein the circuit further comprises a mode configuration unit, the mode configuration unit is respectively connected with the first reading unit and the second reading unit; the method comprises the following steps:

the mode configuration unit correspondingly adjusts the first buffer amount in the first line buffer group and the second buffer amount in the second line buffer group participating in the operation according to the mode configuration signal, and correspondingly adjusts the storage length range of the first output buffer unit and the second output buffer unit participating in the operation.