CN116090530A

CN116090530A - Systolic array structure and method capable of configuring convolution kernel size and parallel calculation number

Info

Publication number: CN116090530A
Application number: CN202310150887.1A
Authority: CN
Inventors: 窦思远; 朱博源; 杨冬立
Original assignee: Guangdong Songke Intelligent Technology Co ltd
Current assignee: Guangdong Songke Intelligent Technology Co ltd
Priority date: 2023-02-22
Filing date: 2023-02-22
Publication date: 2023-05-09

Abstract

The invention discloses a pulse array structure and a method capable of configuring the size of a convolution kernel and calculating the number in parallel, comprising the following steps of firstly, fixing weight data; step two, broadcasting and arranging the input feature map data; step three, convolution calculation; step four, judging the validity of the output data; step five, data storage; the invention adopts the same pulse array structure with the same configurable convolution kernel size and parallel quantity, the same pulse array multiplication and addition pipeline structure, the same pulse array window sliding and broadcasting structure and the same parallel output effective signal and structure, does not need to take out data from a memory into a buffer or a cache, realizes the hardware acceleration of a convolution layer, can configure the pulse array structure with the convolution kernel size and parallel calculation quantity, supports the convolution calculation with different sizes and different parallel lines, improves the data multiplexing capability, reduces the consumption of a hardware calculation unit and improves the convolution calculation efficiency.

Description

Systolic array structure and method capable of configuring convolution kernel size and parallel calculation number

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a pulse array structure and a pulse array method capable of configuring the size of a convolution kernel and calculating numbers in parallel.

Background

Along with technological development, artificial intelligence is widely applied in various fields, research and application of a convolutional neural network are always a core of the artificial intelligence, the neural network can provide high-precision reasoning for the wide application, along with deep and complicated neural networks, people have higher demands on the calculation speed and hardware area of the neural network, in the convolutional neural network, the calculated amount of a convolutional layer occupies more than 80%, and in order to enable a higher-efficiency network to be deployed in a low-power-consumption soc, the hardware acceleration of the convolutional layer is indispensable.

In the processor architecture, a large amount of data is stored in a memory outside the processor, when the computing module needs to operate, the data needs to be fetched from the memory to a buffer or a cache and then sent to the computing module for operation, in the general convolutional neural network convolutional layer calculation, the convolutional layer operation time is much shorter than the data carrying time, the general convolutional neural network needs to use a large amount of data, and the data access speed is far lower than the data processing speed.

The traditional convolution computing unit has the defects of low data multiplexing, more consumed hardware units and low convolution computing efficiency, the traditional pulse array is optimized for specific convolution kernel size and parallel number, and the array structure is required to be redesigned once the network structure is replaced.

Disclosure of Invention

The present invention is directed to a systolic array structure and method capable of configuring the convolution kernel size and the number of parallel computations, so as to solve the above-mentioned problems in the prior art.

In order to achieve the above purpose, the present invention provides the following technical solutions: the pulse ARRAY structure capable of configuring the convolution kernel size and the parallel calculation number comprises an IFMAP_RAM unit, a PE_ARRAY unit and a WEIGHT_RAM unit, wherein the IFMAP_RAM unit is in data connection with the PE_ARRAY unit through an outdata interface, the PE_ARRAY unit is in data connection with the WEIGHT_RAM unit through taps, and the PE_ARRAY unit is in data connection with the outside through a done interface, an outsum_final and an overall interface.

Preferably, the pe_array unit includes a multiplier and a D flip-flop, and the multiplier is an 8-bit multiplier.

Preferably, the pe_array unit has 7×15 units, and the supported convolution kernels are one of 21 2×2 convolution kernels, 10 3×3 convolution kernels, 3 5×5 convolution kernels, and 2 7×7 convolution kernels.

Preferably, in the pe_array unit, the convolution kernel size is 7, the parallelism is 2 at maximum, the convolution kernel size is 5, the parallelism is 3 at maximum, the convolution kernel size is 3, the parallelism is 10 at maximum, and the parallelism is 21 at maximum if the convolution kernel size is 2.

The method for configuring convolution kernel size and parallel calculation number of systolic arrays comprises the steps of firstly, fixing weight data; step two, broadcasting and arranging the input feature map data; step three, convolution calculation; step four, judging the validity of the output data; step five, data storage;

in the first step, the convolution kernel data in ram is taken out, parallel quantity signals are sent to a sliding window according to the convolution kernel size signals, the sliding window is connected with a row of PE units at the top of a pulse array, and each cycle of the sliding window and the row of PE units transmit data downwards until a convolution kernel sliding enabling signal is set to zero, and at the moment, different weight data are fixed on the pulse array unit;

in the second step, when the weight data in the first step is fixed, the input feature map data in the ping-pong buffer is taken out, the signals are transmitted to the sliding window according to the size signals of the input feature map, the convolution kernel size signals and the parallel quantity signals, the output data of the sliding window are arranged, if the parallel quantity is greater than 1, the data are required to be broadcasted and connected to the leftmost first column of the pulse array, then the data are transmitted once to the right side of the sliding window and the PE unit in each period until the PE unit which needs to be calculated on the rightmost side receives the input feature map data, and calculation is started;

in the third step, when the feature map data is input to the PE unit to be calculated on the rightmost side in the second step, the operation enabling signal is pulled up, the PE unit starts to perform convolution calculation on the fixed weight data and the input feature map data, and then the multiplication result is obtained as the left input feature map data flows to the right, and then the multiplication result calculated by the same convolution kernel is accumulated in pairs according to the convolution kernel size signal, if the convolution kernel size is 2×2, two cycles are needed to obtain the accumulation result, if the convolution kernel size is 3×3, four cycles are needed to obtain the accumulation result, if the convolution kernel size is 5×5, six cycles are needed to obtain the accumulation result, and if the convolution kernel size is 7×7, six cycles are also needed to obtain the accumulation result;

in the fourth step, when the data accumulation in the third step is completed, the data is output through the output port, and the output port data is not fully valid at this time, the valid output signal is required to be identified according to the valid flag bit of the output data and the parallel quantity, and the flag is marked according to the completion time of each column of data accumulation;

in the fifth step, after the first step, the second step, the third step and the fourth step are completed, the output buffer stores data according to the output data of the systolic array and out_enable, and the parallel quantity.

Preferably, in the second step, if the convolution kernel size is 2 and the convolution parallel line number is 21, the systolic array is arranged as shown in the bottom right corner diagram of fig. 3, and when the input feature image flows into the 2 nd cycle of the systolic array, the convolution operation of all color PEs in 1,2 lines is started, the convolution operation of all color PEs in 3,4 lines is started, the convolution operation of all color PEs in 6 th cycle is started, the convolution operation of all color PEs in 1,2 lines is completed when the 5 th cycle is the last, the convolution operation of all color PEs in 3,4 lines is completed when the 3 rd cycle is the last, and the convolution operation of all color PEs in 5,6 lines is completed when the 1 st cycle is the last.

Preferably, in the fourth step, the output signal is marked as valid by the common flag of ovalid [2:0], kernel_size [1:0] and kernel_num [4:0 ].

Compared with the prior art, the invention has the beneficial effects that: the invention adopts the same pulse array structure with the same configurable convolution kernel size and parallel quantity, the same pulse array multiplication and addition pipeline structure, the same pulse array window sliding and broadcasting structure and the same parallel output effective signal and structure, does not need to take the data out of a memory into a buffer or a cache, and is beneficial to realizing hardware acceleration of a convolution layer in calculation, the pulse array structure with the configurable convolution kernel size and parallel calculation quantity supports convolution calculation with different sizes and different parallel numbers, further improves the capability of data multiplexing, reduces the consumption of a hardware calculation unit and improves the efficiency of convolution calculation.

Drawings

FIG. 1 is a block diagram of a system architecture of the present invention;

FIG. 2 is a diagram of a systolic array structure of the present invention;

FIG. 3 is a diagram showing the output of the systolic array to the output signature cache;

FIG. 4 is a flow chart of the method of the present invention;

in the figure: 1. ifmap_ram unit; 2. a PE_ARRAY unit; 3. WEIGHT_RAM cells.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-3, an embodiment of the present invention is provided: the pulse ARRAY structure capable of configuring convolution kernel size and parallel calculation number comprises an IFMAP_RAM unit 1, a PE_ARRAY unit 2 and a WEIGHT_RAM unit 3, wherein the IFMAP_RAM unit 1 is in data connection with the PE_ARRAY unit 2 through an outdata interface, the PE_ARRAY unit 2 is in data connection with the WEIGHT_RAM unit 3 through taps, the PE_ARRAY unit 2 is in data connection with the outside through a done interface, an outsum_final interface and an overlap interface, the PE_ARRAY unit 2 comprises a multiplier and a D trigger, the multiplier is an 8-bit multiplier, the PE_ARRAY unit 2 is provided with 7×15 units, the supported convolution kernels are 21 2×2 convolution kernels, 10 3×3 convolution kernels, 3 5×5 convolution kernels and 2×7 convolution kernels, the degree of parallelism is maximum 2, the degree of parallelism is maximum is 5, the degree of parallelism is maximum 3, and the degree of parallelism is maximum is 10, and the degree of parallelism is maximum is 21, if the degree of maximum is 2.

Referring to fig. 4, an embodiment of the present invention is provided: the method for configuring convolution kernel size and parallel calculation number of systolic arrays comprises the steps of firstly, fixing weight data; step two, broadcasting and arranging the input feature map data; step three, convolution calculation; step four, judging the validity of the output data; step five, data storage;

in the second step, when the weight data in the first step is fixed, the input feature map data in the ping-pong buffer is taken out, according to the size signal of the input feature map, the size signal of the convolution kernel and the parallel number signal, the signal is transmitted to the sliding window, the output data of the sliding window is arranged, if the parallel number is greater than 1, the data is required to be broadcasted and is connected to the leftmost first column of the pulse array, then the sliding window and the line PE unit of each period are transmitted once to the right until the PE unit of the rightmost period receives the input feature map data, the calculation is started, and if the size of the convolution kernel is 2, the convolution parallel number is 21, the pulse array is arranged according to the size signal of the lower right corner map of fig. 3, the convolution operation of all color PE of 1 and 2 lines is started when the input feature map flows in the 2 th period of the pulse array, the convolution operation of all color PE of 3 and 4 lines is started in the 4 th period, the convolution operation of all color PE of 5 and the color PE of 6 is started in the 6 th period, the convolution operation of all color PE of 1 and the color PE of the convolution operation of the 1 and 2 lines is finished in the 3 and the reciprocal operation of all color PE of the line and 5 are finished when the convolution operation of the color PE of the 2 is calculated in the fourth period is the reciprocal is finished;

in the fourth step, when the data accumulation in the third step is completed, the data is output through the output port, and the output port data is not fully valid at this time, the valid output signal is identified according to the valid flag bit and the parallel number of the output data, the flag is marked according to the completion time of each column of data accumulation, and whether the output signal is valid or not is marked by the ovalid [2:0], the kernel_size [1:0] and the kernel_num [4:0] together;

Based on the above, the invention has the advantages that the invention adopts the same pulse array structure with the configurable convolution kernel size and the parallel quantity, the same pulse array multiplication and addition pipeline structure, the same pulse array window sliding and broadcasting structure and the same parallel output effective signal and structure, and the pulse array structure with the configurable convolution kernel size and the parallel calculation quantity supports the convolution calculation with different sizes and different parallel quantity, further improves the data multiplexing capability, reduces the consumption of a hardware calculation unit and improves the convolution calculation efficiency.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. A systolic ARRAY structure of configurable convolution kernel size and parallel computation count, comprising an ifmap_ram unit (1), a pe_array unit (2) and a weight_ram unit (3), characterized by: the IFMAP_RAM unit (1) establishes data connection with the PE_ARRAY unit (2) through an outdata interface, the PE_ARRAY unit (2) establishes data connection with the WEIGHT_RAM unit (3) through taps, and the PE_ARRAY unit (2) establishes data connection with the outside through a done interface, an output_final and an interface.

2. The configurable systolic array structure of convolution kernel size and parallel operands of claim 1, wherein: the PE_ARRAY unit (2) comprises a multiplier and a D trigger, and the multiplier is an 8-bit multiplier.

3. The configurable systolic array structure of convolution kernel size and parallel operands of claim 1, wherein: the PE_ARRAY unit (2) has 7×15 units, and the supported convolution kernels are one of 21 2×2 convolution kernels, 10 3×3 convolution kernels, 3 5×5 convolution kernels and 2 7×7 convolution kernels.

4. The configurable systolic array structure of convolution kernel size and parallel operands of claim 1, wherein: the convolution kernel size in the pe_array unit (2) is 7, the maximum parallelism is 2, the convolution kernel size is 5, the maximum parallelism is 3, the convolution kernel size is 3, the maximum parallelism is 10, and the maximum parallelism is 21 if the convolution kernel size is 2.

5. The method for configuring convolution kernel size and parallel calculation number of systolic arrays comprises the steps of firstly, fixing weight data; step two, broadcasting and arranging the input feature map data; step three, convolution calculation; step four, judging the validity of the output data; step five, data storage; the method is characterized in that:

6. The method of configurable convolution kernel size and parallel computing of systolic arrays of numbers according to claim 5, wherein: in the second step, if the convolution kernel size is 2 and the convolution parallel line number is 21, the pulse array is arranged according to the lower right corner diagram of fig. 3, when the input feature image flows into the pulse array at the 2 nd period, the convolution operation of all color PEs of 1,2 lines is started, the convolution operation of all color PEs of 3,4 lines is started at the 4 th period, the convolution operation of all color PEs of 5,6 lines is started at the 6 th period, the convolution operation of all color PEs of 1,2 lines is completed at the 5 th period, the convolution operation of all color PEs of 3,4 lines is completed at the 3 rd period, and the convolution operation of all color PEs of 5,6 lines is completed at the 1 st period.

7. The method of configurable convolution kernel size and parallel computing of systolic arrays of numbers according to claim 5, wherein: in the fourth step, the output signal is marked by the common sign of ovalid [2:0], kernel_size [1:0] and kernel_num [4:0] to be valid.