CN112632459A

CN112632459A - On-line computation element for deep convolution

Info

Publication number: CN112632459A
Application number: CN202011525795.XA
Authority: CN
Inventors: 张昆; 钱磊; 尚江卫; 原昊; 朱剑文; 曾明勇; 陆一峰; 贾迅
Original assignee: Wuxi Jiangnan Computing Technology Institute
Current assignee: Wuxi Jiangnan Computing Technology Institute
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2021-04-09
Anticipated expiration: 2040-12-22
Also published as: CN112632459B

Abstract

The invention discloses an on-line computation component of depth convolution, which comprises a standard convolution component, an accumulator and a depth convolution component connected to a data output interface of the accumulator; the deep convolution component comprises a plurality of stages of activation value platforms, a plurality of multipliers, a plurality of weight value platforms and at least one delay platform arranged between 2 adjacent activation value platforms, each multiplier is provided with 1 activation value platform and 1 weight value platform, the delay value D of each delay platform is equal to the width of an input activation image, the weight values are preset before convolution calculation is started, the activation value platforms are injected into the operation component in a step-by-step advancing mode, and results currently stored in the activation value platforms of each stage are sent to the activation value platforms of the next stage. The invention efficiently completes the deep convolution calculation on the premise of not damaging the output data structure of the accumulator, can greatly improve the utilization rate of the calculation resources of the deep convolution calculation, and accelerates the calculation speed of the whole neural network.

Description

On-line computation element for deep convolution

Technical Field

The invention relates to an online computation component for deep convolution, and belongs to the technical field of neural networks.

Background

Most of the calculations in deep neural networks are convolution calculations, so neural network hardware accelerators typically design specialized computational components to achieve acceleration of convolution operations. Convolution acceleration components are generally organized in multi-vector or systolic arrays, and these structures (hereinafter referred to as standard convolution components) can efficiently integrate a large number of multiplication components to achieve high chip area utilization and chip performance ratio.

The deep convolution operation belongs to a special convolution operation, and is mainly characterized in that the accumulation calculation in the direction of an input channel is lacked, so that the efficiency of a standard convolution component is very low when the standard convolution component is used for deep convolution, and the utilization rate of hardware resources is reduced.

In order to speed up the efficiency of convolution operations in a neural network, a hardware structure of multi-vector (SIMD) or Systolic Array (Systolic Array) is generally adopted, and the two structures are consistent in nature, and a multi-vector structure is taken as an example for description.

First, the convolution operation can be described as the following 6-layer loop:

table 1: cyclic hierarchical schematic of convolution operations

for in M:// layer 5: output channel

for H in H:// layer 4: output feature map height

for W in W:// layer 3: output feature map width

for R in R:// layer 2: height of convolution kernel

for S in S:// layer 1: width of convolution kernel

for C in C:// layer 0: input channel

f _ out [ m ] [ h ] [ w ] + = ker [ m ] [ r ] [ s ] [ c ]. activation [ h + r ] [ w + s ] [ c ]// add and accumulate at R, S, C and combine into 1 number

The order of these 6-level loops is mathematically interchangeable, and the order listed above is one of the computational orders that facilitates hardware acceleration implementations.

The standard convolution operation unit generally utilizes the computation parallelism existing in the innermost loop (input channel) to complete computation of a large number (for example, 64 or 128) of input channel loops simultaneously in one clock cycle, so as to realize efficient hardware resource utilization and accelerate the computation of the neural network. Unlike standard convolution operations, deep convolution operations do not have the innermost loop.

The structure of the multi-vector standard convolution acceleration component is shown in fig. 1, and the design principle is that one input activation value needs to be calculated by multiple weights, so that different weight (kernel) data are stored in multiple components, the same input activation value is broadcast to multiple calculation components, each calculation component consists of SIMD _ W multipliers, SIMD _ W multiplications can be completed simultaneously, and the calculation results are accumulated (accumulated to 1 number), and through several rounds of calculation (SIMD _ W for each round), the complete calculation of one input channel at the innermost layer in 6-layer loop of convolution calculation can be completed, and the result is the accumulation of multiple rounds (still 1 number).

The design of the standard convolution calculation component has high adaptability, can deal with various convolution calculations (namely, in the 6-layer loop, if the traversal of a certain layer of loop is lacked, the standard convolution calculation component can still be used for realizing the convolution calculation), and has the defect of low efficiency in the deep convolution calculation. Specifically, corresponding to the 6-layer cyclic structure of convolution calculation, the calculation of deep convolution has no input channel traversal of the innermost layer, and since the standard convolution component has preset a large number of input channel traversals, it generally designs a wider vector width (for example, SIMD _ W =64 or 128), and when performing deep convolution calculation, the multiplication component with the width SIMD _ W cannot be fully utilized, resulting in a reduction in calculation efficiency. Further, due to the adoption of a general design, the standard convolution component cannot continue other convolution calculations of subsequent network layers before completing the deep convolution operation, thereby causing the low calculation efficiency of the whole network.

Disclosure of Invention

The invention aims to provide an on-line computation component for deep convolution, which can efficiently complete deep convolution computation on the premise of not damaging an output data structure of an accumulator, greatly improve the utilization rate of computation resources of deep convolution computation, improve the overall efficiency of a chip and accelerate the computation speed of the whole neural network.

In order to achieve the purpose, the invention adopts the technical scheme that: the on-line computation component for the deep convolution is provided, and comprises a standard convolution component, an accumulator and a deep convolution component connected to a data output interface of the accumulator;

the standard convolution component is used for calculating standard convolution in the convolutional neural network;

the accumulator is used for sending the convolution result obtained from the standard convolution component to the deep convolution component;

the deep convolution component is used for calculating the deep convolution in the convolution neural network;

the deep convolution component comprises a plurality of stages of activation value stations, a plurality of multipliers, a plurality of weight value stations and at least one delay station arranged between 2 adjacent activation value stations, each multiplier is provided with 1 activation value station and 1 weight value station, the delay value D of each delay station is equal to the width of an input activation image, the weight values are preset before convolution calculation is started, the activation value stations are injected into the multipliers in a step-by-step advancing mode, and the result currently stored in each stage of activation value station is sent to the next stage of activation value station in each clock cycle;

before starting the calculation, according to the size of the input activation map from the accumulator, setting the delay value D of the delay station equal to the width of the input activation map, and outputting valid data to the activation value station of the next stage at the output port of the delay station only after the delay station has received D data;

the output activation graph is organized into a plurality of columns according to the output channel (M), and organized into a plurality of rows according to the coordinate sequence, each clock period, the accumulator starts from the upper left corner of the output activation graph, firstly sends the input activation value data into an activation value platform of the deep convolution component according to the row priority from left to right and then according to the sequence of the rows from top to bottom;

when the depth convolution size is K, the input activation value data of the current K-1 line is sent into the activation value stations of the depth convolution component, and after the K input activation value data are sent, all the activation value stations simultaneously have effective input activation value data, at the moment, the multipliers respectively complete the multiplication of the input activation values and the weighted values in one-to-one correspondence, and the multiplication results are combined into 1 convolution result of the depth convolution through the addition tree logic

The further improved scheme in the technical scheme is as follows:

1. in the above scheme, the number of the depth convolution components is less than or equal to the number of output channels for outputting the activation map, and when the number of the depth convolution components is less than the number of output channels for outputting the activation map, the calculation of the whole output activation map is completed in a time division multiplexing mode.

2. In the scheme, the number of the delay stations is equal to the side length-1 of the square convolution kernel.

Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:

the on-line computation component of the deep convolution is characterized in that a customized deep convolution component is connected to an accumulator output interface of a standard convolution component, and the on-line computation and the customization of the component are adopted, so that the deep convolution computation is efficiently completed on the premise of not damaging an accumulator output data structure, the computation resource utilization rate of the deep convolution computation can be greatly improved, the overall efficiency of a chip is improved, the computation speed of the whole neural network is accelerated, and the problem of low hardware efficiency of the standard convolution component in the computation of the deep convolution is solved while the original design structure is not damaged.

Drawings

FIG. 1 is a diagram of a multi-vector standard convolution acceleration component;

FIG. 2 is a schematic structural diagram of the present invention;

FIG. 3 is a schematic diagram of a data storage structure in a standard convolution component accumulator;

FIG. 4 is a schematic diagram of the coupling of a standard convolution component and a deep convolution component of the present invention;

FIG. 5 is a schematic diagram of the structure of a deep convolution component;

FIG. 6 is a schematic diagram of the deep convolution component calculation;

FIG. 7 is a schematic diagram of the computation of multiple deep convolution components.

Detailed Description

Example (b): the invention provides an online computation component of depth convolution, which comprises a standard convolution component, an accumulator and a depth convolution component connected to an accumulator data output interface;

when the depth convolution size is K, the input activation value data of the current K-1 line is already sent to the activation value stations of the depth convolution component, and after the K input activation value data are sent, effective input activation value data exist in all the activation value stations at the same time, at the moment, the multipliers respectively complete the multiplication of the input activation values and the weighted values in a one-to-one correspondence manner, and the multiplication results are combined into 1 convolution result of the depth convolution through the addition tree logic.

The number of the depth convolution components is less than or equal to the number of the output channels for outputting the activation graph, when the number of the depth convolution components is less than the number of the output channels for outputting the activation graph, the calculation of the whole output activation graph is completed in a time division multiplexing mode, and on the premise of ensuring the performance, the hardware resource overhead can be saved.

The above embodiments are further explained as follows:

the overall structure of the invention is shown in fig. 2, a depth convolution component with a customized structure is designed on a data output interface of an accumulator of a standard convolution component, and efficient depth convolution calculation is carried out by an online calculation method.

The convolution result of the general convolution component is generally stored in an accumulator, and is used for combining intermediate results of multiple rounds of circulation, and after several rounds of calculation, a complete convolution result (output characteristic diagram) is obtained in the accumulator, and the structure of the convolution result is shown in fig. 3;

in fig. 3, the output activation map is organized into a plurality of columns according to the output channels (M), and organized into a plurality of rows according to the coordinate sequence, and when the result in the accumulator is output, the result is generally read according to the row direction (that is, the data of the same seat number and different output channels are read in each beat); the complete convolution result obtained in the accumulator is typically sent to memory or once again to the convolution component for calculation.

In the invention, the result of the accumulator is directly sent to the depth convolution component, so that the on-line depth convolution calculation can be realized, specifically, as shown in fig. 4, the accumulator outputs output characteristic graphs of m channels in one clock period, and correspondingly, m independent depth convolution components are adopted for carrying out convolution operation;

because of the deep convolution calculation, the m channel data output by the accumulator does not need to be combined into 1 number (namely, the innermost loop listed in table 1 does not exist), so that the data of each channel is sent to an independent deep convolution component, the multiplication and accumulation calculation of the 1 st layer and the 2 nd layer listed in table 1 is realized in the component, and the output result of each component still corresponds to the input Mi channel.

The deep convolution specific components are generally designed but not limited to the structure of FIG. 5;

fig. 5 shows a component structure capable of calculating a depth convolution result of a single channel, where the component structure includes 9 stages of enabled stations, and each clock cycle sends a result currently stored in each stage of the platform to a next stage of the platform; accordingly, there are 9 multipliers and 9 weight value stations;

before starting the calculation, the delay value D of the delay station in fig. 5 is set equal to the width of the input activation map according to the size of the input activation map, and the delay station functions as: outputting valid data to a subsequent stage at the output port only after the stage has received D data;

in each clock cycle, starting from the upper left corner of the output activation graph, firstly, inputting activation value data into a deep convolution component according to row priority from left to right and then according to the sequence of columns from top to bottom;

after the active values of the two rows have been sent to the deep convolution component and 3 active values have been sent, all the active value stations in fig. 5 will have valid active value data at the same time, at this time, 9 multipliers will simultaneously complete the one-to-one multiplication of 9 active values and 9 weight values, and the 9 multiplication results will be merged into 1 result through the addition tree logic, and this result is a convolution result of the deep convolution; after the 9 input activation values pass through the deep convolution component, a mapping relationship of 1 output activation value is obtained as shown in fig. 6.

Because the depth convolution component shown in fig. 5 is channel-independent, that is, the depth convolution component is performed on one input channel, in order to increase the calculation speed, m parts of depth convolution components can be implemented, each clock cycle can simultaneously receive the activation value data of m channels, and in order to facilitate the implementation of control logic, each clock cycle sends the data of the same coordinate position of m channels into m parts of depth convolution components, thereby implementing m times of parallelism;

as shown in fig. 7, since m parts of outputs of the depth convolution component correspond to m independent channels, and each clock cycle generates the same coordinate position of the output activation map, the input activation values of the depth convolution component can be organized as shown in fig. 3 so as to be directly sent to the storage component, or sent to other convolution components again for new convolution operation.

The number of output channels which can be obtained by each reading operation on the accumulator is recorded as acc _ out _ w, and the number of channels which can be simultaneously processed by the deep convolution component is recorded as m.

The data format output by the deep convolution component is consistent with the output of the accumulator and therefore does not affect the way the accumulator result is used by the destination to which the pre-accumulator result of the present invention is applied.

The independent deep convolution components are specially designed, and the specific implementation method is not limited in the invention.

When the on-line computation component of the deep convolution is adopted, the customized deep convolution component is connected to the accumulator output interface of the standard convolution component, and the on-line computation and the component customization are adopted, so that the deep convolution computation is efficiently completed on the premise of not damaging the output data structure of the accumulator, the computation resource utilization rate of the deep convolution computation can be greatly improved, the overall efficiency of a chip is improved, the computation speed of the whole neural network is accelerated, and the problem of low hardware efficiency of the standard convolution component in the computation of the deep convolution is solved while the original design structure is not damaged.

The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. The on-line computation component of the deep convolution is characterized by comprising a standard convolution component, an accumulator and a deep convolution component connected to a data output interface of the accumulator;

2. The on-line computation component of deep convolution of claim 1, characterized by: and when the number of the depth convolution components is less than the number of the output channels for outputting the activation map, the calculation of the whole output activation map is completed in a time division multiplexing mode.

3. The on-line computation component of deep convolution of claim 1, characterized by: the number of delay stations is equal to-1 for the side length of the square convolution kernel.