CN113361687A

CN113361687A - Configurable addition tree suitable for convolutional neural network training accelerator

Info

Publication number: CN113361687A
Application number: CN202110597775.1A
Authority: CN
Inventors: 刘强; 孟浩
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-09-07
Anticipated expiration: 2041-05-31
Also published as: CN113361687B

Abstract

The invention discloses a configurable addition tree suitable for a convolutional neural network training accelerator, which consists of three groups of addition units, wherein each group of addition units comprises a first-order multiplexer and adder structure, a second-order multiplexer and adder structure and a third-order multiplexer and adder structure which are connected in series; mode selection with a multiplexer: the output of the multi-path selector is connected with the adder of the next stage in series. Compared with the prior art, the method of the invention 1) can reduce the use of addition resources under the condition of large parallelism; 2) the method is suitable for accumulation of forward propagation conventional convolution 3 multiplied by 3, and also suitable for accumulation of weight gradient super large kernel convolution (unfixed size); 3) can be suitable for different data precisions.

Description

Configurable addition tree suitable for convolutional neural network training accelerator

Technical Field

The invention belongs to the field of information technology and hardware acceleration of convolutional neural network training, and particularly relates to convolutional neural network training based on low power consumption and high performance.

Background

With the wide application of artificial intelligence technology, the field of on-line training chip design gradually becomes the leading edge of AI chip research at home and abroad. A Convolutional Neural Network (CNN) is a feedforward neural network, and can be widely applied to the fields of computer vision, natural language processing, and the like. The CNN training process involves large data storage, complex reading, and synchronization requirements, with very high requirements on storage space, access bandwidth, and management mechanisms. The existing hardware architecture explores an efficient hardware implementation mode of a convolution training operator around a training algorithm, and meets the requirements of a deep neural network on calculated amount and storage space. The basic operators of the CNN training algorithm comprise convolution, pooling, activation function, normalization, loss function, derivation of related operation and the like, wherein the convolution layer is an important component of the CNN and occupies a very important position. The method has important significance for training the convolutional neural network, and is a convolutional single-engine architecture supporting forward propagation and backward propagation in the CNN training process, and different deep neural network models are trained and mapped onto a configurable training accelerator architecture. The FPGA becomes one of the platforms for realizing the convolutional neural network training by utilizing the characteristics of strong programmability, high parallelism and low energy efficiency. There are two different accumulation forms for forward propagation (error back propagation) and weight gradient computation of the CNN training accelerator.

However, the existing CNN training accelerator is mainly optimized for multiplication units, the optimization for addition units is less, and 17 addition units are needed in the separate implementation of convolution kernel internal addition trees and self-accumulation unit modes under a single parallelism. When the parallelism is large, the addition unit needs to consume a large amount of computing resources.

Therefore, optimizing the addition tree to reduce the consumption of computing resources is an urgent technical problem to be solved by the present invention.

Disclosure of Invention

In order to further reduce the resource occupation of the addition unit, the invention provides a configurable addition tree suitable for a convolutional neural network training accelerator, and a configurable addition tree design is realized, so that the configurable addition tree design not only supports the in-core accumulation of forward propagation and error reverse transmission, but also supports the self-accumulation function of weight gradient calculation, and the optimization of hardware architectures with different accumulation forms in a CNN training accelerator is realized.

The technical scheme adopted by the invention to solve the problems is as follows:

a configurable addition tree suitable for a convolutional neural network training accelerator, the configurable addition tree being composed of three groups of addition units, the addition units of each group comprising a first-order multiplexer and adder structure, a second-order multiplexer and adder structure, and a third-order multiplexer and adder structure connected in series; mode selection with a multiplexer: the output of the multi-path selector is connected with the adder of the next stage in series.

Compared with the prior art, the configurable addition tree applicable to the convolutional neural network training accelerator can achieve the following beneficial effects:

1) under the condition of high parallelism, the use of addition resources is reduced;

2) the method is suitable for accumulation of forward propagation conventional convolution 3 multiplied by 3, and can also be suitable for accumulation of weight gradient super large kernel convolution (unfixed size).

3) Can be suitable for different data precisions.

Drawings

FIG. 1 is a schematic diagram of a configurable additive tree architecture for a convolutional neural network training accelerator according to the present invention;

FIG. 2 is a schematic diagram of the accumulation mode of a configurable addition tree suitable for a convolutional neural network training accelerator according to the present invention; (a) mode 0: convolution kernel addition tree mode, (b) mode 1: self accumulation mode.

Detailed Description

The technical solution of the present invention is further described in detail below with reference to the accompanying drawings and specific embodiments.

The configurable addition tree suitable for the convolutional neural network training accelerator is combined by software and hardware, different accumulation functions are realized by dynamically configuring the functional mode of the addition tree, and the problem of large resource occupation in the design of the training accelerator is solved.

FIG. 1 is a schematic diagram of a configurable addition tree architecture suitable for a convolutional neural network training accelerator according to the present invention.

In a CNN training accelerator, an adder tree is configured based on a convolutional layer with the size of 3 x 3, and the structure of the configurable adder tree comprises the following components: one group of every 3 adding units, convolvingThe results of the 9 multiply units in the core are divided into 3 groups (a)₂、a₁、a₃)、(a₅、a₄、a₆) And (a)₈、a₇、a₉) The two 4-level addition trees, namely a first-level addition tree, a second-level addition tree, a third-level addition tree, a fourth-level addition tree and a 3-level addition tree, are formed through the connection of the multiplexer and the addition unit.

Each group of the adding units comprises a first-order multiplexer and adder structure, a second-order multiplexer and adder structure and a third-order multiplexer and adder structure which are connected in series. Mode selection with a multiplexer: wherein, the mode 0 is convolution kernel accumulation mode, and the mode 1 is self accumulation mode. The multiplexer is arranged at the input end of each-stage adder, and the output of the multiplexer is connected with the next-stage adder in series. In the implementation of the whole network architecture, the network architecture is divided into three situations of forward propagation, error reverse transfer and weight gradient calculation of different layers. Controlling a multiplexer by register configuration, and configuring an addition tree to enter a mode 0 if the whole single-engine architecture realizes forward propagation and error reverse transmission in the CNN training process; if the entire single-engine architecture implements the weight gradient computation function, the configurable addition tree enters mode 1.

In the first set of addition units:

in a first-order multiplexer and adder configuration, a first-order multiplexer connects the result a of a multiplication unit in the mode 0 state₂Connected to the output of a first-order adder via a feedback line in the mode 1 state, and the two inputs of the first-order adder are the output of a first-order multiplexer and the result a of the multiplication unit₁；

In the structure of a group of second-order multiplexers and adders, the group of second-order multiplexers are connected with the output end of the group of first-order adders in a mode 0 state and connected with the output end of the group of second-order adders through a feedback line in a mode 1 state, and the two inputs of the group of second-order adders are the output of the group of second-order multiplexers and the result a of a multiplication unit respectively₃；

In the structure of a group of third-order multiplexers and adders, a group of first third-order multiplexers are connected with the output ends of a group of second-order adders in a mode 0 state and connected with the output ends of the group of third-order adders through a feedback circuit in a mode 1 state, and a group of second third-order multiplexers are connected with a result a of a multiplication unit in the mode 1 state₂The two inputs of the group of third-order adders are respectively the output of the group of first third-order multiplexers and the output of the group of second third-order multiplexers;

in the second group of adding units:

in a two-set first-order multiplexer and adder configuration, the two-set first-order multiplexer connects the result a of the multiplication unit in the mode 0 state₅Connected to the output ends of the two groups of first-order adders through feedback lines in the mode 1 state, and the two inputs of the two groups of first-order adders are the output of the two groups of first-order multiplexers and the result a of the multiplication unit₄；

In the structure of the two groups of second-order multiplexers and adders, the two groups of second-order multiplexers are connected with the output ends of the two groups of first-order adders in a mode 0 state and are connected with the output ends of the two groups of second-order adders through feedback lines in a mode 1 state, and the two inputs of the two groups of second-order adders are respectively the output ends of the two groups of second-order multiplexers and the result a of the multiplication unit₆；

In the structure of the two groups of third-order multiplexers and adders, the two groups of first third-order multiplexers are connected with the output ends of the two groups of second-order adders in a mode 0 state and connected with the output ends of the two groups of third-order adders through a feedback circuit in a mode 1 state, and the two groups of second third-order multiplexers are connected with a result a of the multiplication unit in the mode 1 state₅The two inputs of the two groups of third-order adders are respectively the output of the two groups of first third-order multiplexers and the output of the two groups of second third-order multiplexers;

in the third group of addition units:

in the structure of three groups of first-order multiplexers and adders, the three groups of first-order multiplexers are connected with the result a of the multiplication unit in the mode 0 state₈Connected to the output ends of the three groups of first-order adders through feedback lines in the mode 1 state, and the two inputs of the three groups of first-order adders are the output of the three groups of first-order multiplexers and the result a of the multiplication unit₇；

In the structure of the three groups of second-order multiplexers and adders, the three groups of second-order multiplexers are connected with the output ends of the three groups of first-order adders in a mode 0 state and are connected with the output ends of the three groups of second-order adders through feedback lines in a mode 1 state, and the two inputs of the three groups of second-order adders are the output ends of the three groups of second-order multiplexers and the result a of the multiplication unit respectively₉；

In the structure of the three groups of third-order multiplexers and adders, three groups of first third-order multiplexers are suspended in the mode 0 state and are connected to the output ends of the three groups of third-order adders through feedback lines in the mode 1 state, and three groups of second third-order multiplexers are connected with the result a of the multiplication unit in the mode 1 state₈The three groups of third-order adders are suspended in the mode 0 state, and two inputs of the three groups of third-order adders are respectively the output of the three groups of first third-order multiplexers and the output of the three groups of second third-order multiplexers;

the configurable addition tree requires only 9 addition units as a whole. Therefore, compared with the prior art that the convolution kernel internal addition tree mode and the self-accumulation mode are realized separately, the method can reduce the number of addition units by 47% with single parallelism, and has important significance for reducing the resources occupied by the addition units.

The configurable addition tree is used for a convolution single-engine architecture, and supports forward propagation, error reverse transfer and weight gradient calculation of the convolution single-engine architecture; and meanwhile, the method obtains better performance, can be compatible with two working modes of convolution kernel internal accumulation and self-accumulation in forward propagation, error reverse transfer and weight gradient calculation, and reduces the number of addition units, thereby reducing the consumption of calculation resources.

In the whole CNN training accelerator, the addition tree can meet the addition tree in a convolution kernel in forward propagation and also can meet the self-accumulation function of weight gradient calculation.

Fig. 2 is a schematic diagram of the addition tree accumulation mode proposed by the present invention. The convolution kernel addition tree pattern of pattern 0 is shown as (a). The self-accumulation mode of mode 1 is shown as (b). For the convolution kernel addition tree mode, there are:

in the first set of addition units: the first-level addition tree realizes a₂+a₁(ii) a The two-stage addition tree realizes a₂+a₁₊a₃(ii) a Three-level addition tree realizes a₂+a₁+a₃+a₅+a₄+a₆(ii) a Four-level addition tree realizes a₂+a₁+a₃+a₅+a₄+a₆ a₈+a₇+a₉；

In the second group of adding units: the first-level addition tree realizes a₅+a₄(ii) a The two-stage addition tree realizes a₅+a₄₊a₆(ii) a Three-level addition tree realizes a₂+a₁+a₃+a₅+a₄+a₆(ii) a Four-level addition tree realizes a₂+a₁+a₃+a₅+a₄+a₆+a₈+a₇+a₉；

In the third group of addition units: the first-level addition tree realizes a₈+a₇(ii) a The two-stage addition tree realizes a₈+a₇+a₉(ii) a Three-level addition tree realizes a₂+a₁+a₃+a₅+a₄+a₆+a₈+a₇+a₉。

Claims

1. A configurable addition tree suitable for a convolutional neural network training accelerator is characterized in that the configurable addition tree is composed of three groups of addition units, and the addition units in each group comprise a first-order multiplexer and adder structure, a second-order multiplexer and adder structure and a third-order multiplexer and adder structure which are connected in series; mode selection with a multiplexer: the output of the multi-path selector is connected with the adder of the next stage in series.

2. The configurable addition tree suitable for use in a convolutional neural network training accelerator as defined in claim 1, wherein said multiplexer comprises a mode 0 and a mode 1, said mode 0 being a convolutional intra-core accumulation mode and said mode 1 being a self-accumulation mode.

3. The configurable addition tree suitable for use in a convolutional neural network training accelerator as claimed in claim 1, wherein the specific structure of said three sets of addition units is as follows:

in the first set of addition units:

in the second group of adding units:

in the third group of addition units:

In the structure of the three groups of second-order multiplexers and adders, the three groups of second-order multiplexers are connected with the output ends of the three groups of first-order adders in a mode 0 state and connected with the output ends of the three groups of second-order adders through feedback lines in a mode 1 state, and the three groups of second-order adders are connected with the output ends of the three groups of second-order addersThe two inputs of the multiplexer are respectively the output of the three groups of second-order multiplexers and the result a of the multiplication unit₉；

In the structure of the three groups of third-order multiplexers and adders, three groups of first third-order multiplexers are suspended in the air in the mode 0 state and are connected to the output ends of the three groups of third-order adders through feedback lines in the mode 1 state, and three groups of second third-order multiplexers are connected with the result a of the multiplication unit in the mode 1 state₈And in the suspension mode in the mode 0 state, two inputs of the three groups of third-order adders are respectively the output of the three groups of first third-order multiplexers and the output of the three groups of second third-order multiplexers.