CN109740731B

CN109740731B - Design method of self-adaptive convolution layer hardware accelerator

Info

Publication number: CN109740731B
Application number: CN201811537915.0A
Authority: CN
Inventors: 秦华标; 曹钦平
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2018-12-15
Filing date: 2018-12-15
Publication date: 2023-07-18
Anticipated expiration: 2038-12-15
Also published as: CN109740731A; WO2020119318A1

Abstract

The invention discloses a design method of a self-adaptive convolution layer hardware accelerator, which comprises the steps of designing a convolution layer accelerator scheme pool, adaptively selecting an optimal scheme and generating the hardware accelerator. The method comprises the steps of firstly analyzing the characteristics and the parallelism of a convolution layer structure, then designing four different accelerator schemes by evaluating the consumption and the running speed of hardware resources, storing all the accelerator schemes in a storage area, and calling the storage area as an accelerator scheme pool, finally acquiring the convolution layer structure and parameters from an input source, selecting a corresponding optimal accelerator scheme according to the different convolution layer structures, and generating a final hardware accelerator. The method and the device have the advantages that the scheme pool of the convolution layer accelerator is designed, the optimal scheme is selected in a self-adaptive mode, and the hardware accelerator is generated, so that the hardware design has stronger flexibility, the resource consumption can be reduced, and the parallel operation speed of the convolution layer can be improved.

Description

Design method of self-adaptive convolution layer hardware accelerator

Technical Field

The invention relates to a convolutional neural network hardware accelerator design, belongs to the technical field of integrated circuit hardware acceleration, and particularly relates to a method for adaptively selecting an optimal hardware acceleration scheme and generating a hardware accelerator according to different convolutional layer structures.

Background

In recent years, convolutional neural networks are widely used in the fields of image classification, object detection, speech recognition, and the like. However, convolutional neural networks require more computing and memory resources while achieving very high precision, and many convolutional neural network-based applications must rely on large servers. In the embedded platform with limited resources, deep learning technologies such as convolutional neural networks are increasingly used. Convolutional neural networks typically include a large number of convolutional layers that can be calculated in parallel, so the design of hardware accelerators on the convolutional layers is a necessary development direction in the future.

Regarding the hardware acceleration design of the convolution layer, the main direction of the current research is to accelerate the convolution layer by adopting the same hardware circuit architecture without considering the structure of the convolution layer, and the method is not optimized aiming at different structures, so that more hardware resources are consumed and the parallel computing speed is reduced; the current hardware design mainly provides a hardware interface, and has relatively many parameters and complex structure for a convolution layer, so that the flexibility of the circuit is very poor. In view of the limitation of the prior art in the acceleration of the convolution layer hardware, corresponding accelerator schemes can be designed for different convolution layer structures, all accelerator scheme storage areas are called accelerator scheme pools, the convolution layer structures are obtained from input sources, then the optimal scheme is selected from the accelerator scheme pools, and finally the hardware accelerator is generated. Through the search of the prior art documents, no report has been made on designing different accelerator schemes for different convolution layer structures and using adaptive selection of the optimal scheme.

Disclosure of Invention

The invention provides a design method of a self-adaptive convolution layer hardware accelerator, which overcomes the defects in the existing convolution layer hardware acceleration technology.

According to the invention, four different accelerator schemes are designed, the optimal scheme is selected in a self-adaptive mode, and the hardware accelerator is generated, so that not only is the flexibility of hardware design improved, but also the resource consumption is reduced, and the operation speed is improved.

The invention is realized by the following technical scheme, and the design method of the self-adaptive convolution layer hardware accelerator comprises the following steps:

(1) Analyzing the characteristics of the convolution layer structure, dividing the convolution layer structure into four types according to the difference of the number of input channels and the number of convolution kernels, designing four different hardware accelerator schemes aiming at the four different convolution layer structures, and storing all accelerator schemes in a storage area and being called an accelerator scheme pool.

(2) And acquiring a convolution layer structure and parameters from an input source, selecting an optimal scheme from an accelerator scheme pool according to the convolution layer structure, constructing a corresponding convolution layer accelerator by an accelerator scheme, combining network parameters, and generating a final hardware accelerator.

Further, in the step (1), the threshold value N of the number of input channels is designated by the user _i And output channel number threshold N _o The convolution layer structure can be divided into the following four types: the number of the input channels is less than N _i The number of the output channels is less than N _o The method comprises the steps of carrying out a first treatment on the surface of the The number of the input channels is less than N _i The number of the output channels is greater than N _o The method comprises the steps of carrying out a first treatment on the surface of the The number of input channels is greater than N _i The number of the output channels is less than N _o The method comprises the steps of carrying out a first treatment on the surface of the The number of the input channels is greater than N _i Output ofThe number of channels is greater than N _o 。

The hardware acceleration scheme is as follows:

and in the first parallel acceleration scheme, parallel operation is carried out on the output channel, and pipeline operation is carried out on the input channel and the convolution window respectively.

And in the second parallel acceleration scheme, parallel operation is carried out on the output channel and the input channel, and pipeline operation is carried out on the convolution window.

The parallel acceleration scheme III carries out parallel operation on the input channel and respectively carries out pipeline on the output channel and the convolution window

And (3) operating.

And a parallel acceleration scheme IV, wherein parallel operation is carried out on part of the input channels and the output channels, and pipeline operation is carried out on part of the input channels and the convolution window respectively.

Four hardware accelerator schemes are stored in the memory region, referred to as a pool of accelerator schemes.

Further, adaptively selecting an optimal scheme and generating a hardware accelerator includes the following steps: firstly, obtaining a convolution layer structure and convolution layer parameters from an input source; then selecting an optimal accelerator scheme from the accelerator unit pool according to the convolution layer structure; and finally generating a final hardware accelerator by the optimal accelerator scheme and the convolution layer parameters.

Further, the fourth parallel acceleration scheme is to divide the input channel into a plurality of equal parts, and perform convolution operation on a convolution window of the plurality of input channels and all convolution kernels of each part; then carrying out pipeline operation on a plurality of input channels so as to obtain a convolution window convolution output of all the input channels; and then carrying out pipeline operation on the convolution window to obtain convolution output of all input channels.

Further, the obtaining of the convolution layer parameters includes obtaining the height and width of the input feature map from the input source, obtaining the height and width of the convolution kernel, the number of the convolution kernel, and the width step length and the high step length of the input feature map; obtaining an input feature map, and the values of weights and biases; estimating the hardware resources consumed by each acceleration scheme and the required clock period by the parameters of the convolution layer; the results of these estimations are combined with the user's task-constrained requirements to select the optimal accelerator scheme to generate the convolutional layer hardware accelerator.

Further, according to the relation between the number of input channels and the number of output channels, the convolution layer structure is divided into the following four types:

first kind: the number of input channels is small, and the number of output channels is small.

Second kind: the number of input channels is small, and the number of output channels is large.

Third kind: the number of input channels is more, and the number of output channels is less.

Fourth kind: the number of input channels is large, and the number of output channels is large.

Further, in the step (2), the step of obtaining the convolution layer structure and parameters from the input source is as follows:

1) Acquiring the shape of the weight tensor of the convolution layer, so as to analyze the number of convolution kernels of the convolution layer, the size of the convolution kernels and the step length;

2) Acquiring the shape of the tensor of the input feature map of the convolution layer, analyzing the size of the input feature map of the convolution layer and the number of input channels;

3) Quantizing the values of the convolutional layer input feature map and the values of the convolutional layer weights and offsets, and converting the values into a hardware format data file;

further, the step of selecting the optimal accelerator scheme is specifically as follows:

1) Judging whether the first convolution layer structure belongs to, if so, preferentially adopting a second acceleration scheme, otherwise, executing the step 2);

2) Judging whether the first convolution layer structure belongs to the second convolution layer structure, if so, preferentially adopting the first or second acceleration scheme, otherwise, executing the step 3);

3) Judging whether the data belongs to a third convolution layer structure, if so, preferentially adopting a third acceleration scheme, otherwise, executing the step 4);

4) The structure necessarily belongs to a fourth convolution layer structure, and a fourth acceleration scheme is preferentially adopted.

Further, in the step (2), the step of generating the final convolution layer hardware accelerator is as follows:

a. acquiring a convolution layer structure and parameters from an input source, wherein the convolution layer structure and parameters comprise a file containing a definition of the convolution layer structure and a data file containing a weight of the convolution layer and a bias of the convolution layer;

b. and selecting an optimal accelerator scheme by the structural parameters of the convolution layer, namely the size of the convolution kernel, the size of the input channel, the size of the output channel and the size of the convolution step, and generating a corresponding convolution layer accelerator.

Further, the convolution layer parameters comprise weights and offsets, and the parameters are converted into data files in a hardware format and stored in a memory; the convolution layer structure comprises the number of input channels of the input feature diagram, the width of the input feature diagram, the height of the input feature diagram, the number of convolution kernels, namely the number of output channels, the width of the convolution kernels, the height of the convolution kernels, the width step length of the convolution kernels and the height step length of the convolution kernels.

Compared with the prior art, the invention has the advantages and positive effects that:

1. the invention designs four convolution layer accelerator schemes, divides the convolution layer structure into four types, uses the corresponding optimal accelerator scheme for different convolution layer structures, can greatly reduce the hardware resource consumption, adopts the operations of parallelism, pipelining and the like, and can achieve similar calculation performance at lower hardware resource consumption.

2. The invention can acquire the convolution layer structure and the convolution layer parameters from the input source, adaptively select the optimal scheme and generate the hardware accelerating circuit, thereby greatly improving the flexibility and the efficiency of hardware design.

3. According to the method, by designing the accelerator scheme pool, different accelerator schemes can be selected according to different convolution layer structures, so that hardware resources are saved, and hardware parallel computing speed is improved; and the flexibility of hardware design is improved by adaptively selecting an optimal scheme.

Drawings

FIG. 1 is a schematic diagram of an output channel parallel module and a parallel acceleration scheme according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of a parallel acceleration scheme in an embodiment of the present invention.

FIG. 3 is a schematic diagram of an input channel parallel module and a parallel acceleration scheme according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a parallel acceleration scheme in an embodiment of the present invention.

FIG. 5 is a schematic diagram of an input channel pipeline in an embodiment of the invention.

FIG. 6 is a schematic diagram of a convolution kernel pipeline in an embodiment of the present invention.

FIG. 7 is a schematic diagram of a pipeline of input channels after segmentation in an embodiment of the present invention.

FIG. 8 is a schematic diagram of a convolution window pipeline in an embodiment of the present invention.

FIG. 9 is a schematic diagram of an adaptive accelerator design flow in an embodiment of the invention.

FIG. 10 is a flow chart of adaptive selection of an optimal scheme from a convolutional layer structure in an embodiment of the invention.

Detailed Description

The following describes the embodiments of the present invention further with reference to the drawings. Embodiments of the present invention are not limited thereto.

A design method of a self-adaptive convolution layer hardware accelerator. Let N be the number of input channels, W be the width of the input feature map, H be the height of the input feature map, M be the number of convolution kernels, W be the number of output channels _k Is the width of convolution kernel, H _k High as convolution kernel, W _s For convolution kernel wide step length, H _s For convolution kernel high step length, the width W of the characteristic diagram is output _o High H of output feature map _o And the number G of convolution windows generated by each input channel satisfies the following conditions:

G＝W _o *H _o #(3)

the convolution operation formula is:

wherein the method comprises the steps ofOutput of g window representing mth output channel,/>The ith row and ith column values of the jth window, w, representing the nth input channel of the input profile _mnij The ith row and column weights of the nth channel representing the mth convolution kernel, b _m Representing the offset of the mth convolution kernel.

Symbol description:

according to the formula (4), the intermediate result of the convolution of the g-th convolution window of the n-th input channel and the n-th channel of the m-th convolution kernelAs follows, +..

The mth channel of the output is convolved by the g-th convolution windowThe calculation is as follows:

definition matrix A ^(g) Wherein the mth row and the nth column data are

Matrix a ^(g) N-th column vector of (a)Is that

Matrix A ^(g) Is the m-th row vector of (2)Is that

A convolutional layer bias vector b, where the bias of the mth output channel is b _m

Outputting the feature map vector C ^(g) Wherein the value of the mth output channel of the g-th convolution window is

From (7) (8), the result matrix A in the convolution of the g-th convolution window can be obtained ^(g) Satisfy the following requirements

From (7) and (9), it can be derived that

From equation (6) and definitions (7) (8) (10) (11) can be deduced

From equation (6) and definition (9) can be deduced

Step one: as shown in figure 1 of the drawings,and w is equal to _·n Each element in (2) is obtained by (5) (8)>The process calculates m output channels in parallel, thus called output channel parallelism, and encapsulates the structure of FIG. 1 as an output channel parallel module with input of +.>And w _·n Output is +.>

Step two: as shown in FIG. 5, the output channel parallel module in the first step is utilized to pipeline n input channels, namely, one input channel is input every clock period, and A is obtained according to (12) ^(g) Obtaining the values C of all convolution output channels according to (14) ^(g) . As shown in FIG. 8, the G convolution windows are pipelined to obtain all convolution output feature maps. This scheme is referred to as parallel acceleration scheme one. Assuming that the required clock period after one convolution window is calculated is T, and the addition operation adopts addition tree operation, the analysis can know the required clock period of the whole convolution operationThe period is T+N+G. The number of consumed multipliers is M x W _k *H _k The number of the adders is

Step three: as shown in FIG. 2, the sum (12) of the output channel parallel modules in FIG. 1 is performed on all N input channels to obtain A ^(g) Re-using (14) to obtain the values C of all convolution output channels ^(g) . As shown in FIG. 8, the G convolution windows are pipelined to obtain all convolution output feature maps. This scheme is referred to as parallel acceleration scheme two. Assuming that the required clock period after one convolution window is calculated is T, and the addition operation adopts addition tree operation, the analysis shows that the required clock period of the whole convolution operation is T+G. The number of consumed multipliers is N M W _k *H _k The number of the adders is

Step four: as shown in fig. 3, x ^(g) And w is equal to _m Obtained by using (5) and (9)In using (6) get->The process computes N input channels in parallel, called input channel parallelism, and encapsulates the structure of FIG. 3 as an input channel parallel module, the input of which is x ^(g) 、w _m· And b _m Output is +.>

Step five: as shown in FIG. 6, M convolution kernels are pipelined by the input channel parallel module in the fourth step, i.e., one output channel is input per clock, and A is obtained according to (13) ^(g) Obtaining the values C of all convolution output channels according to (11) and (15) ^(g) . As shown in FIG. 8, G convolution windows are pipelinedAnd (5) performing line operation to obtain all convolution output characteristic diagrams. This scheme is referred to as parallel acceleration scheme three. Assuming that the required clock period after one convolution window is calculated as T, and the addition operation adopts addition tree operation, the analysis shows that the required clock period of the whole convolution operation is T+M+G. The number of consumed multipliers is N x W _k *H _k The number of the adders is

Step six: dividing N input channels into Q parts, and setting the number of each input channel to be equal

The number of the first Q-1 parts of input channels is u, and the number of the Q-th part of input channels is only N-uQ, so that the Q-th part fills up u (Q+1) -N input channels with the value of 0.

As can be seen from (12), A ^(g) Divided into Q parts such that the Q-th input channel has a subscript range of [ (Q-1) u+1, qu)]Wherein the convolution intermediate output corresponding to the g convolution window of the q-th input channel is

Let the mth convolution output channel intermediate value of the kth input channel kth windowIs that

Then from equation (6)

Let the g window of the q input channel output the characteristic diagramIs that

Then from the formulas (19) (20)

As shown in FIG. 4, the q-th part of all the u input channels are subjected to the output channel parallel module operation in FIG. 1 to obtain the output channel (17)Then from (18) (20)>As shown in FIG. 7, the Q input channels are pipelined again, and the values C of all convolution output channels are calculated from (21) ^(g) . As shown in FIG. 8, the G convolution windows are pipelined to obtain all convolution output feature maps. This scheme is referred to as parallel acceleration scheme four. Assuming that the required clock period after one convolution window operation is T, and the addition operation adopts addition tree operation, the analysis shows that the required clock period of the whole convolution operation is T+Q+G. The number of consumed multipliers is u M W _k *H _k The number of the adders is

Step eight: as shown in fig. 9, a convolution layer parameter and a convolution layer structure are obtained from an input source, the convolution layer parameter comprises a weight w and a bias b, and a data file with the parameter converted into a hardware format is stored in a memory; the convolution layer structure comprises the number N of input channels of the input feature diagram, the width W of the input feature diagram, the height H of the input feature diagram and the number of convolution kernelsI.e. the number M of output channels, the width W of the convolution kernel _k High H of convolution kernel _k Convolution kernel width step W _s Convolution kernel high step H _s Storing the values in a parameter file; and selecting an optimal acceleration scheme from the accelerator scheme pool by the parameters, and finally generating the convolution hardware accelerator.

Step nine: as shown in fig. 10, the flow of selecting the optimal accelerator scheme is:

1. specifying the threshold N of the number of input channels _i Output channel number threshold N _o 。

1. Firstly judging whether a scheme is manually specified, if so, selecting the scheme and ending, otherwise, executing 2;

2. judging whether the hardware resource consumption and the speed requirements are given, if so, calculating the hardware resource consumption and the running speed of all schemes in the accelerator scheme pool, and executing 3, otherwise, executing 4;

3. judging whether a scheme meeting the requirements exists or not, if so, selecting the scheme and ending, otherwise, executing 7;

4. if the number of input channels is less than N _i Executing 5, otherwise executing 6;

5. if the number of output channels is less than N _o Selecting a second hardware accelerator scheme, otherwise selecting the first or second hardware accelerator scheme; this scheme is then used and 7 is performed;

6. if the number of output channels is less than N _o Selecting a third hardware accelerator scheme, otherwise selecting a fourth hardware accelerator scheme; this scheme is then used and 7 is performed;

7. judging whether to continue, if yes, executing 1, otherwise ending.

Claims

1. The design method of the self-adaptive convolution layer hardware accelerator is characterized by comprising the following steps of:

(1) Analyzing the convolution layer structure, designing four different hardware accelerator schemes aiming at different convolution layer structures, and storing the four different hardware accelerator schemes in an accelerator scheme poolIn (a) and (b); by user specification of the threshold N of the number of input channels _i And output channel number threshold N _o The convolution layer structure is divided into the following four types: first, the number of input channels is less than N _i The number of the output channels is less than N _o The method comprises the steps of carrying out a first treatment on the surface of the Second, the number of input channels is less than N _i The number of the output channels is greater than N _o The method comprises the steps of carrying out a first treatment on the surface of the Third, the number of input channels is greater than N _i The number of the output channels is less than N _o The method comprises the steps of carrying out a first treatment on the surface of the Fourth, the number of input channels is greater than N _i The number of the output channels is greater than N _o ；

(2) Acquiring a convolution layer structure and convolution layer parameters from an input source, selecting an optimal accelerator scheme from an accelerator scheme pool according to the convolution layer structure, and constructing a corresponding convolution layer accelerator by the accelerator scheme;

the accelerator scheme pool contains the following hardware accelerator schemes:

the first parallel acceleration scheme is to perform parallel operation on the output channel and pipeline operation on the input channel and the convolution window respectively;

a parallel acceleration scheme II carries out parallel operation on the output channel and the input channel and carries out pipeline operation on the convolution window;

a parallel acceleration scheme III, which carries out parallel operation on the input channel and respectively carries out pipeline operation on the output channel and the convolution window;

a parallel acceleration scheme IV, which carries out parallel operation on part of input channels and output channels and respectively carries out pipeline operation on part of input channels and convolution windows;

storing four hardware accelerator schemes in a memory area, referred to as an accelerator scheme pool;

the hardware accelerator is generated by an optimal accelerator scheme and convolutional layer parameters;

the parallel acceleration scheme IV comprises the steps of dividing an input channel into a plurality of equal parts, and carrying out convolution operation on a convolution window of the plurality of input channels of each part and all convolution kernels; then carrying out pipeline operation on a plurality of input channels so as to obtain a convolution window convolution output of all the input channels; then, carrying out pipeline operation on the convolution window to obtain convolution output of all input channels;

the optimal accelerator scheme is selected as follows:

1. judging whether the first convolution layer structure belongs to the first convolution layer structure, if so, preferentially adopting a second acceleration scheme, otherwise, executing the step 2;

2. judging whether the first convolution layer structure belongs to a second convolution layer structure, if so, preferentially adopting a first or second acceleration scheme, otherwise, executing 3;

3. judging whether the data belongs to a third convolution layer structure, if so, preferentially adopting a third acceleration scheme, otherwise, executing the step 4;

4. the structure necessarily belongs to a fourth convolution layer structure, and a fourth acceleration scheme is preferentially adopted.

2. The hardware accelerator design method of claim 1, wherein: the step (2) specifically comprises: obtaining the height and width of an input feature map and the number of input channels of the input feature map from an input source, and obtaining the height and width of convolution kernels, the number of convolution kernels, and the width step length and the height step length; obtaining an input feature map, and a value of a convolution layer weight and a convolution layer bias; estimating the hardware resources consumed by each acceleration scheme and the required clock period by the parameters of the convolution layer; the results of these estimations are combined with the user's task-constrained requirements to select the optimal accelerator scheme to generate the convolutional layer accelerator.

3. The hardware accelerator design method of claim 1, wherein: the specific steps for acquiring the convolution layer structure and parameters from the input source are as follows:

3) And quantizing the values of the convolutional layer input feature map and the values of the convolutional layer weights and offsets and converting the values into a hardware format data file.

4. The hardware accelerator design method of claim 1, wherein: the step (2) specifically comprises:

5. The hardware accelerator design method of claim 1, wherein: the convolution layer parameters include weights and offsets; the convolution layer structure comprises the number of input channels of the input feature diagram, the width of the input feature diagram, the height of the input feature diagram, the number of convolution kernels, namely the number of output channels, the width of the convolution kernels, the height of the convolution kernels, the width step length of the convolution kernels and the height step length of the convolution kernels.