WO2020119318A1

WO2020119318A1 - Self-adaptive selection and design method for convolutional-layer hardware accelerator

Info

Publication number: WO2020119318A1
Application number: PCT/CN2019/114910
Authority: WO
Inventors: 秦华标; 曹钦平
Original assignee: 华南理工大学
Priority date: 2018-12-15
Filing date: 2019-10-31
Publication date: 2020-06-18
Also published as: CN109740731A; CN109740731B

Abstract

Disclosed is a self-adaptive selection and design method for a convolutional-layer hardware accelerator, comprising the following steps: (1) analyzing convolutional layer structures, designing four different hardware accelerator solutions for different kinds of convolutional layer structures, and storing the four different hardware accelerator solutions in an accelerator solution pool; and (2) obtaining a convolutional layer structure and a convolutional layer parameter from an input source, selecting, according to the convolutional layer structure, an optimal accelerator solution from the accelerator solution pool, and constructing a corresponding convolutional-layer accelerator on the basis of the optimal accelerator solution. The invention is employed to design a solution pool of convolution-layer accelerators, self-adaptively select an optimal solution, and generate a hardware accelerator, thereby enabling more flexible hardware design, while also reducing resource consumption and increasing parallel operation speeds of convolution layers.

Description

An Adaptive Convolutional Layer Hardware Accelerator Design Method

Technical field

The invention relates to the design of hardware accelerators for convolutional neural networks and belongs to the technical field of hardware acceleration of integrated circuits, in particular to a method for adaptively selecting an optimal hardware acceleration scheme and generating a hardware accelerator according to different convolutional layer structures.

Background technique

In recent years, convolutional neural networks have been widely used in image classification, target detection, and language recognition. However, while achieving very high accuracy, convolutional neural networks also require more computing resources and memory resources, which also leads to many applications based on convolutional neural networks must rely on large servers. In embedded platforms with limited resources, the application of deep learning technologies such as convolutional neural networks has become the general trend. Convolutional neural networks usually contain a large number of convolutional layers that can be calculated in parallel, so the design of hardware accelerators for convolutional layers is an inevitable development direction in the future.

Regarding the hardware acceleration design of the convolutional layer, the main research direction is not to consider the structure of the convolutional layer. The same hardware circuit architecture is used to accelerate the convolutional layer. This method does not optimize for different structures, resulting in more hardware consumption. At the same time, it reduces the parallel computing speed; the current hardware design is mainly to provide a hardware interface, and the circuit with more parameters and complex structure of the convolution layer is very inflexible.

In view of the limitations of the current technology in hardware acceleration of convolutional layers, corresponding accelerator solutions can be designed for different convolutional layer structures. All accelerator solution storage areas are called accelerator solution pools, and the convolutional layer structure is obtained from the input source. Then select the optimal solution from the accelerator solution pool, and finally generate a hardware accelerator. After searching the literature of the prior art, it has not been reported that different accelerator schemes are designed for different convolutional layer structures and the optimal scheme is adaptively selected.

Summary of the invention

The present invention overcomes the deficiencies in the existing hardware acceleration technology for convolutional layers, and proposes an adaptive convolutional layer hardware accelerator design method.

By designing four different accelerator schemes, the present invention adaptively selects the optimal scheme and generates a hardware accelerator, which not only improves the flexibility of hardware design, but also reduces resource consumption and improves the calculation speed.

The object of the present invention is achieved by at least one of the following technical solutions.

An adaptive convolutional layer hardware accelerator design method, the design includes the following steps:

(1) Analyze the convolutional layer structure, design four different hardware accelerator solutions for different convolutional layer structures, and store the four different hardware accelerator solutions in the accelerator solution pool;

(2) Obtain the convolutional layer structure and convolutional layer parameters from the input source, and then select the optimal accelerator solution from the accelerator solution pool according to the convolutional layer structure, and construct the corresponding convolutional layer accelerator from the accelerator solution.

Further, in the above step (1), the user can specify the input channel number threshold N _i and the output channel number threshold N _o , the convolution layer structure can be divided into the following four types: the number of input channels is less than N _i , output The number of channels is less than N _o ; the number of input channels is less than N _i , the number of output channels is greater than N _o ; the number of input channels is greater than N _i , the number of output channels is less than N _o ; the number of input channels is greater than N _i , output channels is greater than the number N _o.

The hardware acceleration solution is:

Parallel acceleration scheme one, parallel operation is performed on the output channel, and pipeline operations are performed on the input channel and the convolution window respectively;

Parallel acceleration scheme two, parallel operation of the output channel and input channel, and pipeline operation of the convolution window;

Parallel acceleration scheme three, parallel operation of the input channel, and pipeline of the output channel and the convolution window respectively

operating;

Parallel acceleration scheme four, parallel operation is performed on some input channels and output channels, and pipeline operations are performed on some input channels and convolution windows, respectively;

The four hardware accelerator solutions are stored in the storage area, called the accelerator solution pool.

Further, adaptively selecting the optimal solution and generating the hardware accelerator include the following steps: first, obtain the convolutional layer structure and convolutional layer parameters from the input source; and then select the optimal accelerator solution from the accelerator unit pool according to the convolutional layer structure ; Finally, the final accelerator is generated by the optimal accelerator solution and the convolution layer parameters.

Further, the process of the parallel acceleration scheme 4 is to divide the input channel into several equal parts, and perform a convolution operation on a convolution window of several input channels of each part and all convolution kernels; The input channels are pipelined to obtain a convolution window convolution output of all input channels; then the convolution window is pipelined to obtain the convolution output of all input channels.

Further, obtaining the convolution layer parameters includes obtaining the height and width of the input feature map from the input source, and the number of input channels of the input feature map, obtaining the height and width of the convolution kernel, the number of convolution kernels, and the wide step Long and high step size; obtain input feature map, weight and offset values; estimate the hardware resources consumed by each acceleration scheme and the required clock cycle from the parameters of the convolution layer; combine these estimated results with the user's task Limited requirements to select the optimal accelerator solution, and thus generate a convolutional layer hardware accelerator.

Further, according to the relationship between the number of input channels and the number of output channels, the convolutional layer structure is divided into the following four types:

The first type: fewer input channels and fewer output channels;

The second type: fewer input channels and more output channels;

The third type: more input channels and less output channels;

The fourth kind: there are many input channels and many output channels.

Further, in the above step (2), the steps of obtaining the convolutional layer structure and parameters from the input source are as follows:

1) Obtain the shape of the weight tensor of the convolutional layer, so as to analyze the number of convolution kernels, the size and step size of the convolution kernel;

2) Obtain the shape of the tensor of the input feature map of the convolution layer, and analyze the size of the input feature map of the convolution layer and the number of input channels;

3) Enter the value of the convolutional layer into the feature map, and quantize and convert the convolutional layer weight and offset values into a hardware format data file;

Further, the steps of selecting the optimal accelerator solution are as follows:

1). Determine whether it belongs to the first convolutional layer structure, if it is, the second acceleration scheme is preferred, otherwise, perform 2);

2). Determine whether it belongs to the second convolutional layer structure, if it is, the first or second acceleration scheme is preferred, otherwise, perform 3);

3). Determine whether it belongs to the third convolutional layer structure, if it is, the third acceleration scheme is preferred, otherwise, perform 4);

4) The structure must belong to the fourth convolutional layer structure, and the fourth acceleration scheme is preferred.

Further, in the above step (2), the steps of generating the final convolutional layer hardware accelerator are as follows:

a. Obtain the convolutional layer structure and parameters from the input source, including files containing the definition of the convolutional layer structure and data files containing the convolutional layer weights and convolutional layer offsets;

b. According to the structure parameters of the convolution layer, that is, the size of the convolution kernel, the size of the input channel, the size of the output channel, and the size of the convolution step, the optimal accelerator scheme is selected to generate the corresponding convolution layer accelerator.

Further, the convolutional layer parameters include weights and offsets, and the parameters are converted into hardware format data files and stored in the memory; the convolutional layer structure includes the number of input channels of the input feature map, the width of the input feature map, the input The height of the feature map and the number of convolution kernels are the number of output channels, the width of the convolution kernel, the height of the convolution kernel, the width step of the convolution kernel and the height step of the convolution kernel.

Compared with the prior art, the advantages and positive effects of the present invention are:

1. The present invention designs four convolutional layer accelerator solutions, and divides the convolutional layer structure into four types. For different convolutional layer structures, using the corresponding optimal accelerator solution can greatly reduce the consumption of hardware resources, using parallel and pipeline Such operations can achieve similar computing performance at lower hardware resource consumption.

2. The invention can obtain the convolutional layer structure and convolutional layer parameters from the input source, adaptively select the optimal scheme and generate a hardware acceleration circuit, which greatly improves the flexibility and efficiency of hardware design.

3. This method can select different accelerator schemes for different convolutional layer structures by designing an accelerator scheme pool, which not only saves hardware resources but also increases hardware parallel computing speed; by adaptively selecting the optimal scheme, the flexibility of hardware design is increased .

BRIEF DESCRIPTION

1 is a schematic diagram of an output channel parallel module and a parallel acceleration solution in an embodiment of the present invention;

2 is a schematic diagram of a parallel acceleration solution 2 in an embodiment of the present invention;

3 is a schematic diagram of three parallel input channels and parallel acceleration schemes in an embodiment of the present invention;

4 is a schematic diagram of a fourth parallel acceleration scheme in an embodiment of the present invention;

5 is a schematic diagram of an input channel pipeline in an embodiment of the present invention;

6 is a schematic diagram of a convolution kernel pipeline in an embodiment of the present invention;

7 is a schematic diagram of an input channel pipeline after segmentation in the embodiment of the present invention;

8 is a schematic diagram of a convolution window pipeline in an embodiment of the present invention;

9 is a schematic diagram of an adaptive accelerator design process in an embodiment of the present invention;

FIG. 10 is a schematic flow chart of adaptively selecting the optimal scheme by the convolutional layer structure in the embodiment of the present invention.

detailed description

The specific implementation of the present invention will be further described below with reference to the drawings. However, the embodiments of the present invention are not limited to this.

An adaptive convolutional layer hardware accelerator design method. Based on the analysis of the convolutional layer structure, the convolutional layer structure is divided into four types according to the difference between the number of input channels and the number of convolution kernels. For four different convolutional layer structures, four different hardware accelerator solutions are designed. Store all accelerator solutions in the storage area and call it the accelerator solution pool;

Let N be the number of input feature map channels, that is, the number of input channels, W be the width of the input feature map, H be the height of the input feature map, M be the number of convolution kernels that is the number of output channels, and W _k be the convolution The width of the kernel, H _k is the height of the convolution kernel, W _s is the width step of the convolution kernel, H _s is the height step of the convolution kernel, the output feature map width W _o , the output feature map height H _o and each The number G of convolution windows generated by each input channel satisfies

G＝W _o *H _o #(3)

The convolution operation formula is:

among them

Represents the output of the g th window of the m th output channel,

Represents the value of the i-th column, row i, and i-th value of the g-th window of the n-th input channel of the input feature map, w _mnij represents the weight of the i-th column, row i, and n-th channel of the m-th convolution kernel, and b _m represents Offset of m convolution kernels.

Symbol Description:

According to formula (4), the intermediate result after the convolution of the g-th convolution window of the n-th input channel and the n-th channel of the m-th convolution kernel

As follows, where ⊙ represents the convolution operation.

Then the mth channel of the convolution output of the gth convolution window

The calculation is as follows:

Define the matrix A ^(g) , where the data in the mth row and nth column are

Then the nth column vector in matrix A ^(g)

for

Proof of the mth row vector of A ^(g)

for

Convolution layer offset vector b, where the mth output channel offset is b _m

Output feature map vector C ^(g) , where the value of the m-th output channel of the g-th convolution window is

From (7)(8), the result matrix A ^(g) in the convolution of the g-th convolution window can be obtained

From (7)(9), we can get

It can be derived from formula (6) and definition (7)(8)(10)(11)

It can be derived from formula (6) and definition (9)

Step 1: As shown in Figure 1,

And each element in w _·n is obtained by (5)(8)

This process calculates m output channels in parallel, which is called output channel parallelism, and encapsulates the structure of Figure 1 as an output channel parallel module.

And w _{· n} , the output is

Step 2: As shown in Figure 5, the output channel parallel module in step 1 is used to pipeline the n input channels, that is, one input channel is input every clock cycle, and A ^{(g) is} obtained according to (12 ⁾ , and then according to (14) Obtain the value C ^{(g) of} all convolution output channels. As shown in Figure 8, pipeline operations are performed on the G convolution windows to obtain all convolution output feature maps. This scheme is called parallel acceleration scheme one. Assuming that the clock period required for the operation of a convolution window is T, and the addition operation uses the addition tree operation, the above analysis shows that the clock period required for the entire convolution operation is T+N+G. The number of multipliers consumed is M*W _k *H _k , and the number of adders is

Step 3: As shown in Figure 2, for all N input channels, the output channel in Figure 1 is paralleled and summed (12) to obtain A ^(g) , and then (14) is used to obtain the value C of all convolution output channels ^{( g)} . As shown in Figure 8, pipeline operations are performed on the G convolution windows to obtain all convolution output feature maps. This scheme is called parallel acceleration scheme 2. Assuming that the clock period required for the operation of a convolution window is T, and the addition operation uses the addition tree operation, the above analysis shows that the clock period required for the entire convolution operation is T+G. The number of multipliers consumed is N*M*W _k *H _k , and the number of adders is

Step 4: As shown in Figure 3, x ^(g) and w _{m· are obtained} by (5)(9)

Get in (6)

This process calculates N input channels in parallel, which is called input channel parallel, and encapsulates the structure of Figure 3 as an input channel parallel module. The input of this module is x ^(g) , w _m· and b _m , and the output is

Step 5: As shown in Figure 6, the input channel parallel module in step 4 is used to pipeline the M convolution kernels, that is, each clock is input to an output channel, and A ^{(g) is} obtained according to (13 ⁾ , and then according to ( 11) and (15) get the value C ^{(g) of} all convolution output channels. As shown in Figure 8, pipeline operations are performed on the G convolution windows to obtain all convolution output feature maps. This scheme is called Parallel Acceleration Scheme 3. Assuming that the clock period required for the operation of a convolution window is T, and the addition operation uses the addition tree operation, the above analysis shows that the clock period required for the entire convolution operation is T+M+G. The number of multipliers consumed is N*W _k *H _k , and the number of adders is

Step 6: Divide the N input channels into Q shares. In order to make the calculation amount of each copy the same, let the number of input channels of each copy be

The number of the first Q-1 input channels is u, and the number of Q-th input channels is only N-uQ, so that the Q-th component is filled with u(Q+1)-N input channels with a value of 0.

It can be seen from (12) that A ^{(g) is} divided into Q parts, so that the subscript range of the qth input channel is [(q-1)u+1,qu], where the gth volume of the qth input channel The intermediate output of the convolution corresponding to the product window is

Let the middle value of the mth convolution output channel of the gth window of the qth input channel

for

Then from formula (6)

Make the gth window of the qth input channel output the characteristic map

for

Then from formula (19)(20)

As shown in Figure 4, all u input channels of the qth part can be obtained by (17) through the operation of the output channel parallel module in Figure 1

Then get from (18)(20)

As shown in Figure 7, pipeline operations are performed on the Q input channels, and the value C ^(g) of all convolution output channels is calculated by (21 ⁾ . As shown in Figure 8, pipeline operations are performed on the G convolution windows to obtain all convolution output feature maps. This scheme is called Parallel Acceleration Scheme 4. Assuming that the clock period required for the operation of a convolution window is T, and the addition operation uses the addition tree operation, the above analysis shows that the clock period required for the entire convolution operation is T+Q+G. The number of multipliers consumed is u*M*W _k *H _k , and the number of adders is

Step 8: As shown in Figure 9, obtain the convolutional layer structure and convolutional layer parameters from the input source, select the optimal accelerator solution from the accelerator solution pool according to the convolutional layer structure, and construct the corresponding convolutional layer from the accelerator solution The accelerator combines the network parameters and generates the final hardware accelerator. The convolutional layer parameters include weight w and offset b. The parameters are converted into hardware format data files and stored in the memory; the convolutional layer structure includes the number of input channels of the input feature map N, the width of the input feature map W, The height H of the input feature map, the number of convolution kernels is the number of output channels M, the width of the convolution kernel W _k , the height of the convolution kernel H _k , the width step of the convolution kernel W _s , the height of the convolution kernel Step size H _s , store these values in the parameter file; from these parameters, select the optimal acceleration scheme from the accelerator scheme pool, and finally generate the convolution hardware accelerator.

Step 9: As shown in Figure 10, the process of selecting the optimal accelerator solution is:

1. The input channel specified threshold number N _i and output channel number threshold value N _o.

1. First judge whether the scheme is artificially specified, if it is, select the scheme and end, otherwise execute 2;

2. Determine whether the hardware resource consumption and speed requirements are given. If yes, calculate the hardware resource consumption and running speed of all programs in the accelerator program pool and execute 3, otherwise execute 4;

3. Determine whether there is a plan that meets the requirements, if it is, select the plan and end, otherwise perform 7;

4. If the number of input channels is less than N _i , go to 5, otherwise go to 6;

5. If the number of output channels is less than N _o, the second hardware accelerator is selected, but otherwise, selecting a first program or the second hardware accelerator; and then the program execution 7;

6. If the number of output channels is less than N _o, the selected hardware accelerator third embodiment, or a hardware accelerator to select the fourth embodiment; and was then performed using this scheme 7;

7. Determine whether to continue, if it is executed 1, otherwise it ends.

Claims

An adaptive convolutional layer hardware accelerator design method, which is characterized by the following steps:

(1) Analyze the convolutional layer structure, design four different hardware accelerator solutions for different convolutional layer structures, and store the four different hardware accelerator solutions in the accelerator solution pool;

(2) Obtain the convolutional layer structure and convolutional layer parameters from the input source, and then select the optimal accelerator solution from the accelerator solution pool according to the convolutional layer structure, and construct the corresponding convolutional layer accelerator from the accelerator solution.
The hardware accelerator design method according to claim 1, wherein the accelerator solution pool includes the following hardware accelerator solutions:

Parallel acceleration scheme one, parallel operation is performed on the output channel, and pipeline operations are performed on the input channel and the convolution window respectively;

Parallel acceleration scheme two, parallel operation of the output channel and input channel, and pipeline operation of the convolution window;

Parallel acceleration scheme three, parallel operation is performed on the input channel, and pipeline operations are performed on the output channel and the convolution window, respectively;

Parallel acceleration scheme four, parallel operation is performed on some input channels and output channels, and pipeline operations are performed on some input channels and convolution windows, respectively;

The four hardware accelerator solutions are stored in the storage area, called the accelerator solution pool.
The hardware accelerator design method according to claim 1, wherein the hardware accelerator is generated by an optimal accelerator scheme and convolution layer parameters.
The method for designing a hardware accelerator according to claim 2, wherein the process of the parallel acceleration scheme 4 is to divide the input channel into several equal parts, and a convolution window of several input channels of each part is All convolution kernels perform convolution operations; then pipeline operations are performed on several input channels to obtain a convolution window convolution output of all input channels; and then the convolution window is pipelined to obtain convolutions of all input channels Output.
The method for designing a hardware accelerator according to claim 3, wherein step (2) specifically comprises: obtaining the height and width of the input feature map from the input source, and the number of input channels of the input feature map to obtain the convolution kernel Height and width, the number of convolution kernels, and wide step and high step; obtain the input feature map, the weight of the convolution layer and the offset of the convolution layer; the parameters of the convolution layer estimate each acceleration scheme. The hardware resources consumed and the required clock cycles; combine these estimated results with the user's limited requirements for the task to select the optimal accelerator solution to generate a convolutional accelerator.
The hardware accelerator design method according to claim 1, wherein: the channel number designated by the user input the threshold value N i and output channel number threshold value N o, convolution-layer structure into the following four: the number of input channels Less than N i , the number of output channels is less than N o ; the number of input channels is less than N i , the number of output channels is greater than N o ; the number of input channels is greater than N i , the number of output channels is less than N o ; the number of input channels is greater than N i, the number of output channels is greater than N o.
The hardware accelerator design method according to claim 1, wherein the specific steps of obtaining the convolutional layer structure and parameters from the input source are as follows:

1) Obtain the shape of the weight tensor of the convolutional layer, so as to analyze the number of convolution kernels, the size and step size of the convolution kernel;

2) Obtain the shape of the tensor of the input feature map of the convolution layer, and analyze the size of the input feature map of the convolution layer and the number of input channels;

3) Input the value of the convolutional layer into the feature map, quantize and convert the convolutional layer weight and offset values into a hardware format data file.
The method for designing a hardware accelerator according to claim 5, wherein the steps of selecting an optimal accelerator solution are as follows:

1. Determine whether it belongs to the first type of convolutional layer structure, if it is, the second type of acceleration scheme is preferred, otherwise, go to 2;

2. Determine whether it belongs to the second convolutional layer structure. If it is, the first or second acceleration scheme is preferred, otherwise, go to 3;

3. Determine whether it belongs to the third convolutional layer structure. If it is, the third acceleration scheme is preferred, otherwise, go to 4;

4. This structure must belong to the fourth convolutional layer structure, and the fourth acceleration scheme is preferred.
The hardware accelerator design method according to claim 1, wherein step (2) specifically includes:

a. Obtain the convolutional layer structure and parameters from the input source, including files containing the definition of the convolutional layer structure and data files containing the convolutional layer weights and convolutional layer offsets;

b. According to the structure parameters of the convolution layer, that is, the size of the convolution kernel, the size of the input channel, the size of the output channel, and the size of the convolution step, the optimal accelerator scheme is selected to generate the corresponding convolution layer accelerator.
The hardware accelerator design method according to claim 1, wherein: the convolutional layer parameters include weights and offsets; the convolutional layer structure includes the number of input channels of the input feature map, the width of the input feature map, and the input features The height of the graph and the number of convolution kernels are the number of output channels, the width of the convolution kernel, the height of the convolution kernel, the width step of the convolution kernel and the height step of the convolution kernel.