CN108416434B

CN108416434B - Circuit structure for accelerating convolutional layer and full-connection layer of neural network

Info

Publication number: CN108416434B
Application number: CN201810120895.0A
Authority: CN
Inventors: 韩军; 蔡宇杰; 曾晓洋
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2018-02-07
Filing date: 2018-02-07
Publication date: 2021-06-04
Anticipated expiration: 2038-02-07
Also published as: CN108416434A

Abstract

The invention belongs to the technical field of integrated circuit design, and particularly relates to a circuit structure capable of accelerating a convolution layer and a full connection layer simultaneously. The circuit structure of the invention comprises five parts: the system comprises a characteristic/weight prefetching module for data reading, a local cache for improving the data reuse rate, a matrix operation unit for realizing matrix multiplication, a temporary data accumulation module for accumulating temporary output results and an output control module for data write-back. The circuit uses a special mapping method to map the operation of the convolution layer and the operation of the full connection layer to a matrix operation unit with a fixed size. The circuit adjusts the memory arrangement mode of the characteristics and the weight, thereby greatly improving the memory access efficiency of the circuit. Meanwhile, the scheduling of the circuit module adopts a pipeline mechanism, so that all hardware units in each clock cycle are in a working state, the utilization rate of the hardware units is improved, and the working efficiency of the circuit is improved.

Description

Circuit structure for accelerating convolutional layer and full-connection layer of neural network

Technical Field

The invention belongs to the technical field of integrated circuit design, and particularly relates to a circuit structure for accelerating a convolution layer and a full connection layer of a neural network.

Background

In the last 60 th century, Hubel et al proposed the concept of the receptive field through the study of the visual cortical cells of cats, and in the 80 th century, Fukushima proposed the concept of the neurocognitive machine on the basis of the receptive field concept, which can be regarded as the first implementation network of the convolutional neural network, the neurocognitive machine decomposed a visual pattern into a plurality of sub-patterns (features), and then entered the feature plane connected by hierarchical connection, it tried to model the visual system, so that it could complete the recognition even if the object had displacement or slight deformation.

Convolutional neural networks are a variant of the multi-layered perceptron. Developed by the biologists huboer and viser at an early stage of research on the visual cortex of cats. The cells of the visual cortex present a complex architecture. These cells are very sensitive to a sub-region of the visual input space, we call the receptive field, and are tiled in this way over the entire field of view area. These cells can be divided into two basic types, simple cells and complex cells. Simple cells respond maximally to marginal stimulation patterns from within the receptive field. Complex cells have a larger receptive domain that is locally invariant to stimuli from an exact site. The convolutional neural network structure includes: convolutional layer, downsampling layer, full connection layer. Each layer has a plurality of feature maps, each feature map extracting a feature of the input through a convolution filter, each feature map having a plurality of neurons.

Because of the huge calculation amount, the convolutional neural network is difficult to perform local operation on the mobile terminal at present, and is mostly realized by a cloud computing mode. While the amount of operations of the convolutional neural network is more than ninety percent of the computation of the convolutional layer and the fully-connected layer, a separate accelerating circuit is usually designed for the two operations, thereby introducing extra chip area.

The invention provides a circuit structure capable of accelerating convolution layers and full connection layers simultaneously, which can be mapped to the same matrix operation unit (array of a multiplier and an adder) by a method of reordering the characteristics and weights of each layer of a neural network. Therefore, the multiplexing efficiency of hardware is improved, the chip area is reduced, and the circuit can obtain higher operation throughput rate in unit area.

Disclosure of Invention

The invention aims to provide a circuit structure capable of accelerating a convolution layer and a full connection layer simultaneously aiming at the operation acceleration of the convolution layer and the full connection layer of a neural network so as to improve the hardware multiplexing efficiency and reduce the chip area.

The circuit structure for accelerating the convolution layer and the full-connection layer of the neural network provided by the invention can map the convolution layer and the full-connection layer to the same matrix operation unit by a method of expanding operation; and by a method of reordering the characteristics and the weights of each layer of the neural network, the access performance loss caused by the discontinuity of the read addresses of the characteristics and the weights after expansion is reduced.

The circuit structure provided by the invention comprises a characteristic/weight prefetching module, a local cache, a matrix operation unit, a temporary data accumulation module and an output control module; wherein:

the feature/weight prefetch module is to fetch and place new feature and weight data from an external memory (DRAM) into a local cache while replacing old, unused data. Except the first layer of characteristics of the neural network, all other characteristics and weights are rearranged according to a certain mode, and the first layer of characteristics are also rearranged according to a certain mode, which is realized by software; the feature/weight prefetch module does not need to implement the rearrangement function;

the local cache is used for caching input data required by the matrix operation unit. Whether the convolution layer or the full connection layer is adopted, a large amount of data multiplexing exists in the operation, so that the data which can be multiplexed is stored in the local cache, and the access amount to an external memory is reduced;

the matrix operation unit is an array of a multiplier and an adder and is used for realizing the operation of the matrix. After the features and the weights are rearranged, the operation of the convolution layer and the full connection layer is mapped into a series of matrix operations, and the matrix operations are realized by calling a matrix operation module for multiple times;

and the temporary data accumulation module is used for accumulating the data sent by the matrix operation module. After multiple times of accumulation, the accumulated result (the input characteristic of the next layer of network) is sent to an output control module;

and the output control module is responsible for sequentially writing the accumulated results back to the external memory according to the same rearrangement mode.

In mapping convolutional layer operations to matrix operations, it is necessary to pull the input features into a series of row vectors and expand the convolutional kernels into a two-dimensional matrix. Therefore, the traditional memory space allocation method can cause the addresses needing to be read by the characteristic/weight prefetching module to be not continuous, thereby reducing the memory access efficiency. The characteristics and the weights are rearranged, so that the continuity of the addresses read by the characteristic/weight prefetching module is ensured, and the access efficiency of the circuit is greatly improved. The process of rearranging the characteristics and the weights according to a certain mode is as follows:

as in fig. 4, for a size C_inCutting the input characteristic of H W into H W strips, wherein the length of each strip is C_in. And writing the data in the H x W stripes into the memory in a sequential address mode. Starting from the low address, the data in the 0 th stripe is stored in 0 to C_in-1 data corresponding to the memory space, the data in the 1 st stripe being stored in C_inTo 2C_in-1 data in the memory space, and so on, and the data in the last stripe (H W-1) is stored in (H W-1) C_inTo C_in*H*W*C_in-1 data corresponding to the memory space. In other words, the order of expansion of the features in memory is C_in=> W =>H (W = traditional memory space allocation method)> H => C_in）。

The convolution kernel includes C_outSize is C_inH and W sub-weight matrixes, and arranging each sub-weight matrix according to the form of the input features, so that the readjustment of the weight memory distribution can be completed. I.e. the order of expansion of the weight features in memory is C_in => W => H => C_out(W = traditional memory space allocation method)> H => C_in => C_out）。

In the invention, the scheduling of the characteristic/weight pre-fetching module, the local cache, the matrix operation unit, the temporary data accumulation module and the output control module adopts a pipeline mechanism, so that all hardware units in each clock cycle are in a working state, the utilization rate of the hardware units is improved, the chip area is reduced, and the working efficiency of the circuit is improved.

The invention has the beneficial effects that: the convolution layer and the full connection layer can share the same arithmetic circuit, so that hardware can be fully multiplexed, and the convolution device is suitable for various convolution neural network structures. At the same time, the output control module writes the outputs of each layer back to the external memory in the expected arrangement order. All the features of all layers except the first layer are arranged and no cost is required for rearranging the data. While the weights of the convolutional neural network are unchanged during the inference phase, i.e. the weights need to be rearranged only once at system initialization.

Drawings

Fig. 1 is a basic block diagram of the circuit.

FIG. 2 is a diagram illustrating conversion of full link layer operation into convolutional layer operation.

FIG. 3 is a diagram illustrating mapping of convolutional layer operations to matrix operations.

Fig. 4 is a schematic diagram of the memory arrangement of features and weights.

FIG. 5 is a schematic diagram of a decomposition of an arbitrary-scale matrix operation into multiple fixed-size matrix operations.

Detailed Description

In the present invention, a basic block diagram of a circuit capable of accelerating both the convolutional layer and the fully-connected layer is shown in fig. 1. The working process of the design is as follows: inputting the features of each layer with the corresponding weights in the external memory (DRAM) of claim 5. First, the feature/weight prefetch module reads out the features and weights to be involved in the operation from the external memory and puts them into the local cache. The new data can replace the old and unused data in the local cache; then, the control circuit fetches the features and weights to be involved in the operation from the local buffer in accordance with the order of the operation, and sends them to the matrix operation unit. After the features and the weights are rearranged, the operation of the convolution layer and the full connection layer is mapped into a series of matrix operation; and the output result of the matrix operation unit is written into the temporary data accumulation module. After a number of matrix operations are performed, the accumulated result is part of the output characteristics of the layer of operations. The output control module is responsible for writing the partial output characteristics back to the external memory according to a specific arrangement sequence. After all operations of the current layer are completed, the circuit can start to operate the next layer of network.

The operation of the convolution layer and the full connection layer is mapped into a series of matrix operations, and the specific flow is described as follows:

first, the operation of the full link layer is converted into the operation of the convolutional layer, as shown in fig. 2. Let the input feature be a shape C_inH W cube, meaning: input is provided with C_inAnd each channel is H W. For a fully connected layer, the usual operation is to rearrange the input matrix to a length C_inH W row vector, and then adding the vector to a vector with a height of C_inH W, width C_outThe weight matrix of (a) is multiplied. The result of the matrix multiplication is a length C_outThe row vector is the feature that the current layer transmits to the next layer of network. To convert a full link layer operation into a convolutional layer operation, the height needs to be C_inH W, width C_outIs split into C_outThe sub-weight matrices are respectively denoted as K0, K1, K2, … …, Kn (n = C)_out-1). Each sub-weight matrix is a shape C_inH W cube. Convolving each sub-weight matrix with the input features respectively, because the shapes of the sub-weight matrices are completely the same (all are C)_inH W) so that the result of each convolution is a scalar whose value is equal to the result of the inner product of the feature matrix and the weight matrix. For C_outThe sub-weight matrices, together, can yield C_outA scalar. Will C this_outThe scalars are concatenated into a vector to obtain the output of the current network layer (fully-connected layer). According to this method, a fully-connected layer can be converted to an input signature and convolution kernel of size C_inH W, the number of output channels is C_outThe convolution operation of (1).

Next, the convolution layer operations are mapped into matrix operations, as shown in FIG. 3. Input feature size of C_inH W, the size of the convolution kernel (weight) is C_inK, together with C_outA convolution kernel corresponding to C_outAn output channel. To obtain the first pixel of each output channel, the required C is obtained_inDrawing K input features into a row vector, and C_outThe convolution kernel is unfolded into a high C_outWidth C_inK matrix. Multiplying the characteristic row vector by the weight matrix to obtain a length C_outEach element of the row vector represents the first pixel of each output channel. In order to calculate all the pixels, H × W matrix operations are required. By this method, convolution layer operation can be converted into H x W matrix operation, wherein the height of matrix is C_outWidth is C_inK. This is a matrix with a relatively large dimension, and the size varies with the convolutional layer, which is not suitable for hardware implementation, and therefore, it is necessary to decompose such matrix operation into a plurality of matrix operations of fixed size.

Finally, the matrix operation is decomposed into a plurality of matrix operations of fixed size.

FIG. 5 illustrates how a fixed size H can be used_F*W_FThe matrix operation unit is used for realizing the process of matrix operation of H x W. In order to realize H W matrix operation, ceil (H/H) is required to be called_F)*ceil(W/W_F) The secondary size is H_F*W_FCeil denotes rounding up. The data used in the first operation is a sub-matrix of the original matrix, which is located from 0 to W of the original matrix_F-1 rows and 0 to H_F-1 column. The output of the first operation is a length W_FThe scalar is output as temporary data to the temporary data accumulation module; the data used in the second operation is still a sub-matrix of the original matrix, which is located between 0 and W of the original matrix_F-1 line and H_FTo 2H_FColumn-1, which shows the iterative operation in the column direction. The output of the second operation is still a length W_FA scalar of (c). In the process of ceil (H/H)_F) After the iteration, the row direction iteration is finished, and ceil (H/H) is generated together_F) Each length is W_FA scalar of (c). The sum of these scalars is a matrix of H x WFront W of operation_FAnd (5) calculating results. The remaining W-W can be calculated by the same method_FAnd (6) obtaining the result. Thus, a matrix operation of arbitrary size can be decomposed into multiple fixed size matrix operations.

For example, a matrix operation of 100 × 32 is implemented by using a matrix operation unit of size 64 × 16 as follows. In order to realize 100 × 32 matrix operations, ceil (100/64) × ceil (32/16) =4 matrix operation units with size 64 × 16 need to be called. The data used in the first operation is a sub-matrix of the original matrix, which is located in 0 to 15 rows and 0 to 63 columns of the original matrix, as shown by the red box (i.e., inner frame) in fig. 5 (a). The output of the first operation is a scalar with the length of 16, and the scalar is output to the temporary data accumulation module as temporary data; the data used in the second operation is still a sub-matrix of the original matrix, which is located in rows 0 to 15 and columns 64 to 99 of the original matrix. Since this operation uses only 100-64=36 columns of matrix operation cells, the remaining 28 columns of data need to be complemented by 0. The output of the second operation is still a scalar of length 16, the sum of which and the result of the first operation, the first 16 operations of this 16 x 100 matrix operation. The remaining 16 results can be calculated by the same method, and thus, one arbitrary-scale matrix operation can be decomposed into a plurality of fixed-size matrix operations.

The output result of the matrix operation unit with fixed size is stored in the temporary data accumulation module. After the accumulation is finished, the accumulation module sends the accumulated result (the input characteristic of the next layer of network) to the output control module, and the output control module is responsible for writing the accumulated result back to the external memory according to a certain arrangement sequence, so that the operation of the current layer (which can be a convolution layer or a full connection layer) is completed.

In mapping convolutional layer operations to matrix operations, it is necessary to pull the input features into a series of row vectors and expand the convolutional kernels into a matrix. If the conventional memory space allocation method is used, the access bandwidth of the external memory becomes a bottleneck of the whole system, because the addresses required to be read by the feature/weight prefetch module become discontinuous. In order to ensure the continuity of the addresses where the data read by the feature/weight prefetching module is located, the memory arrangement of the features and the weights needs to be adjusted.

And when the operation of each layer is finished, the output control module writes the output of each layer back to the external memory according to the expected arrangement sequence. All the features of all layers except the first layer are arranged and no cost is required for rearranging the data. While the weights of the convolutional neural network are unchanged during the inference phase, i.e. the weights need to be rearranged only once at system initialization. The cost of adjusting the arrangement of features and weights in memory is relatively small.

Claims

1. A circuit structure for accelerating a convolution layer and a full connection layer of a neural network is characterized in that the convolution layer and the full connection layer are both mapped to the same matrix operation unit in a mode of expanding operation; the access performance loss caused by the discontinuity of the expanded feature and weight reading addresses is reduced by reordering the features and weights of each layer of the neural network; the circuit structure comprises a characteristic/weight prefetching module, a local cache, a matrix operation unit, a temporary data accumulation module and an output control module; wherein:

the characteristic/weight pre-fetching module is used for taking out and putting new characteristic and weight data into a local cache from an external memory and replacing old and unused data; except for the first-layer features of the neural network, all the other features and weights are rearranged according to a certain mode, and the rearrangement of the first-layer features is also rearranged according to a certain mode, so that the feature/weight pre-fetching module does not need to realize the function of rearrangement;

the local cache is used for caching input data required by the matrix arithmetic unit;

the matrix operation unit is used for realizing the operation of a matrix; after the features and the weights are rearranged, the operation of the convolution layer and the full connection layer is mapped into a series of matrix operations, and the matrix operations are realized by calling a matrix operation module for multiple times;

the temporary data accumulation module is used for accumulating the data sent by the matrix operation module; after multiple times of accumulation, the accumulated result, namely the input characteristic of the next layer of network, is sent to an output control module;

the output control module is responsible for sequentially writing the accumulated results back to the external memory according to the rearrangement mode;

the features and the weights are rearranged according to a certain mode, and the specific process is as follows:

let a size C_inCutting the input characteristic of H W into H W strips, wherein the length of each strip is C_in(ii) a Writing the data in the H x W strips into a memory in a sequential address mode; from low addressFirst, the data in the 0 th stripe is stored in 0 to C_in-1 data corresponding to the memory space, the data in the 1 st stripe being stored in C_inTo 2C_in-1 data in the corresponding memory space, and so on, and the data in the last stripe is stored in (H × W-1) × C_inTo C_in*H*W*C_in-1 data in a corresponding memory space;

let the convolution kernel contain C_outSize is C_inH and W sub-weight matrixes, and arranging each sub-weight matrix according to the form of the input features, namely finishing readjustment of weight memory distribution.

2. The circuit structure for accelerating the convolutional layer and the fully-connected layer of the neural network as claimed in claim 1, wherein the feature/weight pre-fetching module, the local cache, the matrix operation unit, the temporary data accumulation module and the output control module are scheduled by a pipeline mechanism, so that all hardware units are in a working state every clock cycle.

3. The circuit structure for accelerating convolutional layers and fully-connected layers of a neural network as claimed in claim 1, wherein the operation of the convolutional layers and fully-connected layers is mapped to a series of matrix operations, and the specific flow is as follows:

firstly, converting the operation of a full connection layer into the operation of a convolution layer; let the input feature be a shape C_inH W cube, meaning: input is provided with C_inEach channel is H W; for a fully connected layer, the usual operation is to rearrange the input matrix to a length C_inH W row vector, and then adding the vector to a vector with a height of C_inH W, width C_outMultiplying the weight matrix; to convert full link layer operations to convolutional layer operations, the height is C_inH W, width C_outIs split into C_outThe sub-weight matrices are respectively marked as K0, K1, K2, … …, Kn, n = C_out-1; each sub-weight matrix is of a shape ofC_inH × W cubes; convolving each sub-weight matrix with the input features respectively, wherein the sub-weight matrices are all C when the shapes of the sub-weight matrices are completely the same_inH W; the result of each convolution is a scalar, and the value of the scalar is equal to the result of inner product of the feature matrix and the weight matrix; for C_outSub-weight matrices, together yielding C_outA scalar quantity; will C this_outThe scalar quantities are connected into a vector, so that the output of the current network layer, namely the full connection layer, is obtained; thus, a full connection layer is converted to an input feature with a convolution kernel size of C_inH W, the number of output channels is C_outThe convolution operation of (2);

secondly, mapping the operation of the convolution layer into matrix operation; input feature size of C_inH W, the size of the convolution kernel, i.e. the weight, is C_inK, together with C_outA convolution kernel corresponding to C_outAn output channel; to obtain the first pixel of each output channel, the required C is obtained_inDrawing K input features into a row vector, and C_outThe convolution kernel is unfolded into a high C_outWidth C_inK matrix; multiplying the characteristic row vector by the weight matrix to obtain a length C_outEach element of the row vector represents a first pixel point of each output channel; calculating all pixel points, namely performing H × W times of matrix operation; thus, the convolution layer operation is converted into H x W times matrix operation, wherein the height of the matrix is C_outWidth is C_in*K*K；

Finally, such a matrix operation is decomposed into a plurality of fixed-size matrix operations.

4. The circuit structure for accelerating convolutional layers and fully-connected layers of a neural network of claim 3, wherein the process of decomposing the matrix operation into a plurality of fixed-size matrix operations is:

let the matrix to be operated on be H W, and decompose the matrix of fixed size used for operation into H_F*W_FTherefore ceil (H/H) needs to be called_F)*ceil(W/W_F) The secondary size is H_F*W_FCeil represents rounding up; the data used in the first operation is a sub-matrix of the original matrix, which is located from 0 to W of the original matrix_F-1 rows and 0 to H_F-1 column; the output of the first operation is a length W_FThe scalar is output as temporary data to the temporary data accumulation module; the data used in the second operation is still a sub-matrix of the original matrix, which is located between 0 and W of the original matrix_F-1 line and H_FTo 2H_F-1 column, which represents the iterative operation in the column direction; the output of the second operation is still a length W_FA scalar of (a); in the process of ceil (H/H)_F) After the iteration, the column direction iteration is finished, and ceil (H/H) is generated together_F) Each length is W_FA scalar of (a); the sum of these scalars is the first W of the matrix operation of H x W_FAn operation result; by analogy, the rest W-W is calculated_FAnd (6) obtaining the result.