CN110807513A

CN110807513A - Convolutional neural network accelerator based on Winograd sparse algorithm

Info

Publication number: CN110807513A
Application number: CN201911013112.XA
Authority: CN
Inventors: 郭阳; 徐睿; 马胜; 刘胜; 陈海燕; 王耀华
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2019-10-23
Filing date: 2019-10-23
Publication date: 2020-02-18

Abstract

The invention discloses a convolution neural network accelerator based on Winograd sparse algorithm, which comprises: the control module is used for taking charge of moving the data; the buffer module is used for temporarily storing load data, and the operation module is used for finishing the operation of the Winograd sparse algorithm; in the reading stage, the control module sends an address, and the input cache and the weight cache read data in the external DRAM; in the data operation stage, the operation module reads input data, weight data and weight indexes from the buffer module to complete convolution operation; in the sending stage, when the output finishes the final accumulation operation, the output is sent to the external DRAM through the output cache, and the calculation is finished finally. The invention has the advantages of simple structure, easy realization, good acceleration effect and the like.

Description

Convolutional neural network accelerator based on Winograd sparse algorithm

Technical Field

The invention mainly relates to the technical field of convolutional neural networks, in particular to a convolutional neural network accelerator based on a Winograd sparse algorithm.

Background

Convolutional neural networks are currently widely used in various computer fields, such as image recognition, recommendation systems, and language processing. However, the time to train and derive convolutional neural networks is intolerable. The reason is that convolutional layers are introduced into the convolutional neural network, so that the computational complexity in the network is improved, and huge workload is brought. This is difficult to solve by the current CPU or embedded end processor.

In order to solve this problem, many schemes are proposed, such as GPU acceleration during the operation process, or using hardware such as FPGA, custom ASIC, etc. to perform the convolution operation. Most of the schemes utilize the parallelism of a convolutional neural network algorithm so as to improve the efficiency of convolutional calculation. But due to the limitations of GPU area, power consumption and platform usage, it is difficult to be widely applied to hot mobile or embedded terminals. Therefore, the purpose of acceleration is achieved under the condition of controlling power consumption and area by adopting the FPGA or ASIC to carry out customized hardware design, and the method is a very efficient solution.

However, at present, the FPGA and ASIC schemes mostly adopt direct convolution operations in handling convolutional layer operations, and there are few schemes using other algorithms. This actually optimizes the computational efficiency only by hardware, while ignoring the optimization space at the software or algorithm level. From the current trend, the convolutional neural network topology is continuously deepened, so that higher operation complexity is brought. It is therefore necessary to select other acceleration schemes to achieve more benefits with limited hardware resources.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides the convolution neural network accelerator based on the Winograd sparse algorithm, which has the advantages of simple structure, easiness in realization and good acceleration effect.

In order to solve the technical problems, the invention adopts the following technical scheme:

a convolutional neural network accelerator based on Winograd sparse algorithm, comprising:

the control module is used for taking charge of moving the data;

a buffer module buffer for temporary storage of load data,

the operation module is used for finishing the operation of the Winograd sparse algorithm;

in the reading stage, the control module sends an address, and the input cache and the weight cache read data in the external DRAM; in the data operation stage, the operation module reads input data, weight data and weight indexes from the buffer module to complete convolution operation; in the sending stage, when the output finishes the final accumulation operation, the output is sent to the external DRAM through the output cache, and the calculation is finished finally.

As a further improvement of the invention: the control module includes:

the conversion module is used for converting the data to be processed into a Winograd domain;

the 0 skipping module is used for skipping all registers with the input value of 0, and then, the input is transmitted into the parallel multiplier array again, so that the pressure of using the multipliers in the calculation process is reduced;

a compression coding unit for providing sparse storage support;

and the weight compression coding reading unit is used for separately storing the data and the Index, and the reading of the data and the Index is guided to be completed by the position of the buffer and the first bit of the Index.

As a further improvement of the invention: the compression coding unit adopts a 4 × 4 sparse matrix with the density of about 0.4; storing a two-dimensional matrix of Data in a linear form in a Data structure in a one-dimensional form, storing Data of elements other than 0 in a vector Data, and storing position information of elements other than 0 in a vector Index, the position information being represented by (r × 4+ c), where r represents a row of Data in the matrix and c represents a column of Data in the matrix; the number of all non-0 elements in the matrix is stored in the Index first bit.

As a further improvement of the invention: the weight compression coding reading unit adopts a convolution weight of 4 multiplied by 4 with training and sparsity adding, and the maximum value of the stored data in Index is 16, so that the data is stored by using unsigned integer data of 5 bits; fixed point 16 bits are used for Data storage.

As a further improvement of the invention: and the conversion module is used for converting the input into a Winograd domain, then completing the operation of multiplying corresponding elements between the matrixes, and finally converting the result output by the multiplier from the Winograd domain into a spatial domain to complete the output of the result.

As a further improvement of the invention: the buffer module adopts a linear buffer unit and directly transmits data in a required one-dimensional form; when data reuse exists in the data reading process, the reading pointer can be repeatedly read in the reuse area to reuse the data.

As a further improvement of the invention: the operation module comprises:

the processing engine is used for completing convolution operation of input characteristic data and weight data under a Winograd sparse algorithm;

and the processing unit is used for processing the input characteristic data and the weight data.

As a further improvement of the invention: the processing unit comprises a PU, a sub-module processing engine and an accumulator; the PU can process four groups of data at a time, and the four groups of data correspond to four groups of input channels respectively; loading four groups of weight data, and generating a group of output data through calculation, wherein the output data correspond to output characteristic graphs under the same channel; the four groups of inputs are finally distributed to four corresponding PEs in the PU, the four PEs finish calculation in parallel, and the result output is finished by accumulation.

Compared with the prior art, the invention has the advantages that:

the convolutional neural network accelerator based on the Winograd sparse algorithm has the advantages of simple structure, easiness in realization and good acceleration effect, and can quickly convert data to be processed into a Winograd domain by utilizing the conversion module through the simplest addition operation. Then by skipping the module by 0, the pressure to use the multiplier in the calculation process can be reduced. And the convolution operation of the input characteristic data and the weight data under a Winograd sparse algorithm is completed by utilizing the processing engine and the processing unit, and the input characteristic data and the weight data are processed. The linear buffer design in the invention can reuse the input characteristic data to the maximum extent.

Drawings

Fig. 1 is a schematic diagram of the topology of the present invention.

FIG. 2 is a pseudo code of the present invention for the structure and operation of a processing unit in an embodiment.

FIG. 3 is a diagram of compression coding in a specific application example of the present invention.

FIG. 4 is a diagram illustrating weight data storage and reading in an embodiment of the present invention.

FIG. 5 is a schematic diagram of the processing engine in a specific application example.

FIG. 6 is a schematic diagram of the structural principle of the conversion module in a specific application example of the present invention.

FIG. 7 is a schematic diagram of a linear buffer unit in an embodiment of the present invention.

Detailed Description

The invention will be described in further detail below with reference to the drawings and specific examples.

As shown in fig. 1, the convolutional neural network accelerator based on Winograd sparse algorithm of the present invention includes:

the control module (topcontrol) is used for taking charge of moving data;

a buffer module (buffer) for temporary storage of load data,

and the operation module (PUS) is used for finishing the operation of the Winograd sparse algorithm.

In the reading stage, the control module (top control) sends an address, and the input buffer and the weight buffer read data in the external DRAM;

in the data operation stage, the operation module reads input data, weight data and weight indexes from the buffer module to complete convolution operation;

in the sending stage, when the output finishes the final accumulation operation, the output is sent to the external DRAM through the output cache, and the calculation is finished finally.

In a specific application example, the control module of the invention comprises:

the conversion module is used for converting the data to be processed into a Winograd domain; in the example, the conversion module can complete data conversion very quickly through the simplest addition operation;

and the compression coding unit is used for providing sparse storage support, and the data is not stored in the original matrix form any more, but is stored in a linear form in a one-dimensional form of a two-dimensional matrix. Then, storing the Data of the non-0 element into the vector Data, and storing the position information of the non-0 element into the vector Index;

and the weight compression coding reading unit is used for separately storing the data and the Index, so that the buffer storage space is saved as much as possible, and the reading of the data and the Index is guided to be completed by the position of the buffer and the first bit of the Index.

In a specific application example, the buffer module of the invention adopts a linear buffer unit, and because the processing engine selects to process a two-dimensional matrix into a one-dimensional vector when processing input data, and uses a linear-structure buffer, the data can be directly transmitted to the processing unit for operation according to a required one-dimensional form. Meanwhile, data reuse exists in the data reading process, and the reading pointer can be repeatedly read in the reuse area to reuse data.

In a specific application example, the operation module of the invention comprises:

the processing unit is used for processing the input characteristic data and the weight data; in this example, the processing unit, which includes the processing engine and the accumulator, is an important unit constituting the calculation module.

In the specific application example, as shown in fig. 2, the processing unit can see that the PU is not a basic unit of the computing module, and it is further composed of sub-module Processing Engines (PEs) and an accumulator. The PU can process four groups of data at a time, the four groups of data respectively correspond to four groups of input channels, for this purpose, four groups of weight data are loaded, and a group of output data is generated through calculation and corresponds to an output characteristic diagram under the same channel. In order to reduce the number of times of reading the weight data, the data flow adopts a weight fixing mode, and only when one weight is completely used, the next group of data is replaced. And the four groups of inputs are finally distributed to four corresponding PEs in the PU, the four PEs finish calculation in parallel, and the result is output after accumulation. The data can be calculated by referring to pseudo code. Note that the weight mentioned here is data after compression encoding, and includes weight data and a weight index.

As shown in fig. 3, for the compression coding scheme in the specific application example of the present invention, because a Winograd domain sparse network structure is used, the present invention proposes its own compression coding scheme by referring to the current mainstream sparse matrix coding format and combining its own hardware features and computing requirements, and considers a 4 × 4 sparse matrix with a density of about 0.4. First, in the data structure, data is no longer stored in the original matrix form as a matrix, but a two-dimensional matrix is stored in a linear form as a one-dimensional form. Then, Data of non-0 elements is stored into the vector Data, and position information of non-0 elements is stored into the vector Index, the position information being represented by (r × 4+ c), where r represents a row of the Data in the matrix and c represents a column of the Data in the matrix. In addition, the number of all non-0 elements in the matrix is stored in the Index first bit.

Fig. 4 is a schematic diagram illustrating weight storage and reading in an embodiment of the present invention. Considering a 4 x 4 sized convolution weight that has been trained to add sparsity, the maximum value of the data stored in Index is 16, so 5 bits of unsigned integer data are used for storage. And the fixed point 16 bits are used for Data storage, so that the storage space is reduced while the accuracy is ensured. Since the data bits used for storing the data are different, the data and the index are stored separately when designing the buffer on the hardware. The association of data with Index depends on its buffer location and the first bit of Index. Since the first bit of the index stores the number of non-zero elements, this is also the length of the weight data vector and the length of the remaining index vector under the same set. When the first group of weights is read, the first bit in the buffer is read first, and the length of the group of data is known to be 6, then the reading pointer moves 6 bits backwards and reads sequentially (the dark part in the figure), and then the first bit is read again, and the length of the next group of data is obtained. The above process is repeated (reading the light colored part of the figure). Therefore, by indexing the first bit, the location of each set of weights in the Buffer can be easily located and the weight data associated with the index.

As shown in fig. 5, which is a schematic diagram of a processing engine structure in a specific application example of the present invention, the module is mainly divided into three working steps, wherein the first step is to complete processing of an input feature map; secondly, completing calculation under a sparse data structure by index guidance; and thirdly, inversely converting the output result, and converting the output result from the Winograd domain back to the space domain.

Fig. 6 is a schematic diagram of a conversion module in an embodiment of the invention. In the first stage of the processing engine, the input must be transferred to Winograd domain, and it is the conversion module to complete this operation, and the parameter matrix involved in the change process of Winograd is very simple, and only involves the change and addition and subtraction operation of digital sign bits, therefore, it does not need too complex calculation module.

In the second phase of the work, the operation of multiplying corresponding elements between the matrices will be completed. Since the weight data has been compressed and encoded, the input and output data are represented in the form of one-dimensional vectors throughout the calculation flow. In order to reduce the workload by using compression coding, the index is used as a control signal, all input and output registers to participate in operation are gated, all registers with the input of 0 value are skipped at the same time, and then the input is transmitted into a parallel multiplier array to complete calculation and output. Since most of the multiplication operations are skipped by the compression coding method, the number of multipliers in PE is only 8 in the design of the present invention to process 16-bit input. Later experiments will prove that it is sufficient to design only 8 multipliers.

In the last stage, the result output by the multiplier is converted into a spatial domain from a Winograd domain through a conversion module, and the output of the result is completed.

Fig. 7 shows a buffer module for inputting features in an embodiment of the present invention. Because the processing unit selects to process the two-dimensional matrix into the one-dimensional vector when processing the input data, the invention designs and uses the buffer with the linear structure, thereby directly transmitting the data to the processing module for operation according to the required one-dimensional form. Taking the graph as an example, if the storage mode is a linear mode when viewed transversely, different data blocks under the same channel in the feature graph are read into the linear buffer and correspond to a plurality of small squares in the graph; and vertically, data corresponding to different channels (TN) are stored. Since the Winograd algorithm processes one 4 × 4 (TH × TW in the corresponding diagram) matrix block at a time in the design of the present invention, during reading from DRAM to buffer, data under the same channel will be selected by sliding longitudinally on the data in a window with a size of 4 × 4 with a step size of 2. Of course, this process has data reuse, and after the first data block is read, only the next two rows of data are read (the dark color part in the figure is the reuse part). The reused data corresponds to the dotted part of the linear buffer, the read pointer will be read repeatedly in the region to reuse the data, and the size of the reusable data is about H × TW.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. A convolution neural network accelerator based on Winograd sparse algorithm is characterized by comprising the following components:

the control module is used for taking charge of moving the data;

a buffer module buffer for temporary storage of load data,

2. The Winograd sparsity algorithm-based convolutional neural network accelerator of claim 1, wherein said control module comprises:

a compression coding unit for providing sparse storage support;

3. The Winograd sparsity algorithm-based convolutional neural network accelerator of claim 2, wherein said compression coding unit employs a 4 x 4 sparse matrix with a density of about 0.4; storing a two-dimensional matrix of Data in a linear form in a Data structure in a one-dimensional form, storing Data of elements other than 0 in a vector Data, and storing position information of elements other than 0 in a vector Index, the position information being represented by (r × 4+ c), where r represents a row of Data in the matrix and c represents a column of Data in the matrix; the number of all non-0 elements in the matrix is stored in the Index first bit.

4. The Winograd sparsity algorithm based convolutional neural network accelerator as claimed in claim 2, wherein the weight compression coding reading unit employs a 4 x 4 convolutional weight that has been trained to add sparsity, then the maximum value of the stored data in Index is 16, so 5 bits of unsigned integer data are used for storage; fixed point 16 bits are used for Data storage.

5. The Winograd sparsity algorithm based convolutional neural network accelerator as claimed in claim 2, wherein said transformation module is configured to transform the input into a Winograd domain, then perform operations of multiplying corresponding elements between matrices, and finally transform the result output from the multiplier from the Winograd domain into a spatial domain to perform the output of the result.

6. The Winograd sparse algorithm-based convolutional neural network accelerator according to any one of claims 1-5, wherein the buffer module employs a linear buffer unit to directly transmit data in a desired one-dimensional form; when data reuse exists in the data reading process, the reading pointer can be repeatedly read in the reuse area to reuse the data.

7. The Winograd sparsity algorithm based convolutional neural network accelerator as claimed in any one of claims 1-5, wherein said operation module comprises:

8. The Winograd sparsity algorithm-based convolutional neural network accelerator of claim 7, wherein said processing unit comprises a PU, a sub-module processing engine, and an accumulator; the PU can process four groups of data at a time, and the four groups of data correspond to four groups of input channels respectively; loading four groups of weight data, and generating a group of output data through calculation, wherein the output data correspond to output characteristic graphs under the same channel; the four groups of inputs are finally distributed to four corresponding PEs in the PU, the four PEs finish calculation in parallel, and the result output is finished by accumulation.