CN108647773B

CN108647773B - Hardware interconnection system capable of reconstructing convolutional neural network

Info

Publication number: CN108647773B
Application number: CN201810358443.6A
Authority: CN
Inventors: 曹伟; 王伶俐; 谢亮; 罗成; 范锡添; 周学功
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2018-04-20
Filing date: 2018-04-20
Publication date: 2021-07-23
Anticipated expiration: 2038-04-20
Also published as: CN108647773A

Abstract

The invention belongs to the technical field of hardware design of an image processing algorithm, and particularly relates to a hardware interconnection architecture of a reconfigurable convolutional neural network. The interconnect architecture of the present invention comprises: the data and parameter off-chip cache module is used for caching the input pixel data in the picture to be processed and caching the input parameters during the calculation of the convolutional neural network; the basic computing unit array module is used for realizing the core computation of the convolutional neural network; and the arithmetic logic unit calculation module is used for processing the calculation result of the basic calculation unit array to realize downsampling layer, activation function and partial sum accumulation. The basic computing unit array modules are interconnected in a two-dimensional array mode, input data are shared in the row direction, and parallel computing is achieved by using different parameter data; in the column direction, the calculation results are transmitted line by line and participate in the operation as the input of the next line. The invention can reduce the requirement of bandwidth while improving the data multiplexing capability through the structural interconnection.

Description

Hardware interconnection system capable of reconstructing convolutional neural network

Technical Field

The invention belongs to the technical field of hardware design of image processing algorithms, and particularly relates to a hardware interconnection system of a reconfigurable convolutional neural network.

Background

With the rise of artificial intelligence, deep learning is widely applied to computer vision, speech recognition and other large data applications, and is increasingly applied. Convolutional neural networks are widely used as an important algorithm model in deep learning, for example, in the aspects of image classification, face recognition, video detection, voice recognition, and the like.

With the improvement of image recognition precision, the convolutional neural network becomes more and more complex, and meanwhile, more calculation requirements are increased. This makes conventional general purpose processors with large amounts of redundant resources less capable of handling large convolutional neural networks and cannot meet the practical needs in many scenarios. Therefore, the acceleration of the convolutional neural network by using a hardware platform in engineering gradually becomes a mainstream method, and if a hardware architecture in a parallel pipeline mode is adopted to realize an algorithm, a good acceleration effect can be obtained so as to achieve a real-time processing effect.

Although the performance of the hardware platform is far higher than that of the traditional general-purpose processor, as the network structure of the convolutional neural network is increasingly complex and needs to face convolutional kernels with various sizes, the characteristic of low flexibility of the hardware structure can enable the hardware processor to only improve the acceleration effect of certain specific networks, but not accelerate other networks efficiently. Therefore, the demand for a reconfigurable convolutional neural network hardware architecture applicable to convolutional kernels of any size and various network structures is becoming more and more urgent.

For the realization of reconfigurable convolution neural network hardware, the current main research difficulty has two aspects, namely, how to use a fixed hardware structure to configure convolution kernels with different sizes, and how to enable the hardware structure to achieve high efficiency on any convolution kernel. In the prior art, a hardware structure for implementing the most common 3 × 3 convolution template is designed as a basic unit, then data is transferred through the interconnection between the basic units, and a plurality of basic units are used to implement other types of convolution kernels, for example, 5 basic units are needed to implement 3 convolution templates of 5 × 5. Under the scheme, because the number of the basic units in the hardware is certain but the number of the basic units needed for realizing one convolution kernel is variable, a part of the basic units are in an inoperative state when convolution with certain size is realized, and the utilization rate of one hundred percent cannot be guaranteed. In addition, under the scheme, not all basic units can generate effective output, all the output is cached firstly, and then effective data is selected, so that a large amount of waste of extra storage resources is generated.

In the prior art, many image processing hardware accelerators design an optimal hardware structure for one algorithm and then support other algorithms by means of interconnections. Due to the diversity of algorithms and the fixed limit of hardware scale, when the network model changes or the calculation type changes, the resource utilization rate and the calculation performance are reduced. Moreover, in the field of image processing, especially in neural network algorithms, in addition to the utilization of the computing units, bandwidth is also a key that affects the computing performance.

What is needed is a method for increasing data multiplexing capability through fabric interconnect while reducing bandwidth requirements.

Disclosure of Invention

The invention aims to provide a hardware interconnection system of a reconfigurable convolutional neural network, which can improve the data multiplexing capability through structural interconnection and reduce the bandwidth requirement.

The invention provides a hardware interconnection system of a reconfigurable convolution neural network, which is applied to the field of image processing, and comprises:

the data and parameter off-chip cache module is respectively used for caching the input pixel data in the picture to be processed and caching the input parameters during the calculation of the convolutional neural network;

the basic computing unit array module is respectively connected with the data off-chip cache module and the parameter off-chip cache module; the basic computing unit array module is used for realizing the core computation of the convolutional neural network; the basic computing unit array modules are interconnected in a two-dimensional array mode, the basic computing unit arrays share input data in the row direction, and parallel computing is achieved by using different parameter data; in the column direction, the calculation results of the basic calculation unit array are transmitted line by line and used as the input of the next line to participate in operation;

the arithmetic logic unit computing module is connected with the basic computing unit array module and is used for processing the computing result of the basic computing unit array to realize downsampling layer, activation function and partial sum accumulation;

when a convolutional neural network is calculated, when each row of the basic calculation unit array module independently calculates an output characteristic graph, mapping convolution kernels with different sizes according to a mode that convolution window positions divide input characteristic graphs for multiple times of input;

and the control module is respectively connected with the basic computing unit array module and the arithmetic logic unit computing module and is used for realizing convolution kernels with any size and a plurality of computing modes according to different parameters.

Optionally, when the arithmetic logic unit computing module and the basic computing unit array module compute the convolutional neural network, each segmented input feature map only needs one parameter of a convolutional kernel.

Optionally, the basic computing unit array module is formed by adopting an interconnection manner, and the interconnection manner and the data path inside the basic computing unit array module can be changed to support different types of computations with different bit widths.

Optionally, the hardware interconnection system is implemented by using any one of a field programmable gate array and an integrated circuit.

According to the specific embodiment provided by the invention, the invention has the following technical effects: the invention uses a new mapping mode to map the convolutional neural network and converts the freedom degrees of different neural network models from a space level to a time level, thereby enabling a fixed hardware interconnection system to keep one hundred percent of resource utilization rate for any convolutional neural network. Meanwhile, the digital signal processing unit is fully utilized in the basic computing unit module, and a flexible interconnection structure is adopted to support different computing modes and the computing requirement of variable data bit width. And the requirements of input data and output bandwidth are reduced by adopting the mode of parallel computing of the pipelines among rows and multiplexing the input data among columns.

Drawings

Fig. 1 is a schematic structural diagram of a hardware interconnection system of a convolutional neural network provided in the present invention.

Fig. 2 is a schematic diagram of a specific mapping of the mapping scheme provided by the present invention.

Fig. 3 is a schematic diagram of an internal structure of the basic computing module provided in the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the drawings are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive exercise. The described embodiments are only a few embodiments of the invention and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments described herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention aims to provide a hardware interconnection system of a convolutional neural network, which can improve the data multiplexing capability through structural interconnection and reduce the requirement of bandwidth.

A hardware interconnect system for reconfigurable convolutional neural networks, the interconnect system being used in the field of image processing, the interconnect system comprising:

the data and parameter off-chip cache module is used for caching the input pixel data in the picture to be processed and caching the input parameters during the calculation of the convolutional neural network;

the basic computing unit array module is respectively connected with the data off-chip cache module and the parameter off-chip cache module, the basic computing unit array module is used for realizing the computation of an image processing core-convolutional neural network, the basic computing modules are interconnected in a two-dimensional array mode, the basic computing unit array module shares input data in the row direction, and parallel computation is realized by using different parameter data; in the column direction, the calculation result of the basic calculation module is transmitted line by line and is used as the input of the next line to participate in the operation;

the arithmetic logic unit calculation module is connected with the basic calculation module and is used for processing the calculation result of the basic calculation unit array and realizing the down-sampling layer, the activation function and the partial sum accumulation;

the control module is respectively connected with the basic calculation module and the arithmetic logic unit calculation module, is used for realizing convolution kernels with any size and a plurality of calculation modes according to different parameters, and can design efficient and flexible calculation units and interconnection structures aiming at the convolution neural networks with different network structures, distance calculation, vector calculation and digital signal processing calculation in the image processing method.

Optionally, when the arithmetic logic unit computing module and the basic computing module compute the convolutional neural network, each segmented input feature map only needs one parameter of the convolutional kernel.

Optionally, the basic computing module is formed by adopting an interconnection manner, and the interconnection manner and the data path inside the basic computing module can be changed to support different types of computations with different bit widths.

Optionally, the system is implemented by using any one of a field programmable gate array and an integrated circuit.

As shown in fig. 1, the data off-chip cache module and the parameter off-chip cache module 1 are configured to: image data and parameters input from outside are firstly buffered in the module and are waited to be read out by the basic computing unit array module; and the result obtained by calculation is cached in the module and waits to be called again by the arithmetic logic unit calculation module or output. The data and parameter off-chip cache module is equivalent to a large field programmable gate array and is used for handling the condition that the processing speed of the basic computing unit array module is greatly different from the external input speed, otherwise, if the external part directly sends data to the internal basic computing unit array module, a relatively complex handshake signal is needed to ensure that the conditions of data missing sending or retransmission and the like do not exist when the speeds of the two modules are not matched.

The basic computing unit array module 2 connected to the data parameter off-chip cache module 1 is used to implement core computation of a convolutional neural network, and the interconnection mode between the basic computing unit arrays is shown in fig. 1. The interconnect system is composed of a plurality of stages of pipelines, each stage of the pipelines having the same function for calculating multiplication of an input profile by a parameter and adding a partial sum to a result obtained by a previous stage of the pipelines. When the number of the input characteristic diagrams is larger than the number of rows of the basic calculation unit array, the calculation results can be sent back to the arithmetic logic unit calculation module from the off-chip cache module to be continuously accumulated so as to obtain the final output characteristic diagram. After the first characteristic diagram starts to calculate for one period in the first-stage pipeline, the second characteristic diagram starts to calculate in the second-stage pipeline, so that the first-stage pipeline can just send the result to the second-stage pipeline to participate in addition operation when the first-stage pipeline obtains the result, and higher parallelism and higher processing performance can be realized when the pipeline is completely filled. In order to make the pipeline calculation efficient, the number of rows/columns of the basic calculation unit array should be a factor of the number of input/output characteristic patterns as much as possible.

And the arithmetic logic unit computation module 3 is connected with the basic computation unit array module 2 and is used for realizing partial sum accumulation, a down sampling layer and an activation function layer of convolution. The method is specifically configured that in a convolution layer, after calculation results obtained from the last line of a basic calculation unit array are stored in an off-chip cache module, if partial sums exist, the partial sums are read from the off-chip cache module and are sent to an arithmetic logic unit calculation module to be further accumulated with calculation results of the last line of the next group of basic calculation unit arrays, and then are stored in the off-chip cache module to realize accumulation of calculation results of a plurality of groups of input characteristic diagrams; in the lower sampling layer, the arithmetic logic unit calculation module reads the output characteristic graph of the last convolution layer from the off-chip cache module, and selects the function of the lower sampling layer to be a comparator or an adder according to the maximum value or the average value of the lower sampling layer so as to realize the lower sampling operation; in the activation function layer, in this example the Relu function, the arithmetic logic unit computation module implements the activation function using a data selector Mux.

The hardware interconnection system of the reconfigurable convolutional neural network can be used for calculating the convolutional neural network and also can effectively support distance calculation, linear algebra calculation and image signal processing calculation.

And the control module 4 is connected with the modules and used for generating control signals of the rest modules. The method is specifically configured to receive and decode an external instruction, generate an enable signal and an address for the off-chip cache module to read and write data to the off-chip cache module and the basic computing unit array to read and write data to the off-chip cache module, and generate data and function gating signals in the basic computing unit array module and the arithmetic logic unit computing module.

In the basic calculation unit array 2 described above, a multistage pipeline is configured. During mapping, image data in the same position of a convolution template (namely corresponding parameters are the same) in a convolution sliding window is extracted to form a new input feature map, namely the convolution template is divided into a plurality of convolution templates of 1x1, and results calculated by the divided convolution templates are accumulated to obtain an original convolution result. Taking the 5 × 5 convolution template as an example, the mapping manner is shown in fig. 2. In this way, a convolution template of any size can be divided into 1x1 convolution templates, so that the structure can keep high efficiency for a convolution kernel of any size. Secondly, the number of output ports is reduced from the number of the computing units to the number of columns of the basic computing unit array while the DSP function is efficiently utilized by adopting a pipeline structure between rows, so that the computing results of the computing units except the last row do not need extra cache space, and the requirement of on-chip storage resources is reduced.

The arithmetic logic unit above calculates a module 3 that links the downsampling layer, the activation function layer and the convolution layer together. Because the downsampling layer and the activation function layer do not need parameters and have simple operation, in order to reduce the requirement of external data bandwidth and improve the calculation efficiency, the system does not map the two layers into the basic calculation unit array, but sends the two layers into the arithmetic logic unit calculation module to complete the operation of comparison, addition or data selection before the convolution layer result is output. The specific method is that the arithmetic logic unit calculation module is configured as a comparator when the maximum value is realized; when the average value is realized, the denominator of the weight in the convolution layer in the upper layer is removed in advance, so that the average value is only needed to be configured into an adder by the arithmetic logic unit calculation module; the arithmetic logic unit computation module is configured as a data selector Mux when implementing the activation function Relu.

In the above arithmetic logic unit computation module 4, the control signals generated by the arithmetic logic unit computation module for controlling the basic computation unit array are also transmitted layer by layer with the data to match the computation timing of each stage of the pipeline. And the number of the computing units which need to be controlled by the same control signal after the control signal is delayed in multiple stages is reduced from the number of the computing units to the number of columns, so that the fan-out of the control signal is reduced, and the working frequency of the chip can be improved.

The invention adopts a new interconnection structure and a new mapping mode to realize a reconfigurable convolution neural network hardware system. The resource utilization rate of convolution kernels with any size can reach 100%, and high performance and low hardware resource consumption are achieved.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A hardware interconnection system capable of reconstructing a convolutional neural network is applied to the field of image processing, and is characterized by comprising:

the basic computing unit array module is respectively connected with the data and parameter off-chip cache module; the basic computing unit array module is used for realizing the core computation of the convolutional neural network; the basic computing unit array modules are interconnected in a two-dimensional array mode, the basic computing unit arrays share input data in the row direction, and parallel computing is achieved by using different parameter data; in the column direction, the calculation results of the basic calculation unit array are transmitted line by line and used as the input of the next line to participate in operation;

when a convolutional neural network is calculated, when each row of the basic calculation unit array module independently calculates an output characteristic graph, mapping convolution kernels with different sizes according to a mode that convolution window positions divide input characteristic graphs and input the characteristic graphs for multiple times;

2. The hardware interconnection system of the reconfigurable convolutional neural network of claim 1, wherein the arithmetic logic unit computing module and the basic computing unit array module only need one parameter of a convolution kernel for each segmented input feature map when computing the convolutional neural network.

3. The hardware interconnection system of the reconfigurable convolutional neural network of claim 1, wherein the basic computing unit array module is internally interconnected in a manner of changing the interconnection manner and data path inside the basic computing unit array module, so as to support different types and different bit widths of computation.

4. The hardware interconnection system of the reconfigurable convolutional neural network of claim 1, wherein the hardware interconnection system is implemented by any one of a field programmable gate array and an integrated circuit.

5. The hardware interconnection system of the reconfigurable convolutional neural network of claim 1, wherein the basic computing unit array modules are interconnected in a two-dimensional array manner, the interconnection system is composed of a plurality of stages of pipelines, each stage of pipeline has the same function, and is used for computing multiplication of the input characteristic diagram and the parameters and adding partial sums and results obtained by the previous stage of pipeline; the input data is calculated from the first-stage pipeline at the forefront and is backwards step by step until the last-stage pipeline is stored in the off-chip cache module; when the number of the input feature maps is larger than the number of rows of the basic computing unit array, the computing results are sent back to the arithmetic logic unit computing module from the off-chip cache module to continuously complete accumulation so as to obtain a final output feature map; after the first characteristic diagram starts to calculate for a period in the first-stage pipeline, the second characteristic diagram starts to calculate in the second-stage pipeline, so that the first-stage pipeline can just send the result into the second-stage pipeline to participate in addition operation when the result is obtained by the first-stage pipeline; higher parallelism and higher processing performance can be achieved when the pipeline is fully filled.

6. The hardware interconnection system of a reconfigurable convolutional neural network of claim 1, wherein the arithmetic logic unit computation block is used to implement partial sum accumulation, downsampling layer and activation function layer of convolution; the concrete configuration is as follows: in the convolution layer, after the calculation result obtained from the last line of the basic calculation unit array is stored in an off-chip cache module, if the calculation result is still partial sum, the partial sum is read from the off-chip cache module, sent into the arithmetic logic unit calculation module to be further accumulated with the calculation result of the last line of the next group of basic calculation unit array, and then stored in the off-chip cache module to realize the accumulation of the calculation results of a plurality of groups of input characteristic diagrams; in the lower sampling layer, the arithmetic logic unit calculation module reads out the output characteristic graph of the last convolution layer from the off-chip cache module, and selects the function of the lower sampling layer to be a comparator or an adder according to the maximum value or the average value of the lower sampling layer so as to realize the lower sampling operation; in the activation function layer, the arithmetic logic unit computation block implements the activation function using a data selector Mux.

7. The hardware interconnection system of the reconfigurable convolutional neural network of claim 1, wherein the control block is configured to generate control signals for the remaining blocks; the specific configuration is as follows: and receiving and decoding an external instruction, generating an enable signal and an address for the off-chip cache module to read and write data to the off-chip cache module and the basic computing unit array to read and write the data to the off-chip cache module, and generating data and a function gating signal in the basic computing unit array module and the arithmetic logic unit computing module.