WO2019136747A1

WO2019136747A1 - Deconvolver and an artificial intelligence processing device applied by same

Info

Publication number: WO2019136747A1
Application number: PCT/CN2018/072659
Authority: WO
Inventors: 肖梦秋
Original assignee: 深圳鲲云信息科技有限公司
Priority date: 2018-01-15
Filing date: 2018-01-15
Publication date: 2019-07-18
Also published as: CN110178146B; CN110178146A

Abstract

A deconvolver (100) and an artificial intelligence processing device applied by same. The deconvolver is electrically connected to an external memory (200), and the external memory (200) stores data to be processed and a weight parameter; the deconvolver (100) comprises: a parameter buffer (110), an input buffer, a deconvolutional operation circuit (140), and an output buffer (150); the parameter buffer (110) is used for receiving and outputting the weight parameter; the input buffer comprises: multiple row buffers connected to each other and used for receiving and outputting the data to be processed, wherein each row buffer outputs a bit of data, the data is aggregated to form a column of data to output; the deconvolutional operation circuit (140) is used for receiving the data to be processed from the input buffer and receiving the weight parameter from the parameter buffer (110), so as to perform deconvolutional operation and output a deconvolutional operation result; the output buffer (150) is used for receiving the deconvolutional operation result and outputting the deconvolutional operation result to the external memory (200). Therefore, the problem in the prior art of high performance requirements for a processor due to slow processing speed caused by software operation implementation can be effectively solved.

Description

Deconvolution device and artificial intelligence processing device applied thereto

Technical field

The present invention relates to the field of processor technologies, and in particular, to the field of artificial intelligence processor technologies, specifically a deconvolution device and an artificial intelligence processing device to which the same is applied.

Background technique

The Convolutional Neural Network (CNN) is a feedforward neural network whose artificial neurons can respond to a surrounding area of a part of the coverage and perform well for large image processing. The deconvolution neural network includes a convolutional layer and a pooling layer.

Nowadays, CNN has become one of the research hotspots in many scientific fields, especially in the field of pattern classification. Because the network avoids the complicated pre-processing of images, it can directly input the original image, so it has been widely used.

Generally, the basic structure of the CNN includes two layers, one of which is a feature extraction layer, and the input of each neuron is connected to the local acceptance domain of the previous layer, and the local features are extracted. Once the local feature is extracted, its positional relationship with other features is also determined; the second is the feature mapping layer, each computing layer of the network is composed of multiple feature maps, and each feature map is a plane. The weights of all neurons on the plane are equal. The feature mapping structure uses a small sigmoid function that affects the function kernel as the activation function of the deconvolution network, so that the feature map has displacement invariance. In addition, since the neurons on one mapping surface share weights, the number of network free parameters is reduced. Each deconvolution layer in the deconvolutional neural network is followed by a computational layer for local averaging and quadratic extraction. This unique two-feature extraction structure reduces the feature resolution.

CNN is mainly used to identify two-dimensional graphics of displacement, scaling and other forms of distortion invariance. Since the feature detection layer of the CNN learns through the training data, when the CNN is used, the feature extraction of the display is avoided, and the learning data is implicitly learned; and the weights of the neurons on the same feature mapping surface are the same. So the network can learn in parallel, which is also a big advantage of the deconvolution network relative to the neural network connected to each other. Deconvolution neural network has unique advantages in speech recognition and image processing with its special structure of local weight sharing. Its layout is closer to the actual biological neural network, and weight sharing reduces the complexity of the network, especially The feature that the image of the multi-dimensional input vector can be directly input into the network avoids the complexity of data reconstruction in the feature extraction and classification process.

At present, the deconvolution neural network is implemented by software running in one processor or multiple distributed processes. As the complexity of the deconvolution neural network increases, the processing speed is relatively slow, and The performance requirements for the processor are also getting higher and higher.

Summary of the invention

In view of the above-mentioned shortcomings of the prior art, the object of the present invention is to provide a deconvolution device and an artificial intelligence processing device thereof, which are used to solve the problem that the deconvolution neural network in the prior art is realized by software operation. The processing speed is slower and the performance of the processor is high.

To achieve the above and other related objects, the present invention provides a deconvolution device electrically connected to an external memory, wherein the external memory stores data to be processed and weight parameters; the deconvolution device includes: a parameter cache , an input buffer, a deconvolution operation circuit and an output buffer; the parameter buffer is configured to receive and output the weight parameter; the input buffer comprises: a plurality of connected line buffers for receiving and outputting The data to be processed; wherein each of the line buffers is assembled to form a column of data output for each bit of data output; the deconvolution operation circuit is configured to receive the to-be-processed data from the input buffer The parameter buffer receives a weight parameter for performing a deconvolution operation and outputting a deconvolution operation result; the output buffer is configured to receive the deconvolution operation result and output the deconvolution operation result to the external memory Output.

In an embodiment of the invention, the input buffer includes: a first line buffer, which receives pixel data of a feature map to be processed bit by bit, outputs line pixel data simultaneously after filtering, and stores each input of the inverse The characteristic map of the convolutional layer.

In an embodiment of the invention, the first line buffer sequentially outputs row pixel data of each of the deconvolution layers, and sequentially outputs each channel data when outputting each of the deconvolution layer row pixel data. Row of pixel data.

In an embodiment of the invention, the input buffer further includes: at least one second line buffer, configured to acquire weight parameters of the respective filters from the external memory and sequentially input to the parameter buffer.

In an embodiment of the invention, the deconvolution operation circuit includes: a plurality of deconvolution cores running in parallel, each of the deconvolution cores including a multiplier for performing a deconvolution operation; an adder a tree, accumulating output results of the plurality of multipliers; each of the deconvolfers inputs pixel data in the form of a K×K matrix, and is subjected to a deconvolution operation bit by bit according to the input pixel data and the weight parameter Output pixel data.

In an embodiment of the invention, the output buffer includes: a plurality of FIFO memories in parallel, wherein channel data of the same filter is accumulated and stored in the same FIFO memory; and a data selector is used for The result of each accumulation is returned to the adder tree until the adder outputs the final accumulated result.

In an embodiment of the present invention, the deconvolution device further includes: a pooling operation circuit connected between the output buffer and the external memory for pooling the deconvolution operation result After output, it is output to the external memory.

In an embodiment of the invention, the internal components included in the deconvolution device and the deconvolferer and the external memory are connected by a first in first out data interface.

The present invention also provides an artificial intelligence processing apparatus including the deconvolution device as described above.

As described above, the deconvolution device of the present invention and the artificial intelligence processing device to which the same is applied have the following advantageous effects:

The deconvolution device of the invention is composed of a parameter buffer, an input buffer, a deconvolution operation circuit, an output buffer, a pooling operation circuit and a first-in first-out data interface, and can process high-complexity deconvolution at high speed. The neural network algorithm can effectively solve the problem that the processing speed brought by the software operation in the prior art is slow and the performance of the processor is high.

DRAWINGS

Figure 1 shows a schematic diagram of the overall principle of a deconvolution device in the prior art.

2 is a schematic diagram showing the input and output of a deconvolution device of the present invention.

Component label description

100 deconvolution

110 parameter buffer

120 first line buffer

130 second line buffer

140 deconvolution arithmetic circuit

150 output buffer

160 pool computing circuit

200 external memory

Detailed ways

The embodiments of the present invention are described below by way of specific examples, and those skilled in the art can readily understand other advantages and effects of the present invention from the disclosure of the present disclosure. The present invention may be embodied or applied in various other specific embodiments, and various modifications and changes can be made without departing from the spirit and scope of the invention. It should be noted that the features in the following embodiments and embodiments may be combined with each other without conflict.

It should be noted that, as shown in FIG. 1 to FIG. 2, the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention in a schematic manner, and only the components related to the present invention are displayed in the drawings instead of The actual number of components, shape and size of the actual implementation, the actual implementation of each component type, number and proportion can be a random change, and its component layout can be more complicated.

The purpose of this embodiment is to provide a deconvolution device and an artificial intelligence processing device thereof, which are used to solve the problem that the deconvolution neural network in the prior art is slowed down by software operation, The problem of high processor performance requirements. The principle and implementation of a deconvolution device and an artificial intelligence processing device to which the present embodiment is applied will be described in detail below, so that a person skilled in the art can understand a deconvolution of the embodiment without any creative work. And the artificial intelligence processing device to which it is applied.

Specifically, as shown in FIG. 1 , the embodiment provides a deconvolution device 100 , and the deconvolution device 100 is electrically connected to an external memory 200 , wherein the external memory 200 stores data to be processed and weight parameters. The deconvolution 100 includes a parameter buffer 110, an input buffer, a deconvolution operation circuit 140, and an output buffer 150.

The first to-be-processed data includes a plurality of channel data; the first weight parameter includes a plurality of sub-parameters, and each layer of sub-parameters respectively correspond to each channel data; and the deconvolution operation circuit 140 has multiple The deconvolution operation results of the respective channel data are calculated in parallel in one-to-one correspondence.

In the present embodiment, the parameter buffer 110 (Con_reg shown in FIG. 2) is used to receive and output the weight parameter (the weight shown in FIG. 2). The parameter buffer 110 includes a FIFO memory, and the weight parameter is stored in the FIFO memory. The parameters in the input buffer, the deconvolution operation circuit 140, and the output buffer 150 are also stored in the parameter buffer 110.

In this embodiment, the input buffer includes: a plurality of connected row buffers for receiving and outputting the to-be-processed data; wherein each of the row buffers collects one column of data for each bit of data output Output.

The input buffer includes a first line buffer 120 (RAM shown in Fig. 2, second line buffer 130 (Coef_reg shown in Fig. 2). First line buffer 120, second line buffer 130 ) for processing the input of 1*1 pixel data to output K*K pixel data. Where K is the size of the deconvolution kernel. The input buffer will be described in detail below.

Specifically, in the embodiment, the first line buffer 120 receives the pixel data of the feature map to be processed bit by bit, outputs the line pixel data simultaneously after the filter, and stores the input deconvolution layer. The feature map; wherein the number of data per row of row pixels is the number of parallel filters.

In this embodiment, the first line buffer 120 includes a RAM, and the feature map input pixel data of each deconvolution layer is to be buffered in the RAM to improve localized storage of the pixel data.

In the embodiment, the first line buffer 120 sequentially outputs row pixel data of each of the deconvolution layers, and sequentially outputs each channel data when outputting each of the deconvolution layer row pixel data. Row of pixel data. That is, the first line buffer 120 outputs the pixel data of the first channel at the beginning, and after the pixel data processing of the first channel is completed, the first line buffer 120 starts to output the pixels of the second channel. Data, when the pixel data of all channels of a deconvolution layer are output, the pixel data output of the channel of the next deconvolution layer is performed. The first line buffer 120 performs iterative calculation output from the first deconvolution layer to the last deconvolution layer by using different filters.

In the embodiment, the input buffer further includes: at least one second line buffer 130. As shown in FIG. 2, the second line buffer 130 includes a FIFO memory, and the second line buffer 130 (Coef_reg shown in FIG. 2) is used to acquire the weight parameters of the respective filters from the external memory 200 and sequentially input them to the parameter buffer. The second row buffer 130 and the external memory 200 are connected by a first-in first-out data interface (a plurality of SIFs shown in FIG. 2). The pixel data output by the second line buffer 130 is pixel data in the form of a k*k matrix.

In this embodiment, the deconvolution operation circuit 140 is configured to receive the to-be-processed data from the input buffer, receive a weight parameter from the parameter buffer 110, perform a deconvolution operation, and output an inverse The result of the convolution operation.

Specifically, in the embodiment, the deconvolution operation circuit 140 includes: a plurality of deconvolution cores running in parallel, each of the deconvolution cores including a multiplier for performing a deconvolution operation; addition a tree for accumulating output results of the plurality of multipliers; each of the deconvolferers 100 inputs pixel data in the form of a K×K matrix, and undergoes deconvolution operation according to the input pixel data and the weight parameter Output pixel data bit by bit.

That is, the deconvolution operation circuit 140 includes a plurality of multipliers, wherein the matrix employed by the multiplier is a transpose of the convolver using a matrix. Pixel data in the form of a K×K matrix input by the deconvolferer in each clock cycle is multiplied by each column of the transposed matrix of the multiplier to obtain a column of outputs, respectively k of the output buffer 150 In the FIFO memory.

For example, the image has three channel data of R, G, and B, that is, three two-dimensional matrices. It is assumed that the first weight parameter, that is, the depth of the filter is 3, that is, the three-layer sub-weight parameter, that is, three two-dimensional matrices, Each length and width is set to K*K, assuming K is an odd number 3, and deconvolution with three Chanel respectively, when a data cube of Pv*k*3 (Pv>K) is taken from the first data to be processed, assuming If the Pv is 5, the filter is to be calculated by the deconvolution operation circuit 140 three times with the data cube. Preferably, the deconvolution operation circuit 140 can be provided with a corresponding number of three, so that it can be in one clock cycle. Perform deconvolution operations on the Channels that are responsible for each other in parallel.

In the embodiment, the output buffer 150 is configured to receive the deconvolution operation result and output the deconvolution operation result to the external memory 200.

Specifically, the output buffer 150 receives the deconvolution operation result of each channel, and then accumulates the deconvolution operation result of all the channel data, and the result is temporarily stored in the output buffer 150.

Specifically, in this embodiment, as shown in FIG. 5, the output buffer 150 includes: a plurality of FIFO memories in parallel, and channel data passing through the same filter is accumulated and stored in the same FIFO memory. a data selector (MUX) for returning the result of each accumulation to the adder tree until the adder outputs the final accumulated result.

Each of the FIFO memories outputs pixel data in the form of a K*W*H matrix, and the output of one filter is stored in K FIFO memories. In addition, the data selector (MUX) is also used to reduce the data stream speed. Up to 1*1, one pixel pixel output.

In the embodiment, the deconvolution device 100 further includes: a pooling operation circuit 160 connected between the output buffer 150 and the external memory 200 for performing the deconvolution operation result. After the pooling, it is output to the external memory 200.

The pooling operation circuit 160 provides a maximum pool for every two rows of pixel data, and the pooling operation circuit 160 also includes a FIFO memory for storing pixel data for each row.

Specifically, the pooling mode may be Max pooling or Average pooling, and may be implemented by a logic circuit.

In this embodiment, the internal components included in the deconvolfer 100 and the deconvolferer 100 and the external memory 200 are connected by a first in first out data interface.

Specifically, the first-in first-out data interface includes: a first-in first-out memory, a first logic unit, and a second logic unit.

The first in first out memory includes: an upstream writable enable pin, a data input pin, and a memory full state identification pin; and a downstream read enable pin, a data output pin, and a memory Empty status identification pin;

The first logic unit is connected to the uplink object, the writable enable pin, and the memory full state identifier pin, and is configured to determine, according to a signal on the memory full status identifier pin, when receiving the write request of the uplink object Whether the first-in first-out memory is full; if not, the enable signal is sent to the writable enable pin to make the first-in first-out memory writable; otherwise, the first-in first-out memory is not writable.

Specifically, the first logic unit includes: a first inverter, an input end of which is connected to the memory full state identification pin, an output end of which is connected to a first identification end for connecting an uplink object; and a first AND gate, The first input end is connected to the first data valid identification end, the second input end is connected to the uplink data valid end for connecting the uplink object, and the output end is connected to the writable enable pin.

The second logic unit is connected to the downlink object, the readable enable pin, and the memory empty state identifier pin, and is configured to determine, according to a signal on the pin of the memory empty state, when receiving the read request of the downlink object Whether the first-in first-out memory is empty; if not, sending an enable signal to the readable enable pin to make the first-in first-out memory readable; otherwise, making the first-in first-out memory unreadable.

Specifically, the second logic unit includes: a second inverter, the input end of which is connected to the memory empty state identifier pin, and the output end of which is connected to the downlink data valid end for connecting the downlink object; the second AND gate, The first input end is connected to the downlink data valid end, and the second input end is connected to the downlink data valid identifier end for connecting the downlink object.

In this embodiment, the operation process of the deconvolferer 100 is as follows:

The data to be processed is read from the external memory 200 through the first-in first-out data interface and stored in the BRAM of the first line buffer 120 (Conv_in_cache shown in FIG. 2).

The data to be processed is a feature map and a convolution parameter, the feature map size is N _C ×W1×H1, and the convolution parameter includes the number of filters N _F , the size of the convolution kernel k*k, the stride s and the boundary. Expand (Padding) p.

The second line buffer 130 through FIFO data interface (SIF) 200 is read from the _{_{N F (N C * k *}} k) of the weight parameter memory external right (one channel), and then stored in the parameter buffer 110.

Once the parameter buffer 110 is loaded into a weight parameter, the pixel data of the processing feature map is started to be received, and the deconvolution operation circuit 140 is processed per clock cycle by the processing of the first line buffer 120 and the second line buffer 130. Received k*k pixel data.

The input data of each channel (the height of the characteristic map input by each channel is H and the width W) is reversely accumulated by the deconvolution operation circuit 140, and then the result of each channel is output to the output buffer. 150.

The different input channels are cyclically accessed, and the output buffer 150 accumulates the data results for each channel until a feature map of NF x W2 x H2 is acquired.

Then, the pooled operation circuit 160 can receive the NF×W2×H2 pixel data for the pooling process, and then output the feature map, or directly output the feature map from the output buffer 150.

After the pooling operation circuit 160 or the output buffer 150 outputs the feature map processed by one filter, the parameter buffer 110 is reloaded into a weight parameter, and the pixel processing process is iteratively repeated by different filters. Until all pixel processing of the deconvolution layer is completed.

The embodiment also provides an artificial intelligence processing apparatus including the deconvolution device 100 as described above. The deconvolution device 100 has been described in detail above and will not be described again.

The artificial intelligence processor includes: a programmable logic circuit (PL) and a processing system circuit (PS). The processing system circuit includes a central processing unit, which can be implemented by an MCU, an SoC, an FPGA, a DSP, or the like, such as an embedded processor chip of an ARM architecture, etc.; the central processing unit is communicatively coupled to an external memory 200, the external memory 200 is, for example, a RAM or ROM memory, such as three generations, four generations of DDR SDRAMs, etc.; the central processor can read and write data to the external memory 200.

In summary, the deconvolution device of the present invention is composed of a parameter buffer, an input buffer, a deconvolution operation circuit, an output buffer, a pooled operation circuit, and a first-in first-out data interface, and can process complexity at a high speed. The high deconvolution neural network algorithm can effectively solve the problem that the processing speed brought by the software operation in the prior art is slow and the performance of the processor is high. Therefore, the present invention effectively overcomes various shortcomings in the prior art and has high industrial utilization value.

The above-described embodiments are merely illustrative of the principles of the invention and its effects, and are not intended to limit the invention. Modifications or variations of the above-described embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, all equivalent modifications or changes made by those skilled in the art without departing from the spirit and scope of the invention will be covered by the appended claims.

Claims

A deconvolution device is electrically connected to an external memory, wherein the external memory stores data to be processed and weight parameters; wherein the deconvolution device comprises: a parameter buffer, an input buffer, and a reverse volume Product operation circuit and output buffer;

The parameter buffer is configured to receive and output the weight parameter;

The input buffer includes: a plurality of connected line buffers for receiving and outputting the data to be processed; wherein each of the line buffers is assembled to form a column of data output for each bit of data output;

The deconvolution operation circuit is configured to receive the to-be-processed data from the input buffer, receive a weight parameter from the parameter buffer, perform a deconvolution operation, and output a deconvolution operation result;

The output buffer is configured to receive the deconvolution operation result and output the deconvolution operation result to the external memory.
The deconvolution device of claim 1 wherein said input buffer comprises:

The first line buffer receives the pixel data of the feature map to be processed bit by bit, outputs the line pixel data simultaneously after passing through the filter, and stores the characteristic map of each of the input deconvolution layers.
The deconvolution apparatus according to claim 2, wherein said first line buffer sequentially outputs row pixel data of each of said deconvolution layers, and outputs each of said deconvolution layer row pixels When the data is output, the line pixel data of each channel data is sequentially output.
The deconvolution device of claim 2, wherein the input buffer further comprises:

At least one second line buffer for acquiring weight parameters of the respective filters from the external memory and sequentially inputting to the parameter buffer.
The deconvolution device according to claim 5, wherein said deconvolution operation circuit comprises:

a plurality of deconvolution kernels running in parallel, each of said deconvolution kernels comprising a multiplier for performing a deconvolution operation;

An adder tree that accumulates output results of a plurality of the multipliers;

Each of the deconvolferers inputs pixel data in the form of a K×K matrix, and outputs pixel data bit by bit according to the input pixel data and the weight parameter through a deconvolution operation.
The deconvolution apparatus according to claim 6, wherein said output buffer comprises:

Parallel at least two FIFO memories, the channel data passing through the same filter is accumulated and stored in the same FIFO memory;

A data selector is operative to return the result of each accumulation to the adder tree until the adder outputs the final accumulated result.
The deconvolution device of claim 1 wherein said deconvolution further comprises:

The pooling operation circuit is connected between the output buffer and the external memory, and is used for pooling the result of the deconvolution operation and outputting the result to an external memory.
The deconvolution device according to claim 1, wherein between the internal components included in the deconvolfer and between the deconvolfer and the external memory, first in first out data is passed Interface connection.
An artificial intelligence processing apparatus, characterized in that the artificial intelligence processing apparatus comprises the deconvolution apparatus according to any one of claims 1 to 8.