WO2019136747A1 - Deconvolver and an artificial intelligence processing device applied by same - Google Patents

Deconvolver and an artificial intelligence processing device applied by same Download PDF

Info

Publication number
WO2019136747A1
WO2019136747A1 PCT/CN2018/072659 CN2018072659W WO2019136747A1 WO 2019136747 A1 WO2019136747 A1 WO 2019136747A1 CN 2018072659 W CN2018072659 W CN 2018072659W WO 2019136747 A1 WO2019136747 A1 WO 2019136747A1
Authority
WO
WIPO (PCT)
Prior art keywords
deconvolution
buffer
data
output
parameter
Prior art date
Application number
PCT/CN2018/072659
Other languages
French (fr)
Chinese (zh)
Inventor
肖梦秋
Original Assignee
深圳鲲云信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳鲲云信息科技有限公司 filed Critical 深圳鲲云信息科技有限公司
Priority to PCT/CN2018/072659 priority Critical patent/WO2019136747A1/en
Priority to CN201880002766.XA priority patent/CN110178146B/en
Publication of WO2019136747A1 publication Critical patent/WO2019136747A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present invention relates to the field of processor technologies, and in particular, to the field of artificial intelligence processor technologies, specifically a deconvolution device and an artificial intelligence processing device to which the same is applied.
  • the Convolutional Neural Network is a feedforward neural network whose artificial neurons can respond to a surrounding area of a part of the coverage and perform well for large image processing.
  • the deconvolution neural network includes a convolutional layer and a pooling layer.
  • CNN has become one of the research hotspots in many scientific fields, especially in the field of pattern classification. Because the network avoids the complicated pre-processing of images, it can directly input the original image, so it has been widely used.
  • the basic structure of the CNN includes two layers, one of which is a feature extraction layer, and the input of each neuron is connected to the local acceptance domain of the previous layer, and the local features are extracted.
  • the second is the feature mapping layer
  • each computing layer of the network is composed of multiple feature maps, and each feature map is a plane.
  • the weights of all neurons on the plane are equal.
  • the feature mapping structure uses a small sigmoid function that affects the function kernel as the activation function of the deconvolution network, so that the feature map has displacement invariance.
  • the neurons on one mapping surface share weights, the number of network free parameters is reduced.
  • Each deconvolution layer in the deconvolutional neural network is followed by a computational layer for local averaging and quadratic extraction. This unique two-feature extraction structure reduces the feature resolution.
  • CNN is mainly used to identify two-dimensional graphics of displacement, scaling and other forms of distortion invariance. Since the feature detection layer of the CNN learns through the training data, when the CNN is used, the feature extraction of the display is avoided, and the learning data is implicitly learned; and the weights of the neurons on the same feature mapping surface are the same. So the network can learn in parallel, which is also a big advantage of the deconvolution network relative to the neural network connected to each other.
  • Deconvolution neural network has unique advantages in speech recognition and image processing with its special structure of local weight sharing. Its layout is closer to the actual biological neural network, and weight sharing reduces the complexity of the network, especially The feature that the image of the multi-dimensional input vector can be directly input into the network avoids the complexity of data reconstruction in the feature extraction and classification process.
  • the deconvolution neural network is implemented by software running in one processor or multiple distributed processes. As the complexity of the deconvolution neural network increases, the processing speed is relatively slow, and The performance requirements for the processor are also getting higher and higher.
  • the object of the present invention is to provide a deconvolution device and an artificial intelligence processing device thereof, which are used to solve the problem that the deconvolution neural network in the prior art is realized by software operation.
  • the processing speed is slower and the performance of the processor is high.
  • the present invention provides a deconvolution device electrically connected to an external memory, wherein the external memory stores data to be processed and weight parameters;
  • the deconvolution device includes: a parameter cache , an input buffer, a deconvolution operation circuit and an output buffer;
  • the parameter buffer is configured to receive and output the weight parameter;
  • the input buffer comprises: a plurality of connected line buffers for receiving and outputting The data to be processed; wherein each of the line buffers is assembled to form a column of data output for each bit of data output;
  • the deconvolution operation circuit is configured to receive the to-be-processed data from the input buffer
  • the parameter buffer receives a weight parameter for performing a deconvolution operation and outputting a deconvolution operation result;
  • the output buffer is configured to receive the deconvolution operation result and output the deconvolution operation result to the external memory Output.
  • the input buffer includes: a first line buffer, which receives pixel data of a feature map to be processed bit by bit, outputs line pixel data simultaneously after filtering, and stores each input of the inverse The characteristic map of the convolutional layer.
  • the first line buffer sequentially outputs row pixel data of each of the deconvolution layers, and sequentially outputs each channel data when outputting each of the deconvolution layer row pixel data. Row of pixel data.
  • the input buffer further includes: at least one second line buffer, configured to acquire weight parameters of the respective filters from the external memory and sequentially input to the parameter buffer.
  • the deconvolution operation circuit includes: a plurality of deconvolution cores running in parallel, each of the deconvolution cores including a multiplier for performing a deconvolution operation; an adder a tree, accumulating output results of the plurality of multipliers; each of the deconvolfers inputs pixel data in the form of a K ⁇ K matrix, and is subjected to a deconvolution operation bit by bit according to the input pixel data and the weight parameter Output pixel data.
  • the output buffer includes: a plurality of FIFO memories in parallel, wherein channel data of the same filter is accumulated and stored in the same FIFO memory; and a data selector is used for The result of each accumulation is returned to the adder tree until the adder outputs the final accumulated result.
  • the deconvolution device further includes: a pooling operation circuit connected between the output buffer and the external memory for pooling the deconvolution operation result After output, it is output to the external memory.
  • the internal components included in the deconvolution device and the deconvolferer and the external memory are connected by a first in first out data interface.
  • the present invention also provides an artificial intelligence processing apparatus including the deconvolution device as described above.
  • the deconvolution device of the present invention and the artificial intelligence processing device to which the same is applied have the following advantageous effects:
  • the deconvolution device of the invention is composed of a parameter buffer, an input buffer, a deconvolution operation circuit, an output buffer, a pooling operation circuit and a first-in first-out data interface, and can process high-complexity deconvolution at high speed.
  • the neural network algorithm can effectively solve the problem that the processing speed brought by the software operation in the prior art is slow and the performance of the processor is high.
  • Figure 1 shows a schematic diagram of the overall principle of a deconvolution device in the prior art.
  • FIG. 2 is a schematic diagram showing the input and output of a deconvolution device of the present invention.
  • FIG. 1 to FIG. 2 the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention in a schematic manner, and only the components related to the present invention are displayed in the drawings instead of The actual number of components, shape and size of the actual implementation, the actual implementation of each component type, number and proportion can be a random change, and its component layout can be more complicated.
  • the purpose of this embodiment is to provide a deconvolution device and an artificial intelligence processing device thereof, which are used to solve the problem that the deconvolution neural network in the prior art is slowed down by software operation, The problem of high processor performance requirements.
  • the principle and implementation of a deconvolution device and an artificial intelligence processing device to which the present embodiment is applied will be described in detail below, so that a person skilled in the art can understand a deconvolution of the embodiment without any creative work. And the artificial intelligence processing device to which it is applied.
  • the embodiment provides a deconvolution device 100 , and the deconvolution device 100 is electrically connected to an external memory 200 , wherein the external memory 200 stores data to be processed and weight parameters.
  • the deconvolution 100 includes a parameter buffer 110, an input buffer, a deconvolution operation circuit 140, and an output buffer 150.
  • the first to-be-processed data includes a plurality of channel data; the first weight parameter includes a plurality of sub-parameters, and each layer of sub-parameters respectively correspond to each channel data; and the deconvolution operation circuit 140 has multiple The deconvolution operation results of the respective channel data are calculated in parallel in one-to-one correspondence.
  • the parameter buffer 110 (Con_reg shown in FIG. 2) is used to receive and output the weight parameter (the weight shown in FIG. 2).
  • the parameter buffer 110 includes a FIFO memory, and the weight parameter is stored in the FIFO memory.
  • the parameters in the input buffer, the deconvolution operation circuit 140, and the output buffer 150 are also stored in the parameter buffer 110.
  • the input buffer includes: a plurality of connected row buffers for receiving and outputting the to-be-processed data; wherein each of the row buffers collects one column of data for each bit of data output Output.
  • the input buffer includes a first line buffer 120 (RAM shown in Fig. 2, second line buffer 130 (Coef_reg shown in Fig. 2). First line buffer 120, second line buffer 130 ) for processing the input of 1*1 pixel data to output K*K pixel data. Where K is the size of the deconvolution kernel.
  • the input buffer will be described in detail below.
  • the first line buffer 120 receives the pixel data of the feature map to be processed bit by bit, outputs the line pixel data simultaneously after the filter, and stores the input deconvolution layer.
  • the feature map wherein the number of data per row of row pixels is the number of parallel filters.
  • the first line buffer 120 includes a RAM, and the feature map input pixel data of each deconvolution layer is to be buffered in the RAM to improve localized storage of the pixel data.
  • the first line buffer 120 sequentially outputs row pixel data of each of the deconvolution layers, and sequentially outputs each channel data when outputting each of the deconvolution layer row pixel data. Row of pixel data. That is, the first line buffer 120 outputs the pixel data of the first channel at the beginning, and after the pixel data processing of the first channel is completed, the first line buffer 120 starts to output the pixels of the second channel. Data, when the pixel data of all channels of a deconvolution layer are output, the pixel data output of the channel of the next deconvolution layer is performed. The first line buffer 120 performs iterative calculation output from the first deconvolution layer to the last deconvolution layer by using different filters.
  • the input buffer further includes: at least one second line buffer 130.
  • the second line buffer 130 includes a FIFO memory, and the second line buffer 130 (Coef_reg shown in FIG. 2) is used to acquire the weight parameters of the respective filters from the external memory 200 and sequentially input them to the parameter buffer.
  • the second row buffer 130 and the external memory 200 are connected by a first-in first-out data interface (a plurality of SIFs shown in FIG. 2).
  • the pixel data output by the second line buffer 130 is pixel data in the form of a k*k matrix.
  • the deconvolution operation circuit 140 is configured to receive the to-be-processed data from the input buffer, receive a weight parameter from the parameter buffer 110, perform a deconvolution operation, and output an inverse The result of the convolution operation.
  • the deconvolution operation circuit 140 includes: a plurality of deconvolution cores running in parallel, each of the deconvolution cores including a multiplier for performing a deconvolution operation; addition a tree for accumulating output results of the plurality of multipliers; each of the deconvolferers 100 inputs pixel data in the form of a K ⁇ K matrix, and undergoes deconvolution operation according to the input pixel data and the weight parameter Output pixel data bit by bit.
  • the deconvolution operation circuit 140 includes a plurality of multipliers, wherein the matrix employed by the multiplier is a transpose of the convolver using a matrix. Pixel data in the form of a K ⁇ K matrix input by the deconvolferer in each clock cycle is multiplied by each column of the transposed matrix of the multiplier to obtain a column of outputs, respectively k of the output buffer 150 In the FIFO memory.
  • the image has three channel data of R, G, and B, that is, three two-dimensional matrices.
  • the first weight parameter that is, the depth of the filter is 3
  • the three-layer sub-weight parameter that is, three two-dimensional matrices.
  • K is an odd number 3
  • deconvolution with three Chanel when a data cube of Pv*k*3 (Pv>K) is taken from the first data to be processed, assuming If the Pv is 5, the filter is to be calculated by the deconvolution operation circuit 140 three times with the data cube.
  • the deconvolution operation circuit 140 can be provided with a corresponding number of three, so that it can be in one clock cycle. Perform deconvolution operations on the Channels that are responsible for each other in parallel.
  • the output buffer 150 is configured to receive the deconvolution operation result and output the deconvolution operation result to the external memory 200.
  • the output buffer 150 receives the deconvolution operation result of each channel, and then accumulates the deconvolution operation result of all the channel data, and the result is temporarily stored in the output buffer 150.
  • the output buffer 150 includes: a plurality of FIFO memories in parallel, and channel data passing through the same filter is accumulated and stored in the same FIFO memory. a data selector (MUX) for returning the result of each accumulation to the adder tree until the adder outputs the final accumulated result.
  • MUX data selector
  • Each of the FIFO memories outputs pixel data in the form of a K*W*H matrix, and the output of one filter is stored in K FIFO memories.
  • the data selector MUX
  • MUX data selector
  • the deconvolution device 100 further includes: a pooling operation circuit 160 connected between the output buffer 150 and the external memory 200 for performing the deconvolution operation result. After the pooling, it is output to the external memory 200.
  • the pooling operation circuit 160 provides a maximum pool for every two rows of pixel data, and the pooling operation circuit 160 also includes a FIFO memory for storing pixel data for each row.
  • the pooling mode may be Max pooling or Average pooling, and may be implemented by a logic circuit.
  • the internal components included in the deconvolfer 100 and the deconvolferer 100 and the external memory 200 are connected by a first in first out data interface.
  • the first-in first-out data interface includes: a first-in first-out memory, a first logic unit, and a second logic unit.
  • the first in first out memory includes: an upstream writable enable pin, a data input pin, and a memory full state identification pin; and a downstream read enable pin, a data output pin, and a memory Empty status identification pin;
  • the first logic unit is connected to the uplink object, the writable enable pin, and the memory full state identifier pin, and is configured to determine, according to a signal on the memory full status identifier pin, when receiving the write request of the uplink object Whether the first-in first-out memory is full; if not, the enable signal is sent to the writable enable pin to make the first-in first-out memory writable; otherwise, the first-in first-out memory is not writable.
  • the first logic unit includes: a first inverter, an input end of which is connected to the memory full state identification pin, an output end of which is connected to a first identification end for connecting an uplink object; and a first AND gate, The first input end is connected to the first data valid identification end, the second input end is connected to the uplink data valid end for connecting the uplink object, and the output end is connected to the writable enable pin.
  • the second logic unit is connected to the downlink object, the readable enable pin, and the memory empty state identifier pin, and is configured to determine, according to a signal on the pin of the memory empty state, when receiving the read request of the downlink object Whether the first-in first-out memory is empty; if not, sending an enable signal to the readable enable pin to make the first-in first-out memory readable; otherwise, making the first-in first-out memory unreadable.
  • the second logic unit includes: a second inverter, the input end of which is connected to the memory empty state identifier pin, and the output end of which is connected to the downlink data valid end for connecting the downlink object; the second AND gate, The first input end is connected to the downlink data valid end, and the second input end is connected to the downlink data valid identifier end for connecting the downlink object.
  • the operation process of the deconvolferer 100 is as follows:
  • the data to be processed is read from the external memory 200 through the first-in first-out data interface and stored in the BRAM of the first line buffer 120 (Conv_in_cache shown in FIG. 2).
  • the data to be processed is a feature map and a convolution parameter
  • the feature map size is N C ⁇ W1 ⁇ H1
  • the convolution parameter includes the number of filters N F , the size of the convolution kernel k*k, the stride s and the boundary. Expand (Padding) p.
  • the second line buffer 130 through FIFO data interface (SIF) 200 is read from the N F (N C * k * k) of the weight parameter memory external right (one channel), and then stored in the parameter buffer 110.
  • SIF FIFO data interface
  • the pixel data of the processing feature map is started to be received, and the deconvolution operation circuit 140 is processed per clock cycle by the processing of the first line buffer 120 and the second line buffer 130. Received k*k pixel data.
  • the input data of each channel (the height of the characteristic map input by each channel is H and the width W) is reversely accumulated by the deconvolution operation circuit 140, and then the result of each channel is output to the output buffer. 150.
  • the different input channels are cyclically accessed, and the output buffer 150 accumulates the data results for each channel until a feature map of NF x W2 x H2 is acquired.
  • the pooled operation circuit 160 can receive the NF ⁇ W2 ⁇ H2 pixel data for the pooling process, and then output the feature map, or directly output the feature map from the output buffer 150.
  • the parameter buffer 110 is reloaded into a weight parameter, and the pixel processing process is iteratively repeated by different filters. Until all pixel processing of the deconvolution layer is completed.
  • the embodiment also provides an artificial intelligence processing apparatus including the deconvolution device 100 as described above.
  • the deconvolution device 100 has been described in detail above and will not be described again.
  • the artificial intelligence processor includes: a programmable logic circuit (PL) and a processing system circuit (PS).
  • the processing system circuit includes a central processing unit, which can be implemented by an MCU, an SoC, an FPGA, a DSP, or the like, such as an embedded processor chip of an ARM architecture, etc.; the central processing unit is communicatively coupled to an external memory 200, the external memory 200 is, for example, a RAM or ROM memory, such as three generations, four generations of DDR SDRAMs, etc.; the central processor can read and write data to the external memory 200.
  • the deconvolution device of the present invention is composed of a parameter buffer, an input buffer, a deconvolution operation circuit, an output buffer, a pooled operation circuit, and a first-in first-out data interface, and can process complexity at a high speed.
  • the high deconvolution neural network algorithm can effectively solve the problem that the processing speed brought by the software operation in the prior art is slow and the performance of the processor is high. Therefore, the present invention effectively overcomes various shortcomings in the prior art and has high industrial utilization value.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Image Processing (AREA)
  • Microcomputers (AREA)
  • Complex Calculations (AREA)

Abstract

A deconvolver (100) and an artificial intelligence processing device applied by same. The deconvolver is electrically connected to an external memory (200), and the external memory (200) stores data to be processed and a weight parameter; the deconvolver (100) comprises: a parameter buffer (110), an input buffer, a deconvolutional operation circuit (140), and an output buffer (150); the parameter buffer (110) is used for receiving and outputting the weight parameter; the input buffer comprises: multiple row buffers connected to each other and used for receiving and outputting the data to be processed, wherein each row buffer outputs a bit of data, the data is aggregated to form a column of data to output; the deconvolutional operation circuit (140) is used for receiving the data to be processed from the input buffer and receiving the weight parameter from the parameter buffer (110), so as to perform deconvolutional operation and output a deconvolutional operation result; the output buffer (150) is used for receiving the deconvolutional operation result and outputting the deconvolutional operation result to the external memory (200). Therefore, the problem in the prior art of high performance requirements for a processor due to slow processing speed caused by software operation implementation can be effectively solved.

Description

反卷积器及其所应用的人工智能处理装置Deconvolution device and artificial intelligence processing device applied thereto 技术领域Technical field
本发明涉及处理器技术领域,特别是涉及人工智能处理器技术领域,具体为反卷积器及其所应用的人工智能处理装置。The present invention relates to the field of processor technologies, and in particular, to the field of artificial intelligence processor technologies, specifically a deconvolution device and an artificial intelligence processing device to which the same is applied.
背景技术Background technique
反卷积神经网络(Convolutional Neural Network,CNN)是一种前馈神经网络,它的人工神经元可以响应一部分覆盖范围内的周围单元,对于大型图像处理有出色表现。反卷积神经网络包括反卷积层(convolutional layer))和池化层(pooling layer)。The Convolutional Neural Network (CNN) is a feedforward neural network whose artificial neurons can respond to a surrounding area of a part of the coverage and perform well for large image processing. The deconvolution neural network includes a convolutional layer and a pooling layer.
现在,CNN已经成为众多科学领域的研究热点之一,特别是在模式分类领域,由于该网络避免了对图像的复杂前期预处理,可以直接输入原始图像,因而得到了更为广泛的应用。Nowadays, CNN has become one of the research hotspots in many scientific fields, especially in the field of pattern classification. Because the network avoids the complicated pre-processing of images, it can directly input the original image, so it has been widely used.
一般地,CNN的基本结构包括两层,其一为特征提取层,每个神经元的输入与前一层的局部接受域相连,并提取该局部的特征。一旦该局部特征被提取后,它与其它特征间的位置关系也随之确定下来;其二是特征映射层,网络的每个计算层由多个特征映射组成,每个特征映射是一个平面,平面上所有神经元的权值相等。特征映射结构采用影响函数核小的sigmoid函数作为反卷积网络的激活函数,使得特征映射具有位移不变性。此外,由于一个映射面上的神经元共享权值,因而减少了网络自由参数的个数。反卷积神经网络中的每一个反卷积层都紧跟着一个用来求局部平均与二次提取的计算层,这种特有的两次特征提取结构减小了特征分辨率。Generally, the basic structure of the CNN includes two layers, one of which is a feature extraction layer, and the input of each neuron is connected to the local acceptance domain of the previous layer, and the local features are extracted. Once the local feature is extracted, its positional relationship with other features is also determined; the second is the feature mapping layer, each computing layer of the network is composed of multiple feature maps, and each feature map is a plane. The weights of all neurons on the plane are equal. The feature mapping structure uses a small sigmoid function that affects the function kernel as the activation function of the deconvolution network, so that the feature map has displacement invariance. In addition, since the neurons on one mapping surface share weights, the number of network free parameters is reduced. Each deconvolution layer in the deconvolutional neural network is followed by a computational layer for local averaging and quadratic extraction. This unique two-feature extraction structure reduces the feature resolution.
CNN主要用来识别位移、缩放及其他形式扭曲不变性的二维图形。由于CNN的特征检测层通过训练数据进行学习,所以在使用CNN时,避免了显示的特征抽取,而隐式地从训练数据中进行学习;再者由于同一特征映射面上的神经元权值相同,所以网络可以并行学习,这也是反卷积网络相对于神经元彼此相连网络的一大优势。反卷积神经网络以其局部权值共享的特殊结构在语音识别和图像处理方面有着独特的优越性,其布局更接近于实际的生物神经网络,权值共享降低了网络的复杂性,特别是多维输入向量的图像可以直接输入网络这一特点避免了特征提取和分类过程中数据重建的复杂度。CNN is mainly used to identify two-dimensional graphics of displacement, scaling and other forms of distortion invariance. Since the feature detection layer of the CNN learns through the training data, when the CNN is used, the feature extraction of the display is avoided, and the learning data is implicitly learned; and the weights of the neurons on the same feature mapping surface are the same. So the network can learn in parallel, which is also a big advantage of the deconvolution network relative to the neural network connected to each other. Deconvolution neural network has unique advantages in speech recognition and image processing with its special structure of local weight sharing. Its layout is closer to the actual biological neural network, and weight sharing reduces the complexity of the network, especially The feature that the image of the multi-dimensional input vector can be directly input into the network avoids the complexity of data reconstruction in the feature extraction and classification process.
目前,反卷积神经网络都是通过运行在一个处理器或多个分布式处理中的软件进行运算实现,随着反卷积神经网络的复杂性增大,处理速度相对就会减慢,而且对处理器的性能要求也越来越高。At present, the deconvolution neural network is implemented by software running in one processor or multiple distributed processes. As the complexity of the deconvolution neural network increases, the processing speed is relatively slow, and The performance requirements for the processor are also getting higher and higher.
发明内容Summary of the invention
鉴于以上所述现有技术的缺点,本发明的目的在于提供反卷积器及其所应用的人工智能处理装置,用于解决现有技术中反卷积神经网络均是通过软件运算实现带来的处理速度变慢,对处理器性能要求高的问题。In view of the above-mentioned shortcomings of the prior art, the object of the present invention is to provide a deconvolution device and an artificial intelligence processing device thereof, which are used to solve the problem that the deconvolution neural network in the prior art is realized by software operation. The processing speed is slower and the performance of the processor is high.
为实现上述目的及其他相关目的,本发明提供一种反卷积器,电性连接至外部存储器,其中,所述外部存储器存储有待处理数据及权重参数;所述反卷积器包括:参数缓存器、输入缓存器、反卷积运算电路及输出缓存器;所述参数缓存器用于接收并输出所述权重参数;所述输入缓存器包括:多个相连的行缓存器,用于接收并输出所述待处理数据;其中,各所述行缓存器每输出一位数据则集合形成一列数据输出;所述反卷积运算电路用于从所述输入缓存器接收所述待处理数据、从所述参数缓存器接收权重参数,据以进行反卷积运算并输出反卷积运算结果;所述输出缓存器用于接收所述反卷积运算结果并将该反卷积运算结果向所述外部存储器输出。To achieve the above and other related objects, the present invention provides a deconvolution device electrically connected to an external memory, wherein the external memory stores data to be processed and weight parameters; the deconvolution device includes: a parameter cache , an input buffer, a deconvolution operation circuit and an output buffer; the parameter buffer is configured to receive and output the weight parameter; the input buffer comprises: a plurality of connected line buffers for receiving and outputting The data to be processed; wherein each of the line buffers is assembled to form a column of data output for each bit of data output; the deconvolution operation circuit is configured to receive the to-be-processed data from the input buffer The parameter buffer receives a weight parameter for performing a deconvolution operation and outputting a deconvolution operation result; the output buffer is configured to receive the deconvolution operation result and output the deconvolution operation result to the external memory Output.
于本发明的一实施例中,所述输入缓存器包括:第一行缓存器,逐位接收待处理的特征图谱的像素数据,经过滤波器之后同时输出行像素数据,并存储输入的各反卷积层的所述特征图谱。In an embodiment of the invention, the input buffer includes: a first line buffer, which receives pixel data of a feature map to be processed bit by bit, outputs line pixel data simultaneously after filtering, and stores each input of the inverse The characteristic map of the convolutional layer.
于本发明的一实施例中,所述第一行缓存器依次输出各所述反卷积层的行像素数据,并在输出每一个所述反卷积层行像素数据时依次输出各通道数据的行像素数据。In an embodiment of the invention, the first line buffer sequentially outputs row pixel data of each of the deconvolution layers, and sequentially outputs each channel data when outputting each of the deconvolution layer row pixel data. Row of pixel data.
于本发明的一实施例中,所述输入缓存器还包括:至少一个第二行缓存器,用于从所述外部存储器获取各个滤波器的权重参数并依次输入到所述参数缓存器。In an embodiment of the invention, the input buffer further includes: at least one second line buffer, configured to acquire weight parameters of the respective filters from the external memory and sequentially input to the parameter buffer.
于本发明的一实施例中,所述反卷积运算电路包括:多个并行运行的反卷积核,每一个所述反卷积核包含用于进行反卷积运算的乘法器;加法器树,对多个所述乘法器的输出结果进行累加;每一个所述反卷积器输入K×K矩阵形式的像素数据,根据输入的像素数据和所述权重参数经过反卷积运算逐位输出像素数据。In an embodiment of the invention, the deconvolution operation circuit includes: a plurality of deconvolution cores running in parallel, each of the deconvolution cores including a multiplier for performing a deconvolution operation; an adder a tree, accumulating output results of the plurality of multipliers; each of the deconvolfers inputs pixel data in the form of a K×K matrix, and is subjected to a deconvolution operation bit by bit according to the input pixel data and the weight parameter Output pixel data.
于本发明的一实施例中,所述输出缓存器包括:并行的多个FIFO存储器,经过同一个滤波器的通道数据经累加后存入同一个所述FIFO存储器中;数据选择器,用于将每次累加的结果返回至所述加法器树直至所述加法器输出最终的累加结果。In an embodiment of the invention, the output buffer includes: a plurality of FIFO memories in parallel, wherein channel data of the same filter is accumulated and stored in the same FIFO memory; and a data selector is used for The result of each accumulation is returned to the adder tree until the adder outputs the final accumulated result.
于本发明的一实施例中,所述反卷积器还包括:池化运算电路,连接于所述输出缓存器和所述外部存储器之间,用于对所述反卷积运算结果进行池化后向外部存储器输出。In an embodiment of the present invention, the deconvolution device further includes: a pooling operation circuit connected between the output buffer and the external memory for pooling the deconvolution operation result After output, it is output to the external memory.
于本发明的一实施例中,所述反卷积器所包括的各内部部件之间、以及所述反卷积器与 所述外部存储器之间通过先入先出数据接口连接。In an embodiment of the invention, the internal components included in the deconvolution device and the deconvolferer and the external memory are connected by a first in first out data interface.
本发明还提供一种人工智能处理装置,所述人工智能处理装置包括如上所述的反卷积器。The present invention also provides an artificial intelligence processing apparatus including the deconvolution device as described above.
如上所述,本发明的反卷积器及其所应用的人工智能处理装置,具有以下有益效果:As described above, the deconvolution device of the present invention and the artificial intelligence processing device to which the same is applied have the following advantageous effects:
本发明的反卷积器由参数缓存器、输入缓存器、反卷积运算电路、输出缓存器,池化运算电路以及先入先出数据接口等硬件组成,可高速处理复杂度高的反卷积神经网络算法,可以有效解决现有技术中通过软件运算实现带来的处理速度变慢,对处理器性能要求高的问题。The deconvolution device of the invention is composed of a parameter buffer, an input buffer, a deconvolution operation circuit, an output buffer, a pooling operation circuit and a first-in first-out data interface, and can process high-complexity deconvolution at high speed. The neural network algorithm can effectively solve the problem that the processing speed brought by the software operation in the prior art is slow and the performance of the processor is high.
附图说明DRAWINGS
图1显示为现有技术中的一种反卷积器的整体原理示意图。Figure 1 shows a schematic diagram of the overall principle of a deconvolution device in the prior art.
图2显示为本发明的一种反卷积器的输入输出示意图。2 is a schematic diagram showing the input and output of a deconvolution device of the present invention.
元件标号说明Component label description
100  反卷积器100 deconvolution
110  参数缓存器110 parameter buffer
120  第一行缓存器120 first line buffer
130  第二行缓存器130 second line buffer
140  反卷积运算电路140 deconvolution arithmetic circuit
150  输出缓存器150 output buffer
160  池化运算电路160 pool computing circuit
200  外部存储器200 external memory
具体实施方式Detailed ways
以下通过特定的具体实例说明本发明的实施方式,本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用,本说明书中的各项细节也可以基于不同观点与应用,在没有背离本发明的精神下进行各种修饰或改变。需说明的是,在不冲突的情况下,以下实施例及实施例中的特征可以相互组合。The embodiments of the present invention are described below by way of specific examples, and those skilled in the art can readily understand other advantages and effects of the present invention from the disclosure of the present disclosure. The present invention may be embodied or applied in various other specific embodiments, and various modifications and changes can be made without departing from the spirit and scope of the invention. It should be noted that the features in the following embodiments and embodiments may be combined with each other without conflict.
需要说明的是,如图1至图2所示,以下实施例中所提供的图示仅以示意方式说明本发明的基本构想,遂图式中仅显示与本发明中有关的组件而非按照实际实施时的组件数目、形 状及尺寸绘制,其实际实施时各组件的型态、数量及比例可为一种随意的改变,且其组件布局型态也可能更为复杂。It should be noted that, as shown in FIG. 1 to FIG. 2, the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention in a schematic manner, and only the components related to the present invention are displayed in the drawings instead of The actual number of components, shape and size of the actual implementation, the actual implementation of each component type, number and proportion can be a random change, and its component layout can be more complicated.
本实施例的目的在于提供一种反卷积器及其所应用的人工智能处理装置,用于解决现有技术中反卷积神经网络均是通过软件运算实现带来的处理速度变慢,对处理器性能要求高的问题。以下将详细描述本实施例的一种反卷积器及其所应用的人工智能处理装置的原理和实施方式,使本领域技术人员不需要创造性劳动即可理解本实施例的一种反卷积器及其所应用的人工智能处理装置。The purpose of this embodiment is to provide a deconvolution device and an artificial intelligence processing device thereof, which are used to solve the problem that the deconvolution neural network in the prior art is slowed down by software operation, The problem of high processor performance requirements. The principle and implementation of a deconvolution device and an artificial intelligence processing device to which the present embodiment is applied will be described in detail below, so that a person skilled in the art can understand a deconvolution of the embodiment without any creative work. And the artificial intelligence processing device to which it is applied.
具体地,如图1所示,本实施例提供一种反卷积器100,所述反卷积器100电性连接至外部存储器200,其中,所述外部存储器200存储有待处理数据及权重参数;所述反卷积器100包括:参数缓存器110、输入缓存器、反卷积运算电路140及输出缓存器150。Specifically, as shown in FIG. 1 , the embodiment provides a deconvolution device 100 , and the deconvolution device 100 is electrically connected to an external memory 200 , wherein the external memory 200 stores data to be processed and weight parameters. The deconvolution 100 includes a parameter buffer 110, an input buffer, a deconvolution operation circuit 140, and an output buffer 150.
所述第一待处理数据包含多个通道数据;所述第一权重参数包含多层子参数,每层子参数分别一一对应各个通道数据;所述反卷积运算电路140有多个,用于一一对应地并行计算各个通道数据的反卷积运算结果。The first to-be-processed data includes a plurality of channel data; the first weight parameter includes a plurality of sub-parameters, and each layer of sub-parameters respectively correspond to each channel data; and the deconvolution operation circuit 140 has multiple The deconvolution operation results of the respective channel data are calculated in parallel in one-to-one correspondence.
于本实施例中,所述参数缓存器110(图2中所示的Con_reg)用于接收并输出所述权重参数(图2中所示的Weight)。所述参数缓存器110包括一FIFO存储器,所述权重参数存储于所述FIFO存储器中。其中,输入缓存器、反卷积运算电路140及输出缓存器150中的参数均配置好后也储存于所述参数缓存器110中。In the present embodiment, the parameter buffer 110 (Con_reg shown in FIG. 2) is used to receive and output the weight parameter (the weight shown in FIG. 2). The parameter buffer 110 includes a FIFO memory, and the weight parameter is stored in the FIFO memory. The parameters in the input buffer, the deconvolution operation circuit 140, and the output buffer 150 are also stored in the parameter buffer 110.
于本实施例中,所述输入缓存器包括:多个相连的行缓存器,用于接收并输出所述待处理数据;其中,各所述行缓存器每输出一位数据则集合形成一列数据输出。In this embodiment, the input buffer includes: a plurality of connected row buffers for receiving and outputting the to-be-processed data; wherein each of the row buffers collects one column of data for each bit of data output Output.
所述输入缓存器包括第一行缓存器120(图2中所示的RAM,第二行缓存器130(图2中所示的Coef_reg)。第一行缓存器120,第二行缓存器130)用于将1*1像素数据的输入进行处理输出K*K像素数据。其中,K为反卷积核的大小。以下对所述输入缓存器进行详细说明。The input buffer includes a first line buffer 120 (RAM shown in Fig. 2, second line buffer 130 (Coef_reg shown in Fig. 2). First line buffer 120, second line buffer 130 ) for processing the input of 1*1 pixel data to output K*K pixel data. Where K is the size of the deconvolution kernel. The input buffer will be described in detail below.
具体地,于本实施例中,所述第一行缓存器120逐位接收待处理的特征图谱的像素数据,经过滤波器之后同时输出行像素数据,并存储输入的各反卷积层的所述特征图谱;其中,行像素每行的数据个数为并行的滤波器数量。Specifically, in the embodiment, the first line buffer 120 receives the pixel data of the feature map to be processed bit by bit, outputs the line pixel data simultaneously after the filter, and stores the input deconvolution layer. The feature map; wherein the number of data per row of row pixels is the number of parallel filters.
于本实施例中,所述第一行缓存器120包括一RAM,每个反卷积层的特征图谱输入像素数据将被缓存RAM中,以提高像素数据的本地化存储。In this embodiment, the first line buffer 120 includes a RAM, and the feature map input pixel data of each deconvolution layer is to be buffered in the RAM to improve localized storage of the pixel data.
其中,于本实施例中,所述第一行缓存器120依次输出各所述反卷积层的行像素数据,并在输出每一个所述反卷积层行像素数据时依次输出各通道数据的行像素数据。即所述第一 行缓存器120在开始时输出第一个通道的像素数据,当对第一个通道的像素数据处理完成后,所述第一行缓存器120开始输出第二个通道的像素数据,当一个反卷积层的所有通道的像素数据都输出后,进行下一个反卷积层的通道的像素数据输出。其中,所述第一行缓存器120会利用不同的滤波器从第一个反卷积层到最后一个反卷积层进行迭代计算输出。In the embodiment, the first line buffer 120 sequentially outputs row pixel data of each of the deconvolution layers, and sequentially outputs each channel data when outputting each of the deconvolution layer row pixel data. Row of pixel data. That is, the first line buffer 120 outputs the pixel data of the first channel at the beginning, and after the pixel data processing of the first channel is completed, the first line buffer 120 starts to output the pixels of the second channel. Data, when the pixel data of all channels of a deconvolution layer are output, the pixel data output of the channel of the next deconvolution layer is performed. The first line buffer 120 performs iterative calculation output from the first deconvolution layer to the last deconvolution layer by using different filters.
于本实施例中,所述输入缓存器还包括:至少一个第二行缓存器130,如图2所示,所述第二行缓存器130包含一个FIFO存储器,所述第二行缓存器130(图2中所示的Coef_reg)用于从所述外部存储器200获取各个滤波器的权重参数并依次输入到所述参数缓存器。其中,所述第二行缓存器130与所述外部存储器200通过先入先出数据接口(图2中所示的多个SIF)连接。所述第二行缓存器130输出的像素数据为k*k矩阵形式的像素数据。In the embodiment, the input buffer further includes: at least one second line buffer 130. As shown in FIG. 2, the second line buffer 130 includes a FIFO memory, and the second line buffer 130 (Coef_reg shown in FIG. 2) is used to acquire the weight parameters of the respective filters from the external memory 200 and sequentially input them to the parameter buffer. The second row buffer 130 and the external memory 200 are connected by a first-in first-out data interface (a plurality of SIFs shown in FIG. 2). The pixel data output by the second line buffer 130 is pixel data in the form of a k*k matrix.
于本实施例中,所述反卷积运算电路140用于从所述输入缓存器接收所述待处理数据、从所述参数缓存器110接收权重参数,据以进行反卷积运算并输出反卷积运算结果。In this embodiment, the deconvolution operation circuit 140 is configured to receive the to-be-processed data from the input buffer, receive a weight parameter from the parameter buffer 110, perform a deconvolution operation, and output an inverse The result of the convolution operation.
具体地,于本实施例中,所述反卷积运算电路140包括:多个并行运行的反卷积核,每一个所述反卷积核包含用于进行反卷积运算的乘法器;加法器树,对多个所述乘法器的输出结果进行累加;每一个所述反卷积器100输入K×K矩阵形式的像素数据,根据输入的像素数据和所述权重参数经过反卷积运算逐位输出像素数据。Specifically, in the embodiment, the deconvolution operation circuit 140 includes: a plurality of deconvolution cores running in parallel, each of the deconvolution cores including a multiplier for performing a deconvolution operation; addition a tree for accumulating output results of the plurality of multipliers; each of the deconvolferers 100 inputs pixel data in the form of a K×K matrix, and undergoes deconvolution operation according to the input pixel data and the weight parameter Output pixel data bit by bit.
即所述反卷积运算电路140包括多个乘法器,其中,所述乘法器所采用的矩阵是卷积器采用矩阵的转置。每个时钟周期内所述反卷积器输入的一个K×K矩阵形式的像素数据与乘法器的转置矩阵的每一列相乘,得到一列输出,分别存在所述输出缓存器150的k个FIFO存储器中。That is, the deconvolution operation circuit 140 includes a plurality of multipliers, wherein the matrix employed by the multiplier is a transpose of the convolver using a matrix. Pixel data in the form of a K×K matrix input by the deconvolferer in each clock cycle is multiplied by each column of the transposed matrix of the multiplier to obtain a column of outputs, respectively k of the output buffer 150 In the FIFO memory.
举例来讲,图像有R、G、B三个通道数据,即三个二维矩阵,假设第一权重参数即filter的深度为3,即具有三层子权重参数,即三个二维矩阵,每个长宽设为K*K,假设K是奇数3,分别与三个Chanel反卷积运算,当从第一待处理数据取出Pv*k*3的一个数据立方体(Pv>K),假设Pv是5,则该filter要与该数据立方体通过反卷积运算电路140三次才能运算完毕,而优选的,反卷积运算电路140可以设有对应数量的3个,从而可以在一个时钟周期内并行进行各自所负责Channel的反卷积运算。For example, the image has three channel data of R, G, and B, that is, three two-dimensional matrices. It is assumed that the first weight parameter, that is, the depth of the filter is 3, that is, the three-layer sub-weight parameter, that is, three two-dimensional matrices, Each length and width is set to K*K, assuming K is an odd number 3, and deconvolution with three Chanel respectively, when a data cube of Pv*k*3 (Pv>K) is taken from the first data to be processed, assuming If the Pv is 5, the filter is to be calculated by the deconvolution operation circuit 140 three times with the data cube. Preferably, the deconvolution operation circuit 140 can be provided with a corresponding number of three, so that it can be in one clock cycle. Perform deconvolution operations on the Channels that are responsible for each other in parallel.
于本实施例中,所述输出缓存器150用于接收所述反卷积运算结果并将该反卷积运算结果向所述外部存储器200输出。In the embodiment, the output buffer 150 is configured to receive the deconvolution operation result and output the deconvolution operation result to the external memory 200.
具体地,所述输出缓存器150接收每一个通道的反卷积运算结果,然后累加所有通道数据的反卷积运算结果,结果暂时存储于所述输出缓存器150。Specifically, the output buffer 150 receives the deconvolution operation result of each channel, and then accumulates the deconvolution operation result of all the channel data, and the result is temporarily stored in the output buffer 150.
具体地,于本实施例中,如图5所示,所述输出缓存器150包括:并行的多个FIFO存 储器,经过同一个滤波器的通道数据经累加后存入同一个所述FIFO存储器中;数据选择器(MUX),用于将每次累加的结果返回至所述加法器树直至所述加法器输出最终的累加结果。Specifically, in this embodiment, as shown in FIG. 5, the output buffer 150 includes: a plurality of FIFO memories in parallel, and channel data passing through the same filter is accumulated and stored in the same FIFO memory. a data selector (MUX) for returning the result of each accumulation to the adder tree until the adder outputs the final accumulated result.
其中,每一个所述FIFO存储器输出K*W*H矩阵形式的像素数据,一个滤波器的输出结果被存储于K个FIFO存储器,此外,数据选择器(MUX)还用于将数据流速度降至1*1,一位一位像素点像素输出。Each of the FIFO memories outputs pixel data in the form of a K*W*H matrix, and the output of one filter is stored in K FIFO memories. In addition, the data selector (MUX) is also used to reduce the data stream speed. Up to 1*1, one pixel pixel output.
于本实施例中,所述反卷积器100还包括:池化运算电路160,连接于所述输出缓存器150和所述外部存储器200之间,用于对所述反卷积运算结果进行池化后向外部存储器200输出。In the embodiment, the deconvolution device 100 further includes: a pooling operation circuit 160 connected between the output buffer 150 and the external memory 200 for performing the deconvolution operation result. After the pooling, it is output to the external memory 200.
所述池化运算电路160为每两行像素数据提供最大的池,所述池化运算电路160也包含一个FIFO存储器,用于存储每行像素数据。The pooling operation circuit 160 provides a maximum pool for every two rows of pixel data, and the pooling operation circuit 160 also includes a FIFO memory for storing pixel data for each row.
具体的,池化方式可以是Max pooling,也可以是Average pooling,都可以通过逻辑电路实现。Specifically, the pooling mode may be Max pooling or Average pooling, and may be implemented by a logic circuit.
于本实施例中,所述反卷积器100所包括的各内部部件之间、以及所述反卷积器100与所述外部存储器200之间通过先入先出数据接口连接。In this embodiment, the internal components included in the deconvolfer 100 and the deconvolferer 100 and the external memory 200 are connected by a first in first out data interface.
具体地,所述先入先出数据接口包括:先入先出存储器,第一逻辑单元和第二逻辑单元。Specifically, the first-in first-out data interface includes: a first-in first-out memory, a first logic unit, and a second logic unit.
其中,所述先入先出存储器包括:上行的可写使能管脚、数据输入管脚、及存储器满状态标识管脚;以及,下行的可读使能管脚、数据输出管脚、及存储器空状态标识管脚;The first in first out memory includes: an upstream writable enable pin, a data input pin, and a memory full state identification pin; and a downstream read enable pin, a data output pin, and a memory Empty status identification pin;
所述第一逻辑单元连接上行对象、所述可写使能管脚、及存储器满状态标识管脚,用于在接收到上行对象的写请求时,根据存储器满状态标识管脚上的信号确定所述先入先出存储器是否已满;若未满,则发送使能信号至可写使能管脚来令先入先出存储器可写;否则,令所述先入先出存储器不可写。The first logic unit is connected to the uplink object, the writable enable pin, and the memory full state identifier pin, and is configured to determine, according to a signal on the memory full status identifier pin, when receiving the write request of the uplink object Whether the first-in first-out memory is full; if not, the enable signal is sent to the writable enable pin to make the first-in first-out memory writable; otherwise, the first-in first-out memory is not writable.
具体地,所述第一逻辑单元包括:第一反向器,其输入端连接所述存储器满状态标识管脚,其输出端引出供连接上行对象的第一标识端;第一与门,其第一输入端连接所述第一数据有效标识端,其第二输入端连接于供连接上行对象的上行数据有效端,其输出端连接所述可写使能管脚。Specifically, the first logic unit includes: a first inverter, an input end of which is connected to the memory full state identification pin, an output end of which is connected to a first identification end for connecting an uplink object; and a first AND gate, The first input end is connected to the first data valid identification end, the second input end is connected to the uplink data valid end for connecting the uplink object, and the output end is connected to the writable enable pin.
所述第二逻辑单元连接下行对象、所述可读使能管脚、及存储器空状态标识管脚,用于在接收到下行对象的读请求时,根据存储器空状态标识管脚上的信号确定所述先入先出存储器是否已空;若未空,则发送使能信号至可读使能管脚来令先入先出存储器可读;否则,令所述先入先出存储器不可读。The second logic unit is connected to the downlink object, the readable enable pin, and the memory empty state identifier pin, and is configured to determine, according to a signal on the pin of the memory empty state, when receiving the read request of the downlink object Whether the first-in first-out memory is empty; if not, sending an enable signal to the readable enable pin to make the first-in first-out memory readable; otherwise, making the first-in first-out memory unreadable.
具体地,所述第二逻辑单元包括:第二反向器,其输入端连接所述存储器空状态标识管 脚,其输出端引出供连接下行对象的下行数据有效端;第二与门,其第一输入端连接所述下行数据有效端,其第二输入端连接于供连接下行对象的下行数据有效标识端。Specifically, the second logic unit includes: a second inverter, the input end of which is connected to the memory empty state identifier pin, and the output end of which is connected to the downlink data valid end for connecting the downlink object; the second AND gate, The first input end is connected to the downlink data valid end, and the second input end is connected to the downlink data valid identifier end for connecting the downlink object.
本实施例中,反卷积器100的运行过程如下:In this embodiment, the operation process of the deconvolferer 100 is as follows:
通过先入先出数据接口从外部存储器200读取待处理数据,并存储至第一行缓存器120(图2中所示的Conv_in_cache)的BRAM。The data to be processed is read from the external memory 200 through the first-in first-out data interface and stored in the BRAM of the first line buffer 120 (Conv_in_cache shown in FIG. 2).
其中,所述待处理数据即为特征图谱和卷积参数,特征图谱大小为N C×W1×H1,卷积参数包括滤波器数量N F,卷积核大小k*k,步幅s和边界扩充(Padding)p。 The data to be processed is a feature map and a convolution parameter, the feature map size is N C ×W1×H1, and the convolution parameter includes the number of filters N F , the size of the convolution kernel k*k, the stride s and the boundary. Expand (Padding) p.
第二行缓存器130通过先入先出数据接口(SIF)从外部存储器200读取N F(N C*k*k)的权重参数(一个通道),然后存储至所述参数缓存器110。 The second line buffer 130 through FIFO data interface (SIF) 200 is read from the N F (N C * k * k) of the weight parameter memory external right (one channel), and then stored in the parameter buffer 110.
一旦所述参数缓存器110加载到一个权重参数,开始接收处理特征图谱的像素数据,通过第一行缓存器120、第二行缓存器130的处理,所述反卷积运算电路140每时钟周期接收到k*k像素数据。Once the parameter buffer 110 is loaded into a weight parameter, the pixel data of the processing feature map is started to be received, and the deconvolution operation circuit 140 is processed per clock cycle by the processing of the first line buffer 120 and the second line buffer 130. Received k*k pixel data.
通过所述反卷积运算电路140对每一个通道(每一个通道输入的特征图谱的高为H和宽为W)的输入数据进行反卷积累加,然后输出各通道的结果至所述输出缓存器150。The input data of each channel (the height of the characteristic map input by each channel is H and the width W) is reversely accumulated by the deconvolution operation circuit 140, and then the result of each channel is output to the output buffer. 150.
循环访问不同的输入通道,所述输出缓存器150累加每一个通道的数据结果直到获取到NF×W2×H2的特征图谱。The different input channels are cyclically accessed, and the output buffer 150 accumulates the data results for each channel until a feature map of NF x W2 x H2 is acquired.
然后可以利用池化运算电路160接收NF×W2×H2像素数据做池化处理后再输出特征图谱,也可以直接从所述输出缓存器150输出特征图谱。Then, the pooled operation circuit 160 can receive the NF×W2×H2 pixel data for the pooling process, and then output the feature map, or directly output the feature map from the output buffer 150.
当所述池化运算电路160或所述输出缓存器150输出经过一个滤波器处理的特征图谱之后,所述参数缓存器110重新加载到一个权重参数,通过不同的滤波器重复迭代上述像素处理过程,直至完成所有反卷积层的像素处理。After the pooling operation circuit 160 or the output buffer 150 outputs the feature map processed by one filter, the parameter buffer 110 is reloaded into a weight parameter, and the pixel processing process is iteratively repeated by different filters. Until all pixel processing of the deconvolution layer is completed.
本实施例还提供一种人工智能处理装置,所述人工智能处理装置包括如上所述的反卷积器100。上述已对所述反卷积器100进行了详细说明,在此不再赘述。The embodiment also provides an artificial intelligence processing apparatus including the deconvolution device 100 as described above. The deconvolution device 100 has been described in detail above and will not be described again.
其中,所述人工智能处理器,包括:可编程逻辑电路(PL)及处理系统电路(PS)。所述处理系统电路包括中央处理器,其可通过MCU、SoC、FPGA或DSP等实现,例如ARM架构的嵌入式处理器芯片等;所述中央处理器与外部存储器200通信连接,所述外部存储器200例如为RAM或ROM存储器,例如三代、四代DDR SDRAM等;所述中央处理器可对外部存储器200读写数据。The artificial intelligence processor includes: a programmable logic circuit (PL) and a processing system circuit (PS). The processing system circuit includes a central processing unit, which can be implemented by an MCU, an SoC, an FPGA, a DSP, or the like, such as an embedded processor chip of an ARM architecture, etc.; the central processing unit is communicatively coupled to an external memory 200, the external memory 200 is, for example, a RAM or ROM memory, such as three generations, four generations of DDR SDRAMs, etc.; the central processor can read and write data to the external memory 200.
综上所述,本发明的反卷积器由参数缓存器、输入缓存器、反卷积运算电路、输出缓存器,池化运算电路以及先入先出数据接口等硬件组成,可高速处理复杂度高的反卷积神经网 络算法,可以有效解决现有技术中通过软件运算实现带来的处理速度变慢,对处理器性能要求高的问题。所以,本发明有效克服了现有技术中的种种缺点而具高度产业利用价值。In summary, the deconvolution device of the present invention is composed of a parameter buffer, an input buffer, a deconvolution operation circuit, an output buffer, a pooled operation circuit, and a first-in first-out data interface, and can process complexity at a high speed. The high deconvolution neural network algorithm can effectively solve the problem that the processing speed brought by the software operation in the prior art is slow and the performance of the processor is high. Therefore, the present invention effectively overcomes various shortcomings in the prior art and has high industrial utilization value.
上述实施例仅例示性说明本发明的原理及其功效,而非用于限制本发明。任何熟悉此技术的人士皆可在不违背本发明的精神及范畴下,对上述实施例进行修饰或改变。因此,举凡所属技术领域中具有通常知识者在未脱离本发明所揭示的精神与技术思想下所完成的一切等效修饰或改变,仍应由本发明的权利要求所涵盖。The above-described embodiments are merely illustrative of the principles of the invention and its effects, and are not intended to limit the invention. Modifications or variations of the above-described embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, all equivalent modifications or changes made by those skilled in the art without departing from the spirit and scope of the invention will be covered by the appended claims.

Claims (9)

  1. 一种反卷积器,电性连接至外部存储器,其中,所述外部存储器存储有待处理数据及权重参数;其特征在于,所述反卷积器包括:参数缓存器、输入缓存器、反卷积运算电路及输出缓存器;A deconvolution device is electrically connected to an external memory, wherein the external memory stores data to be processed and weight parameters; wherein the deconvolution device comprises: a parameter buffer, an input buffer, and a reverse volume Product operation circuit and output buffer;
    所述参数缓存器用于接收并输出所述权重参数;The parameter buffer is configured to receive and output the weight parameter;
    所述输入缓存器包括:多个相连的行缓存器,用于接收并输出所述待处理数据;其中,各所述行缓存器每输出一位数据则集合形成一列数据输出;The input buffer includes: a plurality of connected line buffers for receiving and outputting the data to be processed; wherein each of the line buffers is assembled to form a column of data output for each bit of data output;
    所述反卷积运算电路用于从所述输入缓存器接收所述待处理数据、从所述参数缓存器接收权重参数,据以进行反卷积运算并输出反卷积运算结果;The deconvolution operation circuit is configured to receive the to-be-processed data from the input buffer, receive a weight parameter from the parameter buffer, perform a deconvolution operation, and output a deconvolution operation result;
    所述输出缓存器用于接收所述反卷积运算结果并将该反卷积运算结果向所述外部存储器输出。The output buffer is configured to receive the deconvolution operation result and output the deconvolution operation result to the external memory.
  2. 根据权利要求1所述的反卷积器,其特征在于,所述输入缓存器包括:The deconvolution device of claim 1 wherein said input buffer comprises:
    第一行缓存器,逐位接收待处理的特征图谱的像素数据,经过滤波器之后同时输出行像素数据,并存储输入的各反卷积层的所述特征图谱。The first line buffer receives the pixel data of the feature map to be processed bit by bit, outputs the line pixel data simultaneously after passing through the filter, and stores the characteristic map of each of the input deconvolution layers.
  3. 根据权利要求2所述的反卷积器,其特征在于,所述第一行缓存器依次输出各所述反卷积层的行像素数据,并在输出每一个所述反卷积层行像素数据时依次输出各通道数据的行像素数据。The deconvolution apparatus according to claim 2, wherein said first line buffer sequentially outputs row pixel data of each of said deconvolution layers, and outputs each of said deconvolution layer row pixels When the data is output, the line pixel data of each channel data is sequentially output.
  4. 根据权利要求2所述的反卷积器,其特征在于,所述输入缓存器还包括:The deconvolution device of claim 2, wherein the input buffer further comprises:
    至少一个第二行缓存器,用于从所述外部存储器获取各个滤波器的权重参数并依次输入到所述参数缓存器。At least one second line buffer for acquiring weight parameters of the respective filters from the external memory and sequentially inputting to the parameter buffer.
  5. 根据权利要求5所述的反卷积器,其特征在于,所述反卷积运算电路包括:The deconvolution device according to claim 5, wherein said deconvolution operation circuit comprises:
    多个并行运行的反卷积核,每一个所述反卷积核包含用于进行反卷积运算的乘法器;a plurality of deconvolution kernels running in parallel, each of said deconvolution kernels comprising a multiplier for performing a deconvolution operation;
    加法器树,对多个所述乘法器的输出结果进行累加;An adder tree that accumulates output results of a plurality of the multipliers;
    每一个所述反卷积器输入K×K矩阵形式的像素数据,根据输入的像素数据和所述权重参数经过反卷积运算逐位输出像素数据。Each of the deconvolferers inputs pixel data in the form of a K×K matrix, and outputs pixel data bit by bit according to the input pixel data and the weight parameter through a deconvolution operation.
  6. 根据权利要求6所述的反卷积器,其特征在于,所述输出缓存器包括:The deconvolution apparatus according to claim 6, wherein said output buffer comprises:
    并行的至少两个FIFO存储器,经过同一个滤波器的通道数据经累加后存入同一个所述FIFO存储器中;Parallel at least two FIFO memories, the channel data passing through the same filter is accumulated and stored in the same FIFO memory;
    数据选择器,用于将每次累加的结果返回至所述加法器树直至所述加法器输出最终 的累加结果。A data selector is operative to return the result of each accumulation to the adder tree until the adder outputs the final accumulated result.
  7. 根据权利要求1所述的反卷积器,其特征在于,所述反卷积器还包括:The deconvolution device of claim 1 wherein said deconvolution further comprises:
    池化运算电路,连接于所述输出缓存器和所述外部存储器之间,用于对所述反卷积运算结果进行池化后向外部存储器输出。The pooling operation circuit is connected between the output buffer and the external memory, and is used for pooling the result of the deconvolution operation and outputting the result to an external memory.
  8. 根据权利要求1所述的反卷积器,其特征在于,所述反卷积器所包括的各内部部件之间、以及所述反卷积器与所述外部存储器之间通过先入先出数据接口连接。The deconvolution device according to claim 1, wherein between the internal components included in the deconvolfer and between the deconvolfer and the external memory, first in first out data is passed Interface connection.
  9. 一种人工智能处理装置,其特征在于,所述人工智能处理装置包括如权利要求1至权利要求8任一权利要求所述的反卷积器。An artificial intelligence processing apparatus, characterized in that the artificial intelligence processing apparatus comprises the deconvolution apparatus according to any one of claims 1 to 8.
PCT/CN2018/072659 2018-01-15 2018-01-15 Deconvolver and an artificial intelligence processing device applied by same WO2019136747A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2018/072659 WO2019136747A1 (en) 2018-01-15 2018-01-15 Deconvolver and an artificial intelligence processing device applied by same
CN201880002766.XA CN110178146B (en) 2018-01-15 2018-01-15 Deconvolutor and artificial intelligence processing device applied by deconvolutor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/072659 WO2019136747A1 (en) 2018-01-15 2018-01-15 Deconvolver and an artificial intelligence processing device applied by same

Publications (1)

Publication Number Publication Date
WO2019136747A1 true WO2019136747A1 (en) 2019-07-18

Family

ID=67218472

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/072659 WO2019136747A1 (en) 2018-01-15 2018-01-15 Deconvolver and an artificial intelligence processing device applied by same

Country Status (2)

Country Link
CN (1) CN110178146B (en)
WO (1) WO2019136747A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210081770A1 (en) * 2019-09-17 2021-03-18 GOWN Semiconductor Corporation System architecture based on soc fpga for edge artificial intelligence computing

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106066783A (en) * 2016-06-02 2016-11-02 华为技术有限公司 The neutral net forward direction arithmetic hardware structure quantified based on power weight
CN106228240A (en) * 2016-07-30 2016-12-14 复旦大学 Degree of depth convolutional neural networks implementation method based on FPGA
US20160379109A1 (en) * 2015-06-29 2016-12-29 Microsoft Technology Licensing, Llc Convolutional neural networks on hardware accelerators
CN106875011A (en) * 2017-01-12 2017-06-20 南京大学 The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator
CN107392309A (en) * 2017-09-11 2017-11-24 东南大学—无锡集成电路技术研究所 A kind of general fixed-point number neutral net convolution accelerator hardware structure based on FPGA

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10497089B2 (en) * 2016-01-29 2019-12-03 Fotonation Limited Convolutional neural network
CN106022468B (en) * 2016-05-17 2018-06-01 成都启英泰伦科技有限公司 the design method of artificial neural network processor integrated circuit and the integrated circuit
CN106355244B (en) * 2016-08-30 2019-08-13 深圳市诺比邻科技有限公司 The construction method and system of convolutional neural networks
CN107239824A (en) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 Apparatus and method for realizing sparse convolution neutral net accelerator
CN107403117A (en) * 2017-07-28 2017-11-28 西安电子科技大学 Three dimensional convolution device based on FPGA

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160379109A1 (en) * 2015-06-29 2016-12-29 Microsoft Technology Licensing, Llc Convolutional neural networks on hardware accelerators
CN106066783A (en) * 2016-06-02 2016-11-02 华为技术有限公司 The neutral net forward direction arithmetic hardware structure quantified based on power weight
CN106228240A (en) * 2016-07-30 2016-12-14 复旦大学 Degree of depth convolutional neural networks implementation method based on FPGA
CN106875011A (en) * 2017-01-12 2017-06-20 南京大学 The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator
CN107392309A (en) * 2017-09-11 2017-11-24 东南大学—无锡集成电路技术研究所 A kind of general fixed-point number neutral net convolution accelerator hardware structure based on FPGA

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210081770A1 (en) * 2019-09-17 2021-03-18 GOWN Semiconductor Corporation System architecture based on soc fpga for edge artificial intelligence computing
US11544544B2 (en) * 2019-09-17 2023-01-03 Gowin Semiconductor Corporation System architecture based on SoC FPGA for edge artificial intelligence computing

Also Published As

Publication number Publication date
CN110178146B (en) 2023-05-12
CN110178146A (en) 2019-08-27

Similar Documents

Publication Publication Date Title
WO2019136764A1 (en) Convolutor and artificial intelligent processing device applied thereto
US10394929B2 (en) Adaptive execution engine for convolution computing systems
CN110050267B (en) System and method for data management
US11720523B2 (en) Performing concurrent operations in a processing element
CN110366732B (en) Method and apparatus for matrix processing in convolutional neural networks
CN109844738A (en) Arithmetic processing circuit and identifying system
Liu et al. Fg-net: A fast and accurate framework for large-scale lidar point cloud understanding
WO2020073211A1 (en) Operation accelerator, processing method, and related device
CN109859178B (en) FPGA-based infrared remote sensing image real-time target detection method
US20190213439A1 (en) Switchable propagation neural network
CN111210019B (en) Neural network inference method based on software and hardware cooperative acceleration
CN110766127B (en) Neural network computing special circuit and related computing platform and implementation method thereof
EP3093757A2 (en) Multi-dimensional sliding window operation for a vector processor
WO2019136751A1 (en) Artificial intelligence parallel processing method and apparatus, computer readable storage medium, and terminal
CN108717571A (en) A kind of acceleration method and device for artificial intelligence
WO2023123919A1 (en) Data processing circuit, data processing method, and related product
CN110782430A (en) Small target detection method and device, electronic equipment and storage medium
CN111028136B (en) Method and equipment for processing two-dimensional complex matrix by artificial intelligence processor
CN110738317A (en) FPGA-based deformable convolution network operation method, device and system
US20220012569A1 (en) Computer-implemented data processing method, micro-controller system and computer program product
WO2019136747A1 (en) Deconvolver and an artificial intelligence processing device applied by same
CN112016522B (en) Video data processing method, system and related components
CN114359662A (en) Implementation method of convolutional neural network based on heterogeneous FPGA and fusion multiresolution
WO2019136761A1 (en) Three-dimensional convolution device for recognizing human action
Dawwd The multi 2D systolic design and implementation of Convolutional Neural Networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18900066

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13.11.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18900066

Country of ref document: EP

Kind code of ref document: A1