WO2019136762A1 - 人工智能处理器、及其所应用的处理方法 - Google Patents

人工智能处理器、及其所应用的处理方法 Download PDF

Info

Publication number
WO2019136762A1
WO2019136762A1 PCT/CN2018/072676 CN2018072676W WO2019136762A1 WO 2019136762 A1 WO2019136762 A1 WO 2019136762A1 CN 2018072676 W CN2018072676 W CN 2018072676W WO 2019136762 A1 WO2019136762 A1 WO 2019136762A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
artificial intelligence
convolution
processing module
output
Prior art date
Application number
PCT/CN2018/072676
Other languages
English (en)
French (fr)
Inventor
肖梦秋
Original Assignee
深圳鲲云信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳鲲云信息科技有限公司 filed Critical 深圳鲲云信息科技有限公司
Priority to CN201880002767.4A priority Critical patent/CN109564638B/zh
Priority to PCT/CN2018/072676 priority patent/WO2019136762A1/zh
Publication of WO2019136762A1 publication Critical patent/WO2019136762A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to the field of integrated circuit technology, and in particular to an artificial intelligence processor and a processing method applied thereto.
  • AI Artificial Intelligence
  • the artificial intelligence algorithm is a neural network model algorithm that simulates the human brain. Its computational complexity is very large. AlphaGo, which also uses artificial intelligence algorithms, requires thousands of traditional processors (CPUs) and hundreds of graphics processors (GPUs). It is clear that today, as artificial intelligence ushers in a new wave of revival, traditional processors are becoming a bottleneck that hinders the spread of artificial intelligence.
  • an artificial intelligence processor comprising: a programmable logic circuit, comprising: a convolution processing module, communicatively coupled to an external memory, wherein the external memory stores a first The data to be processed and the first weight parameter;
  • the convolution processing module includes: a first parameter cache, a first input buffer, a convolution operation circuit, and a first output buffer;
  • the first parameter cache is used for reading and outputting
  • the first input parameter includes: a plurality of connected line buffers for reading and outputting the first to-be-processed data; wherein each of the line buffers outputs one bit of data per Then the set forms a column of data output;
  • the convolution operation circuit is configured to read the first to-be-processed data from the first input buffer, and read the first weight parameter from the first parameter cache, thereby performing a convolution operation And outputting a convolution operation result;
  • the first output buffer is configured to receive the convolution operation result, and output the convolution operation result to the external memory.
  • the first input buffer and/or the first parameter cache includes: a plurality of connected line buffers for reading and outputting the first to-be-processed data and/or the first weight a parameter; wherein each of the row buffers outputs a column of data output for each bit of data output.
  • the convolution processing module further includes: a pooling operation circuit, configured to perform pooling on the convolution operation result and output the result to the external memory.
  • the programmable logic portion further includes: a fully connected operation circuit for classifying and outputting according to the convolution operation result.
  • the artificial intelligence processor includes: a first DMA, and a communication connection between the external data storage and a convolution processing module.
  • the internal components included in the convolution processing module and the convolution processing module and the external memory are connected by a first-in first-out data interface.
  • the artificial intelligence processor further includes: a processing system circuit, comprising: a central processing module, configured to configure an operating parameter of the convolution processing module in the programmable logic circuit.
  • the first to-be-processed data includes a plurality of channel data;
  • the first weight parameter includes a plurality of sub-parameters, and each layer of sub-parameters respectively correspond to respective channel data;
  • the convolution There are a plurality of arithmetic circuits for calculating a convolution operation result of each channel data in parallel in one-to-one correspondence; and/or a plurality of the first weight parameters, and a plurality of the convolution operation circuits are used for one
  • each layer sub-parameter of the plurality of first weight parameters and a convolution operation result of each channel data are calculated in parallel.
  • the programmable logic circuit further includes: a deconvolution processing module, the communication is connected to the external memory, wherein the external memory stores the second to-be-processed data and the second weight parameter;
  • the deconvolution processing module includes: a second parameter cache, a second input buffer, a deconvolution operation circuit, and a second output buffer; the second parameter cache is configured to read and output the second weight parameter; a second input buffer, comprising: a plurality of connected row buffers for reading and outputting the second to-be-processed data; wherein each of the row buffers is assembled to form a column of data output for each bit of data output;
  • Deconvolution operation circuit for reading the second to-be-processed data from the second input buffer, and from the second parameter buffer to perform a deconvolution operation and outputting a deconvolution operation result; the second output a buffer for receiving the result of the deconvolution operation and outputting the result of the deconvolution operation to the external memory.
  • the artificial intelligence processor includes: a shared buffer, as the first input buffer and a second input buffer, for convolution operation circuit and deconvolution operation circuit for time division multiplexing
  • the data received from the external memory is transferred.
  • the artificial intelligence processor includes: a second DMA, and a communication connection between the external data storage and the deconvolution processing module.
  • the internal components included in the deconvolution processing module, and the deconvolution processing module and the external memory are connected by a first-in first-out data interface.
  • the artificial intelligence processor further includes: a processing system circuit, including: a central processing module, configured to configure a convolution processing module and a deconvolution process in the programmable logic circuit The operating parameters of the module.
  • a processing system circuit including: a central processing module, configured to configure a convolution processing module and a deconvolution process in the programmable logic circuit The operating parameters of the module.
  • the artificial intelligence processor the type of the second data to be processed includes the convolution operation result.
  • the first-in first-out data interface includes: a first-in first-out memory, including: an upstream writable enable pin, a data input pin, and a memory full state identification pin; and, a downlink The readable enable pin, the data output pin, and the memory empty state identification pin; the first logic unit, the connection upstream object, the writable enable pin, and the memory full state identification pin, for Upon receiving the write request of the uplink object, determining whether the first-in first-out memory is full according to a signal on the memory full status flag pin; if not, transmitting an enable signal to the writable enable pin to make the first-come first The memory is writable; otherwise, the first-in first-out memory is not writable; the second logic unit is connected to the downlink object, the readable enable pin, and the memory empty state identifier pin for receiving the downlink object When the read request is made, it is determined whether the first-in first-out memory is
  • the first logic unit includes: a first inverter, an input end of which is connected to the memory full state identification pin, and an output end of which is connected to a first identification end for connecting an uplink object; a first input terminal, wherein the first input end is connected to the first data valid identification end, the second input end is connected to the uplink data valid end for connecting the uplink object, and the output end is connected to the writable enable pin;
  • the second logic unit includes: a second inverter, the input end of which is connected to the memory empty state identification pin, the output end of which is connected to the downlink data valid end for connecting the downlink object; the second AND gate, the first input thereof The terminal is connected to the downlink data valid end, and the second input end is connected to the downlink data valid identifier end for connecting the downlink object.
  • the type of the central processing unit includes: an MCU, an SoC, an FPGA, or a DSP.
  • the present invention provides an artificial intelligence processing method, which is applied to the artificial intelligence processor; the method includes: reading first to-be-processed data and first weight parameters from an external memory; Convolution operation is performed according to the first to-be-processed data and the first weight parameter, and a convolution operation result is output; and the convolution operation result is output to the external memory.
  • the present invention provides an artificial intelligence processing method, which is applied to the artificial intelligence processor, the method comprising: reading a second to-be-processed data and a second weight parameter from an external memory; Performing a deconvolution operation based on the second to-be-processed data and the second weight parameter and outputting a deconvolution operation result; and outputting the deconvolution operation result to the external memory.
  • the type of the second data to be processed includes the result of the convolution operation.
  • the present invention provides an artificial intelligence processor, and a processing method applied thereto, the artificial intelligence processor comprising: a programmable logic circuit comprising: a convolution processing module, a communication connection to an external memory, wherein The external memory stores the first to-be-processed data and the first weight parameter; the convolution processing module includes: a first parameter cache, a first input buffer, a convolution operation circuit, and a first output buffer; the artificial intelligence processor
  • the convolutional neural network algorithm can be implemented through hardware logic circuits to solve the problems of the prior art.
  • FIG. 1 is a schematic structural diagram of an artificial intelligence processor according to an embodiment of the present invention.
  • FIG. 2 is a schematic structural diagram of a convolution processing module according to an embodiment of the present invention.
  • FIG. 3 is a schematic flow chart showing a processing method of the artificial intelligence processor in the embodiment of FIG. 1.
  • FIG. 4 is a schematic structural diagram of an artificial intelligence processor according to still another embodiment of the present invention.
  • FIG. 5 is a block diagram showing the structure of a convolution operation circuit and a deconvolution operation circuit using a shared buffer according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a FIFO interface according to an embodiment of the present invention.
  • the invention relates to artificial intelligence technology, in particular to designing a dedicated processor dedicated to artificial intelligence processing, to solve the problem of low efficiency of the artificial intelligence algorithm using the existing processor chip architecture in the prior art; the artificial intelligence processor of the present invention Can be used for the arithmetic implementation of convolutional neural network algorithms.
  • FIG. 1 a schematic structural diagram of an artificial intelligence processor 100 in an embodiment of the present invention is shown.
  • the artificial intelligence processor 100 includes: a programmable logic circuit (PL) and a processing system circuit (PS).
  • the processing system circuit includes a central processing unit 101, which can be implemented by an MCU, an SoC, an FPGA, a DSP, or the like, such as an embedded processor chip of an ARM architecture; the central processing unit 101 is communicatively coupled to the external memory 102,
  • the external memory 102 is, for example, a RAM or a ROM memory such as three generations, four generations of DDR SDRAMs, etc.; the central processing unit 101 can read and write data to the external memory 102.
  • the programmable logic circuit (PL) is also communicatively coupled to the external memory 102.
  • the programmable logic circuit (PL) may include a DMA circuit (Direct Memory Access) to directly access the external memory 102 directly.
  • DMA circuit Direct Memory Access
  • the programmable logic circuit can implement various functional circuit modules therein by programming the FPGA.
  • the programmable logic circuit includes: a convolution processing module 103; the convolution processing module 103 can implement a convolution operation.
  • the first memory data and the weight parameter (such as a filter used in the convolution operation, including the weight matrix) used by the convolution processing module 103 may be stored in the external memory 102, and the convolution processing module 103 may Performing a convolution operation based on the first to-be-processed data and the first weight parameter, and outputting the convolution result to the external memory 102, the volume in the external memory 102 according to the number of convolution layers required to be implemented by the convolutional neural network algorithm
  • the result of the product operation can be input repeatedly for convolution operation.
  • the central processing unit 101 is communicatively coupled to the convolution processing module 103, and can be used to set parameters, such as the number of convolution kernel filters, the height and width K of the filter, the number of input channels, the number of output channels, and the step size.
  • the central processor 101 can also provide a clock signal to the convolution processing module 103.
  • the convolution processing module 200 includes: a first parameter cache 201 and a first input buffer 202.
  • the first parameter cache 201 is configured to read and output the first weight parameter.
  • the first to-be-processed data is, for example, image data. Since each position pixel includes three pixel values of R, G, and B, one image can pass three channels of data (Chnannel It is indicated that the depth of the filter included in the first weight parameter also needs to be 3 layers.
  • the first input buffer 202 is configured to read and output the first to-be-processed data.
  • the first input buffer 202 may include: a plurality of connected line buffers (Line FiFO) for reading and outputting the first to-be-processed data and/or the first weight parameter.
  • Line FiFO connected line buffers
  • Each of the row buffers outputs one column of data output for each bit of data output, and the multi-column output can form a matrix; optionally, in the case of multiple channels, the number of layers of the fiter is also multiple.
  • the first parameter cache 201 can also store and output the first weight parameter by using multiple line buffers.
  • the convolution operation circuit 203 is configured to read the first to-be-processed data from the first input buffer 202, and read the first weight parameter from the first parameter cache 201, thereby performing a convolution operation and outputting a convolution The result of the operation.
  • the convolution operation includes a multiplication method and an addition method
  • the convolution operation circuit 203 may be a circuit composed of a multiplier and an adder.
  • the first output buffer 204 is configured to receive the convolution operation result, and output the convolution operation result to the external memory 204.
  • the first output buffer 204 includes at least two buffers (for example, FIFOs), wherein the first buffer is used to write the convolution operation result and the second buffer is used to the external memory 204 in one clock cycle. Output; in the next clock cycle, the first cache is swapped with the second cache role, the first cache is used to output to the external memory 204, and the second cache is used to write the result of the convolution operation.
  • the convolution processing module 103 may further include: a pooling operation circuit for buffering the convolution operation result and outputting the result to the external memory 204; specifically, the pooling method may be Max pooling, or Average pooling, can be implemented by logic circuits.
  • the programmable logic portion may further include: a fully connected operation circuit for classifying and outputting according to the convolution operation result.
  • the first to-be-processed data may include a plurality of channel data;
  • the first weight parameter includes a plurality of sub-parameters, and each layer of sub-parameters respectively correspond to respective channel data;
  • the convolution operation circuit 203 has a plurality of convolution operation results for calculating the respective channel data in parallel in one-to-one correspondence.
  • the image has three channel data of R, G, and B, that is, three two-dimensional matrices.
  • the first weight parameter that is, the depth of the filter is 3
  • the three-layer sub-weight parameter that is, three two-dimensional matrices.
  • K is an odd number 3
  • the convolution operation circuit 203 can be provided with a corresponding number of three, thereby simultaneously performing a filter and a volume between the channels.
  • FIG. 3 the processing flow of the artificial intelligence processor 100 in the embodiment of FIG. 1 is shown:
  • Step S1 The artificial intelligence processor 100 obtains the first to-be-processed data and the first weight parameter from the external memory 102;
  • Step S2 The convolution processing module 103 performs a convolution operation according to the input first to-be-processed data and the first weight parameter;
  • Step S3 The convolution operation result is output to the external memory 102.
  • the above S1 to S3 processing may be repeated, and the convolution operation result in the external memory 102 may be repeatedly read and input into convolution processing.
  • the module 103 performs arithmetic processing and then outputs it back to the external memory 102.
  • the central processing unit can control each processing process by using a clock signal, and each time the processing procedure is required, the central processing unit
  • the convolution processing module 103 can be configured with operational parameters corresponding to different convolution layers and pooling layers.
  • the artificial intelligence processor 100 performs In one processing, the data to be processed and a first weight parameter are acquired from the external memory 102, and the convolution operation circuit in the convolution processing module 103 convolves the first data to be processed, and outputs the data to the pooling operation circuit.
  • the first operation result data is obtained, and the first operation result data is output to the external memory 102; when the artificial intelligence processor 100 performs the second processing, the first operation result data is obtained from the external processor, and Obtaining the next first weight parameter, and then performing the second convolution operation and the pooling operation, obtaining the second operation result data, and performing the operation through the fully connected operation circuit to obtain the final result, of course, in some convolutional neural network models
  • the fully connected layer can be replaced by a convolutional layer, and the fully connected arithmetic circuit is not necessary.
  • the programmable logic circuit of the artificial intelligence processor 400 includes, in addition to the convolution processing module 403, a deconvolution processing module 404, and is communicatively coupled to the external memory 402.
  • the external memory 402 stores the second to-be-processed data and the second weight parameter
  • the deconvolution processing module 404 includes: a second parameter cache, a second input buffer, a deconvolution operation circuit, and a second output buffer;
  • the second parameter cache is configured to read and output the second weight parameter;
  • the second input buffer includes: a plurality of connected line buffers for reading and outputting the second to-be-processed data;
  • Each of the row buffers is configured to form a column of data output for each bit of data output;
  • the deconvolution operation circuit is configured to read the second to-be-processed data from the second input buffer, and to slow down from the second parameter
  • a deconvolution operation is performed to output a deconvolution operation result;
  • the second output buffer is configured to receive the deconvolution operation
  • the implementation of the deconvolution processing module 404 is similar to that of the convolution processing module 403, and is implemented by a cache and a logic operation circuit, and various connection manners, for example, connection to the external memory 402 by DMA, or processing
  • the second to-be-processed data and the second weight parameter are also obtained from the external memory 402, and the deconvolution operation process is outputted to the external memory 402.
  • the central processing unit 401 can also set the operating parameters to the deconvolution processing module 404. Meet the requirements of different deconvolution layer operations.
  • the convolution processing module 403 can also be used in cooperation with the deconvolution processing module 404, for example, in a convolutional neural network model for image semantic segmentation, which may be Performing operations on a plurality of convolution layers on the original image to obtain an operation result, and then performing a plurality of deconvolution operations on the convolution operation result to obtain a feature image of the original image size, and in the embodiment of the present invention, reference is made to As described above, it can be inferred that the convolution processing module 403 and the deconvolution processing module 404 can be used in succession.
  • the artificial intelligence processor includes: a shared cache 501 as the first input buffer and a second input buffer, and is provided by the convolution operation circuit 502 and the deconvolution operation circuit 503.
  • the data received from the external memory is transmitted in a time-multiplexed manner; since the on-chip memory resources are small, and the convolution processing module and the deconvolution processing module are not used at the same time, the shared cache can be used by the volume.
  • the product processing module and the deconvolution processing module are used in a time-sharing manner, which can reduce the input buffer by half.
  • the convolution processing module is connected to the external memory through a first-in first-out data interface.
  • the artificial intelligence processor the type of the second data to be processed includes the convolution operation result.
  • the first-in first-out data interface includes: a first-in first-out memory (FIFO), including: an upper writeable enable pin (write) , data input pin (data_in), and memory full status flag pin (full); and, downstream readable enable pin (read), data output pin (data_out), and memory empty status flag pin (
  • the first logic unit 601 is configured to connect the uplink object, the writable enable pin, and the memory full state identifier pin, and is configured to identify the pin according to the full state of the memory when receiving the write request of the uplink object.
  • the signal determines whether the first-in first-out memory is full; if not, the enable signal is sent to the writable enable pin to make the first-in first-out memory writable; otherwise, the first-in first-out memory is not writable;
  • the second logic unit 602 is configured to connect the downlink object, the readable enable pin, and the memory empty state identifier pin, and is configured to determine, according to a signal on the pin of the memory empty state, when receiving the read request of the downlink object First in advance Whether the memory is empty; if not, the enable signal is sent to the readable enable pin to make the first-in first-out memory readable; otherwise, the first-in first-out memory is unreadable.
  • the first logic unit 601 includes: a first inverter 603 whose input terminal is connected to the memory full state identification pin, and an output terminal thereof leads to the first connection for connecting the uplink object.
  • the first AND gate 604 has a first input end connected to the first data valid identification end, a second input end connected to the uplink data valid end for connecting the uplink object, and an output end connected to the writable
  • the second logic unit 602 includes: a second inverter 605, the input end of which is connected to the memory empty state identification pin, and the output end of which is connected to the downlink data valid end for connecting the downlink object;
  • the gate 606 has a first input end connected to the downlink data valid end, and a second input end connected to the downlink data valid identifier end for connecting the downlink object.
  • the FIFO can be written when the write enable pin is set to "1", and the FIFO is not writable when "0" is set; the full state flag pin (full) is set when the FIFO is full. "1”; the FIFO is readable when the read enable pin is set to “1”, and the FIFO is unreadable when set to "0”; the empty state flag (empty) is set to "1” when the FIFO is empty.
  • the first inverter 603 outputs "0" to an input of the first AND gate 604, thereby The output of the AND gate 604 is "0", the FIFO is not writable; when the memory empty state identification pin (empty) is set to "1" when the FIFO is empty, the second inverter 605 outputs "0" to the second AND gate 606. One input, so that the second AND gate 606 outputs "0", the FIFO is unreadable.
  • the digital voltage value indicating the state of each pin may also be replaced, for example, when "0" is enabled, and the like is not limited to the above embodiment; in addition, the first logic unit 601 and The second logic unit 602 can also employ other logic operation devices, not limited to FIG.
  • the components of the artificial intelligence processor of the present invention can control the pipeline process for implementing data processing through the connection of the first-in first-out data interface.
  • the present invention provides an artificial intelligence processor, and a processing method applied thereto, the artificial intelligence processor comprising: a programmable logic circuit, comprising: a convolution processing module, and a communication connection to an external memory, wherein The external memory stores first data to be processed and first weight parameters; the convolution processing module includes: a first parameter cache, a first input buffer, a convolution operation circuit, and a first output buffer; the artificial intelligence processing
  • the device can implement a convolutional neural network algorithm through hardware logic circuits to solve the problems of the prior art.
  • the invention effectively overcomes various shortcomings in the prior art and has high industrial utilization value.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Logic Circuits (AREA)
  • Stored Programmes (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

一种人工智能处理器、及其所应用的处理方法,所述人工智能处理器(100)包括:可编程逻辑电路,其包括:卷积处理模块(103)、通信连接至外部存储器(102),其中,所述外部存储器(102)存储有第一待处理数据及第一权重参数;所述卷积处理模块(103)包括:第一参数缓存(201)、第一输入缓存、卷积运算电路(203)及第一输出缓存;所述人工智能处理器通过硬件逻辑电路实现卷积神经网络算法,解决现有技术的问题。

Description

人工智能处理器、及其所应用的处理方法 技术领域
本发明涉及集成电路技术领域,特别是涉及人工智能处理器、及其所应用的处理方法。
背景技术
人工智能(Artificial Intelligence),英文缩写为AI。它是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学。
人工智能算法是模拟人脑的神经网络模型算法,其运算量非常巨大,同样采用了人工智能算法的AlphaGo,需要用到上千块传统处理器(CPU)和上百块图形处理器(GPU);很显然,在人工智能迎来新一波复兴的今天,传统处理器正成为阻碍人工智能普及的瓶颈。
因此,如何设计一款专门为神经网络设计的芯片,既高效、低耗、体积小,还能够嵌入智能终端设备,已成为业界亟待解决的技术问题。
发明内容
鉴于以上所述现有技术的缺点,本发明的目的在于提供人工智能处理器、及其所应用的处理方法,解决现有技术中的问题。
为实现上述目的及其他相关目的,本发明提供一种人工智能处理器,包括:可编程逻辑电路,其包括:卷积处理模块,通信连接至外部存储器,其中,所述外部存储器存储有第一待处理数据及第一权重参数;所述卷积处理模块包括:第一参数缓存、第一输入缓存、卷积运算电路及第一输出缓存;所述第一参数缓存,用于读取并输出所述第一权重参数;所述第一输入缓存,其包括:多个相连的行缓存,用于读取并输出所述第一待处理数据;其中,各所述行缓存每输出一位数据则集合形成一列数据输出;所述卷积运算电路,用于从第一输入缓存读取所述第一待处理数据、以及从第一参数缓存读取第一权重参数,据以进行卷积运算并输出卷积运算结果;所述第一输出缓存,用于接收所述卷积运算结果,并将该卷积运算结果向所述外部存储器输出。
于本发明的一实施例中,所述第一输入缓存和/或第一参数缓存包括:多个相连的行缓存,用于读取并输出所述第一待处理数据和/或第一权重参数;其中,各所述行缓存每输出一位数据则集合形成一列数据输出。
于本发明的一实施例中,卷积处理模块还包括:池化运算电路,用于对所述卷积运算结果进行池化后向外部存储器输出。
于本发明的一实施例中,所述可编程逻辑部分还包括:全连接运算电路,用于根据所述卷积运算结果进行分类并输出。
于本发明的一实施例中,所述的人工智能处理器,包括:第一DMA,通信连接在所述外部数据存储器与卷积处理模块之间。
于本发明的一实施例中,所述卷积处理模块所包括的各内部部件之间、以及卷积处理模块与外部存储器之间通过先入先出数据接口连接。
于本发明的一实施例中,所述的人工智能处理器,还包括:处理系统电路,其包括:中央处理模块,用于配置所述可编程逻辑电路中卷积处理模块的运行参数。
于本发明的一实施例中,所述第一待处理数据包含多个通道数据;所述第一权重参数包含多层子参数,每层子参数分别一一对应各个通道数据;所述卷积运算电路有多个,用于一一对应地并行计算各个通道数据的卷积运算结果;以及/或者,所述第一权重参数有多个,所述卷积运算电路有多个,用于一一对应地并行计算所述多个第一权重参数的各层子参数和各个通道数据的卷积运算结果。
于本发明的一实施例中,所述可编程逻辑电路还包括:反卷积处理模块,通信连接至外部存储器,其中,所述外部存储器存储有第二待处理数据及第二权重参数;所述反卷积处理模块包括:第二参数缓存、第二输入缓存、反卷积运算电路及第二输出缓存;所述第二参数缓存,用于读取并输出所述第二权重参数;所述第二输入缓存,其包括:多个相连的行缓存,用于读取并输出所述第二待处理数据;其中,各所述行缓存每输出一位数据则集合形成一列数据输出;所述反卷积运算电路,用于从第二输入缓存读取所述第二待处理数据、以及从第二参数缓据以进行反卷积运算并输出反卷积运算结果;所述第二输出缓存,用于接收所述反卷积运算结果,并输出该反卷积运算结果至所述外部存储器。
于本发明的一实施例中,所述的人工智能处理器,包括:共享缓存,作为所述第一输入缓存及第二输入缓存,供卷积运算电路及反卷积运算电路分时复用地传输接受自外部存储器的数据。
于本发明的一实施例中,所述的人工智能处理器,包括:第二DMA,通信连接在所述外部数据存储器与反卷积处理模块之间。
于本发明的一实施例中,所述反卷积处理模块所包括的各内部部件之间、以及反卷积处理模块与外部存储器之间通过先入先出数据接口连接。
于本发明的一实施例中,所述的人工智能处理器,还包括:处理系统电路,其包括:中央处理模块,用于配置所述可编程逻辑电路中卷积处理模块及反卷积处理模块的运行参数。
于本发明的一实施例中,所述的人工智能处理器,所述第二待处理数据的类型包括所述卷积运算结果。
于本发明的一实施例中,所述先入先出数据接口包括:先入先出存储器,包括:上行的可写使能管脚、数据输入管脚、及存储器满状态标识管脚;以及,下行的可读使能管脚、数据输出管脚、及存储器空状态标识管脚;第一逻辑单元,连接上行对象、所述可写使能管脚、及存储器满状态标识管脚,用于在接收到上行对象的写请求时,根据存储器满状态标识管脚上的信号确定所述先入先出存储器是否已满;若未满,则发送使能信号至可写使能管脚来令先入先出存储器可写;否则,令所述先入先出存储器不可写;第二逻辑单元,连接下行对象、所述可读使能管脚、及存储器空状态标识管脚,用于在接收到下行对象的读请求时,根据存储器空状态标识管脚上的信号确定所述先入先出存储器是否已空;若未空,则发送使能信号至可读使能管脚来令先入先出存储器可读;否则,令所述先入先出存储器不可读。
于本发明的一实施例中,所述第一逻辑单元包括:第一反向器,其输入端连接所述存储器满状态标识管脚,其输出端引出供连接上行对象的第一标识端;第一与门,其第一输入端连接所述第一数据有效标识端,其第二输入端连接于供连接上行对象的上行数据有效端,其输出端连接所述可写使能管脚;所述第二逻辑单元包括:第二反向器,其输入端连接所述存储器空状态标识管脚,其输出端引出供连接下行对象的下行数据有效端;第二与门,其第一输入端连接所述下行数据有效端,其第二输入端连接于供连接下行对象的下行数据有效标识端。
于本发明的一实施例中,所述中央处理器的类型包括:MCU、SoC、FPGA或DSP。
为实现上述目的及其他相关目的,本发明提供一种人工智能处理方法,应用于所述的人工智能处理器;所述方法包括:从外部存储器读取第一待处理数据及第一权重参数;根据所述第一待处理数据及第一权重参数进行卷积运算并输出卷积运算结果;将该卷积运算结果向所述外部存储器输出。
为实现上述目的及其他相关目的,本发明提供一种人工智能处理方法,应用于所述的人工智能处理器,所述方法包括:从外部存储器读取第二待处理数据及第二权重参数;根据所述第二待处理数据及第二权重参数进行反卷积运算并输出反卷积运算结果;将该反卷积运算结果向所述外部存储器输出。
于本发明的一实施例中,所述第二待处理数据的类型包括所述卷积运算结果。
如上所述,本发明提供人工智能处理器、及其所应用的处理方法,所述人工智能处理器包括:可编程逻辑电路,其包括:卷积处理模块、通信连接至外部存储器,其中,所述外部 存储器存储有第一待处理数据及第一权重参数;所述卷积处理模块包括:第一参数缓存、第一输入缓存、卷积运算电路及第一输出缓存;所述人工智能处理器能过硬件逻辑电路实现卷积神经网络算法,解决现有技术的问题。
附图说明
图1显示为本发明一实施例中的人工智能处理器的结构示意图。
图2显示为本发明一实施例中的卷积处理模块的结构示意图。
图3显示为图1实施例中的人工智能处理器的处理方法的流程示意图。
图4显示为本发明又一实施例中的人工智能处理器的结构示意图。
图5显示为本发明一实施例中的卷积运算电路和反卷积运算电路使用共享缓存的结构示意图。
图6显示为本发明一实施例中的FIFO接口的结构示意图。
具体实施方式
以下通过特定的具体实例说明本发明的实施方式,本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用,本说明书中的各项细节也可以基于不同观点与应用,在没有背离本发明的精神下进行各种修饰或改变。需说明的是,在不冲突的情况下,以下实施例及实施例中的特征可以相互组合。
需要说明的是,以下实施例中所提供的图示仅以示意方式说明本发明的基本构想,遂图式中仅显示与本发明中有关的组件而非按照实际实施时的组件数目、形状及尺寸绘制,其实际实施时各组件的型态、数量及比例可为一种随意的改变,且其组件布局型态也可能更为复杂。
本发明涉及人工智能技术,具体是通过设计专用于人工智能处理的专用处理器,来解决现有技术中利用现有处理器芯片架构进行人工智能算法效率低下的问题;本发明的人工智能处理器,可以用于卷积神经网络算法的运算实现。
如图1所示,展示本发明一实施例中人工智能处理器100的结构示意图。
所述人工智能处理器100,包括:可编程逻辑电路(PL)及处理系统电路(PS)。所述处理系统电路包括中央处理器101,其可通过MCU、SoC、FPGA或DSP等实现,例如ARM架构的嵌入式处理器芯片等;所述中央处理器101与外部存储器102通信连接,所述外部存储器102例如为RAM或ROM存储器,例如三代、四代DDR SDRAM等;所述中央处理器 101可对外部存储器102读写数据。
所述可编程逻辑电路(PL)也与所述外部存储器102通信连接,优选的,所述可编程逻辑电路(PL)可以包括DMA电路(Direct Memory Access),从而直接快速访问外部存储器102。
所述可编程逻辑电路可以通过对FPGA编程实现其中的各种功能电路模块。
具体的,所述可编程逻辑电路(PL)包括:卷积处理模块103;所述卷积处理模块103可以实现卷积运算。所述外部存储器102中可以存储用于供卷积处理模块103处理的第一待处理数据及权重参数(如卷积运算使用的filter,其中包含了Weight矩阵),所述卷积处理模块103可以根据第一待处理数据和第一权重参数来进行卷积运算,并输出卷积结果至所述外部存储器102,根据卷积神经网络算法所需实现的卷积层数量,外部存储器102中的卷积运算结果可以反复被输入以进行卷积运算。
所述中央处理器101通信连接该卷积处理模块103,可以用于对其设定参数,例如卷积核filter的数量,filter的高、宽K、输入通道数、输出通道数、及步长等;所述中央处理器101还能为卷积处理模块103提供时钟信号。
如图2所示,展示一实施例中图1实施例中的卷积处理模块103的具体结构示意图,本实施例中,卷积处理模块200包括:第一参数缓存201、第一输入缓存202、卷积运算电路203及第一输出缓存204。
所述第一参数缓存201,用于读取并输出所述第一权重参数。于本发明的一实施例中,所述第一待处理数据例如为图像数据,由于每个位置像素包括R、G、B三个像素值,因此,一幅图像可通过三个通道数据(Chnannel)表示,则所述第一权重参数所包含的filter的深度也需为3层。
所述第一输入缓存202,用于读取并输出所述第一待处理数据。于本发明的一实施例中,所述第一输入缓存202可以包括:多个相连的行缓存(Line FiFO),用于读取并输出所述第一待处理数据和/或第一权重参数;其中,各所述行缓存每输出一位数据则集合形成一列数据输出,多列输出即可构成矩阵;可选的,在多个Channel的情况下,fiter的层数也有多个,所述第一参数缓存201也可以采用多个行缓存的方式来存储并输出第一权重参数。
所述卷积运算电路203,用于从第一输入缓存202读取所述第一待处理数据、以及从第一参数缓存201读取第一权重参数,据以进行卷积运算并输出卷积运算结果。具体的,卷积运算所包括的运算方式有乘法及加法,所述卷积运算电路203可以乘法器和加法器相连组成的电路。
所述第一输出缓存204,用于接收所述卷积运算结果,并将该卷积运算结果向所述外部存储器204输出。优选的,所述第一输出缓存204包括至少两个缓存(例如FIFO),在一个时钟周期内,其中第一个缓存用于写入卷积运算结果,第二个缓存用于向外部存储器204输出;在下一个时钟周期内,第一个缓存和第二个缓存角色互换,第一个缓存用于向外部存储器204输出,第二个缓存用于写入卷积运算结果。
于本发明的一实施例中,卷积处理模块103还可包括:池化运算电路,用于对所述卷积运算结果进行池化后向外部存储器204输出;具体的,池化方式可以是Max pooling,也可以是Average pooling,都可以通过逻辑电路实现。
于本发明的一实施例中,所述可编程逻辑部分还可包括:全连接运算电路,用于根据所述卷积运算结果进行分类并输出。
于本发明的一实施例中,由于所述第一待处理数据可能包含多个通道数据;所述第一权重参数包含多层子参数,每层子参数分别一一对应各个通道数据;所述卷积运算电路203有多个,用于一一对应地并行计算各个通道数据的卷积运算结果。
举例来讲,图像有R、G、B三个通道数据,即三个二维矩阵,假设第一权重参数即filter的深度为3,即具有三层子权重参数,即三个二维矩阵,每个长宽设为K*K,假设K是奇数3,分别与三个Chanel卷积运算,卷积运算电路203可以设有对应数量的3个,从而同时进行一个filter和各Channel间的卷积运算;进一步的,当从第一待处理数据取出Pv*k*k的一个数据立方体(Pv>K),假设Pv是5,K是3,则该filter要与该数据立方体通过卷积运算电路203三次才能运算完毕,则相应的,卷积运算电路可以设有9个,从而可以在一个时钟周期内并行进行各filter和各Channel的卷积运算。
如图3所示,展示图1实施例中人工智能处理器100的处理流程:
步骤S1:人工智能处理器100从外部存储器102获得第一待处理数据和第一权重参数;
步骤S2:卷积处理模块103根据输入的第一待处理数据和第一权重参数进行卷积运算;
步骤S3:输出卷积运算结果至外部存储器102。
如果要实现一卷积神经网络中有多个卷积层、池化层的情况,则可以反复进行上述S1~S3处理过程,反复读取外部存储器102中的卷积运算结果再输入卷积处理模块103进行运算处理,然后再输出回外部存储器102中;需说明的是,所述中央处理器可以通过时钟信号来控制每一次的处理过程,且每一次需进行该处理过程时,中央处理器都可以为卷积处理模块103设置对应不同卷积层、池化层要求的运行参数。
举例来说,如果在一卷积神经网络模型中,是“卷积层-池化层-卷积层-池化层-全连接层”, 在具体实现上,在人工智能处理器100进行第一次处理时,从外部存储器102获取待处理数据和一个第一权重参数,卷积处理模块103中的卷积运算电路对第一待处理数据卷积运算,并输出至池化运算电路,池化运算电路运算完之后得到第一运算结果数据,第一运算结果数据供输出至外部存储器102;在人工智能处理器100进行第二次处理时,从外部处理器获得第一运算结果数据,并获得下一个第一权重参数,进而进行第二次的卷积运算及池化运算,得到第二运算结果数据,并经全连接运算电路进行运算得到最终结果,当然,在一些卷积神经网络模型中,全连接层可以通过卷积层替代,则该全连接运算电路并非必须。
如图4所示,于本实施例中,人工智能处理器400的可编程逻辑电路除了包括卷积处理模块403以外,还包括:反卷积处理模块404,通信连接至外部存储器402,其中,所述外部存储器402存储有第二待处理数据及第二权重参数;所述反卷积处理模块404包括:第二参数缓存、第二输入缓存、反卷积运算电路及第二输出缓存;所述第二参数缓存,用于读取并输出所述第二权重参数;所述第二输入缓存,其包括:多个相连的行缓存,用于读取并输出所述第二待处理数据;其中,各所述行缓存每输出一位数据则集合形成一列数据输出;所述反卷积运算电路,用于从第二输入缓存读取所述第二待处理数据、以及从第二参数缓据以进行反卷积运算并输出反卷积运算结果;所述第二输出缓存,用于接收所述反卷积运算结果,并输出该反卷积运算结果至所述外部存储器402。
所述反卷积处理模块404的实现方式与所述卷积处理模块403相似,都是通过缓存及逻辑运算电路来实现,其各种连接方式,例如通过DMA与外部存储器402连接,或者处理过程也是从外部存储器402获得第二待处理数据和第二权重参数,反卷积运算处理后输出回该外部存储器402;所述中央处理器401也可以对反卷积处理模块404设置运行参数,以满足不同的反卷积层运算的要求。
前述其它适用于卷积处理模块403的一些设计也可以应用于该反卷积处理模块404的实施例中,本领域技术人员应当可以毫无疑义地辨别,故申请人不对该些实施方式作展开赘述。
需特别说明的是,于图4所示的实施例中,卷积处理模块403也可以和反卷积处理模块404协作使用,例如在进行图像语义分割的卷积神经网络模型中,可以是先对原始图像进行多个卷积层的运算来得到运算结果,然后再对卷积运算结果进行相应的多个反卷积运算得到原始图像大小的特征图像,而在本发明的实施例中,参考前述可以推知,可以通过先后利用卷积处理模块403和反卷积处理模块404来实现。
可选的,如图5所示,所述的人工智能处理器,包括:共享缓存501,作为所述第一输入缓存及第二输入缓存,供卷积运算电路502及反卷积运算电路503分时复用地传输接受自 外部存储器的数据;由于片上内存资源较少,且卷积处理模块及反卷积处理模块并不会同时使用,因此可以采用两者的共享缓存的方式,由卷积处理模块和反卷积处理模块分时使用,能减少一半的输入缓存。
可选的,在图1~图5的实施例中,所述卷积处理模块和/或反卷积处理模块所包括的各内部部件之间、以及卷积处理模块和/或反卷积处理模块与外部存储器之间通过先入先出数据接口连接。
于本发明的一实施例中,所述的人工智能处理器,所述第二待处理数据的类型包括所述卷积运算结果。
如图6所示,展示一实施例中该先入先出数据接口的结构,所述先入先出数据接口包括:先入先出存储器(FIFO),包括:上行的可写使能管脚(write)、数据输入管脚(data_in)、及存储器满状态标识管脚(full);以及,下行的可读使能管脚(read)、数据输出管脚(data_out)、及存储器空状态标识管脚(empty);第一逻辑单元601,连接上行对象、所述可写使能管脚、及存储器满状态标识管脚,用于在接收到上行对象的写请求时,根据存储器满状态标识管脚上的信号确定所述先入先出存储器是否已满;若未满,则发送使能信号至可写使能管脚来令先入先出存储器可写;否则,令所述先入先出存储器不可写;第二逻辑单元602,连接下行对象、所述可读使能管脚、及存储器空状态标识管脚,用于在接收到下行对象的读请求时,根据存储器空状态标识管脚上的信号确定所述先入先出存储器是否已空;若未空,则发送使能信号至可读使能管脚来令先入先出存储器可读;否则,令所述先入先出存储器不可读。
于图6所示的实施例中,所述第一逻辑单元601包括:第一反向器603,其输入端连接所述存储器满状态标识管脚,其输出端引出供连接上行对象的第一标识端;第一与门604,其第一输入端连接所述第一数据有效标识端,其第二输入端连接于供连接上行对象的上行数据有效端,其输出端连接所述可写使能管脚;所述第二逻辑单元602包括:第二反向器605,其输入端连接所述存储器空状态标识管脚,其输出端引出供连接下行对象的下行数据有效端;第二与门606,其第一输入端连接所述下行数据有效端,其第二输入端连接于供连接下行对象的下行数据有效标识端。
在本实施例中,所述可写使能管脚(write)置“1”时FIFO可写,置“0”时FIFO不可写;存储器满状态标识管脚(full)在FIFO写满时置“1”;可读使能管脚(read)置“1”时FIFO可读,置“0”时FIFO不可读;存储器空状态标识管脚(empty)在FIFO空时置“1”。
如图所示,当存储器满状态标识管脚(full)在FIFO写满时置“1”,经第一反向器603输出“0”至第一与门604的一输入端,从而令第一与门604输出为“0”,FIFO不可写;当 存储器空状态标识管脚(empty)在FIFO空时置“1”,经第二反向器605输出“0”至第二与门606的一输入端,从而令第二与门606输出为“0”,FIFO不可读。
当然,在其它实施例中,各管脚表示状态的数字电压值也可以加以替换,例如置“0”时使能等,并非以上述实施例为限;另外,所述第一逻辑单元601和第二逻辑单元602也可以采用其它的逻辑运算器件,并非以图6为限。
本发明的人工智能处理器中各部件通过该先入先出数据接口的连接,可以控制实现数据处理的流水线过程。
综上所述,本发明提供人工智能处理器、及其所应用的处理方法,所述人工智能处理器包括:可编程逻辑电路,其包括:卷积处理模块、通信连接至外部存储器,其中,所述外部存储器存储有第一待处理数据及第一权重参数;所述卷积处理模块包括:第一参数缓存、第一输入缓存、卷积运算电路及第一输出缓存;所述人工智能处理器能过硬件逻辑电路实现卷积神经网络算法,解决现有技术的问题。
本发明有效克服了现有技术中的种种缺点而具高度产业利用价值。
上述实施例仅例示性说明本发明的原理及其功效,而非用于限制本发明。任何熟悉此技术的人士皆可在不违背本发明的精神及范畴下,对上述实施例进行修饰或改变。因此,举凡所属技术领域中具有通常知识者在未脱离本发明所揭示的精神与技术思想下所完成的一切等效修饰或改变,仍应由本发明的权利要求所涵盖。

Claims (20)

  1. 一种人工智能处理器,其特征在于,包括:
    可编程逻辑电路,其包括:
    卷积处理模块,通信连接至外部存储器,其中,所述外部存储器存储有第一待处理数据及第一权重参数;所述卷积处理模块包括:第一参数缓存、第一输入缓存、卷积运算电路及第一输出缓存;
    所述第一参数缓存,用于读取并输出所述第一权重参数;
    所述第一输入缓存,用于读取并输出所述第一待处理数据;
    所述卷积运算电路,用于从第一输入缓存读取所述第一待处理数据、以及从第一参数缓存读取第一权重参数,据以进行卷积运算并输出卷积运算结果;
    所述第一输出缓存,用于接收所述卷积运算结果,并将该卷积运算结果向所述外部存储器输出。
  2. 根据权利要求1所述的人工智能处理器,其特征在于,所述第一输入缓存和/或第一参数缓存包括:多个相连的行缓存,用于读取并输出所述第一待处理数据和/或第一权重参数;其中,各所述行缓存每输出一位数据则集合形成一列数据输出。
  3. 根据权利要求1所述的人工智能处理器,其特征在于,卷积处理模块还包括:池化运算电路,用于对所述卷积运算结果进行池化后向外部存储器输出。
  4. 根据权利要求1所述的人工智能处理器,其特征在于,所述可编程逻辑部分还包括:全连接运算电路,用于根据所述卷积运算结果进行分类并输出。
  5. 根据权利要求1所述的人工智能处理器,其特征在于,包括:第一DMA,通信连接在所述外部数据存储器与卷积处理模块之间。
  6. 根据权利要求1所述的人工智能处理器,其特征在于,所述卷积处理模块所包括的各内部部件之间、以及卷积处理模块与外部存储器之间通过先入先出数据接口连接。
  7. 根据权利要求1所述的人工智能处理器,其特征在于,还包括:处理系统电路,其包括:中央处理模块,用于配置所述可编程逻辑电路中卷积处理模块的运行参数。
  8. 根据权利要求1所述的人工智能处理器,其特征在于,所述第一待处理数据包含多个通道数据;所述第一权重参数包含多层子参数,每层子参数分别一一对应各个通道数据;所述卷积运算电路有多个,用于一一对应地并行计算各个通道数据的卷积运算结果;以及/或者,所述第一权重参数有多个,所述卷积运算电路有多个,用于一一对应地并行计算所述多个第一权重参数的各层子参数和各个通道数据的卷积运算结果。
  9. 根据权利要求1所述的人工智能处理器,其特征在于,所述可编程逻辑电路还包括:
    反卷积处理模块,通信连接至外部存储器,其中,所述外部存储器存储有第二待处理数据及第二权重参数;所述反卷积处理模块包括:第二参数缓存、第二输入缓存、反卷积运算电路及第二输出缓存;
    所述第二参数缓存,用于读取并输出所述第二权重参数;
    所述第二输入缓存,其包括:多个相连的行缓存,用于读取并输出所述第二待处理数据;其中,各所述行缓存每输出一位数据则集合形成一列数据输出;
    所述反卷积运算电路,用于从第二输入缓存读取所述第二待处理数据、以及从第二参数缓据以进行反卷积运算并输出反卷积运算结果;
    所述第二输出缓存,用于接收所述反卷积运算结果,并输出该反卷积运算结果至所述外部存储器。
  10. 根据权利要求9所述的人工智能处理器,其特征在于,包括:共享缓存,作为所述第一输入缓存及第二输入缓存,供卷积运算电路及反卷积运算电路分时复用地传输接受自外部存储器的数据。
  11. 根据权利要求9所述的人工智能处理器,其特征在于,包括:第二DMA,通信连接在所述外部数据存储器与反卷积处理模块之间。
  12. 根据权利要求9所述的人工智能处理器,其特征在于,所述反卷积处理模块所包括的各内部部件之间、以及反卷积处理模块与外部存储器之间通过先入先出数据接口连接。
  13. 根据权利要求9所述的人工智能处理器,其特征在于,还包括:处理系统电路,其包括:中央处理模块,用于配置所述可编程逻辑电路中卷积处理模块及反卷积处理模块的运行参数。
  14. 根据权利要求9所述的人工智能处理器,其特征在于,所述第二待处理数据的类型包括所述卷积运算结果。
  15. 根据权利要求6或12所述的人工智能处理器,其特征在于,所述先入先出数据接口包括:
    先入先出存储器,包括:上行的可写使能管脚、数据输入管脚、及存储器满状态标识管脚;以及,下行的可读使能管脚、数据输出管脚、及存储器空状态标识管脚;
    第一逻辑单元,连接上行对象、所述可写使能管脚、及存储器满状态标识管脚,用于在接收到上行对象的写请求时,根据存储器满状态标识管脚上的信号确定所述先入先出存储器是否已满;若未满,则发送使能信号至可写使能管脚来令先入先出存储器可写;否则,令所述先入先出存储器不可写;
    第二逻辑单元,连接下行对象、所述可读使能管脚、及存储器空状态标识管脚,用于在接收到下行对象的读请求时,根据存储器空状态标识管脚上的信号确定所述先入先出存储器是否已空;若未空,则发送使能信号至可读使能管脚来令先入先出存储器可读;否则,令所述先入先出存储器不可读。
  16. 根据权利要求15所述的人工智能处理器,其特征在于,所述第一逻辑单元包括:第一反向器,其输入端连接所述存储器满状态标识管脚,其输出端引出供连接上行对象的第一标识端;第一与门,其第一输入端连接所述第一数据有效标识端,其第二输入端连接于供连接上行对象的上行数据有效端,其输出端连接所述可写使能管脚;所述第二逻辑单元包括:第二反向器,其输入端连接所述存储器空状态标识管脚,其输出端引出供连接下行对象的下行数据有效端;第二与门,其第一输入端连接所述下行数据有效端,其第二输入端连接于供连接下行对象的下行数据有效标识端。
  17. 根据权利要求7或13所述的,其特征在于,所述中央处理器的类型包括:MCU、SoC、FPGA或DSP。
  18. 一种人工智能处理方法,其特征在于,应用于如权利要求1至17中任一项所述的人工智能处理器;所述方法包括:
    从外部存储器读取第一待处理数据及第一权重参数;
    根据所述第一待处理数据及第一权重参数进行卷积运算并输出卷积运算结果;
    将该卷积运算结果向所述外部存储器输出。
  19. 一种人工智能处理方法,其特征在于,应用于如权利要求9所述的人工智能处理器,所述方法包括:
    从外部存储器读取第二待处理数据及第二权重参数;
    根据所述第二待处理数据及第二权重参数进行反卷积运算并输出反卷积运算结果;
    将该反卷积运算结果向所述外部存储器输出。
  20. 根据权利要求19所述的人工智能处理方法,其特征在于,所述第二待处理数据的类型包括所述卷积运算结果。
PCT/CN2018/072676 2018-01-15 2018-01-15 人工智能处理器、及其所应用的处理方法 WO2019136762A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201880002767.4A CN109564638B (zh) 2018-01-15 2018-01-15 人工智能处理器及其所应用的处理方法
PCT/CN2018/072676 WO2019136762A1 (zh) 2018-01-15 2018-01-15 人工智能处理器、及其所应用的处理方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/072676 WO2019136762A1 (zh) 2018-01-15 2018-01-15 人工智能处理器、及其所应用的处理方法

Publications (1)

Publication Number Publication Date
WO2019136762A1 true WO2019136762A1 (zh) 2019-07-18

Family

ID=65872638

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/072676 WO2019136762A1 (zh) 2018-01-15 2018-01-15 人工智能处理器、及其所应用的处理方法

Country Status (2)

Country Link
CN (1) CN109564638B (zh)
WO (1) WO2019136762A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111857989A (zh) * 2020-06-22 2020-10-30 深圳鲲云信息科技有限公司 人工智能芯片和基于人工智能芯片的数据处理方法
CN111914996A (zh) * 2020-06-30 2020-11-10 华为技术有限公司 一种提取数据特征的方法和相关装置
CN112308762A (zh) * 2020-10-23 2021-02-02 北京三快在线科技有限公司 一种数据处理方法及装置
CN112349419A (zh) * 2020-08-27 2021-02-09 北京颢云信息科技股份有限公司 一种基于医学数据及人工智能的真实世界研究方法
US20210081770A1 (en) * 2019-09-17 2021-03-18 GOWN Semiconductor Corporation System architecture based on soc fpga for edge artificial intelligence computing

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992225B (zh) * 2019-04-04 2022-02-22 中科寒武纪科技股份有限公司 数据输出方法及相关装置
CN110110850A (zh) * 2019-04-29 2019-08-09 山东浪潮人工智能研究院有限公司 基于fpga前向反向可复用的处理单元实现方法
CN110928216B (zh) * 2019-11-14 2020-12-15 深圳云天励飞技术有限公司 人工智能装置
US11706076B2 (en) 2020-01-23 2023-07-18 Novnet Computing System Tech Co., Ltd. Computer system with computing devices, communication device, task processing device
CN110928693B (zh) * 2020-01-23 2021-01-15 飞诺门阵(北京)科技有限公司 一种计算设备及资源分配方法
CN111343106B (zh) * 2020-02-25 2023-03-24 母国标 多路中频数字信号处理装置和方法
CN111752887B (zh) * 2020-06-22 2024-03-15 深圳鲲云信息科技有限公司 人工智能芯片和基于人工智能芯片的数据处理方法
CN111813721B (zh) * 2020-07-15 2022-09-09 深圳鲲云信息科技有限公司 神经网络数据处理方法、装置、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160217263A1 (en) * 2015-01-23 2016-07-28 Panasonic Intellectual Property Management Co., Ltd. Image processing apparatus, image processing method, image display system, and storage medium
CN106530227A (zh) * 2016-10-27 2017-03-22 北京小米移动软件有限公司 图像复原方法及装置
CN106940815A (zh) * 2017-02-13 2017-07-11 西安交通大学 一种可编程卷积神经网络协处理器ip核
CN107480782A (zh) * 2017-08-14 2017-12-15 电子科技大学 一种片上学习神经网络处理器

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3166075B1 (en) * 2015-11-05 2020-08-05 Facebook, Inc. Systems and methods for processing content using convolutional neural networks
CN106597920B (zh) * 2016-11-16 2019-07-26 西安电子科技大学 基于nios嵌入式处理器控制hpi接口的控制系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160217263A1 (en) * 2015-01-23 2016-07-28 Panasonic Intellectual Property Management Co., Ltd. Image processing apparatus, image processing method, image display system, and storage medium
CN106530227A (zh) * 2016-10-27 2017-03-22 北京小米移动软件有限公司 图像复原方法及装置
CN106940815A (zh) * 2017-02-13 2017-07-11 西安交通大学 一种可编程卷积神经网络协处理器ip核
CN107480782A (zh) * 2017-08-14 2017-12-15 电子科技大学 一种片上学习神经网络处理器

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210081770A1 (en) * 2019-09-17 2021-03-18 GOWN Semiconductor Corporation System architecture based on soc fpga for edge artificial intelligence computing
US11544544B2 (en) * 2019-09-17 2023-01-03 Gowin Semiconductor Corporation System architecture based on SoC FPGA for edge artificial intelligence computing
CN111857989A (zh) * 2020-06-22 2020-10-30 深圳鲲云信息科技有限公司 人工智能芯片和基于人工智能芯片的数据处理方法
CN111857989B (zh) * 2020-06-22 2024-02-27 深圳鲲云信息科技有限公司 人工智能芯片和基于人工智能芯片的数据处理方法
CN111914996A (zh) * 2020-06-30 2020-11-10 华为技术有限公司 一种提取数据特征的方法和相关装置
CN112349419A (zh) * 2020-08-27 2021-02-09 北京颢云信息科技股份有限公司 一种基于医学数据及人工智能的真实世界研究方法
CN112308762A (zh) * 2020-10-23 2021-02-02 北京三快在线科技有限公司 一种数据处理方法及装置

Also Published As

Publication number Publication date
CN109564638B (zh) 2023-05-26
CN109564638A (zh) 2019-04-02

Similar Documents

Publication Publication Date Title
WO2019136762A1 (zh) 人工智能处理器、及其所应用的处理方法
CN109284817B (zh) 深度可分离卷积神经网络处理架构/方法/系统及介质
JP6857286B2 (ja) ニューラルネットワークアレイの性能の改善
TWI811291B (zh) 深度學習加速器及加快深度學習操作的方法
WO2019201656A1 (en) Method for accelerating operations and accelerator apparatus
WO2019201657A1 (en) Accelerator and system for accelerating operations
WO2020199476A1 (zh) 基于脉动阵列的神经网络加速方法、装置、计算机设备及存储介质
EP3161793B1 (en) Adaptive partition mechanism with arbitrary tile shape for tile based rendering gpu architecture
US11232360B1 (en) Lossless tiling in convolution networks—weight gradient calculation
US9449131B2 (en) Extracting system architecture in high level synthesis
JPWO2019216376A1 (ja) 演算処理装置
US11195080B1 (en) Lossless tiling in convolution networks—tiling configuration
WO2023045445A1 (zh) 数据处理装置、数据处理方法及相关产品
WO2023123919A1 (zh) 数据处理电路、数据处理方法及相关产品
CN110738317A (zh) 基于fpga的可变形卷积网络运算方法、装置和系统
US20240168913A1 (en) Lossless tiling in convolution networks - tiling configuration between two sections
US11874898B2 (en) Streaming-based artificial intelligence convolution processing method and apparatus, readable storage medium and terminal
CN108701102A (zh) 直接存储器访问控制器、数据读取方法和数据写入方法
US20200082898A1 (en) Multi-lvel memory hierarchy
WO2019136747A1 (zh) 反卷积器及其所应用的人工智能处理装置
CN111382856B (zh) 数据处理装置、方法、芯片及电子设备
CN110929854B (zh) 一种数据处理方法、装置及硬件加速器
CN106383936B (zh) 一种fpga存储器拆分方法
US11227207B1 (en) Lossless tiling in convolution networks—section boundaries
WO2023087698A1 (zh) 执行卷积运算的计算装置、方法及相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18900070

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13.11.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18900070

Country of ref document: EP

Kind code of ref document: A1