CN109739556A - A general-purpose deep learning processor based on multi-parallel cache interaction and computation - Google Patents

A general-purpose deep learning processor based on multi-parallel cache interaction and computation Download PDF

Info

Publication number
CN109739556A
CN109739556A CN201811528451.7A CN201811528451A CN109739556A CN 109739556 A CN109739556 A CN 109739556A CN 201811528451 A CN201811528451 A CN 201811528451A CN 109739556 A CN109739556 A CN 109739556A
Authority
CN
China
Prior art keywords
module
unit
instruction
data
gating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811528451.7A
Other languages
Chinese (zh)
Other versions
CN109739556B (en
Inventor
禹霁阳
汪路元
程博文
李宗凌
刘伟伟
牛跃华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Spacecraft System Engineering
Original Assignee
Beijing Institute of Spacecraft System Engineering
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Spacecraft System Engineering filed Critical Beijing Institute of Spacecraft System Engineering
Priority to CN201811528451.7A priority Critical patent/CN109739556B/en
Publication of CN109739556A publication Critical patent/CN109739556A/en
Application granted granted Critical
Publication of CN109739556B publication Critical patent/CN109739556B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Advance Control (AREA)

Abstract

本发明涉及一种基于多并行缓存交互及计算的通用深度学习处理器,主要是针对深度学习在计算卷积和全连接过程中的频繁乘累加计算以及频繁的参数访问,利用向量操作多缓存并行交互计算技术,通过共享参数减少对数据的碎片式访问,利用已执行指令缓存重检索,提高计算过程的并行度、相同指令及相同参数访问的计算效率,减少了重复计算指令对硬件浮点计算器的占用,从指令数据流的层面降低了深度学习网络计算的高重复操作,提高了深度学习网络计算的实时性。

The invention relates to a general deep learning processor based on multi-parallel cache interaction and calculation, mainly aiming at frequent multiply-accumulate calculations and frequent parameter accesses in deep learning in the process of calculating convolution and full connection, and using vector operations to operate multiple caches in parallel Interactive computing technology reduces fragmented access to data by sharing parameters, and uses executed instruction cache for re-retrieval to improve the parallelism of the computing process, the computing efficiency of the same instruction and the same parameter access, and reduce repeated computing instructions to hardware floating-point computing. The occupancy of the machine is reduced, the high repetitive operation of deep learning network computing is reduced from the level of instruction data flow, and the real-time performance of deep learning network computing is improved.

Description

A kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated
Technical field
The present invention relates to the general deep learning processors for caching interaction based on multiple parallel and calculating, especially for large size The general acceleration that convolutional calculation and full connection calculate in deep learning network.
Background technique
Deep learning network is during artificial intelligence independently detects identification, judgement prediction and other pattern-recognitions, tool It is significant.But deep learning network needs to carry out a large amount of matroid simultaneously, vector calculates, and algorithm calculation amount is very big, broken Chip parameter access rate is high, and high requirement is proposed to the architecture design of processor;Meanwhile in Embedded Application field, Especially in field of aerospace, power consumption, volume, area limit the application of most of commercialization GPU processor and microprocessor.
In addition, existing commercialization GPU computer and microprocessor are all based on greatly the calculation processing to simultaneously column register, i.e., Make Titan 1080p processor such as that there is vector data to move operation, has thousands of a nodes in internal calculating process to every A data parallel, this process occupy huge hardware resource, while also consuming a large amount of energy power consumption.Other add Fast processor, the similar input and output result of Cambrian chip during processing must be via MLU modules, and by decoding Selected input output operation is completed in the judgement of process different instruction type, and this accelerated mode is operated by multiple modular concurrents, And HotBuf and ColdBuf is combined to reduce or assimilate the convolutional calculation of identical parameters and operation, this design architecture can be effective Convolution kernel calculated performance is promoted, but can not necessarily have preferable performance under non-structured deep learning network.Deep mirror The AI processor of scientific & technical corporation's design has the characteristics that hardware resource is variable, by means of NPU node, is calculated by Compiler Optimization Process, compression parameters reduce calculation amount, but this function must have the corresponding mating execution of program structure progress to can be only achieved most Excellent speed, and in practical big convolutional calculation process, intermediate computations data are usually called by other processes, are difficult to all multiple Miscellaneous depth network forms actual optimization.Especially deep mirror science and technology, which thinks to simplify in calculating process, calculates system, can effectively solve The certainly huge problem of calculation amount quantifies in practice for the position 8-16 of the key node of the deep learning network towards small objects It may bringing on a disaster property effect.
Implement at present for the algorithm of the acceleration of depth convolutional network, it can only be complete by the parallel computation of multiple computing units At, it is expensive, structure is complicated although having commercial GPU processor or dedicated IP, with a distance from microminaturization Embedded Application compared with Far.The general deep learning processor that interaction is cached based on multiple parallel and is calculated is designed, current low-power consumption, micro- can be effectively met Minimize Embedded A I processor development process there is an urgent need to.
Summary of the invention
Technology of the invention solves the problems, such as: in the prior art, convolutional calculation process is complicated in large-scale learning network, Power consumption is larger, calculates fragment repeated accesses and computationally intensive problem, proposes a kind of interactive based on multiple parallel caching and calculates General deep learning processor.
The present invention solves above-mentioned technical problem and is achieved by following technical solution:
A kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated, including it is command process module, slow It deposits forwarding module, the first gating module, data computation module, the second gating module, third gating module, execute instruction caching mould Block, in which:
Buffer forwarding module: it receives the first control signal that command process module is sent and chooses corresponding register, and externally The data-signal to be calculated that portion's bus is sent into carries out inner buffer, while first control signal and data to be calculated are merged into the Two control signals are simultaneously sent to the first gating module, carry out data further according to the register more new signal of the first gating module foldback It updates;
First gating module: the second control signal that Buffer forwarding module is sent is received, and is sent out according to command process module The the first gating command signal sent carries out selection output to data cached in second control signal;It is carried out according to data to be calculated Output controls signal comprising the third of operand needed for this calculating after gating control, and is sent to data computation module, simultaneously Transmitter register more new signal gives Buffer forwarding module;
Data computation module: it receives the calculating sent from the first gating module and controls signal with third, according to third control Operand pair data to be calculated needed for this calculating in signal processed are calculated, and by the fixed-point computation result being calculated, are floated Point calculated result, logic calculation result are sent to the second gating module;
Second gating module: fixed-point computation result, floating point calculations, the logic sent from data computation module is received Calculated result, and the second gating command signal sent according to command process module carry out secondary gating control, by fixed-point computation As a result or floating point calculations or logic calculation result are forwarded to third gating module and execute instruction cache module;
It executes instruction cache module: receiving the calculated result after the gating output of the second gating module and carry out parallel instruction Conflict retrieval, and sent to third gating module and calculate ongoing decision instruction signal, packet for determining whether there is other Judging result is retrieved containing instruction conflict;
Third gating module: the calculated result that the second gating module is sent, the sent according to command process module are received Three gating command signals, and the conflict retrieval judging result of cache module transmission is executed instruction, it generates for Buffer forwarding mould Block carries out the instruction more new signal of instruction update, and sends instruction more new signal to command process module and form calculating closed loop;
Command process module: what reception third gating module was sent is used to choose instruction instructions to be performed more new signal simultaneously Corresponding instruction code is generated, exports first control signal after being decoded, is used to choose corresponding register for what is obtained after decoding First control signal is sent to Buffer forwarding module, while generating first/second/third gating command signal, is sent respectively to First/second/third gating module carries out output data strobe.
Described instruction processing module includes IA generation unit, instruction pool unit, decoding unit, in which:
IA generation unit: the instruction update signal behavior sent according to third gating module corresponds to program address and refers to Needle is simultaneously sent to instruction pool unit;
It instructs pool unit: receiving the correspondence program address pointer that IA generation unit is sent and addressing instruction code, it will Instruction code is sent to decoding unit;
Decoding unit: receiving the instruction code that instruction pool unit is sent, and progress Instruction decoding simultaneously believes gained control after decoding Number be sent to Buffer forwarding module, while generating first/second/third gating command signal, be sent respectively to first/second/ Third gating module carries out output data strobe.
The Buffer forwarding module includes register group unit, inner buffer unit, peripheral interface control unit, in which:
Register group unit: the decoding first control signal that command process module is sent and selected corresponding deposit are received Device, while the register data in the control signal is sent to inner buffer unit;
Peripheral interface control unit: the data to be calculated that external bus access interface is sent are received, and will include number to be calculated According to external signal data be sent to inner buffer unit;
Inner buffer unit: what the register data and peripheral interface control unit that receiving register group unit is sent were sent All signal datas are sent to the first gating module by external signal comprising data to be calculated.
The data computation module includes fixed point calculation unit, floating point calculating unit, logic computing unit, in which:
Fixed point calculation unit: it receives the third sent from the first gating module and controls signal and carry out fixed-point computation, together When to the second gating module send fixed-point computation result;
Floating point calculating unit: it receives the third sent from the first gating module and controls signal and carry out Floating-point Computation, together When to the second gating module send floating point calculations;
Logic computing unit: receiving the third sent from the first gating module and control signal and carry out logic calculation, Simultaneously to the second gating module sending logic calculated result.
The floating point calculating unit include Floating-point Computation sub-unit, interactive controlling sub-unit, bus interaction sub-unit, In:
Floating-point Computation sub-unit: it receives the third control signal sent from the first gating module and carries out Floating-point Computation, together When receive the reading instruction of interactive controlling sub-unit and read by interactive controlling sub-unit and floating point calculations and be sent to the second choosing Logical unit;
Interactive controlling sub-unit: when Floating-point Computation sub-unit starts to calculate, by bus interaction sub-unit to corresponding or Adjacent Floating-point Computation sub-unit, which is sent, reads instruction, and calculated result is sent to the second gating module;
Bus interacts sub-unit: carrying out the instruction and data interaction of Floating-point Computation sub-unit, interactive controlling sub-unit.
The Floating-point Computation sub-unit includes floating-point flowing water vector counter, left operand data stream memory FIFO, the right side Operand data stream memory FIFO, output result data stream memory FIFO, left operand register RL, the deposit of right operand Device RR, output result register RX, in which:
Left operand register RL: the interactive controlling signal of bus interaction sub-unit output is received, to write-in data manipulation Number is cached, and is sent to specified floating-point flowing water vector counter and is calculated;
Right operand register RR: the interactive controlling signal of bus interaction sub-unit output is received, to write-in data manipulation Number is cached, and is sent to specified floating-point flowing water vector counter and is calculated;
Left/right operand data stream memory FIFO: according to the interactive controlling signal of bus interaction sub-unit output, to writing Enter data data fifo value to be cached, and is sent to specified floating-point flowing water vector counter and carries out pipeline computing;
Floating-point flowing water vector counter: calculating write-in data, and calculated result is sent to output result deposit Device RX, output result data stream memory FIFO;
It exports result register RX: receiving the calculated result of floating-point flowing water vector counter, calculating is grasped after being cached It counts and is judged, if operand is 0 and calculates type to be multiplication or division, calculated result is denoted as 0 and is sent to outside Otherwise bus directs out portion's bus and exports immediately;
It exports result data stream memory FIFO: receiving the calculated result of floating-point flowing water vector counter, and cached, If the summing value of data flow FIFO is 0 and calculates type to be multiplication or division, calculated result is denoted as 0 and is sent to external total Otherwise line directs out portion's bus and exports immediately.
The interactive controlling sub-unit includes input interactive access control module, controls signal interpretation module, interaction output Module, in which:
Input interactive access control module: receiving the third control signal that the first gating module is sent and decoded, then Data resulting after decoding and control signal are sent to interactive output module;
Control signal interpretation module: receiving the external computations that the first gating module is sent and decoded, and will decoding Gained control signal is sent to interactive output module afterwards;
Interaction output module: interactive access control module or control signal interpretation mould will be inputted by bus interaction sub-unit Block output data is sent to the second gating module.
The floating-point flowing water vector counter quantity is four.
A kind of general depth network method for calculation that multiple parallel is calculated and cached, steps are as follows:
(1) simultaneously output order is chosen from pool of instructions according to the address pointer of generation, is exported after being decoded to the instruction First control signal, and generate the first/bis-/tri- gating command signals and carry out data strobe;
(2) data in first control signal are cached, by outside access bus obtain data to be calculated and by its Merge output second control signal with data in first control signal;
(3) gating control is carried out to second control signal using the first gating command signal, and according to the third after gating Control signal calculates data to be calculated, while being updated using calculated result to data cached;
(4) gating output is carried out to calculated result according to the second gating command command signal, while carries out parallel instruction punching Prominent retrieval, output conflict judge search result;
(5) search result, transmitter register more new signal are judged according to third gating command signal, calculated result, conflict Data buffer storage update is carried out, while carrying out IA update, is formed and calculates closed loop.
The advantages of the present invention over the prior art are that:
(1) a kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated provided by the invention, for Vector is computationally intensive in deep learning convolutional network calculating process, calculates the high feature of multiplicity, using to executed instructions into Row caching, and the mode of result of will executing instruction is quickly generated in the mode that the caching scans for that will execute instruction, Reduce and compute repeatedly instruction to the occupancy of hardware floating point calculator, reduces deep learning network from the level of instruction data flow The high repetitive operation calculated improves vector in convolution kernel calculating process, matrix calculating speed;
(2) present invention is high for Data duplication degree between full connection calculating process interior joint calculating, compared to convolutional calculation Process instruction fetching decodes the very fast feature of process, the invention proposes floating point calculating unit, ensure that input Floating-point Computation refers to Parallel, the interactive pipeline computing enabled, synchronization can complete floating add, multiplication, division, Sigmoid parallel and calculate four The vector of kind operation calculates, and reduces fragment type parameter access rate.
Detailed description of the invention
Fig. 1 is the floating point calculating unit structural schematic diagram that invention provides;
Fig. 2 is the Floating-point Computation sub-unit structural schematic diagram that invention provides;
Fig. 3 is the interactive controlling sub-unit structural schematic diagram that invention provides;
Fig. 4 is the unit bus interactive controlling modular structure schematic diagram that invention provides;
Fig. 5 is the structural schematic diagram that invention provides;
Fig. 6 is the processor inner bay composition that invention provides;
Specific embodiment
The present invention relates to the general deep learning processors for caching interaction based on multiple parallel and calculating, especially for large size Depth convolutional network and fully-connected network frequently multiply accumulating calculating in calculating.
The present invention will be further described with reference to the accompanying drawing.
The invention mainly comprises command process module, Buffer forwarding module, the first gating module, data computation module, Two gating modules, execute instruction cache module at third gating module.The main object of the present invention is to large-scale deep learning network Middle convolutional calculation and full connection calculate and carry out quick execution, to achieve the purpose that complete in real time.As shown in fig. 6, instruction updates Afterwards, decoding unit in instruction pool unit output order and is entered according to IA generation unit, exports first control signal;The One control signal enters the register group unit in Buffer forwarding module, while receiving update register, buffer update signal, Inner buffer unit is inputted with peripheral interface control unit output data, inner buffer unit exports second control signal to first Gating module;First gating module updates according to second control signal, the first gating command signal difference output register, is peripheral It updates and third control signal is to register group unit, peripheral interface control unit and data computation module, after calculating Fixed point, floating-point and the logic calculation result of output are sent to the second gating module;Second gating module receives fixed point, floating-point and patrols Calculated result and the second gating command signal are collected, parallel instruction conflict recall signal is sent and refers to third gating module and execution Enable cache module;It executes instruction cache module and receives parallel instruction conflict recall signal, judge whether there is conflict instruction and held Row instruction, and conflict is judged that search result returns third gating module;Third gating module receives parallel instruction conflict retrieval Signal, conflict judge search result and third gating command signal, output register/caching/periphery more new signal, and instruction More new signal is to command process module.
By that can be calculated using vector the judgement of instruction, floating point calculating unit is shown in for the calculating of floating number in calculating process Shown in Fig. 1.The module includes four four input and output RAM cachings, is respectively used to the floating-point meter added, multiplication and division, Sigmoid are calculated Point counting unit, four bus interaction sub-units are for the shared and effectively distribution between data.Floating data passes through Floating-point Computation Which four input and output RAM caching sub-unit, judgement are input in, and judge that four input and output RAM caching is according to instruction code It is no to need to backup to adjacent two four input and output RAM caching.Then take out the floating point vector in four input and output RAM caching Data input stream water input and output vector floating-point calculator, and judge whether the calculating data of output need to back up according to instruction code Into adjacent two buses interaction sub-unit, if it is desired to then calculating data are shifted using computing unit bus interactive controlling Or backup to other computing modules.
Floating-point Computation sub-unit, as shown in Fig. 2, including left/right operand data stream memory FIFO, output result data Flow memory FIFO, left/right operand register RL, right operand register RR, output result register RX.Module is examined first The data for surveying external four input and output RAM processing whether in need, if there is and to need to handle data be single then by left operation Number reads in left operand register RL, the right operand register R of right operand readingR, floating-point pipeline computing is inputted after latching level-one Device carries out addition/multiplication/division/Sigmoid calculating operation, is as a result sent into output result register RX, internal bus is waited to connect Mouth takes out result and is sent into four input and output RAM;The data that handle if necessary and to need to handle data be vector, then by left behaviour It counts and continuously reads in left operand data stream memory FIFO, right operand continuously reads in right operand data stream memory FIFO, while flowing water inputs floating-point pipeline computing device, carries out addition/multiplication/division/Sigmoid calculating operation, is as a result sent into defeated Result data stream memory FIFO out waits internal bus interface to take out result and is sent into four input and output RAM.
Interactive controlling sub-unit, as shown in figure 3, the module is mainly used for controlling the meter in external input third control signal It calculates information and is assigned to each computing unit and caching.Including inputting interactive access control module, signal interpretation module, interaction are controlled Output module.Interaction output module is divided into floating add/multiplication interaction output judgement again, floating add/division interaction output is sentenced Disconnected, floating-point division/Sigmoid calculates interaction output judgement, floating-point multiplication/Sigmoid calculates interaction output judgement, floating add Register/FIFO interaction output judgement, floating-point multiplication register/FIFO interaction output judgement, floating-point Sigmoid calculate deposit The internal modules such as device/FIFO interaction output judgement, floating-point division register/FIFO interaction output judgement.Firstly, passing through external the Three control signals obtain input information, then judge that the instruction belongs to single computations or vector computations, then to phase The four input and output RAM output datas answered, while judging whether the instruction can interact, it is total to generate computing unit according to interactive information The interactive control information of line interactive controlling module;Then, judge the calculated result in four input and output RAM be individual data or Vector data, and take out calculated result and give instruction execution module.
Computing unit bus interactive controlling module, as shown in figure 4, the module is mainly used for number between four input into/output from cache According to interaction.According to interactive controlling module information judge current four input into/output from cache X and four input into/output from cache Y datas whether need It interacts, interactive mode is divided into that X and Y is exchanged, X backups to Y, Y and backups to X, X and Y and remains unchanged, and the delay of exchange process Temporal information is sent to flowing water input and output vector floating-point calculator modules X and Y.
Cache module is executed instruction, as shown in figure 5, the module is mainly used for being not carried out the quick execution of instruction.It has executed Instruction is maintained in executed instructions fragment caching with output result, and fragment caching covers old fragment strategy using new fragment. Receive the second gating module output parallel instruction conflict retrieval, in pool of instructions it is to be executed instruction fragment caching in into Row search is not entered back into if there is matching result then directly takes out the calculated result in search result in the instruction execution Computing module, and delete the fragment in caching;If the instruction comes into the execution stage in search process, shows and search Rope process generates conflict, stops the instruction and searches for and enter next instruction search process.Conflict judges that search result is sent to Third gating module.
Study processor structure provided by the invention is as follows:
A kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated, including it is command process module, slow It deposits forwarding module, the first gating module, data computation module, the second gating module, third gating module, execute instruction caching mould Block, in which:
Command process module: receive that third gating module sends for choose parallel instruction signal instructions to be performed or Single command signal generates corresponding instruction code and carries out decoding output, will be obtained after decoding comprising being used to choose corresponding deposit The first control signal of the operand control address of device is sent to Buffer forwarding module, while generating first/second/third gating Signal is sent respectively to first/second/third gating module;
First/second/third gating signal is as multibit signal, for selecting first/second/third gating module defeated Result out.Output of high two of first gating signal for the first gating module judges that, when being ' 00 ' for high two, output is posted The data that storage group unit issues export the sending data of inner buffer unit when being ' 01 ', peripheral interface when being ' 10 ' The output data of control unit;Output of high two of second gating signal for the second gating module judges, is when high two When ' 00 ', output fixed-point computation exports floating point calculations when as a result, being ' 01 ', and logic calculation result is exported when being ' 10 ';The Output of the high position for three gating signals for third gating module judges, when a high position is ' 0 ', exports the defeated of the second gating module Search result is executed instruction as a result, exporting when being ' 01 ' out.
Buffer forwarding module: receiving register more new signal, first control signal, external bus access data, and periphery More new signal selects corresponding register and carries out inner buffer, and will send out comprising the second control signal of calculating data address It send to the first gating module;
First gating module: receiving the second control signal that Buffer forwarding module is sent, and second control signal includes deposit Device group unit, inner buffer unit, peripheral interface units output data information, be incorporated as second control signal input the After one gating module, which of three second control signals of final output is judged according to the first gating signal that decoding unit exports One;Output includes that this third for calculating required data controls signal after carrying out gating control according to data to be calculated, concurrently It send to data computation module and executes instruction cache module;First gating signal is sent to the first gating module as decoding unit Control signal, the data for control access mask register, inner buffer or peripheral interface;
Data computation module: receiving the third sent from the first gating module and control signal, in third control signal Including required data and address calculated, by the fixed-point computation result being calculated, floating point calculations, logic calculation knot Fruit is sent to the second gating module;
Second gating module: fixed-point computation result, floating point calculations, the logic sent from data computation module is received Calculated result carries out secondary gating control, fixed-point computation result or floating-point according to the second gating signal that decoding unit exports Calculated result or logic calculation result are forwarded to third gating module and execute instruction cache module;Second gating signal is used as and translates Code unit is sent to the control signal of the second gating module, and the output result for calculating type is selected for control access;
It executes instruction cache module: receiving the parallel instruction conflict retrieval letter after the gating output of the second gating module Breath carries out parallel instruction conflict retrieval and sends to third gating module to calculate ongoing punching for determining whether there is other It is prominent to judge search result;
Third gating module: it receives the parallel instruction conflict that the second gating module is sent and retrieves information, according to decoding unit The third gating command signal of transmission and the conflict for executing instruction cache module transmission determine search result, and output order updates letter Number in command process module formed calculate closed loop;Transmitter register, caching, peripheral more new signal are used for register group list respectively The update control of member, inner buffer unit, peripheral interface control unit.
Described instruction processing module includes IA generation unit, instruction pool unit, decoding unit, in which:
IA generation unit: program address is corresponded to according to the parallel instruction signal behavior that third gating module is sent and is referred to Needle is simultaneously sent to instruction pool unit;
It instructs pool unit: receiving the correspondence program address pointer that IA generation unit is sent and addressing instruction code, it will Instruction code is sent to decoding unit;
Decoding unit: receiving the instruction code that instruction pool unit is sent, and progress Instruction decoding simultaneously believes gained control after decoding Number be sent to Buffer forwarding module, while generating first/second/third gating command signal, be sent respectively to first/second/ Third gating module.
The Buffer forwarding module includes register group unit, inner buffer unit, peripheral interface control unit, in which:
Register group unit: it receives the third control signal that decoding unit is sent and selectes corresponding register, simultaneously will Data are sent to inner buffer unit;
Inner buffer unit: the data of receiving register group unit transmission are simultaneously cached;
Peripheral interface control unit: receiving external bus access data and carries out data pipe according to periphery update control signal Control, and send data to inner buffer unit.
The data computation module includes fixed point calculation unit, floating point calculating unit, logic computing unit, in which:
Fixed point calculation unit: the required data sent from the first gating module, address are received and to carry out fixed-point computation same When to the second gating module send fixed-point computation result;
Floating point calculating unit: the required data sent from the first gating module, address are received and to carry out Floating-point Computation same When to the second gating module send floating point calculations;
Logic computing unit: the required data sent from the first gating module, address are received and to carry out logic calculation same When to the second gating module sending logic calculated result.
The floating point calculating unit include Floating-point Computation sub-unit, interactive controlling sub-unit, bus interaction sub-unit, In:
Floating-point Computation sub-unit: it receives control signal, the operand sent from the first gating module and controls address and carry out Floating-point Computation, while the control instruction for receiving interactive controlling sub-unit is concurrent by interactive controlling sub-unit reading floating point calculations It send to the second gating unit;
Interactive controlling sub-unit: it is concurrent to receive the third control signal generation control instruction sent from the first gating module It send to Floating-point Computation sub-unit and bus interaction sub-unit, corresponding or adjacent floating-point meter is sent to by bus interaction sub-unit Point counting unit synchronizes calculating, while carrying out control information exchange with Floating-point Computation sub-unit;
Bus interacts sub-unit: sending control signal to Floating-point Computation sub-unit, interactive controlling sub-unit, carries out floating-point meter The information exchange of point counting unit, interactive controlling sub-unit.
The Floating-point Computation sub-unit include floating-point flowing water vector counter, left/right operand data stream memory FIFO, Export result data stream memory FIFO, left/right operand register RL/RR, output result register RX, in which:
Floating-point flowing water vector counter: all input signals write-in data are calculated, and calculated result is sent to Export result register RX, output result data stream memory FIFO;
Left/right operand register RL/RR: receiving the control signal of bus interaction sub-unit output, to write-in data behaviour It counts and is cached, and be sent to specified floating-point flowing water vector counter and calculated;
Left/right operand data stream memory FIFO: according to the control signal of bus interaction sub-unit output, to write-in number It is cached according to data fifo value, and is sent to specified floating-point flowing water vector counter and carries out pipeline computing;
It exports result register RX: receiving the calculated result of floating-point flowing water vector counter, calculating is grasped after being cached It counts and is judged, if operand is 0 and calculates type to be multiplication or division, calculated result is denoted as 0 and is sent to outside Otherwise bus directs out portion's bus and exports immediately;
It exports result data stream memory FIFO: receiving the calculated result of floating-point flowing water vector counter, and cached, If the summing value of data flow FIFO is 0 and calculates type to be multiplication or division, calculated result is denoted as 0 and is sent to external total Otherwise line directs out portion's bus and exports immediately.Such a process reduces the access of the fragmentation of data and compute repeatedly.
The interactive controlling sub-unit includes input interactive access control module, controls signal interpretation module, interaction output Module, in which:
Input interactive access control module: it receives the required third control signal that the first gating module is sent and is translated Code, then level-one operand resulting after decoding control signal is sent to control signal interpretation module;
It controls signal interpretation module: receiving the level-one operand control signal that input interactive access control module is sent and carry out It decodes again, and gained secondary operation number control signal after decoding is sent to interactive output module;
Interaction output module: judge whether to need to translate control signal according to the Instruction decoding result of instruction pool unit output The secondary operation number control signal that code module is sent carries out four bus shared buffer memories and Floating-point Computation device calculates reading data.
Using embodiment, the present invention will be further described with reference to the accompanying drawing.
The present invention relates to based on multiple parallel cache interaction and calculate general deep learning processor architecture, especially for Large-scale depth convolutional network and fully-connected network frequently multiply accumulating calculating in calculating, commonly save in conjunction with following convolution operation For point calculates, explain to realization step.
FN×N=EN×NSigmoid((AN×NBN×N+CN×N)./DN×N) (1)
Wherein, N=positive integer, wherein Sigmoid (x)=(1+e-x)-1Calculating is abbreviated as SIGMF.
Operation of the conventional processors to above-mentioned formula, is all based on the calculating of single number, even if such as the Cambrian and deep mirror Company listens great waves processor that can only also complete water operation under conditions of optimizing to code, in the invention patent design In processor architecture, calculating for formula (1) can complete entire meter using matroid computations by way of parallel pipelining process It calculates less than several calculating cycles.In addition, formula (1) needs at least five conditions judgement and two in conventional processors written in code It recirculates, the mating instruction set of processor designed using the invention patent, it is only necessary to five-element's assembly code, as follows:
MULF.M AN×N,BN×N,TN×N,N
ADDF.M TN×N,CN×N,TN×N,N
DIVF.M TN×N,DN×N,TN×N,N
SIGMF.M TN×N,TN×N,N
MULF.M TN×N,EN×N,FN×N,N
Above-mentioned five-element's assembler language code generates machine code after compiling, is controlled by external debugging interface and debugging Unit entry instruction pool unit processed.Then entire processor works according to the instruction code of instruction pool unit output.
<1>assembly statement MULF.M A is executedN×N,BN×N,TN×N, N, by address AN×NAnd BN×NTwo matrix datas are taken out, Four input and output RAM D being stored in floating point calculating unit, and matrix-vector is written to computing unit bus interactive controlling module D Calculating parameter N;In Fig. 6 processor inside structure block diagram, this instruction operation and calculation process are as described below:
(1) it firstly, taking out this instruction from instruction pool unit according to the address of IA generation unit output, is sent to Decoding unit;
(2) decoding unit is input to the data of Buffer forwarding module according to the selection of the size of N, is controlled by peripheral interface single Member takes out AN×N、BN×N、TN×NThe corresponding data of address thinks that this instruction is general data multiplication if N=1, and data are defeated Enter register group unit, thinks that this operation is that vector calculates if N is not equal to 1, data input inner buffer unit;
(3) Buffer forwarding module judges whether according to the buffer update signal of third gating module by new input data Instead of there are the passing data of inner buffer unit, judge that whether this executes instruction with upper one according to passing instruction retrieval result It is secondary execute instruction it is identical, if the same no longer in Buffer forwarding module register group unit, inner buffer unit carry out Data update, but directly output register group unit or the passing storing data of inner buffer unit, and output is to the first gating Module;
(4) first gating modules control after receiving data according to data strobe, judge that data are input to data calculating Which part of module, due to being that floating-point multiplication calculates, operand enters floating point calculating unit for this instruction;
(5) floating point calculating unit in data computation module after receiving the data, places the data into interactive control first Sub-unit processed calculates mode according to data and length enters different output channels, this instruction is floating point vector calculating, therefore Operand data is output to four input and output RAM D by floating-point multiplication register/FIFO interaction output module in interactive controlling, It calculates information and enters bus interaction sub-unit D;
(6) bus interaction sub-unit D judges T according to the judgement instructed to nextN×NData still can be in floating add Sub-unit is calculated to carry out using therefore in TN×NData are input to after four input and output RAM D, control DMA D To A by RAM T is backed up in DN×NData are to RMA A;
(7) four input and output RAM D input operand data calculate sub-unit to floating-point multiplication, and operand is put into respectively Left/right operand data stream FIFO, is stored in left/right operand register RL/RR if N=1, if left/right operand is posted Storage RL/RR is equal to 0 or left operand data stream FIFO input data summed result is 0, then it is assumed that this floating-point multiplication meter It calculates or vector calculating output result is 0, exporting result register Rx or output stream FIFO output result is 0, is otherwise pressed Output result road RAM D is calculated according to normal;
(8) data computation module exports multiplication calculation result to the second gating module, and the second gating module is according to decoding The second gating command signal output calculated result of unit output exports parallel instruction conflict inspection to third gating module simultaneously Rope information judges whether current data computing module has other calculating to cache module is executed instruction, and prevents parallel Computations conflict is made delay to current output in the case of a conflict and is waited;
(9) third gating module according to the third gating command signal of decoding unit and executes instruction the conflict of cache module Judge search result, output order, register, caching, peripheral more new information, result deposit register group when N=1 in this instruction Unit, N, which is updated when being not equal to 1 by periphery, is stored in TN×NCorresponding address, and the address of next instruction is generated, refer to herein for this It enables address add 1, executes next instruction ADDF.M TN×N,CN×N,TN×N,N;
<2>ADDF.M T is executedN×N,CN×N,TN×N, N, by address CN×NMatrix data is taken out, is stored in floating point calculating unit Four input and output RAM A, TN×NData via computing unit bus interactive controlling module D be passed to computing unit bus interaction Then control module A is written floating add and calculates sub-unit;
<3>data enter floating-point multiplication calculating sub-unit by four input and output RAM A, calculate TN×NAnd CN×NTwo matrixes Product, acquired results are passed to four input and output RAM A by computing unit bus interactive controlling modules A, and by internal total Line returns to matrix-vector register TN×N
<4>DIVF.MT is executedN×N,DN×N,TN×N, N, by address DN×NMatrix data is taken out, is stored in floating point calculating unit Four input and output RAM B, TN×NData via computing unit bus interactive controlling modules A be passed to computing unit bus interaction Then control module B is written floating-point division and calculates sub-unit;
<5>data enter flowing water input and output vector floating-point adder by four input and output RAM B, calculate TN×NAnd DN×N The division of two matrixes, acquired results are passed to four input and output RAM B by computing unit bus interactive controlling module B, and lead to It crosses internal bus and returns to matrix-vector register TN×N
<6>SIGMF.M T is executedN×N,TN×N, N, TN×NData it is incoming via computing unit bus interactive controlling module B Then computing unit bus interactive controlling module C is written floating-point Sigmoid and calculates sub-unit;
<7>T is calculatedN×NSigmoid calculate function, acquired results by computing unit bus interactive controlling module C be passed to Four input and output RAM C, and matrix-vector register T is returned to by internal busN×N
<8>MULF.M T is executedN×N,EN×N,FN×N, N, by address EN×NMatrix data is taken out, is stored in floating point calculating unit Four input and output RAM D, TN×NData via computing unit bus interactive controlling module D be passed to computing unit bus interaction Then control module D is written floating-point multiplication and calculates sub-unit;
<9>data enter floating-point multiplication calculating sub-unit progress T by four input and output RAM DN×NAnd EN×NMultiplication of matrices It calculates, obtained result returns to four input and output RAM D via computing unit bus interactive controlling module D, and by internal total Line returns to matrix-vector register FN×N, to complete this calculating task.
The calculation instructed due to other several is similar with first, no longer carries out repeating detailed description here.
Pass through the execution of above-mentioned concrete operations, it can be seen that (3) reduce peripheral data the step of first instruction execution The frequency interactively communicated reduces data computing relay caused by communicating between caching, this is in deep learning network easily hundred Ten thousand times or even more than one hundred million times convolution kernel calculating process, can save a large amount of cache access time;The step of first instruction execution Suddenly (6) can back up frequently-used data in short-term according to the correlation above to give an order, equally reduce inside and outside caching in calculating process Data interaction access time, be especially the reduction of the frequency of fragment type parameter access rate;The step of first instruction execution (7) It in calculating process, is prejudged according to input data, the mode for directly exporting result is carried out for 0 Value Data, reduces calculating Expense;In the design process due to present processor, it has been provided simultaneously with fixed point Floating-point Computation module, has avoided and is calculated in other designs Model accuracy rate is greatly reduced problem caused by error.
The content that description in the present invention is not described in detail belongs to the well-known technique of those skilled in the art.

Claims (9)

1.一种基于多并行缓存交互及计算的通用深度学习处理器,其特征在于:包括指令处理模块、缓存转发模块、第一选通模块、数据计算模块、第二选通模块、第三选通模块、执行指令缓存模块,其中:1. a general-purpose deep learning processor based on multi-parallel cache interaction and calculation, is characterized in that: comprise instruction processing module, cache forwarding module, the first gating module, data calculation module, the second gating module, the third selection communication module, execution instruction cache module, where: 缓存转发模块:接收指令处理模块发送的第一控制信号选取对应寄存器,并对外部总线送入的待计算数据信号进行内部缓存,同时将第一控制信号及待计算数据合并为第二控制信号并发送至第一选通模块,再根据第一选通模块返送的寄存器更新信号进行数据更新;Cache forwarding module: select the corresponding register after receiving the first control signal sent by the instruction processing module, and internally cache the data signal to be calculated sent from the external bus, and at the same time combine the first control signal and the data to be calculated into a second control signal and Send to the first gating module, and then perform data update according to the register update signal returned by the first gating module; 第一选通模块:接收缓存转发模块发送的第二控制信号,并根据指令处理模块发送的第一选通指令信号,对第二控制信号内缓存数据进行选择输出;根据待计算数据进行选通控制后输出包含本次计算所需操作数的第三控制信号,并发送至数据计算模块,同时发送寄存器更新信号给缓存转发模块;The first gating module: receives the second control signal sent by the buffering and forwarding module, and selects and outputs the buffered data in the second control signal according to the first gating instruction signal sent by the instruction processing module; performs gating according to the data to be calculated After control, output a third control signal including operands required for this calculation, and send it to the data calculation module, and at the same time send a register update signal to the cache forwarding module; 数据计算模块:接收来自第一选通模块发送的计算用第三控制信号,根据第三控制信号中的本次计算所需操作数对待计算数据进行计算,将计算得到的定点计算结果、浮点计算结果、逻辑计算结果发送至第二选通模块;Data calculation module: receives the third control signal for calculation sent from the first gating module, calculates the data to be calculated according to the operands required for this calculation in the third control signal, and calculates the fixed-point calculation result obtained by the calculation, the floating-point calculation result. The calculation result and the logical calculation result are sent to the second gating module; 第二选通模块:接收来自数据计算模块发送的定点计算结果、浮点计算结果、逻辑计算结果,并根据指令处理模块发送的第二选通指令信号,进行二次选通控制,将定点计算结果或浮点计算结果或逻辑计算结果转发至第三选通模块及执行指令缓存模块;The second gating module: receives the fixed-point calculation results, floating-point calculation results, and logical calculation results sent from the data calculation module, and performs secondary gating control according to the second gating command signal sent by the command processing module, and converts the fixed-point calculation The result or floating point calculation result or logic calculation result is forwarded to the third gating module and the execution instruction cache module; 执行指令缓存模块:接收来自第二选通模块选通输出后的计算结果进行并行指令冲突检索,并向第三选通模块发送用于判定是否有其他计算正在进行的判定指令信号,包含指令冲突检索判断结果;Execute instruction cache module: receive the calculation result after gating output from the second gating module to perform parallel instruction conflict retrieval, and send to the third gating module a judgment command signal used to determine whether other calculations are in progress, including instruction conflicts search results; 第三选通模块:接收第二选通模块发送的计算结果,根据指令处理模块发送的第三选通指令信号,及执行指令缓存模块发送的冲突检索判断结果,生成用于对缓存转发模块进行指令更新的指令更新信号,并发送指令更新信号至指令处理模块形成计算闭环;The third gating module: receives the calculation result sent by the second gating module, and generates a data for the cache forwarding module according to the third gating instruction signal sent by the instruction processing module and the conflict retrieval judgment result sent by the execution instruction cache module. The instruction update signal of the instruction update, and the instruction update signal is sent to the instruction processing module to form a closed calculation loop; 指令处理模块:接收第三选通模块发送的用于选取待执行指令的指令更新信号并生成对应指令码,进行译码后输出第一控制信号,将译码后得到的用于选取对应寄存器的第一控制信号发送至缓存转发模块,同时生成第一/第二/第三选通指令信号,分别发送给第一/第二/第三选通模块进行输出数据选通。Instruction processing module: receive the instruction update signal sent by the third gating module for selecting the instruction to be executed and generate the corresponding instruction code, output the first control signal after decoding, and use the decoding to select the corresponding register. The first control signal is sent to the buffer forwarding module, and the first/second/third gating command signal is generated at the same time, and sent to the first/second/third gating module for output data gating respectively. 2.根据权利要求1所述的一种基于多并行缓存交互及计算的通用深度学习处理器,其特征在于:所述指令处理模块包括指令地址生成单元、指令池单元、译码单元,其中:2. a kind of general-purpose deep learning processor based on multi-parallel cache interaction and calculation according to claim 1, is characterized in that: described instruction processing module comprises instruction address generation unit, instruction pool unit, decoding unit, wherein: 指令地址生成单元:根据第三选通模块发送的指令更新信号选择对应程序地址指针并发送至指令池单元;Instruction address generation unit: select the corresponding program address pointer according to the instruction update signal sent by the third gating module and send it to the instruction pool unit; 指令池单元:接收指令地址生成单元发送的对应程序地址指针并寻址指令码,将指令码发送至译码单元;The instruction pool unit: receives the corresponding program address pointer sent by the instruction address generation unit, addresses the instruction code, and sends the instruction code to the decoding unit; 译码单元:接收指令池单元发送的指令码,进行指令译码并将译码后所得控制信号发送至缓存转发模块,同时生成第一/第二/第三选通指令信号,分别发送给第一/第二/第三选通模块进行输出数据选通。Decoding unit: receives the instruction code sent by the instruction pool unit, decodes the instruction and sends the control signal obtained after decoding to the cache forwarding module, and generates the first/second/third gating instruction signal and sends them to the first/second/third gating instruction signal respectively. The first/second/third gating module performs output data gating. 3.根据权利要求2所述的一种基于多并行缓存交互及计算的通用深度学习处理器,其特征在于:所述缓存转发模块包括寄存器组单元、内部缓存单元、外围接口控制单元,其中:3. a kind of general-purpose deep learning processor based on multi-parallel cache interaction and calculation according to claim 2, is characterized in that: described cache forwarding module comprises register group unit, internal cache unit, peripheral interface control unit, wherein: 寄存器组单元:接收指令处理模块发送的译码用第一控制信号并选定相应的寄存器,同时将该控制信号内的寄存器数据发送至内部缓存单元;Register group unit: receive the first control signal for decoding sent by the instruction processing module and select the corresponding register, and at the same time send the register data in the control signal to the internal buffer unit; 外围接口控制单元:接收外总线访问接口发送的待计算数据,并将包含待计算数据的外部信号数据发送至内部缓存单元;The peripheral interface control unit: receives the data to be calculated sent by the external bus access interface, and sends the external signal data containing the data to be calculated to the internal buffer unit; 内部缓存单元:接收寄存器组单元发送的寄存器数据及外围接口控制单元发送的包含待计算数据的外部信号,将所有信号数据作为第二控制信号发送至第一选通模块。Internal buffer unit: receives the register data sent by the register group unit and the external signal including the data to be calculated sent by the peripheral interface control unit, and sends all the signal data as the second control signal to the first gating module. 4.根据权利要求1所述的一种基于多并行缓存交互及计算的通用深度学习处理器,其特征在于:所述数据计算模块包括定点计算单元、浮点计算单元、逻辑计算单元,其中:4. A general-purpose deep learning processor based on multi-parallel cache interaction and calculation according to claim 1, wherein: the data calculation module comprises a fixed-point calculation unit, a floating-point calculation unit, and a logic calculation unit, wherein: 定点计算单元:接收来自第一选通模块发送的第三控制信号并进行定点计算,同时向第二选通模块发送定点计算结果;Fixed-point calculation unit: receives the third control signal sent from the first gating module and performs fixed-point calculation, and simultaneously sends the fixed-point calculation result to the second gating module; 浮点计算单元:接收来自第一选通模块发送的第三控制信号并进行浮点计算,同时向第二选通模块发送浮点计算结果;Floating-point calculation unit: receives the third control signal sent from the first gating module and performs floating-point calculation, and simultaneously sends the floating-point calculation result to the second gating module; 逻辑计算单元:接收来自第一选通模块发送的第三控制信号并并进行逻辑计算,同时向第二选通模块发送逻辑计算结果。Logical calculation unit: receives the third control signal sent from the first gating module and performs logical calculation, and simultaneously sends the logical calculation result to the second gating module. 5.根据权利要求4所述的一种基于多并行缓存交互及计算的通用深度学习处理器,其特征在于:所述浮点计算单元包括浮点计算分单元、交互控制分单元、总线交互分单元,其中:5. A kind of general-purpose deep learning processor based on multi-parallel cache interaction and calculation according to claim 4, it is characterized in that: described floating-point calculation unit comprises floating-point calculation sub-unit, interactive control sub-unit, bus interaction sub-unit unit, where: 浮点计算分单元:接收来自第一选通模块发送的第三控制信号进行浮点计算,同时接收交互控制分单元的读取指令由交互控制分单元读取浮点计算结果并发送至第二选通单元;Floating-point calculation sub-unit: Receive the third control signal sent from the first gating module to perform floating-point calculation, and at the same time receive the read instruction of the interactive control sub-unit, the interactive control sub-unit reads the floating-point calculation result and sends it to the second gating unit; 交互控制分单元:当浮点计算分单元开始计算时,通过总线交互分单元向对应或相邻的浮点计算分单元发送读取指令,并将计算结果发送至第二选通模块;Interactive control sub-unit: when the floating-point calculation sub-unit starts to calculate, it sends a read instruction to the corresponding or adjacent floating-point calculation sub-unit through the bus interactive sub-unit, and sends the calculation result to the second gating module; 总线交互分单元:进行浮点计算分单元、交互控制分单元的指令及数据交互。Bus interaction sub-unit: perform instruction and data exchange of floating-point calculation sub-unit and interactive control sub-unit. 6.根据权利要求5所述的一种基于多并行缓存交互及计算的通用深度学习处理器,其特征在于:所述浮点计算分单元包括浮点流水向量计算器、左操作数数据流存储器FIFO、右操作数数据流存储器FIFO、输出结果数据流存储器FIFO、左操作数寄存器RL、右操作数寄存器RR、输出结果寄存器RX,其中:6. A general-purpose deep learning processor based on multi-parallel cache interaction and calculation according to claim 5, wherein the floating-point calculation sub-unit comprises a floating-point pipeline vector calculator, a left operand data stream memory FIFO, right operand data stream memory FIFO, output result data stream memory FIFO, left operand register RL, right operand register RR, output result register RX, where: 左操作数寄存器RL:接收总线交互分单元输出的交互控制信号,对写入数据操作数进行缓存,并发送给指定浮点流水向量计算器进行计算;Left operand register RL: receives the interactive control signal output by the bus interactive sub-unit, caches the written data operand, and sends it to the designated floating-point pipeline vector calculator for calculation; 右操作数寄存器RR:接收总线交互分单元输出的交互控制信号,对写入数据操作数进行缓存,并发送给指定浮点流水向量计算器进行计算;Right operand register RR: Receive the interactive control signal output by the bus interactive sub-unit, cache the written data operand, and send it to the designated floating-point pipeline vector calculator for calculation; 左/右操作数数据流存储器FIFO:根据总线交互分单元输出的交互控制信号,对写入数据FIFO数据值进行缓存,并发送给指定浮点流水向量计算器进行流水计算;Left/right operand data flow memory FIFO: According to the interactive control signal output by the bus interactive sub-unit, the data value of the written data FIFO is buffered, and sent to the designated floating-point pipeline vector calculator for pipeline calculation; 浮点流水向量计算器:对写入数据进行计算,并将计算结果发送至输出结果寄存器RX、输出结果数据流存储器FIFO;Floating-point pipeline vector calculator: Calculate the written data, and send the calculation result to the output result register RX, and the output result data stream memory FIFO; 输出结果寄存器RX:接收浮点流水向量计算器的计算结果,进行缓存后对计算操作数进行判断,若操作数为0且计算类型为乘法或除法,则将计算结果记为0并发送至外部总线,否则直接向外部总线立即输出;Output result register RX: Receive the calculation result of the floating-point pipeline vector calculator, and judge the calculation operand after buffering. If the operand is 0 and the calculation type is multiplication or division, the calculation result is recorded as 0 and sent to the external bus, otherwise output directly to the external bus immediately; 输出结果数据流存储器FIFO:接收浮点流水向量计算器的计算结果,并进行缓存,若数据流FIFO的求和值为0且计算类型为乘法或除法,则将计算结果记为0并发送至外部总线,否则直接向外部总线立即输出。Output result data stream memory FIFO: Receive the calculation result of the floating-point pipeline vector calculator and buffer it. If the summation value of the data stream FIFO is 0 and the calculation type is multiplication or division, the calculation result is recorded as 0 and sent to External bus, otherwise output directly to the external bus immediately. 7.根据权利要求5所述的一种基于多并行缓存交互及计算的通用深度学习处理器,其特征在于:所述交互控制分单元包括输入交互访问控制模块,控制信号译码模块,交互输出模块,其中:7. A general-purpose deep learning processor based on multi-parallel cache interaction and calculation according to claim 5, wherein the interactive control sub-unit comprises an input interactive access control module, a control signal decoding module, and an interactive output module, where: 输入交互访问控制模块:接收第一选通模块发送的第三控制信号并进行译码,再将译码后所得的数据及控制信号发送至交互输出模块;Input interactive access control module: receive and decode the third control signal sent by the first gating module, and then send the decoded data and control signal to the interactive output module; 控制信号译码模块:接收第一选通模块发送的外部计算指令进行译码,并将译码后所得控制信号发送至交互输出模块;Control signal decoding module: receive the external calculation instruction sent by the first gating module for decoding, and send the control signal obtained after decoding to the interactive output module; 交互输出模块:通过总线交互分单元将输入交互访问控制模块或控制信号译码模块输出数据发送至第二选通模块。Interactive output module: send the input interactive access control module or the output data of the control signal decoding module to the second gating module through the bus interactive sub-unit. 8.根据权利要求6所述的一种基于多并行缓存交互及计算的通用深度学习处理器,其特征在于:所述浮点流水向量计算器数量为四个。8 . The universal deep learning processor based on multi-parallel cache interaction and calculation according to claim 6 , wherein the number of the floating-point pipeline vector calculators is four. 9 . 9.一种多并行计算及缓存的通用深度网络计算方法,其特征在于步骤如下:9. A general-purpose deep network computing method for multi-parallel computing and buffering, characterized in that the steps are as follows: (1)根据生成的地址指针从指令池中选取并输出指令,对该指令进行译码后输出第一控制信号,并生成第一/二/三选通指令信号进行数据选通;(1) select and output the instruction from the instruction pool according to the generated address pointer, output the first control signal after decoding the instruction, and generate the first/two/three gating instruction signal to perform data gating; (2)对第一控制信号内数据进行缓存,通过外部访问总线获取待计算数据并将其与第一控制信号内数据合并输出第二控制信号;(2) buffer the data in the first control signal, obtain the data to be calculated through the external access bus and combine it with the data in the first control signal to output the second control signal; (3)利用第一选通指令信号对第二控制信号进行选通控制,并根据选通后的第三控制信号对待计算数据进行计算,同时利用计算结果对已缓存数据进行更新;(3) utilize the first gating instruction signal to carry out gating control to the second control signal, and calculate the data to be calculated according to the third control signal after gating, and utilize the calculation result to update the cached data simultaneously; (4)根据第二选通指令指令信号对计算结果进行选通输出,同时进行并行指令冲突检索,输出冲突判断检索结果;(4) according to the second gating instruction command signal, the calculation result is gated and output, and parallel instruction conflict retrieval is carried out simultaneously, and the conflict judgment retrieval result is output; (5)根据第三选通指令信号、计算结果、冲突判断检索结果,发送寄存器更新信号进行数据缓存更新,同时进行指令地址更新,形成计算闭环。(5) According to the third gating command signal, the calculation result, and the conflict judgment and retrieval result, the register update signal is sent to update the data cache, and the command address is updated at the same time to form a closed calculation loop.
CN201811528451.7A 2018-12-13 2018-12-13 General deep learning processor based on multi-parallel cache interaction and calculation Active CN109739556B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811528451.7A CN109739556B (en) 2018-12-13 2018-12-13 General deep learning processor based on multi-parallel cache interaction and calculation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811528451.7A CN109739556B (en) 2018-12-13 2018-12-13 General deep learning processor based on multi-parallel cache interaction and calculation

Publications (2)

Publication Number Publication Date
CN109739556A true CN109739556A (en) 2019-05-10
CN109739556B CN109739556B (en) 2021-03-26

Family

ID=66359421

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811528451.7A Active CN109739556B (en) 2018-12-13 2018-12-13 General deep learning processor based on multi-parallel cache interaction and calculation

Country Status (1)

Country Link
CN (1) CN109739556B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112817638A (en) * 2019-11-18 2021-05-18 北京希姆计算科技有限公司 A data processing device and method
CN113051212A (en) * 2021-03-02 2021-06-29 长沙景嘉微电子股份有限公司 Graphics processor, data transmission method, data transmission device, electronic device, and storage medium
CN113806250A (en) * 2021-09-24 2021-12-17 中国人民解放军国防科技大学 Method for coordinating general processor core and vector component, interface and processor
US11782722B2 (en) 2020-06-30 2023-10-10 Beijing Baidu Netcom Science And Technology Co., Ltd. Input and output interfaces for transmitting complex computing information between AI processors and computing components of a special function unit

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5961628A (en) * 1997-01-28 1999-10-05 Samsung Electronics Co., Ltd. Load and store unit for a vector processor
CN1387649A (en) * 1999-08-31 2002-12-25 英特尔公司 Parallel processor architecture
CN101751244A (en) * 2010-01-04 2010-06-23 清华大学 Microprocessor
CN101986263A (en) * 2010-11-25 2011-03-16 中国人民解放军国防科学技术大学 Method and microprocessor for supporting single instruction stream and multi-instruction stream dynamic switching execution
CN106445468A (en) * 2015-10-08 2017-02-22 上海兆芯集成电路有限公司 Direct execution of execution unit for loading micro-operation of framework cache file by employing framework instruction of processor
US10073696B2 (en) * 2013-07-15 2018-09-11 Texas Instruments Incorporated Streaming engine with cache-like stream data storage and lifetime tracking

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5961628A (en) * 1997-01-28 1999-10-05 Samsung Electronics Co., Ltd. Load and store unit for a vector processor
CN1387649A (en) * 1999-08-31 2002-12-25 英特尔公司 Parallel processor architecture
CN101751244A (en) * 2010-01-04 2010-06-23 清华大学 Microprocessor
CN101986263A (en) * 2010-11-25 2011-03-16 中国人民解放军国防科学技术大学 Method and microprocessor for supporting single instruction stream and multi-instruction stream dynamic switching execution
US10073696B2 (en) * 2013-07-15 2018-09-11 Texas Instruments Incorporated Streaming engine with cache-like stream data storage and lifetime tracking
CN106445468A (en) * 2015-10-08 2017-02-22 上海兆芯集成电路有限公司 Direct execution of execution unit for loading micro-operation of framework cache file by employing framework instruction of processor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨川: "MPCore多核处理器并行计算方法的研究与实现", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112817638A (en) * 2019-11-18 2021-05-18 北京希姆计算科技有限公司 A data processing device and method
US11782722B2 (en) 2020-06-30 2023-10-10 Beijing Baidu Netcom Science And Technology Co., Ltd. Input and output interfaces for transmitting complex computing information between AI processors and computing components of a special function unit
CN113051212A (en) * 2021-03-02 2021-06-29 长沙景嘉微电子股份有限公司 Graphics processor, data transmission method, data transmission device, electronic device, and storage medium
CN113051212B (en) * 2021-03-02 2023-12-05 长沙景嘉微电子股份有限公司 Graphics processor, data transmission method, data transmission device, electronic equipment and storage medium
CN113806250A (en) * 2021-09-24 2021-12-17 中国人民解放军国防科技大学 Method for coordinating general processor core and vector component, interface and processor

Also Published As

Publication number Publication date
CN109739556B (en) 2021-03-26

Similar Documents

Publication Publication Date Title
CN109739556A (en) A general-purpose deep learning processor based on multi-parallel cache interaction and computation
Lu et al. Evaluating fast algorithms for convolutional neural networks on FPGAs
Chen et al. ReGNN: A redundancy-eliminated graph neural networks accelerator
Mohaidat et al. A survey on neural network hardware accelerators
CN112784970B (en) A hardware accelerator, data processing method, system-on-chip and medium
CN114781632A (en) Deep Neural Network Accelerator Based on Dynamic Reconfigurable Systolic Tensor Computation Engine
CN110516810A (en) A quantum program processing method, device, storage medium and electronic device
CN108052347A (en) A kind of device for executing instruction selection, method and command mappings method
CN109271138A (en) A kind of chain type multiplication structure multiplied suitable for big dimensional matrix
Wang et al. Cosa: Co-operative systolic arrays for multi-head attention mechanism in neural network using hybrid data reuse and fusion methodologies
CN113055060B (en) Coarse-grained reconfigurable architecture system for large-scale MIMO signal detection
WO2023092620A1 (en) Risc-v-based three-dimensional interconnection many-core processor architecture and operating method therefor
Zhang et al. Efficient neighbor-sampling-based gnn training on cpu-fpga heterogeneous platform
Zhao et al. Rf-risa: A novel flexible random forest accelerator based on fpga
Zhu et al. Taming unstructured sparsity on GPUs via latency-aware optimization
Chen et al. Rubik: A hierarchical architecture for efficient graph learning
CN112232517B (en) An artificial intelligence acceleration engine and artificial intelligence processor
CN119272827A (en) Data processing method of neural network, neural network and chip
Liu et al. A cloud server oriented FPGA accelerator for LSTM recurrent neural network
Shang et al. LACS: A high-computational-efficiency accelerator for CNNs
Lin et al. swFLOW: A dataflow deep learning framework on sunway taihulight supercomputer
WO2021217502A1 (en) Computing architecture
Xiang et al. Accelerating CNN algorithm with fine-grained dataflow architectures
Gao et al. FPGA-based accelerator for independently recurrent neural network
US11714649B2 (en) RISC-V-based 3D interconnected multi-core processor architecture and working method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant