CN109739556A

CN109739556A - A general-purpose deep learning processor based on multi-parallel cache interaction and computation

Info

Publication number: CN109739556A
Application number: CN201811528451.7A
Authority: CN
Inventors: 禹霁阳; 汪路元; 程博文; 李宗凌; 刘伟伟; 牛跃华
Original assignee: Beijing Institute of Spacecraft System Engineering
Current assignee: Beijing Institute of Spacecraft System Engineering
Priority date: 2018-12-13
Filing date: 2018-12-13
Publication date: 2019-05-10
Anticipated expiration: 2038-12-13
Also published as: CN109739556B

Abstract

The invention relates to a general deep learning processor based on multi-parallel cache interaction and calculation, mainly aiming at frequent multiply-accumulate calculations and frequent parameter accesses in deep learning in the process of calculating convolution and full connection, and using vector operations to operate multiple caches in parallel Interactive computing technology reduces fragmented access to data by sharing parameters, and uses executed instruction cache for re-retrieval to improve the parallelism of the computing process, the computing efficiency of the same instruction and the same parameter access, and reduce repeated computing instructions to hardware floating-point computing. The occupancy of the machine is reduced, the high repetitive operation of deep learning network computing is reduced from the level of instruction data flow, and the real-time performance of deep learning network computing is improved.

Description

A kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated

Technical field

The present invention relates to the general deep learning processors for caching interaction based on multiple parallel and calculating, especially for large size The general acceleration that convolutional calculation and full connection calculate in deep learning network.

Background technique

Deep learning network is during artificial intelligence independently detects identification, judgement prediction and other pattern-recognitions, tool It is significant.But deep learning network needs to carry out a large amount of matroid simultaneously, vector calculates, and algorithm calculation amount is very big, broken Chip parameter access rate is high, and high requirement is proposed to the architecture design of processor；Meanwhile in Embedded Application field, Especially in field of aerospace, power consumption, volume, area limit the application of most of commercialization GPU processor and microprocessor.

In addition, existing commercialization GPU computer and microprocessor are all based on greatly the calculation processing to simultaneously column register, i.e., Make Titan 1080p processor such as that there is vector data to move operation, has thousands of a nodes in internal calculating process to every A data parallel, this process occupy huge hardware resource, while also consuming a large amount of energy power consumption.Other add Fast processor, the similar input and output result of Cambrian chip during processing must be via MLU modules, and by decoding Selected input output operation is completed in the judgement of process different instruction type, and this accelerated mode is operated by multiple modular concurrents, And HotBuf and ColdBuf is combined to reduce or assimilate the convolutional calculation of identical parameters and operation, this design architecture can be effective Convolution kernel calculated performance is promoted, but can not necessarily have preferable performance under non-structured deep learning network.Deep mirror The AI processor of scientific & technical corporation's design has the characteristics that hardware resource is variable, by means of NPU node, is calculated by Compiler Optimization Process, compression parameters reduce calculation amount, but this function must have the corresponding mating execution of program structure progress to can be only achieved most Excellent speed, and in practical big convolutional calculation process, intermediate computations data are usually called by other processes, are difficult to all multiple Miscellaneous depth network forms actual optimization.Especially deep mirror science and technology, which thinks to simplify in calculating process, calculates system, can effectively solve The certainly huge problem of calculation amount quantifies in practice for the position 8-16 of the key node of the deep learning network towards small objects It may bringing on a disaster property effect.

Implement at present for the algorithm of the acceleration of depth convolutional network, it can only be complete by the parallel computation of multiple computing units At, it is expensive, structure is complicated although having commercial GPU processor or dedicated IP, with a distance from microminaturization Embedded Application compared with Far.The general deep learning processor that interaction is cached based on multiple parallel and is calculated is designed, current low-power consumption, micro- can be effectively met Minimize Embedded A I processor development process there is an urgent need to.

Summary of the invention

Technology of the invention solves the problems, such as: in the prior art, convolutional calculation process is complicated in large-scale learning network, Power consumption is larger, calculates fragment repeated accesses and computationally intensive problem, proposes a kind of interactive based on multiple parallel caching and calculates General deep learning processor.

The present invention solves above-mentioned technical problem and is achieved by following technical solution:

A kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated, including it is command process module, slow It deposits forwarding module, the first gating module, data computation module, the second gating module, third gating module, execute instruction caching mould Block, in which:

Buffer forwarding module: it receives the first control signal that command process module is sent and chooses corresponding register, and externally The data-signal to be calculated that portion's bus is sent into carries out inner buffer, while first control signal and data to be calculated are merged into the Two control signals are simultaneously sent to the first gating module, carry out data further according to the register more new signal of the first gating module foldback It updates；

First gating module: the second control signal that Buffer forwarding module is sent is received, and is sent out according to command process module The the first gating command signal sent carries out selection output to data cached in second control signal；It is carried out according to data to be calculated Output controls signal comprising the third of operand needed for this calculating after gating control, and is sent to data computation module, simultaneously Transmitter register more new signal gives Buffer forwarding module；

Data computation module: it receives the calculating sent from the first gating module and controls signal with third, according to third control Operand pair data to be calculated needed for this calculating in signal processed are calculated, and by the fixed-point computation result being calculated, are floated Point calculated result, logic calculation result are sent to the second gating module；

Second gating module: fixed-point computation result, floating point calculations, the logic sent from data computation module is received Calculated result, and the second gating command signal sent according to command process module carry out secondary gating control, by fixed-point computation As a result or floating point calculations or logic calculation result are forwarded to third gating module and execute instruction cache module；

It executes instruction cache module: receiving the calculated result after the gating output of the second gating module and carry out parallel instruction Conflict retrieval, and sent to third gating module and calculate ongoing decision instruction signal, packet for determining whether there is other Judging result is retrieved containing instruction conflict；

Third gating module: the calculated result that the second gating module is sent, the sent according to command process module are received Three gating command signals, and the conflict retrieval judging result of cache module transmission is executed instruction, it generates for Buffer forwarding mould Block carries out the instruction more new signal of instruction update, and sends instruction more new signal to command process module and form calculating closed loop；

Command process module: what reception third gating module was sent is used to choose instruction instructions to be performed more new signal simultaneously Corresponding instruction code is generated, exports first control signal after being decoded, is used to choose corresponding register for what is obtained after decoding First control signal is sent to Buffer forwarding module, while generating first/second/third gating command signal, is sent respectively to First/second/third gating module carries out output data strobe.

Described instruction processing module includes IA generation unit, instruction pool unit, decoding unit, in which:

IA generation unit: the instruction update signal behavior sent according to third gating module corresponds to program address and refers to Needle is simultaneously sent to instruction pool unit；

It instructs pool unit: receiving the correspondence program address pointer that IA generation unit is sent and addressing instruction code, it will Instruction code is sent to decoding unit；

Decoding unit: receiving the instruction code that instruction pool unit is sent, and progress Instruction decoding simultaneously believes gained control after decoding Number be sent to Buffer forwarding module, while generating first/second/third gating command signal, be sent respectively to first/second/ Third gating module carries out output data strobe.

The Buffer forwarding module includes register group unit, inner buffer unit, peripheral interface control unit, in which:

Register group unit: the decoding first control signal that command process module is sent and selected corresponding deposit are received Device, while the register data in the control signal is sent to inner buffer unit；

Peripheral interface control unit: the data to be calculated that external bus access interface is sent are received, and will include number to be calculated According to external signal data be sent to inner buffer unit；

Inner buffer unit: what the register data and peripheral interface control unit that receiving register group unit is sent were sent All signal datas are sent to the first gating module by external signal comprising data to be calculated.

The data computation module includes fixed point calculation unit, floating point calculating unit, logic computing unit, in which:

Fixed point calculation unit: it receives the third sent from the first gating module and controls signal and carry out fixed-point computation, together When to the second gating module send fixed-point computation result；

Floating point calculating unit: it receives the third sent from the first gating module and controls signal and carry out Floating-point Computation, together When to the second gating module send floating point calculations；

Logic computing unit: receiving the third sent from the first gating module and control signal and carry out logic calculation, Simultaneously to the second gating module sending logic calculated result.

The floating point calculating unit include Floating-point Computation sub-unit, interactive controlling sub-unit, bus interaction sub-unit, In:

Floating-point Computation sub-unit: it receives the third control signal sent from the first gating module and carries out Floating-point Computation, together When receive the reading instruction of interactive controlling sub-unit and read by interactive controlling sub-unit and floating point calculations and be sent to the second choosing Logical unit；

Interactive controlling sub-unit: when Floating-point Computation sub-unit starts to calculate, by bus interaction sub-unit to corresponding or Adjacent Floating-point Computation sub-unit, which is sent, reads instruction, and calculated result is sent to the second gating module；

Bus interacts sub-unit: carrying out the instruction and data interaction of Floating-point Computation sub-unit, interactive controlling sub-unit.

The Floating-point Computation sub-unit includes floating-point flowing water vector counter, left operand data stream memory FIFO, the right side Operand data stream memory FIFO, output result data stream memory FIFO, left operand register RL, the deposit of right operand Device RR, output result register RX, in which:

Left operand register RL: the interactive controlling signal of bus interaction sub-unit output is received, to write-in data manipulation Number is cached, and is sent to specified floating-point flowing water vector counter and is calculated；

Right operand register RR: the interactive controlling signal of bus interaction sub-unit output is received, to write-in data manipulation Number is cached, and is sent to specified floating-point flowing water vector counter and is calculated；

Left/right operand data stream memory FIFO: according to the interactive controlling signal of bus interaction sub-unit output, to writing Enter data data fifo value to be cached, and is sent to specified floating-point flowing water vector counter and carries out pipeline computing；

Floating-point flowing water vector counter: calculating write-in data, and calculated result is sent to output result deposit Device RX, output result data stream memory FIFO；

It exports result register RX: receiving the calculated result of floating-point flowing water vector counter, calculating is grasped after being cached It counts and is judged, if operand is 0 and calculates type to be multiplication or division, calculated result is denoted as 0 and is sent to outside Otherwise bus directs out portion's bus and exports immediately；

It exports result data stream memory FIFO: receiving the calculated result of floating-point flowing water vector counter, and cached, If the summing value of data flow FIFO is 0 and calculates type to be multiplication or division, calculated result is denoted as 0 and is sent to external total Otherwise line directs out portion's bus and exports immediately.

The interactive controlling sub-unit includes input interactive access control module, controls signal interpretation module, interaction output Module, in which:

Input interactive access control module: receiving the third control signal that the first gating module is sent and decoded, then Data resulting after decoding and control signal are sent to interactive output module；

Control signal interpretation module: receiving the external computations that the first gating module is sent and decoded, and will decoding Gained control signal is sent to interactive output module afterwards；

Interaction output module: interactive access control module or control signal interpretation mould will be inputted by bus interaction sub-unit Block output data is sent to the second gating module.

The floating-point flowing water vector counter quantity is four.

A kind of general depth network method for calculation that multiple parallel is calculated and cached, steps are as follows:

(1) simultaneously output order is chosen from pool of instructions according to the address pointer of generation, is exported after being decoded to the instruction First control signal, and generate the first/bis-/tri- gating command signals and carry out data strobe；

(2) data in first control signal are cached, by outside access bus obtain data to be calculated and by its Merge output second control signal with data in first control signal；

(3) gating control is carried out to second control signal using the first gating command signal, and according to the third after gating Control signal calculates data to be calculated, while being updated using calculated result to data cached；

(4) gating output is carried out to calculated result according to the second gating command command signal, while carries out parallel instruction punching Prominent retrieval, output conflict judge search result；

(5) search result, transmitter register more new signal are judged according to third gating command signal, calculated result, conflict Data buffer storage update is carried out, while carrying out IA update, is formed and calculates closed loop.

The advantages of the present invention over the prior art are that:

(1) a kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated provided by the invention, for Vector is computationally intensive in deep learning convolutional network calculating process, calculates the high feature of multiplicity, using to executed instructions into Row caching, and the mode of result of will executing instruction is quickly generated in the mode that the caching scans for that will execute instruction, Reduce and compute repeatedly instruction to the occupancy of hardware floating point calculator, reduces deep learning network from the level of instruction data flow The high repetitive operation calculated improves vector in convolution kernel calculating process, matrix calculating speed；

(2) present invention is high for Data duplication degree between full connection calculating process interior joint calculating, compared to convolutional calculation Process instruction fetching decodes the very fast feature of process, the invention proposes floating point calculating unit, ensure that input Floating-point Computation refers to Parallel, the interactive pipeline computing enabled, synchronization can complete floating add, multiplication, division, Sigmoid parallel and calculate four The vector of kind operation calculates, and reduces fragment type parameter access rate.

Detailed description of the invention

Fig. 1 is the floating point calculating unit structural schematic diagram that invention provides；

Fig. 2 is the Floating-point Computation sub-unit structural schematic diagram that invention provides；

Fig. 3 is the interactive controlling sub-unit structural schematic diagram that invention provides；

Fig. 4 is the unit bus interactive controlling modular structure schematic diagram that invention provides；

Fig. 5 is the structural schematic diagram that invention provides；

Fig. 6 is the processor inner bay composition that invention provides；

Specific embodiment

The present invention relates to the general deep learning processors for caching interaction based on multiple parallel and calculating, especially for large size Depth convolutional network and fully-connected network frequently multiply accumulating calculating in calculating.

The present invention will be further described with reference to the accompanying drawing.

The invention mainly comprises command process module, Buffer forwarding module, the first gating module, data computation module, Two gating modules, execute instruction cache module at third gating module.The main object of the present invention is to large-scale deep learning network Middle convolutional calculation and full connection calculate and carry out quick execution, to achieve the purpose that complete in real time.As shown in fig. 6, instruction updates Afterwards, decoding unit in instruction pool unit output order and is entered according to IA generation unit, exports first control signal；The One control signal enters the register group unit in Buffer forwarding module, while receiving update register, buffer update signal, Inner buffer unit is inputted with peripheral interface control unit output data, inner buffer unit exports second control signal to first Gating module；First gating module updates according to second control signal, the first gating command signal difference output register, is peripheral It updates and third control signal is to register group unit, peripheral interface control unit and data computation module, after calculating Fixed point, floating-point and the logic calculation result of output are sent to the second gating module；Second gating module receives fixed point, floating-point and patrols Calculated result and the second gating command signal are collected, parallel instruction conflict recall signal is sent and refers to third gating module and execution Enable cache module；It executes instruction cache module and receives parallel instruction conflict recall signal, judge whether there is conflict instruction and held Row instruction, and conflict is judged that search result returns third gating module；Third gating module receives parallel instruction conflict retrieval Signal, conflict judge search result and third gating command signal, output register/caching/periphery more new signal, and instruction More new signal is to command process module.

By that can be calculated using vector the judgement of instruction, floating point calculating unit is shown in for the calculating of floating number in calculating process Shown in Fig. 1.The module includes four four input and output RAM cachings, is respectively used to the floating-point meter added, multiplication and division, Sigmoid are calculated Point counting unit, four bus interaction sub-units are for the shared and effectively distribution between data.Floating data passes through Floating-point Computation Which four input and output RAM caching sub-unit, judgement are input in, and judge that four input and output RAM caching is according to instruction code It is no to need to backup to adjacent two four input and output RAM caching.Then take out the floating point vector in four input and output RAM caching Data input stream water input and output vector floating-point calculator, and judge whether the calculating data of output need to back up according to instruction code Into adjacent two buses interaction sub-unit, if it is desired to then calculating data are shifted using computing unit bus interactive controlling Or backup to other computing modules.

Floating-point Computation sub-unit, as shown in Fig. 2, including left/right operand data stream memory FIFO, output result data Flow memory FIFO, left/right operand register R_L, right operand register R_R, output result register R_X.Module is examined first The data for surveying external four input and output RAM processing whether in need, if there is and to need to handle data be single then by left operation Number reads in left operand register R_L, the right operand register R of right operand reading_R, floating-point pipeline computing is inputted after latching level-one Device carries out addition/multiplication/division/Sigmoid calculating operation, is as a result sent into output result register R_X, internal bus is waited to connect Mouth takes out result and is sent into four input and output RAM；The data that handle if necessary and to need to handle data be vector, then by left behaviour It counts and continuously reads in left operand data stream memory FIFO, right operand continuously reads in right operand data stream memory FIFO, while flowing water inputs floating-point pipeline computing device, carries out addition/multiplication/division/Sigmoid calculating operation, is as a result sent into defeated Result data stream memory FIFO out waits internal bus interface to take out result and is sent into four input and output RAM.

Interactive controlling sub-unit, as shown in figure 3, the module is mainly used for controlling the meter in external input third control signal It calculates information and is assigned to each computing unit and caching.Including inputting interactive access control module, signal interpretation module, interaction are controlled Output module.Interaction output module is divided into floating add/multiplication interaction output judgement again, floating add/division interaction output is sentenced Disconnected, floating-point division/Sigmoid calculates interaction output judgement, floating-point multiplication/Sigmoid calculates interaction output judgement, floating add Register/FIFO interaction output judgement, floating-point multiplication register/FIFO interaction output judgement, floating-point Sigmoid calculate deposit The internal modules such as device/FIFO interaction output judgement, floating-point division register/FIFO interaction output judgement.Firstly, passing through external the Three control signals obtain input information, then judge that the instruction belongs to single computations or vector computations, then to phase The four input and output RAM output datas answered, while judging whether the instruction can interact, it is total to generate computing unit according to interactive information The interactive control information of line interactive controlling module；Then, judge the calculated result in four input and output RAM be individual data or Vector data, and take out calculated result and give instruction execution module.

Computing unit bus interactive controlling module, as shown in figure 4, the module is mainly used for number between four input into/output from cache According to interaction.According to interactive controlling module information judge current four input into/output from cache X and four input into/output from cache Y datas whether need It interacts, interactive mode is divided into that X and Y is exchanged, X backups to Y, Y and backups to X, X and Y and remains unchanged, and the delay of exchange process Temporal information is sent to flowing water input and output vector floating-point calculator modules X and Y.

Cache module is executed instruction, as shown in figure 5, the module is mainly used for being not carried out the quick execution of instruction.It has executed Instruction is maintained in executed instructions fragment caching with output result, and fragment caching covers old fragment strategy using new fragment. Receive the second gating module output parallel instruction conflict retrieval, in pool of instructions it is to be executed instruction fragment caching in into Row search is not entered back into if there is matching result then directly takes out the calculated result in search result in the instruction execution Computing module, and delete the fragment in caching；If the instruction comes into the execution stage in search process, shows and search Rope process generates conflict, stops the instruction and searches for and enter next instruction search process.Conflict judges that search result is sent to Third gating module.

Study processor structure provided by the invention is as follows:

Command process module: receive that third gating module sends for choose parallel instruction signal instructions to be performed or Single command signal generates corresponding instruction code and carries out decoding output, will be obtained after decoding comprising being used to choose corresponding deposit The first control signal of the operand control address of device is sent to Buffer forwarding module, while generating first/second/third gating Signal is sent respectively to first/second/third gating module；

First/second/third gating signal is as multibit signal, for selecting first/second/third gating module defeated Result out.Output of high two of first gating signal for the first gating module judges that, when being ' 00 ' for high two, output is posted The data that storage group unit issues export the sending data of inner buffer unit when being ' 01 ', peripheral interface when being ' 10 ' The output data of control unit；Output of high two of second gating signal for the second gating module judges, is when high two When ' 00 ', output fixed-point computation exports floating point calculations when as a result, being ' 01 ', and logic calculation result is exported when being ' 10 '；The Output of the high position for three gating signals for third gating module judges, when a high position is ' 0 ', exports the defeated of the second gating module Search result is executed instruction as a result, exporting when being ' 01 ' out.

Buffer forwarding module: receiving register more new signal, first control signal, external bus access data, and periphery More new signal selects corresponding register and carries out inner buffer, and will send out comprising the second control signal of calculating data address It send to the first gating module；

First gating module: receiving the second control signal that Buffer forwarding module is sent, and second control signal includes deposit Device group unit, inner buffer unit, peripheral interface units output data information, be incorporated as second control signal input the After one gating module, which of three second control signals of final output is judged according to the first gating signal that decoding unit exports One；Output includes that this third for calculating required data controls signal after carrying out gating control according to data to be calculated, concurrently It send to data computation module and executes instruction cache module；First gating signal is sent to the first gating module as decoding unit Control signal, the data for control access mask register, inner buffer or peripheral interface；

Data computation module: receiving the third sent from the first gating module and control signal, in third control signal Including required data and address calculated, by the fixed-point computation result being calculated, floating point calculations, logic calculation knot Fruit is sent to the second gating module；

Second gating module: fixed-point computation result, floating point calculations, the logic sent from data computation module is received Calculated result carries out secondary gating control, fixed-point computation result or floating-point according to the second gating signal that decoding unit exports Calculated result or logic calculation result are forwarded to third gating module and execute instruction cache module；Second gating signal is used as and translates Code unit is sent to the control signal of the second gating module, and the output result for calculating type is selected for control access；

It executes instruction cache module: receiving the parallel instruction conflict retrieval letter after the gating output of the second gating module Breath carries out parallel instruction conflict retrieval and sends to third gating module to calculate ongoing punching for determining whether there is other It is prominent to judge search result；

Third gating module: it receives the parallel instruction conflict that the second gating module is sent and retrieves information, according to decoding unit The third gating command signal of transmission and the conflict for executing instruction cache module transmission determine search result, and output order updates letter Number in command process module formed calculate closed loop；Transmitter register, caching, peripheral more new signal are used for register group list respectively The update control of member, inner buffer unit, peripheral interface control unit.

IA generation unit: program address is corresponded to according to the parallel instruction signal behavior that third gating module is sent and is referred to Needle is simultaneously sent to instruction pool unit；

Decoding unit: receiving the instruction code that instruction pool unit is sent, and progress Instruction decoding simultaneously believes gained control after decoding Number be sent to Buffer forwarding module, while generating first/second/third gating command signal, be sent respectively to first/second/ Third gating module.

Register group unit: it receives the third control signal that decoding unit is sent and selectes corresponding register, simultaneously will Data are sent to inner buffer unit；

Inner buffer unit: the data of receiving register group unit transmission are simultaneously cached；

Peripheral interface control unit: receiving external bus access data and carries out data pipe according to periphery update control signal Control, and send data to inner buffer unit.

Fixed point calculation unit: the required data sent from the first gating module, address are received and to carry out fixed-point computation same When to the second gating module send fixed-point computation result；

Floating point calculating unit: the required data sent from the first gating module, address are received and to carry out Floating-point Computation same When to the second gating module send floating point calculations；

Logic computing unit: the required data sent from the first gating module, address are received and to carry out logic calculation same When to the second gating module sending logic calculated result.

Floating-point Computation sub-unit: it receives control signal, the operand sent from the first gating module and controls address and carry out Floating-point Computation, while the control instruction for receiving interactive controlling sub-unit is concurrent by interactive controlling sub-unit reading floating point calculations It send to the second gating unit；

Interactive controlling sub-unit: it is concurrent to receive the third control signal generation control instruction sent from the first gating module It send to Floating-point Computation sub-unit and bus interaction sub-unit, corresponding or adjacent floating-point meter is sent to by bus interaction sub-unit Point counting unit synchronizes calculating, while carrying out control information exchange with Floating-point Computation sub-unit；

Bus interacts sub-unit: sending control signal to Floating-point Computation sub-unit, interactive controlling sub-unit, carries out floating-point meter The information exchange of point counting unit, interactive controlling sub-unit.

The Floating-point Computation sub-unit include floating-point flowing water vector counter, left/right operand data stream memory FIFO, Export result data stream memory FIFO, left/right operand register RL/RR, output result register RX, in which:

Floating-point flowing water vector counter: all input signals write-in data are calculated, and calculated result is sent to Export result register RX, output result data stream memory FIFO；

Left/right operand register RL/RR: receiving the control signal of bus interaction sub-unit output, to write-in data behaviour It counts and is cached, and be sent to specified floating-point flowing water vector counter and calculated；

Left/right operand data stream memory FIFO: according to the control signal of bus interaction sub-unit output, to write-in number It is cached according to data fifo value, and is sent to specified floating-point flowing water vector counter and carries out pipeline computing；

It exports result data stream memory FIFO: receiving the calculated result of floating-point flowing water vector counter, and cached, If the summing value of data flow FIFO is 0 and calculates type to be multiplication or division, calculated result is denoted as 0 and is sent to external total Otherwise line directs out portion's bus and exports immediately.Such a process reduces the access of the fragmentation of data and compute repeatedly.

Input interactive access control module: it receives the required third control signal that the first gating module is sent and is translated Code, then level-one operand resulting after decoding control signal is sent to control signal interpretation module；

It controls signal interpretation module: receiving the level-one operand control signal that input interactive access control module is sent and carry out It decodes again, and gained secondary operation number control signal after decoding is sent to interactive output module；

Interaction output module: judge whether to need to translate control signal according to the Instruction decoding result of instruction pool unit output The secondary operation number control signal that code module is sent carries out four bus shared buffer memories and Floating-point Computation device calculates reading data.

Using embodiment, the present invention will be further described with reference to the accompanying drawing.

The present invention relates to based on multiple parallel cache interaction and calculate general deep learning processor architecture, especially for Large-scale depth convolutional network and fully-connected network frequently multiply accumulating calculating in calculating, commonly save in conjunction with following convolution operation For point calculates, explain to realization step.

F_N×N=E_N×NSigmoid((A_N×NB_N×N+C_N×N)./D_N×N) (1)

Wherein, N=positive integer, wherein Sigmoid (x)=(1+e^-x)^-1Calculating is abbreviated as SIGMF.

Operation of the conventional processors to above-mentioned formula, is all based on the calculating of single number, even if such as the Cambrian and deep mirror Company listens great waves processor that can only also complete water operation under conditions of optimizing to code, in the invention patent design In processor architecture, calculating for formula (1) can complete entire meter using matroid computations by way of parallel pipelining process It calculates less than several calculating cycles.In addition, formula (1) needs at least five conditions judgement and two in conventional processors written in code It recirculates, the mating instruction set of processor designed using the invention patent, it is only necessary to five-element's assembly code, as follows:

MULF.M A_N×N,B_N×N,T_N×N,N

ADDF.M T_N×N,C_N×N,T_N×N,N

DIVF.M T_N×N,D_N×N,T_N×N,N

SIGMF.M T_N×N,T_N×N,N

MULF.M T_N×N,E_N×N,F_N×N,N

Above-mentioned five-element's assembler language code generates machine code after compiling, is controlled by external debugging interface and debugging Unit entry instruction pool unit processed.Then entire processor works according to the instruction code of instruction pool unit output.

<1>assembly statement MULF.M A is executed_N×N,B_N×N,T_N×N, N, by address A_N×NAnd B_N×NTwo matrix datas are taken out, Four input and output RAM D being stored in floating point calculating unit, and matrix-vector is written to computing unit bus interactive controlling module D Calculating parameter N；In Fig. 6 processor inside structure block diagram, this instruction operation and calculation process are as described below:

(1) it firstly, taking out this instruction from instruction pool unit according to the address of IA generation unit output, is sent to Decoding unit；

(2) decoding unit is input to the data of Buffer forwarding module according to the selection of the size of N, is controlled by peripheral interface single Member takes out A_N×N、B_N×N、T_N×NThe corresponding data of address thinks that this instruction is general data multiplication if N=1, and data are defeated Enter register group unit, thinks that this operation is that vector calculates if N is not equal to 1, data input inner buffer unit；

(3) Buffer forwarding module judges whether according to the buffer update signal of third gating module by new input data Instead of there are the passing data of inner buffer unit, judge that whether this executes instruction with upper one according to passing instruction retrieval result It is secondary execute instruction it is identical, if the same no longer in Buffer forwarding module register group unit, inner buffer unit carry out Data update, but directly output register group unit or the passing storing data of inner buffer unit, and output is to the first gating Module；

(4) first gating modules control after receiving data according to data strobe, judge that data are input to data calculating Which part of module, due to being that floating-point multiplication calculates, operand enters floating point calculating unit for this instruction；

(5) floating point calculating unit in data computation module after receiving the data, places the data into interactive control first Sub-unit processed calculates mode according to data and length enters different output channels, this instruction is floating point vector calculating, therefore Operand data is output to four input and output RAM D by floating-point multiplication register/FIFO interaction output module in interactive controlling, It calculates information and enters bus interaction sub-unit D；

(6) bus interaction sub-unit D judges T according to the judgement instructed to next_N×NData still can be in floating add Sub-unit is calculated to carry out using therefore in T_N×NData are input to after four input and output RAM D, control DMA D To A by RAM T is backed up in D_N×NData are to RMA A；

(7) four input and output RAM D input operand data calculate sub-unit to floating-point multiplication, and operand is put into respectively Left/right operand data stream FIFO, is stored in left/right operand register RL/RR if N=1, if left/right operand is posted Storage RL/RR is equal to 0 or left operand data stream FIFO input data summed result is 0, then it is assumed that this floating-point multiplication meter It calculates or vector calculating output result is 0, exporting result register Rx or output stream FIFO output result is 0, is otherwise pressed Output result road RAM D is calculated according to normal；

(8) data computation module exports multiplication calculation result to the second gating module, and the second gating module is according to decoding The second gating command signal output calculated result of unit output exports parallel instruction conflict inspection to third gating module simultaneously Rope information judges whether current data computing module has other calculating to cache module is executed instruction, and prevents parallel Computations conflict is made delay to current output in the case of a conflict and is waited；

(9) third gating module according to the third gating command signal of decoding unit and executes instruction the conflict of cache module Judge search result, output order, register, caching, peripheral more new information, result deposit register group when N=1 in this instruction Unit, N, which is updated when being not equal to 1 by periphery, is stored in T_N×NCorresponding address, and the address of next instruction is generated, refer to herein for this It enables address add 1, executes next instruction ADDF.M T_N×N,C_N×N,T_N×N,N；

<2>ADDF.M T is executed_N×N,C_N×N,T_N×N, N, by address C_N×NMatrix data is taken out, is stored in floating point calculating unit Four input and output RAM A, T_N×NData via computing unit bus interactive controlling module D be passed to computing unit bus interaction Then control module A is written floating add and calculates sub-unit；

<3>data enter floating-point multiplication calculating sub-unit by four input and output RAM A, calculate T_N×NAnd C_N×NTwo matrixes Product, acquired results are passed to four input and output RAM A by computing unit bus interactive controlling modules A, and by internal total Line returns to matrix-vector register T_N×N；

<4>DIVF.MT is executed_N×N,D_N×N,T_N×N, N, by address D_N×NMatrix data is taken out, is stored in floating point calculating unit Four input and output RAM B, T_N×NData via computing unit bus interactive controlling modules A be passed to computing unit bus interaction Then control module B is written floating-point division and calculates sub-unit；

<5>data enter flowing water input and output vector floating-point adder by four input and output RAM B, calculate T_N×NAnd D_N×N The division of two matrixes, acquired results are passed to four input and output RAM B by computing unit bus interactive controlling module B, and lead to It crosses internal bus and returns to matrix-vector register T_N×N；

<6>SIGMF.M T is executed_N×N,T_N×N, N, T_N×NData it is incoming via computing unit bus interactive controlling module B Then computing unit bus interactive controlling module C is written floating-point Sigmoid and calculates sub-unit；

<7>T is calculated_N×NSigmoid calculate function, acquired results by computing unit bus interactive controlling module C be passed to Four input and output RAM C, and matrix-vector register T is returned to by internal bus_N×N；

<8>MULF.M T is executed_N×N,E_N×N,F_N×N, N, by address E_N×NMatrix data is taken out, is stored in floating point calculating unit Four input and output RAM D, T_N×NData via computing unit bus interactive controlling module D be passed to computing unit bus interaction Then control module D is written floating-point multiplication and calculates sub-unit；

<9>data enter floating-point multiplication calculating sub-unit progress T by four input and output RAM D_N×NAnd E_N×NMultiplication of matrices It calculates, obtained result returns to four input and output RAM D via computing unit bus interactive controlling module D, and by internal total Line returns to matrix-vector register F_N×N, to complete this calculating task.

The calculation instructed due to other several is similar with first, no longer carries out repeating detailed description here.

Pass through the execution of above-mentioned concrete operations, it can be seen that (3) reduce peripheral data the step of first instruction execution The frequency interactively communicated reduces data computing relay caused by communicating between caching, this is in deep learning network easily hundred Ten thousand times or even more than one hundred million times convolution kernel calculating process, can save a large amount of cache access time；The step of first instruction execution Suddenly (6) can back up frequently-used data in short-term according to the correlation above to give an order, equally reduce inside and outside caching in calculating process Data interaction access time, be especially the reduction of the frequency of fragment type parameter access rate；The step of first instruction execution (7) It in calculating process, is prejudged according to input data, the mode for directly exporting result is carried out for 0 Value Data, reduces calculating Expense；In the design process due to present processor, it has been provided simultaneously with fixed point Floating-point Computation module, has avoided and is calculated in other designs Model accuracy rate is greatly reduced problem caused by error.

The content that description in the present invention is not described in detail belongs to the well-known technique of those skilled in the art.

Claims

1. a general-purpose deep learning processor based on multi-parallel cache interaction and calculation, is characterized in that: comprise instruction processing module, cache forwarding module, the first gating module, data calculation module, the second gating module, the third selection communication module, execution instruction cache module, where:

Cache forwarding module: select the corresponding register after receiving the first control signal sent by the instruction processing module, and internally cache the data signal to be calculated sent from the external bus, and at the same time combine the first control signal and the data to be calculated into a second control signal and Send to the first gating module, and then perform data update according to the register update signal returned by the first gating module;

The first gating module: receives the second control signal sent by the buffering and forwarding module, and selects and outputs the buffered data in the second control signal according to the first gating instruction signal sent by the instruction processing module; performs gating according to the data to be calculated After control, output a third control signal including operands required for this calculation, and send it to the data calculation module, and at the same time send a register update signal to the cache forwarding module;

Data calculation module: receives the third control signal for calculation sent from the first gating module, calculates the data to be calculated according to the operands required for this calculation in the third control signal, and calculates the fixed-point calculation result obtained by the calculation, the floating-point calculation result. The calculation result and the logical calculation result are sent to the second gating module;

The second gating module: receives the fixed-point calculation results, floating-point calculation results, and logical calculation results sent from the data calculation module, and performs secondary gating control according to the second gating command signal sent by the command processing module, and converts the fixed-point calculation The result or floating point calculation result or logic calculation result is forwarded to the third gating module and the execution instruction cache module;

Execute instruction cache module: receive the calculation result after gating output from the second gating module to perform parallel instruction conflict retrieval, and send to the third gating module a judgment command signal used to determine whether other calculations are in progress, including instruction conflicts search results;

The third gating module: receives the calculation result sent by the second gating module, and generates a data for the cache forwarding module according to the third gating instruction signal sent by the instruction processing module and the conflict retrieval judgment result sent by the execution instruction cache module. The instruction update signal of the instruction update, and the instruction update signal is sent to the instruction processing module to form a closed calculation loop;

Instruction processing module: receive the instruction update signal sent by the third gating module for selecting the instruction to be executed and generate the corresponding instruction code, output the first control signal after decoding, and use the decoding to select the corresponding register. The first control signal is sent to the buffer forwarding module, and the first/second/third gating command signal is generated at the same time, and sent to the first/second/third gating module for output data gating respectively.

2. a kind of general-purpose deep learning processor based on multi-parallel cache interaction and calculation according to claim 1, is characterized in that: described instruction processing module comprises instruction address generation unit, instruction pool unit, decoding unit, wherein:

Instruction address generation unit: select the corresponding program address pointer according to the instruction update signal sent by the third gating module and send it to the instruction pool unit;

The instruction pool unit: receives the corresponding program address pointer sent by the instruction address generation unit, addresses the instruction code, and sends the instruction code to the decoding unit;

Decoding unit: receives the instruction code sent by the instruction pool unit, decodes the instruction and sends the control signal obtained after decoding to the cache forwarding module, and generates the first/second/third gating instruction signal and sends them to the first/second/third gating instruction signal respectively. The first/second/third gating module performs output data gating.

3. a kind of general-purpose deep learning processor based on multi-parallel cache interaction and calculation according to claim 2, is characterized in that: described cache forwarding module comprises register group unit, internal cache unit, peripheral interface control unit, wherein:

Register group unit: receive the first control signal for decoding sent by the instruction processing module and select the corresponding register, and at the same time send the register data in the control signal to the internal buffer unit;

The peripheral interface control unit: receives the data to be calculated sent by the external bus access interface, and sends the external signal data containing the data to be calculated to the internal buffer unit;

Internal buffer unit: receives the register data sent by the register group unit and the external signal including the data to be calculated sent by the peripheral interface control unit, and sends all the signal data as the second control signal to the first gating module.

4. A general-purpose deep learning processor based on multi-parallel cache interaction and calculation according to claim 1, wherein: the data calculation module comprises a fixed-point calculation unit, a floating-point calculation unit, and a logic calculation unit, wherein:

Fixed-point calculation unit: receives the third control signal sent from the first gating module and performs fixed-point calculation, and simultaneously sends the fixed-point calculation result to the second gating module;

Floating-point calculation unit: receives the third control signal sent from the first gating module and performs floating-point calculation, and simultaneously sends the floating-point calculation result to the second gating module;

Logical calculation unit: receives the third control signal sent from the first gating module and performs logical calculation, and simultaneously sends the logical calculation result to the second gating module.

5. A kind of general-purpose deep learning processor based on multi-parallel cache interaction and calculation according to claim 4, it is characterized in that: described floating-point calculation unit comprises floating-point calculation sub-unit, interactive control sub-unit, bus interaction sub-unit unit, where:

Floating-point calculation sub-unit: Receive the third control signal sent from the first gating module to perform floating-point calculation, and at the same time receive the read instruction of the interactive control sub-unit, the interactive control sub-unit reads the floating-point calculation result and sends it to the second gating unit;

Interactive control sub-unit: when the floating-point calculation sub-unit starts to calculate, it sends a read instruction to the corresponding or adjacent floating-point calculation sub-unit through the bus interactive sub-unit, and sends the calculation result to the second gating module;

Bus interaction sub-unit: perform instruction and data exchange of floating-point calculation sub-unit and interactive control sub-unit.

6. A general-purpose deep learning processor based on multi-parallel cache interaction and calculation according to claim 5, wherein the floating-point calculation sub-unit comprises a floating-point pipeline vector calculator, a left operand data stream memory FIFO, right operand data stream memory FIFO, output result data stream memory FIFO, left operand register RL, right operand register RR, output result register RX, where:

Left operand register RL: receives the interactive control signal output by the bus interactive sub-unit, caches the written data operand, and sends it to the designated floating-point pipeline vector calculator for calculation;

Right operand register RR: Receive the interactive control signal output by the bus interactive sub-unit, cache the written data operand, and send it to the designated floating-point pipeline vector calculator for calculation;

Left/right operand data flow memory FIFO: According to the interactive control signal output by the bus interactive sub-unit, the data value of the written data FIFO is buffered, and sent to the designated floating-point pipeline vector calculator for pipeline calculation;

Floating-point pipeline vector calculator: Calculate the written data, and send the calculation result to the output result register RX, and the output result data stream memory FIFO;

Output result register RX: Receive the calculation result of the floating-point pipeline vector calculator, and judge the calculation operand after buffering. If the operand is 0 and the calculation type is multiplication or division, the calculation result is recorded as 0 and sent to the external bus, otherwise output directly to the external bus immediately;

Output result data stream memory FIFO: Receive the calculation result of the floating-point pipeline vector calculator and buffer it. If the summation value of the data stream FIFO is 0 and the calculation type is multiplication or division, the calculation result is recorded as 0 and sent to External bus, otherwise output directly to the external bus immediately.

7. A general-purpose deep learning processor based on multi-parallel cache interaction and calculation according to claim 5, wherein the interactive control sub-unit comprises an input interactive access control module, a control signal decoding module, and an interactive output module, where:

Input interactive access control module: receive and decode the third control signal sent by the first gating module, and then send the decoded data and control signal to the interactive output module;

Control signal decoding module: receive the external calculation instruction sent by the first gating module for decoding, and send the control signal obtained after decoding to the interactive output module;

Interactive output module: send the input interactive access control module or the output data of the control signal decoding module to the second gating module through the bus interactive sub-unit.

8 . The universal deep learning processor based on multi-parallel cache interaction and calculation according to claim 6 , wherein the number of the floating-point pipeline vector calculators is four. 9 .

9. A general-purpose deep network computing method for multi-parallel computing and buffering, characterized in that the steps are as follows:

(1) select and output the instruction from the instruction pool according to the generated address pointer, output the first control signal after decoding the instruction, and generate the first/two/three gating instruction signal to perform data gating;

(2) buffer the data in the first control signal, obtain the data to be calculated through the external access bus and combine it with the data in the first control signal to output the second control signal;

(3) utilize the first gating instruction signal to carry out gating control to the second control signal, and calculate the data to be calculated according to the third control signal after gating, and utilize the calculation result to update the cached data simultaneously;

(4) according to the second gating instruction command signal, the calculation result is gated and output, and parallel instruction conflict retrieval is carried out simultaneously, and the conflict judgment retrieval result is output;

(5) According to the third gating command signal, the calculation result, and the conflict judgment and retrieval result, the register update signal is sent to update the data cache, and the command address is updated at the same time to form a closed calculation loop.