CN109739556A - A kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated - Google Patents

A kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated Download PDF

Info

Publication number
CN109739556A
CN109739556A CN201811528451.7A CN201811528451A CN109739556A CN 109739556 A CN109739556 A CN 109739556A CN 201811528451 A CN201811528451 A CN 201811528451A CN 109739556 A CN109739556 A CN 109739556A
Authority
CN
China
Prior art keywords
module
unit
sent
data
gating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811528451.7A
Other languages
Chinese (zh)
Other versions
CN109739556B (en
Inventor
禹霁阳
汪路元
程博文
李宗凌
刘伟伟
牛跃华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Spacecraft System Engineering
Original Assignee
Beijing Institute of Spacecraft System Engineering
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Spacecraft System Engineering filed Critical Beijing Institute of Spacecraft System Engineering
Priority to CN201811528451.7A priority Critical patent/CN109739556B/en
Publication of CN109739556A publication Critical patent/CN109739556A/en
Application granted granted Critical
Publication of CN109739556B publication Critical patent/CN109739556B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Advance Control (AREA)

Abstract

The present invention relates to a kind of general deep learning processors for caching interaction based on multiple parallel and calculating, calculating and frequent parameter access are frequently multiplied accumulating in the calculating full connection procedure of convolution sum primarily directed to deep learning, cache parallel interactive computing using vector operations more, it is reduced by shared parameter and the fragment type of data is accessed, it is retrieved again using executed instructions caching, improve the degree of parallelism of calculating process, same instructions and the computational efficiency of identical parameters access, reduce and computes repeatedly instruction to the occupancy of hardware floating point calculator, the high repetitive operation of deep learning network query function is reduced from the level of instruction data flow, improve the real-time of deep learning network query function.

Description

A kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated
Technical field
The present invention relates to the general deep learning processors for caching interaction based on multiple parallel and calculating, especially for large size The general acceleration that convolutional calculation and full connection calculate in deep learning network.
Background technique
Deep learning network is during artificial intelligence independently detects identification, judgement prediction and other pattern-recognitions, tool It is significant.But deep learning network needs to carry out a large amount of matroid simultaneously, vector calculates, and algorithm calculation amount is very big, broken Chip parameter access rate is high, and high requirement is proposed to the architecture design of processor;Meanwhile in Embedded Application field, Especially in field of aerospace, power consumption, volume, area limit the application of most of commercialization GPU processor and microprocessor.
In addition, existing commercialization GPU computer and microprocessor are all based on greatly the calculation processing to simultaneously column register, i.e., Make Titan 1080p processor such as that there is vector data to move operation, has thousands of a nodes in internal calculating process to every A data parallel, this process occupy huge hardware resource, while also consuming a large amount of energy power consumption.Other add Fast processor, the similar input and output result of Cambrian chip during processing must be via MLU modules, and by decoding Selected input output operation is completed in the judgement of process different instruction type, and this accelerated mode is operated by multiple modular concurrents, And HotBuf and ColdBuf is combined to reduce or assimilate the convolutional calculation of identical parameters and operation, this design architecture can be effective Convolution kernel calculated performance is promoted, but can not necessarily have preferable performance under non-structured deep learning network.Deep mirror The AI processor of scientific & technical corporation's design has the characteristics that hardware resource is variable, by means of NPU node, is calculated by Compiler Optimization Process, compression parameters reduce calculation amount, but this function must have the corresponding mating execution of program structure progress to can be only achieved most Excellent speed, and in practical big convolutional calculation process, intermediate computations data are usually called by other processes, are difficult to all multiple Miscellaneous depth network forms actual optimization.Especially deep mirror science and technology, which thinks to simplify in calculating process, calculates system, can effectively solve The certainly huge problem of calculation amount quantifies in practice for the position 8-16 of the key node of the deep learning network towards small objects It may bringing on a disaster property effect.
Implement at present for the algorithm of the acceleration of depth convolutional network, it can only be complete by the parallel computation of multiple computing units At, it is expensive, structure is complicated although having commercial GPU processor or dedicated IP, with a distance from microminaturization Embedded Application compared with Far.The general deep learning processor that interaction is cached based on multiple parallel and is calculated is designed, current low-power consumption, micro- can be effectively met Minimize Embedded A I processor development process there is an urgent need to.
Summary of the invention
Technology of the invention solves the problems, such as: in the prior art, convolutional calculation process is complicated in large-scale learning network, Power consumption is larger, calculates fragment repeated accesses and computationally intensive problem, proposes a kind of interactive based on multiple parallel caching and calculates General deep learning processor.
The present invention solves above-mentioned technical problem and is achieved by following technical solution:
A kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated, including it is command process module, slow It deposits forwarding module, the first gating module, data computation module, the second gating module, third gating module, execute instruction caching mould Block, in which:
Buffer forwarding module: it receives the first control signal that command process module is sent and chooses corresponding register, and externally The data-signal to be calculated that portion's bus is sent into carries out inner buffer, while first control signal and data to be calculated are merged into the Two control signals are simultaneously sent to the first gating module, carry out data further according to the register more new signal of the first gating module foldback It updates;
First gating module: the second control signal that Buffer forwarding module is sent is received, and is sent out according to command process module The the first gating command signal sent carries out selection output to data cached in second control signal;It is carried out according to data to be calculated Output controls signal comprising the third of operand needed for this calculating after gating control, and is sent to data computation module, simultaneously Transmitter register more new signal gives Buffer forwarding module;
Data computation module: it receives the calculating sent from the first gating module and controls signal with third, according to third control Operand pair data to be calculated needed for this calculating in signal processed are calculated, and by the fixed-point computation result being calculated, are floated Point calculated result, logic calculation result are sent to the second gating module;
Second gating module: fixed-point computation result, floating point calculations, the logic sent from data computation module is received Calculated result, and the second gating command signal sent according to command process module carry out secondary gating control, by fixed-point computation As a result or floating point calculations or logic calculation result are forwarded to third gating module and execute instruction cache module;
It executes instruction cache module: receiving the calculated result after the gating output of the second gating module and carry out parallel instruction Conflict retrieval, and sent to third gating module and calculate ongoing decision instruction signal, packet for determining whether there is other Judging result is retrieved containing instruction conflict;
Third gating module: the calculated result that the second gating module is sent, the sent according to command process module are received Three gating command signals, and the conflict retrieval judging result of cache module transmission is executed instruction, it generates for Buffer forwarding mould Block carries out the instruction more new signal of instruction update, and sends instruction more new signal to command process module and form calculating closed loop;
Command process module: what reception third gating module was sent is used to choose instruction instructions to be performed more new signal simultaneously Corresponding instruction code is generated, exports first control signal after being decoded, is used to choose corresponding register for what is obtained after decoding First control signal is sent to Buffer forwarding module, while generating first/second/third gating command signal, is sent respectively to First/second/third gating module carries out output data strobe.
Described instruction processing module includes IA generation unit, instruction pool unit, decoding unit, in which:
IA generation unit: the instruction update signal behavior sent according to third gating module corresponds to program address and refers to Needle is simultaneously sent to instruction pool unit;
It instructs pool unit: receiving the correspondence program address pointer that IA generation unit is sent and addressing instruction code, it will Instruction code is sent to decoding unit;
Decoding unit: receiving the instruction code that instruction pool unit is sent, and progress Instruction decoding simultaneously believes gained control after decoding Number be sent to Buffer forwarding module, while generating first/second/third gating command signal, be sent respectively to first/second/ Third gating module carries out output data strobe.
The Buffer forwarding module includes register group unit, inner buffer unit, peripheral interface control unit, in which:
Register group unit: the decoding first control signal that command process module is sent and selected corresponding deposit are received Device, while the register data in the control signal is sent to inner buffer unit;
Peripheral interface control unit: the data to be calculated that external bus access interface is sent are received, and will include number to be calculated According to external signal data be sent to inner buffer unit;
Inner buffer unit: what the register data and peripheral interface control unit that receiving register group unit is sent were sent All signal datas are sent to the first gating module by external signal comprising data to be calculated.
The data computation module includes fixed point calculation unit, floating point calculating unit, logic computing unit, in which:
Fixed point calculation unit: it receives the third sent from the first gating module and controls signal and carry out fixed-point computation, together When to the second gating module send fixed-point computation result;
Floating point calculating unit: it receives the third sent from the first gating module and controls signal and carry out Floating-point Computation, together When to the second gating module send floating point calculations;
Logic computing unit: receiving the third sent from the first gating module and control signal and carry out logic calculation, Simultaneously to the second gating module sending logic calculated result.
The floating point calculating unit include Floating-point Computation sub-unit, interactive controlling sub-unit, bus interaction sub-unit, In:
Floating-point Computation sub-unit: it receives the third control signal sent from the first gating module and carries out Floating-point Computation, together When receive the reading instruction of interactive controlling sub-unit and read by interactive controlling sub-unit and floating point calculations and be sent to the second choosing Logical unit;
Interactive controlling sub-unit: when Floating-point Computation sub-unit starts to calculate, by bus interaction sub-unit to corresponding or Adjacent Floating-point Computation sub-unit, which is sent, reads instruction, and calculated result is sent to the second gating module;
Bus interacts sub-unit: carrying out the instruction and data interaction of Floating-point Computation sub-unit, interactive controlling sub-unit.
The Floating-point Computation sub-unit includes floating-point flowing water vector counter, left operand data stream memory FIFO, the right side Operand data stream memory FIFO, output result data stream memory FIFO, left operand register RL, the deposit of right operand Device RR, output result register RX, in which:
Left operand register RL: the interactive controlling signal of bus interaction sub-unit output is received, to write-in data manipulation Number is cached, and is sent to specified floating-point flowing water vector counter and is calculated;
Right operand register RR: the interactive controlling signal of bus interaction sub-unit output is received, to write-in data manipulation Number is cached, and is sent to specified floating-point flowing water vector counter and is calculated;
Left/right operand data stream memory FIFO: according to the interactive controlling signal of bus interaction sub-unit output, to writing Enter data data fifo value to be cached, and is sent to specified floating-point flowing water vector counter and carries out pipeline computing;
Floating-point flowing water vector counter: calculating write-in data, and calculated result is sent to output result deposit Device RX, output result data stream memory FIFO;
It exports result register RX: receiving the calculated result of floating-point flowing water vector counter, calculating is grasped after being cached It counts and is judged, if operand is 0 and calculates type to be multiplication or division, calculated result is denoted as 0 and is sent to outside Otherwise bus directs out portion's bus and exports immediately;
It exports result data stream memory FIFO: receiving the calculated result of floating-point flowing water vector counter, and cached, If the summing value of data flow FIFO is 0 and calculates type to be multiplication or division, calculated result is denoted as 0 and is sent to external total Otherwise line directs out portion's bus and exports immediately.
The interactive controlling sub-unit includes input interactive access control module, controls signal interpretation module, interaction output Module, in which:
Input interactive access control module: receiving the third control signal that the first gating module is sent and decoded, then Data resulting after decoding and control signal are sent to interactive output module;
Control signal interpretation module: receiving the external computations that the first gating module is sent and decoded, and will decoding Gained control signal is sent to interactive output module afterwards;
Interaction output module: interactive access control module or control signal interpretation mould will be inputted by bus interaction sub-unit Block output data is sent to the second gating module.
The floating-point flowing water vector counter quantity is four.
A kind of general depth network method for calculation that multiple parallel is calculated and cached, steps are as follows:
(1) simultaneously output order is chosen from pool of instructions according to the address pointer of generation, is exported after being decoded to the instruction First control signal, and generate the first/bis-/tri- gating command signals and carry out data strobe;
(2) data in first control signal are cached, by outside access bus obtain data to be calculated and by its Merge output second control signal with data in first control signal;
(3) gating control is carried out to second control signal using the first gating command signal, and according to the third after gating Control signal calculates data to be calculated, while being updated using calculated result to data cached;
(4) gating output is carried out to calculated result according to the second gating command command signal, while carries out parallel instruction punching Prominent retrieval, output conflict judge search result;
(5) search result, transmitter register more new signal are judged according to third gating command signal, calculated result, conflict Data buffer storage update is carried out, while carrying out IA update, is formed and calculates closed loop.
The advantages of the present invention over the prior art are that:
(1) a kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated provided by the invention, for Vector is computationally intensive in deep learning convolutional network calculating process, calculates the high feature of multiplicity, using to executed instructions into Row caching, and the mode of result of will executing instruction is quickly generated in the mode that the caching scans for that will execute instruction, Reduce and compute repeatedly instruction to the occupancy of hardware floating point calculator, reduces deep learning network from the level of instruction data flow The high repetitive operation calculated improves vector in convolution kernel calculating process, matrix calculating speed;
(2) present invention is high for Data duplication degree between full connection calculating process interior joint calculating, compared to convolutional calculation Process instruction fetching decodes the very fast feature of process, the invention proposes floating point calculating unit, ensure that input Floating-point Computation refers to Parallel, the interactive pipeline computing enabled, synchronization can complete floating add, multiplication, division, Sigmoid parallel and calculate four The vector of kind operation calculates, and reduces fragment type parameter access rate.
Detailed description of the invention
Fig. 1 is the floating point calculating unit structural schematic diagram that invention provides;
Fig. 2 is the Floating-point Computation sub-unit structural schematic diagram that invention provides;
Fig. 3 is the interactive controlling sub-unit structural schematic diagram that invention provides;
Fig. 4 is the unit bus interactive controlling modular structure schematic diagram that invention provides;
Fig. 5 is the structural schematic diagram that invention provides;
Fig. 6 is the processor inner bay composition that invention provides;
Specific embodiment
The present invention relates to the general deep learning processors for caching interaction based on multiple parallel and calculating, especially for large size Depth convolutional network and fully-connected network frequently multiply accumulating calculating in calculating.
The present invention will be further described with reference to the accompanying drawing.
The invention mainly comprises command process module, Buffer forwarding module, the first gating module, data computation module, Two gating modules, execute instruction cache module at third gating module.The main object of the present invention is to large-scale deep learning network Middle convolutional calculation and full connection calculate and carry out quick execution, to achieve the purpose that complete in real time.As shown in fig. 6, instruction updates Afterwards, decoding unit in instruction pool unit output order and is entered according to IA generation unit, exports first control signal;The One control signal enters the register group unit in Buffer forwarding module, while receiving update register, buffer update signal, Inner buffer unit is inputted with peripheral interface control unit output data, inner buffer unit exports second control signal to first Gating module;First gating module updates according to second control signal, the first gating command signal difference output register, is peripheral It updates and third control signal is to register group unit, peripheral interface control unit and data computation module, after calculating Fixed point, floating-point and the logic calculation result of output are sent to the second gating module;Second gating module receives fixed point, floating-point and patrols Calculated result and the second gating command signal are collected, parallel instruction conflict recall signal is sent and refers to third gating module and execution Enable cache module;It executes instruction cache module and receives parallel instruction conflict recall signal, judge whether there is conflict instruction and held Row instruction, and conflict is judged that search result returns third gating module;Third gating module receives parallel instruction conflict retrieval Signal, conflict judge search result and third gating command signal, output register/caching/periphery more new signal, and instruction More new signal is to command process module.
By that can be calculated using vector the judgement of instruction, floating point calculating unit is shown in for the calculating of floating number in calculating process Shown in Fig. 1.The module includes four four input and output RAM cachings, is respectively used to the floating-point meter added, multiplication and division, Sigmoid are calculated Point counting unit, four bus interaction sub-units are for the shared and effectively distribution between data.Floating data passes through Floating-point Computation Which four input and output RAM caching sub-unit, judgement are input in, and judge that four input and output RAM caching is according to instruction code It is no to need to backup to adjacent two four input and output RAM caching.Then take out the floating point vector in four input and output RAM caching Data input stream water input and output vector floating-point calculator, and judge whether the calculating data of output need to back up according to instruction code Into adjacent two buses interaction sub-unit, if it is desired to then calculating data are shifted using computing unit bus interactive controlling Or backup to other computing modules.
Floating-point Computation sub-unit, as shown in Fig. 2, including left/right operand data stream memory FIFO, output result data Flow memory FIFO, left/right operand register RL, right operand register RR, output result register RX.Module is examined first The data for surveying external four input and output RAM processing whether in need, if there is and to need to handle data be single then by left operation Number reads in left operand register RL, the right operand register R of right operand readingR, floating-point pipeline computing is inputted after latching level-one Device carries out addition/multiplication/division/Sigmoid calculating operation, is as a result sent into output result register RX, internal bus is waited to connect Mouth takes out result and is sent into four input and output RAM;The data that handle if necessary and to need to handle data be vector, then by left behaviour It counts and continuously reads in left operand data stream memory FIFO, right operand continuously reads in right operand data stream memory FIFO, while flowing water inputs floating-point pipeline computing device, carries out addition/multiplication/division/Sigmoid calculating operation, is as a result sent into defeated Result data stream memory FIFO out waits internal bus interface to take out result and is sent into four input and output RAM.
Interactive controlling sub-unit, as shown in figure 3, the module is mainly used for controlling the meter in external input third control signal It calculates information and is assigned to each computing unit and caching.Including inputting interactive access control module, signal interpretation module, interaction are controlled Output module.Interaction output module is divided into floating add/multiplication interaction output judgement again, floating add/division interaction output is sentenced Disconnected, floating-point division/Sigmoid calculates interaction output judgement, floating-point multiplication/Sigmoid calculates interaction output judgement, floating add Register/FIFO interaction output judgement, floating-point multiplication register/FIFO interaction output judgement, floating-point Sigmoid calculate deposit The internal modules such as device/FIFO interaction output judgement, floating-point division register/FIFO interaction output judgement.Firstly, passing through external the Three control signals obtain input information, then judge that the instruction belongs to single computations or vector computations, then to phase The four input and output RAM output datas answered, while judging whether the instruction can interact, it is total to generate computing unit according to interactive information The interactive control information of line interactive controlling module;Then, judge the calculated result in four input and output RAM be individual data or Vector data, and take out calculated result and give instruction execution module.
Computing unit bus interactive controlling module, as shown in figure 4, the module is mainly used for number between four input into/output from cache According to interaction.According to interactive controlling module information judge current four input into/output from cache X and four input into/output from cache Y datas whether need It interacts, interactive mode is divided into that X and Y is exchanged, X backups to Y, Y and backups to X, X and Y and remains unchanged, and the delay of exchange process Temporal information is sent to flowing water input and output vector floating-point calculator modules X and Y.
Cache module is executed instruction, as shown in figure 5, the module is mainly used for being not carried out the quick execution of instruction.It has executed Instruction is maintained in executed instructions fragment caching with output result, and fragment caching covers old fragment strategy using new fragment. Receive the second gating module output parallel instruction conflict retrieval, in pool of instructions it is to be executed instruction fragment caching in into Row search is not entered back into if there is matching result then directly takes out the calculated result in search result in the instruction execution Computing module, and delete the fragment in caching;If the instruction comes into the execution stage in search process, shows and search Rope process generates conflict, stops the instruction and searches for and enter next instruction search process.Conflict judges that search result is sent to Third gating module.
Study processor structure provided by the invention is as follows:
A kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated, including it is command process module, slow It deposits forwarding module, the first gating module, data computation module, the second gating module, third gating module, execute instruction caching mould Block, in which:
Command process module: receive that third gating module sends for choose parallel instruction signal instructions to be performed or Single command signal generates corresponding instruction code and carries out decoding output, will be obtained after decoding comprising being used to choose corresponding deposit The first control signal of the operand control address of device is sent to Buffer forwarding module, while generating first/second/third gating Signal is sent respectively to first/second/third gating module;
First/second/third gating signal is as multibit signal, for selecting first/second/third gating module defeated Result out.Output of high two of first gating signal for the first gating module judges that, when being ' 00 ' for high two, output is posted The data that storage group unit issues export the sending data of inner buffer unit when being ' 01 ', peripheral interface when being ' 10 ' The output data of control unit;Output of high two of second gating signal for the second gating module judges, is when high two When ' 00 ', output fixed-point computation exports floating point calculations when as a result, being ' 01 ', and logic calculation result is exported when being ' 10 ';The Output of the high position for three gating signals for third gating module judges, when a high position is ' 0 ', exports the defeated of the second gating module Search result is executed instruction as a result, exporting when being ' 01 ' out.
Buffer forwarding module: receiving register more new signal, first control signal, external bus access data, and periphery More new signal selects corresponding register and carries out inner buffer, and will send out comprising the second control signal of calculating data address It send to the first gating module;
First gating module: receiving the second control signal that Buffer forwarding module is sent, and second control signal includes deposit Device group unit, inner buffer unit, peripheral interface units output data information, be incorporated as second control signal input the After one gating module, which of three second control signals of final output is judged according to the first gating signal that decoding unit exports One;Output includes that this third for calculating required data controls signal after carrying out gating control according to data to be calculated, concurrently It send to data computation module and executes instruction cache module;First gating signal is sent to the first gating module as decoding unit Control signal, the data for control access mask register, inner buffer or peripheral interface;
Data computation module: receiving the third sent from the first gating module and control signal, in third control signal Including required data and address calculated, by the fixed-point computation result being calculated, floating point calculations, logic calculation knot Fruit is sent to the second gating module;
Second gating module: fixed-point computation result, floating point calculations, the logic sent from data computation module is received Calculated result carries out secondary gating control, fixed-point computation result or floating-point according to the second gating signal that decoding unit exports Calculated result or logic calculation result are forwarded to third gating module and execute instruction cache module;Second gating signal is used as and translates Code unit is sent to the control signal of the second gating module, and the output result for calculating type is selected for control access;
It executes instruction cache module: receiving the parallel instruction conflict retrieval letter after the gating output of the second gating module Breath carries out parallel instruction conflict retrieval and sends to third gating module to calculate ongoing punching for determining whether there is other It is prominent to judge search result;
Third gating module: it receives the parallel instruction conflict that the second gating module is sent and retrieves information, according to decoding unit The third gating command signal of transmission and the conflict for executing instruction cache module transmission determine search result, and output order updates letter Number in command process module formed calculate closed loop;Transmitter register, caching, peripheral more new signal are used for register group list respectively The update control of member, inner buffer unit, peripheral interface control unit.
Described instruction processing module includes IA generation unit, instruction pool unit, decoding unit, in which:
IA generation unit: program address is corresponded to according to the parallel instruction signal behavior that third gating module is sent and is referred to Needle is simultaneously sent to instruction pool unit;
It instructs pool unit: receiving the correspondence program address pointer that IA generation unit is sent and addressing instruction code, it will Instruction code is sent to decoding unit;
Decoding unit: receiving the instruction code that instruction pool unit is sent, and progress Instruction decoding simultaneously believes gained control after decoding Number be sent to Buffer forwarding module, while generating first/second/third gating command signal, be sent respectively to first/second/ Third gating module.
The Buffer forwarding module includes register group unit, inner buffer unit, peripheral interface control unit, in which:
Register group unit: it receives the third control signal that decoding unit is sent and selectes corresponding register, simultaneously will Data are sent to inner buffer unit;
Inner buffer unit: the data of receiving register group unit transmission are simultaneously cached;
Peripheral interface control unit: receiving external bus access data and carries out data pipe according to periphery update control signal Control, and send data to inner buffer unit.
The data computation module includes fixed point calculation unit, floating point calculating unit, logic computing unit, in which:
Fixed point calculation unit: the required data sent from the first gating module, address are received and to carry out fixed-point computation same When to the second gating module send fixed-point computation result;
Floating point calculating unit: the required data sent from the first gating module, address are received and to carry out Floating-point Computation same When to the second gating module send floating point calculations;
Logic computing unit: the required data sent from the first gating module, address are received and to carry out logic calculation same When to the second gating module sending logic calculated result.
The floating point calculating unit include Floating-point Computation sub-unit, interactive controlling sub-unit, bus interaction sub-unit, In:
Floating-point Computation sub-unit: it receives control signal, the operand sent from the first gating module and controls address and carry out Floating-point Computation, while the control instruction for receiving interactive controlling sub-unit is concurrent by interactive controlling sub-unit reading floating point calculations It send to the second gating unit;
Interactive controlling sub-unit: it is concurrent to receive the third control signal generation control instruction sent from the first gating module It send to Floating-point Computation sub-unit and bus interaction sub-unit, corresponding or adjacent floating-point meter is sent to by bus interaction sub-unit Point counting unit synchronizes calculating, while carrying out control information exchange with Floating-point Computation sub-unit;
Bus interacts sub-unit: sending control signal to Floating-point Computation sub-unit, interactive controlling sub-unit, carries out floating-point meter The information exchange of point counting unit, interactive controlling sub-unit.
The Floating-point Computation sub-unit include floating-point flowing water vector counter, left/right operand data stream memory FIFO, Export result data stream memory FIFO, left/right operand register RL/RR, output result register RX, in which:
Floating-point flowing water vector counter: all input signals write-in data are calculated, and calculated result is sent to Export result register RX, output result data stream memory FIFO;
Left/right operand register RL/RR: receiving the control signal of bus interaction sub-unit output, to write-in data behaviour It counts and is cached, and be sent to specified floating-point flowing water vector counter and calculated;
Left/right operand data stream memory FIFO: according to the control signal of bus interaction sub-unit output, to write-in number It is cached according to data fifo value, and is sent to specified floating-point flowing water vector counter and carries out pipeline computing;
It exports result register RX: receiving the calculated result of floating-point flowing water vector counter, calculating is grasped after being cached It counts and is judged, if operand is 0 and calculates type to be multiplication or division, calculated result is denoted as 0 and is sent to outside Otherwise bus directs out portion's bus and exports immediately;
It exports result data stream memory FIFO: receiving the calculated result of floating-point flowing water vector counter, and cached, If the summing value of data flow FIFO is 0 and calculates type to be multiplication or division, calculated result is denoted as 0 and is sent to external total Otherwise line directs out portion's bus and exports immediately.Such a process reduces the access of the fragmentation of data and compute repeatedly.
The interactive controlling sub-unit includes input interactive access control module, controls signal interpretation module, interaction output Module, in which:
Input interactive access control module: it receives the required third control signal that the first gating module is sent and is translated Code, then level-one operand resulting after decoding control signal is sent to control signal interpretation module;
It controls signal interpretation module: receiving the level-one operand control signal that input interactive access control module is sent and carry out It decodes again, and gained secondary operation number control signal after decoding is sent to interactive output module;
Interaction output module: judge whether to need to translate control signal according to the Instruction decoding result of instruction pool unit output The secondary operation number control signal that code module is sent carries out four bus shared buffer memories and Floating-point Computation device calculates reading data.
Using embodiment, the present invention will be further described with reference to the accompanying drawing.
The present invention relates to based on multiple parallel cache interaction and calculate general deep learning processor architecture, especially for Large-scale depth convolutional network and fully-connected network frequently multiply accumulating calculating in calculating, commonly save in conjunction with following convolution operation For point calculates, explain to realization step.
FN×N=EN×NSigmoid((AN×NBN×N+CN×N)./DN×N) (1)
Wherein, N=positive integer, wherein Sigmoid (x)=(1+e-x)-1Calculating is abbreviated as SIGMF.
Operation of the conventional processors to above-mentioned formula, is all based on the calculating of single number, even if such as the Cambrian and deep mirror Company listens great waves processor that can only also complete water operation under conditions of optimizing to code, in the invention patent design In processor architecture, calculating for formula (1) can complete entire meter using matroid computations by way of parallel pipelining process It calculates less than several calculating cycles.In addition, formula (1) needs at least five conditions judgement and two in conventional processors written in code It recirculates, the mating instruction set of processor designed using the invention patent, it is only necessary to five-element's assembly code, as follows:
MULF.M AN×N,BN×N,TN×N,N
ADDF.M TN×N,CN×N,TN×N,N
DIVF.M TN×N,DN×N,TN×N,N
SIGMF.M TN×N,TN×N,N
MULF.M TN×N,EN×N,FN×N,N
Above-mentioned five-element's assembler language code generates machine code after compiling, is controlled by external debugging interface and debugging Unit entry instruction pool unit processed.Then entire processor works according to the instruction code of instruction pool unit output.
<1>assembly statement MULF.M A is executedN×N,BN×N,TN×N, N, by address AN×NAnd BN×NTwo matrix datas are taken out, Four input and output RAM D being stored in floating point calculating unit, and matrix-vector is written to computing unit bus interactive controlling module D Calculating parameter N;In Fig. 6 processor inside structure block diagram, this instruction operation and calculation process are as described below:
(1) it firstly, taking out this instruction from instruction pool unit according to the address of IA generation unit output, is sent to Decoding unit;
(2) decoding unit is input to the data of Buffer forwarding module according to the selection of the size of N, is controlled by peripheral interface single Member takes out AN×N、BN×N、TN×NThe corresponding data of address thinks that this instruction is general data multiplication if N=1, and data are defeated Enter register group unit, thinks that this operation is that vector calculates if N is not equal to 1, data input inner buffer unit;
(3) Buffer forwarding module judges whether according to the buffer update signal of third gating module by new input data Instead of there are the passing data of inner buffer unit, judge that whether this executes instruction with upper one according to passing instruction retrieval result It is secondary execute instruction it is identical, if the same no longer in Buffer forwarding module register group unit, inner buffer unit carry out Data update, but directly output register group unit or the passing storing data of inner buffer unit, and output is to the first gating Module;
(4) first gating modules control after receiving data according to data strobe, judge that data are input to data calculating Which part of module, due to being that floating-point multiplication calculates, operand enters floating point calculating unit for this instruction;
(5) floating point calculating unit in data computation module after receiving the data, places the data into interactive control first Sub-unit processed calculates mode according to data and length enters different output channels, this instruction is floating point vector calculating, therefore Operand data is output to four input and output RAM D by floating-point multiplication register/FIFO interaction output module in interactive controlling, It calculates information and enters bus interaction sub-unit D;
(6) bus interaction sub-unit D judges T according to the judgement instructed to nextN×NData still can be in floating add Sub-unit is calculated to carry out using therefore in TN×NData are input to after four input and output RAM D, control DMA D To A by RAM T is backed up in DN×NData are to RMA A;
(7) four input and output RAM D input operand data calculate sub-unit to floating-point multiplication, and operand is put into respectively Left/right operand data stream FIFO, is stored in left/right operand register RL/RR if N=1, if left/right operand is posted Storage RL/RR is equal to 0 or left operand data stream FIFO input data summed result is 0, then it is assumed that this floating-point multiplication meter It calculates or vector calculating output result is 0, exporting result register Rx or output stream FIFO output result is 0, is otherwise pressed Output result road RAM D is calculated according to normal;
(8) data computation module exports multiplication calculation result to the second gating module, and the second gating module is according to decoding The second gating command signal output calculated result of unit output exports parallel instruction conflict inspection to third gating module simultaneously Rope information judges whether current data computing module has other calculating to cache module is executed instruction, and prevents parallel Computations conflict is made delay to current output in the case of a conflict and is waited;
(9) third gating module according to the third gating command signal of decoding unit and executes instruction the conflict of cache module Judge search result, output order, register, caching, peripheral more new information, result deposit register group when N=1 in this instruction Unit, N, which is updated when being not equal to 1 by periphery, is stored in TN×NCorresponding address, and the address of next instruction is generated, refer to herein for this It enables address add 1, executes next instruction ADDF.M TN×N,CN×N,TN×N,N;
<2>ADDF.M T is executedN×N,CN×N,TN×N, N, by address CN×NMatrix data is taken out, is stored in floating point calculating unit Four input and output RAM A, TN×NData via computing unit bus interactive controlling module D be passed to computing unit bus interaction Then control module A is written floating add and calculates sub-unit;
<3>data enter floating-point multiplication calculating sub-unit by four input and output RAM A, calculate TN×NAnd CN×NTwo matrixes Product, acquired results are passed to four input and output RAM A by computing unit bus interactive controlling modules A, and by internal total Line returns to matrix-vector register TN×N
<4>DIVF.MT is executedN×N,DN×N,TN×N, N, by address DN×NMatrix data is taken out, is stored in floating point calculating unit Four input and output RAM B, TN×NData via computing unit bus interactive controlling modules A be passed to computing unit bus interaction Then control module B is written floating-point division and calculates sub-unit;
<5>data enter flowing water input and output vector floating-point adder by four input and output RAM B, calculate TN×NAnd DN×N The division of two matrixes, acquired results are passed to four input and output RAM B by computing unit bus interactive controlling module B, and lead to It crosses internal bus and returns to matrix-vector register TN×N
<6>SIGMF.M T is executedN×N,TN×N, N, TN×NData it is incoming via computing unit bus interactive controlling module B Then computing unit bus interactive controlling module C is written floating-point Sigmoid and calculates sub-unit;
<7>T is calculatedN×NSigmoid calculate function, acquired results by computing unit bus interactive controlling module C be passed to Four input and output RAM C, and matrix-vector register T is returned to by internal busN×N
<8>MULF.M T is executedN×N,EN×N,FN×N, N, by address EN×NMatrix data is taken out, is stored in floating point calculating unit Four input and output RAM D, TN×NData via computing unit bus interactive controlling module D be passed to computing unit bus interaction Then control module D is written floating-point multiplication and calculates sub-unit;
<9>data enter floating-point multiplication calculating sub-unit progress T by four input and output RAM DN×NAnd EN×NMultiplication of matrices It calculates, obtained result returns to four input and output RAM D via computing unit bus interactive controlling module D, and by internal total Line returns to matrix-vector register FN×N, to complete this calculating task.
The calculation instructed due to other several is similar with first, no longer carries out repeating detailed description here.
Pass through the execution of above-mentioned concrete operations, it can be seen that (3) reduce peripheral data the step of first instruction execution The frequency interactively communicated reduces data computing relay caused by communicating between caching, this is in deep learning network easily hundred Ten thousand times or even more than one hundred million times convolution kernel calculating process, can save a large amount of cache access time;The step of first instruction execution Suddenly (6) can back up frequently-used data in short-term according to the correlation above to give an order, equally reduce inside and outside caching in calculating process Data interaction access time, be especially the reduction of the frequency of fragment type parameter access rate;The step of first instruction execution (7) It in calculating process, is prejudged according to input data, the mode for directly exporting result is carried out for 0 Value Data, reduces calculating Expense;In the design process due to present processor, it has been provided simultaneously with fixed point Floating-point Computation module, has avoided and is calculated in other designs Model accuracy rate is greatly reduced problem caused by error.
The content that description in the present invention is not described in detail belongs to the well-known technique of those skilled in the art.

Claims (9)

1. a kind of general deep learning processor for caching interaction based on multiple parallel and calculating, it is characterised in that: at instruction It manages module, Buffer forwarding module, the first gating module, data computation module, the second gating module, third gating module, execute Instruction cache module, in which:
Buffer forwarding module: it receives the first control signal that command process module is sent and chooses corresponding register, and to external total The data-signal to be calculated that line is sent into carries out inner buffer, while first control signal and data to be calculated are merged into the second control Signal processed is simultaneously sent to the first gating module, carries out data more further according to the register more new signal of the first gating module foldback Newly;
First gating module: receiving the second control signal that Buffer forwarding module is sent, and sent according to command process module First gating command signal carries out selection output to data cached in second control signal;It is gated according to data to be calculated Output controls signal comprising this third for calculating required operand after control, and is sent to data computation module, sends simultaneously Register more new signal gives Buffer forwarding module;
Data computation module: it receives the calculating sent from the first gating module and controls signal with third, controlled and believed according to third Operand pair data to be calculated needed for this calculating in number are calculated, by the fixed-point computation result being calculated, floating-point meter Calculate result, logic calculation result is sent to the second gating module;
Second gating module: fixed-point computation result, floating point calculations, the logic calculation sent from data computation module is received As a result, and according to the second gating command signal that command process module is sent, the secondary gating control of progress, by fixed-point computation result Or floating point calculations or logic calculation result are forwarded to third gating module and execute instruction cache module;
It executes instruction cache module: receiving the calculated result after the gating output of the second gating module and carry out parallel instruction conflict Retrieval, and sent to third gating module and calculate ongoing decision instruction signal for determining whether there is other, comprising referring to Enable conflict retrieval judging result;
Third gating module: receiving the calculated result that the second gating module is sent, and is selected according to the third that command process module is sent Logical command signal, and execute instruction the conflict retrieval judging result of cache module transmission, generate for Buffer forwarding module into The instruction more new signal that row instruction updates, and send instruction more new signal to command process module formation and calculate closed loop;
Command process module: what reception third gating module was sent is used to choose instruction instructions to be performed more new signal and generate Corresponding instruction code exports first control signal after being decoded, be used to choose corresponding register first that will be obtained after decoding Control signal is sent to Buffer forwarding module, while generating first/second/third gating command signal, it is sent respectively to first/ Second/third gating module carries out output data strobe.
2. a kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated according to claim 1, Be characterized in that: described instruction processing module includes IA generation unit, instruction pool unit, decoding unit, in which:
IA generation unit: the instruction sent according to third gating module updates signal behavior and corresponds to program address pointer simultaneously It is sent to instruction pool unit;
It instructs pool unit: receiving the correspondence program address pointer that IA generation unit is sent and addressing instruction code, will instruct Code is sent to decoding unit;
Decoding unit: receiving the instruction code that instruction pool unit is sent, and progress Instruction decoding simultaneously sends out gained control signal after decoding It send to Buffer forwarding module, while generating first/second/third gating command signal, be sent respectively to first/second/third Gating module carries out output data strobe.
3. a kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated according to claim 2, Be characterized in that: the Buffer forwarding module includes register group unit, inner buffer unit, peripheral interface control unit, in which:
Register group unit: receiving the decoding first control signal that command process module is sent and select corresponding register, The register data in the control signal is sent to inner buffer unit simultaneously;
Peripheral interface control unit: the data to be calculated that external bus access interface is sent are received, and data to be calculated will be included External signal data is sent to inner buffer unit;
Inner buffer unit: receiving register group unit send register data and peripheral interface control unit send include All signal datas are sent to the first gating module by the external signal of data to be calculated.
4. a kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated according to claim 1, Be characterized in that: the data computation module includes fixed point calculation unit, floating point calculating unit, logic computing unit, in which:
Fixed point calculation unit: receiving the third sent from the first gating module and control signal and carry out fixed-point computation, while to Second gating module sends fixed-point computation result;
Floating point calculating unit: receiving the third sent from the first gating module and control signal and carry out Floating-point Computation, while to Second gating module sends floating point calculations;
Logic computing unit: it receives the third sent from the first gating module and controls signal and carry out logic calculation, simultaneously To the second gating module sending logic calculated result.
5. a kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated according to claim 4, Be characterized in that: the floating point calculating unit include Floating-point Computation sub-unit, interactive controlling sub-unit, bus interaction sub-unit, In:
Floating-point Computation sub-unit: it receives the third control signal sent from the first gating module and carries out Floating-point Computation, connect simultaneously The reading instruction for receiving interactive controlling sub-unit reads floating point calculations by interactive controlling sub-unit and is sent to the second gating list Member;
Interactive controlling sub-unit: when Floating-point Computation sub-unit starts to calculate, by bus interaction sub-unit to corresponding or adjacent Floating-point Computation sub-unit send and read instruction, and calculated result is sent to the second gating module;
Bus interacts sub-unit: carrying out the instruction and data interaction of Floating-point Computation sub-unit, interactive controlling sub-unit.
6. a kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated according to claim 5, Be characterized in that: the Floating-point Computation sub-unit includes floating-point flowing water vector counter, left operand data stream memory FIFO, the right side Operand data stream memory FIFO, output result data stream memory FIFO, left operand register RL, the deposit of right operand Device RR, output result register RX, in which:
Left operand register RL: receive bus interaction sub-unit output interactive controlling signal, to write-in data operand into Row caching, and be sent to specified floating-point flowing water vector counter and calculated;
Right operand register RR: receive bus interaction sub-unit output interactive controlling signal, to write-in data operand into Row caching, and be sent to specified floating-point flowing water vector counter and calculated;
Left/right operand data stream memory FIFO: according to the interactive controlling signal of bus interaction sub-unit output, to write-in number It is cached according to data fifo value, and is sent to specified floating-point flowing water vector counter and carries out pipeline computing;
Floating-point flowing water vector counter: to write-in data calculate, and by calculated result be sent to output result register RX, Export result data stream memory FIFO;
It exports result register RX: the calculated result of floating-point flowing water vector counter is received, to calculating operation number after being cached Judged, if operand is 0 and calculates type to be multiplication or division, calculated result be denoted as 0 and is sent to external bus, Otherwise portion's bus is directed out to export immediately;
It exports result data stream memory FIFO: receiving the calculated result of floating-point flowing water vector counter, and cached, if number Summing value according to stream FIFO is 0 and calculating type is multiplication or division, then calculated result is denoted as 0 and is sent to external bus, Otherwise portion's bus is directed out to export immediately.
7. a kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated according to claim 5, Be characterized in that: the interactive controlling sub-unit includes input interactive access control module, controls signal interpretation module, interaction output Module, in which:
Input interactive access control module: it receives the third control signal that the first gating module is sent and is decoded, then will translate Resulting data and control signal are sent to interactive output module after code;
Control signal interpretation module: receive the first gating module send external computations decoded, and will decode after institute Signal must be controlled and be sent to interactive output module;
Interaction output module: interactive access control module will be inputted by bus interaction sub-unit or control signal interpretation module is defeated Data are sent to the second gating module out.
8. a kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated according to claim 6, Be characterized in that: the floating-point flowing water vector counter quantity is four.
9. a kind of multiple parallel calculates and the general depth network method for calculation of caching, it is characterised in that steps are as follows:
(1) simultaneously output order is chosen from pool of instructions according to the address pointer of generation, exports first after decoding to the instruction Signal is controlled, and generates the first/bis-/tri- gating command signals and carries out data strobe;
(2) data in first control signal are cached, data to be calculated is obtained by outside access bus and by itself and the Data merge output second control signal in one control signal;
(3) gating control is carried out to second control signal using the first gating command signal, and is controlled according to the third after gating Signal calculates data to be calculated, while being updated using calculated result to data cached;
(4) gating output is carried out to calculated result according to the second gating command command signal, while carries out parallel instruction conflict inspection Rope, output conflict judge search result;
(5) search result is judged according to third gating command signal, calculated result, conflict, transmitter register more new signal carries out Data buffer storage updates, while carrying out IA update, is formed and calculates closed loop.
CN201811528451.7A 2018-12-13 2018-12-13 General deep learning processor based on multi-parallel cache interaction and calculation Active CN109739556B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811528451.7A CN109739556B (en) 2018-12-13 2018-12-13 General deep learning processor based on multi-parallel cache interaction and calculation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811528451.7A CN109739556B (en) 2018-12-13 2018-12-13 General deep learning processor based on multi-parallel cache interaction and calculation

Publications (2)

Publication Number Publication Date
CN109739556A true CN109739556A (en) 2019-05-10
CN109739556B CN109739556B (en) 2021-03-26

Family

ID=66359421

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811528451.7A Active CN109739556B (en) 2018-12-13 2018-12-13 General deep learning processor based on multi-parallel cache interaction and calculation

Country Status (1)

Country Link
CN (1) CN109739556B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112817638A (en) * 2019-11-18 2021-05-18 北京希姆计算科技有限公司 Data processing device and method
CN113051212A (en) * 2021-03-02 2021-06-29 长沙景嘉微电子股份有限公司 Graphics processor, data transmission method, data transmission device, electronic device, and storage medium
CN113806250A (en) * 2021-09-24 2021-12-17 中国人民解放军国防科技大学 Method for coordinating general processor core and vector component, interface and processor
US11782722B2 (en) 2020-06-30 2023-10-10 Beijing Baidu Netcom Science And Technology Co., Ltd. Input and output interfaces for transmitting complex computing information between AI processors and computing components of a special function unit

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5961628A (en) * 1997-01-28 1999-10-05 Samsung Electronics Co., Ltd. Load and store unit for a vector processor
CN1387649A (en) * 1999-08-31 2002-12-25 英特尔公司 Parallel processor architecture
CN101751244A (en) * 2010-01-04 2010-06-23 清华大学 Microprocessor
CN101986263A (en) * 2010-11-25 2011-03-16 中国人民解放军国防科学技术大学 Method and microprocessor for supporting single instruction stream and multi-instruction stream dynamic switching execution
CN106445468A (en) * 2015-10-08 2017-02-22 上海兆芯集成电路有限公司 Direct execution of execution unit for loading micro-operation of framework cache file by employing framework instruction of processor
US10073696B2 (en) * 2013-07-15 2018-09-11 Texas Instruments Incorporated Streaming engine with cache-like stream data storage and lifetime tracking

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5961628A (en) * 1997-01-28 1999-10-05 Samsung Electronics Co., Ltd. Load and store unit for a vector processor
CN1387649A (en) * 1999-08-31 2002-12-25 英特尔公司 Parallel processor architecture
CN101751244A (en) * 2010-01-04 2010-06-23 清华大学 Microprocessor
CN101986263A (en) * 2010-11-25 2011-03-16 中国人民解放军国防科学技术大学 Method and microprocessor for supporting single instruction stream and multi-instruction stream dynamic switching execution
US10073696B2 (en) * 2013-07-15 2018-09-11 Texas Instruments Incorporated Streaming engine with cache-like stream data storage and lifetime tracking
CN106445468A (en) * 2015-10-08 2017-02-22 上海兆芯集成电路有限公司 Direct execution of execution unit for loading micro-operation of framework cache file by employing framework instruction of processor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨川: "MPCore多核处理器并行计算方法的研究与实现", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112817638A (en) * 2019-11-18 2021-05-18 北京希姆计算科技有限公司 Data processing device and method
US11782722B2 (en) 2020-06-30 2023-10-10 Beijing Baidu Netcom Science And Technology Co., Ltd. Input and output interfaces for transmitting complex computing information between AI processors and computing components of a special function unit
CN113051212A (en) * 2021-03-02 2021-06-29 长沙景嘉微电子股份有限公司 Graphics processor, data transmission method, data transmission device, electronic device, and storage medium
CN113051212B (en) * 2021-03-02 2023-12-05 长沙景嘉微电子股份有限公司 Graphics processor, data transmission method, data transmission device, electronic equipment and storage medium
CN113806250A (en) * 2021-09-24 2021-12-17 中国人民解放军国防科技大学 Method for coordinating general processor core and vector component, interface and processor

Also Published As

Publication number Publication date
CN109739556B (en) 2021-03-26

Similar Documents

Publication Publication Date Title
Lu et al. Evaluating fast algorithms for convolutional neural networks on FPGAs
CN111897579B (en) Image data processing method, device, computer equipment and storage medium
Chen et al. Regnn: A redundancy-eliminated graph neural networks accelerator
CN118690805A (en) Processing apparatus and processing method
CN109739556A (en) A kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated
CN105468439A (en) Adaptive parallel algorithm for traversing neighbors in fixed radius under CPU-GPU (Central Processing Unit-Graphic Processing Unit) heterogeneous framework
CN108052347A (en) A kind of device for executing instruction selection, method and command mappings method
CN116301920B (en) Compiling system for deploying CNN model to high-performance accelerator based on FPGA
US20230394110A1 (en) Data processing method, apparatus, device, and medium
CN116401502B (en) Method and device for optimizing Winograd convolution based on NUMA system characteristics
CN114995822A (en) Deep learning compiler optimization method special for CNN accelerator
CN112232517B (en) Artificial intelligence accelerates engine and artificial intelligence treater
Chen et al. Rubik: A hierarchical architecture for efficient graph learning
Zhu et al. Taming unstructured sparsity on GPUs via latency-aware optimization
CN110047477A (en) A kind of optimization method, equipment and the system of weighted finite state interpreter
Chen et al. Exploiting on-chip heterogeneity of versal architecture for GNN inference acceleration
Wang et al. COSA: Co-Operative Systolic Arrays for Multi-head Attention Mechanism in Neural Network using Hybrid Data Reuse and Fusion Methodologies
CN113157638B (en) Low-power-consumption in-memory calculation processor and processing operation method
CN111522776B (en) Computing architecture
Shang et al. LACS: A high-computational-efficiency accelerator for CNNs
Janssen et al. A specification invariant technique for regularity improvement between flow-graph clusters
Lin et al. swFLOW: A dataflow deep learning framework on sunway taihulight supercomputer
CN116090519A (en) Compiling method of convolution operator and related product
US11714649B2 (en) RISC-V-based 3D interconnected multi-core processor architecture and working method thereof
CN106095730B (en) A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant