CN109739556A - A kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated - Google Patents
A kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated Download PDFInfo
- Publication number
- CN109739556A CN109739556A CN201811528451.7A CN201811528451A CN109739556A CN 109739556 A CN109739556 A CN 109739556A CN 201811528451 A CN201811528451 A CN 201811528451A CN 109739556 A CN109739556 A CN 109739556A
- Authority
- CN
- China
- Prior art keywords
- module
- unit
- sent
- data
- gating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Advance Control (AREA)
Abstract
The present invention relates to a kind of general deep learning processors for caching interaction based on multiple parallel and calculating, calculating and frequent parameter access are frequently multiplied accumulating in the calculating full connection procedure of convolution sum primarily directed to deep learning, cache parallel interactive computing using vector operations more, it is reduced by shared parameter and the fragment type of data is accessed, it is retrieved again using executed instructions caching, improve the degree of parallelism of calculating process, same instructions and the computational efficiency of identical parameters access, reduce and computes repeatedly instruction to the occupancy of hardware floating point calculator, the high repetitive operation of deep learning network query function is reduced from the level of instruction data flow, improve the real-time of deep learning network query function.
Description
Technical field
The present invention relates to the general deep learning processors for caching interaction based on multiple parallel and calculating, especially for large size
The general acceleration that convolutional calculation and full connection calculate in deep learning network.
Background technique
Deep learning network is during artificial intelligence independently detects identification, judgement prediction and other pattern-recognitions, tool
It is significant.But deep learning network needs to carry out a large amount of matroid simultaneously, vector calculates, and algorithm calculation amount is very big, broken
Chip parameter access rate is high, and high requirement is proposed to the architecture design of processor;Meanwhile in Embedded Application field,
Especially in field of aerospace, power consumption, volume, area limit the application of most of commercialization GPU processor and microprocessor.
In addition, existing commercialization GPU computer and microprocessor are all based on greatly the calculation processing to simultaneously column register, i.e.,
Make Titan 1080p processor such as that there is vector data to move operation, has thousands of a nodes in internal calculating process to every
A data parallel, this process occupy huge hardware resource, while also consuming a large amount of energy power consumption.Other add
Fast processor, the similar input and output result of Cambrian chip during processing must be via MLU modules, and by decoding
Selected input output operation is completed in the judgement of process different instruction type, and this accelerated mode is operated by multiple modular concurrents,
And HotBuf and ColdBuf is combined to reduce or assimilate the convolutional calculation of identical parameters and operation, this design architecture can be effective
Convolution kernel calculated performance is promoted, but can not necessarily have preferable performance under non-structured deep learning network.Deep mirror
The AI processor of scientific & technical corporation's design has the characteristics that hardware resource is variable, by means of NPU node, is calculated by Compiler Optimization
Process, compression parameters reduce calculation amount, but this function must have the corresponding mating execution of program structure progress to can be only achieved most
Excellent speed, and in practical big convolutional calculation process, intermediate computations data are usually called by other processes, are difficult to all multiple
Miscellaneous depth network forms actual optimization.Especially deep mirror science and technology, which thinks to simplify in calculating process, calculates system, can effectively solve
The certainly huge problem of calculation amount quantifies in practice for the position 8-16 of the key node of the deep learning network towards small objects
It may bringing on a disaster property effect.
Implement at present for the algorithm of the acceleration of depth convolutional network, it can only be complete by the parallel computation of multiple computing units
At, it is expensive, structure is complicated although having commercial GPU processor or dedicated IP, with a distance from microminaturization Embedded Application compared with
Far.The general deep learning processor that interaction is cached based on multiple parallel and is calculated is designed, current low-power consumption, micro- can be effectively met
Minimize Embedded A I processor development process there is an urgent need to.
Summary of the invention
Technology of the invention solves the problems, such as: in the prior art, convolutional calculation process is complicated in large-scale learning network,
Power consumption is larger, calculates fragment repeated accesses and computationally intensive problem, proposes a kind of interactive based on multiple parallel caching and calculates
General deep learning processor.
The present invention solves above-mentioned technical problem and is achieved by following technical solution:
A kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated, including it is command process module, slow
It deposits forwarding module, the first gating module, data computation module, the second gating module, third gating module, execute instruction caching mould
Block, in which:
Buffer forwarding module: it receives the first control signal that command process module is sent and chooses corresponding register, and externally
The data-signal to be calculated that portion's bus is sent into carries out inner buffer, while first control signal and data to be calculated are merged into the
Two control signals are simultaneously sent to the first gating module, carry out data further according to the register more new signal of the first gating module foldback
It updates;
First gating module: the second control signal that Buffer forwarding module is sent is received, and is sent out according to command process module
The the first gating command signal sent carries out selection output to data cached in second control signal;It is carried out according to data to be calculated
Output controls signal comprising the third of operand needed for this calculating after gating control, and is sent to data computation module, simultaneously
Transmitter register more new signal gives Buffer forwarding module;
Data computation module: it receives the calculating sent from the first gating module and controls signal with third, according to third control
Operand pair data to be calculated needed for this calculating in signal processed are calculated, and by the fixed-point computation result being calculated, are floated
Point calculated result, logic calculation result are sent to the second gating module;
Second gating module: fixed-point computation result, floating point calculations, the logic sent from data computation module is received
Calculated result, and the second gating command signal sent according to command process module carry out secondary gating control, by fixed-point computation
As a result or floating point calculations or logic calculation result are forwarded to third gating module and execute instruction cache module;
It executes instruction cache module: receiving the calculated result after the gating output of the second gating module and carry out parallel instruction
Conflict retrieval, and sent to third gating module and calculate ongoing decision instruction signal, packet for determining whether there is other
Judging result is retrieved containing instruction conflict;
Third gating module: the calculated result that the second gating module is sent, the sent according to command process module are received
Three gating command signals, and the conflict retrieval judging result of cache module transmission is executed instruction, it generates for Buffer forwarding mould
Block carries out the instruction more new signal of instruction update, and sends instruction more new signal to command process module and form calculating closed loop;
Command process module: what reception third gating module was sent is used to choose instruction instructions to be performed more new signal simultaneously
Corresponding instruction code is generated, exports first control signal after being decoded, is used to choose corresponding register for what is obtained after decoding
First control signal is sent to Buffer forwarding module, while generating first/second/third gating command signal, is sent respectively to
First/second/third gating module carries out output data strobe.
Described instruction processing module includes IA generation unit, instruction pool unit, decoding unit, in which:
IA generation unit: the instruction update signal behavior sent according to third gating module corresponds to program address and refers to
Needle is simultaneously sent to instruction pool unit;
It instructs pool unit: receiving the correspondence program address pointer that IA generation unit is sent and addressing instruction code, it will
Instruction code is sent to decoding unit;
Decoding unit: receiving the instruction code that instruction pool unit is sent, and progress Instruction decoding simultaneously believes gained control after decoding
Number be sent to Buffer forwarding module, while generating first/second/third gating command signal, be sent respectively to first/second/
Third gating module carries out output data strobe.
The Buffer forwarding module includes register group unit, inner buffer unit, peripheral interface control unit, in which:
Register group unit: the decoding first control signal that command process module is sent and selected corresponding deposit are received
Device, while the register data in the control signal is sent to inner buffer unit;
Peripheral interface control unit: the data to be calculated that external bus access interface is sent are received, and will include number to be calculated
According to external signal data be sent to inner buffer unit;
Inner buffer unit: what the register data and peripheral interface control unit that receiving register group unit is sent were sent
All signal datas are sent to the first gating module by external signal comprising data to be calculated.
The data computation module includes fixed point calculation unit, floating point calculating unit, logic computing unit, in which:
Fixed point calculation unit: it receives the third sent from the first gating module and controls signal and carry out fixed-point computation, together
When to the second gating module send fixed-point computation result;
Floating point calculating unit: it receives the third sent from the first gating module and controls signal and carry out Floating-point Computation, together
When to the second gating module send floating point calculations;
Logic computing unit: receiving the third sent from the first gating module and control signal and carry out logic calculation,
Simultaneously to the second gating module sending logic calculated result.
The floating point calculating unit include Floating-point Computation sub-unit, interactive controlling sub-unit, bus interaction sub-unit,
In:
Floating-point Computation sub-unit: it receives the third control signal sent from the first gating module and carries out Floating-point Computation, together
When receive the reading instruction of interactive controlling sub-unit and read by interactive controlling sub-unit and floating point calculations and be sent to the second choosing
Logical unit;
Interactive controlling sub-unit: when Floating-point Computation sub-unit starts to calculate, by bus interaction sub-unit to corresponding or
Adjacent Floating-point Computation sub-unit, which is sent, reads instruction, and calculated result is sent to the second gating module;
Bus interacts sub-unit: carrying out the instruction and data interaction of Floating-point Computation sub-unit, interactive controlling sub-unit.
The Floating-point Computation sub-unit includes floating-point flowing water vector counter, left operand data stream memory FIFO, the right side
Operand data stream memory FIFO, output result data stream memory FIFO, left operand register RL, the deposit of right operand
Device RR, output result register RX, in which:
Left operand register RL: the interactive controlling signal of bus interaction sub-unit output is received, to write-in data manipulation
Number is cached, and is sent to specified floating-point flowing water vector counter and is calculated;
Right operand register RR: the interactive controlling signal of bus interaction sub-unit output is received, to write-in data manipulation
Number is cached, and is sent to specified floating-point flowing water vector counter and is calculated;
Left/right operand data stream memory FIFO: according to the interactive controlling signal of bus interaction sub-unit output, to writing
Enter data data fifo value to be cached, and is sent to specified floating-point flowing water vector counter and carries out pipeline computing;
Floating-point flowing water vector counter: calculating write-in data, and calculated result is sent to output result deposit
Device RX, output result data stream memory FIFO;
It exports result register RX: receiving the calculated result of floating-point flowing water vector counter, calculating is grasped after being cached
It counts and is judged, if operand is 0 and calculates type to be multiplication or division, calculated result is denoted as 0 and is sent to outside
Otherwise bus directs out portion's bus and exports immediately;
It exports result data stream memory FIFO: receiving the calculated result of floating-point flowing water vector counter, and cached,
If the summing value of data flow FIFO is 0 and calculates type to be multiplication or division, calculated result is denoted as 0 and is sent to external total
Otherwise line directs out portion's bus and exports immediately.
The interactive controlling sub-unit includes input interactive access control module, controls signal interpretation module, interaction output
Module, in which:
Input interactive access control module: receiving the third control signal that the first gating module is sent and decoded, then
Data resulting after decoding and control signal are sent to interactive output module;
Control signal interpretation module: receiving the external computations that the first gating module is sent and decoded, and will decoding
Gained control signal is sent to interactive output module afterwards;
Interaction output module: interactive access control module or control signal interpretation mould will be inputted by bus interaction sub-unit
Block output data is sent to the second gating module.
The floating-point flowing water vector counter quantity is four.
A kind of general depth network method for calculation that multiple parallel is calculated and cached, steps are as follows:
(1) simultaneously output order is chosen from pool of instructions according to the address pointer of generation, is exported after being decoded to the instruction
First control signal, and generate the first/bis-/tri- gating command signals and carry out data strobe;
(2) data in first control signal are cached, by outside access bus obtain data to be calculated and by its
Merge output second control signal with data in first control signal;
(3) gating control is carried out to second control signal using the first gating command signal, and according to the third after gating
Control signal calculates data to be calculated, while being updated using calculated result to data cached;
(4) gating output is carried out to calculated result according to the second gating command command signal, while carries out parallel instruction punching
Prominent retrieval, output conflict judge search result;
(5) search result, transmitter register more new signal are judged according to third gating command signal, calculated result, conflict
Data buffer storage update is carried out, while carrying out IA update, is formed and calculates closed loop.
The advantages of the present invention over the prior art are that:
(1) a kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated provided by the invention, for
Vector is computationally intensive in deep learning convolutional network calculating process, calculates the high feature of multiplicity, using to executed instructions into
Row caching, and the mode of result of will executing instruction is quickly generated in the mode that the caching scans for that will execute instruction,
Reduce and compute repeatedly instruction to the occupancy of hardware floating point calculator, reduces deep learning network from the level of instruction data flow
The high repetitive operation calculated improves vector in convolution kernel calculating process, matrix calculating speed;
(2) present invention is high for Data duplication degree between full connection calculating process interior joint calculating, compared to convolutional calculation
Process instruction fetching decodes the very fast feature of process, the invention proposes floating point calculating unit, ensure that input Floating-point Computation refers to
Parallel, the interactive pipeline computing enabled, synchronization can complete floating add, multiplication, division, Sigmoid parallel and calculate four
The vector of kind operation calculates, and reduces fragment type parameter access rate.
Detailed description of the invention
Fig. 1 is the floating point calculating unit structural schematic diagram that invention provides;
Fig. 2 is the Floating-point Computation sub-unit structural schematic diagram that invention provides;
Fig. 3 is the interactive controlling sub-unit structural schematic diagram that invention provides;
Fig. 4 is the unit bus interactive controlling modular structure schematic diagram that invention provides;
Fig. 5 is the structural schematic diagram that invention provides;
Fig. 6 is the processor inner bay composition that invention provides;
Specific embodiment
The present invention relates to the general deep learning processors for caching interaction based on multiple parallel and calculating, especially for large size
Depth convolutional network and fully-connected network frequently multiply accumulating calculating in calculating.
The present invention will be further described with reference to the accompanying drawing.
The invention mainly comprises command process module, Buffer forwarding module, the first gating module, data computation module,
Two gating modules, execute instruction cache module at third gating module.The main object of the present invention is to large-scale deep learning network
Middle convolutional calculation and full connection calculate and carry out quick execution, to achieve the purpose that complete in real time.As shown in fig. 6, instruction updates
Afterwards, decoding unit in instruction pool unit output order and is entered according to IA generation unit, exports first control signal;The
One control signal enters the register group unit in Buffer forwarding module, while receiving update register, buffer update signal,
Inner buffer unit is inputted with peripheral interface control unit output data, inner buffer unit exports second control signal to first
Gating module;First gating module updates according to second control signal, the first gating command signal difference output register, is peripheral
It updates and third control signal is to register group unit, peripheral interface control unit and data computation module, after calculating
Fixed point, floating-point and the logic calculation result of output are sent to the second gating module;Second gating module receives fixed point, floating-point and patrols
Calculated result and the second gating command signal are collected, parallel instruction conflict recall signal is sent and refers to third gating module and execution
Enable cache module;It executes instruction cache module and receives parallel instruction conflict recall signal, judge whether there is conflict instruction and held
Row instruction, and conflict is judged that search result returns third gating module;Third gating module receives parallel instruction conflict retrieval
Signal, conflict judge search result and third gating command signal, output register/caching/periphery more new signal, and instruction
More new signal is to command process module.
By that can be calculated using vector the judgement of instruction, floating point calculating unit is shown in for the calculating of floating number in calculating process
Shown in Fig. 1.The module includes four four input and output RAM cachings, is respectively used to the floating-point meter added, multiplication and division, Sigmoid are calculated
Point counting unit, four bus interaction sub-units are for the shared and effectively distribution between data.Floating data passes through Floating-point Computation
Which four input and output RAM caching sub-unit, judgement are input in, and judge that four input and output RAM caching is according to instruction code
It is no to need to backup to adjacent two four input and output RAM caching.Then take out the floating point vector in four input and output RAM caching
Data input stream water input and output vector floating-point calculator, and judge whether the calculating data of output need to back up according to instruction code
Into adjacent two buses interaction sub-unit, if it is desired to then calculating data are shifted using computing unit bus interactive controlling
Or backup to other computing modules.
Floating-point Computation sub-unit, as shown in Fig. 2, including left/right operand data stream memory FIFO, output result data
Flow memory FIFO, left/right operand register RL, right operand register RR, output result register RX.Module is examined first
The data for surveying external four input and output RAM processing whether in need, if there is and to need to handle data be single then by left operation
Number reads in left operand register RL, the right operand register R of right operand readingR, floating-point pipeline computing is inputted after latching level-one
Device carries out addition/multiplication/division/Sigmoid calculating operation, is as a result sent into output result register RX, internal bus is waited to connect
Mouth takes out result and is sent into four input and output RAM;The data that handle if necessary and to need to handle data be vector, then by left behaviour
It counts and continuously reads in left operand data stream memory FIFO, right operand continuously reads in right operand data stream memory
FIFO, while flowing water inputs floating-point pipeline computing device, carries out addition/multiplication/division/Sigmoid calculating operation, is as a result sent into defeated
Result data stream memory FIFO out waits internal bus interface to take out result and is sent into four input and output RAM.
Interactive controlling sub-unit, as shown in figure 3, the module is mainly used for controlling the meter in external input third control signal
It calculates information and is assigned to each computing unit and caching.Including inputting interactive access control module, signal interpretation module, interaction are controlled
Output module.Interaction output module is divided into floating add/multiplication interaction output judgement again, floating add/division interaction output is sentenced
Disconnected, floating-point division/Sigmoid calculates interaction output judgement, floating-point multiplication/Sigmoid calculates interaction output judgement, floating add
Register/FIFO interaction output judgement, floating-point multiplication register/FIFO interaction output judgement, floating-point Sigmoid calculate deposit
The internal modules such as device/FIFO interaction output judgement, floating-point division register/FIFO interaction output judgement.Firstly, passing through external the
Three control signals obtain input information, then judge that the instruction belongs to single computations or vector computations, then to phase
The four input and output RAM output datas answered, while judging whether the instruction can interact, it is total to generate computing unit according to interactive information
The interactive control information of line interactive controlling module;Then, judge the calculated result in four input and output RAM be individual data or
Vector data, and take out calculated result and give instruction execution module.
Computing unit bus interactive controlling module, as shown in figure 4, the module is mainly used for number between four input into/output from cache
According to interaction.According to interactive controlling module information judge current four input into/output from cache X and four input into/output from cache Y datas whether need
It interacts, interactive mode is divided into that X and Y is exchanged, X backups to Y, Y and backups to X, X and Y and remains unchanged, and the delay of exchange process
Temporal information is sent to flowing water input and output vector floating-point calculator modules X and Y.
Cache module is executed instruction, as shown in figure 5, the module is mainly used for being not carried out the quick execution of instruction.It has executed
Instruction is maintained in executed instructions fragment caching with output result, and fragment caching covers old fragment strategy using new fragment.
Receive the second gating module output parallel instruction conflict retrieval, in pool of instructions it is to be executed instruction fragment caching in into
Row search is not entered back into if there is matching result then directly takes out the calculated result in search result in the instruction execution
Computing module, and delete the fragment in caching;If the instruction comes into the execution stage in search process, shows and search
Rope process generates conflict, stops the instruction and searches for and enter next instruction search process.Conflict judges that search result is sent to
Third gating module.
Study processor structure provided by the invention is as follows:
A kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated, including it is command process module, slow
It deposits forwarding module, the first gating module, data computation module, the second gating module, third gating module, execute instruction caching mould
Block, in which:
Command process module: receive that third gating module sends for choose parallel instruction signal instructions to be performed or
Single command signal generates corresponding instruction code and carries out decoding output, will be obtained after decoding comprising being used to choose corresponding deposit
The first control signal of the operand control address of device is sent to Buffer forwarding module, while generating first/second/third gating
Signal is sent respectively to first/second/third gating module;
First/second/third gating signal is as multibit signal, for selecting first/second/third gating module defeated
Result out.Output of high two of first gating signal for the first gating module judges that, when being ' 00 ' for high two, output is posted
The data that storage group unit issues export the sending data of inner buffer unit when being ' 01 ', peripheral interface when being ' 10 '
The output data of control unit;Output of high two of second gating signal for the second gating module judges, is when high two
When ' 00 ', output fixed-point computation exports floating point calculations when as a result, being ' 01 ', and logic calculation result is exported when being ' 10 ';The
Output of the high position for three gating signals for third gating module judges, when a high position is ' 0 ', exports the defeated of the second gating module
Search result is executed instruction as a result, exporting when being ' 01 ' out.
Buffer forwarding module: receiving register more new signal, first control signal, external bus access data, and periphery
More new signal selects corresponding register and carries out inner buffer, and will send out comprising the second control signal of calculating data address
It send to the first gating module;
First gating module: receiving the second control signal that Buffer forwarding module is sent, and second control signal includes deposit
Device group unit, inner buffer unit, peripheral interface units output data information, be incorporated as second control signal input the
After one gating module, which of three second control signals of final output is judged according to the first gating signal that decoding unit exports
One;Output includes that this third for calculating required data controls signal after carrying out gating control according to data to be calculated, concurrently
It send to data computation module and executes instruction cache module;First gating signal is sent to the first gating module as decoding unit
Control signal, the data for control access mask register, inner buffer or peripheral interface;
Data computation module: receiving the third sent from the first gating module and control signal, in third control signal
Including required data and address calculated, by the fixed-point computation result being calculated, floating point calculations, logic calculation knot
Fruit is sent to the second gating module;
Second gating module: fixed-point computation result, floating point calculations, the logic sent from data computation module is received
Calculated result carries out secondary gating control, fixed-point computation result or floating-point according to the second gating signal that decoding unit exports
Calculated result or logic calculation result are forwarded to third gating module and execute instruction cache module;Second gating signal is used as and translates
Code unit is sent to the control signal of the second gating module, and the output result for calculating type is selected for control access;
It executes instruction cache module: receiving the parallel instruction conflict retrieval letter after the gating output of the second gating module
Breath carries out parallel instruction conflict retrieval and sends to third gating module to calculate ongoing punching for determining whether there is other
It is prominent to judge search result;
Third gating module: it receives the parallel instruction conflict that the second gating module is sent and retrieves information, according to decoding unit
The third gating command signal of transmission and the conflict for executing instruction cache module transmission determine search result, and output order updates letter
Number in command process module formed calculate closed loop;Transmitter register, caching, peripheral more new signal are used for register group list respectively
The update control of member, inner buffer unit, peripheral interface control unit.
Described instruction processing module includes IA generation unit, instruction pool unit, decoding unit, in which:
IA generation unit: program address is corresponded to according to the parallel instruction signal behavior that third gating module is sent and is referred to
Needle is simultaneously sent to instruction pool unit;
It instructs pool unit: receiving the correspondence program address pointer that IA generation unit is sent and addressing instruction code, it will
Instruction code is sent to decoding unit;
Decoding unit: receiving the instruction code that instruction pool unit is sent, and progress Instruction decoding simultaneously believes gained control after decoding
Number be sent to Buffer forwarding module, while generating first/second/third gating command signal, be sent respectively to first/second/
Third gating module.
The Buffer forwarding module includes register group unit, inner buffer unit, peripheral interface control unit, in which:
Register group unit: it receives the third control signal that decoding unit is sent and selectes corresponding register, simultaneously will
Data are sent to inner buffer unit;
Inner buffer unit: the data of receiving register group unit transmission are simultaneously cached;
Peripheral interface control unit: receiving external bus access data and carries out data pipe according to periphery update control signal
Control, and send data to inner buffer unit.
The data computation module includes fixed point calculation unit, floating point calculating unit, logic computing unit, in which:
Fixed point calculation unit: the required data sent from the first gating module, address are received and to carry out fixed-point computation same
When to the second gating module send fixed-point computation result;
Floating point calculating unit: the required data sent from the first gating module, address are received and to carry out Floating-point Computation same
When to the second gating module send floating point calculations;
Logic computing unit: the required data sent from the first gating module, address are received and to carry out logic calculation same
When to the second gating module sending logic calculated result.
The floating point calculating unit include Floating-point Computation sub-unit, interactive controlling sub-unit, bus interaction sub-unit,
In:
Floating-point Computation sub-unit: it receives control signal, the operand sent from the first gating module and controls address and carry out
Floating-point Computation, while the control instruction for receiving interactive controlling sub-unit is concurrent by interactive controlling sub-unit reading floating point calculations
It send to the second gating unit;
Interactive controlling sub-unit: it is concurrent to receive the third control signal generation control instruction sent from the first gating module
It send to Floating-point Computation sub-unit and bus interaction sub-unit, corresponding or adjacent floating-point meter is sent to by bus interaction sub-unit
Point counting unit synchronizes calculating, while carrying out control information exchange with Floating-point Computation sub-unit;
Bus interacts sub-unit: sending control signal to Floating-point Computation sub-unit, interactive controlling sub-unit, carries out floating-point meter
The information exchange of point counting unit, interactive controlling sub-unit.
The Floating-point Computation sub-unit include floating-point flowing water vector counter, left/right operand data stream memory FIFO,
Export result data stream memory FIFO, left/right operand register RL/RR, output result register RX, in which:
Floating-point flowing water vector counter: all input signals write-in data are calculated, and calculated result is sent to
Export result register RX, output result data stream memory FIFO;
Left/right operand register RL/RR: receiving the control signal of bus interaction sub-unit output, to write-in data behaviour
It counts and is cached, and be sent to specified floating-point flowing water vector counter and calculated;
Left/right operand data stream memory FIFO: according to the control signal of bus interaction sub-unit output, to write-in number
It is cached according to data fifo value, and is sent to specified floating-point flowing water vector counter and carries out pipeline computing;
It exports result register RX: receiving the calculated result of floating-point flowing water vector counter, calculating is grasped after being cached
It counts and is judged, if operand is 0 and calculates type to be multiplication or division, calculated result is denoted as 0 and is sent to outside
Otherwise bus directs out portion's bus and exports immediately;
It exports result data stream memory FIFO: receiving the calculated result of floating-point flowing water vector counter, and cached,
If the summing value of data flow FIFO is 0 and calculates type to be multiplication or division, calculated result is denoted as 0 and is sent to external total
Otherwise line directs out portion's bus and exports immediately.Such a process reduces the access of the fragmentation of data and compute repeatedly.
The interactive controlling sub-unit includes input interactive access control module, controls signal interpretation module, interaction output
Module, in which:
Input interactive access control module: it receives the required third control signal that the first gating module is sent and is translated
Code, then level-one operand resulting after decoding control signal is sent to control signal interpretation module;
It controls signal interpretation module: receiving the level-one operand control signal that input interactive access control module is sent and carry out
It decodes again, and gained secondary operation number control signal after decoding is sent to interactive output module;
Interaction output module: judge whether to need to translate control signal according to the Instruction decoding result of instruction pool unit output
The secondary operation number control signal that code module is sent carries out four bus shared buffer memories and Floating-point Computation device calculates reading data.
Using embodiment, the present invention will be further described with reference to the accompanying drawing.
The present invention relates to based on multiple parallel cache interaction and calculate general deep learning processor architecture, especially for
Large-scale depth convolutional network and fully-connected network frequently multiply accumulating calculating in calculating, commonly save in conjunction with following convolution operation
For point calculates, explain to realization step.
FN×N=EN×NSigmoid((AN×NBN×N+CN×N)./DN×N) (1)
Wherein, N=positive integer, wherein Sigmoid (x)=(1+e-x)-1Calculating is abbreviated as SIGMF.
Operation of the conventional processors to above-mentioned formula, is all based on the calculating of single number, even if such as the Cambrian and deep mirror
Company listens great waves processor that can only also complete water operation under conditions of optimizing to code, in the invention patent design
In processor architecture, calculating for formula (1) can complete entire meter using matroid computations by way of parallel pipelining process
It calculates less than several calculating cycles.In addition, formula (1) needs at least five conditions judgement and two in conventional processors written in code
It recirculates, the mating instruction set of processor designed using the invention patent, it is only necessary to five-element's assembly code, as follows:
MULF.M AN×N,BN×N,TN×N,N
ADDF.M TN×N,CN×N,TN×N,N
DIVF.M TN×N,DN×N,TN×N,N
SIGMF.M TN×N,TN×N,N
MULF.M TN×N,EN×N,FN×N,N
Above-mentioned five-element's assembler language code generates machine code after compiling, is controlled by external debugging interface and debugging
Unit entry instruction pool unit processed.Then entire processor works according to the instruction code of instruction pool unit output.
<1>assembly statement MULF.M A is executedN×N,BN×N,TN×N, N, by address AN×NAnd BN×NTwo matrix datas are taken out,
Four input and output RAM D being stored in floating point calculating unit, and matrix-vector is written to computing unit bus interactive controlling module D
Calculating parameter N;In Fig. 6 processor inside structure block diagram, this instruction operation and calculation process are as described below:
(1) it firstly, taking out this instruction from instruction pool unit according to the address of IA generation unit output, is sent to
Decoding unit;
(2) decoding unit is input to the data of Buffer forwarding module according to the selection of the size of N, is controlled by peripheral interface single
Member takes out AN×N、BN×N、TN×NThe corresponding data of address thinks that this instruction is general data multiplication if N=1, and data are defeated
Enter register group unit, thinks that this operation is that vector calculates if N is not equal to 1, data input inner buffer unit;
(3) Buffer forwarding module judges whether according to the buffer update signal of third gating module by new input data
Instead of there are the passing data of inner buffer unit, judge that whether this executes instruction with upper one according to passing instruction retrieval result
It is secondary execute instruction it is identical, if the same no longer in Buffer forwarding module register group unit, inner buffer unit carry out
Data update, but directly output register group unit or the passing storing data of inner buffer unit, and output is to the first gating
Module;
(4) first gating modules control after receiving data according to data strobe, judge that data are input to data calculating
Which part of module, due to being that floating-point multiplication calculates, operand enters floating point calculating unit for this instruction;
(5) floating point calculating unit in data computation module after receiving the data, places the data into interactive control first
Sub-unit processed calculates mode according to data and length enters different output channels, this instruction is floating point vector calculating, therefore
Operand data is output to four input and output RAM D by floating-point multiplication register/FIFO interaction output module in interactive controlling,
It calculates information and enters bus interaction sub-unit D;
(6) bus interaction sub-unit D judges T according to the judgement instructed to nextN×NData still can be in floating add
Sub-unit is calculated to carry out using therefore in TN×NData are input to after four input and output RAM D, control DMA D To A by RAM
T is backed up in DN×NData are to RMA A;
(7) four input and output RAM D input operand data calculate sub-unit to floating-point multiplication, and operand is put into respectively
Left/right operand data stream FIFO, is stored in left/right operand register RL/RR if N=1, if left/right operand is posted
Storage RL/RR is equal to 0 or left operand data stream FIFO input data summed result is 0, then it is assumed that this floating-point multiplication meter
It calculates or vector calculating output result is 0, exporting result register Rx or output stream FIFO output result is 0, is otherwise pressed
Output result road RAM D is calculated according to normal;
(8) data computation module exports multiplication calculation result to the second gating module, and the second gating module is according to decoding
The second gating command signal output calculated result of unit output exports parallel instruction conflict inspection to third gating module simultaneously
Rope information judges whether current data computing module has other calculating to cache module is executed instruction, and prevents parallel
Computations conflict is made delay to current output in the case of a conflict and is waited;
(9) third gating module according to the third gating command signal of decoding unit and executes instruction the conflict of cache module
Judge search result, output order, register, caching, peripheral more new information, result deposit register group when N=1 in this instruction
Unit, N, which is updated when being not equal to 1 by periphery, is stored in TN×NCorresponding address, and the address of next instruction is generated, refer to herein for this
It enables address add 1, executes next instruction ADDF.M TN×N,CN×N,TN×N,N;
<2>ADDF.M T is executedN×N,CN×N,TN×N, N, by address CN×NMatrix data is taken out, is stored in floating point calculating unit
Four input and output RAM A, TN×NData via computing unit bus interactive controlling module D be passed to computing unit bus interaction
Then control module A is written floating add and calculates sub-unit;
<3>data enter floating-point multiplication calculating sub-unit by four input and output RAM A, calculate TN×NAnd CN×NTwo matrixes
Product, acquired results are passed to four input and output RAM A by computing unit bus interactive controlling modules A, and by internal total
Line returns to matrix-vector register TN×N;
<4>DIVF.MT is executedN×N,DN×N,TN×N, N, by address DN×NMatrix data is taken out, is stored in floating point calculating unit
Four input and output RAM B, TN×NData via computing unit bus interactive controlling modules A be passed to computing unit bus interaction
Then control module B is written floating-point division and calculates sub-unit;
<5>data enter flowing water input and output vector floating-point adder by four input and output RAM B, calculate TN×NAnd DN×N
The division of two matrixes, acquired results are passed to four input and output RAM B by computing unit bus interactive controlling module B, and lead to
It crosses internal bus and returns to matrix-vector register TN×N;
<6>SIGMF.M T is executedN×N,TN×N, N, TN×NData it is incoming via computing unit bus interactive controlling module B
Then computing unit bus interactive controlling module C is written floating-point Sigmoid and calculates sub-unit;
<7>T is calculatedN×NSigmoid calculate function, acquired results by computing unit bus interactive controlling module C be passed to
Four input and output RAM C, and matrix-vector register T is returned to by internal busN×N;
<8>MULF.M T is executedN×N,EN×N,FN×N, N, by address EN×NMatrix data is taken out, is stored in floating point calculating unit
Four input and output RAM D, TN×NData via computing unit bus interactive controlling module D be passed to computing unit bus interaction
Then control module D is written floating-point multiplication and calculates sub-unit;
<9>data enter floating-point multiplication calculating sub-unit progress T by four input and output RAM DN×NAnd EN×NMultiplication of matrices
It calculates, obtained result returns to four input and output RAM D via computing unit bus interactive controlling module D, and by internal total
Line returns to matrix-vector register FN×N, to complete this calculating task.
The calculation instructed due to other several is similar with first, no longer carries out repeating detailed description here.
Pass through the execution of above-mentioned concrete operations, it can be seen that (3) reduce peripheral data the step of first instruction execution
The frequency interactively communicated reduces data computing relay caused by communicating between caching, this is in deep learning network easily hundred
Ten thousand times or even more than one hundred million times convolution kernel calculating process, can save a large amount of cache access time;The step of first instruction execution
Suddenly (6) can back up frequently-used data in short-term according to the correlation above to give an order, equally reduce inside and outside caching in calculating process
Data interaction access time, be especially the reduction of the frequency of fragment type parameter access rate;The step of first instruction execution (7)
It in calculating process, is prejudged according to input data, the mode for directly exporting result is carried out for 0 Value Data, reduces calculating
Expense;In the design process due to present processor, it has been provided simultaneously with fixed point Floating-point Computation module, has avoided and is calculated in other designs
Model accuracy rate is greatly reduced problem caused by error.
The content that description in the present invention is not described in detail belongs to the well-known technique of those skilled in the art.
Claims (9)
1. a kind of general deep learning processor for caching interaction based on multiple parallel and calculating, it is characterised in that: at instruction
It manages module, Buffer forwarding module, the first gating module, data computation module, the second gating module, third gating module, execute
Instruction cache module, in which:
Buffer forwarding module: it receives the first control signal that command process module is sent and chooses corresponding register, and to external total
The data-signal to be calculated that line is sent into carries out inner buffer, while first control signal and data to be calculated are merged into the second control
Signal processed is simultaneously sent to the first gating module, carries out data more further according to the register more new signal of the first gating module foldback
Newly;
First gating module: receiving the second control signal that Buffer forwarding module is sent, and sent according to command process module
First gating command signal carries out selection output to data cached in second control signal;It is gated according to data to be calculated
Output controls signal comprising this third for calculating required operand after control, and is sent to data computation module, sends simultaneously
Register more new signal gives Buffer forwarding module;
Data computation module: it receives the calculating sent from the first gating module and controls signal with third, controlled and believed according to third
Operand pair data to be calculated needed for this calculating in number are calculated, by the fixed-point computation result being calculated, floating-point meter
Calculate result, logic calculation result is sent to the second gating module;
Second gating module: fixed-point computation result, floating point calculations, the logic calculation sent from data computation module is received
As a result, and according to the second gating command signal that command process module is sent, the secondary gating control of progress, by fixed-point computation result
Or floating point calculations or logic calculation result are forwarded to third gating module and execute instruction cache module;
It executes instruction cache module: receiving the calculated result after the gating output of the second gating module and carry out parallel instruction conflict
Retrieval, and sent to third gating module and calculate ongoing decision instruction signal for determining whether there is other, comprising referring to
Enable conflict retrieval judging result;
Third gating module: receiving the calculated result that the second gating module is sent, and is selected according to the third that command process module is sent
Logical command signal, and execute instruction the conflict retrieval judging result of cache module transmission, generate for Buffer forwarding module into
The instruction more new signal that row instruction updates, and send instruction more new signal to command process module formation and calculate closed loop;
Command process module: what reception third gating module was sent is used to choose instruction instructions to be performed more new signal and generate
Corresponding instruction code exports first control signal after being decoded, be used to choose corresponding register first that will be obtained after decoding
Control signal is sent to Buffer forwarding module, while generating first/second/third gating command signal, it is sent respectively to first/
Second/third gating module carries out output data strobe.
2. a kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated according to claim 1,
Be characterized in that: described instruction processing module includes IA generation unit, instruction pool unit, decoding unit, in which:
IA generation unit: the instruction sent according to third gating module updates signal behavior and corresponds to program address pointer simultaneously
It is sent to instruction pool unit;
It instructs pool unit: receiving the correspondence program address pointer that IA generation unit is sent and addressing instruction code, will instruct
Code is sent to decoding unit;
Decoding unit: receiving the instruction code that instruction pool unit is sent, and progress Instruction decoding simultaneously sends out gained control signal after decoding
It send to Buffer forwarding module, while generating first/second/third gating command signal, be sent respectively to first/second/third
Gating module carries out output data strobe.
3. a kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated according to claim 2,
Be characterized in that: the Buffer forwarding module includes register group unit, inner buffer unit, peripheral interface control unit, in which:
Register group unit: receiving the decoding first control signal that command process module is sent and select corresponding register,
The register data in the control signal is sent to inner buffer unit simultaneously;
Peripheral interface control unit: the data to be calculated that external bus access interface is sent are received, and data to be calculated will be included
External signal data is sent to inner buffer unit;
Inner buffer unit: receiving register group unit send register data and peripheral interface control unit send include
All signal datas are sent to the first gating module by the external signal of data to be calculated.
4. a kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated according to claim 1,
Be characterized in that: the data computation module includes fixed point calculation unit, floating point calculating unit, logic computing unit, in which:
Fixed point calculation unit: receiving the third sent from the first gating module and control signal and carry out fixed-point computation, while to
Second gating module sends fixed-point computation result;
Floating point calculating unit: receiving the third sent from the first gating module and control signal and carry out Floating-point Computation, while to
Second gating module sends floating point calculations;
Logic computing unit: it receives the third sent from the first gating module and controls signal and carry out logic calculation, simultaneously
To the second gating module sending logic calculated result.
5. a kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated according to claim 4,
Be characterized in that: the floating point calculating unit include Floating-point Computation sub-unit, interactive controlling sub-unit, bus interaction sub-unit,
In:
Floating-point Computation sub-unit: it receives the third control signal sent from the first gating module and carries out Floating-point Computation, connect simultaneously
The reading instruction for receiving interactive controlling sub-unit reads floating point calculations by interactive controlling sub-unit and is sent to the second gating list
Member;
Interactive controlling sub-unit: when Floating-point Computation sub-unit starts to calculate, by bus interaction sub-unit to corresponding or adjacent
Floating-point Computation sub-unit send and read instruction, and calculated result is sent to the second gating module;
Bus interacts sub-unit: carrying out the instruction and data interaction of Floating-point Computation sub-unit, interactive controlling sub-unit.
6. a kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated according to claim 5,
Be characterized in that: the Floating-point Computation sub-unit includes floating-point flowing water vector counter, left operand data stream memory FIFO, the right side
Operand data stream memory FIFO, output result data stream memory FIFO, left operand register RL, the deposit of right operand
Device RR, output result register RX, in which:
Left operand register RL: receive bus interaction sub-unit output interactive controlling signal, to write-in data operand into
Row caching, and be sent to specified floating-point flowing water vector counter and calculated;
Right operand register RR: receive bus interaction sub-unit output interactive controlling signal, to write-in data operand into
Row caching, and be sent to specified floating-point flowing water vector counter and calculated;
Left/right operand data stream memory FIFO: according to the interactive controlling signal of bus interaction sub-unit output, to write-in number
It is cached according to data fifo value, and is sent to specified floating-point flowing water vector counter and carries out pipeline computing;
Floating-point flowing water vector counter: to write-in data calculate, and by calculated result be sent to output result register RX,
Export result data stream memory FIFO;
It exports result register RX: the calculated result of floating-point flowing water vector counter is received, to calculating operation number after being cached
Judged, if operand is 0 and calculates type to be multiplication or division, calculated result be denoted as 0 and is sent to external bus,
Otherwise portion's bus is directed out to export immediately;
It exports result data stream memory FIFO: receiving the calculated result of floating-point flowing water vector counter, and cached, if number
Summing value according to stream FIFO is 0 and calculating type is multiplication or division, then calculated result is denoted as 0 and is sent to external bus,
Otherwise portion's bus is directed out to export immediately.
7. a kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated according to claim 5,
Be characterized in that: the interactive controlling sub-unit includes input interactive access control module, controls signal interpretation module, interaction output
Module, in which:
Input interactive access control module: it receives the third control signal that the first gating module is sent and is decoded, then will translate
Resulting data and control signal are sent to interactive output module after code;
Control signal interpretation module: receive the first gating module send external computations decoded, and will decode after institute
Signal must be controlled and be sent to interactive output module;
Interaction output module: interactive access control module will be inputted by bus interaction sub-unit or control signal interpretation module is defeated
Data are sent to the second gating module out.
8. a kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated according to claim 6,
Be characterized in that: the floating-point flowing water vector counter quantity is four.
9. a kind of multiple parallel calculates and the general depth network method for calculation of caching, it is characterised in that steps are as follows:
(1) simultaneously output order is chosen from pool of instructions according to the address pointer of generation, exports first after decoding to the instruction
Signal is controlled, and generates the first/bis-/tri- gating command signals and carries out data strobe;
(2) data in first control signal are cached, data to be calculated is obtained by outside access bus and by itself and the
Data merge output second control signal in one control signal;
(3) gating control is carried out to second control signal using the first gating command signal, and is controlled according to the third after gating
Signal calculates data to be calculated, while being updated using calculated result to data cached;
(4) gating output is carried out to calculated result according to the second gating command command signal, while carries out parallel instruction conflict inspection
Rope, output conflict judge search result;
(5) search result is judged according to third gating command signal, calculated result, conflict, transmitter register more new signal carries out
Data buffer storage updates, while carrying out IA update, is formed and calculates closed loop.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811528451.7A CN109739556B (en) | 2018-12-13 | 2018-12-13 | General deep learning processor based on multi-parallel cache interaction and calculation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811528451.7A CN109739556B (en) | 2018-12-13 | 2018-12-13 | General deep learning processor based on multi-parallel cache interaction and calculation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109739556A true CN109739556A (en) | 2019-05-10 |
CN109739556B CN109739556B (en) | 2021-03-26 |
Family
ID=66359421
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811528451.7A Active CN109739556B (en) | 2018-12-13 | 2018-12-13 | General deep learning processor based on multi-parallel cache interaction and calculation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109739556B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112817638A (en) * | 2019-11-18 | 2021-05-18 | 北京希姆计算科技有限公司 | Data processing device and method |
CN113051212A (en) * | 2021-03-02 | 2021-06-29 | 长沙景嘉微电子股份有限公司 | Graphics processor, data transmission method, data transmission device, electronic device, and storage medium |
CN113806250A (en) * | 2021-09-24 | 2021-12-17 | 中国人民解放军国防科技大学 | Method for coordinating general processor core and vector component, interface and processor |
US11782722B2 (en) | 2020-06-30 | 2023-10-10 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Input and output interfaces for transmitting complex computing information between AI processors and computing components of a special function unit |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5961628A (en) * | 1997-01-28 | 1999-10-05 | Samsung Electronics Co., Ltd. | Load and store unit for a vector processor |
CN1387649A (en) * | 1999-08-31 | 2002-12-25 | 英特尔公司 | Parallel processor architecture |
CN101751244A (en) * | 2010-01-04 | 2010-06-23 | 清华大学 | Microprocessor |
CN101986263A (en) * | 2010-11-25 | 2011-03-16 | 中国人民解放军国防科学技术大学 | Method and microprocessor for supporting single instruction stream and multi-instruction stream dynamic switching execution |
CN106445468A (en) * | 2015-10-08 | 2017-02-22 | 上海兆芯集成电路有限公司 | Direct execution of execution unit for loading micro-operation of framework cache file by employing framework instruction of processor |
US10073696B2 (en) * | 2013-07-15 | 2018-09-11 | Texas Instruments Incorporated | Streaming engine with cache-like stream data storage and lifetime tracking |
-
2018
- 2018-12-13 CN CN201811528451.7A patent/CN109739556B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5961628A (en) * | 1997-01-28 | 1999-10-05 | Samsung Electronics Co., Ltd. | Load and store unit for a vector processor |
CN1387649A (en) * | 1999-08-31 | 2002-12-25 | 英特尔公司 | Parallel processor architecture |
CN101751244A (en) * | 2010-01-04 | 2010-06-23 | 清华大学 | Microprocessor |
CN101986263A (en) * | 2010-11-25 | 2011-03-16 | 中国人民解放军国防科学技术大学 | Method and microprocessor for supporting single instruction stream and multi-instruction stream dynamic switching execution |
US10073696B2 (en) * | 2013-07-15 | 2018-09-11 | Texas Instruments Incorporated | Streaming engine with cache-like stream data storage and lifetime tracking |
CN106445468A (en) * | 2015-10-08 | 2017-02-22 | 上海兆芯集成电路有限公司 | Direct execution of execution unit for loading micro-operation of framework cache file by employing framework instruction of processor |
Non-Patent Citations (1)
Title |
---|
杨川: "MPCore多核处理器并行计算方法的研究与实现", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112817638A (en) * | 2019-11-18 | 2021-05-18 | 北京希姆计算科技有限公司 | Data processing device and method |
US11782722B2 (en) | 2020-06-30 | 2023-10-10 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Input and output interfaces for transmitting complex computing information between AI processors and computing components of a special function unit |
CN113051212A (en) * | 2021-03-02 | 2021-06-29 | 长沙景嘉微电子股份有限公司 | Graphics processor, data transmission method, data transmission device, electronic device, and storage medium |
CN113051212B (en) * | 2021-03-02 | 2023-12-05 | 长沙景嘉微电子股份有限公司 | Graphics processor, data transmission method, data transmission device, electronic equipment and storage medium |
CN113806250A (en) * | 2021-09-24 | 2021-12-17 | 中国人民解放军国防科技大学 | Method for coordinating general processor core and vector component, interface and processor |
Also Published As
Publication number | Publication date |
---|---|
CN109739556B (en) | 2021-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lu et al. | Evaluating fast algorithms for convolutional neural networks on FPGAs | |
CN111897579B (en) | Image data processing method, device, computer equipment and storage medium | |
Chen et al. | Regnn: A redundancy-eliminated graph neural networks accelerator | |
CN118690805A (en) | Processing apparatus and processing method | |
CN109739556A (en) | A kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated | |
CN105468439A (en) | Adaptive parallel algorithm for traversing neighbors in fixed radius under CPU-GPU (Central Processing Unit-Graphic Processing Unit) heterogeneous framework | |
CN108052347A (en) | A kind of device for executing instruction selection, method and command mappings method | |
CN116301920B (en) | Compiling system for deploying CNN model to high-performance accelerator based on FPGA | |
US20230394110A1 (en) | Data processing method, apparatus, device, and medium | |
CN116401502B (en) | Method and device for optimizing Winograd convolution based on NUMA system characteristics | |
CN114995822A (en) | Deep learning compiler optimization method special for CNN accelerator | |
CN112232517B (en) | Artificial intelligence accelerates engine and artificial intelligence treater | |
Chen et al. | Rubik: A hierarchical architecture for efficient graph learning | |
Zhu et al. | Taming unstructured sparsity on GPUs via latency-aware optimization | |
CN110047477A (en) | A kind of optimization method, equipment and the system of weighted finite state interpreter | |
Chen et al. | Exploiting on-chip heterogeneity of versal architecture for GNN inference acceleration | |
Wang et al. | COSA: Co-Operative Systolic Arrays for Multi-head Attention Mechanism in Neural Network using Hybrid Data Reuse and Fusion Methodologies | |
CN113157638B (en) | Low-power-consumption in-memory calculation processor and processing operation method | |
CN111522776B (en) | Computing architecture | |
Shang et al. | LACS: A high-computational-efficiency accelerator for CNNs | |
Janssen et al. | A specification invariant technique for regularity improvement between flow-graph clusters | |
Lin et al. | swFLOW: A dataflow deep learning framework on sunway taihulight supercomputer | |
CN116090519A (en) | Compiling method of convolution operator and related product | |
US11714649B2 (en) | RISC-V-based 3D interconnected multi-core processor architecture and working method thereof | |
CN106095730B (en) | A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |