CN105589677A - Systolic structure matrix multiplier based on FPGA (Field Programmable Gate Array) and implementation method thereof - Google Patents

Systolic structure matrix multiplier based on FPGA (Field Programmable Gate Array) and implementation method thereof Download PDF

Info

Publication number
CN105589677A
CN105589677A CN201410653363.5A CN201410653363A CN105589677A CN 105589677 A CN105589677 A CN 105589677A CN 201410653363 A CN201410653363 A CN 201410653363A CN 105589677 A CN105589677 A CN 105589677A
Authority
CN
China
Prior art keywords
matrix
row
data
control module
multiplier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410653363.5A
Other languages
Chinese (zh)
Inventor
陶耀东
周磊涛
李锁
齐济
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang Gaojing Numerical Control Intelligent Technology Co Ltd
Original Assignee
Shenyang Gaojing Numerical Control Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang Gaojing Numerical Control Intelligent Technology Co Ltd filed Critical Shenyang Gaojing Numerical Control Intelligent Technology Co Ltd
Priority to CN201410653363.5A priority Critical patent/CN105589677A/en
Publication of CN105589677A publication Critical patent/CN105589677A/en
Pending legal-status Critical Current

Links

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention provides a systolic structure matrix multiplier based on an FPGA (Field Programmable Gate Array) and an implementation method thereof. The systolic structure matrix multiplier comprises a multiplier array composed of M x R node units, a clock control module, a data input control module and a data output control module; the M x R node units are interconnected by a two-dimensional mesh array structure of M rows and R columns, used for performing a multiply-accumulate calculation on data, and have systolic structures in each data direction of rows and columns; the clock control module is used for providing a clock, controlling the whole calculation process, and recoding the state of a current multiplier; the data input control module is used for controlling an input matrix to input by rows or by columns, and making the input matrix to satisfy an input rule of time alignment; and the data output control module is used for outputting the calculation results by rows or by columns under the control of an output clock. The implementation method of the matrix multiplier provided by the invention can enable data to enter the matrix multiplier according to the input rule of the systolic structure by rows or by columns, and perform a multiply-add operation and output the calculation results. The systolic structure matrix multiplier based on the FPGA provided by the invention has the advantages of being very high in calculation performance, good in modularity and convenient in reconfiguration.

Description

A kind of systolic structures matrix multiplier and its implementation based on FPGA
Technical field
The present invention relates to FPGA technology and High Performance Computing field, is a kind of based on FPGA specificallySystolic structures matrix multiplier and its implementation.
Background technology
Matrix multiplication operation is as a basic operation of the aspects such as science calculating, Digital Signal Processing and obtainExtensive use, and its calculated performance directly has influence on the overall performance of system, along with calculated data amount day by dayImproving constantly of huge and precision prescribed level, matrix multiplication becomes the bottleneck of system-computed performance gradually.
It is general processor that matrix multiplication operation in the past adopts hardware conventionally, the side of software realization matrix multiplicationMethod or employing hardware are that special digital signal processor (DigitalSignalProcess, DSP) is realized.Such processing method technology is comparatively ripe, and implementation tool is perfect, the simple feature of programming, but by the innerThe restriction of portion's structure and platform, is not suitable for the occasion higher to real time and reliability requirement, and being also not suitable for shouldBe used in small-sized embedded system. Adopt dominant frequency etc. that general processor or DSP technology be subject to device conventionally because ofElement impact, cannot reach very high calculated performance.
Improving constantly of FPGA technology and technological level thereof in recent years, FPGA can be applied in complex denseThe occasion of the calculating of type, and the distinctive programmability of FPGA can be tackled various different demands flexiblyOccasion, realizes various Different Logic functions, and system is with good expansibility. The particulate that FPGA is intrinsicDegree parallel ability, high-performance low-power-consumption ratio, become embedded system performance accelerate best selection itOne.
Systolic structures is as shown in Figure 1 a kind of linear structure that just possesses pipeline function itself, not TongfangThe linear structure parallel work-flow not waiting to quantity, can obtain quite high system data throughput and clock frequentlyRate, and systolic structures has feature simple, regular, that modularity is good, only has a small amount of node with outsideHave IO operation, this can make system keep good processing speed, simultaneously also can and exterior I O bandwidth betweenBalance is applicable to FPGA and realizes very much.
Use at present FPGA to realize parallel matrix multiplication aspect, most of by data of a line and rowOutput to N node, occur that input module fan-out is excessive, wiring delay increases, and does not possess pipeline organization,Easily there is wrong problem, cause global matrix multiplier to be difficult to reach superior performance and unreliable, be difficult toMeet the requirement of calculating real-time. In addition also occur data input control complexity, the result after calculating is readGet difficulty, and entirety do not have modularity, the shortcoming that needs major part to revise after transplanting.
Summary of the invention
For above shortcomings part in prior art, the technical problem to be solved in the present invention is to provide oneIt is too high that kind can solve existing matrix multiplier input module fan-out, is difficult to reach higher overall performance, dataInput control complexity, the result after calculating reads difficulty, and entirety do not have modularity, after transplanting, needsA kind of systolic structures matrix multiplier and its implementation based on FPGA of the problems such as most of amendment.
The technical scheme that the present invention adopted is for achieving the above object: a kind of systolic structures based on FPGAMatrix multiplier and its implementation, comprising: multiplier array, clock control module, data input controlModule, data output control module;
Described multiplier array is the two-dimensional mesh array structure of the capable R row of M, formed by M × R node unit,And interconnected between adjacent node unit, for realizing the product calculation of input matrix A and matrix B;
Described clock control module, for exporting two-way clock, offer respectively data input control module andData output control module;
Described data input control module, for parallel input matrix A and square under the control of clock control moduleBattle array B, wherein matrix A is the capable N row of M, matrix B is the capable R row of N;
Described data output control module, for output matrix C, wherein Matrix C is the capable R row of M.
Described node unit is by a multiplier, and adder and one are for storing depositing of result of calculationDevice composition; Described multiplier receives the input data of this node unit, and result of product is sent to adder;Described register for the data accumulation of storing a clock cycle with; Described adder is for by multiplierIn result of product and register a upper clock cycle data accumulation and do add operation, and by add operationResult send to register upgrade storage.
The multiplier of described node unit inside adopts the special multiplication stone of FPGA inside.
Described data input control module is pressed row input matrix A, is inputted by row under the control of clock control moduleMatrix B, until matrix A, all data of matrix B pass through multiplier array whole node units full line andPermutation.
Described data output control module is exported by row or column under the control of clock control module.
An implementation method for systolic structures matrix multiplier based on FPGA, comprises the following steps:
Data input control module is by the matrix B of the matrix A of capable M N row and the capable R row of N, in clock control moduleControl under parallel being input in multiplier array;
In input data each node unit in multiplier array, carry out multiplying cumulative, at clockUnder the control of control module, complete all node units and calculate, until there is no data input;
Data output control module is exported the Matrix C of the capable R row of M as matrix under the control of clock control moduleMultiplication result.
Described data input control module is by the matrix B of the matrix A of capable M N row and the capable R row of N, in clock controlParallel being input in multiplier array under the control of module, is used following rule:
Matrix A is the capable N row of M, Ai,kThe element of matrix A, i=1,2 ... M, k=1,2 ... N; SquareBattle array B is the capable R row of N, Bk,jThe element of matrix B, k=1,2 ... N; J=1,2 ... R.
The row vector of matrix A enters into by line number order from small to large the row that multiplier array is corresponding successively, andAnd adjacent lines vector enters multiplier array and differs in time 1 clock cycle, i.e. the capable k row of the i of matrix AData Ai,kData A with the capable k-1 row of the i-1 of matrix Ai-1,k-1Enter multiplier array simultaneously;
The column vector of matrix B is pressed row order number from small to large and is entered into successively the row that multiplier array is corresponding, andAnd adjacent columns vector enters multiplier array and differs in time 1 clock cycle, i.e. the capable j row of the k of matrix BData Bk,jData B with the capable j-1 row of the k-1 of matrix Bk-1,j-1Enter multiplier array simultaneously;
And matrix A enters into multiplier array in the time by being advanced into multiplier array and matrix B by rowUpper parallel, i.e. Ai,kAnd Bk,jTo enter multiplier array under the same clock cycle, until matrix A and matrixThe all elements of B all passes through full line and the permutation of the multiplier array of M × R node unit composition, makes dataThe input control that arrives each node unit meets the input rule of time alignment.
In described input data each node unit in multiplier array, carry out multiplying cumulative,In individual node unit, carry out following steps:
1) data of a upper node of the row and column of reception pulsation direction;
2) calculate the product of two data, and add up with the result of originally depositing;
3) preserve the value after adding up, the row input data of accepting are outputed to next row node, the row of acceptanceInput data output to next row node.
Compared with prior art, the invention has the beneficial effects as follows:
1. the present invention adopts the matrix array structure of systolic structures, has streamline feature completely, to inputThe fan-out of side requires very low, can reach very high calculated performance, after tested, uses AlteraDE2 development boardRealization can reach the maximum clock frequency of 200M nearly, and in the time that matrix size becomes large, still can keepHigher clock frequency.
2. the present invention adopts the method for the whole matrix multiplier of doubleclocking control, makes to input computational process and exportsJourney is used independently clock, and ease for operation is good, can avoid maloperation, and can be suitable for calculating side and the side of readingThe occasion that clock is different.
3. the present invention has good modularity, is particularly suitable as a module application in embedded system,For different calculating scales, because the node in array is all identical, as long as array is rearrangedCan, conveniently reshuffle.
4. the present invention has good low power capabilities, has used doubleclocking control computational process, is not calculatingIn the situation of work, close clock, greatly reduce the overall power consumption of matrix multiplier, be adapted at embeddedIn system, use.
Brief description of the drawings
Fig. 1 is systolic arrays schematic diagram in the inventive method;
Fig. 2 is the internal structure schematic diagram of matrix multiplier in the inventive method;
Fig. 3 is matrix multiplier clock control module schematic diagram in the inventive method;
Fig. 4 is node unit cut-away view in the inventive method;
Fig. 5 is matrix multiplier input instance graph in the inventive method;
Fig. 6 is that shape is calculated in the data input that in the inventive method, matrix multiplier calculates under clock in difference inputState figure.
Detailed description of the invention
Below in conjunction with drawings and Examples, the present invention is described in further detail.
As shown in Figure 2, a kind of systolic structures matrix multiplier based on FPGA comprises: multiplier array, clockControl module, data input control module, data output control module; Described multiplier array is the capable R row of MTwo-dimensional mesh array structure, formed by M × R node unit, and interconnected between adjacent node unit, useIn the product calculation of realizing input matrix A and matrix B; Described clock control module, for exporting two-way clock,Offer respectively data input control module and data output control module; Described data input control module,For parallel input matrix A and matrix B under the control of clock control module, wherein matrix A is the capable N row of M, squareBattle array B is the capable R row of N; Described data output control module, for output matrix C, wherein Matrix C is the capable R row of M.Described data input control module presses row input matrix A under the control of clock control module, by row input matrixB, until full line and permutation that matrix A, all data of matrix B are passed through whole node units of multiplier array.Described data output control module is exported by row or column under the control of clock control module. Wherein C matrixFor the capable R row of M, and C=A × B.
As shown in Figure 3, be matrix multiplier clock control module schematic diagram in the inventive method, matrix multiplierUse input to calculate clock CLK1 and two clocks of output clock CLK2, input is calculated clock from calculating side,In figure, represent with black line; Output clock, from the side of reading, dots in figure. It is main that clock is calculated in inputAs the CLK1 clock of input control module and node; Output clock is mainly used as the clock of data outputting moduleCLK2 clock with node. The whole matrix computations flow process of clock control module control, calculating side is used Req1 to askAsk calculating, module replies to calculating side according to oneself state with Ack1; The side of reading, is used Req2 request to readAs a result, module replies to Ack2 the side of reading according to oneself state. When input and computing mode, input is calculatedClock CLK1 enables, and output clock CLK2 closes; When in output state, clock CLK1 is calculated in inputClose, output clock CLK2 enables.
As shown in Figure 4, node unit is the elementary cell that forms multiplier array, by a multiplier, oneAdder and one form for the register of storing result of calculation. Described multiplier receives this node unitInput data, send to adder by result of product; Described register is for storing the number of a clock cycleAccording to cumulative sum; Described adder for by the result of product of multiplier and a upper clock cycle of registerData accumulation and do add operation, and send to register to upgrade storage the result of add operation. In figureRow_in and Col_in are the input data of node, and CLK1 is synchronised clock, and rstn is asynchronous reset signal, everyThe inferior Matrix Multiplication that recalculates needs reset to remove all data in node before. Under the control of synchronised clock,Row_in and Col_in are two of multiplier input data, and the result that data multiply each other is defeated as of accumulatorEnter, REG is the memory cell in node, for remembering last cumulative result, and as accumulatorAnother inputs data, and the data in result and REG that accumulator multiplies each other data are added, and by result againWrite REG and complete multiply accumulating process one time. When the node in array is again without input when data, the REG in nodeIn data be exactly to calculate last result. In the time that CLK2 enables, by the data output in REG, and acceptThe data of upper node input, leave in REG.
An implementation method for systolic structures matrix multiplier based on FPGA, is characterized in that, comprises followingStep:
Data input control module is by the matrix B of the matrix A of capable M N row and the capable R row of N, in clock control moduleControl under parallel being input in multiplier array;
In input data each node unit in multiplier array, carry out multiplying cumulative, at clockUnder the control of control module, complete all node units and calculate, until there is no data input;
Data output control module is exported the Matrix C of the capable R row of M as matrix under the control of clock control moduleMultiplication result.
Described data input control module is by the matrix B of the matrix A of capable M N row and the capable R row of N, in clock controlParallel being input in multiplier array under the control of module, is used following rule:
Matrix A is the capable N row of M, Ai,kThe element of matrix A, i=1,2 ... M, k=1,2 ... N; SquareBattle array B is the capable R row of N, Bk,jThe element of matrix B, k=1,2 ... N; J=1,2 ... R.
The row vector of matrix A enters into by line number order from small to large the row that multiplier array is corresponding successively, andAnd adjacent lines vector enters multiplier array and differs in time 1 clock cycle, i.e. the capable k row of the i of matrix AData Ai,kData A with the capable k-1 row of the i-1 of matrix Ai-1,k-1Enter multiplier array simultaneously;
The column vector of matrix B is pressed row order number from small to large and is entered into successively the row that multiplier array is corresponding, andAnd adjacent columns vector enters multiplier array and differs in time 1 clock cycle, i.e. the capable j row of the k of matrix BData Bk,jData B with the capable j-1 row of the k-1 of matrix Bk-1,j-1Enter multiplier array simultaneously;
And matrix A enters into multiplier array in the time by being advanced into multiplier array and matrix B by rowUpper parallel, i.e. Ai,kAnd Bk,jTo enter multiplier array under the same clock cycle, until matrix A and matrixThe all elements of B all passes through full line and the permutation of the multiplier array of M × R node unit composition, makes dataThe input control that arrives each node unit meets the input rule of time alignment.
In described input data each node unit in multiplier array, carry out multiplying cumulative,In individual node unit, carry out following steps:
1) data of a upper node of the row and column of reception pulsation direction;
2) calculate the product of two data, and add up with the result of originally depositing;
3) preserve the value after adding up, the row input data of accepting are outputed to next row node, the row of acceptanceInput data output to next row node.
And the computing formula of output matrix C is: Ci,j=∑Ai,k×Bk,j, wherein i=1,2 ... M,k=1,2……N;j=1,2,……R。
The action of all nodes completes under the control of CLK2, and node outputs to same by the result in accumulatorThe next node of individual direction. Accept the data of a node, and be kept in accumulator. So repeatedly straightExport to all result datas. And the input of data input control module is computing time(M+N+R-1) the individual CLK1 clock cycle, the result of data output control module is M by the time of line outputThe individual CLK2 clock cycle, result is R CLK2 clock cycle by the time of row output.
As shown in Figure 5, the implementation procedure of the embodiment of 3 × 3 node units of the present invention is as follows, the matrix A of inputBe 3 × 3 array structure with matrix B, need to obtain Matrix C, C=A × B:
(1) matrix node scale is configured to 3 row 3 column array structures, and it is interconnected to press two-dimensional mesh.
(2) configure corresponding input control module according to matrix size, data outputting module, makes it meet 3 rowParallel input and output.
(3) configure corresponding clock control module according to matrix size.
(4) the whole matrix multiplier that resets, waits side to be calculated to send computation requests Req1, judges current state,Reply can be calculated Ack1, enables CLK1.
(5) data input control module is controlled the raw column data of input by input rule.
(6) through (N+N+N-1)=8 CLK1 clock cycle, calculating completes, and CLK1 closes.
Etc. (7) side to be read sends and reads Req2, and clock control module is recovered can read according to current stateAck2。
(8) enable CLK2, matrix is successively by row or column output data.
(9) through (N)=3 CLK2 clock cycle, data have been exported by row or column.
Be illustrated in figure 6 matrix multiplier in the inventive method and calculate the data input under clock in difference inputComputing mode figure,
Visible for input matrix A = a 1,1 a 1,2 a 1,3 a 2,1 a 2,2 a 2,3 a 3,1 a 3,2 a 3,3 = 1 2 3 1 2 3 1 2 3 , Matrix B = b 1,1 b 1,2 b 1,3 b 2,1 b 2,2 b 2,3 b 3,1 b 3,2 b 3,3 = 1 2 3 1 2 3 1 2 3 , Output matrix C = c 1,1 c 1,2 c 1,3 c 2,1 c 2,2 c 2,3 c 3,1 c 3,2 c 3,3 = 20 20 20 20 20 20 20 20 20 ,
? 1 2 3 1 2 3 1 2 3 × 2 2 2 3 3 3 4 4 4 = 20 20 20 20 20 20 20 20 20 Set up, meet A × B=C.
According to above enforcement, just can well apply this invention.

Claims (8)

1. the systolic structures matrix multiplier based on FPGA, its feature with in, comprising: multiplier array,Clock control module, data input control module, data output control module;
Described multiplier array is the two-dimensional mesh array structure of the capable R row of M, formed by M × R node unit,And interconnected between adjacent node unit, for realizing the product calculation of input matrix A and matrix B;
Described clock control module, for exporting two-way clock, offer respectively data input control module andData output control module;
Described data input control module, for parallel input matrix A and square under the control of clock control moduleBattle array B, wherein matrix A is the capable N row of M, matrix B is the capable R row of N;
Described data output control module, for output matrix C, wherein Matrix C is the capable R row of M.
2. a kind of systolic structures matrix multiplier based on FPGA according to claim 1, is characterized in that,Described node unit is by a multiplier, and adder and one are for storing the register group of result of calculationBecome; Described multiplier receives the input data of this node unit, and result of product is sent to adder; DescribedRegister for the data accumulation of storing a clock cycle with; Described adder is used for the product of multiplierIn result and register a upper clock cycle data accumulation and do add operation, and by the knot of add operationFruit sends to register to upgrade storage.
3. a kind of systolic structures matrix multiplier based on FPGA according to claim 2, is characterized in that,The multiplier of described node unit inside adopts the special multiplication stone of FPGA inside.
4. a kind of systolic structures matrix multiplier based on FPGA according to claim 1, is characterized in that,Described data input control module presses row input matrix A under the control of clock control module, by row input matrixB, until full line and permutation that matrix A, all data of matrix B are passed through whole node units of multiplier array.
5. a kind of systolic structures matrix multiplier based on FPGA according to claim 1, is characterized in that,Described data output control module is exported by row or column under the control of clock control module.
6. an implementation method for the systolic structures matrix multiplier based on FPGA, is characterized in that, comprises followingStep:
Data input control module is by the matrix B of the matrix A of capable M N row and the capable R row of N, in clock control moduleControl under parallel being input in multiplier array;
In input data each node unit in multiplier array, carry out multiplying cumulative, at clockUnder the control of control module, complete all node units and calculate, until there is no data input;
Data output control module is exported the Matrix C of the capable R row of M as matrix under the control of clock control moduleMultiplication result.
7. the implementation method of a kind of systolic structures matrix multiplier based on FPGA according to claim 6, itsBe characterised in that, described data input control module is the matrix B of the matrix A of capable M N row and the capable R row of N, timeParallel being input in multiplier array under the control of clock control module, is used following rule:
Matrix A is the capable N row of M, Ai,kThe element of matrix A, i=1,2 ... M, k=1,2 ... N; SquareBattle array B is the capable R row of N, Bk,jThe element of matrix B, k=1,2 ... N; J=1,2 ... R.
The row vector of matrix A enters into by line number order from small to large the row that multiplier array is corresponding successively, andAnd adjacent lines vector enters multiplier array and differs in time 1 clock cycle, i.e. the capable k row of the i of matrix AData Ai,kData A with the capable k-1 row of the i-1 of matrix Ai-1,k-1Enter multiplier array simultaneously;
The column vector of matrix B is pressed row order number from small to large and is entered into successively the row that multiplier array is corresponding, andAnd adjacent columns vector enters multiplier array and differs in time 1 clock cycle, i.e. the capable j row of the k of matrix BData Bk,jData B with the capable j-1 row of the k-1 of matrix Bk-1,j-1Enter multiplier array simultaneously;
And matrix A enters into multiplier array in the time by being advanced into multiplier array and matrix B by rowUpper parallel, i.e. Ai,kAnd Bk,jTo enter multiplier array under the same clock cycle, until matrix A and matrixThe all elements of B all passes through full line and the permutation of the multiplier array of M × R node unit composition, makes dataThe input control that arrives each node unit meets the input rule of time alignment.
8. the implementation method of a kind of systolic structures matrix multiplier based on FPGA according to claim 6, itsBe characterised in that, in described input data each node unit in multiplier array, carry out multiplying tiredAdd, in individual node unit, carry out following steps:
1) data of a upper node of the row and column of reception pulsation direction;
2) calculate the product of two data, and add up with the result of originally depositing;
3) preserve the value after adding up, the row input data of accepting are outputed to next row node, the row of acceptanceInput data output to next row node.
CN201410653363.5A 2014-11-17 2014-11-17 Systolic structure matrix multiplier based on FPGA (Field Programmable Gate Array) and implementation method thereof Pending CN105589677A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410653363.5A CN105589677A (en) 2014-11-17 2014-11-17 Systolic structure matrix multiplier based on FPGA (Field Programmable Gate Array) and implementation method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410653363.5A CN105589677A (en) 2014-11-17 2014-11-17 Systolic structure matrix multiplier based on FPGA (Field Programmable Gate Array) and implementation method thereof

Publications (1)

Publication Number Publication Date
CN105589677A true CN105589677A (en) 2016-05-18

Family

ID=55929290

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410653363.5A Pending CN105589677A (en) 2014-11-17 2014-11-17 Systolic structure matrix multiplier based on FPGA (Field Programmable Gate Array) and implementation method thereof

Country Status (1)

Country Link
CN (1) CN105589677A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108958704A (en) * 2017-05-18 2018-12-07 华为技术有限公司 A kind of data processing equipment and method
CN109144469A (en) * 2018-07-23 2019-01-04 上海亮牛半导体科技有限公司 Pipeline organization neural network matrix operation framework and method
CN109213962A (en) * 2017-07-07 2019-01-15 华为技术有限公司 Arithmetic accelerator
CN109271138A (en) * 2018-08-10 2019-01-25 合肥工业大学 A kind of chain type multiplication structure multiplied suitable for big dimensional matrix
CN109902063A (en) * 2019-02-01 2019-06-18 京微齐力(北京)科技有限公司 A kind of System on Chip/SoC being integrated with two-dimensional convolution array
CN109992743A (en) * 2017-12-29 2019-07-09 华为技术有限公司 Matrix multiplier
CN110188869A (en) * 2019-05-05 2019-08-30 北京中科汇成科技有限公司 A kind of integrated circuit based on convolutional neural networks algorithm accelerates the method and system of calculating
CN110648313A (en) * 2019-09-05 2020-01-03 北京智行者科技有限公司 Laser stripe center line fitting method based on FPGA
CN110673824A (en) * 2018-07-03 2020-01-10 赛灵思公司 Matrix vector multiplication circuit and circular neural network hardware accelerator
CN112306922A (en) * 2020-11-12 2021-02-02 山东云海国创云计算装备产业创新中心有限公司 Multi-data-pair multi-port arbitration method and related device
CN112865960A (en) * 2020-12-31 2021-05-28 广州万协通信息技术有限公司 System, method and device for realizing high-speed key chain pre-calculation based on stream cipher
CN114237551A (en) * 2021-11-26 2022-03-25 南方科技大学 Multi-precision accelerator based on pulse array and data processing method thereof
CN114580628A (en) * 2022-03-14 2022-06-03 北京宏景智驾科技有限公司 Efficient quantization acceleration method and hardware circuit for neural network convolution layer
CN114816331A (en) * 2017-11-03 2022-07-29 畅想科技有限公司 Hardware unit for performing matrix multiplication with clock gating

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101089840A (en) * 2007-07-12 2007-12-19 浙江大学 Matrix multiplication parallel computing system based on multi-FPGA
CN101794210A (en) * 2010-04-07 2010-08-04 上海交通大学 General matrix floating point multiplier based on FPGA (Field Programmable Gate Array)
CN102629189A (en) * 2012-03-15 2012-08-08 湖南大学 Water floating point multiply-accumulate method based on FPGA
CN102662623A (en) * 2012-04-28 2012-09-12 电子科技大学 Parallel matrix multiplier based on single FPGA and implementation method thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101089840A (en) * 2007-07-12 2007-12-19 浙江大学 Matrix multiplication parallel computing system based on multi-FPGA
CN101794210A (en) * 2010-04-07 2010-08-04 上海交通大学 General matrix floating point multiplier based on FPGA (Field Programmable Gate Array)
CN102629189A (en) * 2012-03-15 2012-08-08 湖南大学 Water floating point multiply-accumulate method based on FPGA
CN102662623A (en) * 2012-04-28 2012-09-12 电子科技大学 Parallel matrix multiplier based on single FPGA and implementation method thereof

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108958704A (en) * 2017-05-18 2018-12-07 华为技术有限公司 A kind of data processing equipment and method
CN108958704B (en) * 2017-05-18 2020-12-15 华为技术有限公司 Data processing device and method
US11720646B2 (en) 2017-07-07 2023-08-08 Huawei Technologies Co., Ltd. Operation accelerator
CN109213962A (en) * 2017-07-07 2019-01-15 华为技术有限公司 Arithmetic accelerator
US11321423B2 (en) 2017-07-07 2022-05-03 Huawei Technologies Co., Ltd. Operation accelerator
CN112214726B (en) * 2017-07-07 2024-05-03 华为技术有限公司 Operation accelerator
CN112214727A (en) * 2017-07-07 2021-01-12 华为技术有限公司 Operation accelerator
CN112214726A (en) * 2017-07-07 2021-01-12 华为技术有限公司 Operation accelerator
CN114816331A (en) * 2017-11-03 2022-07-29 畅想科技有限公司 Hardware unit for performing matrix multiplication with clock gating
CN114816331B (en) * 2017-11-03 2024-01-26 畅想科技有限公司 Hardware unit for performing matrix multiplication with clock gating
CN111859273A (en) * 2017-12-29 2020-10-30 华为技术有限公司 Matrix multiplier
CN109992743B (en) * 2017-12-29 2020-06-16 华为技术有限公司 Matrix multiplier
CN109992743A (en) * 2017-12-29 2019-07-09 华为技术有限公司 Matrix multiplier
US11934481B2 (en) 2017-12-29 2024-03-19 Huawei Technologies Co., Ltd. Matrix multiplier
US11334648B2 (en) 2017-12-29 2022-05-17 Huawei Technologies Co., Ltd. Matrix multiplier
CN110673824A (en) * 2018-07-03 2020-01-10 赛灵思公司 Matrix vector multiplication circuit and circular neural network hardware accelerator
CN109144469A (en) * 2018-07-23 2019-01-04 上海亮牛半导体科技有限公司 Pipeline organization neural network matrix operation framework and method
CN109271138A (en) * 2018-08-10 2019-01-25 合肥工业大学 A kind of chain type multiplication structure multiplied suitable for big dimensional matrix
CN109902063A (en) * 2019-02-01 2019-06-18 京微齐力(北京)科技有限公司 A kind of System on Chip/SoC being integrated with two-dimensional convolution array
CN109902063B (en) * 2019-02-01 2023-08-22 京微齐力(北京)科技有限公司 System chip integrated with two-dimensional convolution array
CN110188869A (en) * 2019-05-05 2019-08-30 北京中科汇成科技有限公司 A kind of integrated circuit based on convolutional neural networks algorithm accelerates the method and system of calculating
CN110188869B (en) * 2019-05-05 2021-08-10 北京中科汇成科技有限公司 Method and system for integrated circuit accelerated calculation based on convolutional neural network algorithm
CN110648313A (en) * 2019-09-05 2020-01-03 北京智行者科技有限公司 Laser stripe center line fitting method based on FPGA
CN112306922B (en) * 2020-11-12 2023-09-22 山东云海国创云计算装备产业创新中心有限公司 Multi-data-to-multi-port arbitration method and related device
CN112306922A (en) * 2020-11-12 2021-02-02 山东云海国创云计算装备产业创新中心有限公司 Multi-data-pair multi-port arbitration method and related device
CN112865960A (en) * 2020-12-31 2021-05-28 广州万协通信息技术有限公司 System, method and device for realizing high-speed key chain pre-calculation based on stream cipher
WO2023092669A1 (en) * 2021-11-26 2023-06-01 南方科技大学 Multi-precision accelerator based on systolic array and data processing method therefor
CN114237551B (en) * 2021-11-26 2022-11-11 南方科技大学 Multi-precision accelerator based on pulse array and data processing method thereof
CN114237551A (en) * 2021-11-26 2022-03-25 南方科技大学 Multi-precision accelerator based on pulse array and data processing method thereof
CN114580628A (en) * 2022-03-14 2022-06-03 北京宏景智驾科技有限公司 Efficient quantization acceleration method and hardware circuit for neural network convolution layer

Similar Documents

Publication Publication Date Title
CN105589677A (en) Systolic structure matrix multiplier based on FPGA (Field Programmable Gate Array) and implementation method thereof
CN110263925B (en) Hardware acceleration implementation device for convolutional neural network forward prediction based on FPGA
CN111684473B (en) Improving performance of neural network arrays
JP7268996B2 (en) Systems and methods for computation
EP3373210B1 (en) Transposing neural network matrices in hardware
WO2018120989A1 (en) Convolution operation chip and communication device
CN106875013B (en) System and method for multi-core optimized recurrent neural networks
CN106203621B (en) The processor calculated for convolutional neural networks
CN111897579B (en) Image data processing method, device, computer equipment and storage medium
EP3513357A1 (en) Tensor operations and acceleration
JP5408913B2 (en) Fast and efficient matrix multiplication hardware module
CN109992743A (en) Matrix multiplier
Heller et al. Systolic networks for orthogonal decompositions
US10713214B1 (en) Hardware accelerator for outer-product matrix multiplication
CN103970720B (en) Based on extensive coarseness imbedded reconfigurable system and its processing method
CN110543939B (en) Hardware acceleration realization device for convolutional neural network backward training based on FPGA
Danielsson Serial/parallel convolvers
CN100465876C (en) Matrix multiplier device based on single FPGA
CN109284475B (en) Matrix convolution calculating device and matrix convolution calculating method
Wu et al. Compute-efficient neural-network acceleration
CN111291323A (en) Matrix multiplication processor based on systolic array and data processing method thereof
Cho et al. FARNN: FPGA-GPU hybrid acceleration platform for recurrent neural networks
Huang et al. A high performance multi-bit-width booth vector systolic accelerator for NAS optimized deep learning neural networks
Mohanty et al. Design and performance analysis of fixed-point jacobi svd algorithm on reconfigurable system
CN112836793B (en) Floating point separable convolution calculation accelerating device, system and image processing method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160518