CN210776651U - FPGA parallel fast multiplier module for vector and matrix - Google Patents

FPGA parallel fast multiplier module for vector and matrix Download PDF

Info

Publication number
CN210776651U
CN210776651U CN201921019111.1U CN201921019111U CN210776651U CN 210776651 U CN210776651 U CN 210776651U CN 201921019111 U CN201921019111 U CN 201921019111U CN 210776651 U CN210776651 U CN 210776651U
Authority
CN
China
Prior art keywords
output port
memory
multiplier
port
accumulator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201921019111.1U
Other languages
Chinese (zh)
Inventor
杨旭辉
祁昌禹
马芳兰
徐武德
张红霞
马宏伟
杨国辉
巩学芳
郑礴
韩根亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
INSTITUTE OF SENSOR TECHNOLOGY GANSU ACADEMY OF SCIENCE
Original Assignee
INSTITUTE OF SENSOR TECHNOLOGY GANSU ACADEMY OF SCIENCE
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by INSTITUTE OF SENSOR TECHNOLOGY GANSU ACADEMY OF SCIENCE filed Critical INSTITUTE OF SENSOR TECHNOLOGY GANSU ACADEMY OF SCIENCE
Priority to CN201921019111.1U priority Critical patent/CN210776651U/en
Application granted granted Critical
Publication of CN210776651U publication Critical patent/CN210776651U/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Complex Calculations (AREA)

Abstract

The FPGA parallel fast multiplier module for the vector and the matrix eliminates the problem that repeated addressing is needed in calculation of the existing method, effectively reduces the memory access times and the memory access time, improves the calculation speed, realizes the parallel multiplication operation of the vector and the matrix, and provides the realization method of the vector and matrix multiplier. The technical scheme of the utility model: the structure is as follows: the structure consists of n +1 FIFO queue structure memories, n multipliers, n accumulators, n buffers and n controllers. Each memory has 1 input port and 1 output port; each multiplier has 2 input ports and 1 output port; each accumulator has 2 input ports and 1 output port; each buffer has 1 input port and 1 output port; each controller has 1 input port and 2 output ports.

Description

FPGA parallel fast multiplier module for vector and matrix
Technical Field
The invention belongs to the field of information communication, and particularly relates to a vector and matrix FPGA parallel fast multiplier.
Background
The multiplication operation of vector and matrix is the most basic operation in modern signal processing, and is widely applied to the process control in the fields of feature extraction in the image processing field, sparse signal processing, data compression in the machine learning field and automatic control. The multiplication operation of the vector and the matrix is an operation which is long in time consumption, high in calculation complexity and large in memory consumption, and the calculation performance of the operation directly influences the overall performance of the system.
In recent years, with the rapid development of the FPGA technology, the FPGA integrates functions of acquisition, control, processing, transmission, and the like into one chip, so that the development cycle is shortened, the programmable flexibility is greatly increased by parallel computing, and the existing FPGA is more widely applied to computing-intensive application occasions along with the improvement of the process and the precision. Based on the design principle and the framework of the FPGA, the FPGA can quickly and effectively realize parallel processing and improve the computing speed by designing a plurality of parallel computing modules. However, in the aspect of vector and matrix multiplication design based on the FPGA, a serial design method is mostly adopted, and the methods have the limitations of long delay time, poor expandability, doubled bandwidth along with dimensionality, and the like. Therefore, the existing processing mode has complex control, can not perform pipelined operation on real-time data, has higher computational complexity, consumes large memory and is difficult to realize.
Disclosure of Invention
The invention aims to provide a vector and matrix FPGA parallel fast multiplier aiming at the defects of the prior method, solves the problem that the prior method needs repeated addressing during calculation, effectively reduces the memory access times and the memory access time, improves the calculation speed, realizes the parallel multiplication operation of the vector and the matrix, and provides an implementation method of the vector and matrix multiplier.
The technical scheme of the invention is as follows: the structure is as follows:
the structure comprises n +1 FIFO queue structure memories (memory (0), memory (1), memory (2) … memory (n)), and n multipliers (M)1,M2,…Mn) N accumulators (A)1,A2… An), n buffers (Buf1, Buf2, … Bufn) and n controllers (controller 1, controller 2, … controller n).
Each memory has 1 input port and 1 output port; each multiplier has 2 input ports and 1 output port; each accumulator has 2 input ports and 1 output port; each buffer has 1 input port and 1 output port; each controller has 1 input port and 2 output ports.
The connection relationship of the components is as follows:
output port of memory (0) and multiplier M1,M2,…MnAre connected to one input port of the memory (1) and the output port of the memory is connected to the multiplier M1Is connected to the other input port of the multiplier M1Output port and accumulator A1Is connected to one input port of the accumulator A1Is connected to the output port 1 of the controller 1, an accumulator a1Is connected with the input port of the buffer Buf1, the output port of the buffer Buf1 is connected with the input port of the controller 1, and the controllerOutput port 2 of 1 is the final result output port out 1;
output port of memory (0) and multiplier M1,M2,…MnAre connected to one input port of the memory (2), and an output port of the memory is connected to the multiplier M2Is connected to the other input port of the multiplier M2Output port and accumulator A2Is connected to one input port of the accumulator A2Is connected to the output port 1 of the controller 2, an accumulator a2The output port of buffer Buf2 is connected with the input port of the buffer Buf2, the output port of the buffer Buf2 is connected with the input port of the controller 2, and the output port 2 of the controller 2 is the final result output port out 2;
……
output port of memory (0) and multiplier M1,M2,…MnAre connected to one input port of the memory (n), and the output port of the memory (n) is connected to the multiplier MnIs connected to the other input port of the multiplier MnOutput port and accumulator AnIs connected to one input port of the accumulator AnIs connected to the output port of the controller n, an accumulator anThe output port of the buffer Bufn is connected with the input port of the buffer Bufn, the output port of the buffer Bufn is connected with the input port of the controller n, and the output port 2 of the controller n is the final result output port outn; the operation steps are as follows:
s1, the m-dimensional vector X is stored in the memory (0), namely the value X stored in the memory (0)1,x2,…xm(ii) a For storage convenience, the matrix W is converted into an n-m-dimensional matrix, and then n row vectors are respectively stored into a memory (1) and a memory (2) …; i.e. the value stored in the memory (1) is w11,w21,…wm1The value stored in the memory (2) is w12,w22,…wm2By analogy, the value stored in the memory (n) is w1n,w2n,…wmn
S2 fetches the 1 st element x in memory (0)1And in the memories (1), (2), …1 st element w11,w12,…,w1n(ii) a X is to be1Are respectively sent into M1,M2,…MnMultiplier, w11,w12,…w1n is sequentially fed into M1,M2,… MnMultiplier, implementing x1And w11,w12,…w1n parallel multiplication of n, then feeding the product result into corresponding accumulator A1,A2,…An;
S3 fetches the 2 nd element x in the memory (0)2And a 2 nd element w in memory (1), memory (2) … memory (n)21,w22,…w2n(ii) a X is to be2Are respectively sent into M1,M2,…MnMultiplier, w21,w22,…w2n is sequentially fed into M1,M2,…MnMultiplier, implementing x2And w21,w22,…w2n parallel multiplication of n, then feeding the product result into corresponding accumulator A1,A2… An; s4 extracting the ith (m > i ≧ 3) element x in memory (0)iAnd the ith element w in memory (1), memory (2) … memory (n)i1,wi2,…,win(ii) a X is to beiAre respectively sent into M1,M2,…,MnMultiplier, wi1,wi2,…winSequentially feed into M1,M2,…MnMultiplier, implementing xiAnd wi1,wi2,…winN paths of parallel multiplication calculation, and then the product result is sent to the corresponding accumulator A1,A2,…An;
S5, if i is less than m, i +1, repeating the step S4; otherwise, go to step S6;
s6 accumulator A1,A2… An stores the output results in registers Buf1, Buf2, …, Bufn, respectively. The controller 1, the controller 2, … and the controller n judge whether the operation of the vector X and the matrix W is finished, if not, the results in Buf1, Buf2, … and Bufn are sent toAccumulator A1,A2… An; if the result is finished, the result in Buf1, Buf2, … and Bufn is output, namely the result [ out1, out2, … outn of multiplying the vector X by the matrix W].
Drawings
FIG. 1 is a schematic view of the structure
Detailed Description
As shown in fig. 1, for clarity of description, a controller module is added to facilitate control of whether multiplication is completed or not, and if not, the result buf is fed to the accumulator a1, and if the calculation is completed, the result Out is output.
The technical scheme of the invention is as follows:
the FPGA parallel fast multiplier principle of the vector and matrix is as follows:
one vector X ═ X1,x2,x3,……,xm),
Figure DEST_PATH_GDA0002417326200000031
If X × W, the following operations are required if the conventional calculation method is used:
1) get x1Taking w11Calculating x1*w11
2) Get x2Taking w21Calculating x2*w21
3) Get x3Taking w31Calculating x3*w31
……
4) Get xmTaking wm1Calculating xm*wm1
5) Calculating ∑ xi × wi1Obtaining X W1The product operation of the vector X and the first column of the matrix W is completed.
The product operation of the vector and a matrix column needs to be accessed and stored for 2m times, and so on, the product operation of the m-dimensional vector X and the m-n-dimensional matrix W needs to be accessed and stored for 2mn times, and the parallel effect is very poor.
If W is transposed,
Figure DEST_PATH_GDA0002417326200000032
then the following operations may be performed:
taking the first vector X in X1As a common vector, take WTThe first column in (1) can be calculated by n-way parallel multiplication to obtain
Figure DEST_PATH_GDA0002417326200000041
Taking the second vector X in X2As a common vector, take WTThe second column in (1) can be calculated by n-way parallel multiplication to obtain
Figure DEST_PATH_GDA0002417326200000042
……
Get the m-th vector X in XmAs a common vector, take WTThe m-th column in the sequence can be calculated by n-path parallel multiplication to obtain
Figure DEST_PATH_GDA0002417326200000043
The above results are accumulated to obtain the product of vector X and matrix W
Figure DEST_PATH_GDA0002417326200000044
The method has access to memory mn + m times, and when n is greater than 1, 2mn is greater than mn + m. That is, the second method has fewer access times and is easy for parallel operation, and when designing a parallel module, the structure of the parallel module can be as shown in fig. 1:
the execution process of the multiplier in the FPGA comprises the following steps:
if the m-dimensional vector X is multiplied by the m X n-dimensional matrix W, the multiplication process is as follows:
(1) storing the m-dimensional vector X into the memory (0), i.e. the value stored in the memory (0) is X1,x2,…xm(ii) a For storage convenience, the matrix W is converted into n × m dimensional matrix, and then n row vectors are respectively stored in the memories (1) and (2) …A memory (n); i.e. the value stored in the memory (1) is w11,w21,…wm1The value stored in the memory (2) is w12,w22,…wm2By analogy, the value stored in the memory (n) is w1n,w2n,…wmn
(2) Fetch the 1 st element x in memory (0)1And 1 st element w in memory (1), memory (2), … memory (n)11,w12,…,w1n(ii) a X is to be1Are respectively sent into M1,M2,…MnMultiplier, w11,w12,…w1n is sequentially fed into M1,M2,…MnMultiplier, implementing x1And w11,w12,…w1n parallel multiplication of n, then feeding the product result into corresponding accumulator A1,A2,…An。
(3) Fetch the 2 nd element x in memory (0)2And a 2 nd element w in memory (1), memory (2) … memory (n)21,w22,…w2n(ii) a X is to be2Are respectively sent into M1,M2,…MnMultiplier, w21,w22,…w2n is sequentially fed into M1,M2,…MnMultiplier, implementing x2And w21,w22,…w2n parallel multiplication of n, then feeding the product result into corresponding accumulator A1,A2,…An。
(4) And so on, the m-th element x in the memory (0) is taken outmAnd the m-th element w in the memory (1), the memory (2) … and the memory (n)m1,wm2,…,wmn(ii) a X is to bemAre respectively sent into M1,M2,…,MnMultiplier, wm1,wm2,…wmnSequentially feed into M1,M2,…MnMultiplier, implementing xmAnd wm1,wm2,…wmnN parallel multiplication and then feeding the product result into the phaseCorresponding accumulator A1,A2,…An。
(5) Accumulator A1,A2… An stores the output results in registers Buf1, Buf2, …, Bufn, respectively. The controller 1, the controller 2, … and the controller n judge whether the operation of the vector X and the matrix W is finished, if not, the results in Buf1, Buf2, … and Bufn are sent to an accumulator A1,A2… An; if the result is finished, the result in Buf1, Buf2, … and Bufn is output, namely the result [ out1, out2, … outn of multiplying the vector X by the matrix W].
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, improvement and the like made within the content and principle of the present invention shall be included in the protection scope of the present invention.

Claims (1)

1. A FPGA parallel fast multiplier module of vector and matrix is characterized in that the structure is as follows:
the structure consists of n +1 FIFOs (first-in first-out) queue structure memories, n multipliers, n accumulators, n buffers and n controllers;
each memory has 1 input port and 1 output port; each multiplier has 2 input ports and 1 output port; each accumulator has 2 input ports and 1 output port; each buffer has 1 input port and 1 output port; each controller is provided with 1 input port and 2 output ports;
the connection relationship of the components is as follows:
output port of memory (0) and multiplier M1,M2,…MnAre connected to one input port of the memory (1) and the output port of the memory is connected to the multiplier M1Another input port ofConnected, multiplier M1Output port and accumulator A1Is connected to one input port of the accumulator A1Is connected to the output port 1 of the controller 1, an accumulator a1The output port of buffer Buf1 is connected with the input port of the buffer Buf1, the output port of buffer Buf1 is connected with the input port of the controller 1, and the output port 2 of the controller 1 is the final result output port out 1;
output port of memory (0) and multiplier M1,M2,…MnAre connected to one input port of the memory (2), and an output port of the memory is connected to the multiplier M2Is connected to the other input port of the multiplier M2Output port and accumulator A2Is connected to one input port of the accumulator A2Is connected to the output port 1 of the controller 2, an accumulator a2The output port of buffer Buf2 is connected with the input port of the buffer Buf2, the output port of the buffer Buf2 is connected with the input port of the controller 2, and the output port 2 of the controller 2 is the final result output port out 2;
sequentially connecting the output port of the memory (0) and the multiplier M1,M2,…MnAre connected to one input port of the memory (n), and the output port of the memory (n) is connected to the multiplier MnIs connected to the other input port of the multiplier MnOutput port and accumulator AnIs connected to one input port of the accumulator AnIs connected to the output port of the controller n, an accumulator anThe output port of buffer Bufn is connected with the input port of buffer Bufn, the output port of buffer Bufn is connected with the input port of controller n, and output port 2 of controller n is the final result output port outn.
CN201921019111.1U 2019-07-02 2019-07-02 FPGA parallel fast multiplier module for vector and matrix Expired - Fee Related CN210776651U (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201921019111.1U CN210776651U (en) 2019-07-02 2019-07-02 FPGA parallel fast multiplier module for vector and matrix

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201921019111.1U CN210776651U (en) 2019-07-02 2019-07-02 FPGA parallel fast multiplier module for vector and matrix

Publications (1)

Publication Number Publication Date
CN210776651U true CN210776651U (en) 2020-06-16

Family

ID=71047003

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201921019111.1U Expired - Fee Related CN210776651U (en) 2019-07-02 2019-07-02 FPGA parallel fast multiplier module for vector and matrix

Country Status (1)

Country Link
CN (1) CN210776651U (en)

Similar Documents

Publication Publication Date Title
US11710041B2 (en) Feature map and weight selection method and accelerating device
CN108108809B (en) Hardware architecture for reasoning and accelerating convolutional neural network and working method thereof
CN109284817B (en) Deep separable convolutional neural network processing architecture/method/system and medium
CN109522052B (en) Computing device and board card
CN109543832B (en) Computing device and board card
WO2020258529A1 (en) Bnrp-based configurable parallel general convolutional neural network accelerator
TWI795519B (en) Computing apparatus, machine learning computing apparatus, combined processing device, neural network chip, electronic device, board, and method for performing machine learning calculation
CN110163360B (en) Computing device and method
CN111915001A (en) Convolution calculation engine, artificial intelligence chip and data processing method
CN111047008B (en) Convolutional neural network accelerator and acceleration method
CN117933314A (en) Processing device, processing method, chip and electronic device
CN111767994A (en) Neuron calculation module
CN110598844A (en) Parallel convolution neural network accelerator based on FPGA and acceleration method
Xiao et al. FPGA-based scalable and highly concurrent convolutional neural network acceleration
CN115310037A (en) Matrix multiplication computing unit, acceleration unit, computing system and related method
WO2022205197A1 (en) Matrix multiplier, matrix computing method, and related device
CN110704022A (en) FPGA parallel fast multiplier module of vector and matrix and calculation method thereof
CN210776651U (en) FPGA parallel fast multiplier module for vector and matrix
CN113485750A (en) Data processing method and data processing device
CN109948787B (en) Arithmetic device, chip and method for neural network convolution layer
CN109190755B (en) Matrix conversion device and method for neural network
CN112639836A (en) Data processing device, electronic equipment and data processing method
CN113392963B (en) FPGA-based CNN hardware acceleration system design method
CN112346704B (en) Full-streamline type multiply-add unit array circuit for convolutional neural network
WO2021082723A1 (en) Operation apparatus

Legal Events

Date Code Title Description
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200616

Termination date: 20210702

CF01 Termination of patent right due to non-payment of annual fee