CN210776651U - FPGA parallel fast multiplier module for vector and matrix - Google Patents
FPGA parallel fast multiplier module for vector and matrix Download PDFInfo
- Publication number
- CN210776651U CN210776651U CN201921019111.1U CN201921019111U CN210776651U CN 210776651 U CN210776651 U CN 210776651U CN 201921019111 U CN201921019111 U CN 201921019111U CN 210776651 U CN210776651 U CN 210776651U
- Authority
- CN
- China
- Prior art keywords
- output port
- memory
- multiplier
- port
- accumulator
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Landscapes
- Complex Calculations (AREA)
Abstract
The FPGA parallel fast multiplier module for the vector and the matrix eliminates the problem that repeated addressing is needed in calculation of the existing method, effectively reduces the memory access times and the memory access time, improves the calculation speed, realizes the parallel multiplication operation of the vector and the matrix, and provides the realization method of the vector and matrix multiplier. The technical scheme of the utility model: the structure is as follows: the structure consists of n +1 FIFO queue structure memories, n multipliers, n accumulators, n buffers and n controllers. Each memory has 1 input port and 1 output port; each multiplier has 2 input ports and 1 output port; each accumulator has 2 input ports and 1 output port; each buffer has 1 input port and 1 output port; each controller has 1 input port and 2 output ports.
Description
Technical Field
The invention belongs to the field of information communication, and particularly relates to a vector and matrix FPGA parallel fast multiplier.
Background
The multiplication operation of vector and matrix is the most basic operation in modern signal processing, and is widely applied to the process control in the fields of feature extraction in the image processing field, sparse signal processing, data compression in the machine learning field and automatic control. The multiplication operation of the vector and the matrix is an operation which is long in time consumption, high in calculation complexity and large in memory consumption, and the calculation performance of the operation directly influences the overall performance of the system.
In recent years, with the rapid development of the FPGA technology, the FPGA integrates functions of acquisition, control, processing, transmission, and the like into one chip, so that the development cycle is shortened, the programmable flexibility is greatly increased by parallel computing, and the existing FPGA is more widely applied to computing-intensive application occasions along with the improvement of the process and the precision. Based on the design principle and the framework of the FPGA, the FPGA can quickly and effectively realize parallel processing and improve the computing speed by designing a plurality of parallel computing modules. However, in the aspect of vector and matrix multiplication design based on the FPGA, a serial design method is mostly adopted, and the methods have the limitations of long delay time, poor expandability, doubled bandwidth along with dimensionality, and the like. Therefore, the existing processing mode has complex control, can not perform pipelined operation on real-time data, has higher computational complexity, consumes large memory and is difficult to realize.
Disclosure of Invention
The invention aims to provide a vector and matrix FPGA parallel fast multiplier aiming at the defects of the prior method, solves the problem that the prior method needs repeated addressing during calculation, effectively reduces the memory access times and the memory access time, improves the calculation speed, realizes the parallel multiplication operation of the vector and the matrix, and provides an implementation method of the vector and matrix multiplier.
The technical scheme of the invention is as follows: the structure is as follows:
the structure comprises n +1 FIFO queue structure memories (memory (0), memory (1), memory (2) … memory (n)), and n multipliers (M)1,M2,…Mn) N accumulators (A)1,A2… An), n buffers (Buf1, Buf2, … Bufn) and n controllers (controller 1, controller 2, … controller n).
Each memory has 1 input port and 1 output port; each multiplier has 2 input ports and 1 output port; each accumulator has 2 input ports and 1 output port; each buffer has 1 input port and 1 output port; each controller has 1 input port and 2 output ports.
The connection relationship of the components is as follows:
output port of memory (0) and multiplier M1,M2,…MnAre connected to one input port of the memory (1) and the output port of the memory is connected to the multiplier M1Is connected to the other input port of the multiplier M1Output port and accumulator A1Is connected to one input port of the accumulator A1Is connected to the output port 1 of the controller 1, an accumulator a1Is connected with the input port of the buffer Buf1, the output port of the buffer Buf1 is connected with the input port of the controller 1, and the controllerOutput port 2 of 1 is the final result output port out 1;
output port of memory (0) and multiplier M1,M2,…MnAre connected to one input port of the memory (2), and an output port of the memory is connected to the multiplier M2Is connected to the other input port of the multiplier M2Output port and accumulator A2Is connected to one input port of the accumulator A2Is connected to the output port 1 of the controller 2, an accumulator a2The output port of buffer Buf2 is connected with the input port of the buffer Buf2, the output port of the buffer Buf2 is connected with the input port of the controller 2, and the output port 2 of the controller 2 is the final result output port out 2;
……
output port of memory (0) and multiplier M1,M2,…MnAre connected to one input port of the memory (n), and the output port of the memory (n) is connected to the multiplier MnIs connected to the other input port of the multiplier MnOutput port and accumulator AnIs connected to one input port of the accumulator AnIs connected to the output port of the controller n, an accumulator anThe output port of the buffer Bufn is connected with the input port of the buffer Bufn, the output port of the buffer Bufn is connected with the input port of the controller n, and the output port 2 of the controller n is the final result output port outn; the operation steps are as follows:
s1, the m-dimensional vector X is stored in the memory (0), namely the value X stored in the memory (0)1,x2,…xm(ii) a For storage convenience, the matrix W is converted into an n-m-dimensional matrix, and then n row vectors are respectively stored into a memory (1) and a memory (2) …; i.e. the value stored in the memory (1) is w11,w21,…wm1The value stored in the memory (2) is w12,w22,…wm2By analogy, the value stored in the memory (n) is w1n,w2n,…wmn;
S2 fetches the 1 st element x in memory (0)1And in the memories (1), (2), …1 st element w11,w12,…,w1n(ii) a X is to be1Are respectively sent into M1,M2,…MnMultiplier, w11,w12,…w1n is sequentially fed into M1,M2,… MnMultiplier, implementing x1And w11,w12,…w1n parallel multiplication of n, then feeding the product result into corresponding accumulator A1,A2,…An;
S3 fetches the 2 nd element x in the memory (0)2And a 2 nd element w in memory (1), memory (2) … memory (n)21,w22,…w2n(ii) a X is to be2Are respectively sent into M1,M2,…MnMultiplier, w21,w22,…w2n is sequentially fed into M1,M2,…MnMultiplier, implementing x2And w21,w22,…w2n parallel multiplication of n, then feeding the product result into corresponding accumulator A1,A2… An; s4 extracting the ith (m > i ≧ 3) element x in memory (0)iAnd the ith element w in memory (1), memory (2) … memory (n)i1,wi2,…,win(ii) a X is to beiAre respectively sent into M1,M2,…,MnMultiplier, wi1,wi2,…winSequentially feed into M1,M2,…MnMultiplier, implementing xiAnd wi1,wi2,…winN paths of parallel multiplication calculation, and then the product result is sent to the corresponding accumulator A1,A2,…An;
S5, if i is less than m, i +1, repeating the step S4; otherwise, go to step S6;
s6 accumulator A1,A2… An stores the output results in registers Buf1, Buf2, …, Bufn, respectively. The controller 1, the controller 2, … and the controller n judge whether the operation of the vector X and the matrix W is finished, if not, the results in Buf1, Buf2, … and Bufn are sent toAccumulator A1,A2… An; if the result is finished, the result in Buf1, Buf2, … and Bufn is output, namely the result [ out1, out2, … outn of multiplying the vector X by the matrix W].
Drawings
FIG. 1 is a schematic view of the structure
Detailed Description
As shown in fig. 1, for clarity of description, a controller module is added to facilitate control of whether multiplication is completed or not, and if not, the result buf is fed to the accumulator a1, and if the calculation is completed, the result Out is output.
The technical scheme of the invention is as follows:
the FPGA parallel fast multiplier principle of the vector and matrix is as follows:
If X × W, the following operations are required if the conventional calculation method is used:
1) get x1Taking w11Calculating x1*w11;
2) Get x2Taking w21Calculating x2*w21;
3) Get x3Taking w31Calculating x3*w31;
……
4) Get xmTaking wm1Calculating xm*wm1;
5) Calculating ∑ xi × wi1Obtaining X W1The product operation of the vector X and the first column of the matrix W is completed.
The product operation of the vector and a matrix column needs to be accessed and stored for 2m times, and so on, the product operation of the m-dimensional vector X and the m-n-dimensional matrix W needs to be accessed and stored for 2mn times, and the parallel effect is very poor.
taking the first vector X in X1As a common vector, take WTThe first column in (1) can be calculated by n-way parallel multiplication to obtain
Taking the second vector X in X2As a common vector, take WTThe second column in (1) can be calculated by n-way parallel multiplication to obtain
……
Get the m-th vector X in XmAs a common vector, take WTThe m-th column in the sequence can be calculated by n-path parallel multiplication to obtain
The above results are accumulated to obtain the product of vector X and matrix W
The method has access to memory mn + m times, and when n is greater than 1, 2mn is greater than mn + m. That is, the second method has fewer access times and is easy for parallel operation, and when designing a parallel module, the structure of the parallel module can be as shown in fig. 1:
the execution process of the multiplier in the FPGA comprises the following steps:
if the m-dimensional vector X is multiplied by the m X n-dimensional matrix W, the multiplication process is as follows:
(1) storing the m-dimensional vector X into the memory (0), i.e. the value stored in the memory (0) is X1,x2,…xm(ii) a For storage convenience, the matrix W is converted into n × m dimensional matrix, and then n row vectors are respectively stored in the memories (1) and (2) …A memory (n); i.e. the value stored in the memory (1) is w11,w21,…wm1The value stored in the memory (2) is w12,w22,…wm2By analogy, the value stored in the memory (n) is w1n,w2n,…wmn;
(2) Fetch the 1 st element x in memory (0)1And 1 st element w in memory (1), memory (2), … memory (n)11,w12,…,w1n(ii) a X is to be1Are respectively sent into M1,M2,…MnMultiplier, w11,w12,…w1n is sequentially fed into M1,M2,…MnMultiplier, implementing x1And w11,w12,…w1n parallel multiplication of n, then feeding the product result into corresponding accumulator A1,A2,…An。
(3) Fetch the 2 nd element x in memory (0)2And a 2 nd element w in memory (1), memory (2) … memory (n)21,w22,…w2n(ii) a X is to be2Are respectively sent into M1,M2,…MnMultiplier, w21,w22,…w2n is sequentially fed into M1,M2,…MnMultiplier, implementing x2And w21,w22,…w2n parallel multiplication of n, then feeding the product result into corresponding accumulator A1,A2,…An。
(4) And so on, the m-th element x in the memory (0) is taken outmAnd the m-th element w in the memory (1), the memory (2) … and the memory (n)m1,wm2,…,wmn(ii) a X is to bemAre respectively sent into M1,M2,…,MnMultiplier, wm1,wm2,…wmnSequentially feed into M1,M2,…MnMultiplier, implementing xmAnd wm1,wm2,…wmnN parallel multiplication and then feeding the product result into the phaseCorresponding accumulator A1,A2,…An。
(5) Accumulator A1,A2… An stores the output results in registers Buf1, Buf2, …, Bufn, respectively. The controller 1, the controller 2, … and the controller n judge whether the operation of the vector X and the matrix W is finished, if not, the results in Buf1, Buf2, … and Bufn are sent to an accumulator A1,A2… An; if the result is finished, the result in Buf1, Buf2, … and Bufn is output, namely the result [ out1, out2, … outn of multiplying the vector X by the matrix W].
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, improvement and the like made within the content and principle of the present invention shall be included in the protection scope of the present invention.
Claims (1)
1. A FPGA parallel fast multiplier module of vector and matrix is characterized in that the structure is as follows:
the structure consists of n +1 FIFOs (first-in first-out) queue structure memories, n multipliers, n accumulators, n buffers and n controllers;
each memory has 1 input port and 1 output port; each multiplier has 2 input ports and 1 output port; each accumulator has 2 input ports and 1 output port; each buffer has 1 input port and 1 output port; each controller is provided with 1 input port and 2 output ports;
the connection relationship of the components is as follows:
output port of memory (0) and multiplier M1,M2,…MnAre connected to one input port of the memory (1) and the output port of the memory is connected to the multiplier M1Another input port ofConnected, multiplier M1Output port and accumulator A1Is connected to one input port of the accumulator A1Is connected to the output port 1 of the controller 1, an accumulator a1The output port of buffer Buf1 is connected with the input port of the buffer Buf1, the output port of buffer Buf1 is connected with the input port of the controller 1, and the output port 2 of the controller 1 is the final result output port out 1;
output port of memory (0) and multiplier M1,M2,…MnAre connected to one input port of the memory (2), and an output port of the memory is connected to the multiplier M2Is connected to the other input port of the multiplier M2Output port and accumulator A2Is connected to one input port of the accumulator A2Is connected to the output port 1 of the controller 2, an accumulator a2The output port of buffer Buf2 is connected with the input port of the buffer Buf2, the output port of the buffer Buf2 is connected with the input port of the controller 2, and the output port 2 of the controller 2 is the final result output port out 2;
sequentially connecting the output port of the memory (0) and the multiplier M1,M2,…MnAre connected to one input port of the memory (n), and the output port of the memory (n) is connected to the multiplier MnIs connected to the other input port of the multiplier MnOutput port and accumulator AnIs connected to one input port of the accumulator AnIs connected to the output port of the controller n, an accumulator anThe output port of buffer Bufn is connected with the input port of buffer Bufn, the output port of buffer Bufn is connected with the input port of controller n, and output port 2 of controller n is the final result output port outn.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201921019111.1U CN210776651U (en) | 2019-07-02 | 2019-07-02 | FPGA parallel fast multiplier module for vector and matrix |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201921019111.1U CN210776651U (en) | 2019-07-02 | 2019-07-02 | FPGA parallel fast multiplier module for vector and matrix |
Publications (1)
Publication Number | Publication Date |
---|---|
CN210776651U true CN210776651U (en) | 2020-06-16 |
Family
ID=71047003
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201921019111.1U Expired - Fee Related CN210776651U (en) | 2019-07-02 | 2019-07-02 | FPGA parallel fast multiplier module for vector and matrix |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN210776651U (en) |
-
2019
- 2019-07-02 CN CN201921019111.1U patent/CN210776651U/en not_active Expired - Fee Related
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11710041B2 (en) | Feature map and weight selection method and accelerating device | |
CN108108809B (en) | Hardware architecture for reasoning and accelerating convolutional neural network and working method thereof | |
CN109284817B (en) | Deep separable convolutional neural network processing architecture/method/system and medium | |
CN109522052B (en) | Computing device and board card | |
CN109543832B (en) | Computing device and board card | |
WO2020258529A1 (en) | Bnrp-based configurable parallel general convolutional neural network accelerator | |
TWI795519B (en) | Computing apparatus, machine learning computing apparatus, combined processing device, neural network chip, electronic device, board, and method for performing machine learning calculation | |
CN110163360B (en) | Computing device and method | |
CN111915001A (en) | Convolution calculation engine, artificial intelligence chip and data processing method | |
CN111047008B (en) | Convolutional neural network accelerator and acceleration method | |
CN117933314A (en) | Processing device, processing method, chip and electronic device | |
CN111767994A (en) | Neuron calculation module | |
CN110598844A (en) | Parallel convolution neural network accelerator based on FPGA and acceleration method | |
Xiao et al. | FPGA-based scalable and highly concurrent convolutional neural network acceleration | |
CN115310037A (en) | Matrix multiplication computing unit, acceleration unit, computing system and related method | |
WO2022205197A1 (en) | Matrix multiplier, matrix computing method, and related device | |
CN110704022A (en) | FPGA parallel fast multiplier module of vector and matrix and calculation method thereof | |
CN210776651U (en) | FPGA parallel fast multiplier module for vector and matrix | |
CN113485750A (en) | Data processing method and data processing device | |
CN109948787B (en) | Arithmetic device, chip and method for neural network convolution layer | |
CN109190755B (en) | Matrix conversion device and method for neural network | |
CN112639836A (en) | Data processing device, electronic equipment and data processing method | |
CN113392963B (en) | FPGA-based CNN hardware acceleration system design method | |
CN112346704B (en) | Full-streamline type multiply-add unit array circuit for convolutional neural network | |
WO2021082723A1 (en) | Operation apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200616 Termination date: 20210702 |
|
CF01 | Termination of patent right due to non-payment of annual fee |