CN109144469B - Pipeline structure neural network matrix operation architecture and method - Google Patents

Pipeline structure neural network matrix operation architecture and method Download PDF

Info

Publication number
CN109144469B
CN109144469B CN201810813920.3A CN201810813920A CN109144469B CN 109144469 B CN109144469 B CN 109144469B CN 201810813920 A CN201810813920 A CN 201810813920A CN 109144469 B CN109144469 B CN 109144469B
Authority
CN
China
Prior art keywords
input
matrix
vector
multiplication
column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810813920.3A
Other languages
Chinese (zh)
Other versions
CN109144469A (en
Inventor
王照钢
毛劲松
徐栋麟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Lightning Semiconductor Technology Co ltd
Original Assignee
Shanghai Lightning Semiconductor Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Lightning Semiconductor Technology Co ltd filed Critical Shanghai Lightning Semiconductor Technology Co ltd
Priority to CN201810813920.3A priority Critical patent/CN109144469B/en
Publication of CN109144469A publication Critical patent/CN109144469A/en
Application granted granted Critical
Publication of CN109144469B publication Critical patent/CN109144469B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/491Computations with decimal numbers radix 12 or 20.
    • G06F7/498Computations with decimal numbers radix 12 or 20. using counter-type accumulators
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Neurology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a pipeline structure neural network matrix operation architecture, which comprises: the accelerator is realized by a digital circuit and is used for carrying out pipelined multiplication and addition operation on an input vector A and an input matrix B to obtain a result of A, B and D, wherein A is a column vector of one dimension 1*m, B is m, n and D is a vector matrix output result of 1 row and n columns; the pipelined multiplication and addition operation means that the input matrix B is divided into a plurality of different column blocks, multiplication and addition are carried out on the input vector A and the first column block of the input matrix B, the result is output, multiplication and addition of the input vector A and the next column block of the input matrix B are continuously carried out, the result is output, iteration is repeated until the multiplication and addition of the last column block of the input vector A and the last column block of the input matrix B are completed, and the result is output, so that the multiplication result D of the input vector A and the input matrix B is obtained.

Description

Pipeline structure neural network matrix operation architecture and method
Technical Field
The invention relates to the technical field of digital circuit integrated design, in particular to a pipeline structure neural network matrix operation architecture and a method.
Background
For a group of input data, such as feature vectors of a voice signal or two-dimensional image data, labeling information corresponding to morpheme information or images corresponding to the voice signal or the two-dimensional image data can be obtained through calculation of a neural network model, and a great deal of calculation resources or storage resources are often consumed from data input to calculation by using the neural network model to finally generate an output result.
However, it is known that the performance of an integrated circuit is mainly evaluated in terms of the speed of processing data, performance stability, material cost, and occupied space, and the data processing manner relates to the performance of the computing speed, so that chip designers in the market at present want to perform various optimizations on the processing algorithm, so as to achieve the purposes of high efficiency, cost saving, product performance improvement, etc., for example, the existing neural network matrix computing architecture generally has the following disadvantages:
1. the dimensions of the matrix operation are fixed and the operation scale cannot be adaptively changed;
2. usually, the CPU performs computation by occupying memory, such as RAM, which is a software operation, the speed of which depends on the operation frequency of the CPU, and when the scale is large, a large amount of memory space is consumed, so that the computation efficiency is very low;
3. matrix vector multiplication is realized through a DSP processor, the operation is often serial, the execution efficiency is low, the time consumption is long, the input vector and the weight matrix type pre-exist RAM space, the intermediate variable in the calculation process also needs to be output, and the storage and the broadband overhead are further increased.
Disclosure of Invention
The invention aims to provide a pipeline structure neural network matrix operation architecture and method, which utilizes a digital circuit to realize a multiplication accumulation MAC unit comprising array arrangement, a counter matched with the multiplication accumulation MAC unit and an accelerator realized by a shifter, combines a circulation principle to circularly input data, and realizes superposition and homing accumulation according to original requirements like a pipeline structure, so that matrix and vector multiplication operations can be executed in parallel, the processing speed is greatly improved compared with the processing modes of a CPU and a DSP, and intermediate results can be stored locally without consuming extra storage expenditure; the dimension of the matrix and the vector which participate in the multiplication and addition operation, the number of counter pulses and the shift depth of the shifter are dynamically configured by the aid of the controller.
In order to achieve the above purpose, the present invention is realized by the following technical scheme:
a pipeline structured neural network matrix operation architecture, comprising:
the accelerator is realized by a digital circuit and is used for carrying out pipelined multiplication and addition operation on an input vector A and an input matrix B to obtain a result of A, B and D, wherein A is a column vector of one dimension 1*m, B is m, n and D is a vector matrix output result of 1 row and n columns; the pipelined multiplication and addition operation means that the input matrix B is divided into a plurality of different column blocks, multiplication and addition are carried out on the input vector A and the first column block of the input matrix B, the result is output, multiplication and addition of the input vector A and the next column block of the input matrix B are continuously carried out, the result is output, iteration is repeated until the multiplication and addition of the last column block of the input vector A and the last column block of the input matrix B are completed, and the result is output, so that the multiplication result D of the input vector A and the input matrix B is obtained.
The pipeline structure neural network matrix operation architecture, wherein the accelerator comprises:
the fixed-point multiply-accumulate module is used for executing running water multiply-accumulate operation on the input vector A and the input matrix B; the fixed-point multiply-accumulate module comprises a plurality of fixed-point multiply-add devices running in parallel, wherein the two input ends of each fixed-point multiply-add device are sequentially input with 1 row and m columns of elements of a vector A and each element in a corresponding column block of an input matrix B so as to synchronously execute multiply and accumulate on each corresponding column in the corresponding column block of the input vector A and the input matrix B respectively, output and return zero of multiply and accumulate results are carried out under the control of a reset pulse of a counter to an RC reset pulse enabling end of each fixed-point multiply-add device after calculation is completed, and then multiply and accumulate on the next corresponding column block in the input matrix B are executed;
the counter is used for outputting a reset pulse after the fixed-point multiply-add device finishes the multiply-add operation of the corresponding column blocks of the input vector A and the input matrix B every time, generating a pipeline reset signal to the RC reset pulse enabling end of each fixed-point multiply-add device through the first register chain, and controlling the zero clearing of self pulses after the fixed-point multiply-add module finishes the continuous-flow multiply-add operation of the input vector A and the input matrix B every time;
the shifter is used for controlling the shifting depth of the column number A of the input vector;
the counter carries out pulse control on the RC reset pulse enabling end of each fixed-point multiply-add device through the first register chain;
a second register chain through which 1 row m column elements of the input vector a are sequentially input to each fixed-point multiply-add device;
and a plurality of third register chains, wherein corresponding column elements in corresponding column blocks of the input matrix B are continuously input to corresponding fixed-point multiply-add devices through the corresponding third register chains.
The pipeline structure neural network matrix operation architecture further comprises:
the controller is connected with the accelerator and is used for dynamically configuring the column number of the input vector A and the row number m of the input matrix B, the column number n of the input matrix B and the number of counter pulses in the accelerator so that the shifter controls the shift depth of the column number of the input vector A, the counter controls the RC reset pulse enabling end after the multiplication and addition operation of corresponding column blocks of the input vector A and the input matrix B is finished once, and the counter controls the zero clearing of the pulses of the input vector A and the input matrix B after the running water multiplication and addition operation of the input vector A and the input matrix B is finished once.
The pipeline structure neural network matrix operation architecture comprises the following components:
the controller is realized by a CPU.
The pipeline structure neural network matrix operation architecture comprises the following components:
the number of the fixed-point multiply-add devices and the third register chains is the same as the number of columns contained in each column block of the input matrix B.
A method for pipeline structure neural network matrix operation implemented by a digital circuit, comprising:
performing pipelined multiply-add operation on an input vector a and an input matrix B through a digital circuit to obtain a result of a×b=d, wherein a is a column vector of one dimension 1*m, B is m×n, and D is a vector matrix output result of 1 row and n columns;
the pipelined multiplication and addition operation means that the input matrix B is divided into a plurality of different column blocks, multiplication and addition are carried out on the input vector A and the first column block of the input matrix B, the result is output, multiplication and addition of the input vector A and the next column block of the input matrix B are continuously carried out, the result is output, iteration is repeated until the multiplication and addition of the last column block of the input vector A and the last column block of the input matrix B are completed, and the result is output, so that the multiplication result D of the input vector A and the input matrix B is obtained.
Compared with the prior art, the invention has the following advantages: the rapid neural network acceleration capability is realized through hardware acceleration of operations such as high-speed matrix vector multiplication and addition, so that the result is calculated in real time through the operation framework after data input and model loading, the operation speed and efficiency of the neural network are improved to a great extent, and the image or voice recognition process is further accelerated.
Drawings
FIG. 1 is a block diagram of the structure of the present invention;
FIG. 2 is a block diagram of an accelerator according to the present invention;
fig. 3 is a block diagram showing a specific structure of the accelerator according to the embodiment of the present invention.
Detailed Description
The invention will be further described by the following detailed description of a preferred embodiment, taken in conjunction with the accompanying drawings.
As shown in fig. 1, the present invention proposes a pipeline structure neural network matrix operation architecture, which includes:
the accelerator is realized by a digital circuit and is used for carrying out pipelined multiplication and addition operation on an input vector A and an input matrix B to obtain a result of A, B and D, wherein A is a column vector of one dimension 1*m, B is m, n and D is a vector matrix output result of 1 row and n columns; the pipelined multiplication and addition operation means that the input matrix B is divided into a plurality of different column blocks, multiplication and addition are carried out on the input vector A and the first column block of the input matrix B, the result is output, multiplication and addition of the input vector A and the next column block of the input matrix B are continuously carried out, the result is output, iteration is repeated until the multiplication and addition of the last column block of the input vector A and the last column block of the input matrix B are completed, and the result is output, so that the multiplication result D of the input vector A and the input matrix B is obtained.
As shown in fig. 2, specifically, the accelerator includes:
the fixed-point multiply-accumulate module is used for executing running water multiply-accumulate operation on the input vector A and the input matrix B; the fixed-point multiply-accumulate module comprises a plurality of fixed-point multiply-accumulate devices running in parallel, wherein the two input ends of each fixed-point multiply-accumulate device sequentially input each element of 1 row and m columns of a vector A and each element in a corresponding column of an input matrix B corresponding column block so as to synchronously execute multiply and accumulate on each corresponding column of the input vector A and the corresponding column of the input matrix B respectively (in the figure, i represents the ith column in the matrix B, B [ i ] represents all elements of the ith column in the matrix B, the column number of multiply-accumulate operation performed by one iteration is x+1), and after calculation is completed, output and zero return of a multiply and accumulate result are performed under the control of a counter reset pulse to an RC reset pulse enabling end of each fixed-point multiply-accumulate device, and then multiply and accumulate on the next corresponding column block in the input matrix B are executed;
a counter (which may be a cyclic timer or a timer) for outputting a reset pulse after the fixed-point multiply-add device performs multiply-add operation on the corresponding column blocks of the input vector a and the input matrix B each time, the reset pulse generating a pipeline reset signal to the RC reset pulse enabling end of each fixed-point multiply-add device through the first register chain, and controlling zero clearing of the pulse after the fixed-point multiply-add module performs a running-type multiply-add operation on the input vector a and the input matrix B each time;
the shifter is used for controlling the shifting depth of the column number A of the input vector;
the counter carries out pulse control on the RC reset pulse enabling end of each fixed-point multiply-add device through the first register chain;
a second register chain through which 1 row m column elements of the input vector a are sequentially input to each fixed-point multiply-add device;
and a plurality of third register chains, wherein corresponding column elements in corresponding column blocks of the input matrix B are continuously input to corresponding fixed-point multiply-add devices through the corresponding third register chains.
The pipeline structure neural network matrix operation architecture further comprises:
the controller can be realized by a CPU, and is connected with the accelerator and is used for dynamically configuring the column number of the input vector A and the row number m of the input matrix B, the column number n of the input matrix B and the number of counter pulses in the accelerator, so that the shifter controls the shift depth of the column number of the input vector A, the control of the counter on the RC reset pulse enabling end after the multiplication and addition operation of the corresponding column blocks of the input vector A and the input matrix B is finished once, and the counter controls the zero clearing of the pulses after the running water multiplication and addition operation of the input vector A and the input matrix B is finished once.
In this embodiment, the number of fixed-point multiply-add devices and the number of third register chains are respectively the same as the number of columns included in each column block of the input matrix B.
Specifically, the parallelism can be further expanded, for example, the number of fixed-point multiply-add devices is increased, more columns can be calculated at a time, and the iteration times are further reduced.
The implementation of the computing architecture of the present invention is further described in conjunction with a preferred embodiment:
as shown in fig. 3, the structure can calculate 32 columns of multiply-add operations at a time, and two matrices are a in the example: 1*m and B: m is n, m and n can be configured to adapt to different neural network scales, in the example, the counter (Timer) is a loop counter or Timer, a reset pulse is output to perform accumulation result output and return to zero every time m times of multiply-add operation is completed, the MAC in the figure is a 16-bit fixed-point multiply-add device, which is total 32, i.e. in this embodiment, x=31, the number of columns of multiply-add operation performed by one iteration is 32, each time the MAC performs multiply-add operation, the operation formula is c=c+a, where a is a certain value in the input vector a, B is a certain value in the input matrix B, c is the accumulation result, and RC is the reset pulse enable. The whole matrix operation process is that the input vector a inputs 1 row and m columns of elements continuously to a second Register Chain (Register Chain), meanwhile, the input matrix B inputs every 32 columns of m rows and n columns as a unit, after the input matrix B inputs all m rows of elements corresponding to the 32 columns, the operation of D [1:32] =a [1:m ] =b [1:m ] [1:32] is completed, and then the next iteration is performed, wherein each iteration needs to input all elements of the input vector a, but different column blocks of B are selected, for example, the first iteration selects the 1 st to 32 th columns of B, and the second iteration selects the 33 rd to 64 th columns of B. After iteration is finished, the result is output and the previous accumulation operation is repeated, and the element B is different only when accumulation is carried out, and the accumulation rule is the same as the original rule; after the last round of 32 nd accumulation is completed, a new matrix is obtained, the arrangement of the new matrix is an array like 1*n, at this time, the counter starts reset clearing once, and each clearing means that the operation of two matrices is completed. If a new matrix operation is started, the same repeated operation is started as the last time. The advantage of the matrix multiplication is that the matrix multiplication can effectively reduce the energy consumption delay cost required by other CPUs and DSPs to participate in the same operation; firstly, the method effectively avoids the need of firstly reading data, decoding, analyzing and executing and finally outputting results when the data is processed like a CPU, and the arithmetic unit of the matrix multiplication can directly input the data into a register chain and operate according to beats when the data is processed, so that decoding is not needed; a third advantage is that this type of matrix can be designed randomly, if the matrix to be operated is relatively large, the register chain can be designed to be 32-bit or 64-bit, but if the circuit can be designed as a matrix operation circuit for 16-bit fixed point operation in order to save space and material cost of hardware circuits, only more cyclic operations are required in the operation process; the fourth matrix operation hardware circuit improves the service efficiency of adders in the circuit and saves the material cost.
The invention also provides a method for realizing pipeline structure neural network matrix operation by a digital circuit, which comprises the following steps:
performing pipelined multiply-add operation on an input vector a and an input matrix B through a digital circuit to obtain a result of a×b=d, wherein a is a column vector of one dimension 1*m, B is m×n, and D is a vector matrix output result of 1 row and n columns;
the pipelined multiplication and addition operation means that the input matrix B is divided into a plurality of different column blocks, multiplication and addition are carried out on the input vector A and the first column block of the input matrix B, the result is output, multiplication and addition of the input vector A and the next column block of the input matrix B are continuously carried out, the result is output, iteration is repeated until the multiplication and addition of the last column block of the input vector A and the last column block of the input matrix B are completed, and the result is output, so that the multiplication result D of the input vector A and the input matrix B is obtained.
In summary, the invention utilizes the integration arrangement of the fixed-point multiply-add device and the counter of the large array, and utilizes the circulation principle to carry out circulation input on data, and if a pipeline structure carries out superposition, homing and accumulation according to the original requirement, one large matrix vector operation is split into small matrix vector dimensions to carry out operation, x+1 parallel vector multiplication can be sequentially customized, so that the multiplication operation of a vector matrix can be carried out in parallel, compared with the traditional CPU and DSP processing modes, the processing speed is greatly improved, the intermediate result can be stored locally, and the extra storage cost is not consumed, for example, the m times of multiply-add result of any column of the vector A and the matrix B can be reserved in the corresponding multiply-add device without carrying out data movement; for the controller, the input data and the output data can be accessed sequentially, the data to be calculated are shifted in sequence through the shift register, for the controller, only the data of the input vector A are required to be read in advance, each column of data of the input matrix B is read in batches, and when all the data are read in, the vector matrix multiplication operation is completed.
While the present invention has been described in detail through the foregoing description of the preferred embodiment, it should be understood that the foregoing description is not to be considered as limiting the invention. Many modifications and substitutions of the present invention will become apparent to those of ordinary skill in the art upon reading the foregoing. Accordingly, the scope of the invention should be limited only by the attached claims.

Claims (4)

1. A pipeline architecture neural network matrix operation architecture, comprising:
the accelerator is realized by a digital circuit and is used for carrying out pipelined multiplication and addition operation on an input vector A and an input matrix B to obtain a result of A, B and D, wherein A is a column vector of one dimension 1*m, B is m, n and D is a vector matrix output result of 1 row and n columns; the pipelined multiplication and addition operation finger divides an input matrix B into a plurality of different column blocks, multiplies and adds an input vector A and a first column block of the input matrix B and outputs a result, then continues to execute the multiplication and addition of the input vector A and a next column block of the input matrix B and outputs the result, and the iteration is repeated until the multiplication and addition of the last column block of the input vector A and the input matrix B are completed and the result is output, and then a multiplication result D of the input vector A and the input matrix B is obtained;
wherein, the accelerator includes:
the fixed-point multiply-accumulate module is used for executing running water multiply-accumulate operation on the input vector A and the input matrix B; the fixed-point multiply-accumulate module comprises x+1 fixed-point multiply-add devices running in parallel, wherein the two input ends of each fixed-point multiply-add device are sequentially input with each element of 1 row and m columns of a vector A and each element in a corresponding column of a corresponding column block of an input matrix B so as to synchronously execute multiply and accumulate on each corresponding column of the corresponding column block of the input vector A and the corresponding column block of the input matrix B respectively, output and return zero of multiply-accumulate results are carried out under the control of a reset pulse of a counter to an RC reset pulse enabling end of each fixed-point multiply-add device after calculation is completed, and then multiply and accumulate on the next corresponding column block of the input vector A and the input matrix B are executed;
the counter is used for outputting a reset pulse after the fixed-point multiply-add device finishes the multiply-add operation of the corresponding column blocks of the input vector A and the input matrix B every time, the pulse generates a pipeline reset signal to the RC reset pulse enabling end of each fixed-point multiply-add device through the first register chain, and the fixed-point multiply-add module controls the zero clearing of the pulse after finishing the running-type multiply-add operation of the input vector A and the input matrix B every time;
the shifter is used for controlling the shifting depth of the column number A of the input vector;
the counter carries out pulse control on the RC reset pulse enabling end of each fixed-point multiply-add device through the first register chain;
a second register chain through which 1 row m column elements of the input vector a are sequentially input to each fixed-point multiply-add device;
x+1 third register chains, corresponding column elements in corresponding column blocks of the input matrix B are continuously input to corresponding fixed-point multiply-add devices through the corresponding third register chains;
the number of the fixed-point multiply-add device and the third register chain is the same as the number of columns contained in each column block of the input matrix B; the number of columns of multiply-add operation performed in one iteration is x+1 columns, all elements of the input vector a need to be input, different column blocks of the input matrix B are selected, and each x+1 column in m rows and n columns of the input matrix B is input as a unit.
2. The pipeline architecture neural network matrix operation architecture of claim 1, further comprising:
the controller is connected with the accelerator and is used for dynamically configuring the column number of the input vector A and the row number m of the input matrix B, the column number n of the input matrix B and the number of counter pulses in the accelerator so that the shifter controls the shift depth of the column number of the input vector A, the counter controls the RC reset pulse enabling end after the multiplication and addition operation of corresponding column blocks of the input vector A and the input matrix B is finished once, and the counter controls the zero clearing of the pulses of the input vector A and the input matrix B after the running water multiplication and addition operation of the input vector A and the input matrix B is finished once.
3. The pipeline architecture neural network matrix operation architecture of claim 2, wherein:
the controller is realized by a CPU.
4. A method for pipeline structure neural network matrix operations implemented by digital circuits, comprising:
performing operations using the pipeline architecture neural network matrix operation architecture of claim 1; performing pipelined multiply-add operation on an input vector a and an input matrix B through a digital circuit to obtain a result of a×b=d, wherein a is a column vector of one dimension 1*m, B is m×n, and D is a vector matrix output result of 1 row and n columns;
the pipelined multiplication and addition operation means that the input matrix B is divided into a plurality of different column blocks, multiplication and addition are carried out on the input vector A and the first column block of the input matrix B, the result is output, multiplication and addition of the input vector A and the next column block of the input matrix B are continuously carried out, the result is output, iteration is repeated until the multiplication and addition of the last column block of the input vector A and the last column block of the input matrix B are completed, and the result is output, so that the multiplication result D of the input vector A and the input matrix B is obtained.
CN201810813920.3A 2018-07-23 2018-07-23 Pipeline structure neural network matrix operation architecture and method Active CN109144469B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810813920.3A CN109144469B (en) 2018-07-23 2018-07-23 Pipeline structure neural network matrix operation architecture and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810813920.3A CN109144469B (en) 2018-07-23 2018-07-23 Pipeline structure neural network matrix operation architecture and method

Publications (2)

Publication Number Publication Date
CN109144469A CN109144469A (en) 2019-01-04
CN109144469B true CN109144469B (en) 2023-12-05

Family

ID=64801554

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810813920.3A Active CN109144469B (en) 2018-07-23 2018-07-23 Pipeline structure neural network matrix operation architecture and method

Country Status (1)

Country Link
CN (1) CN109144469B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110276047B (en) * 2019-05-18 2023-01-17 南京惟心光电系统有限公司 Method for performing matrix vector multiplication operation by using photoelectric calculation array
CN110738311A (en) * 2019-10-14 2020-01-31 哈尔滨工业大学 LSTM network acceleration method based on high-level synthesis
CN110889259B (en) * 2019-11-06 2021-07-09 北京中科胜芯科技有限公司 Sparse matrix vector multiplication calculation unit for arranged block diagonal weight matrix
CN112434256B (en) * 2020-12-03 2022-09-13 海光信息技术股份有限公司 Matrix multiplier and processor
US20220293174A1 (en) * 2021-03-09 2022-09-15 International Business Machines Corporation Resistive memory device for matrix-vector multiplications
CN113266559B (en) * 2021-05-21 2022-10-28 华能秦煤瑞金发电有限责任公司 Neural network-based wireless detection method for concrete delivery pump blockage

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662623A (en) * 2012-04-28 2012-09-12 电子科技大学 Parallel matrix multiplier based on single field programmable gate array (FPGA) and implementation method for parallel matrix multiplier
CN104572011A (en) * 2014-12-22 2015-04-29 上海交通大学 FPGA (Field Programmable Gate Array)-based general matrix fixed-point multiplier and calculation method thereof

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5008833A (en) * 1988-11-18 1991-04-16 California Institute Of Technology Parallel optoelectronic neural network processors
CN105589677A (en) * 2014-11-17 2016-05-18 沈阳高精数控智能技术股份有限公司 Systolic structure matrix multiplier based on FPGA (Field Programmable Gate Array) and implementation method thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662623A (en) * 2012-04-28 2012-09-12 电子科技大学 Parallel matrix multiplier based on single field programmable gate array (FPGA) and implementation method for parallel matrix multiplier
CN104572011A (en) * 2014-12-22 2015-04-29 上海交通大学 FPGA (Field Programmable Gate Array)-based general matrix fixed-point multiplier and calculation method thereof

Also Published As

Publication number Publication date
CN109144469A (en) 2019-01-04

Similar Documents

Publication Publication Date Title
CN109144469B (en) Pipeline structure neural network matrix operation architecture and method
US11698773B2 (en) Accelerated mathematical engine
Nguyen et al. A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection
CN107704916B (en) Hardware accelerator and method for realizing RNN neural network based on FPGA
US8051124B2 (en) High speed and efficient matrix multiplication hardware module
US8443170B2 (en) Apparatus and method for performing SIMD multiply-accumulate operations
US5880981A (en) Method and apparatus for reducing the power consumption in a programmable digital signal processor
GB2474901A (en) Multiply-accumulate instruction which adds or subtracts based on a predicate value
Huynh Deep neural network accelerator based on FPGA
CN110674927A (en) Data recombination method for pulse array structure
CN110766128A (en) Convolution calculation unit, calculation method and neural network calculation platform
Sun et al. An I/O bandwidth-sensitive sparse matrix-vector multiplication engine on FPGAs
Que et al. Recurrent neural networks with column-wise matrix–vector multiplication on FPGAs
CN116710912A (en) Matrix multiplier and control method thereof
US6622153B1 (en) Virtual parallel multiplier-accumulator
CN112074810B (en) Parallel processing apparatus
CN110716751B (en) High-parallelism computing platform, system and computing implementation method
CN110659014B (en) Multiplier and neural network computing platform
CN110647309A (en) High-speed big bit width multiplier
CN109343826B (en) Reconfigurable processor operation unit for deep learning
Zhuo et al. High-performance and area-efficient reduction circuits on FPGAs
US11789701B2 (en) Controlling carry-save adders in multiplication
Hormigo-Jiménez et al. High-Throughput DTW accelerator with minimum area in AMD FPGA by HLS
ŞTEFAN Integral Parallel Computation
Zhang et al. New approach for multiple vector reduction on FPGA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant