CN116382617A - Singular value decomposition accelerator with parallel ordering function based on FPGA - Google Patents
Singular value decomposition accelerator with parallel ordering function based on FPGA Download PDFInfo
- Publication number
- CN116382617A CN116382617A CN202310669739.0A CN202310669739A CN116382617A CN 116382617 A CN116382617 A CN 116382617A CN 202310669739 A CN202310669739 A CN 202310669739A CN 116382617 A CN116382617 A CN 116382617A
- Authority
- CN
- China
- Prior art keywords
- singular value
- value decomposition
- column
- fpga
- column vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000354 decomposition reaction Methods 0.000 title claims abstract description 61
- 239000013598 vector Substances 0.000 claims abstract description 110
- 239000011159 matrix material Substances 0.000 claims abstract description 51
- 238000004364 calculation method Methods 0.000 claims abstract description 36
- 238000000034 method Methods 0.000 claims abstract description 34
- 230000009466 transformation Effects 0.000 claims abstract description 27
- 230000006870 function Effects 0.000 claims abstract description 21
- 230000007246 mechanism Effects 0.000 claims abstract description 16
- 238000010408 sweeping Methods 0.000 claims description 6
- 230000001360 synchronised effect Effects 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 3
- 230000010355 oscillation Effects 0.000 claims description 3
- 230000017105 transposition Effects 0.000 claims description 2
- 230000008569 process Effects 0.000 abstract description 14
- 238000012545 processing Methods 0.000 abstract description 10
- 230000000694 effects Effects 0.000 abstract description 2
- 230000001133 acceleration Effects 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 7
- 230000006872 improvement Effects 0.000 description 4
- 230000036961 partial effect Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 241000197727 Euscorpius alpha Species 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000009514 concussion Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000003306 harvesting Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000000513 principal component analysis Methods 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 230000035939 shock Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/06—Arrangements for sorting, selecting, merging, or comparing data on individual record carriers
- G06F7/08—Sorting, i.e. grouping record carriers in numerical or other ordered sequence according to the classification of at least some of the information they carry
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
- G06F15/7817—Specially adapted for signal processing, e.g. Harvard architectures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
- G06F15/7821—Tightly coupled to memory, e.g. computational memory, smart memory, processor in memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/06—Arrangements for sorting, selecting, merging, or comparing data on individual record carriers
- G06F7/20—Comparing separate sets of record carriers arranged in the same sequence to determine whether at least some of the data in one set is identical with that in the other set or sets
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Data Mining & Analysis (AREA)
- Computational Mathematics (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Signal Processing (AREA)
- Complex Calculations (AREA)
- Apparatus For Radiation Diagnosis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a singular value decomposition accelerator with a parallel ordering function based on an FPGA, which comprises an external DDR memory, an AXI interface, k parts of unilateral Jacobian orthogonal transformation circuits and 2k parts of internal BRAM memory, wherein the external DDR memory is used for storing the k parts of single-sided Jacobian orthogonal transformation circuits; the k parts of unilateral Jacobian orthogonal transformation calculation circuits generate norms alpha and beta in parallel, classify and process a rotation matrix J according to the size relation of the norms alpha and beta, execute unilateral Jacobian calculation from the 1 st round to the k th round according to a polling scheduling mechanism state machine, and exchange the norms of the rest column vectors except the last pair of column vector norms alpha and beta when the k+1 th round to the n-1 th round keeps the rule, and the rotation matrix J uses the transposed matrix J thereof T Instead, the iteration is repeated until convergence. The invention can be implementedThe prior singular value decomposition process synchronously completes the singular value sorting, eliminates the time consumption required by independent sorting processing, saves the hardware resource overhead specially used for realizing the processing sorting function, and obviously improves the hardware acceleration effect.
Description
Technical Field
The invention relates to the field of signal processing, in particular to a singular value decomposition accelerator with a parallel ordering function based on an FPGA.
Background
Matrix singular value decomposition is a classical and important technology in the field of signal processing, and plays an important role in aspects of data dimension reduction, hyperspectral image processing, robot positioning and navigation, artificial intelligent recommendation algorithm and the like. The matrix singular value decomposition is realized by projecting in different subspaces through orthogonal transformation, so that the main component is effectively extracted to realize the dimension reduction effect, and singular value decomposition operators or accelerators are often integrated in a plurality of CPU, GPU, AI processors and FPGA systems to realize the performance improvement. However, the singular value decomposition itself involves complex computation, and it is important how to implement the descending order of the singular values and the corresponding singular vectors while completing the singular value decomposition in the ordering process of the computation results.
In the current singular value decomposition scheme realized based on very large scale integrated circuits (Very Large Scale Integration Circuit, VLSI), most of the singular value decomposition schemes adopt a method of mutually separating singular value decomposition and sorting processes, including a matrix algorithm library provided by a certain company, wherein singular value decomposition calculation is firstly adopted, and then singular values and singular vectors are sorted. This results in a serial execution between the sorting operation and the singular value decomposition calculation, increasing the overall delay, and in addition, requiring the overhead of dedicated sorting circuit hardware resources in order to implement the sorting function.
The invention patent content of application number CN201010151981.1 mentions singular value decomposition, singular value size ordering and constructing an image using the first N singular values and their corresponding singular vectors, in which the singular value decomposition and singular value size ordering are performed serially, requiring additional time consuming and computational resources.
The patent application CN2202111040096.0 mentions that singular value decomposition operators are integrated in the lifting AI process to improve the performance of the lifting AI processor, including the application of selecting the first K largest singular values to approximate the original matrix, but there is no relevant description of how the integrated singular value decomposition operators implement singular value ordering.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a singular value decomposition accelerator with a parallel ordering function based on an FPGA, which can achieve the trend that the norms of all column vectors are arranged in a descending order as a whole by processing the norms alpha and beta of all column vectors and a corresponding rotation matrix J in different rounds on the basis of a classical unilateral Jacobi algorithm, can realize the descending order arrangement convergence of all column vector norms through a plurality of times of sweep, further carries out square root calculation on all second order norms to obtain corresponding singular values, and meanwhile, divides all column vectors by corresponding singular values respectively to obtain respective corresponding left singular vectors. In the matrix singular value decomposition process, the invention performs the sorting processing of the singular values and the singular vectors in parallel, and the sorting operation is delayed and hidden in the singular value decomposition process, so that the two steps which are originally performed in series are changed into single-step parallel operation, the improvement of the whole real-time performance is promoted, and the hardware resource cost special for processing the realization of the sorting function is saved.
The aim of the invention is achieved by the following technical scheme:
on the one hand, the singular value decomposition accelerator with the parallel ordering function based on the FPGA comprises an external DDR memory, an AXI interface, k parts of unilateral Jacobian orthogonal transformation circuits, a round-robin scheduling mechanism state machine and 2k blocks of internal memory BRAMs; the singular value decomposition accelerator performs singular value decomposition by:
s1: writing matrix data of m rows and n columns into BRAMs in the FPGA from an external DDR memory through an AXI interface, wherein each column corresponds to 1 BRAM, n blocks of BRAMs are combined into a pair by every two adjacent blocks of BRAMs in sequence, and the pairs are equally divided into k=n/2 pairs; wherein n is an even number; if the number of columns of the matrix is originally odd, 1 column of all 0 is supplemented at the end so as to be even;
s2: the k parts of unilateral Jacobian orthogonal transformation circuits calculate the second-order norms alpha, beta and the inner product gamma of k pairs of BRAM groups in parallel;
s3: if the second order norm alpha of the first column vector is greater than or equal to the second order norm beta of the second column vector, generating a rotation matrix according to the single-side Jacobian algorithmThe method comprises the steps of carrying out a first treatment on the surface of the Conversely, a rotation matrix is generated +.>The method comprises the steps of carrying out a first treatment on the surface of the Synchronous generation of k rotation matrixes J corresponding to k pairs of BRAMs i ,i=1,2,…,k;
S4, k parts of unilateral Jacobi orthogonal transformation circuits synchronously execute orthogonal rotation calculation, and the intermediate result is stored in n parts of BRAMs;
s5: exchanging column vectors according to a round-robin scheduling mechanism state machine, repeating the steps S2-S5, and executing k rounds of operations altogether;
s6: the k+1 th round is executed, comprising the following sub-steps:
s6.1: calculating the second order norms alpha, beta and inner product gamma of k pairs of BRAM groups, and exchanging the values of two second order norms in the 1~k-1 BRAM group;
s6.2: repeating S3 the same operation to synchronously generate k rotation matrixes J i ,i=1,2,…,k;
S6.3: maintaining the last rotation matrix J k The rest rotation matrixes are replaced by a transposition mode without change;
s6.4: s4, executing the same operation;
s7: exchanging column vector data stored by each BRAM according to a round-robin scheduling mechanism state machine, and executing the same operation as the S6 from the k+2~n-1 round until the n-1 round operation is completed, namely completing one-time 'sweeping' operation;
s8: and repeating the steps S2-S7, executing a plurality of 'sweeping' operations until the iteration termination condition is met, completing the singular value decomposition task, and arranging the singular values of the column vectors stored by each block BRAM from large to small.
Further, the unilateral Jacobian orthogonal transformation circuit comprises a norm and inner product calculation module, a cos theta and sin theta calculation module, a norm comparison module, a gamma positive and negative judgment module, a kth rotation matrix judgment module, a k-wheel judgment module, a rotation matrix J generation module, a unilateral Jacobian orthogonal rotation calculation module and a square root calculation module.
Further, the round-robin scheduling mechanism state machine controls the generation of data streams and control streams of each single-sided Jacobi orthogonal transformation circuit, including the reading of BRAM, the calculation of alpha, beta and gamma, the exchange of alpha and beta, the calculation of cos theta and sin theta, the generation of a rotation matrix J, and the write-back operation of Jacobi orthogonal rotation calculation results to BRAM.
Further, the initial column vector index rule is that the column vector index of the lower row is odd, the column vectors 1, 3, 5 … n-1, respectively, and the column vector index of the upper row is even, the column vectors 2, 4, 6 … n, respectively.
Further, the column vector index of the upper row is always greater than the column vector index of the lower row for the first k-wheel of each "sweep"; in the k+1 through n-1 th round of each "sweep", the column vector index of the upper row is always smaller than the column vector index of the lower row except for the last column.
Further, the second order norms of the following row of column vectors are α, i.e., α respectively 1 ,α 2 ,α 3 ,…,α k The method comprises the steps of carrying out a first treatment on the surface of the The second order norms of the column vectors of the upper row are beta, i.e. beta respectively 1 ,β 2 ,β 3 ,…,β k 。
Further, after the execution of S8 is completed, in the first few "sweeps", there is local oscillation, and the second-order norms of the column vectors overall show a descending order trend, so as to finally realize α 1 ≥β 1 ≥α 2 ≥β 2 ≥α 3 ≥…≥α k ≥β k 。
Further, the generation formulas of cos θ and sin θ are as follows:
further, S8 is performed and the harvest is satisfiedAfter the convergence condition, the second order norm alpha of each column vector 1 ,β 1 ,α 2 ,β 2 ,α 3 ,…,α k ,β k Respectively performing square root calculation to obtain corresponding singular values of sigma 1 ,σ 2 ,σ 3 ,σ 4 ,σ 5 ,…,σ n-1 ,σ n And sigma (sigma) 1 ≥σ 2 ≥σ 3 ≥σ 4 ≥σ 5 ≥…≥σ n-1 ≥σ n And sum the result sigma 1 ,σ 2 ,σ 3 ,σ 4 ,σ 5 ,…,σ n-1 ,σ n Sequentially writing to external DDR storage through an AXI interface.
Further, each column of vectors u satisfying convergence in S8 1 ,u 2 ,u 3 ,…,u n Divided by the singular values sigma corresponding to each 1 ,σ 2 ,σ 3 ,…,σ n Obtaining respective corresponding left singular vectors u 1 /σ 1 ,u 2 /σ 2 ,u 3 /σ 3 ,…,u n /σ n And the result u 1 /σ 1 ,u 2 /σ 2 ,u 3 /σ 3 ,…,u n /σ n Sequentially writing to external DDR storage through an AXI interface.
The beneficial effects of the invention are as follows:
the method is particularly suitable for matrix singular value decomposition (including FPGA) realized based on VLSI, performs sequencing treatment of singular values and singular vectors in parallel in a cyclic iterative calculation process of singular value decomposition, and conceals the part of time delay in the whole singular value decomposition process, so that two steps which are originally executed in series are changed into single-step parallel synchronous operation, the integral real-time improvement of singular value decomposition can be improved, and particularly, for the application scene of image compression and principal component analysis, the method can extract the larger part of singular values and the corresponding singular vectors more quickly; in addition, the invention saves the hardware resource cost special for processing the implementation of the sorting function.
Drawings
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
FIG. 1 is a block diagram of a singular value decomposition accelerator with parallel ordering function;
FIG. 2 is a circuit diagram of a detailed control channel and data channel of a singular value decomposition accelerator with parallel ordering function;
FIG. 3 is a schematic diagram of a single-sided Jacobi algorithm of a 512 row by 512 column matrix based on a round-robin state machine;
FIG. 4 is a schematic diagram of a one-time sweep process column vector swap operation with column dimension 6;
FIG. 5 is a diagram of a one-time sweep process column vector norm magnitude relationship with column dimension 6;
fig. 6 is a graph showing a partial column vector norm descending trend of a matrix of 512 rows by 512 columns for 5 times sweep execution.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
First, explanation of technical terms is given:
(1) VLSI: very Large Scale Integrated circuits VLSI (very large scale integrated circuit)
(2) And (3) FPGA: field Programmable Gate Array field programmable gate array
(3) BRAM: block RAM, FPGA internal Block RAM
(4) Jacobi: the invention refers to unilateral Jacobian rotation, which is commonly used for matrix singular value decomposition based on FPGA
(5) round-robin: polling scheduling, one-side Jacobi rotation singular value decomposition commonly used scheduling mechanism
(6) DDR SDRAM: double Data Rate Synchronous Dynamic Random Access Memory, DDR external storage.
The singular value decomposition accelerator with the parallel ordering function based on the FPGA comprises an external DDR memory, an AXI interface, k parts of unilateral Jacobian orthogonal transformation circuits, a round-robin scheduling mechanism state machine and 2k blocks of internal BRAM storage; the singular value decomposition accelerator performs singular value decomposition by:
s1: writing matrix data of m rows and n columns into BRAMs in the FPGA from an external DDR memory through an AXI interface, wherein each column corresponds to 1 BRAM, n blocks of BRAMs are combined into a pair by every two adjacent blocks of BRAMs in sequence, and the pairs are equally divided into k=n/2 pairs; wherein n is an even number; if the number of columns of the matrix is originally odd, 1 column of all 0 is supplemented at the end so as to be even;
s2: the k parts of unilateral Jacobian orthogonal transformation circuits calculate the second-order norms alpha, beta and the inner product gamma of k pairs of BRAM groups in parallel;
s3: the second order norm alpha of the first column vector is larger than or equal to the second order norm beta of the second column vector, and then a rotation matrix is generated according to a unilateral Jacobi algorithmThe method comprises the steps of carrying out a first treatment on the surface of the Conversely, a rotation matrix is generated +.>The method comprises the steps of carrying out a first treatment on the surface of the Synchronous generation of k rotation matrixes J corresponding to k pairs of BRAMs i ,i=1,2,…,k;
Here, the generation formulas of cos θ and sin θ are as follows:
s4: k parts of unilateral Jacobi orthogonal transformation circuits synchronously execute orthogonal rotation calculation, and intermediate results are temporarily stored in n parts of BRAMs;
s5: exchanging column vectors according to a round-robin scheduling mechanism state machine, repeating the steps S2-S5, and executing k rounds of operations altogether;
s6: the k+1 th round is executed, comprising the following sub-steps:
s6.1: calculating the second order norms alpha, beta and inner product gamma of k pairs of BRAM groups, and exchanging the values of two second order norms in the 1~k-1 BRAM group;
s6.2: repeating S3 the same operation to synchronously generate k rotation matrixes J i ,i=1,2,…,k;
S6.3: maintaining the last rotation matrix J k Unchanged, the rest of the rotation matrix J i In transposed form J i T Instead, i=1, 2, …, k-1;
s6.4: s4, executing the same operation;
s7: exchanging column vector data stored by each BRAM according to a round-robin scheduling mechanism state machine, and executing the same operation as the S6 from the k+2~n-1 round until the n-1 round operation is completed, namely completing one-time 'sweeping' operation;
s8: and repeating the steps S2-S7, executing a plurality of 'sweeping' operations until the iteration termination condition is met, completing the singular value decomposition task, and arranging the singular value sizes of the column vectors stored by each block BRAM from large to small according to sequence numbers.
After S8 is executed, in the first few 'sweeps', the second order norms of each column vector totally show descending arrangement trend except for local oscillation, and finally alpha is realized 1 ≥β 1 ≥α 2 ≥β 2 ≥α 3 ≥…≥α k ≥β k 。
In this embodiment, after S8 is performed and convergence conditions are satisfied, the column vector second order norms α stored for each block BRAM 1 ,β 1 ,α 2 ,β 2 ,α 3 ,…,α k ,β k Respectively performing square root calculation to obtain corresponding singular values of sigma 1 ,σ 2 ,σ 3 ,σ 4 ,σ 5 ,…,σ n-1 ,σ n And sigma (sigma) 1 ≥σ 2 ≥σ 3 ≥σ 4 ≥σ 5 ≥…≥σ n-1 ≥σ n . And sum the result sigma 1 ,σ 2 ,σ 3 ,σ 4 ,σ 5 ,…,σ n-1 ,σ n Sequentially writing to external DDR storage through an AXI interface. Or further, each column vector u satisfying convergence in S8 1 ,u 2 ,u 3 ,…,u n Divided by the singular values sigma corresponding to each 1 ,σ 2 ,σ 3 ,…,σ n Obtaining respective corresponding left singular vectors u 1 /σ 1 ,u 2 /σ 2 ,u 3 /σ 3 ,…,u n /σ n And the result u 1 /σ 1 ,u 2 /σ 2 ,u 3 /σ 3 ,…,u n /σ n Sequentially writing to external DDR storage through an AXI interface.
The unilateral Jacobian orthogonal transformation circuit comprises a norm and inner product calculation module, a cos theta and sin theta calculation module, a norm comparison module, a gamma positive and negative judgment module, a kth rotation matrix judgment module, a k-wheel judgment module, a rotation matrix J generation module, a unilateral Jacobian orthogonal rotation calculation module and a square root calculation module.
The round-robin scheduling mechanism state machine controls the generation of data flow and control flow of each single-sided Jacobi orthogonal transformation circuit, and comprises the reading of BRAM, the calculation of alpha, beta and gamma, the exchange of alpha and beta, the calculation of cos theta and sin theta, the generation of a rotation matrix J and the write-back operation of Jacobi orthogonal rotation calculation results to the BRAM.
In addition, in the singular value decomposition process of the present invention, the initial column vector index rule is that the column vector index of the lower row is odd, the column vectors 1, 3, 5 … n-1 are respectively, and the column vector index of the upper row is even, the column vectors 2, 4, 6 … n are respectively. And the column vector index of the upper row is always greater than the column vector index of the lower row for the first k rounds of each "sweep"; in the k+1 through n-1 th round of each "sweep", the column vector index of the upper row is always smaller than the column vector index of the lower row except for the last column. Further, the second order norms of the following row of column vectors are α, i.e., α respectively 1 ,α 2 ,α 3 ,…,α k The method comprises the steps of carrying out a first treatment on the surface of the The upper row is directed toThe second order norms of the quantities being beta, i.e. beta 1 ,β 2 ,β 3 ,…,β k 。
The method of the present invention is explained and illustrated in the following by a specific example.
The specific embodiment is described by singular value decomposition of a 512 row by 512 column matrix, the matrix element data type is a single-precision floating point number which accords with IEEE754 standard, XC7V690T-3FFG1761FPGA of Xilinx company is selected as target hardware for deployment verification, the minimum physical unit of the internal BRAM in the FPGA is BRAM with 18Kb capacity, the single-precision floating point number column vector with the depth of 512 just occupies 1 block of BRAM with 18Kb, and 512 blocks of BRAM are needed in total.
The specific implementation process of this embodiment is as follows:
step 1: through an AXI interface, 512 rows and 512 columns of matrix data are read from an external DDR memory device and written into corresponding 512 blocks BRAM in the FPGA according to columns, wherein the 1 st column is written into the 1 st block BRAM, the 2 nd column is written into the 2 nd block BRAM, and the 1 st pair is formed by the 1 st column and the 2 nd column, so that the internal memory of the 1# unilateral Jacobian orthogonal transformation circuit is formed; column 3 is written to the 3 rd block BRAM, column 4 is written to the 4 th block BRAM, and the two form a 2 nd pair to form the internal storage of the 2# unilateral Jacobian orthogonal transformation circuit; … …; and the method is characterized in that the method is repeated until the 511 th column is written into the 511 th block BRAM, the 512 th column is written into the 512 th block BRAM, and the two blocks form a 256 th pair to form the internal storage of a 256# unilateral Jacobian orthogonal transformation circuit, as shown in figure 1.
Step 2: the k parts of unilateral Jacobian orthogonal transformation circuits in the FPGA in fig. 1 synchronously and parallelly calculate the second-order norms alpha, beta and the inner products gamma, namely alpha, of the column vectors stored in the 256 pairs of BRAM groups in the step 1 1 Is the second order norm calculated by the 1 st block BRAM, beta 1 Is the second order norm and gamma obtained by the 2 nd block BRAM calculation 1 Is the inner product of the two, alpha 2 Is the second order norm and beta calculated by the 3 rd block BRAM 2 Second order norm and gamma obtained by calculating 4 th block BRAM 2 Is the inner product of the two, …, and so on, α 256 Is the second order norm and beta calculated by the 511 th block BRAM 256 Is the second order norm and gamma obtained by the calculation of the 512 th block BRAM 256 Is the inner part of bothAnd (3) accumulation.
Step 3: taking a 1# unilateral Jacobi orthogonal transformation circuit as an example, generating cos theta according to unilateral Jacobi algorithm 1 Sum sin theta 1 The formula is as follows:
in order to realize the parallel ordering function, after the output of the alpha and beta comparison circuit, the following special treatment is carried out:
by analogy, 256 rotation matrices, i.e. J 1 ,J 2 ,…,J 256 And (5) synchronous parallel generation.
Step 4:256 single-sided Jacobi orthogonal transformation circuits synchronously execute single-sided Jacobi orthogonal rotation calculation: to be used for Represents column 1, round 1 current vector, < >>Representing the vector obtained by updating column 1 through the 1 st round of orthogonal rotation transformation, and for the 1 st pair of column vectors, executing Jacobi orthogonal rotation calculation as +.>The 2 nd pair of column vectors performs Jacobi orthogonal rotation calculation as +.>And so on until ++>。
Step 5: as shown in FIG. 2, 1 pair of column vectorsThe control channel and the data channel inside the accelerator are described in detail, and a round-robin scheduling mechanism state machine is responsible for overall flow control and controls the data input and calculation results of each unit module according to running rounds; after 256 pairs of column vectors are subjected to unilateral Jacobi orthogonal rotation transformation, respectively exchanging updated column vectors according to a round-robin scheduling mechanism in FIG. 3; the specific method comprises the following steps: fix the last 1 column, u 512 Other column vectors perform u in counter-clockwise reverse rotation, i.e. concurrently 1 Pass to the right to u 3 ,u 3 Pass to the right to u 5 ,…,u 509 Pass to the right to u 511 ,u 511 Diagonal transfer to u 510 ,u 510 Pass to the left to u 508 ,u 508 Pass to the left to u 506 ,…,u 4 Pass to the left to u 2 ,u 2 Down to u 1 The method comprises the steps of carrying out a first treatment on the surface of the For more details, the data scheduling exchange may refer to fig. 1, where bram_4 stores data to bram_2, and bram_6 stores data to bram_4, …; the data stored in BRAM_1 is transferred to BRAM_3, and the data stored in BRAM_3 is transferred to BRAM_5 and …; the data stored in BRAM_2 is transmitted to BRAM_1, and the data stored in BRAM_512 is kept unchanged; repeating the steps 2-4, and executing k=512/2=256 rounds of the operations altogether; the results are shown in FIG. 3.
Step 6: the execution of the kth+1=256+1=257 round is entered, comprising the following sub-steps:
step 6.1: similarly to step 2, the second order norms α, β and the inner product γ of 256 pairs of column vectors are calculated, but the norms α of the last pair of column vectors are specially processed 256 、β 256 The relationship remains unchanged and the norm values alpha of the remaining column vector pairs i And beta i Interchangeable, i.e. alpha 1 、β 1 Between, alpha 2 、β 2 Between …, alpha 255 、β 255 The values are exchanged between, and it is noted that the column vectors themselves are unchanged in position.
Step 6.2: after the special treatment of the step 6.1, the same operation of the step 3 is executed to generate 256 rotation matrixes J respectively 1 、J 2、 …、J 256 。
Step 6.3: rotation matrix J holding last pair of column vectors 256 Unchanged, the rest of the rotation matrix J i (i=1, 2, …, 255) take the respective transposed form J i T And substituting.
Step 6.4:256 performs a single-sided Jacobi orthogonal rotation transform on the column vector sync.
Step 7: and repeatedly executing the step 6 until the 511 th round of operation is completed.
For ease of description and understanding, a matrix with column dimensions 6 is added to illustrate the overall process and column vector second order norm sequencing results to enhance understanding and implementation. As shown in fig. 4, during one sweep, the first k-wheel is the first 3 steps, shown in the upper half of the figure; in the next (k+1) th to (n-1) th rounds, steps 4 and 5 in the lower half of the figure, due to u 6 Is 6, always greater than the other column indices, so the second order norm alpha of the last 1 pair of column vectors 3 、β 3 While the values of the second order norms of the first 2 pairs of column vectors need to be exchanged, but the column vectors themselves are not exchanged, i.e. step 4 alpha 1 Equal to column vector u 3 Second order norm, beta 1 Equal to column vector u 5 Second order norm, alpha 2 Equal to column vector u 1 Second order norm, beta 2 Equal to column vector u 4 Is a second order norm of (2); the operation of the step 5 is the same; after orthogonal transformation, the size change and overall trend of the second order norms of the column vectors are shown in fig. 5, and the exchange processing of the orthogonal rotation matrix J and the second order norms in the invention can enable the size of the second order norms of each column vector to be reduced according to the sequence number of the column indexOrdered, i.e. overall trend of。
Step 8: according to the matrix size of 512 rows by 512 columns, 6 times sweep is performed to meet the preset convergence condition. At this timePerforming square root computation on column vector norms of 512 columns in parallel to obtain singular values of sigma respectively 1 ,σ 2 ,σ 3 ,…,σ 512 Corresponding sigma 1 ≥σ 2 ≥σ 3 ≥…≥σ 512 Further dividing each column vector by the corresponding singular value to obtain a left singular vector, i.e. u 1 /σ 1 ,u 2 /σ 2 ,u 3 /σ 3 ,…,u 512 /σ 512 And (5) completing the singular value decomposition task. As shown in fig. 6, a part of the column vector second-order norm values are truncated, wherein the whole exhibits a tendency of descending order after 1 sweep execution, but there is a partial concussion, and the column vector second-order norm values are basically monotonically decreasing after 4 sweep execution.
Step 9: and (3) writing the singular values in the descending order obtained in the step (8) back into an external DDR storage through an AXI interface.
The FPGA operation result shows that the single-precision floating point matrix of 512 rows and 512 columns can rapidly complete singular value decomposition in 52.9 milliseconds under the operation of a 200MHz clock in XC7V690T-3FFG1761 target hardware, and the singular values and singular vectors are arranged in descending order. Compared with the result in the matrix singular value decomposition Solver library published by Xilinx corporation, 512 rows x 512 columns of real symmetric single precision floating point matrix singular value decomposition is realized, which takes 1.687 seconds on an Alveo U250 accelerator card, but the ordering of singular values and singular vectors also requires additional functional circuits to realize, and more time is consumed for this.
It can be found by the embodiments of the present invention that, for matrix singular value decomposition based on VLSI (including FPGA), alpha is calculated by the present invention i 、β i And the special treatment of the rotation matrix J can realize the descending order arrangement of the singular values and the singular vectors in parallel while decomposing the singular values, thereby saving the time consumption and the hardware cost of special ordering tasks. Therefore, the invention can realize the improvement of the real-time performance of singular value decomposition and the saving of the hardware resource cost.
Corresponding to the embodiment of the FPGA-based singular value decomposition accelerator with the parallel ordering function, the invention also provides an embodiment of the FPGA-based singular value decomposition system with the parallel ordering function.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.
Claims (10)
1. The singular value decomposition accelerator with the parallel ordering function based on the FPGA is characterized by comprising an external DDR memory, an AXI interface, k parts of unilateral Jacobian orthogonal transformation circuits, a round-robin scheduling mechanism state machine and 2k blocks of internal memory BRAMs; the singular value decomposition accelerator performs singular value decomposition by:
s1: writing matrix data of m rows and n columns into BRAMs in the FPGA from an external DDR memory through an AXI interface, wherein each column corresponds to 1 BRAM, n blocks of BRAMs are combined into a pair by every two adjacent blocks of BRAMs in sequence, and the pairs are equally divided into k=n/2 pairs; wherein n is an even number; if the number of columns of the matrix is originally odd, 1 column of all 0 is supplemented at the end so as to be even;
s2: the k parts of unilateral Jacobian orthogonal transformation circuits calculate the second-order norms alpha, beta and the inner product gamma of k pairs of BRAM groups in parallel;
s3: if the second order norm alpha of the first column vector is greater than or equal to the second order norm beta of the second column vector, generating a rotation matrix according to the single-side Jacobian algorithmThe method comprises the steps of carrying out a first treatment on the surface of the Conversely, a rotation matrix is generated +.>The method comprises the steps of carrying out a first treatment on the surface of the Synchronous generation of k rotation matrixes J corresponding to k pairs of BRAMs i ,i=1,2,…,k;
S4, k parts of unilateral Jacobi orthogonal transformation circuits synchronously execute orthogonal rotation calculation, and the intermediate result is stored in n parts of BRAMs;
s5: exchanging column vectors according to a round-robin scheduling mechanism state machine, repeating the steps S2-S5, and executing k rounds of operations altogether;
s6: the k+1 th round is executed, comprising the following sub-steps:
s6.1: calculating the second order norms alpha, beta and inner product gamma of k pairs of BRAM groups, and exchanging the values of two second order norms in the 1~k-1 BRAM group;
s6.2: repeating S3 the same operation to synchronously generate k rotation matrixes J i ,i=1,2,…,k;
S6.3: maintaining the last rotation matrix J k The rest rotation matrixes are replaced by a transposition mode without change;
s6.4: s4, executing the same operation;
s7: exchanging column vector data stored by each BRAM according to a round-robin scheduling mechanism state machine, and executing the same operation as the S6 from the k+2~n-1 round until the n-1 round operation is completed, namely completing one-time 'sweeping' operation;
s8: and repeating the steps S2-S7, executing a plurality of 'sweeping' operations until the iteration termination condition is met, completing the singular value decomposition task, and arranging the singular values of the column vectors stored by each block BRAM from large to small.
2. The FPGA-based singular value decomposition accelerator with parallel ordering function according to claim 1, wherein the single-sided jacobian orthogonal transformation circuit comprises a norm and inner product calculation module, cos θ and sin θ calculation modules, a norm comparison module, a gamma positive and negative decision module, a kth rotation matrix decision module, a k-less wheel decision module, a rotation matrix J generation module, a single-sided jacobian orthogonal rotation calculation module, and a square root calculation module.
3. The FPGA-based singular value decomposition accelerator with parallel ordering function of claim 1, wherein the round-robin scheduling mechanism state machine controls the generation of data streams and control streams for each single-sided jacobian orthogonal transform circuit, including the reading of BRAM, the computation of α, β, γ and α, β exchange, the computation of cos θ and sin θ, the generation of rotation matrix J, and the write-back operation of jacobian orthogonal rotation computation results to BRAM.
4. The FPGA-based singular value decomposition accelerator with parallel ordering according to claim 1, wherein the initial column vector index rule is that the column vector index of the lower row is odd, respectively column vectors 1, 3, 5 … n-1, and the column vector index of the upper row is even, respectively column vectors 2, 4, 6 … n.
5. The FPGA-based singular value decomposition accelerator with parallel ordering according to claim 2, wherein the column vector index of the upper row is always greater than the column vector index of the lower row for the first k-cycles of each "sweep"; in the k+1 through n-1 th round of each "sweep", the column vector index of the upper row is always smaller than the column vector index of the lower row except for the last column.
6. The FPGA-based singular value decomposition accelerator with parallel ordering according to claim 1, wherein the second order norms of the following row column vectors are α, i.e. α respectively 1 ,α 2 ,α 3 ,…,α k The method comprises the steps of carrying out a first treatment on the surface of the The second order norms of the column vectors of the upper row are beta, i.e. beta respectively 1 ,β 2 ,β 3 ,…,β k 。
7. The FPGA-based singular value decomposition accelerator with parallel ordering function according to claim 1, wherein after S8 is performed, in the first few "sweeps", there is local oscillation, the second-order norms of each column vector overall show descending order trend, and finally realize alpha 1 ≥β 1 ≥α 2 ≥β 2 ≥α 3 ≥…≥α k ≥β k 。
9. the FPGA-based singular value decomposition accelerator with parallel ordering according to claim 1, wherein after S8 is executed and convergence condition is satisfied, the second order norms α for each column vector 1 ,β 1 ,α 2 ,β 2 ,α 3 ,…,α k ,β k Respectively performing square root calculation to obtain corresponding singular values of sigma 1 ,σ 2 ,σ 3 ,σ 4 ,σ 5 ,…,σ n-1 ,σ n And sigma (sigma) 1 ≥σ 2 ≥σ 3 ≥σ 4 ≥σ 5 ≥…≥σ n-1 ≥σ n And sum the result sigma 1 ,σ 2 ,σ 3 ,σ 4 ,σ 5 ,…,σ n-1 ,σ n Sequentially writing to external DDR storage through an AXI interface.
10. The FPGA-based tape of claim 9A singular value decomposition accelerator with parallel ordering function is characterized in that each column vector u satisfying convergence in S8 1 ,u 2 ,u 3 ,…,u n Divided by the singular values sigma corresponding to each 1 ,σ 2 ,σ 3 ,…,σ n Obtaining respective corresponding left singular vectors u 1 /σ 1 ,u 2 /σ 2 ,u 3 /σ 3 ,…,u n /σ n And the result u 1 /σ 1 ,u 2 /σ 2 ,u 3 /σ 3 ,…,u n /σ n Sequentially writing to external DDR storage through an AXI interface.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310669739.0A CN116382617B (en) | 2023-06-07 | 2023-06-07 | Singular value decomposition accelerator with parallel ordering function based on FPGA |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310669739.0A CN116382617B (en) | 2023-06-07 | 2023-06-07 | Singular value decomposition accelerator with parallel ordering function based on FPGA |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116382617A true CN116382617A (en) | 2023-07-04 |
CN116382617B CN116382617B (en) | 2023-08-29 |
Family
ID=86961959
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310669739.0A Active CN116382617B (en) | 2023-06-07 | 2023-06-07 | Singular value decomposition accelerator with parallel ordering function based on FPGA |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116382617B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118153494A (en) * | 2024-05-11 | 2024-06-07 | 南京邮电大学 | Hardware acceleration system for realizing matrix SVD decomposition based on AXI bus |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060173948A1 (en) * | 2005-01-28 | 2006-08-03 | Bae Systems Information And Electronic Systems Integration Inc | Scalable 2X2 rotation processor for singular value decomposition |
CN101390351A (en) * | 2004-11-15 | 2009-03-18 | 高通股份有限公司 | Eigenvalue decomposition and singular value decomposition of matrices using jacobi rotation |
CN106528490A (en) * | 2016-11-30 | 2017-03-22 | 郑州云海信息技术有限公司 | FPGA (Field Programmable Gate Array) heterogeneous accelerated computing device and system |
CN107506173A (en) * | 2017-08-30 | 2017-12-22 | 郑州云海信息技术有限公司 | A kind of accelerated method, the apparatus and system of singular value decomposition computing |
KR20190059033A (en) * | 2017-11-22 | 2019-05-30 | 한국전자통신연구원 | Input vector generating apparatus and method using singular vaule decomposition for deep neural network speech recognition system |
CN112596701A (en) * | 2021-03-05 | 2021-04-02 | 之江实验室 | FPGA acceleration realization method based on unilateral Jacobian singular value decomposition |
CN113536228A (en) * | 2021-09-16 | 2021-10-22 | 之江实验室 | FPGA acceleration implementation method for matrix singular value decomposition |
US11190244B1 (en) * | 2020-07-31 | 2021-11-30 | Samsung Electronics Co., Ltd. | Low complexity algorithms for precoding matrix calculation |
CN115659880A (en) * | 2022-09-01 | 2023-01-31 | 重庆邮电大学 | Hardware circuit and method of principal component analysis algorithm based on singular value decomposition |
CN116170601A (en) * | 2023-04-25 | 2023-05-26 | 之江实验室 | Image compression method based on four-column vector block singular value decomposition |
-
2023
- 2023-06-07 CN CN202310669739.0A patent/CN116382617B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101390351A (en) * | 2004-11-15 | 2009-03-18 | 高通股份有限公司 | Eigenvalue decomposition and singular value decomposition of matrices using jacobi rotation |
US20060173948A1 (en) * | 2005-01-28 | 2006-08-03 | Bae Systems Information And Electronic Systems Integration Inc | Scalable 2X2 rotation processor for singular value decomposition |
CN106528490A (en) * | 2016-11-30 | 2017-03-22 | 郑州云海信息技术有限公司 | FPGA (Field Programmable Gate Array) heterogeneous accelerated computing device and system |
CN107506173A (en) * | 2017-08-30 | 2017-12-22 | 郑州云海信息技术有限公司 | A kind of accelerated method, the apparatus and system of singular value decomposition computing |
KR20190059033A (en) * | 2017-11-22 | 2019-05-30 | 한국전자통신연구원 | Input vector generating apparatus and method using singular vaule decomposition for deep neural network speech recognition system |
US11190244B1 (en) * | 2020-07-31 | 2021-11-30 | Samsung Electronics Co., Ltd. | Low complexity algorithms for precoding matrix calculation |
CN112596701A (en) * | 2021-03-05 | 2021-04-02 | 之江实验室 | FPGA acceleration realization method based on unilateral Jacobian singular value decomposition |
CN113536228A (en) * | 2021-09-16 | 2021-10-22 | 之江实验室 | FPGA acceleration implementation method for matrix singular value decomposition |
CN115659880A (en) * | 2022-09-01 | 2023-01-31 | 重庆邮电大学 | Hardware circuit and method of principal component analysis algorithm based on singular value decomposition |
CN116170601A (en) * | 2023-04-25 | 2023-05-26 | 之江实验室 | Image compression method based on four-column vector block singular value decomposition |
Non-Patent Citations (1)
Title |
---|
应俊;朱云鹏;: "基于CORDIC矩阵奇异值分解的FPGA实现", 重庆邮电大学学报(自然科学版), no. 03 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118153494A (en) * | 2024-05-11 | 2024-06-07 | 南京邮电大学 | Hardware acceleration system for realizing matrix SVD decomposition based on AXI bus |
Also Published As
Publication number | Publication date |
---|---|
CN116382617B (en) | 2023-08-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yepez et al. | Stride 2 1-D, 2-D, and 3-D Winograd for convolutional neural networks | |
US20210065005A1 (en) | Systems and methods for providing vector-wise sparsity in a neural network | |
CN108090565A (en) | Accelerated method is trained in a kind of convolutional neural networks parallelization | |
CN110163359B (en) | Computing device and method | |
US11769041B2 (en) | Low latency long short-term memory inference with sequence interleaving | |
WO2021080873A1 (en) | Structured pruning for machine learning model | |
CN108897716A (en) | By memory read/write operation come the data processing equipment and method of Reduction Computation amount | |
CN116382617B (en) | Singular value decomposition accelerator with parallel ordering function based on FPGA | |
US20240265234A1 (en) | Digital Processing Circuits and Methods of Matrix Operations in an Artificially Intelligent Environment | |
US11341400B1 (en) | Systems and methods for high-throughput computations in a deep neural network | |
CN110580519B (en) | Convolution operation device and method thereof | |
WO2018027706A1 (en) | Fft processor and algorithm | |
CN115186802A (en) | Block sparse method and device based on convolutional neural network and processing unit | |
CN112306555A (en) | Method, apparatus, device and computer readable storage medium for extracting image data in multiple convolution windows in parallel | |
CN114138231B (en) | Method, circuit and SOC for executing matrix multiplication operation | |
CN116710912A (en) | Matrix multiplier and control method thereof | |
Alawad et al. | Memory-efficient probabilistic 2-D finite impulse response (FIR) filter | |
CN112765540A (en) | Data processing method and device and related products | |
Huai et al. | Crossbar-aligned & integer-only neural network compression for efficient in-memory acceleration | |
CN113890508A (en) | Hardware implementation method and hardware system for batch processing FIR algorithm | |
CN114237548A (en) | Method and system for complex dot product operation based on nonvolatile memory array | |
Chen et al. | An efficient ReRAM-based inference accelerator for convolutional neural networks via activation reuse | |
CN112905954A (en) | CNN model convolution operation accelerated calculation method using FPGA BRAM | |
Jain-Mendon et al. | A case study of streaming storage format for sparse matrices | |
Allmann et al. | Cyclic reduction on distributed shared memory machines |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |