CN114003196B - Matrix operation device and matrix operation method - Google Patents

Matrix operation device and matrix operation method Download PDF

Info

Publication number
CN114003196B
CN114003196B CN202111028539.4A CN202111028539A CN114003196B CN 114003196 B CN114003196 B CN 114003196B CN 202111028539 A CN202111028539 A CN 202111028539A CN 114003196 B CN114003196 B CN 114003196B
Authority
CN
China
Prior art keywords
matrix
column
row
multiplier
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111028539.4A
Other languages
Chinese (zh)
Other versions
CN114003196A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Biren Intelligent Technology Co Ltd
Original Assignee
Shanghai Biren Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Biren Intelligent Technology Co Ltd filed Critical Shanghai Biren Intelligent Technology Co Ltd
Priority to CN202111028539.4A priority Critical patent/CN114003196B/en
Publication of CN114003196A publication Critical patent/CN114003196A/en
Application granted granted Critical
Publication of CN114003196B publication Critical patent/CN114003196B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a matrix operation device and a matrix operation method. The matrix operation device comprises a storage unit, a first selection circuit, a multiplier array, a second selection circuit and an accumulator array. When the first matrix is taken as a sparse matrix, the first selection circuit extracts a first element value from the first matrix according to a row index i and a column index j of any one data of the non-zero index table, and provides the first element value to a first input end of each multiplier in a corresponding column of the multiplier array. The first selection circuit extracts a plurality of second element values from corresponding columns of the second matrix according to the row index i and supplies these second element values to the second inputs of the multipliers in corresponding columns of the multiplier array. The second selection circuit selects a selected column from a plurality of columns of the accumulator array according to the column index j and transmits the output of the multipliers in the corresponding column of the multiplier array to the inputs of the accumulators in the selected column of the accumulator array.

Description

Matrix operation device and matrix operation method
Technical Field
The present invention relates to a technology, and more particularly, to a matrix operation device and a matrix operation method.
Background
In artificial intelligence (artificial intelligence, AI), or neural networks (neural networks), a large number of matrix multiplication computations are often performed. GEMM (General Matrix Multiplication general matrix multiplication) is a core arithmetic unit of deep learning. As an example, natural language processing (natural language processing, NLP) models have a large number of GEMM computations. There are also a number of convolution (contiuon) operations in Computer Vision (CV) models based on GEMM. Some typical convolution operations are GEMM calculations themselves. To accelerate GEMM computation, the model matrix may be subjected to a sparsification process after training to force a portion of the model weights in the model matrix whose absolute values are less than a threshold to 0. The distribution of 0 in the model matrix after the general sparse processing according to the absolute value of the weight is random, and no fixed structure exists. In one aspect, the modified linear units (rectified linear unit, reLU) are activation functions commonly used in CV models. The ReLU calculation results in the negative value of the matrix being converted to 0. Through the ReLU calculation, there is a large number of randomly distributed 0 s as real-time intermediate tensors (tensors) of one of the matrix inputs.
As a result, in the operation in the conventional artificial intelligence and neural network, matrix multiplication is performed on a large number of sparse matrices (sparse matrices). Particularly, matrix multiplication acceleration with a zero duty ratio of 5% -95% has great influence on the overall performance of artificial intelligence and a neural network, and how to more commonly and effectively execute matrix multiplication operation acceleration on a sparse matrix is one of important technical subjects in the field.
Disclosure of Invention
The invention provides a matrix operation device and a matrix operation method, which are used for accelerating matrix multiplication operation of a sparse matrix (sparse matrix).
In an embodiment according to the present invention, the matrix operation device includes a memory unit, a first selection circuit, a multiplier array (multiplier array), a second selection circuit, and an accumulator array (accumulator array). The memory unit is adapted to store the first matrix and the second matrix. The first selection circuit is coupled to the memory cell and the multiplier array. The first selection circuit extracts a corresponding element value from one of the first matrix and the second matrix according to a row index (row index) and a column index (column index) of any one of the data of the non-zero index table to a multiplier first input terminal of each of a plurality of multipliers in a corresponding column (column) or a corresponding row (row) of the multiplier array. The first selection circuit extracts all element values from a corresponding column or a corresponding row of the other of the first matrix and the second matrix as a plurality of second element values according to the row index or the column index. The first selection circuit supplies each of these second element values to a multiplier second input of a corresponding one of the plurality of multipliers in a corresponding column or row of the multiplier array, respectively. Each of the plurality of multipliers of the multiplier array is configured to perform a product operation. The second selection circuit is coupled to the multiplier array and the accumulator array. The second selection circuit selects a selected column or a selected row from the accumulator array based on the column index or the row index. The second selection circuit transmits the output of each multiplier in the corresponding column or row of the array of multipliers to the input of a corresponding one of the accumulators in the selected column or row of the array of accumulators, respectively. Each accumulator of the accumulator array is configured to perform an accumulation operation.
In an embodiment according to the present invention, the matrix operation method includes: storing the first matrix and the second matrix by a storage unit; extracting a corresponding element value from one of the first matrix and the second matrix according to a row index and a column index of any one data of the non-zero index table to a multiplier first input end of each of a plurality of multipliers in a corresponding column or a corresponding row of the multiplier array; extracting all element values from a corresponding column or a corresponding row of the other of the first matrix and the second matrix as a plurality of second element values according to the row index or the column index; providing each of the second element values to a multiplier second input of a corresponding one of the multipliers in the corresponding column or row of the multiplier array; performing a product operation by each of a plurality of multipliers of the multiplier array; selecting a selected column or a selected row from an accumulator array according to the column index or the row index; transmitting the output of each multiplier in said corresponding column or said corresponding row of the array of multipliers to the input of a corresponding one of the accumulators in said selected column or said selected row of the array of accumulators; and performing an accumulation operation by each of a plurality of accumulators of the accumulator array.
Based on the above, the non-zero index table according to the embodiments of the present invention may provide the row index and the column index of the first matrix (i.e., the row position and the column position of the non-zero element in the first matrix). Based on the selection operation (switching operation) of the selection circuit, zero elements of the first matrix can be effectively excluded without occupying the computing resources of the multiplier array. Thus, the matrix operation device can accelerate matrix multiplication operation on a sparse matrix (for example, a first matrix).
Drawings
Fig. 1 is a schematic circuit block diagram of a matrix computing device according to an embodiment of the invention.
FIG. 2 is a schematic circuit block diagram of the multiply-add unit of FIG. 1 according to an embodiment of the present invention.
Fig. 3 is a circuit block diagram of a matrix operation device according to another embodiment of the present invention.
Fig. 4 is a flowchart of a matrix operation method according to another embodiment of the invention.
FIG. 5 is a block diagram of the first selection circuit, the multiplier array and the accumulator array of FIG. 3 according to one embodiment of the present invention.
FIG. 6 is a block diagram of the first selection circuit, the multiplier array and the accumulator array of FIG. 3 according to another embodiment of the present invention.
Description of the reference numerals
100. 300: matrix arithmetic device
210: multiplier unit
220. ACC: accumulator
221: adder device
222: buffer memory
310: memory cell
320: first selection circuit
321. 323: scanning circuit
322. 324: selector
330: multiplier array
340: second selection circuit
350: accumulator array
A. B: matrix array
a 1,1 、a 1,2 、a 1,n 、a 1,p 、a 2,1 、a 2,2 、a 2,n 、a 2,p 、a m,1 、a m,2 、a m,n 、a m,p 、b 1,1 、b 1,2 、b 1,k 、b 2,1 、b 2,2 、b 2,k 、b n,1 、b n,2 、b n,k 、b p,1 、b p,2 、b p,k 、c 1,1 、c 1,2 、c 1,k 、c 2,1 、c 2,2 、c 2,k 、c m,1 、c m,2 、c m,k : element(s)
i: line index
j: column index
MA 1,1 、MA 1,2 、MA 1,k 、MA 2,1 、MA 2,2 、MA 2,k 、MA m,1 、MA m,2 、MA m,k : multiply-add operation unit
S410, S420, S430, S440, S450: step (a)
T [ s ]: non-zero index table
Detailed Description
Reference will now be made in detail to the exemplary embodiments of the present invention, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings and the description to refer to the same or like parts.
The term "coupled" as used throughout this specification (including the claims) may refer to any direct or indirect connection. For example, if a first device couples (or connects) to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections. The terms first, second and the like in the description (including the claims) are used for naming components, and are not used for limiting the number of components, i.e. upper or lower, or the order of the components. In addition, wherever possible, the same reference numbers will be used throughout the drawings and the description to refer to the same or like parts. The components/elements/steps in different embodiments using the same reference numerals or using the same terminology may be referred to with respect to each other.
As an example, natural language processing (natural language processing, NLP) models have a large number of generic matrix multiplication (General Matrix Multiplication, GEMM) computations. To accelerate, the model matrix may employ a sparsification process to force a portion of the model weights with absolute values less than a threshold to 0 after training. The distribution of 0 in the model matrix after the sparsification process is random in terms of the absolute value of the weights. The methods proposed by the embodiments described below are suitable for accelerating such sparsified NLP models. There are also a number of convolution (contiol) operations in Computer Vision (CV) models that are GEMM-based. Some typical convolution operations are GEMM calculations themselves. The modified linear units (Rectified Linear Unit, relu) are activation functions commonly used in CV models. Through the calculation of Relu, there are a large number of randomly distributed 0 s in real-time intermediate tensors (tensors) as one of matrix inputs, and this scenario is also applicable to the acceleration method proposed in the following embodiments. The following equation 1 illustrates a matrix multiplication. In equation 1, matrix a is an m×n matrix, matrix B is an n×k matrix, and matrix C is an m×k matrix. Wherein m, n, k are integers determined according to the actual design.
Matrix c=matrix a×matrix B
Fig. 1 is a schematic circuit block diagram of a matrix computing device 100 according to an embodiment of the invention. Please refer to fig. 1 and formula 1. The matrix operation device 100 may perform the calculation of equation 1 to multiply the matrix a by the matrix B to generate the product matrix C. The matrix operation device 100 shown in fig. 1 includes a plurality of multiply-add operation units (multiplication accumulation cell, MAC) MA 1,1 、MA 1,2 、…、MA 1,k 、MA 2,1 、MA 2,2 、…、MA 2,k 、…、MA m,1 、MA m,2 、…、MA m,k . Multiply-add operation unit MA 1,1 ~MA m,k Can receive element B of row p (row) of matrix B p,1 、b p,2 、…、b p,k And multiply-add operation unit MA 1,1 ~MA m,k Can receive element a of column p of matrix a 1,p 、a 2,p 、…、a m,p As shown in fig. 1. Wherein p is an integer from 1 to n. Multiply-add operation unit MA 1,1 ~MA m,k Can be based on element a of matrix a 1,p ~a m,p AndElement B of matrix B p,1 ~b p,k To perform matrix multiplication operations. For example, when p is 1, the multiply-add operation unit MA 1,1 Receiving element a of matrix a 1,1 Element B of matrix B 1,1 To perform the multiply-add operation. When p is n, the multiply-add operation unit MA 1,1 Receiving element a of matrix a 1,n Element B of matrix B n,1 To perform the multiply-add operation.
And so on, each row of matrix B and each column of matrix A may be sequentially provided to multiply-add unit MA 1,1 ~MA m,k To perform the multiply-add operation. When multiply add operation unit MA 1,1 ~MA m,k After completing the multiplication and addition operations of all rows of the matrix B and all columns of the matrix a, the matrix operation device 100 completes the matrix multiplication operations of the matrix a and the matrix B. The matrix computing device 100 can make all the multiply-add computing units MA 1,1 ~MA m,k The accumulated value of the products stored in the matrix C is output, namely the element C of the matrix C 1,1 、c 1,2 、…、c 1, k、c 2,1 、c 2,2 、…、c 2, k、…、c m,1 、c m,2 、…、c m, k。
For example, multiply add unit MA 1,1 To be specific, first, during a first operation period (p is 1), the multiply-add operation unit MA 1,1 Computable element a 1,1 Element b 1,1 And stores the product of both as a product accumulated value (element c 1,1 ). Then, in a second operation period (p is 2) subsequent to the first operation period, the multiply-add operation unit MA 1,1 Calculation element a 1,2 Element b 2,1 The product of the two. Further, the multiply-add unit C11 may add the element a 1,2 Element b 2,1 The product of the two and the stored product accumulated value (i.e. a 1,1 B 1,1 Product of (c) and updating the product accumulation value with the result of this summation. At this time, the product accumulated value (element C 1,1 ) Is "a 1,1 *b 1,1 +a 1,2 *b 2,1 ". Similarly, as all rows of matrix B and all columns of matrix A are movedIs provided to a multiply-add operation unit MA 1,1 ~MA m,k Thereafter (after the end of the nth operation period, i.e., p is n), the multiply-add operation unit MA 1,1 Can calculate element a 1,1 ~a 1,n Respectively with element b 1,1 ~b n,1 The product accumulated value of (a), i.e. "a 1,1 *b 1,1 +a 1,2 *b 2,1 +…+a 1,n *b n,1 ”。
FIG. 2 is a block diagram of the multiply-add unit MA of FIG. 1 according to one embodiment of the invention 1,1 Is a schematic circuit block diagram. Other multiply-add units shown in FIG. 1 (e.g., multiply-add unit MA 1,2 ~MA m,k ) Reference may be made to the multiply-add unit MA shown in fig. 2 1,1 And so forth, and will not be described in detail herein. Please refer to fig. 1 and fig. 2. Multiply-add unit MA shown in FIG. 2 1,1 Including multiplier 210 and accumulator 220. The first input of the multiplier 210 is coupled to a memory cell (not shown in FIG. 1) via a data line to receive the element B of the matrix B p,1 . The second input of the multiplier 210 is coupled to a memory cell (not shown in FIG. 1) via a data line to receive the element a of the matrix A 1,p . An input of the accumulator 220 is coupled to an output of the multiplier 210 to receive the element a 1,p And element b p,1 Is a product of the two values. Accumulator 220 may accumulate a plurality of product values to generate element C of matrix C 1,1
The accumulator 220 shown in fig. 2 includes an adder 221 and a buffer 222. An adder first input of adder 221 may be an input of accumulator 220. An adder first input of the adder 221 is coupled to an output of the multiplier 210 to receive the element a 1,p And element b p,1 Is a product of the two values. A second input of adder 221 is coupled to an output of buffer 222 for receiving the old product accumulation result. The output of the adder 221 is coupled to the input of the register 222 to update the new product accumulation result to the register L1. Therefore, the buffer 222 can accumulate a plurality of product values to generate the element C of the matrix C 1,1
Fig. 3 is a circuit block diagram of a matrix computing device 300 according to another embodiment of the invention. The matrix operation device 300 shown in fig. 3 includes a memory unit 310, a first selection circuit 320, a multiplier array 330, a second selection circuit 340, and an accumulator array (accumulator array) 350. The memory unit 310 is adapted to store a first matrix (one of a matrix a and a matrix B) and a second matrix (the other of a matrix a and a matrix B). The first selection circuit 320 is coupled to the memory cell 310 to read the matrix a and the matrix B. The first selection circuit 320 may compress matrix a or matrix B into a non-zero index table (T s) to take full advantage of dense multiply and accumulate.
For the common matrix multiplication a x B, it is possible that both matrix a and matrix B are sparse matrices. The method selects one of the two (matrix with higher relative zero duty ratio) as a sparse matrix to accelerate. The matrix operation device 300 may compare the sparsity of the matrix a and the matrix B to determine which of the matrix a and the matrix B is the sparse matrix. For ease of illustration and without affecting the versatility, it is assumed here that matrix B is a sparse matrix, where matrix B has s non-zero elements. The first selection circuit 320 may scan the matrix B to generate a non-zero index table T s. The position of each non-zero element in the matrix B (row index i and column index j) is recorded in the non-zero index table T s. The matrix operation device 300 may perform the calculation of equation 2 to multiply the matrix a by the matrix B to generate the product matrix C. Where t=0, 1, …, k is int ((s+k-1)/k), and int () is a down-rounding function. Benefits of matrix computing device 300 include: 1. random zeros are supported without a bound pattern (random zeros without a pattern); 2. support weights or activate sparsity (weight or activation sparsity); 3. no longer limited by the least sparse rows or columns; 4. the acceleration a is calculated, (n x k)/(s+k-1) <=a < = (n x k)/s.
Fig. 4 is a flowchart of a matrix operation method according to another embodiment of the invention. Please refer to fig. 3 and fig. 4. In step S410, the storage unit 310 stores a first matrix (one of the matrix a and the matrix B) and a second matrix (the other of the matrix a and the matrix B). For convenience of explanation, it is assumed herein that the first matrix is matrix B and the second matrix is matrix a. In other embodiments, the first matrix may be matrix a and the second matrix may be matrix B.
In step S420 and step S430, the first selection circuit 320 may read the matrix a and the matrix B in the memory cell 310. The first selection circuit 320 is also coupled to the multiplier array 330. It is assumed here that multiplier array 330 includes m x k multipliers. The first selection circuit 320 may extract a corresponding element value from one of the first matrix and the second matrix according to the row index i and the column index j of any one of the data of the non-zero index table T [ S ] to the multiplier first input of each multiplier in the corresponding column or the corresponding row of the multiplier array 330 (step S420). The first selection circuit 320 may further extract all element values from a corresponding column or row of the other of the first matrix and the second matrix as a plurality of second element values according to the row index i or the column index j of the non-zero index table T [ S ], and provide each second element value to a multiplier second input of a corresponding multiplier of the corresponding column or the corresponding row of the multiplier array 330 (step S430).
In step S440, each multiplier of the multiplier array 330 may perform a product operation. The second selection circuit 340 is coupled to the multiplier array 330 and the accumulator array 350. It is assumed herein that accumulator array 350 includes m x k accumulators. In step S450, the second selection circuit 340 may select a selected column or a selected row from the accumulator array 350 according to the column index j or the row index i of the non-zero index table T [ S ], and the second selection circuit 340 may transmit the output of each multiplier in the corresponding column or row of the multiplier array 330 to the input of a corresponding accumulator in the selected column or row of the accumulator array 350. In step S460, each accumulator of the accumulator array 350 may perform an accumulation operation to generate a matrix C.
For example, assume a first matrix (e.g., matrix B) is a sparse matrix. When the first matrix is a sparse matrix, the first selection circuit 320 may extract a corresponding element value from the first matrix as a first element value according to the row index i and the column index j of any one of the data of the non-zero index table T [ S ] (step S420). The first selection circuit 320 may provide the first element value to a multiplier first input of each of a plurality of multipliers in a corresponding column of the multiplier array 330 in step S420. The first selection circuit 320 may extract the whole column element value from the corresponding column of the second matrix as the second element value according to the row index i of the non-zero index table T s. The first selection circuit 320 may provide each of the second element values to a multiplier second input of a corresponding one of the multipliers in the corresponding column of the multiplier array 330 (step S430). When the first matrix is a sparse matrix, the second selection circuit 340 may select a selected column from a plurality of columns of the accumulator array 350 according to the column index j of the non-zero index table T [ S ] (step S450). The second selection circuit 340 may transmit the output of each of these multipliers in the corresponding column of the multiplier array 330 to the input of a corresponding one of the plurality of accumulators in the selected column of the accumulator array 350, respectively, in step S450.
For another example, assume that a second matrix (e.g., matrix a) is a sparse matrix. When the second matrix is used as the sparse matrix, the first selection circuit may extract the corresponding element value from the second matrix as the first element value according to the row index i and the column index j of any one of the data of the non-zero index table T [ S ] (step S420). The first selection circuit 320 may provide the first element value to a multiplier first input of each of a plurality of multipliers in a corresponding row of the multiplier array 330 in step S420. The first selection circuit 320 may extract the whole row element values from the corresponding row of the first matrix as the second element values according to the column index j of the non-zero index table T s. The first selection circuit 320 may provide each of the second element values to a multiplier second input of a corresponding one of the multipliers in the corresponding row of the multiplier array 330 (step S430). When the second matrix is a sparse matrix, the second selection circuit 340 may select a selected row from a plurality of rows of the accumulator array 350 according to the row index i of the non-zero index table T [ S ] (step S450). The second selection circuit 340 may transmit the output of each of these multipliers in the corresponding row of multiplier array 330 to the input of a corresponding one of the plurality of accumulators in the selected row of accumulator array 350, respectively, in step S450.
FIG. 5 is a block diagram of the first selection circuit 320, the multiplier array 330 and the accumulator array 350 shown in FIG. 3 according to an embodiment of the present invention. The embodiment shown in fig. 5 assumes that the first matrix is defined as a sparse matrix. The first selection circuit 320 shown in fig. 5 includes a scanning circuit 321 and a selector 322. The scan circuit 321 is coupled to the memory cell 310 to read the first matrix (e.g., matrix B). The scan circuit 321 may scan the matrix B to generate a non-zero index table T [ s ]]. For example, assume element B of matrix B 1,1 Is zero and element B of matrix B 1,2 If not zero, then scan circuit 321 may discard (not record) element b 1,1 The row index i and the column index j "1,1", and element b 1,2 The row index i and column index j "1,2" are recorded in the non-zero index table T [ s ]]。
The selector 322 is coupled to the scan circuit 321, the memory unit 310 and the multiplier array 330. Based on the row index i and the column index j recorded in the non-zero index table T s, the selector 322 may extract the corresponding element value from the first matrix (e.g., matrix B) to the multiplier first input of each multiplier in a corresponding column of the multiplier array 330. Based on the row index i recorded in the non-zero index table T s, the selector 322 may extract the whole column element values from the corresponding column of the second matrix (e.g., matrix a) to the multiplier second inputs of the plurality of multipliers in the corresponding column of the multiplier array 330. Based on the column index j recorded in the non-zero index table T s, the second selection circuit 340 may select a selected column from the plurality of columns of the accumulator array 350, and the second selection circuit 340 may transmit the output of the multipliers in the corresponding column of the multiplier array 330 to the input of the accumulator ACC in the selected column of the accumulator array 350. Each accumulator ACC of accumulator array 350 may perform an accumulation operation to generate elements of matrix C. Each accumulator ACC may be analogized with reference to the description of the accumulator 220 shown in fig. 2, and thus will not be described in detail herein.
For example, assume that the first matrix is matrix B and the second matrix is matrix a. Assume again that non-zero index table T [ s ]]Is divided into one or more batches according to the number k of columns of multiplier array 330 and accumulator array 350, wherein non-zero index table T [ s ]]At most, there are k batches of data (k elements "row index i and column index j"). For example, a non-zero index table T [ s ]]The y-th data of the current lot of (a) includes a row index i and a column index j of a certain element of the matrix B. Based on non-zero index table ts]The first selection circuit 320 may extract the element value B located in the ith row and jth column from the matrix B i,j As a first element value, and the first selection circuit 320 may select the first element value b i,j A multiplier first input provided to each multiplier in the y-th column of multiplier array 330. According to a non-zero index table T [ s ]]The first selection circuit 320 may also extract the whole column element value a from the ith column of matrix a for the row index i of the current batch of the y-th pen data 1,i ~a m,i As a second element value. The first selection circuit 320 can select the second element values a 1,i ~a m,i Is provided to a multiplier second input of a corresponding one of the plurality of multipliers in the y-th column of multiplier array 330. For example, the first selection circuit 320 may compare the element value a 1,1 The element value a is provided to the multiplier second input of the first multiplier in column y of multiplier array 330 2,1 A multiplier second input provided to a second multiplier in a y-th column of multiplier array 330 and for inputting an element value a m,1 A multiplier second input provided to an mth multiplier in a y-th column of multiplier array 330.
Based on the column index j of the y-th pen data of the current batch of non-zero index table T [ s ], the second selection circuit 340 may select the j-th column from the plurality of columns of the accumulator array 350, and the second selection circuit 340 may transmit the output of each of the plurality of multipliers in the y-th column of the multiplier array 330 to the input of the corresponding accumulator of the plurality of accumulator ACC in the j-th column of the accumulator array 350, respectively. For example, the second selection circuit 340 may transmit the output of the first multiplier in the y-th column of the multiplier array 330 to the input of the first accumulator ACC in the j-th column of the accumulator array 350, the output of the second multiplier in the y-th column of the multiplier array 330 to the input of the second accumulator ACC in the j-th column of the accumulator array 350, and the output of the mth multiplier in the y-th column of the multiplier array 330 to the input of the mth accumulator ACC in the j-th column of the accumulator array 350.
FIG. 6 is a block diagram of the first selection circuit 320, the multiplier array 330 and the accumulator array 350 shown in FIG. 3 according to another embodiment of the present invention. The embodiment shown in fig. 6 assumes that the second matrix is defined as a sparse matrix. The first selection circuit 320 shown in fig. 6 includes a scanning circuit 323 and a selector 324. The scanning circuit 323 is coupled to the memory unit 310 to read the second matrix (e.g., matrix a). The scan circuit 323 may scan the matrix A to generate a non-zero index table T [ s ]]. For example, assume element a of matrix A 1,1 Is zero and element a of matrix a 2,1 If not zero, then scan circuit 323 can discard (not record) element a 1,1 The row index i and the column index j "1,1", and element a 2,1 The row index i and the column index j '2, 1' are recorded in the non-zero index table T [ s ]]。
The selector 324 is coupled to the scan circuit 323, the memory unit 310 and the multiplier array 330. Based on the row index i and the column index j recorded in the non-zero index table T s, the selector 324 may extract the corresponding element value from the second matrix (e.g., matrix a) to the multiplier first input of each multiplier in a corresponding row of the multiplier array 330. Based on the column index j recorded in the non-zero index table T s, the selector 324 may extract the full row element values from the corresponding row of the first matrix (e.g., matrix B) to the multiplier second inputs of the plurality of multipliers in the corresponding row of the multiplier array 330. Based on the row index i recorded in the non-zero index table T s, the second selection circuit 340 may select a selected row from the plurality of rows of the accumulator array 350, and the second selection circuit 340 may transmit the output of the multipliers in the corresponding row of the multiplier array 330 to the input of the accumulator ACC in the selected row of the accumulator array 350. Each accumulator ACC of accumulator array 350 may perform an accumulation operation to generate elements of matrix C. Each accumulator ACC may be analogized with reference to the description of the accumulator 220 shown in fig. 2, and thus will not be described in detail herein.
For example, assume that the first matrix is matrix B and the second matrix is matrix a. Assume again that non-zero index table T [ s ]]Is divided into one or more batches according to the number m of rows of multiplier array 330 and accumulator array 350, wherein non-zero index table T [ s ]]At most m batches of data (m elements "row index i and column index j"). For example, a non-zero index table T [ s ]]The y-th data of the current lot of (a) includes a row index i and a column index j of a certain element of the matrix a. Based on non-zero index table ts]The first selection circuit 320 may extract the element value a located in the ith row and jth column from the matrix a i,j As a first element value, and the first selection circuit 320 may select the first element value a i,j A multiplier first input provided to each multiplier in row y of multiplier array 330. According to a non-zero index table T [ s ]]The first selection circuit 320 may also extract the whole row element value B from the j-th row of the matrix B j,1 ~b j,k As a second element value. The first selection circuit 320 can select the second element values b j,1 ~b j,k Is provided to a multiplier second input of a corresponding one of the plurality of multipliers in row y of multiplier array 330. For example, the first selection circuit 320 may compare the element value b 1,1 The element value b is provided to the multiplier second input of the first multiplier in row y of multiplier array 330 1,2 A multiplier second input provided to a second multiplier in a y-th row of multiplier array 330 and for applying an element value b 1,k The y-th row provided to multiplier array 330And a multiplier second input of the k multipliers.
Based on the row index i of the y-th pen data of the current batch of non-zero index table T [ s ], the second selection circuit 340 may select the i-th row from the plurality of rows of the accumulator array 350, and the second selection circuit 340 may transmit the output of each of the plurality of multipliers in the y-th row of the multiplier array 330 to the input of the corresponding accumulator of the plurality of accumulator ACC in the i-th row of the accumulator array 350, respectively. For example, the second selection circuit 340 may transmit the output of the first multiplier in the y-th row of the multiplier array 330 to the input of the first accumulator ACC in the i-th row of the accumulator array 350, the output of the second multiplier in the y-th row of the multiplier array 330 to the input of the second accumulator ACC in the i-th row of the accumulator array 350, and the output of the k-th multiplier in the y-th row of the multiplier array 330 to the input of the k-th accumulator ACC in the i-th row of the accumulator array 350.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims (18)

1. A matrix operation device, characterized in that the matrix operation device comprises:
a storage unit adapted to store the first matrix and the second matrix;
a multiplier array, wherein each of a plurality of multipliers of the multiplier array is configured to perform a product operation;
a first selection circuit coupled to the storage unit and the multiplier array, wherein the first selection circuit extracts a corresponding element value from one of the first matrix and the second matrix according to a row index and a column index of any one of the data of the non-zero index table to a multiplier first input of each of a plurality of multipliers in a corresponding column or a corresponding row of the multiplier array, the first selection circuit extracts all element values from the corresponding column or a corresponding row of the other of the first matrix and the second matrix according to the row index or the column index as a plurality of second element values, and the first selection circuit provides each of the plurality of second element values to a multiplier second input of one of the plurality of multipliers in the corresponding column or the corresponding row of the multiplier array, respectively;
An accumulator array, wherein each of a plurality of accumulators of the accumulator array is configured to perform an accumulation operation; and
a second selection circuit coupled to the multiplier array and the accumulator array, wherein the second selection circuit selects a selected column or a selected row from the accumulator array according to the column index or the row index, and the second selection circuit transmits an output of each multiplier in the corresponding column or the corresponding row of the multiplier array to an input of a corresponding accumulator of a plurality of accumulators in the selected column or the selected row of the accumulator array, respectively.
2. The matrix operation device of claim 1 wherein,
when the first matrix is a sparse matrix, the first selection circuit extracts a corresponding element value from the first matrix as a first element value according to the row index and the column index of any one of the data of the non-zero index table, the first selection circuit supplies the first element value to the multiplier first input of each of a plurality of multipliers in the corresponding column of the multiplier array, the first selection circuit extracts an entire column element value from the corresponding column of the second matrix as the plurality of second element values according to the row index of the non-zero index table, and the first selection circuit supplies each of the plurality of second element values to the multiplier second input of one of a plurality of multipliers in the corresponding column of the multiplier array, respectively; and
When the first matrix is the sparse matrix, the second selection circuit selects a selected column from a plurality of columns of the accumulator array according to the column index of the non-zero index table, and the second selection circuit transmits an output of each of the plurality of multipliers in the corresponding column of the multiplier array to an input of a corresponding one of a plurality of accumulators in the selected column of the accumulator array, respectively.
3. The matrix operation device of claim 1 wherein,
when the second matrix is a sparse matrix, the first selection circuit extracts a corresponding element value from the second matrix as a first element value according to the row index and the column index of any one of the data of the non-zero index table, the first selection circuit supplies the first element value to the multiplier first input of each of a plurality of multipliers in the corresponding row of the multiplier array, the first selection circuit extracts an entire row of element values from the corresponding row of the first matrix as the plurality of second element values according to the column index of the non-zero index table, and the first selection circuit supplies each of the plurality of second element values to the multiplier second input of one of a plurality of multipliers in the corresponding row of the multiplier array, respectively; and
When the second matrix is the sparse matrix, the second selection circuit selects a selected row from a plurality of rows of the accumulator array according to the row index of the non-zero index table, and the second selection circuit transmits an output of each of the plurality of multipliers in the corresponding row of the multiplier array to an input of a corresponding one of a plurality of accumulators in the selected row of the accumulator array, respectively.
4. The matrix operation device according to claim 1 wherein said first selection circuit comprises:
a scanning circuit coupled to the memory unit to read the first matrix or the second matrix, wherein the scanning circuit scans the first matrix to generate the non-zero index table when the first matrix is a sparse matrix or scans the second matrix to generate the non-zero index table when the second matrix is the sparse matrix; and
a selector coupled to the scan circuit, the memory cell, and the multiplier array, wherein the selector is configured to:
when the first matrix is taken as the sparse matrix, the selector extracts the corresponding element values from the first matrix to the multiplier first input of each multiplier in the corresponding column of the multiplier array according to the row index and the column index of the non-zero index table, and extracts the whole column element values from the corresponding column of the second matrix to the multiplier second inputs of the plurality of multipliers in the corresponding column of the multiplier array according to the row index; or alternatively
When the second matrix is taken as the sparse matrix, the selector extracts the corresponding element value from the second matrix to the multiplier first input of each multiplier in the corresponding row of the multiplier array according to the row index and the column index of the non-zero index table, and extracts the whole row element value from the corresponding row of the first matrix to the multiplier second inputs of the multipliers in the corresponding row of the multiplier array according to the column index.
5. The matrix operation device of claim 1 wherein any one of the accumulators of the accumulator array comprises:
an adder having an adder first input as an input to either of the accumulators; and
a buffer having an input coupled to the output of the adder, wherein the output of the buffer is coupled to the adder second input of the adder.
6. The matrix operation device according to claim 1 wherein the y-th data of the current batch of the non-zero index table includes a row index i and a column index j, the first selection circuit extracts an element value located in an i-th row and a j-th column from the first matrix as a first element value, the first selection circuit supplies the first element value to the multiplier first input of each of a plurality of multipliers in a y-th column of the multiplier array, the first selection circuit extracts an entire column element value from the i-th column of the second matrix as the plurality of second element values according to the row index i, and the first selection circuit supplies each of the plurality of second element values to the multiplier second input of a corresponding one of the plurality of multipliers in the y-th column of the multiplier array, respectively.
7. The matrix operation device according to claim 1 wherein the y-th data of the current batch of the non-zero index table includes a row index i and a column index j, the second selection circuit selects a j-th column from a plurality of columns of the accumulator array according to the column index j, and the second selection circuit transmits an output of each of the plurality of multipliers in the y-th column of the multiplier array to an input of a corresponding one of a plurality of accumulators in the j-th column of the accumulator array, respectively.
8. The matrix operation device according to claim 1 wherein the y-th data of the current batch of the non-zero index table includes a row index i and a column index j, the first selection circuit extracts an element value located in an i-th row and a j-th column from the second matrix as a first element value, the first selection circuit supplies the first element value to the multiplier first input of each of a plurality of multipliers in a y-th row of the multiplier array, the first selection circuit extracts an entire row element value from the j-th row of the first matrix as the plurality of second element values according to the column index j, and the first selection circuit supplies each of the plurality of second element values to the multiplier second input of a corresponding one of the plurality of multipliers in the y-th row of the multiplier array, respectively.
9. The matrix operation device according to claim 1 wherein the y-th data of the current batch of the non-zero index table includes a row index i and a column index j, the second selection circuit selects an i-th row from a plurality of rows of the accumulator array according to the row index i, and the second selection circuit transmits an output of each of the plurality of multipliers in the y-th row of the multiplier array to an input of a corresponding one of a plurality of accumulators in the i-th row of the accumulator array, respectively.
10. A matrix operation method, characterized in that the matrix operation method comprises:
storing the first matrix and the second matrix by a storage unit of the matrix operation device;
extracting a corresponding element value from one of the first matrix and the second matrix according to a row index and a column index of any one data of a non-zero index table to a multiplier first input end of each of a plurality of multipliers in a corresponding column or a corresponding row of a multiplier array of the matrix operation device;
extracting all element values from a corresponding column or a corresponding row of the other of the first matrix and the second matrix as a plurality of second element values according to the row index or the column index, and providing each of the plurality of second element values to a multiplier second input of a corresponding multiplier of the plurality of multipliers in the corresponding column or the corresponding row of the multiplier array, respectively;
Performing a product operation by each of a plurality of multipliers of the multiplier array;
selecting a selected column or a selected row from an accumulator array of the matrix operation device according to the column index or the row index, and transmitting an output of each multiplier in the corresponding column or the corresponding row of the multiplier array to an input of a corresponding accumulator of a plurality of accumulators in the selected column or the selected row, respectively; and
an accumulation operation is performed by each of a plurality of accumulators of the accumulator array.
11. The matrix operation method according to claim 10, characterized in that the matrix operation method further comprises:
extracting a corresponding element value from the first matrix as a first element value according to the row index and the column index of any one of the data of the non-zero index table when the first matrix is a sparse matrix, and providing the first element value to the multiplier first input of each of a plurality of multipliers in the corresponding column of the multiplier array;
extracting an entire column of element values from the corresponding column of the second matrix as the plurality of second element values according to the row index of the non-zero index table when the first matrix is the sparse matrix, and providing each of the plurality of second element values to the multiplier second input of a corresponding one of the plurality of multipliers in the corresponding column of the multiplier array, respectively; and
When the first matrix is the sparse matrix, a selected column is selected from a plurality of columns of the accumulator array according to the column index of the non-zero index table, and an output of each of the plurality of multipliers in the corresponding column of the multiplier array is respectively transmitted to an input of a corresponding one of a plurality of accumulators in the selected column of the accumulator array.
12. The matrix operation method according to claim 10, characterized in that the matrix operation method further comprises:
extracting corresponding element values from the second matrix as first element values according to the row index and the column index of any one of the data of the non-zero index table when the second matrix is taken as a sparse matrix, and providing the first element values to the multiplier first input of each of a plurality of multipliers in the corresponding row of the multiplier array;
extracting an entire row of element values from the corresponding row of the first matrix as the plurality of second element values according to the column index of the non-zero index table when the second matrix is taken as the sparse matrix, and providing each of the plurality of second element values to the multiplier second input of a corresponding one of the plurality of multipliers in the corresponding row of the multiplier array, respectively; and
When the second matrix is the sparse matrix, a selected row is selected from a plurality of rows of the accumulator array according to the row index of the non-zero index table, and an output of each of the plurality of multipliers in the corresponding row of the multiplier array is respectively transmitted to an input of a corresponding one of a plurality of accumulators in the selected row of the accumulator array.
13. The matrix operation method according to claim 10, characterized in that the matrix operation method further comprises:
scanning the first matrix to generate the non-zero index table when the first matrix is taken as a sparse matrix; and
when the second matrix is the sparse matrix, the second matrix is scanned to generate the non-zero index table.
14. The matrix operation method according to claim 10, wherein the y-th data of the current lot of the non-zero index table includes a row index i and a column index j, the matrix operation method further comprising:
extracting element values located in an ith row and a jth column from the first matrix as first element values;
providing the first element value to the multiplier first input of each of a plurality of multipliers in a y-th column of the multiplier array;
Extracting whole column element values from an ith column of the second matrix as the plurality of second element values according to the row index i; and
each of the plurality of second element values is provided to the multiplier second input of a corresponding one of the plurality of multipliers in the y-th column of the multiplier array, respectively.
15. The matrix operation method according to claim 10, wherein the y-th data of the current lot of the non-zero index table includes a row index i and a column index j, the matrix operation method further comprising:
selecting a j-th column from a plurality of columns of the accumulator array according to the column index j; and
the output of each of the plurality of multipliers in the y-th column of the multiplier array is respectively transmitted to an input of a corresponding one of a plurality of accumulators in the j-th column of the accumulator array.
16. The matrix operation method according to claim 10, wherein the y-th data of the current lot of the non-zero index table includes a row index i and a column index j, the matrix operation method further comprising:
extracting element values located in an ith row and a jth column from the second matrix as first element values;
Providing the first element value to the multiplier first input of each of a plurality of multipliers in a y-th row of the multiplier array;
extracting an entire row of element values from a j-th row of the second matrix as the plurality of second element values according to the column index j; and
each of the plurality of second element values is provided to the multiplier second input of a corresponding one of the plurality of multipliers in the y-th row of the multiplier array, respectively.
17. The matrix operation method according to claim 10, wherein the y-th data of the current lot of the non-zero index table includes a row index i and a column index j, the matrix operation method further comprising:
selecting an ith row from a plurality of columns of the accumulator array according to the row index i; and
the output of each of the plurality of multipliers in the y-th row of the multiplier array is respectively transmitted to an input of a corresponding one of a plurality of accumulators in the i-th row of the accumulator array.
18. The matrix operation method according to claim 10, further comprising:
and comparing the sparsity of the first matrix and the second matrix to determine the sparse matrix.
CN202111028539.4A 2021-09-02 2021-09-02 Matrix operation device and matrix operation method Active CN114003196B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111028539.4A CN114003196B (en) 2021-09-02 2021-09-02 Matrix operation device and matrix operation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111028539.4A CN114003196B (en) 2021-09-02 2021-09-02 Matrix operation device and matrix operation method

Publications (2)

Publication Number Publication Date
CN114003196A CN114003196A (en) 2022-02-01
CN114003196B true CN114003196B (en) 2024-04-09

Family

ID=79921217

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111028539.4A Active CN114003196B (en) 2021-09-02 2021-09-02 Matrix operation device and matrix operation method

Country Status (1)

Country Link
CN (1) CN114003196B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704916A (en) * 2016-08-12 2018-02-16 北京深鉴科技有限公司 A kind of hardware accelerator and method that RNN neutral nets are realized based on FPGA
CN110325988A (en) * 2017-01-22 2019-10-11 Gsi 科技公司 Sparse matrix multiplication in associated memory devices
CN112734024A (en) * 2020-04-17 2021-04-30 神亚科技股份有限公司 Processing apparatus for performing convolutional neural network operations and method of operation thereof
CN113222133A (en) * 2021-05-24 2021-08-06 南京航空航天大学 FPGA-based compressed LSTM accelerator and acceleration method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6102645B2 (en) * 2013-09-11 2017-03-29 富士通株式会社 Product-sum operation circuit and product-sum operation system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704916A (en) * 2016-08-12 2018-02-16 北京深鉴科技有限公司 A kind of hardware accelerator and method that RNN neutral nets are realized based on FPGA
CN110325988A (en) * 2017-01-22 2019-10-11 Gsi 科技公司 Sparse matrix multiplication in associated memory devices
CN112734024A (en) * 2020-04-17 2021-04-30 神亚科技股份有限公司 Processing apparatus for performing convolutional neural network operations and method of operation thereof
CN113222133A (en) * 2021-05-24 2021-08-06 南京航空航天大学 FPGA-based compressed LSTM accelerator and acceleration method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吴淑泉,王前,谢运祥.适于消谐模型求解的矩阵乘法器设计与实现.华南理工大学学报(自然科学版).2003,(第08期),全文. *
鲍黎 ; 陈庆利 ; .基于三元组的稀疏矩阵乘法运算的改进.乐山师范学院学报.2015,(第08期),全文. *

Also Published As

Publication number Publication date
CN114003196A (en) 2022-02-01

Similar Documents

Publication Publication Date Title
CN107704563B (en) Question recommendation method and system
CN109543830B (en) Splitting accumulator for convolutional neural network accelerator
CN107704506B (en) Intelligent response method and device
CN112784964A (en) Image classification method based on bridging knowledge distillation convolution neural network
CN113361685B (en) Knowledge tracking method and system based on learner knowledge state evolution expression
CN114781629A (en) Hardware accelerator of convolutional neural network based on parallel multiplexing and parallel multiplexing method
CN114519425A (en) Convolution neural network acceleration system with expandable scale
JP7163515B2 (en) Neural network training method, video recognition method and apparatus
Li et al. Input-aware dynamic timestep spiking neural networks for efficient in-memory computing
CN113094899B (en) Random power flow calculation method and device, electronic equipment and storage medium
WO2022062391A1 (en) System and method for accelerating rnn network, and storage medium
CN114003196B (en) Matrix operation device and matrix operation method
CN112154415A (en) Efficient event management in a mainframe computer system
CN109558944B (en) Algorithm optimization method and device of convolutional neural network based on configurable convolutional layer
CN116797850A (en) Class increment image classification method based on knowledge distillation and consistency regularization
CN111695689A (en) Natural language processing method, device, equipment and readable storage medium
CN112308197B (en) Compression method and device of convolutional neural network and electronic equipment
CN112215338A (en) Neural network computing method and device, electronic equipment and storage medium
CN117555933B (en) Method and system for solving high concurrency data access
CN110245263B (en) Aggregation method, aggregation device, electronic equipment and storage medium
Guo et al. Sparse Matrix Selection for CSR-Based SpMV Using Deep Learning
CN116402106B (en) Neural network acceleration method, neural network accelerator, chip and electronic equipment
Li et al. Lasso regression based channel pruning for efficient object detection model
WO2022037756A1 (en) Data processing apparatus and method for operating multi-output neural networks
CN114386505A (en) Training method, device, medium and computer equipment for text vector extraction model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB02 Change of applicant information

Country or region after: China

Address after: 201100 room 1302, 13 / F, building 16, No. 2388, Chenhang highway, Minhang District, Shanghai

Applicant after: Shanghai Bi Ren Technology Co.,Ltd.

Address before: 201100 room 1302, 13 / F, building 16, No. 2388, Chenhang highway, Minhang District, Shanghai

Applicant before: Shanghai Bilin Intelligent Technology Co.,Ltd.

Country or region before: China

CB02 Change of applicant information