CN114003196B

CN114003196B - Matrix operation device and matrix operation method

Info

Publication number: CN114003196B
Application number: CN202111028539.4A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Biren Intelligent Technology Co Ltd
Current assignee: Shanghai Biren Intelligent Technology Co Ltd
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2024-04-09
Anticipated expiration: 2041-09-02
Also published as: CN114003196A

Abstract

The invention provides a matrix operation device and a matrix operation method. The matrix operation device comprises a storage unit, a first selection circuit, a multiplier array, a second selection circuit and an accumulator array. When the first matrix is taken as a sparse matrix, the first selection circuit extracts a first element value from the first matrix according to a row index i and a column index j of any one data of the non-zero index table, and provides the first element value to a first input end of each multiplier in a corresponding column of the multiplier array. The first selection circuit extracts a plurality of second element values from corresponding columns of the second matrix according to the row index i and supplies these second element values to the second inputs of the multipliers in corresponding columns of the multiplier array. The second selection circuit selects a selected column from a plurality of columns of the accumulator array according to the column index j and transmits the output of the multipliers in the corresponding column of the multiplier array to the inputs of the accumulators in the selected column of the accumulator array.

Description

Matrix operation device and matrix operation method

Technical Field

The present invention relates to a technology, and more particularly, to a matrix operation device and a matrix operation method.

Background

In artificial intelligence (artificial intelligence, AI), or neural networks (neural networks), a large number of matrix multiplication computations are often performed. GEMM (General Matrix Multiplication general matrix multiplication) is a core arithmetic unit of deep learning. As an example, natural language processing (natural language processing, NLP) models have a large number of GEMM computations. There are also a number of convolution (contiuon) operations in Computer Vision (CV) models based on GEMM. Some typical convolution operations are GEMM calculations themselves. To accelerate GEMM computation, the model matrix may be subjected to a sparsification process after training to force a portion of the model weights in the model matrix whose absolute values are less than a threshold to 0. The distribution of 0 in the model matrix after the general sparse processing according to the absolute value of the weight is random, and no fixed structure exists. In one aspect, the modified linear units (rectified linear unit, reLU) are activation functions commonly used in CV models. The ReLU calculation results in the negative value of the matrix being converted to 0. Through the ReLU calculation, there is a large number of randomly distributed 0 s as real-time intermediate tensors (tensors) of one of the matrix inputs.

As a result, in the operation in the conventional artificial intelligence and neural network, matrix multiplication is performed on a large number of sparse matrices (sparse matrices). Particularly, matrix multiplication acceleration with a zero duty ratio of 5% -95% has great influence on the overall performance of artificial intelligence and a neural network, and how to more commonly and effectively execute matrix multiplication operation acceleration on a sparse matrix is one of important technical subjects in the field.

Disclosure of Invention

The invention provides a matrix operation device and a matrix operation method, which are used for accelerating matrix multiplication operation of a sparse matrix (sparse matrix).

In an embodiment according to the present invention, the matrix operation device includes a memory unit, a first selection circuit, a multiplier array (multiplier array), a second selection circuit, and an accumulator array (accumulator array). The memory unit is adapted to store the first matrix and the second matrix. The first selection circuit is coupled to the memory cell and the multiplier array. The first selection circuit extracts a corresponding element value from one of the first matrix and the second matrix according to a row index (row index) and a column index (column index) of any one of the data of the non-zero index table to a multiplier first input terminal of each of a plurality of multipliers in a corresponding column (column) or a corresponding row (row) of the multiplier array. The first selection circuit extracts all element values from a corresponding column or a corresponding row of the other of the first matrix and the second matrix as a plurality of second element values according to the row index or the column index. The first selection circuit supplies each of these second element values to a multiplier second input of a corresponding one of the plurality of multipliers in a corresponding column or row of the multiplier array, respectively. Each of the plurality of multipliers of the multiplier array is configured to perform a product operation. The second selection circuit is coupled to the multiplier array and the accumulator array. The second selection circuit selects a selected column or a selected row from the accumulator array based on the column index or the row index. The second selection circuit transmits the output of each multiplier in the corresponding column or row of the array of multipliers to the input of a corresponding one of the accumulators in the selected column or row of the array of accumulators, respectively. Each accumulator of the accumulator array is configured to perform an accumulation operation.

In an embodiment according to the present invention, the matrix operation method includes: storing the first matrix and the second matrix by a storage unit; extracting a corresponding element value from one of the first matrix and the second matrix according to a row index and a column index of any one data of the non-zero index table to a multiplier first input end of each of a plurality of multipliers in a corresponding column or a corresponding row of the multiplier array; extracting all element values from a corresponding column or a corresponding row of the other of the first matrix and the second matrix as a plurality of second element values according to the row index or the column index; providing each of the second element values to a multiplier second input of a corresponding one of the multipliers in the corresponding column or row of the multiplier array; performing a product operation by each of a plurality of multipliers of the multiplier array; selecting a selected column or a selected row from an accumulator array according to the column index or the row index; transmitting the output of each multiplier in said corresponding column or said corresponding row of the array of multipliers to the input of a corresponding one of the accumulators in said selected column or said selected row of the array of accumulators; and performing an accumulation operation by each of a plurality of accumulators of the accumulator array.

Based on the above, the non-zero index table according to the embodiments of the present invention may provide the row index and the column index of the first matrix (i.e., the row position and the column position of the non-zero element in the first matrix). Based on the selection operation (switching operation) of the selection circuit, zero elements of the first matrix can be effectively excluded without occupying the computing resources of the multiplier array. Thus, the matrix operation device can accelerate matrix multiplication operation on a sparse matrix (for example, a first matrix).

Drawings

Fig. 1 is a schematic circuit block diagram of a matrix computing device according to an embodiment of the invention.

FIG. 2 is a schematic circuit block diagram of the multiply-add unit of FIG. 1 according to an embodiment of the present invention.

Fig. 3 is a circuit block diagram of a matrix operation device according to another embodiment of the present invention.

Fig. 4 is a flowchart of a matrix operation method according to another embodiment of the invention.

FIG. 5 is a block diagram of the first selection circuit, the multiplier array and the accumulator array of FIG. 3 according to one embodiment of the present invention.

FIG. 6 is a block diagram of the first selection circuit, the multiplier array and the accumulator array of FIG. 3 according to another embodiment of the present invention.

Description of the reference numerals

100. 300: matrix arithmetic device

210: multiplier unit

220. ACC: accumulator

221: adder device

222: buffer memory

310: memory cell

320: first selection circuit

321. 323: scanning circuit

322. 324: selector

330: multiplier array

340: second selection circuit

350: accumulator array

A. B: matrix array

a _1,1 、a _1,2 、a _1,n 、a _1,p 、a _2,1 、a _2,2 、a _2,n 、a _2,p 、a _m,1 、a _m,2 、a _m,n 、a _m,p 、b _1,1 、b _1,2 、b _1,k 、b _2,1 、b _2,2 、b _2,k 、b _n,1 、b _n,2 、b _n,k 、b _p,1 、b _p,2 、b _p,k 、c _1,1 、c _1,2 、c _1,k 、c _2,1 、c _2,2 、c _2,k 、c _m,1 、c _m,2 、c _m,k : element(s)

i: line index

j: column index

MA _1,1 、MA _1,2 、MA _1,k 、MA _2,1 、MA _2,2 、MA _2,k 、MA _m,1 、MA _m,2 、MA _m,k : multiply-add operation unit

S410, S420, S430, S440, S450: step (a)

T [ s ]: non-zero index table

Detailed Description

Reference will now be made in detail to the exemplary embodiments of the present invention, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings and the description to refer to the same or like parts.

The term "coupled" as used throughout this specification (including the claims) may refer to any direct or indirect connection. For example, if a first device couples (or connects) to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections. The terms first, second and the like in the description (including the claims) are used for naming components, and are not used for limiting the number of components, i.e. upper or lower, or the order of the components. In addition, wherever possible, the same reference numbers will be used throughout the drawings and the description to refer to the same or like parts. The components/elements/steps in different embodiments using the same reference numerals or using the same terminology may be referred to with respect to each other.

As an example, natural language processing (natural language processing, NLP) models have a large number of generic matrix multiplication (General Matrix Multiplication, GEMM) computations. To accelerate, the model matrix may employ a sparsification process to force a portion of the model weights with absolute values less than a threshold to 0 after training. The distribution of 0 in the model matrix after the sparsification process is random in terms of the absolute value of the weights. The methods proposed by the embodiments described below are suitable for accelerating such sparsified NLP models. There are also a number of convolution (contiol) operations in Computer Vision (CV) models that are GEMM-based. Some typical convolution operations are GEMM calculations themselves. The modified linear units (Rectified Linear Unit, relu) are activation functions commonly used in CV models. Through the calculation of Relu, there are a large number of randomly distributed 0 s in real-time intermediate tensors (tensors) as one of matrix inputs, and this scenario is also applicable to the acceleration method proposed in the following embodiments. The following equation 1 illustrates a matrix multiplication. In equation 1, matrix a is an m×n matrix, matrix B is an n×k matrix, and matrix C is an m×k matrix. Wherein m, n, k are integers determined according to the actual design.

Matrix c=matrix a×matrix B

Fig. 1 is a schematic circuit block diagram of a matrix computing device 100 according to an embodiment of the invention. Please refer to fig. 1 and formula 1. The matrix operation device 100 may perform the calculation of equation 1 to multiply the matrix a by the matrix B to generate the product matrix C. The matrix operation device 100 shown in fig. 1 includes a plurality of multiply-add operation units (multiplication accumulation cell, MAC) MA _1,1 、MA _1,2 、…、MA _1,k 、MA _2,1 、MA _2,2 、…、MA _2,k 、…、MA _m,1 、MA _m,2 、…、MA _m,k . Multiply-add operation unit MA _1,1 ～MA _m,k Can receive element B of row p (row) of matrix B _p,1 、b _p,2 、…、b _p,k And multiply-add operation unit MA _1,1 ～MA _m,k Can receive element a of column p of matrix a _1,p 、a _2,p 、…、a _m,p As shown in fig. 1. Wherein p is an integer from 1 to n. Multiply-add operation unit MA _1,1 ～MA _m,k Can be based on element a of matrix a _1,p ～a _m,p AndElement B of matrix B _p,1 ～b _p,k To perform matrix multiplication operations. For example, when p is 1, the multiply-add operation unit MA _1,1 Receiving element a of matrix a _1,1 Element B of matrix B _1,1 To perform the multiply-add operation. When p is n, the multiply-add operation unit MA _1,1 Receiving element a of matrix a _1,n Element B of matrix B _n,1 To perform the multiply-add operation.

And so on, each row of matrix B and each column of matrix A may be sequentially provided to multiply-add unit MA _1,1 ～MA _m,k To perform the multiply-add operation. When multiply add operation unit MA _1,1 ～MA _m,k After completing the multiplication and addition operations of all rows of the matrix B and all columns of the matrix a, the matrix operation device 100 completes the matrix multiplication operations of the matrix a and the matrix B. The matrix computing device 100 can make all the multiply-add computing units MA _1,1 ～MA _m,k The accumulated value of the products stored in the matrix C is output, namely the element C of the matrix C _1,1 、c _1,2 、…、c _1, k、c _2,1 、c _2,2 、…、c _2, k、…、c _m,1 、c _m,2 、…、c _m, k。

For example, multiply add unit MA _1,1 To be specific, first, during a first operation period (p is 1), the multiply-add operation unit MA _1,1 Computable element a _1,1 Element b _1,1 And stores the product of both as a product accumulated value (element c _1,1 ). Then, in a second operation period (p is 2) subsequent to the first operation period, the multiply-add operation unit MA _1,1 Calculation element a _1,2 Element b _2,1 The product of the two. Further, the multiply-add unit C11 may add the element a _1,2 Element b _2,1 The product of the two and the stored product accumulated value (i.e. a _1,1 B _1,1 Product of (c) and updating the product accumulation value with the result of this summation. At this time, the product accumulated value (element C _1,1 ) Is "a _1,1 *b _1,1 +a _1,2 *b _2,1 ". Similarly, as all rows of matrix B and all columns of matrix A are movedIs provided to a multiply-add operation unit MA _1,1 ～MA _m,k Thereafter (after the end of the nth operation period, i.e., p is n), the multiply-add operation unit MA _1,1 Can calculate element a _1,1 ～a _1,n Respectively with element b _1,1 ～b _n,1 The product accumulated value of (a), i.e. "a _1,1 *b _1,1 +a _1,2 *b _2,1 +…+a _1,n *b _n,1 ”。

FIG. 2 is a block diagram of the multiply-add unit MA of FIG. 1 according to one embodiment of the invention _1,1 Is a schematic circuit block diagram. Other multiply-add units shown in FIG. 1 (e.g., multiply-add unit MA _1,2 ～MA _m,k ) Reference may be made to the multiply-add unit MA shown in fig. 2 _1,1 And so forth, and will not be described in detail herein. Please refer to fig. 1 and fig. 2. Multiply-add unit MA shown in FIG. 2 _1,1 Including multiplier 210 and accumulator 220. The first input of the multiplier 210 is coupled to a memory cell (not shown in FIG. 1) via a data line to receive the element B of the matrix B _p,1 . The second input of the multiplier 210 is coupled to a memory cell (not shown in FIG. 1) via a data line to receive the element a of the matrix A _1,p . An input of the accumulator 220 is coupled to an output of the multiplier 210 to receive the element a _1,p And element b _p,1 Is a product of the two values. Accumulator 220 may accumulate a plurality of product values to generate element C of matrix C _1,1 。

The accumulator 220 shown in fig. 2 includes an adder 221 and a buffer 222. An adder first input of adder 221 may be an input of accumulator 220. An adder first input of the adder 221 is coupled to an output of the multiplier 210 to receive the element a _1,p And element b _p,1 Is a product of the two values. A second input of adder 221 is coupled to an output of buffer 222 for receiving the old product accumulation result. The output of the adder 221 is coupled to the input of the register 222 to update the new product accumulation result to the register L1. Therefore, the buffer 222 can accumulate a plurality of product values to generate the element C of the matrix C _1,1 。

Fig. 3 is a circuit block diagram of a matrix computing device 300 according to another embodiment of the invention. The matrix operation device 300 shown in fig. 3 includes a memory unit 310, a first selection circuit 320, a multiplier array 330, a second selection circuit 340, and an accumulator array (accumulator array) 350. The memory unit 310 is adapted to store a first matrix (one of a matrix a and a matrix B) and a second matrix (the other of a matrix a and a matrix B). The first selection circuit 320 is coupled to the memory cell 310 to read the matrix a and the matrix B. The first selection circuit 320 may compress matrix a or matrix B into a non-zero index table (T s) to take full advantage of dense multiply and accumulate.

For the common matrix multiplication a x B, it is possible that both matrix a and matrix B are sparse matrices. The method selects one of the two (matrix with higher relative zero duty ratio) as a sparse matrix to accelerate. The matrix operation device 300 may compare the sparsity of the matrix a and the matrix B to determine which of the matrix a and the matrix B is the sparse matrix. For ease of illustration and without affecting the versatility, it is assumed here that matrix B is a sparse matrix, where matrix B has s non-zero elements. The first selection circuit 320 may scan the matrix B to generate a non-zero index table T s. The position of each non-zero element in the matrix B (row index i and column index j) is recorded in the non-zero index table T s. The matrix operation device 300 may perform the calculation of equation 2 to multiply the matrix a by the matrix B to generate the product matrix C. Where t=0, 1, …, k is int ((s+k-1)/k), and int () is a down-rounding function. Benefits of matrix computing device 300 include: 1. random zeros are supported without a bound pattern (random zeros without a pattern); 2. support weights or activate sparsity (weight or activation sparsity); 3. no longer limited by the least sparse rows or columns; 4. the acceleration a is calculated, (n x k)/(s+k-1) <=a < = (n x k)/s.

Fig. 4 is a flowchart of a matrix operation method according to another embodiment of the invention. Please refer to fig. 3 and fig. 4. In step S410, the storage unit 310 stores a first matrix (one of the matrix a and the matrix B) and a second matrix (the other of the matrix a and the matrix B). For convenience of explanation, it is assumed herein that the first matrix is matrix B and the second matrix is matrix a. In other embodiments, the first matrix may be matrix a and the second matrix may be matrix B.

In step S420 and step S430, the first selection circuit 320 may read the matrix a and the matrix B in the memory cell 310. The first selection circuit 320 is also coupled to the multiplier array 330. It is assumed here that multiplier array 330 includes m x k multipliers. The first selection circuit 320 may extract a corresponding element value from one of the first matrix and the second matrix according to the row index i and the column index j of any one of the data of the non-zero index table T [ S ] to the multiplier first input of each multiplier in the corresponding column or the corresponding row of the multiplier array 330 (step S420). The first selection circuit 320 may further extract all element values from a corresponding column or row of the other of the first matrix and the second matrix as a plurality of second element values according to the row index i or the column index j of the non-zero index table T [ S ], and provide each second element value to a multiplier second input of a corresponding multiplier of the corresponding column or the corresponding row of the multiplier array 330 (step S430).

In step S440, each multiplier of the multiplier array 330 may perform a product operation. The second selection circuit 340 is coupled to the multiplier array 330 and the accumulator array 350. It is assumed herein that accumulator array 350 includes m x k accumulators. In step S450, the second selection circuit 340 may select a selected column or a selected row from the accumulator array 350 according to the column index j or the row index i of the non-zero index table T [ S ], and the second selection circuit 340 may transmit the output of each multiplier in the corresponding column or row of the multiplier array 330 to the input of a corresponding accumulator in the selected column or row of the accumulator array 350. In step S460, each accumulator of the accumulator array 350 may perform an accumulation operation to generate a matrix C.

For example, assume a first matrix (e.g., matrix B) is a sparse matrix. When the first matrix is a sparse matrix, the first selection circuit 320 may extract a corresponding element value from the first matrix as a first element value according to the row index i and the column index j of any one of the data of the non-zero index table T [ S ] (step S420). The first selection circuit 320 may provide the first element value to a multiplier first input of each of a plurality of multipliers in a corresponding column of the multiplier array 330 in step S420. The first selection circuit 320 may extract the whole column element value from the corresponding column of the second matrix as the second element value according to the row index i of the non-zero index table T s. The first selection circuit 320 may provide each of the second element values to a multiplier second input of a corresponding one of the multipliers in the corresponding column of the multiplier array 330 (step S430). When the first matrix is a sparse matrix, the second selection circuit 340 may select a selected column from a plurality of columns of the accumulator array 350 according to the column index j of the non-zero index table T [ S ] (step S450). The second selection circuit 340 may transmit the output of each of these multipliers in the corresponding column of the multiplier array 330 to the input of a corresponding one of the plurality of accumulators in the selected column of the accumulator array 350, respectively, in step S450.

For another example, assume that a second matrix (e.g., matrix a) is a sparse matrix. When the second matrix is used as the sparse matrix, the first selection circuit may extract the corresponding element value from the second matrix as the first element value according to the row index i and the column index j of any one of the data of the non-zero index table T [ S ] (step S420). The first selection circuit 320 may provide the first element value to a multiplier first input of each of a plurality of multipliers in a corresponding row of the multiplier array 330 in step S420. The first selection circuit 320 may extract the whole row element values from the corresponding row of the first matrix as the second element values according to the column index j of the non-zero index table T s. The first selection circuit 320 may provide each of the second element values to a multiplier second input of a corresponding one of the multipliers in the corresponding row of the multiplier array 330 (step S430). When the second matrix is a sparse matrix, the second selection circuit 340 may select a selected row from a plurality of rows of the accumulator array 350 according to the row index i of the non-zero index table T [ S ] (step S450). The second selection circuit 340 may transmit the output of each of these multipliers in the corresponding row of multiplier array 330 to the input of a corresponding one of the plurality of accumulators in the selected row of accumulator array 350, respectively, in step S450.

FIG. 5 is a block diagram of the first selection circuit 320, the multiplier array 330 and the accumulator array 350 shown in FIG. 3 according to an embodiment of the present invention. The embodiment shown in fig. 5 assumes that the first matrix is defined as a sparse matrix. The first selection circuit 320 shown in fig. 5 includes a scanning circuit 321 and a selector 322. The scan circuit 321 is coupled to the memory cell 310 to read the first matrix (e.g., matrix B). The scan circuit 321 may scan the matrix B to generate a non-zero index table T [ s ]]. For example, assume element B of matrix B _1,1 Is zero and element B of matrix B _1,2 If not zero, then scan circuit 321 may discard (not record) element b _1,1 The row index i and the column index j "1,1", and element b _1,2 The row index i and column index j "1,2" are recorded in the non-zero index table T [ s ]]。

The selector 322 is coupled to the scan circuit 321, the memory unit 310 and the multiplier array 330. Based on the row index i and the column index j recorded in the non-zero index table T s, the selector 322 may extract the corresponding element value from the first matrix (e.g., matrix B) to the multiplier first input of each multiplier in a corresponding column of the multiplier array 330. Based on the row index i recorded in the non-zero index table T s, the selector 322 may extract the whole column element values from the corresponding column of the second matrix (e.g., matrix a) to the multiplier second inputs of the plurality of multipliers in the corresponding column of the multiplier array 330. Based on the column index j recorded in the non-zero index table T s, the second selection circuit 340 may select a selected column from the plurality of columns of the accumulator array 350, and the second selection circuit 340 may transmit the output of the multipliers in the corresponding column of the multiplier array 330 to the input of the accumulator ACC in the selected column of the accumulator array 350. Each accumulator ACC of accumulator array 350 may perform an accumulation operation to generate elements of matrix C. Each accumulator ACC may be analogized with reference to the description of the accumulator 220 shown in fig. 2, and thus will not be described in detail herein.

For example, assume that the first matrix is matrix B and the second matrix is matrix a. Assume again that non-zero index table T [ s ]]Is divided into one or more batches according to the number k of columns of multiplier array 330 and accumulator array 350, wherein non-zero index table T [ s ]]At most, there are k batches of data (k elements "row index i and column index j"). For example, a non-zero index table T [ s ]]The y-th data of the current lot of (a) includes a row index i and a column index j of a certain element of the matrix B. Based on non-zero index table ts]The first selection circuit 320 may extract the element value B located in the ith row and jth column from the matrix B _i,j As a first element value, and the first selection circuit 320 may select the first element value b _i,j A multiplier first input provided to each multiplier in the y-th column of multiplier array 330. According to a non-zero index table T [ s ]]The first selection circuit 320 may also extract the whole column element value a from the ith column of matrix a for the row index i of the current batch of the y-th pen data _1,i ～a _m,i As a second element value. The first selection circuit 320 can select the second element values a _1,i ～a _m,i Is provided to a multiplier second input of a corresponding one of the plurality of multipliers in the y-th column of multiplier array 330. For example, the first selection circuit 320 may compare the element value a _1,1 The element value a is provided to the multiplier second input of the first multiplier in column y of multiplier array 330 _2,1 A multiplier second input provided to a second multiplier in a y-th column of multiplier array 330 and for inputting an element value a _m,1 A multiplier second input provided to an mth multiplier in a y-th column of multiplier array 330.

Based on the column index j of the y-th pen data of the current batch of non-zero index table T [ s ], the second selection circuit 340 may select the j-th column from the plurality of columns of the accumulator array 350, and the second selection circuit 340 may transmit the output of each of the plurality of multipliers in the y-th column of the multiplier array 330 to the input of the corresponding accumulator of the plurality of accumulator ACC in the j-th column of the accumulator array 350, respectively. For example, the second selection circuit 340 may transmit the output of the first multiplier in the y-th column of the multiplier array 330 to the input of the first accumulator ACC in the j-th column of the accumulator array 350, the output of the second multiplier in the y-th column of the multiplier array 330 to the input of the second accumulator ACC in the j-th column of the accumulator array 350, and the output of the mth multiplier in the y-th column of the multiplier array 330 to the input of the mth accumulator ACC in the j-th column of the accumulator array 350.

FIG. 6 is a block diagram of the first selection circuit 320, the multiplier array 330 and the accumulator array 350 shown in FIG. 3 according to another embodiment of the present invention. The embodiment shown in fig. 6 assumes that the second matrix is defined as a sparse matrix. The first selection circuit 320 shown in fig. 6 includes a scanning circuit 323 and a selector 324. The scanning circuit 323 is coupled to the memory unit 310 to read the second matrix (e.g., matrix a). The scan circuit 323 may scan the matrix A to generate a non-zero index table T [ s ]]. For example, assume element a of matrix A _1,1 Is zero and element a of matrix a _2,1 If not zero, then scan circuit 323 can discard (not record) element a _1,1 The row index i and the column index j "1,1", and element a _2,1 The row index i and the column index j '2, 1' are recorded in the non-zero index table T [ s ]]。

The selector 324 is coupled to the scan circuit 323, the memory unit 310 and the multiplier array 330. Based on the row index i and the column index j recorded in the non-zero index table T s, the selector 324 may extract the corresponding element value from the second matrix (e.g., matrix a) to the multiplier first input of each multiplier in a corresponding row of the multiplier array 330. Based on the column index j recorded in the non-zero index table T s, the selector 324 may extract the full row element values from the corresponding row of the first matrix (e.g., matrix B) to the multiplier second inputs of the plurality of multipliers in the corresponding row of the multiplier array 330. Based on the row index i recorded in the non-zero index table T s, the second selection circuit 340 may select a selected row from the plurality of rows of the accumulator array 350, and the second selection circuit 340 may transmit the output of the multipliers in the corresponding row of the multiplier array 330 to the input of the accumulator ACC in the selected row of the accumulator array 350. Each accumulator ACC of accumulator array 350 may perform an accumulation operation to generate elements of matrix C. Each accumulator ACC may be analogized with reference to the description of the accumulator 220 shown in fig. 2, and thus will not be described in detail herein.

For example, assume that the first matrix is matrix B and the second matrix is matrix a. Assume again that non-zero index table T [ s ]]Is divided into one or more batches according to the number m of rows of multiplier array 330 and accumulator array 350, wherein non-zero index table T [ s ]]At most m batches of data (m elements "row index i and column index j"). For example, a non-zero index table T [ s ]]The y-th data of the current lot of (a) includes a row index i and a column index j of a certain element of the matrix a. Based on non-zero index table ts]The first selection circuit 320 may extract the element value a located in the ith row and jth column from the matrix a _i,j As a first element value, and the first selection circuit 320 may select the first element value a _i,j A multiplier first input provided to each multiplier in row y of multiplier array 330. According to a non-zero index table T [ s ]]The first selection circuit 320 may also extract the whole row element value B from the j-th row of the matrix B _j,1 ～b _j,k As a second element value. The first selection circuit 320 can select the second element values b _j,1 ～b _j,k Is provided to a multiplier second input of a corresponding one of the plurality of multipliers in row y of multiplier array 330. For example, the first selection circuit 320 may compare the element value b _1,1 The element value b is provided to the multiplier second input of the first multiplier in row y of multiplier array 330 _1,2 A multiplier second input provided to a second multiplier in a y-th row of multiplier array 330 and for applying an element value b _1,k The y-th row provided to multiplier array 330And a multiplier second input of the k multipliers.

Based on the row index i of the y-th pen data of the current batch of non-zero index table T [ s ], the second selection circuit 340 may select the i-th row from the plurality of rows of the accumulator array 350, and the second selection circuit 340 may transmit the output of each of the plurality of multipliers in the y-th row of the multiplier array 330 to the input of the corresponding accumulator of the plurality of accumulator ACC in the i-th row of the accumulator array 350, respectively. For example, the second selection circuit 340 may transmit the output of the first multiplier in the y-th row of the multiplier array 330 to the input of the first accumulator ACC in the i-th row of the accumulator array 350, the output of the second multiplier in the y-th row of the multiplier array 330 to the input of the second accumulator ACC in the i-th row of the accumulator array 350, and the output of the k-th multiplier in the y-th row of the multiplier array 330 to the input of the k-th accumulator ACC in the i-th row of the accumulator array 350.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A matrix operation device, characterized in that the matrix operation device comprises:

a storage unit adapted to store the first matrix and the second matrix;

a multiplier array, wherein each of a plurality of multipliers of the multiplier array is configured to perform a product operation;

a first selection circuit coupled to the storage unit and the multiplier array, wherein the first selection circuit extracts a corresponding element value from one of the first matrix and the second matrix according to a row index and a column index of any one of the data of the non-zero index table to a multiplier first input of each of a plurality of multipliers in a corresponding column or a corresponding row of the multiplier array, the first selection circuit extracts all element values from the corresponding column or a corresponding row of the other of the first matrix and the second matrix according to the row index or the column index as a plurality of second element values, and the first selection circuit provides each of the plurality of second element values to a multiplier second input of one of the plurality of multipliers in the corresponding column or the corresponding row of the multiplier array, respectively;

An accumulator array, wherein each of a plurality of accumulators of the accumulator array is configured to perform an accumulation operation; and

a second selection circuit coupled to the multiplier array and the accumulator array, wherein the second selection circuit selects a selected column or a selected row from the accumulator array according to the column index or the row index, and the second selection circuit transmits an output of each multiplier in the corresponding column or the corresponding row of the multiplier array to an input of a corresponding accumulator of a plurality of accumulators in the selected column or the selected row of the accumulator array, respectively.

2. The matrix operation device of claim 1 wherein,

when the first matrix is a sparse matrix, the first selection circuit extracts a corresponding element value from the first matrix as a first element value according to the row index and the column index of any one of the data of the non-zero index table, the first selection circuit supplies the first element value to the multiplier first input of each of a plurality of multipliers in the corresponding column of the multiplier array, the first selection circuit extracts an entire column element value from the corresponding column of the second matrix as the plurality of second element values according to the row index of the non-zero index table, and the first selection circuit supplies each of the plurality of second element values to the multiplier second input of one of a plurality of multipliers in the corresponding column of the multiplier array, respectively; and

When the first matrix is the sparse matrix, the second selection circuit selects a selected column from a plurality of columns of the accumulator array according to the column index of the non-zero index table, and the second selection circuit transmits an output of each of the plurality of multipliers in the corresponding column of the multiplier array to an input of a corresponding one of a plurality of accumulators in the selected column of the accumulator array, respectively.

3. The matrix operation device of claim 1 wherein,

when the second matrix is a sparse matrix, the first selection circuit extracts a corresponding element value from the second matrix as a first element value according to the row index and the column index of any one of the data of the non-zero index table, the first selection circuit supplies the first element value to the multiplier first input of each of a plurality of multipliers in the corresponding row of the multiplier array, the first selection circuit extracts an entire row of element values from the corresponding row of the first matrix as the plurality of second element values according to the column index of the non-zero index table, and the first selection circuit supplies each of the plurality of second element values to the multiplier second input of one of a plurality of multipliers in the corresponding row of the multiplier array, respectively; and

When the second matrix is the sparse matrix, the second selection circuit selects a selected row from a plurality of rows of the accumulator array according to the row index of the non-zero index table, and the second selection circuit transmits an output of each of the plurality of multipliers in the corresponding row of the multiplier array to an input of a corresponding one of a plurality of accumulators in the selected row of the accumulator array, respectively.

4. The matrix operation device according to claim 1 wherein said first selection circuit comprises:

a scanning circuit coupled to the memory unit to read the first matrix or the second matrix, wherein the scanning circuit scans the first matrix to generate the non-zero index table when the first matrix is a sparse matrix or scans the second matrix to generate the non-zero index table when the second matrix is the sparse matrix; and

a selector coupled to the scan circuit, the memory cell, and the multiplier array, wherein the selector is configured to:

when the first matrix is taken as the sparse matrix, the selector extracts the corresponding element values from the first matrix to the multiplier first input of each multiplier in the corresponding column of the multiplier array according to the row index and the column index of the non-zero index table, and extracts the whole column element values from the corresponding column of the second matrix to the multiplier second inputs of the plurality of multipliers in the corresponding column of the multiplier array according to the row index; or alternatively

When the second matrix is taken as the sparse matrix, the selector extracts the corresponding element value from the second matrix to the multiplier first input of each multiplier in the corresponding row of the multiplier array according to the row index and the column index of the non-zero index table, and extracts the whole row element value from the corresponding row of the first matrix to the multiplier second inputs of the multipliers in the corresponding row of the multiplier array according to the column index.

5. The matrix operation device of claim 1 wherein any one of the accumulators of the accumulator array comprises:

an adder having an adder first input as an input to either of the accumulators; and

a buffer having an input coupled to the output of the adder, wherein the output of the buffer is coupled to the adder second input of the adder.

6. The matrix operation device according to claim 1 wherein the y-th data of the current batch of the non-zero index table includes a row index i and a column index j, the first selection circuit extracts an element value located in an i-th row and a j-th column from the first matrix as a first element value, the first selection circuit supplies the first element value to the multiplier first input of each of a plurality of multipliers in a y-th column of the multiplier array, the first selection circuit extracts an entire column element value from the i-th column of the second matrix as the plurality of second element values according to the row index i, and the first selection circuit supplies each of the plurality of second element values to the multiplier second input of a corresponding one of the plurality of multipliers in the y-th column of the multiplier array, respectively.

7. The matrix operation device according to claim 1 wherein the y-th data of the current batch of the non-zero index table includes a row index i and a column index j, the second selection circuit selects a j-th column from a plurality of columns of the accumulator array according to the column index j, and the second selection circuit transmits an output of each of the plurality of multipliers in the y-th column of the multiplier array to an input of a corresponding one of a plurality of accumulators in the j-th column of the accumulator array, respectively.

8. The matrix operation device according to claim 1 wherein the y-th data of the current batch of the non-zero index table includes a row index i and a column index j, the first selection circuit extracts an element value located in an i-th row and a j-th column from the second matrix as a first element value, the first selection circuit supplies the first element value to the multiplier first input of each of a plurality of multipliers in a y-th row of the multiplier array, the first selection circuit extracts an entire row element value from the j-th row of the first matrix as the plurality of second element values according to the column index j, and the first selection circuit supplies each of the plurality of second element values to the multiplier second input of a corresponding one of the plurality of multipliers in the y-th row of the multiplier array, respectively.

9. The matrix operation device according to claim 1 wherein the y-th data of the current batch of the non-zero index table includes a row index i and a column index j, the second selection circuit selects an i-th row from a plurality of rows of the accumulator array according to the row index i, and the second selection circuit transmits an output of each of the plurality of multipliers in the y-th row of the multiplier array to an input of a corresponding one of a plurality of accumulators in the i-th row of the accumulator array, respectively.

10. A matrix operation method, characterized in that the matrix operation method comprises:

storing the first matrix and the second matrix by a storage unit of the matrix operation device;

extracting a corresponding element value from one of the first matrix and the second matrix according to a row index and a column index of any one data of a non-zero index table to a multiplier first input end of each of a plurality of multipliers in a corresponding column or a corresponding row of a multiplier array of the matrix operation device;

extracting all element values from a corresponding column or a corresponding row of the other of the first matrix and the second matrix as a plurality of second element values according to the row index or the column index, and providing each of the plurality of second element values to a multiplier second input of a corresponding multiplier of the plurality of multipliers in the corresponding column or the corresponding row of the multiplier array, respectively;

Performing a product operation by each of a plurality of multipliers of the multiplier array;

selecting a selected column or a selected row from an accumulator array of the matrix operation device according to the column index or the row index, and transmitting an output of each multiplier in the corresponding column or the corresponding row of the multiplier array to an input of a corresponding accumulator of a plurality of accumulators in the selected column or the selected row, respectively; and

an accumulation operation is performed by each of a plurality of accumulators of the accumulator array.

11. The matrix operation method according to claim 10, characterized in that the matrix operation method further comprises:

extracting a corresponding element value from the first matrix as a first element value according to the row index and the column index of any one of the data of the non-zero index table when the first matrix is a sparse matrix, and providing the first element value to the multiplier first input of each of a plurality of multipliers in the corresponding column of the multiplier array;

extracting an entire column of element values from the corresponding column of the second matrix as the plurality of second element values according to the row index of the non-zero index table when the first matrix is the sparse matrix, and providing each of the plurality of second element values to the multiplier second input of a corresponding one of the plurality of multipliers in the corresponding column of the multiplier array, respectively; and

When the first matrix is the sparse matrix, a selected column is selected from a plurality of columns of the accumulator array according to the column index of the non-zero index table, and an output of each of the plurality of multipliers in the corresponding column of the multiplier array is respectively transmitted to an input of a corresponding one of a plurality of accumulators in the selected column of the accumulator array.

12. The matrix operation method according to claim 10, characterized in that the matrix operation method further comprises:

extracting corresponding element values from the second matrix as first element values according to the row index and the column index of any one of the data of the non-zero index table when the second matrix is taken as a sparse matrix, and providing the first element values to the multiplier first input of each of a plurality of multipliers in the corresponding row of the multiplier array;

extracting an entire row of element values from the corresponding row of the first matrix as the plurality of second element values according to the column index of the non-zero index table when the second matrix is taken as the sparse matrix, and providing each of the plurality of second element values to the multiplier second input of a corresponding one of the plurality of multipliers in the corresponding row of the multiplier array, respectively; and

When the second matrix is the sparse matrix, a selected row is selected from a plurality of rows of the accumulator array according to the row index of the non-zero index table, and an output of each of the plurality of multipliers in the corresponding row of the multiplier array is respectively transmitted to an input of a corresponding one of a plurality of accumulators in the selected row of the accumulator array.

13. The matrix operation method according to claim 10, characterized in that the matrix operation method further comprises:

scanning the first matrix to generate the non-zero index table when the first matrix is taken as a sparse matrix; and

when the second matrix is the sparse matrix, the second matrix is scanned to generate the non-zero index table.

14. The matrix operation method according to claim 10, wherein the y-th data of the current lot of the non-zero index table includes a row index i and a column index j, the matrix operation method further comprising:

extracting element values located in an ith row and a jth column from the first matrix as first element values;

providing the first element value to the multiplier first input of each of a plurality of multipliers in a y-th column of the multiplier array;

Extracting whole column element values from an ith column of the second matrix as the plurality of second element values according to the row index i; and

each of the plurality of second element values is provided to the multiplier second input of a corresponding one of the plurality of multipliers in the y-th column of the multiplier array, respectively.

15. The matrix operation method according to claim 10, wherein the y-th data of the current lot of the non-zero index table includes a row index i and a column index j, the matrix operation method further comprising:

selecting a j-th column from a plurality of columns of the accumulator array according to the column index j; and

the output of each of the plurality of multipliers in the y-th column of the multiplier array is respectively transmitted to an input of a corresponding one of a plurality of accumulators in the j-th column of the accumulator array.

16. The matrix operation method according to claim 10, wherein the y-th data of the current lot of the non-zero index table includes a row index i and a column index j, the matrix operation method further comprising:

extracting element values located in an ith row and a jth column from the second matrix as first element values;

Providing the first element value to the multiplier first input of each of a plurality of multipliers in a y-th row of the multiplier array;

extracting an entire row of element values from a j-th row of the second matrix as the plurality of second element values according to the column index j; and

each of the plurality of second element values is provided to the multiplier second input of a corresponding one of the plurality of multipliers in the y-th row of the multiplier array, respectively.

17. The matrix operation method according to claim 10, wherein the y-th data of the current lot of the non-zero index table includes a row index i and a column index j, the matrix operation method further comprising:

selecting an ith row from a plurality of columns of the accumulator array according to the row index i; and

the output of each of the plurality of multipliers in the y-th row of the multiplier array is respectively transmitted to an input of a corresponding one of a plurality of accumulators in the i-th row of the accumulator array.

18. The matrix operation method according to claim 10, further comprising:

and comparing the sparsity of the first matrix and the second matrix to determine the sparse matrix.