WO2020103883A1 - 执行矩阵乘法运算的方法、电路及soc - Google Patents

执行矩阵乘法运算的方法、电路及soc

Info

Publication number
WO2020103883A1
WO2020103883A1 PCT/CN2019/119794 CN2019119794W WO2020103883A1 WO 2020103883 A1 WO2020103883 A1 WO 2020103883A1 CN 2019119794 W CN2019119794 W CN 2019119794W WO 2020103883 A1 WO2020103883 A1 WO 2020103883A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
index
rows
columns
column
Prior art date
Application number
PCT/CN2019/119794
Other languages
English (en)
French (fr)
Inventor
何雷骏
徐斌
王开兴
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202111444099.0A priority Critical patent/CN114138231B/zh
Priority to EP22188228.5A priority patent/EP4102354A1/en
Priority to EP19888082.5A priority patent/EP3876092B1/en
Priority to ES19888082T priority patent/ES2943886T3/es
Priority to CN201980076521.6A priority patent/CN113168309A/zh
Publication of WO2020103883A1 publication Critical patent/WO2020103883A1/zh
Priority to US17/324,533 priority patent/US11263292B2/en
Priority to US17/568,538 priority patent/US11397791B2/en
Priority to US17/841,162 priority patent/US11860970B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/06Arrangements for sorting, selecting, merging, or comparing data on individual record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products

Definitions

  • the present application relates to the field of data processing, and in particular to a method, circuit, and system on chip (SOC) for performing matrix multiplication.
  • SOC system on chip
  • Artificial intelligence technology is widely used in terminals, edge side, cloud, etc., to achieve image recognition, target detection, voice translation and other functions. Artificial intelligence technology is often implemented through deep learning networks. For deep learning networks, operators that affect performance and have a large amount of calculation, such as convolution, inner product (inner product) and other operators, can account for up to 99% of the calculation, and these operators can be expanded into matrix multiplication Matrix operation. Of course, matrix is a common form of data expression. Matrix by matrix operation is often used in other fields.
  • matrix multiplying matrix is often split into the operation of vector multiplying matrix.
  • matrix A is a matrix of m rows and k columns
  • matrix B is a matrix of k rows and n columns
  • m, k, and n are positive integers.
  • n processing elements Provide the index module of n columns of matrix B (used to locate non-zero elements), and then the n PEs Each PE in the reads data from this row in matrix A according to the index module of a column in matrix B that it acquires, and multiplies the read data by this column in matrix B.
  • each PE reads data from a row in matrix A according to the index module of one column in matrix B. Multiple data distributed in a row is more discrete, and since matrix A is often stored in multiple locations in memory in a distributed manner, if PE reads data from matrix A at one time, it is likely to be read simultaneously Address conflicts occur when fetching data from multiple addresses in the memory, which makes it impossible to read the data. Furthermore, since matrix B needs to be read once when multiplying each of the m rows of matrix A by matrix B, the entire operation process needs to read matrix B a total of m times, resulting in low data reusability and consumption Has more processing resources.
  • the present application provides a method, circuit, and SOC for performing matrix multiplication operations, which can solve the problems of low operation efficiency, address conflicts when reading data, and low data reusability in the related art.
  • the technical solution is as follows:
  • a method for performing matrix multiplication includes: acquiring a matrix A1, a matrix B2, and an index matrix, the matrix A1 is a matrix of m rows and k columns, and the matrix B2 is a t row and n Column matrix, the index matrix is a matrix of t rows and n columns, the m, k, t and n are all positive integers, the t is less than or equal to the k; according to the index Matrix and the matrix A1 to generate n matrices A2, the n matrices A2 are all m rows and t columns, and the n matrices A2 correspond to the n columns of the index matrix in order, each Column t of matrix A2 corresponds to the corresponding t elements of a column in the index matrix in order, and each column of each matrix A2 is a corresponding element of the index matrix in the matrix A1 A column indicated in; a matrix C is generated according to the n matrix A2 and the
  • the matrix B2 contains all non-zero elements (valid data) in the matrix B1.
  • the index matrix is the index of the elements in matrix B2 in matrix B1, that is, the index in matrix B1 that includes all non-zero elements in matrix B1.
  • n matrices A2 can be read from the matrix A1 at a time, and then the n matrices A2 can be multiplied by n columns of the matrix B2 to obtain the matrix C. Since the matrix multiplication can be completed only by reading the data from the matrix A1 once, it can maximize the data reusability and save processing resources. Moreover, when multiplying the n matrices A2 by n columns of the matrix B2, since the n matrix A2 has the same size, the multiplication of the n matrices A2 and the n columns of the matrix B2 can be performed in parallel, and The execution is completed in the same time, which can save the calculation time and improve the calculation efficiency.
  • the acquisition matrix B2 and the index matrix include: an acquisition matrix B1, which is a matrix of k rows and n columns, and the elements of each column in the n columns of the matrix B1 are sequentially divided into a preset number of Group, the number of elements in each group is k1, and the number of non-zero elements in each group of elements is less than or equal to k2, the preset value is k / k1, the k is greater than or equal to the k1 and all The k is divisible by the k1, the k1 is greater than or equal to the k2, and the k1 is divisible by the k2; the matrix B2 is generated according to the matrix B1, and the n columns of the matrix B2 are The n columns of the matrix B1 correspond to each other in sequence, and the elements of each column of the matrix B2 include all the non-zero elements in the group of preset preset numerical values in a column of the corresponding matrix B1;
  • the matrix B1 generates the index matrix, and the
  • the matrix B1 is a matrix that satisfies the conditional sparseness.
  • the matrix B1 can be obtained through neural network training. For example, in deep learning scenarios, operators such as convolution and inner product can be controlled by controlling the training process of the deep learning network.
  • the distribution of the parameters is trained to meet the distribution rule of conditional sparseness, and the parameter matrix satisfying the conditional sparseness is obtained as matrix B1.
  • the number of non-zero elements in each column of the n column of the matrix B1 is controlled within a certain range, so that the range of the data index can be effectively controlled, thereby effectively reducing the size of the index matrix and ensuring the project Achievability.
  • only one index matrix can be used to complete the matrix multiplication operation, thus consuming less logic resources.
  • the row label of the non-zero element in the matrix B1 is the row number of the row to which the non-zero element belongs in the matrix B1; for all Any one of the elements in the matrix B2 is zero, and the row label of the one element in the matrix B1 is the first character.
  • a column indicated in the matrix A1 of an element in the index matrix is the The column number in all columns of the matrix A1 is a column of an element in the index matrix; when an element in the index matrix is the first character, a matrix A2 corresponding to an element in the index matrix
  • the elements in a column in are m second characters.
  • first character and the second character can meet the requirement of matrix element alignment.
  • the elements of the t column in the matrix A1 are directly read to form a matrix A2.
  • the k columns of the matrix A1 can be stored to multiple addresses in the memory respectively
  • the required data can be read from the memory at one time according to the index matrix, which can not only greatly reduce the memory access bandwidth required when reading data, but also eliminate the possibility of reading data from the memory There is an address conflict.
  • a method for performing matrix multiplication includes: obtaining a matrix B1, a matrix A2, and an index matrix, the matrix B1 is a matrix of k rows and n columns, and the matrix A2 is m rows and t A matrix of columns, the index matrix is a matrix of m rows and t columns, the k, the n, the m and the t are all positive integers, the t is less than or equal to the k; according to the index Matrix and the matrix B1, generating m matrices B2, the m matrices B2 are each a matrix of t rows and n columns, the m matrices B2 correspond to m rows of the index matrix in order, each The row t of the matrix B2 corresponds to the t elements of a row in the corresponding index matrix in order, and each row of each matrix B2 corresponds to an element in the index matrix corresponding to the matrix B1.
  • a matrix C is generated, the matrix C is a matrix of m rows and n columns, m rows of the matrix C and m rows of the matrix A2
  • the order is one-to-one correspondence, the m rows of the matrix C correspond to the m matrices B2 in order, and the product of a row in the matrix A2 corresponding to each row of the matrix C and a corresponding matrix B2 .
  • the matrix A2 contains all non-zero elements (valid data) in the matrix A1.
  • the index matrix is the index of the elements in the matrix A2 in the matrix A1, that is, the index in the matrix A1 including all non-zero elements in the matrix A1.
  • m matrices B2 can be read from the matrix B1 at one time, and then m rows of the matrix A2 can be multiplied by the m matrices B2 to obtain the matrix C. Since the matrix multiplication operation can be completed only by reading the data from the matrix B1 once, the data reusability can be maximized and processing resources can be saved.
  • the acquisition matrix A2 and the index matrix include: an acquisition matrix A1, which is a matrix of m rows and k columns, and the elements of each row in the m rows of the matrix A1 are sequentially divided into a preset number of Group, the number of elements in each group is k1, and the number of non-zero elements in each group of elements is less than or equal to k2, the preset value is k / k1, the k is greater than or equal to the k1 and all K is divisible by k1, k1 is greater than or equal to k2 and k1 is divisible by k2; according to the matrix A1, the matrix A2 is generated, the m rows and The m rows of the matrix A1 correspond one-to-one in sequence, and the elements of each row of the matrix A2 include all non-zero elements in the group of preset preset numerical values in a row of the corresponding matrix A1;
  • the matrix A1 generates the index matrix, and the m rows of the index matrix correspond to the m
  • the matrix A1 is a matrix that satisfies the conditional sparseness.
  • the matrix A1 can be obtained through neural network training. For example, in a deep learning scenario, operators such as convolution and inner product can be controlled by controlling the training process of the deep learning network The distribution of the parameters is trained to meet the conditional sparse distribution rule, and the parameter matrix satisfying the conditional sparseness is obtained as the matrix A1.
  • the number of non-zero elements in each row of the m row of the matrix A1 is controlled within a certain range, so that the range of the data index can be effectively controlled, which can effectively reduce the size of the index matrix and ensure the project Achievability.
  • only one index matrix can be used to complete the matrix multiplication operation, thus consuming less logic resources.
  • the column index of the non-zero element in the matrix A1 is the column number of the column to which the non-zero element belongs in the matrix A1; for all Any one of the elements in the matrix A2 is zero, and the column index of the one element element in the matrix A1 is the first character.
  • a row indicated by an element in the index matrix in the matrix B1 is the The row number in all rows of matrix B1 is a row of an element in the index matrix; when an element in the index matrix is the first character, an element in the index matrix corresponds to a matrix B2
  • the elements of a line are n second characters.
  • first character and the second character can meet the requirement of matrix element alignment.
  • the elements of t rows in the matrix B1 are directly read to form a matrix B2.
  • the k rows of the matrix B1 can be stored to multiple addresses in the memory respectively
  • the required data can be read from the memory at one time according to the index matrix, which can not only greatly reduce the memory access bandwidth required when reading data, but also eliminate the possibility of reading data from the memory There is an address conflict.
  • a circuit for performing matrix multiplication includes:
  • An acquisition circuit for acquiring a matrix A1, a matrix B2 and an index matrix the matrix A1 is a matrix of m rows and k columns, the matrix B2 is a matrix of t rows and n columns, and the index matrix is a matrix of t rows and n columns ,
  • the m, the k, the t, and the n are all positive integers, and the t is less than or equal to the k;
  • the data selection circuit is used to generate n matrix A2 according to the index matrix and the matrix A1, the n matrix A2 are all m rows and t columns of matrix, the n matrix A2 and the index matrix n columns correspond to each other in order, and the t column of each matrix A2 corresponds to the t elements of one column in the corresponding index matrix in order, and each column of each matrix A2 corresponds to the corresponding index An element in the matrix is in a column indicated in the matrix A1;
  • a computing unit array used to generate a matrix C based on the n matrices A2 and the matrix B2, the matrix C is a matrix of m rows and n columns, the n columns of the matrix C and the n matrix A2 are The order is one-to-one correspondence, the n columns of the matrix C correspond to the n columns of the matrix B2 in order, and each column of the matrix C corresponds to a corresponding matrix A2 and a corresponding column of the matrix B2 product.
  • the acquiring circuit When the acquiring circuit is used to acquire the matrix B2 and the index matrix, it is specifically used to:
  • the matrix B1 is a matrix of k rows and n columns, and the elements of each column in the n columns of the matrix B1 are sequentially divided into preset numerical groups, and the number of elements in each group is k1, and the The number of non-zero elements in each group of elements is less than or equal to k2, the preset value is k / k1, the k is greater than or equal to the k1 and the k is divisible by the k1, the k1 is greater than Or equal to the k2 and the k1 is divisible by the k2;
  • the n columns of the matrix B2 correspond to the n columns of the matrix B1 in order, and the elements of each column of the matrix B2 include the corresponding matrix B1 All non-zero elements in the set of preset values in a column of
  • n columns of the index matrix correspond to n columns of the matrix B2 in order, and the elements of each column of the index matrix are in the corresponding matrix B2
  • the row label of the non-zero element in the matrix B1 is the row number of the row to which the non-zero element belongs in the matrix B1 ;
  • the row index of the one element in the matrix B1 is the first character.
  • a column indicated in the matrix A1 of an element in the index matrix is the The column number in all columns of the matrix A1 is a column of an element in the index matrix; when an element in the index matrix is the first character, a matrix A2 corresponding to an element in the index matrix
  • the elements in a column in are m second characters.
  • the matrix B1 is obtained through neural network training.
  • the circuit further includes a first memory, the first memory is used to store the matrix A1, the matrix B2 and the index matrix; accordingly, the acquisition circuit is used to: from the first The matrix A1, the matrix B2 and the index matrix are read from the memory.
  • a circuit for performing matrix multiplication includes:
  • the matrix B1 is a matrix of k rows and n columns
  • the matrix A2 is a matrix of m rows and t columns
  • the index matrix is a matrix of m rows and t columns
  • the k, the n, the m, and the t are all positive integers, and the t is less than or equal to the k;
  • the data selection circuit is used to generate m matrix B2 according to the index matrix and the matrix B1, the m matrix B2 are all t rows and n columns matrix, the m matrix B2 and the index matrix m rows correspond one-to-one in order, and t rows of each matrix B2 correspond to t elements of one row in the corresponding index matrix in order, and each row of each matrix B2 corresponds to the index An element in the matrix is a row indicated in the matrix B1;
  • a computing unit array used to generate a matrix C according to the matrix A2 and the m matrices B2, the matrix C is a matrix of m rows and n columns, m rows of the matrix C and m rows of the matrix A2
  • the acquiring circuit When the acquiring circuit is used to acquire the matrix A2 and the index matrix, it is specifically used to:
  • the matrix A1 is a matrix of m rows and k columns, the elements of each row in the m rows of the matrix A1 are sequentially divided into preset numerical groups, and the number of elements in each group is k1, and the The number of non-zero elements in each group of elements is less than or equal to k2, the preset value is k / k1, the k is greater than or equal to the k1 and the k is divisible by the k1, the k1 is greater than Or equal to the k2 and the k1 is divisible by the k2;
  • m rows of the index matrix correspond to m rows of the matrix A2 in order, and the elements of each row of the index matrix are in the corresponding matrix A2 All the elements arranged in a row in the row are listed in the matrix A1.
  • the column label of the non-zero element in the matrix A1 is the column number of the column to which the non-zero element belongs in the matrix A1 ;
  • the column index of the element element that is zero in the matrix A1 is the first character.
  • a row indicated by an element in the index matrix in the matrix B1 is the The row number in all rows of matrix B1 is a row of an element in the index matrix; when an element in the index matrix is the first character, an element in the index matrix corresponds to a matrix B2
  • the elements of a line are n second characters.
  • the matrix A1 is obtained through neural network training.
  • the circuit further includes a first memory, the first memory is used to store the matrix B1, the matrix A2, and the index matrix; accordingly, the acquisition circuit is used to: from the first The matrix B1, the matrix A2 and the index matrix are read from the memory.
  • an SOC including the circuit for performing matrix multiplication described in the third aspect.
  • the SOC also includes a processing core for controlling the circuit that performs matrix multiplication.
  • the SOC further includes a second memory, the second memory is used to store the matrix A1, the matrix B2 and the index matrix; accordingly, the acquisition circuit is used to: from the second The matrix A1, the matrix B2 and the index matrix are read from the memory.
  • an SOC including the circuit for performing matrix multiplication described in the fourth aspect.
  • the SOC also includes a processing core for controlling the circuit that performs matrix multiplication.
  • the SOC further includes a second memory, the second memory is used to store the matrix B1, the matrix A2, and the index matrix; accordingly, the acquisition circuit is used to: from the second The matrix B1, the matrix A2 and the index matrix are read from the memory.
  • a computer-readable storage medium having instructions stored therein, which when executed on a computer, causes the computer to execute the method for performing matrix multiplication described in the first aspect .
  • a computer-readable storage medium in which instructions are stored in the computer-readable storage medium, which when executed on a computer, causes the computer to execute the method for performing matrix multiplication described in the second aspect .
  • a computer program product containing instructions, which when executed on a computer, causes the computer to perform the method for performing matrix multiplication described in the first aspect.
  • a computer program product containing instructions, which when executed on a computer, causes the computer to execute the method for performing matrix multiplication described in the second aspect above.
  • m matrix B2 are generated. Since the required data can be read from the matrix B1 stored in the memory at one time according to the index matrix, not only can the required data for reading data be greatly reduced Memory access bandwidth, and can eliminate address conflicts that may occur when reading data from memory. Finally, generate matrix C based on matrix A2 and m matrices B2. Since m matrices B2 have the same size, when m rows of matrix A2 are multiplied by m matrices B2, the m rows of matrix A2 and the m The multiplication operation of a matrix B2 can be executed in parallel and can be completed in the same time, thereby saving operation time and improving operation efficiency. In the embodiment of the present application, the matrix multiplication operation can be completed only by reading the data from the matrix B1 once, so the data reuse can be maximized and processing resources can be saved.
  • FIG. 1 is a schematic diagram of a matrix B1 provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a matrix A1 provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of a method for performing matrix multiplication operation provided by an embodiment of the present application
  • FIG. 4 is a schematic diagram of another matrix B1 provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a matrix B2 provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of an index matrix provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of another matrix A1 provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of an n matrix A2 provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a matrix C provided by an embodiment of the present application.
  • FIG. 10 is a flowchart of another method for performing matrix multiplication operation provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of another matrix A1 provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a matrix A2 provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of another index matrix provided by an embodiment of the present application.
  • FIG. 14 is a schematic diagram of another matrix B1 provided by an embodiment of the present application.
  • 15 is a schematic diagram of an m matrix B2 provided by an embodiment of the present application.
  • 16 is a schematic diagram of another matrix C provided by an embodiment of the present application.
  • 17 is a schematic structural diagram of a circuit for performing matrix multiplication operations provided by an embodiment of the present application.
  • FIG. 18 is a schematic structural diagram of another circuit for performing matrix multiplication operations provided by an embodiment of the present application.
  • FIG. 19 is a schematic structural diagram of a computing unit array provided by an embodiment of the present application.
  • 20 is a schematic structural diagram of an SOC provided by an embodiment of the present application.
  • artificial intelligence technology is widely used in terminals, edge side, cloud, etc., to achieve image recognition, target detection, voice translation and other functions.
  • Artificial intelligence technology is often implemented through deep learning networks.
  • deep learning networks such as based on neural networks
  • matrix is a common form of data expression, and matrix multiplication is often used in other fields.
  • the method for performing matrix multiplication operation provided by the embodiment of the present application is applied to the matrix multiplication matrix calculation scenario in deep network learning or other fields.
  • a matrix that satisfies the conditional sparseness can be obtained first.
  • the developer can control the training process of the deep learning network to obtain a matrix that satisfies the conditional sparseness.
  • the matrix that satisfies the conditional sparseness can also be obtained in other ways, which is not limited in the embodiments of the present application.
  • matrix A1 is multiplied by matrix B1
  • matrix A1 is a matrix of m rows and k columns
  • matrix B1 is a matrix of k rows and n columns
  • m, k, and n are all positive integers.
  • Matrix B1 satisfies the condition that sparseness means that the elements of each column in the n-column of matrix B1 are divided into preset value groups in sequence, the number of elements in each group is k1, and the number of non-zero elements in each group is less than Or equal to k2, the preset value is a positive integer, the preset value is k / k1, k is greater than or equal to k1 and k is divisible by k1, k1 is greater than or equal to k2 and k1 is divisible by k2.
  • matrix B1 is a matrix of 16 rows and 16 columns, and the elements of each column in the 16 columns of matrix B1 are divided into In two groups, the number of elements in each group is 8, and the number of non-zero elements in each group is less than or equal to 2.
  • matrix B1 can be as shown in Figure 1. In this case, the 16 columns of matrix B1 Every 8 consecutive elements in each column are sparse to no more than 2 non-zero elements, and the corresponding conditional sparse rate is 25%.
  • Matrix A1 satisfies the condition of sparseness means that the elements of each row in the m rows of matrix A1 are divided into preset value groups in sequence, the number of elements in each group is k1, and the number of non-zero elements in each group is less than Or equal to k2, the preset value is a positive integer, the preset value is k / k1, k is greater than or equal to k1 and k is divisible by k1, k1 is greater than or equal to k2 and k1 is divisible by k2.
  • matrix A1 is a matrix of 5 rows and 4 columns, and the elements of each row in the 5 rows of matrix A1 are divided into For two groups, the number of elements in each group is 2, and the number of non-zero elements in each group is less than or equal to 1.
  • the matrix A1 can be as shown in Figure 2. In this case, the matrix A1 has 5 rows Every 2 consecutive elements in each row are sparse to no more than 1 non-zero element, and the corresponding conditional sparse rate is 50%.
  • matrix A1 is multiplied by matrix B1
  • matrix B1 is a matrix that satisfies the conditional sparseness
  • the product of matrix A1 and matrix B1 can be determined by performing the matrix multiplication operation provided in the embodiment of FIG. 3 below ; If the matrix A1 is a matrix that satisfies the conditional sparseness, the product of the matrix A1 and the matrix B1 can be determined by performing the matrix multiplication operation provided in the embodiment of FIG. 10 below.
  • FIG. 3 is a flowchart of a method for performing matrix multiplication operation provided by an embodiment of the present application. Referring to FIG. 3, the method includes:
  • Step 301 Obtain matrix A1, matrix B2 and index matrix.
  • matrix A1 is a matrix of m rows and k columns, and m and k are both positive integers.
  • Matrix A1 can be the multiplicand in the matrix multiplication operation of any operator (such as convolution, Inner products, etc.) in the deep learning network, and matrix A1 can be a data matrix.
  • matrix A1 can also be The multiplicand in the matrix multiplication matrix calculation in other applications is not limited in the embodiments of the present application.
  • matrix B2 is a matrix of t rows and n columns, t and n are both positive integers, and t is less than or equal to k.
  • Matrix B2 contains all non-zero elements (valid data) in matrix B1.
  • Matrix B1 is a matrix of k rows and n columns.
  • Matrix B1 can be a matrix multiplier in the operation of matrix multiplication by any operator in the deep learning network, and matrix B1 can be a parameter matrix.
  • matrix B1 can also be The multiplier in the operation of matrix multiplying matrix in other applications is not limited in the embodiment of the present application.
  • the index matrix is a matrix of t rows and n columns.
  • the index matrix is the index of the elements in matrix B2 in matrix B1, that is, the index in matrix B1 that includes all non-zero elements in matrix B1.
  • the matrix A1 when acquiring the matrix A1, the matrix A1 may be directly read from the memory.
  • the matrix B2 and the index matrix can be directly read from the memory; or, the matrix B1 can be acquired first, and then the matrix B2 and the index matrix can be generated according to the matrix B1, specifically, the matrix B2 can be generated according to the matrix B1
  • the index matrix you can generate matrix B2 from matrix B1 and index matrix from matrix B1, or first generate matrix B2 from B1, and then generate index matrix from matrix B1 and matrix B2.
  • the specific generation algorithm is not limited, as long as it can satisfy the final generated matrix to meet the definition requirements of each matrix.
  • the matrix B1 when acquiring the matrix B1, the matrix B1 may be directly read from the memory.
  • the matrix B1 may be a matrix that satisfies the condition of sparseness, that is, the elements of each column in the n columns of the matrix B1 are sequentially divided into preset numerical groups, and the number of elements in each group is k1, and each group of elements The number of non-zero elements in all are less than or equal to k2, the preset value is k / k1 (also equal to the value obtained by dividing t by k2), k is greater than or equal to k1 and k can be divisible by k1, k1 is greater than or equal to k2 and k1 is divisible by k2.
  • the matrix B1 can be obtained through neural network training.
  • the distribution of the parameters of operators such as convolution and inner product can be trained to meet the sparse distribution rule by controlling the training process of the deep learning network.
  • matrix B1 can also be obtained by other methods, which is not limited in the embodiment of the present application.
  • the data types of the elements in matrix A1 and matrix B1 can be set in advance according to actual needs, such as integer, floating point or any custom format, and m, k, n, k1, k2 and preset values
  • the value of can also be set in advance according to actual requirements. For example, it can be reasonably determined according to the sparseness of the neural network and the computing power of the hardware, which is not limited in the embodiments of the present application.
  • matrix B1 is a matrix of 16 rows and 16 columns, and the elements of each column in the 16 columns of matrix B1 are divided into For a group, the number of elements in each group is 16, and the number of non-zero elements in each group is less than or equal to 4.
  • the matrix B1 may be as shown in FIG. In this case, every 16 consecutive elements in each of the 16 columns of matrix B1 are sparse to no more than 4 non-zero elements, and the corresponding conditional sparse rate is 25%.
  • the n columns of the matrix B2 correspond to the n columns of the matrix B1 in order, and the elements of each column of the matrix B2 include all the non-zero elements in the group of preset values arranged in sequence in the corresponding column of the matrix B1 . That is, for each column in the n-column of the matrix B1, for example, for the i-th column in the n-column of the matrix B1, it is sequentially selected from each group of elements in the group of preset values in the i-th column of the matrix B1 Select k2 elements containing all non-zero elements as the i-th column of matrix B2 to obtain matrix B2, i is an integer greater than or equal to 1 and less than or equal to n.
  • the preset value is 1, and the matrix B1 may be as shown in FIG. 4 at this time.
  • the n columns of the index matrix correspond to the n columns of the matrix B2 in order
  • the elements of each column of the index matrix are the row labels of all the elements arranged in order in the corresponding column of the matrix B2 in the matrix B1. That is, for each column in the n-th column of matrix B2, for example, for the i-th column in the n-th column of matrix B2, each element of all the elements in the i-th column of matrix B2 is sequentially placed in matrix B1
  • the row label is used as the i-th column of the index matrix to obtain the index matrix.
  • conditional sparseness is introduced in the embodiments of the present application, so that the number of non-zero elements in each column of column n of matrix B1 can be controlled within a certain range, so that the range of data index can be effectively controlled, Therefore, the scale of the index matrix can be effectively reduced, and the achievability in engineering can be ensured.
  • the matrix multiplication operation can be completed only by using this one index matrix in the future, so that less logic resources are consumed.
  • the row label of the non-zero element in the matrix B1 is the row number of the row to which the non-zero element belongs in the matrix B1.
  • the row index of this element in matrix B1 is the first character.
  • the introduction of the first character can meet the requirements of matrix element alignment.
  • the first character can be set in advance.
  • the first character can be any value, such as the first character can be X, Y, etc. Not limited.
  • matrix B1 may be shown in FIG. 4 and matrix B2 may be shown in FIG. 5.
  • the row index of each element in the first column of the matrix B2 in the matrix B1 is used as the first column of the index matrix in order, and each of the four elements in the second column of the matrix B2 is sequentially
  • the row index of the element in the matrix B1 is used as the second column of the index matrix, and so on, until the row index of each element in the matrix B1 of the matrix B2 in the 16th column of the matrix B2 is used as the second row of the index matrix
  • the index matrix shown in FIG. 6 can be obtained in this way.
  • the index matrix is a matrix with 4 rows and 16 columns, and includes all non-zero elements in the matrix B1.
  • Step 302 Generate n matrices A2 according to the index matrix and the matrix A1.
  • n matrices A2 are all m rows and t columns.
  • the n matrices A2 correspond to the n columns of the index matrix in order
  • the t column of each matrix A2 corresponds to a column in the corresponding index matrix.
  • the t elements correspond to each other in order
  • each column of each matrix A2 is a column indicated in the matrix A1 by an element in the corresponding index matrix. That is, for each of the t elements in each column of the n-th column of the index matrix, for example, for the j-th element of the i-th column of the index matrix, it is the j-th of the i-th column of the index matrix.
  • the column indicated by the element in the matrix A1 is used as the jth column of the i-th matrix A2 in the n matrices A2 to obtain n matrix A2, where j is an integer greater than or equal to 1 and less than or equal to t.
  • the elements of t columns in the matrix A1 are directly read to form a matrix A2.
  • the k columns of the matrix A1 can be stored to multiple addresses in the memory respectively
  • the required data can be read from the memory at one time according to the index matrix, which can not only greatly reduce the memory access bandwidth required when reading data, but also eliminate the possibility of reading data from the memory There is an address conflict.
  • the one column in the index matrix indicated in the matrix A1 is the column number in all columns of the matrix A1 Is a column of this element in the index matrix; when an element in the index matrix is the first character, an element in a column in a matrix A2 corresponding to this element in the index matrix is m second characters.
  • the column with the j-th element in all columns of the matrix A1 can be used as the j-th column of the i-th matrix A2 ;
  • the j-th element of the i-th column of the index matrix is the first character, take the m second characters as the j-th column of the i-th matrix A2.
  • the second character may be set in advance, and in a specific implementation, the second character may be any value, such as the second character may be 0, X, any element in the matrix A1, etc., which is not limited in this embodiment of the present application.
  • the index matrix may be as shown in FIG. 6, and the matrix A1 may be as shown in FIG. 7.
  • the first column of the first element of the first column of the index matrix in the matrix A1 is taken as the first column of the first matrix A2, and the second element of the first column of the index matrix is indicated in the matrix A1
  • One column is the second column of the first matrix A2
  • the third column of the first column of the index matrix is indicated in the matrix A1 as the third column of the first matrix A2
  • the column indicated by the fourth element in the matrix A1 is taken as the fourth column of the first matrix A2 to obtain the first matrix A2, and so on, until the first element of the 16th column of the index matrix is in the matrix A1
  • the indicated column is the first column of the 16th matrix A2, the second column of the 16th column of the index matrix in the matrix A1 is the second column of the
  • Step 303 Generate matrix C according to n matrix A2 and matrix B2.
  • the matrix C is a matrix of m rows and n columns, and the matrix C is the product of the matrix A1 and the matrix B1.
  • the n columns of matrix C correspond to n matrices A2 in order.
  • the n columns of matrix C correspond to n columns in matrix B2 in order.
  • Each column of matrix C corresponds to a corresponding matrix A2 and a corresponding matrix B2.
  • the product of a column That is, for each of the n columns of the matrix C, as for the i-th column of the matrix C, the product of the i-th matrix A2 and the i-th column of the matrix B2 is taken as the i-th column of the matrix C to obtain Matrix C.
  • n matrices A2 can be read from the matrix A1 at one time, and then the n matrices A2 can be multiplied by n columns of the matrix B2 to obtain the matrix C . Since the matrix multiplication can be completed only by reading the data from the matrix A1 once, it can maximize the data reusability and save processing resources.
  • n matrix A2 is multiplied by n columns of the matrix B2 one by one
  • the size of the n matrix A2 is the same
  • the multiplication of the n matrix A2 and the n column of the matrix B2 can be executed in parallel, and The execution is completed in the same time, which can save the calculation time and improve the calculation efficiency.
  • the matrix B2 may be as shown in FIG. 5, and the 16 matrices A2 may be as shown in FIG. 8.
  • the product of the first matrix A2 and the first column of the matrix B2 can be used as the first column of the matrix C
  • the product of the second matrix A2 and the second column of the matrix B2 can be used as the second column of the matrix C, to And so on, until the product of the 16th matrix A2 and the 16th column of the matrix B2 is taken as the 16th column of the matrix C, so that the matrix C shown in FIG. 9 can be obtained, in which case the matrix C is a matrix of 16 rows and 16 columns .
  • the embodiment of the present application introduces the concept of conditional sparseness in the process of matrix multiplication, and then performs the multiplication of matrix A1 and matrix B1 satisfying the conditional sparseness in the above manner, which can greatly improve the calculation performance and calculation.
  • the performance improvement factor is the reciprocal of the conditional sparsity rate of the matrix B1. For example, if the conditional sparsity rate of the matrix B1 is 25%, the calculation performance can be improved by a factor of four.
  • the matrix A1, the matrix B2, and the index matrix are obtained.
  • n matrices A2 are generated. Since the required data can be read from the matrix A1 stored in the memory at one time according to the index matrix, not only can the memory required for reading data be greatly reduced Access bandwidth, and can eliminate address conflicts that may occur when reading data from memory.
  • n matrices A2 have the same size, when n matrix A2 is multiplied by n columns of matrix B2, the n matrix A2 and matrix B2
  • the multiplication operation of n columns can be executed in parallel, and can be completed in the same time, thereby saving operation time and improving operation efficiency.
  • the matrix multiplication operation can be completed only by reading the data from the matrix A1 once, so that the data reusability can be maximized and processing resources can be saved.
  • FIG. 10 is a flowchart of a method for performing matrix multiplication operation provided by an embodiment of the present application. Referring to FIG. 10, the method includes:
  • Step 1001 Obtain matrix B1, matrix A2 and index matrix.
  • matrix B1 is a matrix of k rows and n columns, and k and n are positive integers.
  • Matrix B1 can be the multiplier in the matrix multiplication operation of any operator in the deep learning network (such as convolution, Inner products, etc.), and matrix B1 can be a data matrix.
  • matrix B1 can also be other
  • the multiplier in the operation of the matrix multiplication matrix in the application is not limited in the embodiment of the present application.
  • matrix A2 is a matrix of m rows and t columns, m and t are both positive integers, and t is less than or equal to the k.
  • Matrix A2 contains all non-zero elements (valid data) in matrix A1.
  • Matrix A1 is a matrix with m rows and k columns.
  • Matrix A1 can be the multiplicand in the matrix multiplication operation of any operator in the deep learning network, and matrix A1 can be a parameter matrix.
  • matrix A1 can also be The multiplicand in the matrix multiplication matrix calculation in other applications is not limited in the embodiment of the present application.
  • the index matrix is a matrix of m rows and t columns.
  • the index matrix is the index of the elements in the matrix A2 in the matrix A1, that is, the index in the matrix A1 including all non-zero elements in the matrix A1.
  • the matrix B1 when acquiring the matrix B1, the matrix B1 may be directly read from the memory.
  • the matrix A2 and the index matrix can be read directly from the memory; or, the matrix A1 can be acquired first, and then the matrix A2 and the index matrix can be generated according to the matrix A1, specifically, the matrix A2 can be generated according to the matrix A1
  • the matrix A2 can be generated from the matrix A1 and the index matrix can be generated from the matrix A1, or the matrix A2 can be generated from the A1 first, and then the index matrix can be generated from the matrix A1 and the matrix A2.
  • the specific generation algorithm is not limited, as long as it can satisfy the final generated matrix to meet the definition requirements of each matrix.
  • the matrix A1 when acquiring the matrix A1, the matrix A1 can be directly read from the memory.
  • the matrix A1 may be a matrix that satisfies the condition of sparseness, that is, the elements of each row in the m rows of the matrix A1 are sequentially divided into preset numerical groups, and the number of elements in each group is k1, and each group of elements The number of non-zero elements in all are less than or equal to k2, the preset value is k / k1 (also equal to the value obtained by dividing t by k2), k is greater than or equal to k1 and k can be divisible by k1, k1 is greater than or equal to k2 and k1 is divisible by k2.
  • the matrix A1 can be obtained through neural network training.
  • the distribution of the parameters of operators such as convolution and inner product can be trained to meet the sparse distribution rule by controlling the training process of the deep learning network.
  • the matrix A1 can also be obtained in other ways, which is not limited in the embodiment of the present application.
  • the data types of the elements in matrix A1 and matrix B1 can be set in advance according to actual needs, such as integer, floating point or any custom format, and m, k, n, k1, k2 and preset values
  • the value of can also be set in advance according to actual requirements. For example, it can be reasonably determined according to the sparseness of the neural network and the computing power of the hardware, which is not limited in the embodiments of the present application.
  • matrix A1 is a matrix of 5 rows and 4 columns, and the elements of each row in the 5 rows of matrix A1 are divided into
  • the number of elements in each group is 4, and the number of non-zero elements in each group is less than or equal to 2.
  • the matrix A1 may be as shown in FIG. 11. In this case, every 4 consecutive elements in each of the 5 rows of matrix A1 are sparse to no more than 2 non-zero elements, and the corresponding conditional sparse rate is 50%.
  • the m rows of the matrix A2 correspond to the m rows of the matrix A1 in order, and the elements of each row of the matrix A2 include all the non-zero elements in the group of preset numerical values arranged in sequence in the row of the corresponding matrix A1 . That is, for each row in the m rows of the matrix A1, for example, for the i-th row in the m-rows of the matrix A1, each group of elements in the group of preset values in the i-th row of the matrix A1 Select k2 elements containing all non-zero elements as the i-th row of matrix A2 to obtain matrix A2, i is an integer greater than or equal to 1 and less than or equal to m.
  • the preset value is 1, and the matrix A1 may be as shown in FIG. 11 at this time.
  • the m rows of the index matrix correspond to the m rows of the matrix A2 in order
  • the elements of each row of the index matrix are the column labels of all the elements arranged in order in the row of the corresponding matrix A2 in the matrix A1. That is, for each of the m rows of the matrix A2, for example, for the i-th row of the m-row of the matrix A2, each element of all the elements of the i-th row of the matrix A2 in the matrix A1
  • the column label is used as the i-th row of the index matrix to obtain the index matrix.
  • conditional sparseness is introduced in the embodiments of the present application, so that the number of non-zero elements in each row of m rows of matrix A1 can be controlled within a certain range, which can effectively control the range of data index Therefore, the scale of the index matrix can be effectively reduced, and the achievability in engineering can be ensured.
  • the matrix multiplication operation can be completed only by using this one index matrix in the future, so that less logic resources are consumed.
  • the column label of the non-zero element in the matrix A1 is the column number of the column to which the non-zero element belongs in the matrix A1.
  • the row index of this element in matrix A1 is the first character.
  • the introduction of the first character can meet the requirements of matrix element alignment.
  • the first character can be set in advance.
  • the first character can be any value, such as the first character can be X, Y, etc. Not limited.
  • the matrix A1 may be as shown in FIG. 11 and the matrix A2 may be as shown in FIG. 12.
  • the column index of each element in the first row of matrix A2 in matrix A1 is used as the first row of the index matrix in order, and each of the two elements in the second row of matrix A2 are sequentially
  • the column index of the element in the matrix A1 is used as the second row of the index matrix, and so on, until the column index of each element in the matrix A2 in the fifth row of the matrix A2 is used as the second row of the index matrix Line 5 stops.
  • the index matrix shown in FIG. 13 can be obtained.
  • the index matrix is a matrix with 5 rows and 2 columns, and includes all non-zero elements in the matrix A1.
  • Step 1002 Generate m matrices B2 according to the index matrix and the matrix B1.
  • the m matrices B2 are all t rows and n columns, the m matrices B2 correspond to the m rows of the index matrix in order, and the t row of each matrix B2 corresponds to the row of the corresponding index matrix.
  • the t elements correspond to each other in order, and each row of each matrix B2 corresponds to an element in the index matrix corresponding to a row indicated in the matrix B1. That is, for each element of the t elements in each row of the m-row of the index matrix, for example, for the j-th element of the i-th row of the index matrix, it is the j-th of the i-th row of the index matrix.
  • the row indicated by the element in the matrix B1 is used as the jth row of the i-th matrix B2 in the m matrices B2 to obtain m matrix B2, where j is an integer greater than or equal to 1 and less than or equal to t.
  • the elements of t rows in the matrix B1 are directly read to form a matrix B2.
  • the k rows of the matrix B1 can be stored to multiple addresses in the memory respectively
  • the required data can be read from the memory at one time according to the index matrix, which can not only greatly reduce the memory access bandwidth required when reading data, but also eliminate the possibility of reading data from the memory There is an address conflict.
  • the row indicated by this element in the index matrix in matrix B1 is the row number in all rows of matrix B1
  • One row of this element in the index matrix; when an element in the index matrix is the first character, the element of a row in a matrix B2 corresponding to this element in the index matrix is n second characters.
  • the row with the j-th element in all rows of the matrix B1 can be used as the j-th row of the i-th matrix B2;
  • n second characters are regarded as the j-th row of the i-th matrix B2.
  • the second character can be set in advance, and in a specific implementation, the second character can be any value, such as the second character can be 0, X, any element in the matrix B1, etc., which is not limited in this embodiment of the present application.
  • the index matrix may be as shown in FIG. 13, and the matrix B1 may be as shown in FIG. 14.
  • the row indicated by the first element of the first row of the index matrix in the matrix B1 is taken as the first row of the first matrix B2, and the second element of the first row of the index matrix is indicated by the matrix B1
  • One row is the second row of the first matrix B2, and the first matrix B2 is obtained, and so on, until the first element of the fifth row of the index matrix is indicated in the matrix B1 as the fifth matrix B2.
  • the row indicated by the second element of the fifth row of the index matrix in the matrix B1 is taken as the second row of the fifth matrix B2, until the fifth matrix B2 is obtained, so that it can be obtained as shown in FIG. 15 5 matrices B2, which are all 2 rows and 3 columns.
  • Step 1003 Generate matrix C according to matrix A2 and m matrices B2.
  • the matrix C is a matrix of m rows and n columns, and the matrix C is the product of the matrix A1 and the matrix B1.
  • the m rows of matrix C correspond to the m rows of matrix A2 in order
  • the m rows of matrix C correspond to the m matrices B2 in order
  • each row of matrix C corresponds to a row in matrix A2 and a corresponding one
  • m matrix B2 can be read from matrix B1 at a time according to the index matrix, and then m rows of matrix A2 can be multiplied by the m matrix B2 to obtain matrix C . Since the matrix multiplication operation can be completed only by reading the data from the matrix B1 once, the data reusability can be maximized and processing resources can be saved.
  • the matrix A2 may be as shown in FIG. 12, and the five matrices B2 may be as shown in FIG. After that, the product of the first row of the matrix A2 and the first matrix B2 can be regarded as the first row of the matrix C, and the product of the second row of the matrix A2 and the second matrix B2 can be regarded as the second row of the matrix C. And so on, until the product of the fifth row of the matrix A2 and the fifth matrix B2 is taken as the fifth row of the matrix C, so that the matrix C as shown in FIG. 16 can be obtained. At this time, the matrix C is 5 rows and 3 columns matrix.
  • the embodiment of the present application introduces the concept of conditional sparseness during the matrix multiplication operation, and then performs the multiplication operation of the matrix A1 and the matrix B1 satisfying the conditional sparseness in the above manner, which can greatly improve the calculation performance and calculation.
  • the performance improvement factor is the reciprocal of the conditional sparsity rate of the matrix A1. For example, if the conditional sparsity rate of the matrix A1 is 50%, the calculation performance can be improved by a factor of two.
  • the matrix B1, the matrix A2, and the index matrix are obtained.
  • m matrix B2 are generated. Since the required data can be read from the matrix B1 stored in the memory at one time according to the index matrix, not only can the required data for reading data be greatly reduced Memory access bandwidth, and can eliminate address conflicts that may occur when reading data from memory.
  • m matrices B2 have the same size, when m rows of matrix A2 are multiplied by m matrices B2, the m rows of matrix A2 and the m
  • the multiplication operation of a matrix B2 can be executed in parallel and can be completed in the same time, thereby saving operation time and improving operation efficiency.
  • the matrix multiplication operation can be completed only by reading the data from the matrix B1 once, so the data reuse can be maximized and processing resources can be saved.
  • FIG. 17 is a schematic structural diagram of a circuit for performing matrix multiplication operations provided by an embodiment of the present application.
  • the circuit for performing matrix multiplication can be implemented by field programmable gate array (Field-Programmable Array, FPGA), ASIC, etc.
  • the circuit for performing matrix multiplication includes: acquisition circuit 1701 and data selection circuit 1702 ⁇ Compute unit array 1703.
  • the process of implementing the matrix multiplication operation by the circuit implementing the matrix multiplication operation method provided in the embodiment of FIG. 3 may include the following steps (1)-(3):
  • the acquisition circuit 1701 acquires the matrix A1, the matrix B2, and the index matrix.
  • the circuit for performing matrix multiplication may further include a first memory 1704.
  • the first memory is used to store the matrix A1, the matrix B2, and the index matrix.
  • the acquisition circuit 1701 may start from the first A memory 1704 reads the matrix A1, matrix B2 and index matrix. Alternatively, the acquisition circuit 1701 may first acquire the matrix A1 and the matrix B1, and then generate the matrix B2 and the index matrix according to the matrix B1.
  • the data selection circuit 1702 generates n matrices A2 based on the index matrix and the matrix A1.
  • the calculation unit array 1703 generates a matrix C based on n matrices A2 and B2.
  • the calculation unit array 1703 includes a plurality of three-dimensional calculation units.
  • the plurality of three-dimensional calculation units may be distributed on m rows and n columns.
  • Each three-dimensional calculation unit includes a plurality of multiplication units and additions.
  • the unit, such as a three-dimensional calculation unit may be a multiply-accumulate unit (multiply and accumulate, mac).
  • 1 3D calculation unit can be used to calculate the product of 1 row of 1 matrix A2 and 1 column of matrix B2, 1 column of 3D calculation unit (m 3D calculation units) can be used to calculate 1 of matrix A2 and matrix B2
  • the product of columns, that is, one column of three-dimensional calculation units can calculate the elements of one column of matrix C, so that n column of three-dimensional calculation units can calculate the elements of n columns of matrix C, so that matrix C can be obtained.
  • the matrix C may also be saved in a register set, which may be included in the first memory 1704 or may be included in other memories, which is not limited in the embodiment of the present application .
  • the process of implementing the matrix multiplication operation by the circuit implementing the matrix multiplication operation method provided in the embodiment of FIG. 10 may include the following steps (4)-(6):
  • the acquisition circuit 1701 acquires the acquisition matrix B1, the matrix A2, and the index matrix.
  • the circuit for performing matrix multiplication may further include a first memory 1704.
  • the first memory is used to store the matrix B1, the matrix A2, and the index matrix.
  • the acquisition circuit 1701 may start from the first A memory 1704 reads the matrix B1, the matrix A2, and the index matrix. Alternatively, the acquisition circuit 1701 may first acquire the matrix A1 and the matrix B1, and then generate the matrix A2 and the index matrix according to the matrix A1.
  • the data selection circuit 1702 generates m matrixes B2 based on the index matrix and the matrix B1.
  • the calculation unit array 1703 generates the matrix C based on the matrix A2 and the m matrices B2.
  • the calculation unit array 1703 includes a plurality of three-dimensional calculation units.
  • the plurality of three-dimensional calculation units may be distributed on m rows and n columns.
  • Each three-dimensional calculation unit includes a plurality of multiplication units and additions.
  • the unit, such as the three-dimensional calculation unit, may be mac.
  • 1 3D calculation unit can be used to calculate the product of 1 row of matrix A2 and 1 column of one matrix B2, 1 row of 3D calculation unit (n 3D calculation units) can be used to calculate 1 row of matrix A2 and a matrix B2
  • the product that is, one row of three-dimensional calculation units can calculate the elements of one row of matrix C, so that the m row of three-dimensional calculation units can calculate the elements of m rows of matrix C, so that matrix C can be obtained.
  • the matrix C may also be saved in a register set, which may be included in the first memory 1704 or may be included in other memories, which is not limited in the embodiment of the present application .
  • An SOC provided by an embodiment of the present application may include the circuit for performing matrix multiplication operations described in the foregoing embodiments, and may include other components in addition to it.
  • FIG. 20 is a schematic structural diagram of an SOC provided by an embodiment of the present application.
  • the SOC includes: a processor 2001 (in some applications, also known as a processing core, a CPU, such as a processing core based on the ARM architecture), a second memory 2002, an interconnect bus 2003, and a circuit 2004 that performs matrix multiplication
  • the circuit 2004 for performing matrix multiplication may be the circuit for performing matrix multiplication described in the above embodiment.
  • the processor 2001 is used to control the circuit 2004 that performs matrix multiplication, for example, to send required data, or to receive the result of the operation performed by the circuit 2004 that performs matrix multiplication.
  • the data stored in the second memory 2002 is the same as the data stored in the first memory 1704, that is, used to store the matrix A1, matrix B2, and index matrix, or used to store the matrix B1, matrix A2, and index matrix.
  • the first memory 1704 may be RAM or the like, and the second memory 2002 may be double-rate synchronous dynamic random access memory (Double Data Rate, DDR).
  • the processor 2001 controls the circuit 2004 that performs matrix multiplication through the interconnection bus 2003 to start.
  • the circuit 2004 that performs the matrix multiplication operation directly performs the matrix multiplication operation based on the data stored in the second memory 2002. Specifically, after the circuit 2004 that performs the matrix multiplication operation is started, the acquisition circuit 1701 reads data from the second memory 2002 through the interconnection bus 2003 (read matrix A1, matrix B2, and index matrix, or read matrix B1, matrix A2 And index matrix), after that, the data selection circuit 1702 and the calculation unit array 1703 in the circuit 2004 that performs the matrix multiplication operation complete the matrix multiplication operation according to the data read from the second memory 2002 by the acquisition circuit 1701, and return the operation result To the second memory 2002.
  • the acquisition circuit 1701 reads data from the second memory 2002 through the interconnection bus 2003 (read matrix A1, matrix B2, and index matrix, or read matrix B1, matrix A2 And index matrix)
  • the data selection circuit 1702 and the calculation unit array 1703 in the circuit 2004 that performs the matrix multiplication operation complete the matrix multiplication operation according to the data read from the second memory 2002 by the acquisition circuit 1701, and return the operation result
  • the circuit 2004 that performs the matrix multiplication operation directly performs the matrix multiplication operation based on the data stored in the first memory 1704. Specifically, when the matrix A1, the matrix B2, and the index matrix have not been stored in the first memory 1704, or the matrix B1, the matrix A2, and the index matrix have not been stored, the circuit 2004 that performs the matrix multiplication operation is activated from the second memory through the interconnection bus 2003 Read data in 2002 (read matrix A1, matrix B2 and index matrix, or read matrix B1, matrix A2 and index matrix), and then store the data read from the second memory 2002 to the first memory 1704 .
  • Read data in 2002 read matrix A1, matrix B2 and index matrix, or read matrix B1, matrix A2 and index matrix
  • the acquisition circuit 1701 in the circuit 2004 that performs matrix multiplication reads data from the first memory 1704 (read matrix A1, matrix B2 and index matrix, or read matrix B1, matrix A2 and index matrix), and then executes
  • the data selection circuit 1702 and the calculation unit array 1703 in the matrix multiplication circuit 2004 complete the matrix multiplication operation based on the data read from the first memory 1704 by the acquisition circuit 1701, and return the operation result to the first memory 1704 and / or Second memory 2002.
  • the computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the computer, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer may be a general-purpose computer, a dedicated computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transferred from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, computer, server or data center Transmit to another website, computer, server or data center by wired (for example: coaxial cable, optical fiber, digital subscriber line (Digital Subscriber Line, DSL)) or wireless (for example: infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device including a server, a data center, and the like integrated with one or more available media.
  • the available media may be magnetic media (for example: floppy disk, hard disk, magnetic tape), optical media (for example: Digital Versatile Disc (DVD)) or semiconductor media (for example: Solid State Disk (SSD)) Wait.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)
  • Stabilization Of Oscillater, Synchronisation, Frequency Synthesizers (AREA)

Abstract

本申请公开一种执行矩阵乘法运算的方法,属于数据处理领域。该方法包括:获取矩阵A1、矩阵B2和索引矩阵;根据索引矩阵和矩阵A1生成n个矩阵A2,n个矩阵A2与索引矩阵的n列按顺序一一对应,每个矩阵A2的t列与对应的索引矩阵中的一列的t个元素按顺序一一对应,每个矩阵A2的每一列为对应的索引矩阵中的一个元素在矩阵A1中指示的一列;根据n个矩阵A2和矩阵B2生成矩阵C,矩阵C的n列与n个矩阵A2按顺序一一对应,矩阵C的n列与矩阵B2的n列按顺序一一对应,矩阵C的每一列为对应的一个矩阵A2与对应的矩阵B2中的一列的乘积。本申请只需从矩阵A1中读取一次数据就能完成矩阵乘法运算,实现了数据复用性的最大化。

Description

执行矩阵乘法运算的方法、电路及SOC
本申请要求于2018年11月20日提交的申请号为201811384503.8、发明名称为“执行矩阵乘法运算的方法、电路及SOC”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理领域,特别涉及一种执行矩阵乘法运算的方法、电路及片上系统(System on Chip,SOC)。
背景技术
人工智能技术广泛应用于终端、边缘侧、云端等,用来实现图像识别、目标检测、语音翻译等功能,人工智能技术往往通过深度学习网络实现。对于深度学习网络,其中较影响性能、计算量较大的算子,如卷积、内积(inner product)等算子的计算量占比可达99%,这些算子均可以展开成矩阵乘矩阵的运算。当然,矩阵作为一种常见的数据表达形式,矩阵乘矩阵的运算也常常应用于其它领域中。
目前,在执行矩阵乘法运算时,往往是将矩阵乘矩阵的运算拆分成向量乘矩阵的运算。假设矩阵A是m行k列的矩阵,矩阵B是k行n列的矩阵,m、k和n均为正整数,在将矩阵A乘以矩阵B时,是将矩阵A的m行依次乘以矩阵B。其中,将矩阵A中的某一行乘以矩阵B时,n个处理元件(Process Element,PE)一一获取矩阵B的n列的索引模块(用来定位非零元素),然后该n个PE中的每个PE根据其获取到的矩阵B中的一列的索引模块从矩阵A中的这一行中读取数据,并将读取到的数据乘以矩阵B中的这一列。
上述运算方式中,由于是将矩阵A的m行依次乘以矩阵B,所以整个运算过程需要耗费较长时间,运算效率较低。其次,由于矩阵B的每一列中非零元素的分布没有规律,所以每个PE根据矩阵B中的一列的索引模块从矩阵A中的一行中读取数据时,是读取矩阵A中的这一行中分布较为离散的多个数据,而由于矩阵A往往是分散地存储到存储器中的多个地址上,所以如果PE是一次性从矩阵A中读取数据,则很有可能会因同时读取存储器中的多个地址上的数据而出现地址冲突,导致无法读取到数据。再者,由于在将矩阵A的m行中的每一行乘以矩阵B时均需读取一次矩阵B,整个运算过程共需读取m次矩阵B,所以导致数据复用性较低,消耗的处理资源较多。
发明内容
本申请提供了一种执行矩阵乘法运算的方法、电路及SOC,可以解决相关技术中矩阵乘法运算的运算效率较低、读取数据时地址冲突且数据复用性较低的问题。所述技术方案如下:
第一方面,提供了一种执行矩阵乘法运算的方法,所述方法包括:获取矩阵A1、矩阵B2和索引矩阵,所述矩阵A1是m行k列的矩阵,所述矩阵B2是t行n列的矩阵,所述索 引矩阵是t行n列的矩阵,所述m、所述k、所述t和所述n均为正整数,所述t小于或等于所述k;根据所述索引矩阵和所述矩阵A1,生成n个矩阵A2,所述n个矩阵A2均是m行t列的矩阵,所述n个矩阵A2与所述索引矩阵的n列按顺序一一对应,每个矩阵A2的t列与对应的所述索引矩阵中的一列的t个元素按顺序一一对应,所述每个矩阵A2的每一列为对应的所述索引矩阵中的一个元素在所述矩阵A1中指示的一列;根据所述n个矩阵A2和所述矩阵B2,生成矩阵C,所述矩阵C是m行n列的矩阵,所述矩阵C的n列与所述n个矩阵A2按顺序一一对应,所述矩阵C的n列与所述矩阵B2的n列按顺序一一对应,所述矩阵C的每一列为对应的一个矩阵A2与对应的所述矩阵B2中的一列的乘积。
需要说明的是,矩阵B2中包含矩阵B1中所有的非零元素(有效数据)。索引矩阵为矩阵B2中的元素在矩阵B1中的索引,即包括矩阵B1中所有的非零元素在矩阵B1中的索引。
在本申请实施例中,根据索引矩阵可以一次性从矩阵A1中读取出n个矩阵A2,继而可以将该n个矩阵A2一一乘以矩阵B2的n列,来得到矩阵C。由于只需从矩阵A1中读取一次数据就能够完成矩阵乘法运算,所以可以实现数据复用性的最大化,节省处理资源。并且,将该n个矩阵A2一一乘以矩阵B2的n列时,由于该n个矩阵A2的大小相同,所以该n个矩阵A2与矩阵B2的n列的乘法运算可以并行执行,且可以在相同的时间内执行完成,从而可以节省运算时间,提高运算效率。
其中,所述获取矩阵B2和索引矩阵,包括:获取矩阵B1,所述矩阵B1是k行n列的矩阵,所述矩阵B1的n列中每一列的元素均按顺序划分到预设数值个组,每组元素的数量均为k1,且所述每组元素中非零元素的数量均小于或等于k2,所述预设数值是k/k1,所述k大于或等于所述k1且所述k能够被所述k1整除,所述k1大于或等于所述k2且所述k1能够被所述k2整除;根据所述矩阵B1,生成所述矩阵B2,所述矩阵B2的n列与所述矩阵B1的n列按顺序一一对应,所述矩阵B2的每一列的元素包括对应的所述矩阵B1中的一列中按顺序排列的预设数值个组中所有的非零元素;根据所述矩阵B1,生成所述索引矩阵,所述索引矩阵的n列与所述矩阵B2的n列按顺序一一对应,所述索引矩阵的每一列的元素为对应的所述矩阵B2中的一列中按顺序排列的所有元素在所述矩阵B1中的行标。
需要说明的是,矩阵B1为满足条件稀疏的矩阵,矩阵B1可以通过神经网络训练得到,例如,在深度学习场景下,可以通过控制深度学习网络的训练过程,将卷积、Inner product等算子的参数的分布训练成符合条件稀疏的分布规律,以得到满足条件稀疏的参数矩阵作为矩阵B1。
在本申请实施例中,将矩阵B1的n列中每一列的非零元素的数量均控制在一定的范围内,如此可以有效控制数据索引的范围,从而可以有效降低索引矩阵的规模,保证工程上的可实现性。并且,本申请实施例中仅使用这一个索引矩阵就可以完成矩阵乘法运算,因而消耗的逻辑资源较少。
其中,对于所述矩阵B2中的任意一个非零元素,所述非零元素在所述矩阵B1中的行标是所述非零元素在所述矩阵B1中所属的一行的行号;对于所述矩阵B2中的任意一个为零的元素,所述一个为零的元素在所述矩阵B1中的行标是第一字符。
其中,对于所述索引矩阵中的任意一个元素,当所述索引矩阵中的一个元素不是所述第一字符时,所述索引矩阵中的一个元素在所述矩阵A1中指示的一列是所述矩阵A1的所有列中列号为所述索引矩阵中的一个元素的一列;当所述索引矩阵中的一个元素是所述第一字符 时,所述索引矩阵中的一个元素对应的一个矩阵A2中的一列的元素是m个第二字符。
需要说明的是,引入第一字符和第二字符可以满足矩阵元素对齐的要求。
在本申请实施例中,是根据索引矩阵的每一列的t个元素,直接读取矩阵A1中t列的元素来构成一个矩阵A2。这种情况下,由于所要读取的数据在矩阵A1中的分布比较规则且集中,所以在将矩阵A1存储到存储器中时,可以将矩阵A1的k列分别存储到存储器中的多个地址上,这种情况下,根据索引矩阵可以一次性从存储器中读取到所需的数据,从而不仅可以大大降低读取数据时所需的存储器访问带宽,而且可以消除从存储器中读取数据时可能出现的地址冲突问题。
第二方面,提供了一种执行矩阵乘法运算的方法,所述方法包括:获取矩阵B1、矩阵A2和索引矩阵,所述矩阵B1是k行n列的矩阵,所述矩阵A2是m行t列的矩阵,所述索引矩阵是m行t列的矩阵,所述k、所述n、所述m和所述t均为正整数,所述t小于或等于所述k;根据所述索引矩阵和所述矩阵B1,生成m个矩阵B2,所述m个矩阵B2均是t行n列的矩阵,所述m个矩阵B2与所述索引矩阵的m行按顺序一一对应,每个矩阵B2的t行与对应的所述索引矩阵中的一行的t个元素按顺序一一对应,所述每个矩阵B2的每一行为对应的所述索引矩阵中的一个元素在所述矩阵B1中指示的一行;根据所述矩阵A2和所述m个矩阵B2,生成矩阵C,所述矩阵C是m行n列的矩阵,所述矩阵C的m行与所述矩阵A2的m行按顺序一一对应,所述矩阵C的m行与所述m个矩阵B2按顺序一一对应,所述矩阵C的每一行为对应的所述矩阵A2中的一行与对应的一个矩阵B2的乘积。
需要说明的是,矩阵A2中包含矩阵A1中所有的非零元素(有效数据)。索引矩阵为矩阵A2中的元素在矩阵A1中的索引,即包括矩阵A1中所有的非零元素在矩阵A1中的索引。
在本申请实施例中,根据索引矩阵可以一次性从矩阵B1中读取出m个矩阵B2,继而可以将矩阵A2的m行一一乘以该m个矩阵B2,来得到矩阵C。由于只需从矩阵B1中读取一次数据就能够完成矩阵乘法运算,所以可以实现数据复用性的最大化,节省处理资源。并且,将矩阵A2的m行一一乘以该m个矩阵B2时,由于该m个矩阵B2的大小相同,所以矩阵A2的m行与该m个矩阵B2的乘法运算可以并行执行,且可以在相同的时间内执行完成,从而可以节省运算时间,提高运算效率。
其中,所述获取矩阵A2和索引矩阵,包括:获取矩阵A1,所述矩阵A1是m行k列的矩阵,所述矩阵A1的m行中每一行的元素均按顺序划分到预设数值个组,每组元素的数量均为k1,且所述每组元素中非零元素的数量均小于或等于k2,所述预设数值是k/k1,所述k大于或等于所述k1且所述k能够被所述k1整除,所述k1大于或等于所述k2且所述k1能够被所述k2整除;根据所述矩阵A1,生成所述矩阵A2,所述矩阵A2的m行与所述矩阵A1的m行按顺序一一对应,所述矩阵A2的每一行的元素包括对应的所述矩阵A1中的一行中按顺序排列的预设数值个组中所有的非零元素;根据所述矩阵A1,生成所述索引矩阵,所述索引矩阵的m行与所述矩阵A2的m行按顺序一一对应,所述索引矩阵的每一行的元素为对应的所述矩阵A2中的一行中按顺序排列的所有元素在所述矩阵A1中的列标。
需要说明的是,矩阵A1为满足条件稀疏的矩阵,矩阵A1可以通过神经网络训练得到,例如,在深度学习场景下,可以通过控制深度学习网络的训练过程,将卷积、Inner product等算子的参数的分布训练成符合条件稀疏的分布规律,以得到满足条件稀疏的参数矩阵作为 矩阵A1。
在本申请实施例中,将矩阵A1的m行中每一行的非零元素的数量均控制在一定的范围内,如此可以有效控制数据索引的范围,从而可以有效降低索引矩阵的规模,保证工程上的可实现性。并且,本申请实施例中仅使用这一个索引矩阵就可以完成矩阵乘法运算,因而消耗的逻辑资源较少。
其中,对于所述矩阵A2中的任意一个非零元素,所述非零元素在所述矩阵A1中的列标是所述非零元素在所述矩阵A1中所属的一列的列号;对于所述矩阵A2中的任意一个为零的元素,所述一个为零的元素元素在所述矩阵A1中的列标是第一字符。
其中,对于所述索引矩阵中的任意一个元素,当所述索引矩阵中的一个元素不是所述第一字符时,所述索引矩阵中的一个元素在所述矩阵B1中指示的一行是所述矩阵B1的所有行中行号为所述索引矩阵中的一个元素的一行;当所述索引矩阵中的一个元素是所述第一字符时,所述索引矩阵中的一个元素对应的一个矩阵B2中的一行的元素是n个第二字符。
需要说明的是,引入第一字符和第二字符可以满足矩阵元素对齐的要求。
在本申请实施例中,根据索引矩阵的每一行的t个元素,直接读取矩阵B1中t行的元素来构成一个矩阵B2。这种情况下,由于所要读取的数据在矩阵B1中的分布比较规则且集中,所以在将矩阵B1存储到存储器中时,可以将矩阵B1的k行分别存储到存储器中的多个地址上,这种情况下,根据索引矩阵可以一次性从存储器中读取到所需的数据,从而不仅可以大大降低读取数据时所需的存储器访问带宽,而且可以消除从存储器中读取数据时可能出现的地址冲突问题。
第三方面,提供了一种执行矩阵乘法运算的电路,所述电路包括:
获取电路,用于获取矩阵A1、矩阵B2和索引矩阵,所述矩阵A1是m行k列的矩阵,所述矩阵B2是t行n列的矩阵,所述索引矩阵是t行n列的矩阵,所述m、所述k、所述t和所述n均为正整数,所述t小于或等于所述k;
数据选择电路,用于根据所述索引矩阵和所述矩阵A1,生成n个矩阵A2,所述n个矩阵A2均是m行t列的矩阵,所述n个矩阵A2与所述索引矩阵的n列按顺序一一对应,每个矩阵A2的t列与对应的所述索引矩阵中的一列的t个元素按顺序一一对应,所述每个矩阵A2的每一列为对应的所述索引矩阵中的一个元素在所述矩阵A1中指示的一列;
计算单元阵列,用于根据所述n个矩阵A2和所述矩阵B2,生成矩阵C,所述矩阵C是m行n列的矩阵,所述矩阵C的n列与所述n个矩阵A2按顺序一一对应,所述矩阵C的n列与所述矩阵B2的n列按顺序一一对应,所述矩阵C的每一列为对应的一个矩阵A2与对应的所述矩阵B2中的一列的乘积。
其中,所述获取电路用于获取所述矩阵B2和所述索引矩阵时,具体用于:
获取矩阵B1,所述矩阵B1是k行n列的矩阵,所述矩阵B1的n列中每一列的元素均按顺序划分到预设数值个组,每组元素的数量均为k1,且所述每组元素中非零元素的数量均小于或等于k2,所述预设数值是k/k1,所述k大于或等于所述k1且所述k能够被所述k1整除,所述k1大于或等于所述k2且所述k1能够被所述k2整除;
根据所述矩阵B1,生成所述矩阵B2,所述矩阵B2的n列与所述矩阵B1的n列按顺序一一对应,所述矩阵B2的每一列的元素包括对应的所述矩阵B1中的一列中按顺序排列的预 设数值个组中所有的非零元素;
根据所述矩阵B1,生成所述索引矩阵,所述索引矩阵的n列与所述矩阵B2的n列按顺序一一对应,所述索引矩阵的每一列的元素为对应的所述矩阵B2中的一列中按顺序排列的所有元素在所述矩阵B1中的行标。
需要说明的是,对于所述矩阵B2中的任意一个非零元素,所述非零元素在所述矩阵B1中的行标是所述非零元素在所述矩阵B1中所属的一行的行号;对于所述矩阵B2中的任意一个为零的元素,所述一个为零的元素在所述矩阵B1中的行标是第一字符。
另外,对于所述索引矩阵中的任意一个元素,当所述索引矩阵中的一个元素不是所述第一字符时,所述索引矩阵中的一个元素在所述矩阵A1中指示的一列是所述矩阵A1的所有列中列号为所述索引矩阵中的一个元素的一列;当所述索引矩阵中的一个元素是所述第一字符时,所述索引矩阵中的一个元素对应的一个矩阵A2中的一列的元素是m个第二字符。
再者,所述矩阵B1是通过神经网络训练得到。
进一步地,所述电路还包括第一存储器,所述第一存储器用于存储所述矩阵A1、所述矩阵B2和所述索引矩阵;相应地,所述获取电路用于:从所述第一存储器中读取所述矩阵A1、所述矩阵B2和所述索引矩阵。
第四方面,提供了一种执行矩阵乘法运算的电路,所述电路包括:
获取电路,用于获取矩阵B1、矩阵A2和索引矩阵,所述矩阵B1是k行n列的矩阵,所述矩阵A2是m行t列的矩阵,所述索引矩阵是m行t列的矩阵,所述k、所述n、所述m和所述t均为正整数,所述t小于或等于所述k;
数据选择电路,用于根据所述索引矩阵和所述矩阵B1,生成m个矩阵B2,所述m个矩阵B2均是t行n列的矩阵,所述m个矩阵B2与所述索引矩阵的m行按顺序一一对应,每个矩阵B2的t行与对应的所述索引矩阵中的一行的t个元素按顺序一一对应,所述每个矩阵B2的每一行为对应的所述索引矩阵中的一个元素在所述矩阵B1中指示的一行;
计算单元阵列,用于根据所述矩阵A2和所述m个矩阵B2,生成矩阵C,所述矩阵C是m行n列的矩阵,所述矩阵C的m行与所述矩阵A2的m行按顺序一一对应,所述矩阵C的m行与所述m个矩阵B2按顺序一一对应,所述矩阵C的每一行为对应的所述矩阵A2中的一行与对应的一个矩阵B2的乘积。
其中,所述获取电路用于获取所述矩阵A2和所述索引矩阵时,具体用于:
获取矩阵A1,所述矩阵A1是m行k列的矩阵,所述矩阵A1的m行中每一行的元素均按顺序划分到预设数值个组,每组元素的数量均为k1,且所述每组元素中非零元素的数量均小于或等于k2,所述预设数值是k/k1,所述k大于或等于所述k1且所述k能够被所述k1整除,所述k1大于或等于所述k2且所述k1能够被所述k2整除;
根据所述矩阵A1,生成所述矩阵A2,所述矩阵A2的m行与所述矩阵A1的m行按顺序一一对应,所述矩阵A2的每一行的元素包括对应的所述矩阵A1中的一行中按顺序排列的预设数值个组中所有的非零元素;
根据所述矩阵A1,生成所述索引矩阵,所述索引矩阵的m行与所述矩阵A2的m行按顺序一一对应,所述索引矩阵的每一行的元素为对应的所述矩阵A2中的一行中按顺序排列的所有元素在所述矩阵A1中的列标。
需要说明的是,对于所述矩阵A2中的任意一个非零元素,所述非零元素在所述矩阵A1中的列标是所述非零元素在所述矩阵A1中所属的一列的列号;对于所述矩阵A2中的任意一个为零的元素,所述一个为零的元素元素在所述矩阵A1中的列标是第一字符。
另外,对于所述索引矩阵中的任意一个元素,当所述索引矩阵中的一个元素不是所述第一字符时,所述索引矩阵中的一个元素在所述矩阵B1中指示的一行是所述矩阵B1的所有行中行号为所述索引矩阵中的一个元素的一行;当所述索引矩阵中的一个元素是所述第一字符时,所述索引矩阵中的一个元素对应的一个矩阵B2中的一行的元素是n个第二字符。
再者,所述矩阵A1是通过神经网络训练得到。
进一步地,所述电路还包括第一存储器,所述第一存储器用于存储所述矩阵B1、所述矩阵A2和所述索引矩阵;相应地,所述获取电路用于:从所述第一存储器中读取所述矩阵B1、所述矩阵A2和所述索引矩阵。
第五方面,提供了一种SOC,所述SOC包括上述第三方面所述的执行矩阵乘法运算的电路。SOC还包括处理核,用于控制所述执行矩阵乘法运算的电路。
进一步地,所述SOC还包括第二存储器,所述第二存储器用于存储所述矩阵A1、所述矩阵B2和所述索引矩阵;相应地,所述获取电路用于:从所述第二存储器中读取所述矩阵A1、所述矩阵B2和所述索引矩阵。
第六方面,提供了一种SOC,所述SOC包括上述第四方面所述的执行矩阵乘法运算的电路。SOC还包括处理核,用于控制所述执行矩阵乘法运算的电路。
进一步地,所述SOC还包括第二存储器,所述第二存储器用于存储所述矩阵B1、所述矩阵A2和所述索引矩阵;相应地,所述获取电路用于:从所述第二存储器中读取所述矩阵B1、所述矩阵A2和所述索引矩阵。
第七方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面所述的执行矩阵乘法运算的方法。
第八方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第二方面所述的执行矩阵乘法运算的方法。
第九方面,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面所述的执行矩阵乘法运算的方法。
第十方面,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第二方面所述的执行矩阵乘法运算的方法。
上述第三方面、第五方面、第七方面和第九方面所获得的技术效果与上述第一方面中对应的技术手段获得的技术效果近似,在这里不再赘述。
上述第四方面、第六方面、第八方面和第十方面所获得的技术效果与上述第二方面中对应的技术手段获得的技术效果近似,在这里不再赘述。
本申请提供的技术方案至少可以带来以下有益效果:
获取矩阵B1、矩阵A2和索引矩阵。之后,根据索引矩阵和矩阵B1,生成m个矩阵B2,由于根据索引矩阵可以一次性从存储器中存储的矩阵B1中读取到所需的数据,所以不仅可以大大降低读取数据时所需的存储器访问带宽,而且可以消除从存储器中读取数据时可能出现的地址冲突问题。最后,根据矩阵A2和m个矩阵B2,生成矩阵C,由于m个矩阵B2的大小相同,所以将矩阵A2的m行一一乘以该m个矩阵B2时,矩阵A2的m行与该m个矩阵B2的乘法运算可以并行执行,且可以在相同的时间内执行完成,从而可以节省运算时间,提高运算效率。本申请实施例中只需从矩阵B1中读取一次数据就能够完成矩阵乘法运算,因而可以实现数据复用性的最大化,节省处理资源。
附图说明
图1是本申请实施例提供的一种矩阵B1的示意图;
图2是本申请实施例提供的一种矩阵A1的示意图;
图3是本申请实施例提供的一种执行矩阵乘法运算的方法的流程图;
图4是本申请实施例提供的另一种矩阵B1的示意图;
图5是本申请实施例提供的一种矩阵B2的示意图;
图6是本申请实施例提供的一种索引矩阵的示意图;
图7是本申请实施例提供的另一种矩阵A1的示意图;
图8是本申请实施例提供的一种n个矩阵A2的示意图;
图9是本申请实施例提供的一种矩阵C的示意图;
图10是本申请实施例提供的另一种执行矩阵乘法运算的方法的流程图;
图11是本申请实施例提供的又一种矩阵A1的示意图;
图12是本申请实施例提供的一种矩阵A2的示意图;
图13是本申请实施例提供的另一种索引矩阵的示意图;
图14是本申请实施例提供的又一种矩阵B1的示意图;
图15是本申请实施例提供的一种m个矩阵B2的示意图;
图16是本申请实施例提供的另一种矩阵C的示意图;
图17是本申请实施例提供的一种执行矩阵乘法运算的电路的结构示意图;
图18是本申请实施例提供的另一种执行矩阵乘法运算的电路的结构示意图;
图19是本申请实施例提供的一种计算单元阵列的结构示意图;
图20是本申请实施例提供的一种SOC的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请的实施方式作进一步地详细描述。
在对本申请实施例进行详细地解释说明之前,对本申请实施例涉及的应用场景予以说明。
目前,人工智能技术广泛应用于终端、边缘侧、云端等,用来实现图像识别、目标检测、语音翻译等功能,人工智能技术往往通过深度学习网络实现。对于深度学习网络(如基于神经网络),其中较影响性能、计算量较大的算子,如卷积、inner product等算子的计算量占比可达99%,这些算子均可以展开成矩阵乘矩阵的运算。当然,矩阵作为一种常见的数据表达 形式,矩阵乘矩阵的运算也常常应用于其它领域中。本申请实施例提供的执行矩阵乘法运算的方法就应用于深度网络学习或其它领域中的矩阵乘矩阵的运算场景中。
接下来对本申请实施例提供的执行矩阵乘法运算的方法予以说明。
值得注意的是,在进行本申请实施例提供的执行矩阵乘法运算的方法之前,可以先获得满足条件稀疏的矩阵。例如,开发人员可以通过控制深度学习网络的训练过程,来得到满足条件稀疏的矩阵,当然,也可以通过其它方式得到满足条件稀疏的矩阵,本申请实施例对此不作限定。
对于矩阵A1乘以矩阵B1的情况,假设矩阵A1是m行k列的矩阵,矩阵B1是k行n列的矩阵,m、k和n均为正整数。
矩阵B1满足条件稀疏是指:矩阵B1的n列中每一列的元素均按顺序划分到预设数值个组,每组元素的数量均为k1,且每组元素中非零元素的数量均小于或等于k2,预设数值为正整数,预设数值是k/k1,k大于或等于k1且k能够被k1整除,k1大于或等于k2且k1能够被k2整除。例如,k=16,n=16,k1=8,k2=2,预设数值为2,即矩阵B1为16行16列的矩阵,矩阵B1的16列中每一列的元素均按顺序划分到两个组,每组元素的数量为8,且每组元素中非零元素的数量均小于或等于2,此时矩阵B1可以如图1所示,这种情况下,矩阵B1的16列中每一列的每8个连续的元素均被稀疏成不超过2个非零元素,对应的条件稀疏率是25%。
矩阵A1满足条件稀疏是指:矩阵A1的m行中每一行的元素均按顺序划分到预设数值个组,每组元素的数量均为k1,且每组元素中非零元素的数量均小于或等于k2,预设数值为正整数,预设数值是k/k1,k大于或等于k1且k能够被k1整除,k1大于或等于k2且k1能够被k2整除。例如,m=5,k=4,k1=2,k2=1,预设数值为2,即矩阵A1为5行4列的矩阵,矩阵A1的5行中每一行的元素均按顺序划分到两个组,每组元素的数量为2,且每组元素中非零元素的数量均小于或等于1,此时矩阵A1可以如图2所示,这种情况下,矩阵A1的5行中每一行的每2个连续的元素均被稀疏成不超过1个非零元素,对应的条件稀疏率是50%。
需要说明的是,对于矩阵A1乘以矩阵B1的情况,如果矩阵B1为满足条件稀疏的矩阵,则可以通过下文图3实施例提供的执行矩阵乘法运算的方法来确定矩阵A1与矩阵B1的乘积;如果矩阵A1为满足条件稀疏的矩阵,则可以通过下文图10实施例提供的执行矩阵乘法运算的方法来确定矩阵A1与矩阵B1的乘积。
图3是本申请实施例提供的一种执行矩阵乘法运算的方法的流程图。参见图3,该方法包括:
步骤301:获取矩阵A1、矩阵B2和索引矩阵。
需要说明的是,矩阵A1是m行k列的矩阵,m和k均为正整数。矩阵A1可以为深度学习网络中的任意算子(如卷积、Inner product等)展开后的矩阵乘矩阵的运算中的被乘数,且矩阵A1可以为数据矩阵,当然,矩阵A1也可以为其它应用中的矩阵乘矩阵的运算中的被乘数,本申请实施例对此不作限定。
另外,矩阵B2是t行n列的矩阵,t和n均为正整数,t小于或等于k。矩阵B2中包含矩阵B1中所有的非零元素(有效数据)。矩阵B1是k行n列的矩阵,矩阵B1可以为深度学 习网络中的任意算子展开后的矩阵乘矩阵的运算中的乘数,且矩阵B1可以为参数矩阵,当然,矩阵B1也可以为其它应用中的矩阵乘矩阵的运算中的乘数,本申请实施例对此不作限定。
再者,索引矩阵是t行n列的矩阵。索引矩阵为矩阵B2中的元素在矩阵B1中的索引,即包括矩阵B1中所有的非零元素在矩阵B1中的索引。
具体地,在获取矩阵A1时,可以直接从存储器中读取矩阵A1。在获取矩阵B2和索引矩阵时,可以直接从存储器中读取矩阵B2和索引矩阵;或者,可以先获取矩阵B1,再根据矩阵B1生成矩阵B2以及索引矩阵,具体的,根据矩阵B1生成矩阵B2以及索引矩阵时,可以根据矩阵B1生成矩阵B2以及根据矩阵B1生成索引矩阵,或者先根据B1生成矩阵B2,再根据矩阵B1以及矩阵B2生成索引矩阵。具体的生成算法并不限定,只要能够满足最终生成的矩阵满足各矩阵的定义要求即可。
其中,获取矩阵B1时,可以直接从存储器中读取矩阵B1。
需要说明的是,矩阵B1可以为满足条件稀疏的矩阵,即矩阵B1的n列中每一列的元素均按顺序划分到预设数值个组,每组元素的数量均为k1,且每组元素中非零元素的数量均小于或等于k2,预设数值是k/k1(也等于t除以k2后得到的数值),k大于或等于k1且k能够被k1整除,k1大于或等于k2且k1能够被k2整除。
另外,矩阵B1可以通过神经网络训练得到,例如,在深度学习场景下,可以通过控制深度学习网络的训练过程,将卷积、Inner product等算子的参数的分布训练成符合条件稀疏的分布规律,以得到满足条件稀疏的参数矩阵作为矩阵B1;当然,矩阵B1也可以通过其它方式得到,本申请实施例对此不作限定。
再者,矩阵A1和矩阵B1中的元素的数据类型可以根据实际需求预先进行设置,如可以是整形、浮点或任意的自定义格式,且m、k、n、k1、k2和预设数值的取值也可以根据实际需求预先进行设置,如可以根据神经网络的稀疏程度及硬件的计算能力合理确定,本申请实施例对此不作限定。
例如,k=16,n=16,k1=16,k2=4,预设数值为1,即矩阵B1为16行16列的矩阵,矩阵B1的16列中每一列的元素均按顺序划分到一个组,每组元素的数量为16,且每组元素中非零元素的数量均小于或等于4,此时矩阵B1可以如图4所示。这种情况下,矩阵B1的16列中每一列的每16个连续的元素均被稀疏成不超过4个非零元素,对应的条件稀疏率是25%。
其中,矩阵B2的n列与矩阵B1的n列按顺序一一对应,矩阵B2的每一列的元素包括对应的矩阵B1中的一列中按顺序排列的预设数值个组中所有的非零元素。也即是,对于矩阵B1的n列中的每一列,如对于矩阵B1的n列中的第i列,是按顺序从矩阵B1的第i列的预设数值个组中的每组元素中选择包含有所有的非零元素的k2个元素作为矩阵B2的第i列,来得到矩阵B2,i为大于或等于1且小于或等于n的整数。
例如,k=16,n=16,k1=16,k2=4,预设数值为1,此时矩阵B1可以如图4所示。之后,可以按顺序从矩阵B1的第1列的一组元素中选择包含有这一组元素中所有的非零元素的4个元素作为矩阵B2的第1列;按顺序从矩阵B1的第2列的一组元素中选择包含有这一组元素中所有的非零元素的4个元素作为矩阵B2的第2列;以此类推,直至按顺序从矩阵B1的第16列的一组元素中选择包含有这一组元素中所有的非零元素的4个元素作为矩阵B2的第16列为止,如此可以得到如图5所示的矩阵B2,此时矩阵B2是4行16列的矩阵,且包含 有矩阵B1中所有的非零元素。
其中,索引矩阵的n列与矩阵B2的n列按顺序一一对应,索引矩阵的每一列的元素为对应的矩阵B2中的一列中按顺序排列的所有元素在矩阵B1中的行标。也即是,对于矩阵B2的n列中的每一列,如对于矩阵B2的n列中的第i列,是按顺序将矩阵B2的第i列的所有元素中每个元素在矩阵B1中的行标作为索引矩阵的第i列,来得到索引矩阵。
值得说明的是,本申请实施例中引入了条件稀疏的概念,使得矩阵B1的n列中每一列的非零元素的数量均可以控制在一定的范围内,如此可以有效控制数据索引的范围,从而可以有效降低索引矩阵的规模,保证工程上的可实现性。并且,本申请实施例中后续仅使用这一个索引矩阵就可以完成矩阵乘法运算,因而消耗的逻辑资源较少。
需要说明的是,对于矩阵B2中的任意一个非零元素,这一个非零元素在矩阵B1中的行标是这一个非零元素在矩阵B1中所属的一行的行号。对于矩阵B2中的任意一个为零的元素,这一个为零的元素在矩阵B1中的行标是第一字符。
另外,引入第一字符可以满足矩阵元素对齐的要求,第一字符可以预先进行设置,具体实现时第一字符可以为任意值,如第一字符可以为X、Y等,本申请实施例对此不作限定。
例如,k=16,n=16,k1=16,k2=4,预设数值为1,第一字符为X,此时矩阵B1可以如图4所示,矩阵B2可以如图5所示。之后,按顺序将矩阵B2的第1列的4个元素中每个元素在矩阵B1中的行标作为索引矩阵的第1列,按顺序将矩阵B2的第2列的4个元素中每个元素在矩阵B1中的行标作为索引矩阵的第2列,以此类推,直至按顺序将矩阵B2的第16列的4个元素中每个元素在矩阵B1中的行标作为索引矩阵的第16列为止,如此可以得到如图6所示的索引矩阵,此时索引矩阵是4行16列的矩阵,且包括矩阵B1中所有的非零元素在矩阵B1中的索引。
步骤302:根据索引矩阵和矩阵A1,生成n个矩阵A2。
需要说明的是,n个矩阵A2均是m行t列的矩阵,n个矩阵A2与索引矩阵的n列按顺序一一对应,每个矩阵A2的t列与对应的索引矩阵中的一列的t个元素按顺序一一对应,每个矩阵A2的每一列为对应的索引矩阵中的一个元素在矩阵A1中指示的一列。也即是,对于索引矩阵的n列中的每一列的t个元素中的每一个元素,如对于索引矩阵的第i列的第j个元素,是将索引矩阵的第i列的第j个元素在矩阵A1中指示的一列作为n个矩阵A2中的第i个矩阵A2的第j列,来得到n个矩阵A2,j为大于或等于1且小于或等于t的整数。
值得说明的是,本申请实施例中是根据索引矩阵的每一列的t个元素,直接读取矩阵A1中t列的元素来构成一个矩阵A2。这种情况下,由于所要读取的数据在矩阵A1中的分布比较规则且集中,所以在将矩阵A1存储到存储器中时,可以将矩阵A1的k列分别存储到存储器中的多个地址上,这种情况下,根据索引矩阵可以一次性从存储器中读取到所需的数据,从而不仅可以大大降低读取数据时所需的存储器访问带宽,而且可以消除从存储器中读取数据时可能出现的地址冲突问题。
需要说明的是,对于索引矩阵中的任意一个元素,当索引矩阵中的一个元素不是第一字符时,索引矩阵中的这一个元素在矩阵A1中指示的一列是矩阵A1的所有列中列号为索引矩阵中的这一个元素的一列;当索引矩阵中的一个元素是第一字符时,索引矩阵中的这一个元素对应的一个矩阵A2中的一列的元素是m个第二字符。也即是,当索引矩阵的第i列的第j个元素不是第一字符时,可以将矩阵A1的所有列中列号为该第j个元素的一列作为第i个矩 阵A2的第j列;当索引矩阵的第i列的第j个元素是第一字符时,将m个第二字符作为第i个矩阵A2的第j列。
另外,第二字符可以预先进行设置,具体实现时第二字符可以为任意值,如第二字符可以为0、X、矩阵A1中的任意元素等,本申请实施例对此不作限定。
例如,m=16,k=16,n=16,t=4,第一字符和第二字符均为X,此时索引矩阵可以如图6所示,矩阵A1可以如图7所示。之后,将索引矩阵的第1列的第1个元素在矩阵A1中指示的一列作为第1个矩阵A2的第1列,将索引矩阵的第1列的第2个元素在矩阵A1中指示的一列作为第1个矩阵A2的第2列,将索引矩阵的第1列的第3个元素在矩阵A1中指示的一列作为第1个矩阵A2的第3列,将索引矩阵的第1列的第4个元素在矩阵A1中指示的一列作为第1个矩阵A2的第4列,得到第1个矩阵A2,以此类推,直至将索引矩阵的第16列的第1个元素在矩阵A1中指示的一列作为第16个矩阵A2的第1列,将索引矩阵的第16列的第2个元素在矩阵A1中指示的一列作为第16个矩阵A2的第2列,将索引矩阵的第16列的第3个元素在矩阵A1中指示的一列作为第16个矩阵A2的第3列,将索引矩阵的第16列的第4个元素在矩阵A1中指示的一列作为第16个矩阵A2的第4列,得到第16个矩阵A2为止,如此可以得到如图8所示的16个矩阵A2,该16个矩阵A2均是16行4列的矩阵。
步骤303:根据n个矩阵A2和矩阵B2,生成矩阵C。
需要说明的是,矩阵C是m行n列的矩阵,矩阵C为矩阵A1与矩阵B1的乘积。矩阵C的n列与n个矩阵A2按顺序一一对应,矩阵C的n列与矩阵B2的n列按顺序一一对应,矩阵C的每一列为对应的一个矩阵A2与对应的矩阵B2中的一列的乘积。也即是,对于矩阵C的n列中的每一列,如对于矩阵C的第i列,是将第i个矩阵A2与矩阵B2的第i列的乘积作为矩阵C的第i列,来得到矩阵C。
值得说明的是,本申请实施例中根据索引矩阵可以一次性从矩阵A1中读取出n个矩阵A2,继而可以将该n个矩阵A2一一乘以矩阵B2的n列,来得到矩阵C。由于只需从矩阵A1中读取一次数据就能够完成矩阵乘法运算,所以可以实现数据复用性的最大化,节省处理资源。并且,将该n个矩阵A2一一乘以矩阵B2的n列时,由于该n个矩阵A2的大小相同,所以该n个矩阵A2与矩阵B2的n列的乘法运算可以并行执行,且可以在相同的时间内执行完成,从而可以节省运算时间,提高运算效率。
例如,m=16,n=16,t=4,此时矩阵B2可以如图5所示,16个矩阵A2可以如图8所示。之后,可以将第1个矩阵A2与矩阵B2的第1列的乘积作为矩阵C的第1列,将第2个矩阵A2与矩阵B2的第2列的乘积作为矩阵C的第2列,以此类推,直至将第16个矩阵A2与矩阵B2的第16列的乘积作为矩阵C的第16列,如此可以得到如图9所示的矩阵C,此时矩阵C为16行16列的矩阵。
值得注意的是,本申请实施例在矩阵乘法运算过程中,引入了条件稀疏的概念,然后通过上述方式来进行矩阵A1与满足条件稀疏的矩阵B1的乘法运算,从而可以大大提升计算性能,计算性能的提升倍数为矩阵B1的条件稀疏率的倒数,例如,矩阵B1的条件稀疏率为25%,则计算性能可以提升4倍。
在本申请实施例中,获取矩阵A1、矩阵B2和索引矩阵。之后,根据索引矩阵和矩阵A1,生成n个矩阵A2,由于根据索引矩阵可以一次性从存储器中存储的矩阵A1读取到所需的数据,所以不仅可以大大降低读取数据时所需的存储器访问带宽,而且可以消除从存储器中读 取数据时可能出现的地址冲突问题。最后,根据n个矩阵A2和矩阵B2,生成矩阵C,由于n个矩阵A2的大小相同,所以将该n个矩阵A2一一乘以矩阵B2的n列时,该n个矩阵A2与矩阵B2的n列的乘法运算可以并行执行,且可以在相同的时间内执行完成,从而可以节省运算时间,提高运算效率。本申请实施例中只需从矩阵A1中读取一次数据就能够完成矩阵乘法运算,因而可以实现数据复用性的最大化,节省处理资源。
图10是本申请实施例提供的一种执行矩阵乘法运算的方法的流程图。参见图10,该方法包括:
步骤1001:获取矩阵B1、矩阵A2和索引矩阵。
需要说明的是,矩阵B1是k行n列的矩阵,k和n均为正整数。矩阵B1可以为深度学习网络中的任意算子(如卷积、Inner product等)展开后的矩阵乘矩阵的运算中的乘数,且矩阵B1可以为数据矩阵,当然,矩阵B1也可以为其它应用中的矩阵乘矩阵的运算中的乘数,本申请实施例对此不作限定。
另外,矩阵A2是m行t列的矩阵,m和t均为正整数,t小于或等于所述k。矩阵A2中包含矩阵A1中所有的非零元素(有效数据)。矩阵A1是m行k列的矩阵,矩阵A1可以为深度学习网络中的任意算子展开后的矩阵乘矩阵的运算中的被乘数,且矩阵A1可以为参数矩阵,当然,矩阵A1也可以为其它应用中的矩阵乘矩阵的运算中的被乘数,本申请实施例对此不作限定。
再者,索引矩阵是m行t列的矩阵。索引矩阵为矩阵A2中的元素在矩阵A1中的索引,即包括矩阵A1中所有的非零元素在矩阵A1中的索引。
具体地,在获取矩阵B1时,可以直接从存储器中读取矩阵B1。在获取矩阵A2和索引矩阵时,可以直接从存储器中读取矩阵A2和索引矩阵;或者,可以先获取矩阵A1,再根据矩阵A1生成矩阵A2以及索引矩阵,具体的,根据矩阵A1生成矩阵A2以及索引矩阵时,可以根据矩阵A1生成矩阵A2以及根据矩阵A1生成索引矩阵,或者先根据A1生成矩阵A2,再根据矩阵A1以及矩阵A2生成索引矩阵。具体的生成算法并不限定,只要能够满足最终生成的矩阵满足各矩阵的定义要求即可。
其中,获取矩阵A1时,可以直接从存储器中读取矩阵A1。
需要说明的是,矩阵A1可以为满足条件稀疏的矩阵,即矩阵A1的m行中每一行的元素均按顺序划分到预设数值个组,每组元素的数量均为k1,且每组元素中非零元素的数量均小于或等于k2,预设数值是k/k1(也等于t除以k2后得到的数值),k大于或等于k1且k能够被k1整除,k1大于或等于k2且k1能够被k2整除。
另外,矩阵A1可以通过神经网络训练得到,例如,在深度学习场景下,可以通过控制深度学习网络的训练过程,将卷积、Inner product等算子的参数的分布训练成符合条件稀疏的分布规律,以得到满足条件稀疏的参数矩阵作为矩阵A1;当然,矩阵A1也可以通过其它方式得到,本申请实施例对此不作限定。
再者,矩阵A1和矩阵B1中的元素的数据类型可以根据实际需求预先进行设置,如可以是整形、浮点或任意的自定义格式,且m、k、n、k1、k2和预设数值的取值也可以根据实际需求预先进行设置,如可以根据神经网络的稀疏程度及硬件的计算能力合理确定,本申请实施例对此不作限定。
例如,m=5,k=4,k1=4,k2=2,预设数值为1,即矩阵A1为5行4列的矩阵,矩阵A1的5行中每一行的元素均按顺序划分到一个组,每组元素的数量为4,且每组元素中非零元素的数量均小于或等于2,此时矩阵A1可以如图11所示。这种情况下,矩阵A1的5行中每一行的每4个连续的元素均被稀疏成不超过2个非零元素,对应的条件稀疏率是50%。
其中,矩阵A2的m行与矩阵A1的m行按顺序一一对应,矩阵A2的每一行的元素包括对应的矩阵A1中的一行中按顺序排列的预设数值个组中所有的非零元素。也即是,对于矩阵A1的m行中的每一行,如对于矩阵A1的m行中的第i行,是按顺序从矩阵A1的第i行的预设数值个组中的每组元素中选择包含有所有的非零元素的k2个元素作为矩阵A2的第i行,来得到矩阵A2,i为大于或等于1且小于或等于m的整数。
例如,m=5,k=4,k1=4,k2=2,预设数值为1,此时矩阵A1可以如图11所示。之后,可以按顺序从矩阵A1的第1行的一组元素中选择包含有这一组元素中所有的非零元素的2个元素作为矩阵A2的第1行;按顺序从矩阵A1的第2行的一组元素中选择包含有这一组元素中所有的非零元素的2个元素作为矩阵A2的第2行,以此类推,直至按顺序从矩阵A1的第5行的一组元素中选择包含有这一组元素中所有的非零元素的2个元素作为矩阵A2的第5行为止,如此可以得到如图12所示的矩阵A2,此时矩阵A2是5行2列的矩阵,且包含有矩阵A1中所有的非零元素。
其中,索引矩阵的m行与矩阵A2的m行按顺序一一对应,索引矩阵的每一行的元素为对应的矩阵A2中的一行中按顺序排列的所有元素在矩阵A1中的列标。也即是,对于矩阵A2的m行中的每一行,如对于矩阵A2的m行中的第i行,是按顺序将矩阵A2的第i行的所有元素中每个元素在矩阵A1中的列标作为索引矩阵的第i行,来得到索引矩阵。
值得说明的是,本申请实施例中引入了条件稀疏的概念,使得矩阵A1的m行中每一行的非零元素的数量均可以控制在一定的范围内,如此可以有效控制数据索引的范围,从而可以有效降低索引矩阵的规模,保证工程上的可实现性。并且,本申请实施例中后续仅使用这一个索引矩阵就可以完成矩阵乘法运算,因而消耗的逻辑资源较少。
需要说明的是,对于矩阵A2中的任意一个非零元素,这一个非零元素在矩阵A1中的列标是这一个非零元素在矩阵A1中所属的一列的列号。对于矩阵A2中的任意一个为零的元素,这一个为零的元素在矩阵A1中的行标是第一字符。
另外,引入第一字符可以满足矩阵元素对齐的要求,第一字符可以预先进行设置,具体实现时第一字符可以为任意值,如第一字符可以为X、Y等,本申请实施例对此不作限定。
例如,m=5,k=4,k1=4,k2=2,预设数值为1,第一字符为X,此时矩阵A1可以如图11所示,矩阵A2可以如图12所示。之后,按顺序将矩阵A2的第1行的2个元素中每个元素在矩阵A1中的列标作为索引矩阵的第1行,按顺序将矩阵A2的第2行的2个元素中每个元素在矩阵A1中的列标作为索引矩阵的第2行,以此类推,直至按顺序将矩阵A2的第5行的2个元素中每个元素在矩阵A1中的列标作为索引矩阵的第5行为止,如此可以得到如图13所示的索引矩阵,此时索引矩阵是5行2列的矩阵,且包括矩阵A1中所有的非零元素在矩阵A1中的索引。
步骤1002:根据索引矩阵和矩阵B1,生成m个矩阵B2。
需要说明的是,m个矩阵B2均是t行n列的矩阵,m个矩阵B2与索引矩阵的m行按顺序一一对应,每个矩阵B2的t行与对应的索引矩阵中的一行的t个元素按顺序一一对应,每 个矩阵B2的每一行为对应的索引矩阵中的一个元素在矩阵B1中指示的一行。也即是,对于索引矩阵的m行中的每一行的t个元素中的每一个元素,如对于索引矩阵的第i行的第j个元素,是将索引矩阵的第i行的第j个元素在矩阵B1中指示的一行作为m个矩阵B2中的第i个矩阵B2的第j行,来得到m个矩阵B2,j为大于或等于1且小于或等于t的整数。
值得说明的是,本申请实施例中是根据索引矩阵的每一行的t个元素,直接读取矩阵B1中t行的元素来构成一个矩阵B2。这种情况下,由于所要读取的数据在矩阵B1中的分布比较规则且集中,所以在将矩阵B1存储到存储器中时,可以将矩阵B1的k行分别存储到存储器中的多个地址上,这种情况下,根据索引矩阵可以一次性从存储器中读取到所需的数据,从而不仅可以大大降低读取数据时所需的存储器访问带宽,而且可以消除从存储器中读取数据时可能出现的地址冲突问题。
需要说明的是,对于索引矩阵中的任意一个元素,当索引矩阵中的一个元素不是第一字符时,索引矩阵中的这一个元素在矩阵B1中指示的一行是矩阵B1的所有行中行号为索引矩阵中的这一个元素的一行;当索引矩阵中的一个元素是第一字符时,索引矩阵中的这一个元素对应的一个矩阵B2中的一行的元素是n个第二字符。也即是,当索引矩阵的第i行的第j个元素不是第一字符时,可以将矩阵B1的所有行中行号为该第j个元素的一行作为第i个矩阵B2的第j行;当索引矩阵的第i行的第j个元素是第一字符时,将n个第二字符作为第i个矩阵B2的第j行。
另外,第二字符可以预先进行设置,具体实现时第二字符可以为任意值,如第二字符可以为0、X、矩阵B1中的任意元素等,本申请实施例对此不作限定。
例如,m=5,k=4,n=3,t=2,第一字符和第二字符均为X,此时索引矩阵可以如图13所示,矩阵B1可以如图14所示。之后,将索引矩阵的第1行的第1个元素在矩阵B1中指示的一行作为第1个矩阵B2的第1行,将索引矩阵的第1行的第2个元素在矩阵B1中指示的一行作为第1个矩阵B2的第2行,得到第1个矩阵B2,以此类推,直至将索引矩阵的第5行的第1个元素在矩阵B1中指示的一行作为第5个矩阵B2的第1行,将索引矩阵的第5行的第2个元素在矩阵B1中指示的一行作为第5个矩阵B2的第2行,得到第5个矩阵B2为止,如此可以得到如图15所示的5个矩阵B2,该5个矩阵B2均是2行3列的矩阵。
步骤1003:根据矩阵A2和m个矩阵B2,生成矩阵C。
需要说明的是,矩阵C是m行n列的矩阵,矩阵C为矩阵A1与矩阵B1的乘积。矩阵C的m行与矩阵A2的m行按顺序一一对应,矩阵C的m行与m个矩阵B2按顺序一一对应,矩阵C的每一行为对应的矩阵A2中的一行与对应的一个矩阵B2的乘积。也即是,对于矩阵C的m行中的每一行,如对于矩阵C的第i行,是将矩阵A2的第i行与第i个矩阵B2的乘积作为矩阵C的第i行,来得到矩阵C。
值得说明的是,本申请实施例中根据索引矩阵可以一次性从矩阵B1中读取出m个矩阵B2,继而可以将矩阵A2的m行一一乘以该m个矩阵B2,来得到矩阵C。由于只需从矩阵B1中读取一次数据就能够完成矩阵乘法运算,所以可以实现数据复用性的最大化,节省处理资源。并且,将矩阵A2的m行一一乘以该m个矩阵B2时,由于该m个矩阵B2的大小相同,所以矩阵A2的m行与该m个矩阵B2的乘法运算可以并行执行,且可以在相同的时间内执行完成,从而可以节省运算时间,提高运算效率。
例如,m=5,n=3,t=2,此时矩阵A2可以如图12所示,5个矩阵B2可以如图15所示。 之后,可以将矩阵A2的第1行与第1个矩阵B2的乘积作为矩阵C的第1行,将矩阵A2的第2行与第2个矩阵B2的乘积作为矩阵C的第2行,以此类推,直至将矩阵A2的第5行与第5个矩阵B2的乘积作为矩阵C的第5行为止,如此可以得到如图16所示的矩阵C,此时矩阵C为5行3列的矩阵。
值得注意的是,本申请实施例在矩阵乘法运算过程中,引入了条件稀疏的概念,然后通过上述方式来进行满足条件稀疏的矩阵A1与矩阵B1的乘法运算,从而可以大大提升计算性能,计算性能的提升倍数为矩阵A1的条件稀疏率的倒数,例如,矩阵A1的条件稀疏率为50%,则计算性能可以提升2倍。
在本申请实施例中,获取矩阵B1、矩阵A2和索引矩阵。之后,根据索引矩阵和矩阵B1,生成m个矩阵B2,由于根据索引矩阵可以一次性从存储器中存储的矩阵B1中读取到所需的数据,所以不仅可以大大降低读取数据时所需的存储器访问带宽,而且可以消除从存储器中读取数据时可能出现的地址冲突问题。最后,根据矩阵A2和m个矩阵B2,生成矩阵C,由于m个矩阵B2的大小相同,所以将矩阵A2的m行一一乘以该m个矩阵B2时,矩阵A2的m行与该m个矩阵B2的乘法运算可以并行执行,且可以在相同的时间内执行完成,从而可以节省运算时间,提高运算效率。本申请实施例中只需从矩阵B1中读取一次数据就能够完成矩阵乘法运算,因而可以实现数据复用性的最大化,节省处理资源。
接下来对本申请实施例提供的执行矩阵乘法运算的电路进行说明。
图17是本申请实施例提供的一种执行矩阵乘法运算的电路的结构示意图。该执行矩阵乘法运算的电路可以通过现场可编程门阵列((Field-Programmable Gate Array,FPGA)、ASIC等实现。参见图17,该执行矩阵乘法运算的电路包括:获取电路1701、数据选择电路1702和计算单元阵列1703。
下面结合图17所示的执行矩阵乘法运算的电路分别对上述图3和图10实施例提供的执行矩阵乘法运算的方法进行说明。
其中,该执行矩阵乘法运算的电路实现图3实施例提供的执行矩阵乘法运算的方法的过程可以包括如下步骤(1)-(3):
(1)获取电路1701获取矩阵A1、矩阵B2和索引矩阵。
需要说明的是,参见图18,该执行矩阵乘法运算的电路还可以包括第一存储器1704,第一存储器用于存储矩阵A1、矩阵B2和索引矩阵,这种情况下,获取电路1701可以从第一存储器1704中读取矩阵A1、矩阵B2和索引矩阵。或者,获取电路1701可以先获取矩阵A1和矩阵B1,再根据矩阵B1生成矩阵B2以及索引矩阵。
(2)数据选择电路1702根据索引矩阵和矩阵A1,生成n个矩阵A2。
(3)计算单元阵列1703根据n个矩阵A2和矩阵B2,生成矩阵C。
需要说明的是,如图19所示,计算单元阵列1703包括多个三维计算单元,该多个三维计算单元可以分布在m行n列上,每个三维计算单元中包括多个乘法单元和加法单元,如三维计算单元可以为乘累加单元(multiply and accumulate,mac)。1个三维计算单元可以用于计算1个矩阵A2的1行与矩阵B2的1列的乘积,1列三维计算单元(m个三维计算单元)可以用于计算1个矩阵A2与矩阵B2的1列的乘积,即1列三维计算单元可以计算出矩阵C的1列的元素,从而n列三维计算单元可以计算出矩阵C的n列的元素,如此可以得到矩阵 C。
另外,计算单元阵列1703得到矩阵C后,还可以将矩阵C保存到寄存器组中,该寄存器组可以包含于第一存储器1704中,也可以包含于其它存储器中,本申请实施例对此不作限定。
其中,该执行矩阵乘法运算的电路实现图10实施例提供的执行矩阵乘法运算的方法的过程可以包括如下步骤(4)-(6):
(4)获取电路1701获取获取矩阵B1、矩阵A2和索引矩阵。
需要说明的是,参见图18,该执行矩阵乘法运算的电路还可以包括第一存储器1704,第一存储器用于存储矩阵B1、矩阵A2和索引矩阵,这种情况下,获取电路1701可以从第一存储器1704中读取矩阵B1、矩阵A2和索引矩阵。或者,获取电路1701可以先获取矩阵A1和矩阵B1,再根据矩阵A1生成矩阵A2以及索引矩阵。
(5)数据选择电路1702根据索引矩阵和矩阵B1,生成m个矩阵B2。
(6)计算单元阵列1703根据矩阵A2和m个矩阵B2,生成矩阵C。
需要说明的是,如图19所示,计算单元阵列1703包括多个三维计算单元,该多个三维计算单元可以分布在m行n列上,每个三维计算单元中包括多个乘法单元和加法单元,如三维计算单元可以为mac。1个三维计算单元可以用于计算矩阵A2的1行与一个矩阵B2的1列的乘积,1行三维计算单元(n个三维计算单元)可以用于计算矩阵A2的1行与一个矩阵B2的乘积,即1行三维计算单元可以计算出矩阵C的1行的元素,从而m行三维计算单元可以计算出矩阵C的m行的元素,如此可以得到矩阵C。
另外,计算单元阵列1703得到矩阵C后,还可以将矩阵C保存到寄存器组中,该寄存器组可以包含于第一存储器1704中,也可以包含于其它存储器中,本申请实施例对此不作限定。
接下来对本申请实施例提供的SOC进行说明。
本申请实施例提供的一种SOC可以包括上述实施例中所述的执行矩阵乘法运算的电路,除此之外,还可以包括其它部件。
例如,图20是本申请实施例提供的一种SOC的结构示意图。参见图20,该SOC包括:处理器2001(在一些应用中,也称为处理核、CPU,如基于ARM架构的处理核)、第二存储器2002、互联总线2003和执行矩阵乘法运算的电路2004,执行矩阵乘法运算的电路2004可以为上述实施例中所述的执行矩阵乘法运算的电路。处理器2001用于对执行矩阵乘法运算的电路2004进行控制,例如,发送需要的数据,或者接收执行矩阵乘法运算的电路2004运算后的结果。
需要说明的是,第二存储器2002中存储的数据与上述第一存储器1704中存储的数据相同,即用于存储矩阵A1、矩阵B2和索引矩阵,或者用于存储矩阵B1、矩阵A2和索引矩阵。第一存储器1704可以为RAM等,第二存储器2002可以为双倍速率同步动态随机存储器(Double Data Rate,DDR)等。
具体地,该SOC执行矩阵乘法运算时,处理器2001通过互联总线2003控制执行矩阵乘法运算的电路2004启动。
一种可能的情况中,执行矩阵乘法运算的电路2004直接根据第二存储器2002中存储的 数据执行矩阵乘法运算。具体地,执行矩阵乘法运算的电路2004启动后其中的获取电路1701通过互联总线2003从第二存储器2002中读取数据(读取矩阵A1、矩阵B2和索引矩阵,或者读取矩阵B1、矩阵A2和索引矩阵),之后,执行矩阵乘法运算的电路2004中的数据选择电路1702和计算单元阵列1703根据获取电路1701从第二存储器2002中读取到的数据完成矩阵乘法运算,并将运算结果返回到第二存储器2002。
另一种可能的情况中,执行矩阵乘法运算的电路2004直接根据第一存储器1704中存储的数据执行矩阵乘法运算。具体地,当第一存储器1704中尚未存储矩阵A1、矩阵B2和索引矩阵,或者尚未存储矩阵B1、矩阵A2和索引矩阵时,执行矩阵乘法运算的电路2004启动后通过互联总线2003从第二存储器2002中读取数据(读取矩阵A1、矩阵B2和索引矩阵,或者读取矩阵B1、矩阵A2和索引矩阵),然后将从第二存储器2002中读取到的数据存储到第一存储器1704中。之后,执行矩阵乘法运算的电路2004中的获取电路1701从第一存储器1704中读取数据(读取矩阵A1、矩阵B2和索引矩阵,或者读取矩阵B1、矩阵A2和索引矩阵),然后执行矩阵乘法运算的电路2004中的数据选择电路1702和计算单元阵列1703根据获取电路1701从第一存储器1704中读取到的数据完成矩阵乘法运算,并将运算结果返回到第一存储器1704和/或第二存储器2002。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意结合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络或其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如:同轴电缆、光纤、数据用户线(Digital Subscriber Line,DSL))或无线(例如:红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质,或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如:软盘、硬盘、磁带)、光介质(例如:数字通用光盘(Digital Versatile Disc,DVD))或半导体介质(例如:固态硬盘(Solid State Disk,SSD))等。
以上所述为本申请提供的实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (24)

  1. 一种执行矩阵乘法运算的方法,其特征在于,所述方法包括:
    获取矩阵A1、矩阵B2和索引矩阵,所述矩阵A1是m行k列的矩阵,所述矩阵B2是t行n列的矩阵,所述索引矩阵是t行n列的矩阵,所述m、所述k、所述t和所述n均为正整数,所述t小于或等于所述k;
    根据所述索引矩阵和所述矩阵A1,生成n个矩阵A2,所述n个矩阵A2均是m行t列的矩阵,所述n个矩阵A2与所述索引矩阵的n列按顺序一一对应,每个矩阵A2的t列与对应的所述索引矩阵中的一列的t个元素按顺序一一对应,所述每个矩阵A2的每一列为对应的所述索引矩阵中的一个元素在所述矩阵A1中指示的一列;
    根据所述n个矩阵A2和所述矩阵B2,生成矩阵C,所述矩阵C是m行n列的矩阵,所述矩阵C的n列与所述n个矩阵A2按顺序一一对应,所述矩阵C的n列与所述矩阵B2的n列按顺序一一对应,所述矩阵C的每一列为对应的一个矩阵A2与对应的所述矩阵B2中的一列的乘积。
  2. 如权利要求1所述的方法,其特征在于,所述获取矩阵B2和索引矩阵,包括:
    获取矩阵B1,所述矩阵B1是k行n列的矩阵,所述矩阵B1的n列中每一列的元素均按顺序划分到预设数值个组,每组元素的数量均为k1,且所述每组元素中非零元素的数量均小于或等于k2,所述预设数值是k/k1,所述k大于或等于所述k1且所述k能够被所述k1整除,所述k1大于或等于所述k2且所述k1能够被所述k2整除;
    根据所述矩阵B1,生成所述矩阵B2,所述矩阵B2的n列与所述矩阵B1的n列按顺序一一对应,所述矩阵B2的每一列的元素包括对应的所述矩阵B1中的一列中按顺序排列的预设数值个组中所有的非零元素;
    根据所述矩阵B1,生成所述索引矩阵,所述索引矩阵的n列与所述矩阵B2的n列按顺序一一对应,所述索引矩阵的每一列的元素为对应的所述矩阵B2中的一列中按顺序排列的所有元素在所述矩阵B1中的行标。
  3. 如权利要求2所述的方法,其特征在于,
    对于所述矩阵B2中的任意一个非零元素,所述非零元素在所述矩阵B1中的行标是所述非零元素在所述矩阵B1中所属的一行的行号;
    对于所述矩阵B2中的任意一个为零的元素,所述一个为零的元素在所述矩阵B1中的行标是第一字符。
  4. 如权利要求3所述的方法,其特征在于,
    对于所述索引矩阵中的任意一个元素,当所述索引矩阵中的一个元素不是所述第一字符时,所述索引矩阵中的一个元素在所述矩阵A1中指示的一列是所述矩阵A1的所有列中列号为所述索引矩阵中的一个元素的一列;当所述索引矩阵中的一个元素是所述第一字符时,所述索引矩阵中的一个元素对应的一个矩阵A2中的一列的元素是m个第二字符。
  5. 如权利要求2-4任一所述的方法,其特征在于,所述矩阵B1是通过神经网络训练得到。
  6. 一种执行矩阵乘法运算的电路,其特征在于,所述电路包括:
    获取电路,用于获取矩阵A1、矩阵B2和索引矩阵,所述矩阵A1是m行k列的矩阵,所述矩阵B2是t行n列的矩阵,所述索引矩阵是t行n列的矩阵,所述m、所述k、所述t和所述n均为正整数,所述t小于或等于所述k;
    数据选择电路,用于根据所述索引矩阵和所述矩阵A1,生成n个矩阵A2,所述n个矩阵A2均是m行t列的矩阵,所述n个矩阵A2与所述索引矩阵的n列按顺序一一对应,每个矩阵A2的t列与对应的所述索引矩阵中的一列的t个元素按顺序一一对应,所述每个矩阵A2的每一列为对应的所述索引矩阵中的一个元素在所述矩阵A1中指示的一列;
    计算单元阵列,用于根据所述n个矩阵A2和所述矩阵B2,生成矩阵C,所述矩阵C是m行n列的矩阵,所述矩阵C的n列与所述n个矩阵A2按顺序一一对应,所述矩阵C的n列与所述矩阵B2的n列按顺序一一对应,所述矩阵C的每一列为对应的一个矩阵A2与对应的所述矩阵B2中的一列的乘积。
  7. 如权利要求6所述的电路,其特征在于,所述获取电路用于获取所述矩阵B2和所述索引矩阵时,具体用于:
    获取矩阵B1,所述矩阵B1是k行n列的矩阵,所述矩阵B1的n列中每一列的元素均按顺序划分到预设数值个组,每组元素的数量均为k1,且所述每组元素中非零元素的数量均小于或等于k2,所述预设数值是k/k1,所述k大于或等于所述k1且所述k能够被所述k1整除,所述k1大于或等于所述k2且所述k1能够被所述k2整除;
    根据所述矩阵B1,生成所述矩阵B2,所述矩阵B2的n列与所述矩阵B1的n列按顺序一一对应,所述矩阵B2的每一列的元素包括对应的所述矩阵B1中的一列中按顺序排列的预设数值个组中所有的非零元素;
    根据所述矩阵B1,生成所述索引矩阵,所述索引矩阵的n列与所述矩阵B2的n列按顺序一一对应,所述索引矩阵的每一列的元素为对应的所述矩阵B2中的一列中按顺序排列的所有元素在所述矩阵B1中的行标。
  8. 如权利要求7所述的电路,其特征在于,
    对于所述矩阵B2中的任意一个非零元素,所述非零元素在所述矩阵B1中的行标是所述非零元素在所述矩阵B1中所属的一行的行号;
    对于所述矩阵B2中的任意一个为零的元素,所述一个为零的元素在所述矩阵B1中的行标是第一字符。
  9. 如权利要求8所述的电路,其特征在于,
    对于所述索引矩阵中的任意一个元素,当所述索引矩阵中的一个元素不是所述第一字符时,所述索引矩阵中的一个元素在所述矩阵A1中指示的一列是所述矩阵A1的所有列中列号 为所述索引矩阵中的一个元素的一列;当所述索引矩阵中的一个元素是所述第一字符时,所述索引矩阵中的一个元素对应的一个矩阵A2中的一列的元素是m个第二字符。
  10. 如权利要求7-9任一所述的电路,其特征在于,所述矩阵B1是通过神经网络训练得到。
  11. 如权利要求6所述的电路,其特征在于,所述电路还包括第一存储器,所述第一存储器用于存储所述矩阵A1、所述矩阵B2和所述索引矩阵;
    相应地,所述获取电路用于:从所述第一存储器中读取所述矩阵A1、所述矩阵B2和所述索引矩阵。
  12. 一种片上系统SOC,其特征在于,所述SOC包括上述权利要求6-11任一所述的执行矩阵乘法运算的电路以及处理核;所述处理核用于对所述执行矩阵乘法运算的电路进行控制。
  13. 一种执行矩阵乘法运算的方法,其特征在于,所述方法包括:
    获取矩阵B1、矩阵A2和索引矩阵,所述矩阵B1是k行n列的矩阵,所述矩阵A2是m行t列的矩阵,所述索引矩阵是m行t列的矩阵,所述k、所述n、所述m和所述t均为正整数,所述t小于或等于所述k;
    根据所述索引矩阵和所述矩阵B1,生成m个矩阵B2,所述m个矩阵B2均是t行n列的矩阵,所述m个矩阵B2与所述索引矩阵的m行按顺序一一对应,每个矩阵B2的t行与对应的所述索引矩阵中的一行的t个元素按顺序一一对应,所述每个矩阵B2的每一行为对应的所述索引矩阵中的一个元素在所述矩阵B1中指示的一行;
    根据所述矩阵A2和所述m个矩阵B2,生成矩阵C,所述矩阵C是m行n列的矩阵,所述矩阵C的m行与所述矩阵A2的m行按顺序一一对应,所述矩阵C的m行与所述m个矩阵B2按顺序一一对应,所述矩阵C的每一行为对应的所述矩阵A2中的一行与对应的一个矩阵B2的乘积。
  14. 如权利要求13所述的方法,其特征在于,所述获取矩阵A2和索引矩阵,包括:
    获取矩阵A1,所述矩阵A1是m行k列的矩阵,所述矩阵A1的m行中每一行的元素均按顺序划分到预设数值个组,每组元素的数量均为k1,且所述每组元素中非零元素的数量均小于或等于k2,所述预设数值是k/k1,所述k大于或等于所述k1且所述k能够被所述k1整除,所述k1大于或等于所述k2且所述k1能够被所述k2整除;
    根据所述矩阵A1,生成所述矩阵A2,所述矩阵A2的m行与所述矩阵A1的m行按顺序一一对应,所述矩阵A2的每一行的元素包括对应的所述矩阵A1中的一行中按顺序排列的预设数值个组中所有的非零元素;
    根据所述矩阵A1,生成所述索引矩阵,所述索引矩阵的m行与所述矩阵A2的m行按顺序一一对应,所述索引矩阵的每一行的元素为对应的所述矩阵A2中的一行中按顺序排列的所有元素在所述矩阵A1中的列标。
  15. 如权利要求14所述的方法,其特征在于,
    对于所述矩阵A2中的任意一个非零元素,所述非零元素在所述矩阵A1中的列标是所述非零元素在所述矩阵A1中所属的一列的列号;
    对于所述矩阵A2中的任意一个为零的元素,所述一个为零的元素元素在所述矩阵A1中的列标是第一字符。
  16. 如权利要求15所述的方法,其特征在于,
    对于所述索引矩阵中的任意一个元素,当所述索引矩阵中的一个元素不是所述第一字符时,所述索引矩阵中的一个元素在所述矩阵B1中指示的一行是所述矩阵B1的所有行中行号为所述索引矩阵中的一个元素的一行;当所述索引矩阵中的一个元素是所述第一字符时,所述索引矩阵中的一个元素对应的一个矩阵B2中的一行的元素是n个第二字符。
  17. 如权利要求14-16任一所述的方法,其特征在于,所述矩阵A1是通过神经网络训练得到。
  18. 一种执行矩阵乘法运算的电路,其特征在于,所述电路包括:
    获取电路,用于获取矩阵B1、矩阵A2和索引矩阵,所述矩阵B1是k行n列的矩阵,所述矩阵A2是m行t列的矩阵,所述索引矩阵是m行t列的矩阵,所述k、所述n、所述m和所述t均为正整数,所述t小于或等于所述k;
    数据选择电路,用于根据所述索引矩阵和所述矩阵B1,生成m个矩阵B2,所述m个矩阵B2均是t行n列的矩阵,所述m个矩阵B2与所述索引矩阵的m行按顺序一一对应,每个矩阵B2的t行与对应的所述索引矩阵中的一行的t个元素按顺序一一对应,所述每个矩阵B2的每一行为对应的所述索引矩阵中的一个元素在所述矩阵B1中指示的一行;
    计算单元阵列,用于根据所述矩阵A2和所述m个矩阵B2,生成矩阵C,所述矩阵C是m行n列的矩阵,所述矩阵C的m行与所述矩阵A2的m行按顺序一一对应,所述矩阵C的m行与所述m个矩阵B2按顺序一一对应,所述矩阵C的每一行为对应的所述矩阵A2中的一行与对应的一个矩阵B2的乘积。
  19. 如权利要求18所述的电路,其特征在于,所述获取电路用于获取所述矩阵A2和所述索引矩阵时,具体用于:
    获取矩阵A1,所述矩阵A1是m行k列的矩阵,所述矩阵A1的m行中每一行的元素均按顺序划分到预设数值个组,每组元素的数量均为k1,且所述每组元素中非零元素的数量均小于或等于k2,所述预设数值是k/k1,所述k大于或等于所述k1且所述k能够被所述k1整除,所述k1大于或等于所述k2且所述k1能够被所述k2整除;
    根据所述矩阵A1,生成所述矩阵A2,所述矩阵A2的m行与所述矩阵A1的m行按顺序一一对应,所述矩阵A2的每一行的元素包括对应的所述矩阵A1中的一行中按顺序排列的预设数值个组中所有的非零元素;
    根据所述矩阵A1,生成所述索引矩阵,所述索引矩阵的m行与所述矩阵A2的m行按 顺序一一对应,所述索引矩阵的每一行的元素为对应的所述矩阵A2中的一行中按顺序排列的所有元素在所述矩阵A1中的列标。
  20. 如权利要求19所述的电路,其特征在于,
    对于所述矩阵A2中的任意一个非零元素,所述非零元素在所述矩阵A1中的列标是所述非零元素在所述矩阵A1中所属的一列的列号;
    对于所述矩阵A2中的任意一个为零的元素,所述一个为零的元素元素在所述矩阵A1中的列标是第一字符。
  21. 如权利要求20所述的电路,其特征在于,
    对于所述索引矩阵中的任意一个元素,当所述索引矩阵中的一个元素不是所述第一字符时,所述索引矩阵中的一个元素在所述矩阵B1中指示的一行是所述矩阵B1的所有行中行号为所述索引矩阵中的一个元素的一行;当所述索引矩阵中的一个元素是所述第一字符时,所述索引矩阵中的一个元素对应的一个矩阵B2中的一行的元素是n个第二字符。
  22. 如权利要求19-21任一所述的电路,其特征在于,所述矩阵A1是通过神经网络训练得到。
  23. 如权利要求18所述的电路,其特征在于,所述电路还包括第一存储器,所述第一存储器用于存储所述矩阵B1、所述矩阵A2和所述索引矩阵;
    相应地,所述获取电路用于:从所述第一存储器中读取所述矩阵B1、所述矩阵A2和所述索引矩阵。
  24. 一种片上系统SOC,其特征在于,所述SOC包括上述权利要求18-23任一所述的执行矩阵乘法运算的电路以及处理核;所述处理核用于对所述执行矩阵乘法运算的电路进行控制。
PCT/CN2019/119794 2018-11-20 2019-11-20 执行矩阵乘法运算的方法、电路及soc WO2020103883A1 (zh)

Priority Applications (8)

Application Number Priority Date Filing Date Title
CN202111444099.0A CN114138231B (zh) 2018-11-20 2019-11-20 执行矩阵乘法运算的方法、电路及soc
EP22188228.5A EP4102354A1 (en) 2018-11-20 2019-11-20 Method, circuit, and soc for performing matrix multiplication operation
EP19888082.5A EP3876092B1 (en) 2018-11-20 2019-11-20 Method for executing matrix multiplication, circuit and soc
ES19888082T ES2943886T3 (es) 2018-11-20 2019-11-20 Método, Circuito y SOC para ejecutar multiplicación de matrices
CN201980076521.6A CN113168309A (zh) 2018-11-20 2019-11-20 执行矩阵乘法运算的方法、电路及soc
US17/324,533 US11263292B2 (en) 2018-11-20 2021-05-19 Method, circuit, and SOC for performing matrix multiplication operation
US17/568,538 US11397791B2 (en) 2018-11-20 2022-01-04 Method, circuit, and SOC for performing matrix multiplication operation
US17/841,162 US11860970B2 (en) 2018-11-20 2022-06-15 Method, circuit, and SOC for performing matrix multiplication operation

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811384503.8A CN111198670B (zh) 2018-11-20 2018-11-20 执行矩阵乘法运算的方法、电路及soc
CN201811384503.8 2018-11-20

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/324,533 Continuation US11263292B2 (en) 2018-11-20 2021-05-19 Method, circuit, and SOC for performing matrix multiplication operation

Publications (1)

Publication Number Publication Date
WO2020103883A1 true WO2020103883A1 (zh) 2020-05-28

Family

ID=70744057

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/119794 WO2020103883A1 (zh) 2018-11-20 2019-11-20 执行矩阵乘法运算的方法、电路及soc

Country Status (5)

Country Link
US (3) US11263292B2 (zh)
EP (2) EP3876092B1 (zh)
CN (3) CN111198670B (zh)
ES (1) ES2943886T3 (zh)
WO (1) WO2020103883A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182496B (zh) * 2020-09-24 2022-09-16 成都海光集成电路设计有限公司 用于矩阵乘法的数据处理方法及装置
EP4310700A1 (en) * 2021-03-31 2024-01-24 Huawei Technologies Co., Ltd. Matrix multiplier, matrix computing method, and related device
US20230342291A1 (en) * 2022-04-26 2023-10-26 Microsoft Technology Licensing, Llc Fetching non-zero data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040083253A1 (en) * 2002-10-29 2004-04-29 Yung-Hsiang Lee Method and apparatus for efficient matrix multiplication in a direct sequence CDMA system
CN102375721A (zh) * 2010-08-23 2012-03-14 联想(北京)有限公司 一种矩阵乘法运算方法、图形处理器和电子设备
CN102541814A (zh) * 2010-12-27 2012-07-04 北京国睿中数科技股份有限公司 用于数据通信处理器的矩阵计算装置和方法
CN103902507A (zh) * 2014-03-28 2014-07-02 中国科学院自动化研究所 一种面向可编程代数处理器的矩阵乘法计算装置及方法
CN106991077A (zh) * 2016-01-20 2017-07-28 南京艾溪信息科技有限公司 一种矩阵计算装置

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4914615A (en) 1987-09-04 1990-04-03 At&T Bell Laboratories Calculator of matrix products
CN1553577A (zh) * 2003-12-19 2004-12-08 清华大学 一种时序逻辑数字电路的设计方法
CN100558024C (zh) * 2004-03-04 2009-11-04 智邦科技股份有限公司 多速率多载波多码分工系统的通信方法
CN101488761B (zh) * 2009-02-27 2011-01-19 北京交通大学 一种无短环无低码重码的ldpc码构造方法
WO2011156247A2 (en) * 2010-06-11 2011-12-15 Massachusetts Institute Of Technology Processor for large graph algorithm computations and matrix operations
CN102141976B (zh) * 2011-01-10 2013-08-14 中国科学院软件研究所 稀疏矩阵的对角线数据存储方法及基于该方法的SpMV实现方法
CN103235711A (zh) * 2013-04-19 2013-08-07 荣成市鼎通电子信息科技有限公司 基于查找表的准循环矩阵高速乘法器
US9367519B2 (en) * 2013-08-30 2016-06-14 Microsoft Technology Licensing, Llc Sparse matrix data structure
CN103984527B (zh) * 2014-04-01 2017-12-15 杭州电子科技大学 优化稀疏矩阵向量乘提升不可压缩管流模拟效率的方法
US9830302B1 (en) 2014-04-16 2017-11-28 Knowles Electronics, Llc Sparse matrix vector multiplication
RU2016147181A (ru) * 2014-06-17 2018-06-04 Клауд Инвент М.Л. Лтд Способ и система для определения конфигурации модели, содержащей совокупность элементов и удовлетворяющей набору ограничений
CN107563497B (zh) 2016-01-20 2021-03-19 中科寒武纪科技股份有限公司 用于稀疏人工神经网络的计算装置和运算方法
CN106126481B (zh) * 2016-06-29 2019-04-12 华为技术有限公司 一种计算系统和电子设备
CN107239823A (zh) 2016-08-12 2017-10-10 北京深鉴科技有限公司 一种用于实现稀疏神经网络的装置和方法
JP7148526B2 (ja) * 2017-02-23 2022-10-05 アーム・リミテッド データ処理装置におけるベクトルによる要素演算
CN108875956B (zh) * 2017-05-11 2019-09-10 广州异构智能科技有限公司 原生张量处理器
JP2019148969A (ja) * 2018-02-27 2019-09-05 富士通株式会社 行列演算装置、行列演算方法および行列演算プログラム
CN108805273A (zh) * 2018-05-20 2018-11-13 复旦大学 一种lstm中门控单元加速运算的硬件实现电路
CN108763163B (zh) * 2018-08-02 2023-10-20 北京知存科技有限公司 模拟向量-矩阵乘法运算电路
US10936311B1 (en) * 2019-07-09 2021-03-02 Xilinx, Inc. Sparse matrix processing circuitry
CN114341825A (zh) * 2019-08-29 2022-04-12 阿里巴巴集团控股有限公司 用于在神经网络中提供向量稀疏化的方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040083253A1 (en) * 2002-10-29 2004-04-29 Yung-Hsiang Lee Method and apparatus for efficient matrix multiplication in a direct sequence CDMA system
CN102375721A (zh) * 2010-08-23 2012-03-14 联想(北京)有限公司 一种矩阵乘法运算方法、图形处理器和电子设备
CN102541814A (zh) * 2010-12-27 2012-07-04 北京国睿中数科技股份有限公司 用于数据通信处理器的矩阵计算装置和方法
CN103902507A (zh) * 2014-03-28 2014-07-02 中国科学院自动化研究所 一种面向可编程代数处理器的矩阵乘法计算装置及方法
CN106991077A (zh) * 2016-01-20 2017-07-28 南京艾溪信息科技有限公司 一种矩阵计算装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3876092A4

Also Published As

Publication number Publication date
US20210271736A1 (en) 2021-09-02
US11263292B2 (en) 2022-03-01
EP4102354A1 (en) 2022-12-14
CN111198670A (zh) 2020-05-26
EP3876092A4 (en) 2021-12-29
US20220129523A1 (en) 2022-04-28
US11860970B2 (en) 2024-01-02
CN111198670B (zh) 2021-01-29
CN113168309A (zh) 2021-07-23
EP3876092A1 (en) 2021-09-08
US11397791B2 (en) 2022-07-26
EP3876092B1 (en) 2023-04-12
CN114138231A (zh) 2022-03-04
US20220391471A1 (en) 2022-12-08
ES2943886T3 (es) 2023-06-16
CN114138231B (zh) 2022-07-22

Similar Documents

Publication Publication Date Title
US10140251B2 (en) Processor and method for executing matrix multiplication operation on processor
US11397791B2 (en) Method, circuit, and SOC for performing matrix multiplication operation
US20170097884A1 (en) Pipelined convolutional operations for processing clusters
US20170060811A1 (en) Matrix operands for linear algebra operations
US20230026006A1 (en) Convolution computation engine, artificial intelligence chip, and data processing method
CN109416755B (zh) 人工智能并行处理方法、装置、可读存储介质、及终端
WO2024027039A1 (zh) 数据处理方法、装置、设备和可读存储介质
US11635904B2 (en) Matrix storage method, matrix access method, apparatus and electronic device
US11874898B2 (en) Streaming-based artificial intelligence convolution processing method and apparatus, readable storage medium and terminal
WO2021083101A1 (zh) 数据处理方法、装置及相关产品
US11481994B2 (en) Method and apparatus for extracting image data in parallel from multiple convolution windows, device, and computer-readable storage medium
CN112084023A (zh) 数据并行处理的方法、电子设备及计算机可读存储介质
CN109740730B (zh) 运算方法、装置及相关产品
CN116185937A (zh) 基于众核处理器多层互联架构的二元运算访存优化方法及装置
WO2021082723A1 (zh) 运算装置
CN112395009A (zh) 运算方法、装置、计算机设备和存储介质
CN112395008A (zh) 运算方法、装置、计算机设备和存储介质
CN111338974A (zh) 用于矩阵数学指令集的图块化算法
CN112395002B (zh) 运算方法、装置、计算机设备和存储介质
WO2022257980A1 (zh) 计算装置、利用计算装置实施卷积运算的方法及相关产品
WO2021082724A1 (zh) 运算方法及相关产品
US20230297386A1 (en) Computing apparatus, integrated circuit chip, board card, electronic device, and computing method
WO2020073874A1 (zh) 机器学习运算的分配系统及方法
CN113704340A (zh) 数据处理方法、装置、服务器及存储介质
CN112785544A (zh) 一种三维线段模型的平面提取方法、系统及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19888082

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019888082

Country of ref document: EP

Effective date: 20210602