WO2018176882A1 - 一种矩阵与矢量的乘法运算方法及装置 - Google Patents

一种矩阵与矢量的乘法运算方法及装置 Download PDF

Info

Publication number
WO2018176882A1
WO2018176882A1 PCT/CN2017/113422 CN2017113422W WO2018176882A1 WO 2018176882 A1 WO2018176882 A1 WO 2018176882A1 CN 2017113422 W CN2017113422 W CN 2017113422W WO 2018176882 A1 WO2018176882 A1 WO 2018176882A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
read
vector
zero
indication information
Prior art date
Application number
PCT/CN2017/113422
Other languages
English (en)
French (fr)
Inventor
屠嘉晋
朱凡
林强
刘虎
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP17902691.9A priority Critical patent/EP3584719A4/en
Publication of WO2018176882A1 publication Critical patent/WO2018176882A1/zh
Priority to US16/586,164 priority patent/US20200026746A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Definitions

  • the present application relates to a matrix and vector multiplication method and apparatus.
  • convolutional neural networks Due to the excellent performance of convolutional neural networks in image processing, image classification, audio recognition and other data processing applications, convolutional neural networks have become one of the hot topics in various academic research.
  • convolutional neural networks there are a large number of floating-point multiplication and addition operations in convolutional neural networks, including matrix and vector multiplication operations, which are computationally intensive and time-consuming, making the hardware energy consumption of convolutional neural networks large. Therefore, how to reduce the amount of floating-point operations in convolutional neural networks has become one of the current technical problems to be solved.
  • the position of non-zero elements in the matrix is recorded by detecting non-zero elements in the matrix in real time, and non-zero elements and vector elements are selected from the matrix for multiplication and addition. Operation.
  • the matrix and vector operations in the prior art need to determine whether the value of the matrix element is zero in real time, and record the position of the non-zero element in real time, and the real-time judgment and recording have high implementation complexity, complicated operation, low data processing efficiency, and poor applicability.
  • the present application provides a matrix and vector multiplication operation method and device, which can reduce the complexity of data processing, reduce the power consumption of data processing, and improve data processing efficiency.
  • the first aspect provides a matrix and vector multiplication method, which may include:
  • Reading according to the first indication information, a matrix element value of a non-zero element from the preset matrix, and determining a first location code of the read matrix element value, where the first location code is a position marker of a matrix element value in a single read matrix data;
  • the indication information of the matrix read pointer indicates a non-zero element in the matrix to be processed, and the non-zero element value is read from the preset matrix and multiplied by the vector data value.
  • the application can read the vector data value corresponding to the position of the matrix element value from the input vector data according to the indication information of the vector read pointer, thereby saving the judgment operation of the matrix element value in the multiplication operation, thereby reducing the complexity of the data processing. Degree, reduce the power consumption of data processing, improve data processing efficiency.
  • the method before the obtaining the first indication information of the matrix element, the method further includes:
  • the first indication information of the matrix element is generated according to the preset matrix and the pre-code of each non-zero element included therein.
  • the present invention can preprocess the matrix to be processed participating in the multiplication operation, and remove the zero elements in the matrix to be processed to obtain a preset matrix to be stored in a specified storage space, thereby generating a matrix according to the positional relationship of each non-zero element in the preset matrix.
  • Read the indication of the pointer The indication information of the matrix read pointer can be used to schedule matrix elements in the multiplication operation of the matrix and the vector, and the accuracy of the matrix element scheduling and the data processing efficiency can reduce the operation complexity of the matrix element reading.
  • the method further includes :
  • the location code of any one of the non-zero elements is smaller than the data size.
  • the application can mark the non-zero element of the preset matrix according to the data size read by the single operation, and the position code of any non-zero element is smaller than the data size of the single operation, and the code bit can be made. Wide fixed, reducing data processing complexity.
  • the first indication information includes a matrix read pointer, a matrix valid pointer, and an effective matrix. Number of elements;
  • the matrix read pointer is used to indicate a row of matrix elements to be read that participate in the current calculation in the preset matrix
  • the matrix valid pointer points to a position of the starting non-zero element participating in the current calculation in the row of matrix elements to be read;
  • the number of the effective matrix elements is used to indicate the number M of non-zero elements to be read participating in the current calculation, and the M is an integer greater than or equal to 1;
  • the reading the matrix element values of the non-zero element from the preset matrix according to the first indication information includes:
  • the parameters such as the matrix read pointer, the matrix valid pointer, the effective matrix element number and the like indicate the reading position of the non-zero element of the preset matrix and the number of readings, etc., thereby improving the scheduling convenience of the matrix element, thereby improving Data processing efficiency.
  • the first indication information further includes a matrix read pointer increment
  • the initial value of the matrix read pointer increment is zero, indicating that the matrix element to be read in the current calculation acts as a matrix element row indicated by the matrix read pointer;
  • the first indication information of the pre-standard code generation matrix element according to the preset matrix and each non-zero element included therein includes:
  • the matrix read pointer is incremented by one, and the matrix read pointer increments by one to indicate that the next calculation is to be performed.
  • the read matrix element behaves as the matrix read pointer Two lines after the indicated matrix element row;
  • the remaining non-zero elements are the number of non-zero elements after the position pointed by the matrix valid pointer included in the matrix element row to be read.
  • the application can mark the matrix element row tracked by the matrix read pointer by matrix read pointer increment, further guarantee the scheduling accuracy of the matrix element, and improve the efficiency of data processing.
  • the method further includes:
  • the matrix read pointer is updated in accordance with the matrix read pointer increment to obtain a matrix read pointer for the next calculation.
  • the matrix read pointer can be updated by the matrix read pointer increment to ensure the accuracy of the matrix element row pointed by the matrix read pointer, and the accuracy of the data scheduling is improved, and the applicability is higher.
  • the vector data information to be read includes the current calculation The vector data line to be read;
  • the method further includes:
  • the second indication information includes a vector data row to be read indicated by the vector read pointer, and a vector read pointer increment;
  • the vector read pointer increment indicates the number of intervals of the vector data line to be read calculated next time and the vector data line indicated by the vector read pointer.
  • the application can determine the indication information of the vector read pointer according to the number of non-zero elements of each matrix element row in the matrix to be processed, and the indication information of the vector read pointer indicates the vector data row of the vector data read from the input vector data during the multiplication operation. It can guarantee the accuracy of the multiplication of vector data and matrix element values, and improve the accuracy of data scheduling.
  • the second indication information that generates a vector element according to the number of non-zero elements included in each matrix element row includes:
  • the vector read pointer increment is set to H, and the H is a preset matrix data size and the current calculation read.
  • the vector read pointer increment is set to H 1 .
  • the application can also set the vector read pointer increment according to the zero element condition included in each matrix element row in the matrix to be processed, and specify the vector data row to be read by the multiplication operation by the matrix read pointer increment, and then increase the vector read pointer by the vector.
  • the setting of the quantity skips the all-zero matrix element row, which saves the data scheduling signaling of the multiplication operation and provides data processing efficiency.
  • the reading from the input vector data according to the second indication information includes:
  • the vector data row to be read from the input vector data where the input vector data includes T*K elements, and the T is an integer greater than 1;
  • a vector element value of the second position code corresponding to the first position code is read from the vector data line.
  • the present application searches for the vector data row to be read from the input vector data of the number of matrix elements by using the indication information of the vector read pointer, and reads the vector corresponding to the read matrix element value from the searched vector data row. Element value. This application ensures the effective utilization of the operator in the accelerator by inputting more vector data, and improves the applicability of the multiplication of the matrix and the vector.
  • the method further includes:
  • the vector read pointer is updated according to the increment of the vector read pointer to obtain a vector read pointer for the next calculation.
  • the vector read pointer can be updated by the vector read pointer increment to ensure the accuracy of the vector data line pointed by the vector read pointer for each operation, and the accuracy of the data scheduling is improved, and the applicability is higher.
  • a second aspect provides a matrix and vector multiplication device, which may include: a memory, a scheduling unit, and an operator;
  • the memory is configured to store a preset matrix and first indication information of a matrix element of the preset matrix, where the first indication information is used to indicate a non-zero element in the preset matrix;
  • the scheduling unit is configured to acquire the first indication information from the memory, and read a matrix element value of a non-zero element from the preset matrix according to the first indication information, and determine the read a first position code of the matrix element value, where the first position code is a position mark of the matrix element value in a single read matrix data;
  • the memory is further configured to store input vector data and second indication information of a vector element of the input vector data, where the second indication information is used to indicate vector data information to be read;
  • the scheduling unit is further configured to read the second indication information from the memory, and read, according to the second indication information, a number corresponding to the first location code from the input vector data.
  • Vector element value of the two-position code
  • the operator is configured to calculate a multiplication value of the matrix element value read by the scheduling unit and the vector element value.
  • the multiplying device further includes:
  • a general-purpose processor configured to acquire a matrix to be processed, and perform position labeling on each matrix element in the to-be-processed matrix to obtain a pre-code of each matrix element, where each row in the to-be-processed matrix includes K Element, K is an integer greater than zero;
  • the general-purpose processor is further configured to: select a non-zero element in the to-be-processed matrix, and generate a preset matrix according to a pre-code of a non-zero element in the to-be-processed matrix and store the same to the memory, Each row in the preset matrix includes K non-zero elements;
  • the general-purpose processor is further configured to generate first indication information of the matrix element according to the preset matrix and a pre-code of each non-zero element included therein, and store the information to the memory.
  • the general-purpose processor is further configured to:
  • the location code of any one of the non-zero elements is smaller than the data size.
  • the first indication information includes a matrix read pointer, a matrix valid pointer, and a valid matrix element number
  • the matrix read pointer is used to indicate a row of matrix elements to be read that participate in the current calculation in the preset matrix
  • the matrix valid pointer points to a position of the starting non-zero element participating in the current calculation in the row of matrix elements to be read;
  • the number of the effective matrix elements is used to indicate the number M of non-zero elements to be read participating in the current calculation, and the M is an integer greater than or equal to 1;
  • the scheduling unit is used to:
  • the first indication information further includes a matrix read pointer increment
  • the initial value of the matrix read pointer increment is zero, indicating that the matrix element to be read in the current calculation acts as a matrix element row indicated by the matrix read pointer;
  • the general purpose processor is used to:
  • the matrix read pointer is incremented by one, and the matrix read pointer increments by one to indicate the next calculation.
  • the matrix element to be read behaves two lines after the matrix element row indicated by the matrix read pointer;
  • the remaining non-zero elements are the number of non-zero elements after the position pointed by the matrix valid pointer included in the matrix element row to be read.
  • the general-purpose processor is further configured to:
  • the matrix read pointer is updated in accordance with the matrix read pointer increment to obtain a matrix read pointer for the next calculation.
  • the vector data information to be read includes the current calculation The vector data line to be read;
  • the general purpose processor is also used to:
  • the second indication information includes a vector data row to be read indicated by the vector read pointer, and a vector read pointer increment;
  • the vector read pointer increment indicates the number of intervals of the vector data line to be read calculated next time and the vector data line indicated by the vector read pointer.
  • the general-purpose processor is used to:
  • the vector read pointer increment is set to H, and the H is a preset matrix data size and the current calculation read.
  • the vector read pointer increment is set to H 1 .
  • the scheduling unit is configured to:
  • the vector data row to be read from the input vector data where the input vector data includes T*K elements, and the T is an integer greater than 1;
  • a vector element value of the second position code corresponding to the first position code is read from the vector data line.
  • the general-purpose processor is further configured to:
  • the vector read pointer is updated according to the increment of the vector read pointer to obtain a vector read pointer for the next calculation.
  • the present application indicates non-zero elements in the matrix to be processed by using a matrix read pointer, a matrix valid pointer, a number of valid matrix elements, and a matrix read pointer increment, and reads non-zero element values and vector data values from the preset matrix.
  • the multiplication operation can improve the scheduling accuracy of the matrix elements, reduce the non-zero judgment of the matrix elements before the scheduling of the matrix element values, and reduce the scheduling operation complexity of the matrix elements.
  • the application can read the vector data value corresponding to the position of the matrix element value from the input vector data according to the indication information such as the vector read pointer and the vector read pointer increment, thereby saving the judgment operation of the matrix element value in the multiplication operation, and further It can reduce the complexity of data processing, reduce the power consumption of data processing, and improve data processing efficiency.
  • the application can also perform position labeling on the matrix elements in the preset matrix according to the data size of a single reading, which can ensure the fixed label width and reduce the operation complexity of the data processing.
  • FIG. 1 is a schematic diagram of a multiplication operation of a matrix and a vector according to an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of a matrix and vector multiplication device according to an embodiment of the present invention.
  • FIG. 3 is a schematic flowchart of a method for multiplying a matrix and a vector according to an embodiment of the present invention
  • FIG. 5 is a schematic diagram of location location code acquisition of matrix elements according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of indication information of a matrix/vector read pointer according to an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a PE according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic structural diagram of a matrix and vector multiplication device according to an embodiment of the present invention.
  • FIG. 1 is a schematic diagram of a multiplication operation of a matrix and a vector according to an embodiment of the present invention.
  • the to-be-processed matrix participating in the multiplication operation is an A*B matrix
  • the input vector data participating in the multiplication operation is a B*1 vector.
  • the above matrix of A*B is multiplied by the vector of B*1 described above to obtain a vector of A*1. That is, the matrix to be processed is a matrix of A rows and B columns, and one or more zero elements are included in the to-be-processed matrix.
  • the input vector data is a vector of B columns.
  • the matrix elements of each row of the matrix are paired with the vector elements, the two elements in each pair are multiplied, and then these multiplications are added together, and the final value is the first.
  • Row of knots fruit For example, the matrix elements of the first row in the matrix to be processed are paired with the vector elements, for example (1,1), (0,3), (0,5), (1,2), and then in each pair. The two elements are multiplied to get the product of each pair, and the products are added together to get the result of the first line. The same operation of the matrix elements of each row gives a 4*1 vector.
  • the prior art densifies the sparse matrices (ie, the 0 elements in the matrix are discarded, and the remaining non-zero elements are regenerated A matrix) saves data storage space and reduces the number of matrix elements involved in multiplication.
  • the prior art densifies the matrix, the multiplication of the matrix and the vector obtained by the densification process becomes more complicated. For example, the matrix after the densification process needs to record the label and value of each matrix element, wherein the above-mentioned label indicates the position of the matrix element in the matrix.
  • the first element of the first line in Figure 1 may have a label of 1, the second element of the first line has a label of 2, and so on, and the first element of the second line has a label of 4, and finally
  • the last element of a line has a code of 20 or so.
  • the corresponding vector elements may be selected according to the recorded non-zero elements for multiplication, that is, before the densification processing, pairing with the non-zero elements
  • the vector element is the vector element corresponding to the non-zero element. If the vector element span is too large, find out where the next vector element is in memory.
  • Embodiments of the present invention provide a method and apparatus for multiplying a matrix and a vector.
  • the matrix can be used as a characteristic of known data, and a set of control signals for multiplication of a matrix and a vector is generated by software, and the control signal is used to obtain a matrix from the matrix.
  • the correct data is selected from the data and vector data to perform the multiplication operation.
  • the multiplication operation device in the multiplication operation of the matrix and the vector, the multiplication operation device only needs to perform the corresponding operation according to the control signal, and does not need to perform operations such as data judgment and real-time recording, and the operation is simple, and the data is simple. High processing efficiency.
  • the control signal of the multiplication operation only needs to be generated once, and then all operations of the multiplication operation device can be triggered by the control signal, and real-time judgment and data recording are not required, and the matrix element can be reduced.
  • the scheduling complexity provides data processing efficiency.
  • FIG. 2 is a schematic structural diagram of a matrix and vector multiplication device according to an embodiment of the present invention.
  • the multiplication device provided by the embodiment of the present invention may specifically be a multiplication accelerator.
  • the top-level architecture of the multiplication accelerator shown in FIG. 2 includes x process engines (PEs), a vector random access memory (RAM), a matrix information RAM, and a controller.
  • PEs process engines
  • RAM vector random access memory
  • matrix information RAM a matrix information RAM
  • controller a controller
  • Each of the PEs further includes a matrix RAM for storing the matrix participating in the multiplication operation.
  • Each PE is a floating-point multiply-accumulate (FMAC).
  • FMAC floating-point multiply-accumulate
  • the above matrix information RAM, matrix RAM and vector RAM both contain one write port and two read ports. In each multiplication operation, reading data from the above matrix RAM and vector RAM is to simultaneously read two rows of data of the Lth and L+1th rows in the matrix RAM and the vector RAM. Wherein, each row of the above matrix RAM and vector RAM has K Elements (the position of the corresponding position of each element is also stored in the row of the matrix RAM).
  • the output width of the matrix RAM is K elements, that is, the output width of the above matrix RAM indicates the number of elements (ie, K) read from the matrix RAM for each PE single operation, and the output width of the vector RAM is T*K. Elements.
  • T may be a predefined multiple, and may be determined according to a percentage of all the elements included in the matrix of the element 0 included in the matrix participating in the operation, and is not limited herein.
  • the matrix stored in the matrix RAM may be a matrix after densification
  • the matrix may not include zero elements, that is, the K elements in a row of data read from the matrix RAM by the PE single operation may be K non-zero elements after zero elements are eliminated, so the K matrix elements actually read may include matrix elements with a label greater than K.
  • the actual K non-zero matrix elements paired vector elements need more than K
  • the vector elements are real-time input data
  • the vector RAM is the data that is not subjected to the densification processing
  • the vector RAM should output more data than K. For example, suppose K is 8.
  • the number of zero elements in the 16 (ie, 2K) data included in the L row and L+1 row data of the matrix is four.
  • the Lth line includes elements (2, 4, 3, 0, 5, 0, 1, 0)
  • the L+1 line includes elements (7, 6, 9, 0, 8, 2, 1 , 4)
  • the K non-zero elements read after the densification process may be (2, 4, 3, 5, 1, 7, 6, 9).
  • the data of the L line and the L+1 line is included in the K non-zero elements read after the densification processing.
  • the vector elements are the data of the input data, and are not subjected to the densification process, because the vector elements read at this time should be more than K, L
  • the vector elements of the +1 line are paired with the matrix elements whose labels are larger than K, and the vector elements paired with the K non-zero matrix elements are guaranteed to be read. That is, when there are a large number of zero elements in the matrix, it is difficult to effectively utilize all the arithmetic operators in the PE if only K elements are taken. If the vector element takes T*K (T>1), it means that the vector element can have a larger range of values, which makes it easier to select more non-zero elements and participate in FMAC operations, which can improve the utilization of the PE internal multiplier.
  • the two operands in the FMAC operation are matrix elements and vector elements, respectively.
  • the matrix elements are pre-processed and stored in the matrix RAM in the PE.
  • the vector element can be stored in a large RAM at the far end and input into the PE as real-time input data to participate in the FMAC operation.
  • the vector RAM broadcasts T*K elements to the bus.
  • T can be customized.
  • the matrix information of each PE stored in the matrix information RAM will be sent to each PE along with the broadcast bus.
  • the execution body of the multiplication method of the matrix and the vector provided by the embodiment of the present invention may be the above-mentioned PE, or may be a function module in the PE, or may be the controller or the like, and is not limited herein.
  • FIG. 3 is a schematic flowchart diagram of a multiplication operation method of a matrix and a vector according to an embodiment of the present invention.
  • the method provided by the embodiment of the present invention may include the following steps:
  • a matrix to be processed participating in the FMAC operation may be acquired, and the matrix to be processed is preprocessed to obtain matrix initialization information.
  • each matrix element in the to-be-processed matrix may be positionally coded to obtain a pre-code of each matrix element.
  • the PE may code each matrix element according to the position of each matrix element in the matrix to be processed, and obtain the pre-code of each matrix element.
  • FIG. 4 is a schematic diagram of sparse matrix preprocessing.
  • the data to the left of the "Dense" arrow of Figure 4 is the matrix elements included in the matrix to be processed and their corresponding pre-codes.
  • the PE may select a non-zero element from the to-be-processed matrix, and generate a preset matrix according to the pre-labeled code of the non-zero element.
  • the preset matrix is a matrix after the matrix to be processed, and the zero matrix is not included in the preset matrix, and each row in the preset matrix also includes K elements.
  • the data shown to the right of the "Dense" arrow in Figure 4 is the non-zero element in the matrix to be processed and its corresponding pre-code. It should be noted that the data shown in FIG. 4 performs partial data in the matrix shown in Table 1. The data not shown may be obtained according to the data processing shown, and is not limited herein.
  • the PE may pre-mark each non-zero element included in the preset matrix according to a data size (for example, 2K) read by a preset single operation (for example, the current calculation).
  • the code is processed to obtain a positional code (i.e., a first position code) for each non-zero element. That is, for a value whose pre-code is greater than K, it is divided by 2K to obtain the actual code of the matrix element (ie, the first position code).
  • the processed matrix elements and their corresponding positional codes can be stored in the matrix RAM of the PE.
  • FIG. 5 is a schematic diagram of obtaining location code of a matrix element according to an embodiment of the present invention.
  • the pre-labeling code of each non-zero element in the preset matrix can be processed to obtain the position code of each non-zero element.
  • the pre-standard code of each matrix element is divided by the 2K remainder to obtain the position code of each non-zero matrix element, so that the position code of each non-zero matrix element is not greater than 2K, so that the non-zero matrix element
  • the positional code has a fixed bit width, which saves the storage space of the location code and improves the applicability of data processing.
  • the actual label obtained by dividing the pre-labeled code larger than K by the remaining number of 2K may be the position of the matrix element in the matrix data of a single reading.
  • Mark. For example, a single read of 16 matrix elements, that is, 16 matrix elements with a pre-code of 0-15, and a matrix element with a pre-code of 15 means that the matrix element is labeled 15 in 16 matrix elements.
  • the data. If the 16 matrix elements read in a single operation are 16 matrix elements with a pre-code of 16-31, the actual code of the 16 matrix elements with the pre-code of 16-31 is 0-15.
  • the matrix element with the pre-labeled code 31 refers to the data of the matrix element at the position marked 15 in the 16 matrix elements read this time.
  • the first indication information of the matrix elements of the preset matrix may also be generated according to the pre-labeling codes of the respective non-zero elements in the preset matrix.
  • the first indication information may include a matrix read pointer, a matrix valid pointer, and a number of valid matrix elements.
  • FIG. 6 is a schematic diagram of indication information of a matrix/vector read pointer according to an embodiment of the present invention. The label is the pre-code described in the embodiment of the present invention. In the specific implementation, if K is 8, then the data read by a single operation is 16 (that is, 2K) data.
  • the matrix elements of the preset matrix may be divided into 16 groups according to the pre-coding codes of the matrix elements in the preset matrix, for example, the pre-codes are 0-15
  • the matrix elements are a group, the pre-code is A matrix of 16-21, etc.
  • the number of valid matrix elements may be determined according to the number of non-zero elements included in each set of matrix elements, for example, a set of matrix elements with a pre-labeled code of 0-15 has three non-zero elements, and thus the effective matrix elements The number is 3.
  • the above matrix read pointer is used to indicate the matrix element row to be read participating in the current calculation in the preset matrix.
  • the matrix read pointer corresponding to the matrix element group with the pre-labeled code of 0-15 is 0.
  • the matrix element read by the matrix read pointer in the first operation is the current row and the next row (ie, two rows are read each time). For example, the first row and the second row of the preset matrix, and the like.
  • the matrix valid pointer points to the position of the starting non-zero element participating in this calculation in the row of matrix elements to be read.
  • the matrix valid pointer corresponding to the matrix element group with the pre-labeled code of 0-15 is 0, and the matrix element to be read is read from the element whose actual code is 0 in the first row of the preset matrix.
  • the number of valid matrix elements is used to indicate the number M of non-zero elements to be read participating in this calculation, that is, the number of elements that can be multiplied, and also expressed in [i*k, (i+2)* k] A valid element that can be read in the range.
  • the above i is an integer greater than or equal to 0, and the data in the range of [i*k, (i+2)*k] of the matrix to be processed is two rows of data.
  • the above [i*k, (i+2)*k] refers to two lines of data whose pre-codes are 0-15.
  • the number of valid matrix elements indicates the number of valid elements in the range, for example, three.
  • the matrix element read by the second operation is a set of matrix elements with a pre-code of 16-31, and two non-zero elements in the set of elements are 2 (as shown in Figure 6 with labels 23 and 31). 2 matrix elements), so the number of valid matrix elements is 2.
  • the matrix read pointer corresponding to the matrix element group with the pre-labeled code of 16-31 is 0.
  • the matrix element read by the matrix read pointer is the current row and the next row. For example, the first row and the second row of the preset matrix, and the like. It should be noted that the first row of the preset matrix includes 8 non-zero matrix elements, and the first operation reads 3, so the second operation matrix read pointer is still 0, that is, still from the first row. Start reading.
  • the matrix valid pointer corresponding to the matrix element group with the pre-labeled code of 16-31 is 3, and the matrix element to be read is read from the element with the actual code of 3 in the first row of the preset matrix, that is, The reading is started from the fourth matrix element of the first row of the preset matrix, and the matrix elements read this time are two.
  • the number of valid matrix elements is used to indicate that the number M of non-zero elements to be read participating in this calculation is 2.
  • the matrix element read by the fifth operation is a set of matrix elements with a pre-code of 64-79, and there are two non-zero elements in the set of elements (such as the labels 71 and 79 shown in Figure 6). 2 matrix elements), so the number of valid matrix elements is 2.
  • the matrix read pointer corresponding to the matrix element group with the pre-code of 64-79 is +1 (ie, the matrix read pointer increment is 1), and the matrix element read matrix read by the matrix read pointer refers to the read pointer pointed to by the read pointer.
  • the next row of the matrix element row and the lower and lower rows For example, the second and third lines of the preset matrix, and the like.
  • the first row of the preset matrix includes 8 non-zero matrix elements, and the matrix elements that have been read in the first four operations are 9, that is, 3+2+2+2.
  • the above 9 matrix elements include 8 matrix elements of the first row of the preset matrix and the first matrix element of the second row. Therefore, the fifth operation matrix read pointer is +1, that is, reading from the next line of the first line.
  • the matrix valid pointer corresponding to the matrix element group with the pre-labeled code of 64-79 is 1 and the matrix element to be read is read from the element with the actual code of 1 in the second row of the preset matrix, that is, Read from the second matrix element of the second row of the preset matrix, and the number of matrix elements read this time is two.
  • the number of valid matrix elements is used to indicate that the number M of non-zero elements to be read participating in this calculation is 2.
  • first indication information of a matrix read pointer, a matrix valid pointer, and a number of effective matrix elements corresponding to each matrix element group is generated.
  • the first indication information of the matrix element further includes a matrix read pointer increment.
  • the initial value of the matrix read pointer increment is zero, indicating the matrix element row indicated by the matrix element behavior matrix read pointer to be read in the current calculation (two rows are read each time, and the pointer is pointed from the matrix The line starts reading). If this time If the number of non-zero matrix elements to be read is greater than the remaining number of non-zero matrix elements included in the matrix element row indicated by the matrix read pointer, the matrix read pointer of this run has an increment of 1 for updating. The matrix read pointer for the next run.
  • the matrix read pointer is incremented by one.
  • the matrix read pointer increment plus 1 indicates that the matrix element to be read in the next calculation behaves two rows after the matrix element row indicated by the matrix read pointer of this operation.
  • the remaining non-zero elements are the number of non-zero elements after the position pointed by the matrix valid pointer included in the matrix element row indicated by the matrix read pointer of the current operation.
  • the matrix read pointer in the fourth operation, in the first row of the preset matrix, the position of the matrix valid pointer pointed to by the non-zero element with a code of 7 is 0, less than 2, after the fourth operation The corresponding matrix read pointer increment is 1, which means that the matrix read pointer points to the second row of the matrix during the fifth run.
  • the matrix read pointer can be updated according to the matrix read pointer increment to obtain the matrix read pointer of the fifth operation.
  • the first indication information of the matrix element described in FIG. 6 may be stored in the matrix information RAM, and when the PE performs the FMAC operation, the matrix information RAM may be acquired from the broadcast bus.
  • the first indication information may be matrix indication information obtained after the initialization of the to-be-processed matrix, and is stored in the matrix information RAM.
  • the matrix indication information may be obtained from the broadcast bus, and the parameters of the matrix read pointer, the matrix valid pointer, and the effective matrix element included in the matrix indication information are used to schedule the preset matrix to be used for performing the FMAC operation. Non-zero element.
  • the matrix data such as the matrix to be processed described in the embodiment of the present invention is known data, and the known data does not change. Therefore, the initialization information of the matrix obtained by preprocessing the matrix to be processed can be performed by the initialization information guiding multiplier.
  • the data is scheduled and operated. Among them, the data scheduling and operation of one beat can be the data scheduling and operation of one processing cycle, which can improve the processing efficiency of the data operation and reduce the operation complexity of the multiplication operation of the matrix and the vector.
  • the PE may search for a specified matrix element row pointed by the matrix read pointer from the preset matrix according to the first indication information, and start from a specified position pointed by the matrix valid pointer, from the The M matrix element values are read in the specified matrix element row.
  • the matrix element values of the three non-zero elements can be read from the position of the first matrix element of the first row of the preset matrix according to the matrix read pointer.
  • a positional code (ie, a first position code) of the read matrix element value may also be determined to read the paired vector element from the input vector element.
  • the position code of the value of the matrix element can be determined, and the first pair paired with the multiplication and addition operation can be read from the vector data. Element value.
  • the indication information of the vector data may be determined according to the number of non-zero elements included in each matrix element row in the matrix to be processed (ie, Two instructions).
  • the embodiment of the present invention indicates a row of vector data to be read by a vector read pointer.
  • the second indication information includes a vector data row to be read indicated by the vector read pointer, and a vector read pointer increment.
  • the read vector data needs to be paired with the matrix data, so when the data size of the single read of the matrix data is 2K (ie, two rows), The data size of a single read of vector data should also be 2K, so the vector read pointer increment can be set to the number of vector RAM lines separated by the vector elements of the two beat outputs. That is, the vector read pointer increment indicates the number of intervals of the vector data line to be read calculated next time and the vector data line indicated by the vector read pointer in the current calculation, wherein the vector data line indicated by the vector read pointer is The vector data line read this time.
  • the increment of the vector read pointer may be set to 2, that is, the ratio of the data size (2K) of the read data to the K. H is 2. If the elements included in each matrix element row of the matrix to be processed are all zero, they can be skipped directly, that is, the all-zero matrix element rows do not need to be multiplied. At this time, the vector read pointer increment can set the number of rows to be skipped. If the elements in the range of [i*k, (i+2)*k] in the to-be-processed matrix are all zeros, you can skip directly. 2 lines, at this time the vector read pointer increment can be set to 2 or 4.
  • H 1 is 2. If the elements in the range of [i*k, (i+N)*k] are all zeros in the matrix to be processed, N rows can be skipped directly, and the increment of the vector read pointer can be set to N. As shown in FIG. 6, according to the label of each matrix element in the matrix to be processed, the elements between the code 127 and the code 300 are zero, and the elements between the code 127 and the code 300 are separated by 22 lines. Therefore, the increment of the vector read pointer can be set to 22. If the element interval between the code C and the code D is less than 2K, the increment of the vector read pointer is set to 2. For details, refer to the example shown in FIG. 6 , and details are not described herein again.
  • the indication information of the above vector may be pre-processed and stored in the vector RAM, and may be transmitted to the PE through the broadcast bus when the PE performs the FMAC operation.
  • the vector read pointer increment can update the vector read pointer after each read of the data to obtain the next calculated vector read pointer, thereby achieving accurate scheduling of the vector data.
  • the vector may be searched from the input vector data according to the second indication information of the vector element.
  • the vector data line indicated by the pointer is read, and the vector element value of the second position code corresponding to the first position code is read from the vector data line.
  • the second position code corresponding to the first position code is a position where the matrix element value paired with the matrix element value on the second position code is located.
  • the input vector data may be an output width of the vector RAM, specifically T*K elements, and T is an integer greater than 1. That is, if the matrix RAM output width is K non-zero elements, the vector RAM can output T*K elements to ensure that sufficient vector elements are paired with the matrix elements, thereby improving the accuracy of multiplication of the matrix and the vector.
  • the matrix element value and the vector element value multiply and add operation are performed, thereby obtaining a multiplication value of the matrix element value and the vector element value.
  • FIG. 7 is a schematic diagram of a PE architecture, how the PE performs data scheduling and multiply-accumulate processing operations according to the indication information of the matrix elements stored in the matrix information RAM and the indication information of the vector elements stored in the vector RAM.
  • the process is roughly introduced.
  • each PE actually does an FMAC operation.
  • the structure of the PE is processed by the pipeline, and can be divided into 2+N layer pipelines, including a 2-layer data scheduling pipeline (including the read layer and the data layer) and an N-layer operation pipeline (ie, an operation layer), such as C0, C1,... , shown in C5.
  • the adder reads the matrix read pointers returned from the matrix RAM and the matrix transmitted from the broadcast bus.
  • the pointer increments to update the matrix read pointer.
  • the PE can maintain a matrix mask register, and use the indication information of the matrix valid pointer and the number of effective matrix elements input from the matrix information RAM through the broadcast bus to generate a mask that can filter out the matrix elements that have been calculated.
  • the matrix elements that have been operated may be filtered out from the data read from the preset matrix stored in the matrix RAM by the mask of the matrix element, that is, the matrix valid value is output from the matrix RAM according to the number of valid pointers and valid matrix elements.
  • the effective matrix elements participating in the current FAMC operation are selected, and the effective matrix elements in the preset matrix can be input to the operation pipeline.
  • vector input ie, input vector data
  • the vector read pointer and the vector read pointer increment can also be stored in the vector RAM beforehand, and no limitation is imposed here.
  • the input vector data may include 2K elements, and may be divided into an upper layer vector and a lower layer vector.
  • the PE can read the input vector data from the vector RAM, and can input the corresponding vector element value from the input vector data to the operation pipeline through the 32-1 selector according to the information such as the pre-label of the matrix element value transmitted from the matrix RAM.
  • the matrix and vector multiplication and addition operations are performed.
  • matrix data can be read from the matrix RAM.
  • K or less than K effective matrix elements in the preset matrix are passed to the operation layer.
  • the corresponding vector elements can be selected by a plurality of selectors (such as the 32-1 selector shown) according to the pre-code read through the matrix RAM, and passed to the operation layer.
  • Each of the plurality of selectors may select one vector element from the 2K elements, where the vector element corresponds to a matrix element corresponding to the pre-labeled code.
  • the data of the pre-coded code position that is not used can be passed to 0, or the enable signal can be passed to make the multiplier not work, thereby saving the calculation amount of the multiplier.
  • the accelerator will multiply and add the incoming data, and in the last layer, the operation result is added to the accumulated result together with the previous result.
  • the non-zero element in the matrix to be processed is indicated by the matrix read pointer, the matrix valid pointer, the number of valid matrix elements, and the matrix read pointer increment, and the non-zero element value and the vector data are read from the preset matrix.
  • the value multiplication operation can improve the scheduling accuracy of the matrix elements, reduce the non-zero judgment of the matrix elements before the scheduling of the matrix element values, and reduce the scheduling operation complexity of the matrix elements.
  • the embodiment of the invention can also read the vector data value corresponding to the position of the matrix element value from the input vector data according to the indication information such as the vector read pointer and the vector read pointer increment, thereby saving the judgment of the matrix element value in the multiplication operation process.
  • Operation which can reduce the complexity of data processing, reduce the power consumption of data processing, and improve data processing efficiency.
  • the application can also perform position labeling on the matrix elements in the preset matrix according to the data size of a single reading, which can ensure the fixed label width and reduce the operation complexity of the data processing.
  • FIG. 8 is a schematic structural diagram of a matrix and vector multiplication device according to an embodiment of the present invention.
  • the multiplication device provided by the embodiment of the present invention may specifically be the PE described in the embodiment of the present invention.
  • the multiplication device provided by the embodiment of the present invention may include: a memory 801, a scheduling unit 802, an operator 802, and a general purpose processor 804 (such as a central processing unit CPU).
  • the memory 801 may be a matrix RAM, a matrix information RAM, a vector RAM, etc., which may be determined according to actual application requirements, and is not limited herein.
  • the foregoing scheduling unit 802 may be a function module such as a read pointer, a filter, or a selector in the PE, or may be a functional module for scheduling data stored in the memory 801 in other manifestations, which is not limited herein.
  • the above arithmetic unit 802 can be an adder and an acceleration in the PE.
  • Functional modules such as devices are not limited here.
  • the general-purpose processor 804 may also be a data pre-processing module external to the PE, or a data initialization module, for performing operations such as pre-processing or initialization of the matrix data, and is not limited herein.
  • the foregoing memory 801 is configured to store a preset matrix and first indication information of a matrix element of the preset matrix, where the first indication information is used to indicate a non-zero element in the preset matrix.
  • the scheduling unit 802 is configured to acquire the first indication information from the memory 801, and read a matrix element value of a non-zero element from the preset matrix according to the first indication information, and determine to read a first position code of the matrix element value, the first position code being a position mark of the matrix element value in a single read matrix data;
  • the memory 801 is further configured to store input vector data and second indication information of a vector element of the input vector data, where the second indication information is used to indicate vector data information to be read.
  • the scheduling unit 802 is further configured to read the second indication information from the memory 801, and read, according to the second indication information, the first location label from the input vector data.
  • the vector element value of the second position code is further configured to read the second indication information from the memory 801, and read, according to the second indication information, the first location label from the input vector data. The vector element value of the second position code.
  • the operator 803 is configured to calculate a multiplication value of the matrix element value read by the scheduling unit and the vector element value.
  • the multiplying device further includes:
  • a general-purpose processor 804 configured to acquire a matrix to be processed, and perform position labeling on each matrix element in the to-be-processed matrix to obtain a pre-code of each matrix element, where each row in the to-be-processed matrix includes K Elements, K is an integer greater than zero.
  • the general-purpose processor 804 is further configured to: select a non-zero element in the to-be-processed matrix, and generate a preset matrix according to a pre-code of a non-zero element in the to-be-processed matrix and store the same to the memory, Each row in the preset matrix includes K non-zero elements.
  • the general-purpose processor 804 is further configured to generate first indication information of the matrix element according to the preset matrix and the pre-code of each non-zero element included therein, and store the information to the memory.
  • the general purpose processor 804 is further configured to:
  • the location code of any one of the non-zero elements is smaller than the data size.
  • the foregoing first indication information includes a matrix read pointer, a matrix valid pointer, and a valid matrix element number
  • the matrix read pointer is used to indicate a row of matrix elements to be read that participate in the current calculation in the preset matrix
  • the matrix valid pointer points to a position of the starting non-zero element participating in the current calculation in the row of matrix elements to be read;
  • the number of the effective matrix elements is used to indicate the number M of non-zero elements to be read participating in the current calculation, and the M is an integer greater than or equal to 1;
  • the scheduling unit is used to:
  • the foregoing first indication information further includes a matrix read pointer increment
  • the initial value of the matrix read pointer increment is zero, indicating that the matrix element to be read in the current calculation acts as a matrix element row indicated by the matrix read pointer;
  • the general purpose processor is used to:
  • the matrix read pointer is incremented by one, and the matrix read pointer increments by one to indicate the next calculation.
  • the matrix element to be read behaves two lines after the matrix element row indicated by the matrix read pointer;
  • the remaining non-zero elements are the number of non-zero elements after the position pointed by the matrix valid pointer included in the matrix element row to be read.
  • the general purpose processor 804 is further configured to:
  • the matrix read pointer is updated in accordance with the matrix read pointer increment to obtain a matrix read pointer for the next calculation.
  • the vector data information to be read includes the vector data row to be read in the current calculation
  • the general purpose processor 804 is also used to:
  • the second indication information includes a vector data row to be read indicated by the vector read pointer, and a vector read pointer increment;
  • the vector read pointer increment indicates the number of intervals of the vector data line to be read calculated next time and the vector data line indicated by the vector read pointer.
  • the general purpose processor 804 is configured to:
  • the vector read pointer increment is set to H, and the H is a preset matrix data size and the current calculation read.
  • the vector read pointer increment is set to H 1 .
  • the scheduling unit 802 is configured to:
  • the vector data row to be read from the input vector data where the input vector data includes T*K elements, and the T is an integer greater than 1;
  • a vector element value of the second position code corresponding to the first position code is read from the vector data line.
  • the general purpose processor 804 is further configured to:
  • the vector read pointer is updated according to the increment of the vector read pointer to obtain a vector read pointer for the next calculation.
  • the foregoing multiplication device may perform the implementation manners described in the foregoing embodiments by using the various functional units that are built in the foregoing, and details are not described herein again.
  • the non-zero element in the matrix to be processed is indicated by the matrix read pointer, the matrix valid pointer, the number of valid matrix elements, and the matrix read pointer increment, and the non-zero element value and the vector data are read from the preset matrix.
  • the value is multiplied to improve the scheduling accuracy of the matrix elements and reduce the non-zero judgment of the matrix elements before the scheduling of the matrix element values. Intermittent operations reduce the complexity of scheduling operations for matrix elements.
  • the embodiment of the invention can also read the vector data value corresponding to the position of the matrix element value from the input vector data according to the indication information such as the vector read pointer and the vector read pointer increment, thereby saving the judgment of the matrix element value in the multiplication operation process.
  • Operation which can reduce the complexity of data processing, reduce the power consumption of data processing, and improve data processing efficiency.
  • the application can also perform position labeling on the matrix elements in the preset matrix according to the data size of a single reading, which can ensure the fixed label width and reduce the operation complexity of the data processing.

Abstract

本发明实施例公开了一种矩阵与矢量的乘法运算方法及装置,所述方法包括:获取矩阵元素的第一指示信息;根据所述第一指示信息从所述预置矩阵中读取非零元素的矩阵元素值,并确定读取的所述矩阵元素值的第一位置标码;获取矢量元素的第二指示信息;根据所述第二指示信息从输入矢量数据中读取与所述第一位置标码对应的第二位置标码的矢量元素值;获取所述矩阵元素值与所述矢量元素值的乘法运算值。采用本发明实施例,具有可降低数据处理的复杂度,降低数据处理的功耗,提高数据处理效率的优点。

Description

一种矩阵与矢量的乘法运算方法及装置 技术领域
本申请涉及一种矩阵与矢量的乘法运算方法及装置。
背景技术
由于卷积神经网络在图像识别、图像分类、音频识别等数据处理运用中的出色表现,使得卷积神经网络成为了各类学术研究的热门课题之一。然而,卷积神经网络中存在大量的浮点数乘加运算,包括矩阵与矢量的乘法运算,运算量大,耗时长,使得卷积神经网络的硬件能耗大。因此,如何减少卷积神经网络中的浮点数运算量成为当前亟待解决的技术问题之一。
现有技术中,卷积神经网络中的矩阵与矢量运算时,通过实时检测矩阵中的非零元素,记录矩阵中非零元素的位置,并从矩阵中选取非零元素与矢量元素进行乘加运算。现有技术中的矩阵与矢量运算需要实时判断矩阵元素的值是否为零,实时记录非零元素的位置,实时判断、记录的实现复杂度高,操作复杂,数据处理效率低,适用性差。
发明内容
本申请提供一种矩阵与矢量的乘法运行方法及装置,可降低数据处理的复杂度,降低数据处理的功耗,提高数据处理效率。
第一方面提供了一种矩阵与矢量的乘法运算方法,其可包括:
获取矩阵元素的第一指示信息,所述第一指示信息用于指示预置矩阵中的非零元素;
根据所述第一指示信息从所述预置矩阵中读取非零元素的矩阵元素值,并确定读取的所述矩阵元素值的第一位置标码,所述第一位置标码为所述矩阵元素值在单次读取的矩阵数据中的位置标记;
获取矢量元素的第二指示信息,所述第二指示信息用于指示待读取的矢量数据信息;
根据所述第二指示信息从输入矢量数据中读取与所述第一位置标码对应的第二位置标码的矢量元素值;
获取所述矩阵元素值与所述矢量元素值的乘法运算值。
本申请通过矩阵读指针的指示信息指示待处理矩阵中的非零元素,从预置矩阵中读取非零元素值与矢量数据值进行乘法运算。本申请可根据矢量读指针的指示信息从输入矢量数据中读取与矩阵元素值的位置相对应的矢量数据值,可节省乘法运算过程中矩阵元素值的判断操作,进而可降低数据处理的复杂度,降低数据处理的功耗,提高数据处理效率。
结合第一方面,在第一种可能的实现方式中,所述获取矩阵元素的第一指示信息之前,所述方法还包括:
获取待处理矩阵,并将所述待处理矩阵中的各个矩阵元素进行位置标码以得到各个矩阵元素的预标码,其中,所述待处理矩阵中每行包括K个元素,K为大于零的整数;
选取所述待处理矩阵中的非零元素,并根据所述待处理矩阵中的非零元素的预标码生成预置矩阵,所述预置矩阵中每行包括K个非零元素;
根据所述预置矩阵及其包含的各个非零元素的预标码生成矩阵元素的第一指示信息。
本申请可对参与乘法运算的待处理矩阵进行预处理,将待处理矩阵中的零元素剔除得到预置矩阵存储至指定存储空间,进而可根据预置矩阵中各个非零元素的位置关系生成矩阵读指针的指示信息。上述矩阵读指针的指示信息可在矩阵与矢量的乘法运算中调度矩阵元素,可矩阵元素调度的准确性和数据处理效率,降低矩阵元素读取的操作复杂度。
结合第一方面第一种可能的实现方式,在第二种可能的实现方式中,所述根据所述待处理矩阵中的非零元素的预标码生成预置矩阵之后,所述方法还包括:
根据预设的本次计算读取的矩阵数据大小将所述预置矩阵中包含的各个非零元素的预标码进行处理以得到各个非零元素的位置标码;
将所述各个非零元素的位置标码添加至所述第一指示信息;
其中,所述各个非零元素中任一非零元素的位置标码小于所述数据大小。
本申请可根据单次运算读取的数据大小为预置矩阵的非零元素进行标码,任一非零元素的位置标码均小于上述单次运算读取的数据大小,可让标码位宽固定,降低数据处理复杂度。
结合第一方面第一种可能的实现方式或者第一方面第二种可能的实现方式,在第三种可能的实现方式中,所述第一指示信息包括矩阵读指针、矩阵有效指针和有效矩阵元素个数;
所述矩阵读指针用于指示所述预置矩阵中参与本次计算的待读取的矩阵元素行;
所述矩阵有效指针指向参与本次计算的起始非零元素在所述待读取的矩阵元素行中的位置;
所述有效矩阵元素个数用于指示参与本次计算的待读取的非零元素的个数M,所述M为大于或者等于1的整数;
所述根据所述第一指示信息从预置矩阵中读取非零元素的矩阵元素值包括:
从所述预置矩阵中查找所述矩阵读指针所指向的指定矩阵元素行,并从所述矩阵有效指针所指向的指定位置开始,从所述指定矩阵元素行中读取M个矩阵元素值。
本申请可矩阵读指针、矩阵有效指针、有效矩阵元素个数等参数指示预置矩阵的非零元素的读取位置及读取个数等信息,可提高矩阵元素的调度便捷性,进而可提高数据处理效率。
结合第一方面第三种可能的实现方式,在第四种可能的实现方式中,所述第一指示信息还包括矩阵读指针增量;
所述矩阵读指针增量的初始值为零,表示本次计算中待读取的矩阵元素行为所述矩阵读指针所指示的矩阵元素行;
所述根据所述预置矩阵及其包含的各个非零元素的预标码生成矩阵元素的第一指示信息包括:
若所述M多于所述待读取的矩阵元素行中剩余非零元素的个数,则所述矩阵读指针增量加1,所述矩阵读指针增量加1表示下次计算中待读取的矩阵元素行为所述矩阵读指针 所指示的矩阵元素行之后的两行;
其中,所述剩余非零元素为所述待读取的矩阵元素行中包含的所述矩阵有效指针指向的位置之后的非零元素个数。
本申请可通过矩阵读指针增量来标记矩阵读指针追踪的矩阵元素行,进一步保障的矩阵元素的调度准确性,提高数据处理的效率。
结合第一方面第三种可能的实现方式,在第五种可能的实现方式中,所述方法还包括:
根据所述矩阵读指针增量更新所述矩阵读指针以得到下次计算的矩阵读指针。
本申请可通过矩阵读指针增量更新矩阵读指针以保证每次运算,矩阵读指针所指向的矩阵元素行的准确性,提高数据调度的准确性,适用性更高。
结合第一方面第一种可能的实现方式至第一方面第五种可能的实现方式中任一种,在第六种可能的实现方式中,所述待读取的矢量数据信息包括本次计算中待读取的矢量数据行;
所述获取矢量元素的第二指示信息之前,所述方法还包括:
根据所述待处理矩阵中的非零元素的预标码确定所述待处理矩阵中各矩阵元素行包含的非零元素个数;
根据所述各矩阵元素行包含的非零元素个数生成矢量元素的第二指示信息;
其中,第二指示信息包括所述矢量读指针所指示的待读取的矢量数据行,以及矢量读指针增量;
所述矢量读指针增量指示下次计算的待读取的矢量数据行与所述矢量读指针所指示的矢量数据行的间隔行数。
本申请可根据待处理矩阵中各个矩阵元素行的非零元素个数确定矢量读指针的指示信息,通过矢量读指针的指示信息指示乘法运算时从输入矢量数据中读取矢量数据的矢量数据行,可保证矢量数据与矩阵元素值的乘法运算的准确性,提高了数据调度的准确性。
结合第一方面第六种可能的实现方式,在第七种可能的实现方式中,所述根据所述各矩阵元素行包含的非零元素个数生成矢量元素的第二指示信息包括:
若所述各矩阵元素行包含的非零元素个数均不为零,则将所述矢量读指针增量设为H,所述H为预设的本次计算读取的矩阵数据大小与所述K的比值;
若所述各矩阵元素行中包含的非零元素为零的矩阵元素行的数目H1大于H,则将所述矢量读指针增量设为H1
本申请还可根据待处理矩阵中各矩阵元素行包括的零元素情况设定矢量读指针增量,通过矩阵读指针增量指定乘法运算所要读取的矢量数据行,进而可通过矢量读指针增量的设定将全零矩阵元素行跳过,可节省乘法运算的数据调度信令,提供数据处理效率。
结合第一方面第六种可能的实现方式或第一方面第七种可能的实现方式,在第八种可能的实现方式中,所述根据所述第二指示信息从输入矢量数据中读取与所述第一位置标码对应的第二位置标码的矢量元素值包括:
根据所述第二指示信息从所述输入矢量数据中查找待读取的矢量数据行,其中,所述输入矢量数据中包括T*K个元素,所述T为大于1的整数;
从所述矢量数据行中读取与所述第一位置标码对应的第二位置标码的矢量元素值。
本申请通过矢量读指针的指示信息从多于矩阵元素数量的输入矢量数据中查找待读取的矢量数据行,从查找得到的矢量数据行中读取与读取的矩阵元素值相对应的矢量元素值。本申请通过输入更多的矢量数据来保证加速器中运算算子的有效利用率,提高了矩阵与矢量的乘法运算的适用性。
结合第一方面第六种可能的实现方式至第一方面第八种可能的实现方式中任一种,在第九种可能的实现方式中,所述方法还包括:
根据所述矢量读指针的增量更新所述矢量读指针以得到下次计算的矢量读指针。
本申请可通过矢量读指针增量更新矢量读指针以保证每次运算,矢量读指针所指向的矢量数据行的准确性,提高数据调度的准确性,适用性更高。
第二方面提供了一种矩阵与矢量的乘法运算装置,其可包括:存储器、调度单元和运算器;
所述存储器,用于存储预置矩阵以及所述预置矩阵的矩阵元素的第一指示信息,所述第一指示信息用于指示预置矩阵中的非零元素;
所述调度单元,用于从所述存储器中获取所述第一指示信息,并根据所述第一指示信息从所述预置矩阵中读取非零元素的矩阵元素值,并确定读取的所述矩阵元素值的第一位置标码,所述第一位置标码为所述矩阵元素值在单次读取的矩阵数据中的位置标记;
所述存储器,还用于存储输入矢量数据以及所述输入矢量数据的矢量元素的第二指示信息,所述第二指示信息用于指示待读取的矢量数据信息;
所述调度单元,还用于从所述存储器中读取所述第二指示信息,并根据所述第二指示信息从所述输入矢量数据中读取与所述第一位置标码对应的第二位置标码的矢量元素值;
所述运算器,用于计算所述调度单元读取的所述矩阵元素值与所述矢量元素值的乘法运算值。
结合第二方面,在第一种可能的实现方式中,所述乘法运算装置还包括:
通用处理器,用于获取待处理矩阵,并将所述待处理矩阵中的各个矩阵元素进行位置标码以得到各个矩阵元素的预标码,其中,所述待处理矩阵中每行包括K个元素,K为大于零的整数;
所述通用处理器,还用于选取所述待处理矩阵中的非零元素,并根据所述待处理矩阵中的非零元素的预标码生成预置矩阵并存储至所述存储器,所述预置矩阵中每行包括K个非零元素;
所述通用处理器,还用于根据所述预置矩阵及其包含的各个非零元素的预标码生成矩阵元素的第一指示信息并存储至所述存储器。
结合第二方面第一种可能的实现方式,在第二种可能的实现方式中,所述通用处理器还用于:
根据预设的本次计算读取的矩阵数据大小将所述预置矩阵中包含的各个非零元素的预标码进行处理以得到各个非零元素的位置标码,将所述各个非零元素的位置标码添加至所述第一指示信息;
其中,所述各个非零元素中任一非零元素的位置标码小于所述数据大小。
结合第二方面第一种可能的实现方式或者第二方面第二种可能的实现方式,在第三种 可能的实现方式中,所述第一指示信息包括矩阵读指针、矩阵有效指针和有效矩阵元素个数;
所述矩阵读指针用于指示所述预置矩阵中参与本次计算的待读取的矩阵元素行;
所述矩阵有效指针指向参与本次计算的起始非零元素在所述待读取的矩阵元素行中的位置;
所述有效矩阵元素个数用于指示参与本次计算的待读取的非零元素的个数M,所述M为大于或者等于1的整数;
所述调度单元用于:
从所述预置矩阵中查找所述矩阵读指针所指向的指定矩阵元素行,并从所述矩阵有效指针所指向的指定位置开始,从所述指定矩阵元素行中读取M个矩阵元素值。
结合第二方面第三种可能的实现方式,在第四种可能的实现方式中,所述第一指示信息还包括矩阵读指针增量;
所述矩阵读指针增量的初始值为零,表示本次计算中待读取的矩阵元素行为所述矩阵读指针所指示的矩阵元素行;
所述通用处理器用于:
若所述M多于所述待读取的矩阵元素行中剩余非零元素的个数,则将所述矩阵读指针增量加1,所述矩阵读指针增量加1表示下次计算中待读取的矩阵元素行为所述矩阵读指针所指示的矩阵元素行之后的两行;
其中,所述剩余非零元素为所述待读取的矩阵元素行中包含的所述矩阵有效指针指向的位置之后的非零元素个数。
结合第二方面第三种可能的实现方式,在第五种可能的实现方式中,所述通用处理器还用于:
根据所述矩阵读指针增量更新所述矩阵读指针以得到下次计算的矩阵读指针。
结合第二方面第一种可能的实现方式至第二方面第五种可能的实现方式中任一种,在第六种可能的实现方式中,所述待读取的矢量数据信息包括本次计算中待读取的矢量数据行;
所述通用处理器还用于:
根据所述待处理矩阵中的非零元素的预标码确定所述待处理矩阵中各矩阵元素行包含的非零元素个数,根据所述各矩阵元素行包含的非零元素个数生成矢量元素的第二指示信息;
其中,第二指示信息包括所述矢量读指针所指示的待读取的矢量数据行,以及矢量读指针增量;
所述矢量读指针增量指示下次计算的待读取的矢量数据行与所述矢量读指针所指示的矢量数据行的间隔行数。
结合第二方面第六种可能的实现方式,在第七种可能的实现方式中,所述通用处理器用于:
若所述各矩阵元素行包含的非零元素个数均不为零,则将所述矢量读指针增量设为H,所述H为预设的本次计算读取的矩阵数据大小与所述K的比值;
若所述各矩阵元素行中包含的非零元素为零的矩阵元素行的数目H1大于H,则将所述矢量读指针增量设为H1
结合第二方面第六种可能的实现方式或第二方面第七种可能的实现方式,在第八种可能的实现方式中,所述调度单元用于:
根据所述第二指示信息从所述输入矢量数据中查找待读取的矢量数据行,其中,所述输入矢量数据中包括T*K个元素,所述T为大于1的整数;
从所述矢量数据行中读取与所述第一位置标码对应的第二位置标码的矢量元素值。
结合第二方面第六种可能的实现方式至第二方面第八种可能的实现方式中任一种,在第九种可能的实现方式中,所述通用处理器还用于:
根据所述矢量读指针的增量更新所述矢量读指针以得到下次计算的矢量读指针。
本申请通过矩阵读指针、矩阵有效指针、有效矩阵元素个数以及矩阵读指针增量等信息指示待处理矩阵中的非零元素,从预置矩阵中读取非零元素值与矢量数据值进行乘法运算,可提高矩阵元素的调度准确性,减少矩阵元素值的调度之前矩阵元素的非零判断等操作,降低了矩阵元素的调度操作复杂度。本申请可根据矢量读指针以及矢量读指针增量等指示信息从输入矢量数据中读取与矩阵元素值的位置相对应的矢量数据值,可节省乘法运算过程中矩阵元素值的判断操作,进而可降低数据处理的复杂度,降低数据处理的功耗,提高数据处理效率。本申请还可根据单次读取的数据大小为预置矩阵中的矩阵元素进行位置标码,可保证标码宽度固定,降低数据处理的操作复杂度。
附图说明
图1是本发明实施例提供的矩阵与矢量的乘法运算的示意图;
图2是本发明实施例提供的矩阵与矢量的乘法运算装置的一结构示意图;
图3是本发明实施例提供的矩阵与矢量的乘法运算方法的流程示意图;
图4是稀疏矩阵预处理的示意图;
图5是本发明实施例提供的矩阵元素的位置标码获取示意图;
图6是本发明实施例提供的矩阵/矢量读指针的指示信息示意图;
图7是本发明实施例提供的PE的架构示意图;
图8是本发明实施例提供的矩阵与矢量的乘法运算装置的结构示意图。
具体实施方式
下面结合本发明实施例中的附图对本发明实施例进行描述。
参见图1,是本发明实施例提供的矩阵与矢量的乘法运算的示意图。如图1所示,假设参与乘法运算的待处理矩阵为一个A*B的矩阵,参与乘法运算的输入矢量数据为一个B*1的矢量。上述A*B的矩阵和上述B*1的矢量相乘可得到一个A*1的矢量。即待处理矩阵为A行B列的矩阵,并且待处理矩阵中包括一个或者多个的零元素。输入矢量数据为一个B列的矢量。在矩阵与矢量的乘法运算中,矩阵的每一行的矩阵元素与矢量元素两两配对,把每一对中的两个元素相乘,再把这些乘积累加起来,最后得到的值则为第一行的结 果。例如待处理矩阵中的第一行的矩阵元素与矢量元素两两配对,例如(1,1)、(0,3)、(0,5)、(1,2),再将每一对中的两个元素相乘得到每一对的乘积,进而将乘积相加得到第一行的结果。每一行的矩阵元素进行相同的操作,则可得到一个4*1的矢量。
为了加速稀疏矩阵(即包含0元素的矩阵,如图1中的矩阵)与矢量的乘法运算,现有技术将稀疏矩阵稠密化(即将矩阵中的0元素丢弃,剩下的非零元素重新生成一个矩阵),可以节省数据存储空间,减少参与乘法运算的矩阵元素的数量。然而,现有技术将矩阵稠密化处理之后,稠密化处理得到的矩阵与矢量的乘法运算将变得更加复杂。例如,稠密化处理之后的矩阵需要记录每个矩阵元素的标码与值,其中,上述标码表示该矩阵元素在矩阵中的位置。例如,图1中第一行第一个元素的标码可为1,第一行第二个元素的标码为2,以此类推,第二行第一个元素的标码为4,最后一行最后一个元素的标码为20等。在矩阵与矢量的乘法运算中,需要从矩阵中读取元素,判断读取的元素值是否为0。若是,则丢弃,否则记录该元素值及其标码。
此外,现有技术中,矩阵与矢量的乘法运算中,读取矩阵元素的过程中还需要判断是否有足够的运算单元来运算,若没有足够的运算单元来运算,则需要记录本次运算读取到哪个元素,下次运算则需要接着该元素后面开始读取。若本次运算读取的矩阵元素行的所有元素运算完毕,则换行读取元素。现有技术中,矩阵元素的读取过程中需要判断的数据较多,操作繁琐,适用性不高。在现有技术中,矩阵与矢量的乘法运算中,可根据记录下来的非0元素来选取与之相对应的矢量元素进行相乘,即在稠密化处理之前,与该非0元素两两配对的矢量元素则为该非0元素相对应的矢量元素。若矢量元素跨度太大,则要找出下一个矢量元素在内存中的位置。
右上可知,在实时运算稀疏矩阵和矢量的乘法运算过程中需要用到大量复杂的判断,并且需要保存读出的标码和值,操作繁琐,适用性低。本发明实施例提供一种矩阵与矢量的乘法运算方法及装置,可利用矩阵是已知数据的特性,通过软件生成一套矩阵与矢量的乘法运算的控制信号,利用这套控制信号来从矩阵数据和矢量数据中选取正确的数据来执行乘法运算。在本发明实施例提供的实现方式中,矩阵与矢量的乘法运算过程中,乘法运算装置只需根据控制信号来执行相应操作即可,无需进行数据的判断和实时记录等操作,操作简单,数据处理效率高。在本发明实施例提供的实现方式中,上述乘法运算的控制信号只需要生成一次,之后乘法运算装置的所有操作都会可过控制信号触发执行,不需要实时的判断和数据记录,可降低矩阵元素的调度复杂度,提供数据处理效率。
参见图2,是本发明实施例提供的矩阵与矢量的乘法运算装置的一结构示意图。本发明实施例提供的乘法运算装置具体可为乘法运算加速器。图2所示的乘法运算加速器的顶层架构中包括x个处理单元(process engine,PE)、一个矢量随机存取存储器(random access memory,RAM)、一个矩阵信息RAM和一个控制器等。其中,每个PE中还包括一个矩阵RAM,用于存储参与乘法运算的矩阵。每个PE都是做一个浮点乘加运算(floating-point multiply-accumulate,FMAC),下面将以单个PE(任一个PE)执行FMAC为例进行说明。
上述矩阵信息RAM,矩阵RAM和矢量RAM都是包含一个写端口,两个读端口。每次乘法运算中,从上述矩阵RAM和矢量RAM中读取数据都是同时读取矩阵RAM和矢量RAM中的第L和L+1行的两行数据。其中,上述矩阵RAM和矢量RAM的每行都存有K 个元素(矩阵RAM的行中还存有每个元素对应的位置标码)。矩阵RAM的输出宽度为K个元素,即,上述矩阵RAM的输出宽度表示每个PE单次运算从矩阵RAM中读取的元素个数(即K),而矢量RAM的输出宽度为T*K个元素。其中,上述T可为预定义倍数,具体可根据参与运算的矩阵中包含的元素0在矩阵中包含的所有元素中的百分比确定,在此不做限制。
需要说明的是,由于矩阵RAM中存储的矩阵可为稠密化之后的矩阵,矩阵中可能不包括零元素,即,PE单次运算从矩阵RAM中读取的一行数据中的K个元素可为剔除了零元素之后K个非零元素,因此,实际读取的K个矩阵元素可能包括标码大于K的矩阵元素。此时实际K个非零矩阵元素两两配对的矢量元素则需要多于K个,矢量元素为实时输入的数据,矢量RAM是不经过稠密化处理的数据,则矢量RAM输出的数据应该多于K个。例如,假设K为8。稠密化处理之前,矩阵的L行和L+1行数据中包括的16(即2K)个数据中零元素为4个。其中,第L行包括的元素为(2,4,3,0,5,0,1,0),第L+1行包括的元素为(7,6,9,0,8,2,1,4),则稠密化处理之后读取的K个非零元素则可为(2,4,3,5,1,7,6,9)。此时,稠密化处理之后读取的K个非零元素中包括了L行和L+1行的数据。由于稠密化处理之前,参与乘法运行的矩阵元素和矢量元素是两两配对的,矢量元素是输入数据的数据,未经过稠密化处理,因为此时读取的矢量元素应该多于K个,L+1行的矢量元素配对的为标码大于K的矩阵元素,可保证与K个非零矩阵元素两两配对的矢量元素均被读取得到。即,当矩阵中有大量零元素时,如果矢量元素只取K个将很难有效地利用PE中所有的运算算子。若矢量元素取T*K个(T>1)则表示矢量元素可以取值的范围更大,从而更容易选出更多非零元素,参与FMAC运算,可提高PE内部乘法器的利用率。
FMAC运算中的两个操作数分别是矩阵元素和矢量元素。其中,矩阵元素被预先处理好并存储在PE中的矩阵RAM中。矢量元素可以是存储在远端一个较大的RAM中,并作为实时输入数据输入至PE中参与FMAC运算。运算开始时,矢量RAM会向总线广播T*K个元素。这里T可自定义,在本发明实施例中为了方便理解,我们用T=2为例进行说明。同时,矩阵信息RAM中存储的各个PE的矩阵信息将会跟着广播总线被送入各个PE。当矢量元素和矩阵信息进入PE后,PE将根据矩阵信息来提取与矢量对应的元素,来与其进行乘加操作。下面将参见图3对本发明实施例提供的矩阵元素和矢量元素的读取以及乘加操作的具体实现方式进行描述。本发明实施例提供的矩阵与矢量的乘法运算方法的执行主体可为上述PE,也可为PE中的功能模块,也可为上述控制器等,在此不做限制。下面将以PE为执行主体为例进行说明。
参见图3,是本发明实施例提供的矩阵与矢量的乘法运算方法的流程示意图。本发明实施例提供的方法可包括步骤:
在一些可行的实施方式中,FMAC运算开始之前,可获取参与FMAC运算的待处理矩阵,对待处理矩阵进行预处理得到矩阵初始化信息。具体的,可将待处理矩阵中各个矩阵元素进行位置标码得到各个矩阵元素的预标码。其中,上述待处理矩阵中每行包括K个元素,K为大于零的整数。例如,假设待处理矩阵为一个5*8的稀疏矩阵,即K=8,如下表1:
表1
12 0 0 4 0 5 0 1
0 0 2 5 0 0 23 0
2 0 0 9 23 4 13 0
0 0 18 21 0 0 0 0
0 0 0 0 0 0 0 0
PE可按照各个矩阵元素在待处理矩阵中的位置,按照先行再列的方式对各个矩阵元素进行标码,得到各个矩阵元素的预标码。如图4,图4为稀疏矩阵预处理的示意图。图4“稠密化”箭头左侧的数据为待处理矩阵中包括的矩阵元素及其对应的预标码。进一步的,PE可从待处理矩阵中选取非零元素,并根据非零元素的预标码生成预置矩阵。其中,上述预置矩阵为待处理矩阵稠密化处理之后的矩阵,预置矩阵中不包括零元素,并且预置矩阵中每行也包括K个元素。如图4“稠密化”箭头右侧所示的数据则为待处理矩阵中的非零元素及其对应的预标码。需要说明的是,图4所示的数据进行表1所示的矩阵中的部分数据,未示出的数据可根据已示出的数据处理得到,在此不做限制。
进一步的,在一些可行的实施方式中,PE可根据预设的单次运算(例如本次计算)读取的数据大小(例如2K)将上述预置矩阵中包含的各个非零元素的预标码进行处理以得到各个非零元素的位置标码(即第一位置标码)。即,对于预标码大于K的值,将对其除以2K取余数来得出该矩阵元素的实际标码(即第一位置标码)。进而可将处理好的矩阵元素和其对应的位置标码存入PE的矩阵RAM中。如图5,是本发明实施例提供的矩阵元素的位置标码获取示意图。以PE0为例,可对预置矩阵中各个非零元素的预标码进行处理得到各个非零元素的位置标码。本发明实施例将各个矩阵元素的预标码除以2K取余数得到各个非零矩阵元素的位置标码,可使得每个非零矩阵元素的位置标码均不大于2K,使得非零矩阵元素的位置标码的位宽固定,节省位置标码的存储空间,提高数据处理的适用性。
需要说明的是,由于单次读取的矩阵数据为2K,因此将大于K的预标码除以2K取余数之后得到的实际标码可为矩阵元素在单次读取的矩阵数据中的位置的标记。例如,单次读取16个矩阵元素,即预标码为0-15的16个矩阵元素,预标码为15的矩阵元素指代该矩阵元素在16个矩阵元素中标号为15的位置上的数据。若第一次运算时单次读取的16个矩阵元素为预标码为16-31的16个矩阵元素,则预标码为16-31的16个矩阵元素的实际标码为0-15,预标码为31的矩阵元素指代该矩阵元素在本次读取的16个矩阵元素中标号为15的位置上的数据。
在一些可行的实施方式中,对待处理矩阵进行处理得到预置矩阵之后,还可根据预置矩阵中各个非零元素的预标码生成预置矩阵的矩阵元素的第一指示信息。其中,上述第一种指示信息可包括矩阵读指针、矩阵有效指针和有效矩阵元素个数等。如图6,图6为本发明实施例提供的矩阵/矢量读指针的指示信息示意图。其中,标码即为本发明实施例中描述的预标码。具体实现中,假设K为8,则单次运算读取的数据为16(即2K)个数据。生成预置矩阵的矩阵元素的第一指示信息时,可根据预置矩阵中各个矩阵元素的预标码将预置矩阵的矩阵元素划分为每16个一组,例如预标码为0-15的矩阵元素为一组,预标码为 16-21的矩阵元素一组等。进一步的,可根据每组矩阵元素中包含的非零元素的个数确定有效矩阵元素个数,例如预标码为0-15的一组矩阵元素中非零元素为3个,因此有效矩阵元素个数为3。
上述矩阵读指针用于指示预置矩阵中参与本次计算的待读取的矩阵元素行。例如预标码为0-15的矩阵元素组对应的矩阵读指针为0指代第一次运算时矩阵读指针读取的矩阵元素行为当前行及其下一行(即每次读取两行),例如预置矩阵的第一行和第二行等。
矩阵有效指针指向参与本次计算的起始非零元素在待读取的矩阵元素行中的位置。例如预标码为0-15的矩阵元素组对应的矩阵有效指针为0指代待读取的矩阵元素从预置矩阵的第一行中实际标码为0的元素开始读取。有效矩阵元素个数用于指示参与本次计算的待读取的非零元素的个数M,即可以相乘的元素个数,同时以也表示在[i*k,(i+2)*k]范围内能读出的有效元素。其中,上述i为大于或者等于0的整数,待处理矩阵的[i*k,(i+2)*k]范围内的数据则为两行数据。例如,当i为0,K为8时,上述[i*k,(i+2)*k]则指代预标码为0-15的两行数据。有效矩阵元素个数表示该范围内的有效元素个数,例如3个。
假设第二次运算读取的矩阵元素为预标码为16-31的一组矩阵元素,在该组元素中非零元素为2个(如图6中所示的标码为23和31的2个矩阵元素),因此有效矩阵元素个数为2。预标码为16-31的矩阵元素组对应的矩阵读指针为0指代矩阵读指针读取的矩阵元素行为当前行及其下一行。例如,预置矩阵的第一行和第二行等。需要说明的是,预置矩阵的第一行包括的非零矩阵元素为8个,第一次运算读取了3个,因此第二次运算矩阵读指针依然为0,即依然从第一行开始读取。此时,预标码为16-31的矩阵元素组对应的矩阵有效指针为3指代待读取的矩阵元素从预置矩阵的第一行中实际标码为3的元素开始读取,即从预置矩阵的第一行的第4个矩阵元素开始读取,并且本次读取的矩阵元素为2个。有效矩阵元素个数用于指示参与本次计算的待读取的非零元素的个数M为2。
假设第五次运算读取的矩阵元素为预标码为64-79的一组矩阵元素,在该组元素中非零元素为2个(如图6中所示的标码为71和79的2个矩阵元素),因此有效矩阵元素个数为2。预标码为64-79的矩阵元素组对应的矩阵读指针为+1(即矩阵读指针增量为1)指代矩阵读指针读取的矩阵元素行为矩阵读指针所指向的待读取的矩阵元素行的下一行和下下行。例如,预置矩阵的第二行和第三行等。需要说明的是,预置矩阵的第一行包括的非零矩阵元素为8个,前四次次运算已读取的矩阵元素为9个,即3+2+2+2个。上述9个矩阵元素中包括预置矩阵的第一行的8个矩阵元素和第二行的第一个矩阵元素。因此,第五次运算矩阵读指针为+1,即从第一行的下一行开始读取。此时,预标码为64-79的矩阵元素组对应的矩阵有效指针为1指代待读取的矩阵元素从预置矩阵的第二行中实际标码为1的元素开始读取,即从预置矩阵的第二行的第2个矩阵元素开始读取,并且本次读取的矩阵元素为2个。有效矩阵元素个数用于指示参与本次计算的待读取的非零元素的个数M为2。
如上所述方式,生成每个矩阵元素组对应的矩阵读指针,矩阵有效指针和有效矩阵元素个数等矩阵元素的第一指示信息。
如图6所示,在本发明实施例中,矩阵元素的第一指示信息还包括矩阵读指针增量。其中,上述矩阵读指针增量的初始值为零,表示本次计算中待读取的矩阵元素行为矩阵读指针所指示的矩阵元素行(每次读取两行,从矩阵读指针所指向的行开始读取)。若在本次 计算所要读取的非零矩阵元素的个数大于矩阵读指针所指示的矩阵元素行包含的非零矩阵元素的剩余个数,则本次运行的矩阵读指针的增量为1,用于更新下一次运行的矩阵读指针。即,若本次运算读取的矩阵元素个数M多于矩阵读指针所指向的矩阵元素行中剩余非零元素的个数,则矩阵读指针增量加1。矩阵读指针增量加1表示下次计算中待读取的矩阵元素行为本次运算的矩阵读指针所指示的矩阵元素行之后的两行。其中,上述剩余非零元素为本次运算的矩阵读指针所指示的矩阵元素行中包含的矩阵有效指针指向的位置之后的非零元素个数。例如,第四次运算中,预置矩阵的第一行中,矩阵有效指针所指向的位置标码为7的非零元素之后的元素为0个,少于2个,则第四次运算之后对应生成的矩阵读指针增量为1,指代第五次运行时矩阵读指针指向矩阵的第二行。第四次运算之后可根据上述矩阵读指针增量更新矩阵读指针以得到第五次运算的矩阵读指针。
S301,获取矩阵元素的第一指示信息。
在一些可行的实施方式中,上述图6所述的矩阵元素的第一指示信息可存储在上述矩阵信息RAM中,PE执行FMAC运算时,可从广播总线中获取上述矩阵信息RAM送入的第一指示信息,以根据上述第一指示信息从预置矩阵中读取用于执行FMAC运算所需的非零元素(即非零矩阵元素)。
具体实现中,上述第一指示信息即可为上述待处理矩阵初始化之后得到的矩阵指示信息,存储于矩阵信息RAM中。PE执行FMAC运算时可从广播总线获取上述矩阵指示信息,根据上述矩阵指示信息中包括的矩阵读指针、矩阵有效指针以及有效矩阵元素个数等参数调度预置矩阵中用于执行FMAC运算所需的非零元素。
本发明实施例所描述的待处理矩阵等矩阵数据是已知数据,已知数据不会变化,因此通过对待处理矩阵进行预处理得到矩阵的初始化信息则可通过初始化信息引导乘法运算器进行每一拍的数据调度和运算。其中,一拍的数据调度和运算可为一个处理周期的数据调度和运算,可提高数据运算的处理效率,降低矩阵与矢量的乘法运算的操作复杂度。
S302,根据所述第一指示信息从所述预置矩阵中读取非零元素的矩阵元素值,并确定读取的所述矩阵元素值的第一位置标码。
在一些可行的实施方式中,PE可根据上述第一指示信息从预置矩阵中查找矩阵读指针所指向的指定矩阵元素行,并从所述矩阵有效指针所指向的指定位置开始,从所述指定矩阵元素行中读取M个矩阵元素值。例如,第一次FMAC运算时,可根据上述矩阵读指针从预置矩阵的第一行的第一个矩阵元素位置开始读取3个非零元素的矩阵元素值。进一步的,还可确定读取的矩阵元素值的位置标码(即第一位置标码),以从输入矢量元素中读取与之配对的矢量元素。例如,读取预置矩阵的第一个非零元素的矩阵元素值,则可确定该矩阵元素值的位置标码,进而可从矢量数据中读取乘加运算中与之配对的第一个元素值。
S303,获取矢量元素的第二指示信息。
在一些可行的实施方式中,对待处理矩阵进行预处理得到待处理矩阵的初始化信息时,还可根据待处理矩阵中各矩阵元素行包含的非零元素个数确定矢量数据的指示信息(即第二指示信息)。具体实现中,本发明实施例通过矢量读指针来指示待读取的矢量数据的行。上述第二指示信息包括矢量读指针所指示的待读取的矢量数据行,还包括矢量读指针增量。需要说明的是,在矩阵和矢量的乘法运算中,读取的矢量数据需要与矩阵数据进行两两配 对,因此,当矩阵数据的单次读取的数据大小为2K(即两行)时,矢量数据的单次读取的数据大小也应该为2K,因此可将矢量读指针增量设定为两拍输出的矢量元素所间隔的矢量RAM行数。即矢量读指针增量指示下次计算的待读取的矢量数据行与本次计算中矢量读指针所指示的矢量数据行的间隔行数,其中,矢量读指针所指示的矢量数据行即为本次读取的矢量数据行。具体实现中,若上述待处理矩阵中各矩阵元素行包含的元素不是全零,则矢量读指针的增量可设定为2,即本次读取数据的数据大小(2K)与K的比值H为2。若待处理矩阵的各矩阵元素行中包含的元素全为零,则可直接跳过,即全零的矩阵元素行不需要进行乘法运算。此时,矢量读指针增量可设定需要跳过的行的数目,假设待处理矩阵中[i*k,(i+2)*k]范围内的元素为全零,则可直接跳过2行,此时矢量读指针增量可设定为2或者4。即H1为2。若待处理矩阵中,连续[i*k,(i+N)*k]范围内的元素为全零,则可直接跳过N行,此时矢量读指针的增量可设定为N。如图6所示,根据待处理矩阵中各矩阵元素的标码可知,标码为127和标码为300之间的元素均为零,标码127和标码300的元素之间间隔22行,因此可将矢量读指针的增量设定为22。若标码C与标码D之间的元素间隔小于2K,则将矢量读指针的增量设定为2,具体可参见图6所示的示例,在此不再赘述。
需要说明的是,上述矢量的指示信息可预先处理得到并存储于矢量RAM中,进而可在PE进行FMAC运算时通过广播总线传输给PE。上述矢量读指针增量可在每次读取数据之后,对矢量读指针进行更新以得到下一次计算的矢量读指针,进而可实现对矢量数据的准确调度。
S304,根据所述第二指示信息从输入矢量数据中读取与所述第一位置标码对应的第二位置标码的矢量元素值。
在一些可行的实施方式中,PE从预置矩阵读取矩阵元素值并且确定读取的矩阵元素值的第一位置标码之后,可根据矢量元素的第二指示信息从输入矢量数据中查找矢量读指针所指示的矢量数据行,并从矢量数据行中读取与第一位置标码相对应的第二位置标码的矢量元素值。其中,上述与第一位置标码相对应的第二位置标码即为与第二位置标码上的矩阵元素值进行两两配对的矩阵元素值所在的位置。其中,上述输入矢量数据可为矢量RAM的输出宽度,具体可为T*K个元素,T为大于1的整数。即,若矩阵RAM输出宽度为K个非零元素,矢量RAM可输出T*K个元素,以保证足够的矢量元素与矩阵元素配对,提高矩阵与矢量的乘法运算的准确性。
S305,获取所述矩阵元素值与所述矢量元素值的乘法运算值。
在一些可行的实施方式中,PE获取得到矩阵元素值和矢量元素值之后,则可进行矩阵元素值和矢量元素值乘加运算,进而得到矩阵元素值和矢量元素值的乘法运算值。
下面将结合图7,图7为PE的架构示意图,对PE是如何根据矩阵信息RAM中存储的矩阵元素的指示信息和矢量RAM中存储的矢量元素的指示信息进行数据调度和乘加处理操作的过程进行大概介绍。如图7所示,每个PE实际上都是做了一个FMAC运算。PE的结构经过流水线处理,可分为2+N层流水线,其中,包括2层数据调度流水线(包括读取层和数据层)和N层运算流水线(即运算层),如C0,C1,…,C5所示。
在读取层中,加法器将根据矩阵RAM传回的矩阵读指针和广播总线传输过来的矩阵读 指针增量来更新矩阵读指针。同时,PE可将维护一个矩阵掩码寄存器,利用从矩阵信息RAM通过广播总线输入的矩阵有效指针和有效矩阵元素个数等指示信息,生成可以过滤掉已经计算过的矩阵元素的掩码。进一步的,可通过矩阵元素的掩码对从矩阵RAM中存储的预置矩阵中读取的数据中过滤掉已经运算过的矩阵元素,即根据矩阵有效指针和有效矩阵元素个数从矩阵RAM输出的矩阵元素中选择参与本次FAMC运行的有效矩阵元素,进而可将预置矩阵中的有效矩阵元素输入至运算流水线。
同时,在该处理周期内,矢量输入(即输入矢量数据)也将从外部输入并存入矢量RAM中。矢量读指针以及矢量读指针增量也可在此之前存入矢量RAM中,在此不做限制。其中,输入矢量数据可包括2K个元素,并且可分为上层矢量和下层矢量。PE可从矢量RAM中读取输入矢量数据,并可通过32-1选择器根据矩阵RAM传输过来的矩阵元素值的预标码等信息从输入矢量数据中选择相应的矢量元素值输入至运算流水线上进行矩阵与矢量的乘加运算。
在数据层中,可将从矩阵RAM中读出矩阵数据。在通过过滤后得到的有效矩阵元素,预置矩阵中有K个或少于K个的有效矩阵元素被传入运算层。与此同时,可通过多个选择器(如图所示的32-1选择器)根据通过矩阵RAM中读出的预标码来选中相应的矢量元素,并传入运算层。其中,上述多个选择器中每个选择器可从2K个元素中选取一个矢量元素,该矢量元素为上述预标码对应的矩阵元素相对应。对于预置矩阵的操作数少于K的情况,可将对没有用到的预标码位置的数据传入0,或传入使能信号使乘法器不工作,进而可节省乘法器的运算量。
在运算层中,加速器将把传入的数据进行乘加操作,并在最后一层将运算结果与上一次的结果累加到一起存入累加寄存器。
由于在本加速器的运算单元中没有反压的需要,所以所有的流水线都可以并行运行,从而使得该架构的吞吐率为每拍K个FMAC累加运算。
本发明实施例通过矩阵读指针、矩阵有效指针、有效矩阵元素个数以及矩阵读指针增量等信息指示待处理矩阵中的非零元素,从预置矩阵中读取非零元素值与矢量数据值进行乘法运算,可提高矩阵元素的调度准确性,减少矩阵元素值的调度之前矩阵元素的非零判断等操作,降低了矩阵元素的调度操作复杂度。本发明实施例还可根据矢量读指针以及矢量读指针增量等指示信息从输入矢量数据中读取与矩阵元素值的位置相对应的矢量数据值,可节省乘法运算过程中矩阵元素值的判断操作,进而可降低数据处理的复杂度,降低数据处理的功耗,提高数据处理效率。本申请还可根据单次读取的数据大小为预置矩阵中的矩阵元素进行位置标码,可保证标码宽度固定,降低数据处理的操作复杂度。
参见图8,是本发明实施例提供的矩阵与矢量的乘法运算装置的结构示意图。本发明实施例提供的乘法运算装置具体可为本发明实施例中所描述的PE。本发明实施例提供的乘法运算装置可包括:存储器801、调度单元802、运算器802和通用处理器804(如中央处理器CPU)等。其中,上述存储器801可为本发明实施例提供的矩阵RAM、矩阵信息RAM以及矢量RAM等,具体可根据实际应用需求确定,在此不做限制。上述调度单元802可为PE中的读指针、过滤器或者选择器等功能模块,也可为其他表现形式的用于调度存储器801中存储的数据的功能模块,在此不做限制。上述运算器802可为PE中的加法器、加速 器等功能模块,在此不做限制。上述通用处理器804也可为PE外部的数据预处理模块,或者数据初始化模块,用于执行矩阵数据的预处理或者初始化等操作,在此不做限制。
上述存储器801,用于存储预置矩阵以及所述预置矩阵的矩阵元素的第一指示信息,所述第一指示信息用于指示预置矩阵中的非零元素。
上述调度单元802,用于从所述存储器801中获取所述第一指示信息,并根据所述第一指示信息从所述预置矩阵中读取非零元素的矩阵元素值,并确定读取的所述矩阵元素值的第一位置标码,所述第一位置标码为所述矩阵元素值在单次读取的矩阵数据中的位置标记;
上述存储器801,还用于存储输入矢量数据以及所述输入矢量数据的矢量元素的第二指示信息,所述第二指示信息用于指示待读取的矢量数据信息。
上述调度单元802,还用于从所述存储器801中读取所述第二指示信息,并根据所述第二指示信息从所述输入矢量数据中读取与所述第一位置标码对应的第二位置标码的矢量元素值。
上述运算器803,用于计算所述调度单元读取的所述矩阵元素值与所述矢量元素值的乘法运算值。
在一些可行的实施方式中,上述乘法运算装置还包括:
通用处理器804,用于获取待处理矩阵,并将所述待处理矩阵中的各个矩阵元素进行位置标码以得到各个矩阵元素的预标码,其中,所述待处理矩阵中每行包括K个元素,K为大于零的整数。
上述通用处理器804,还用于选取所述待处理矩阵中的非零元素,并根据所述待处理矩阵中的非零元素的预标码生成预置矩阵并存储至所述存储器,所述预置矩阵中每行包括K个非零元素。
上述通用处理器804,还用于根据所述预置矩阵及其包含的各个非零元素的预标码生成矩阵元素的第一指示信息并存储至所述存储器。
在一些可行的实施方式中,上述通用处理器804还用于:
根据预设的本次计算读取的矩阵数据大小将所述预置矩阵中包含的各个非零元素的预标码进行处理以得到各个非零元素的位置标码,将所述各个非零元素的位置标码添加至所述第一指示信息;
其中,所述各个非零元素中任一非零元素的位置标码小于所述数据大小。
在一些可行的实施方式中,上述第一指示信息包括矩阵读指针、矩阵有效指针和有效矩阵元素个数;
所述矩阵读指针用于指示所述预置矩阵中参与本次计算的待读取的矩阵元素行;
所述矩阵有效指针指向参与本次计算的起始非零元素在所述待读取的矩阵元素行中的位置;
所述有效矩阵元素个数用于指示参与本次计算的待读取的非零元素的个数M,所述M为大于或者等于1的整数;
所述调度单元用于:
从所述预置矩阵中查找所述矩阵读指针所指向的指定矩阵元素行,并从所述矩阵有效 指针所指向的指定位置开始,从所述指定矩阵元素行中读取M个矩阵元素值。
在一些可行的实施方式中,上述第一指示信息还包括矩阵读指针增量;
所述矩阵读指针增量的初始值为零,表示本次计算中待读取的矩阵元素行为所述矩阵读指针所指示的矩阵元素行;
所述通用处理器用于:
若所述M多于所述待读取的矩阵元素行中剩余非零元素的个数,则将所述矩阵读指针增量加1,所述矩阵读指针增量加1表示下次计算中待读取的矩阵元素行为所述矩阵读指针所指示的矩阵元素行之后的两行;
其中,所述剩余非零元素为所述待读取的矩阵元素行中包含的所述矩阵有效指针指向的位置之后的非零元素个数。
在一些可行的实施方式中,上述通用处理器804还用于:
根据所述矩阵读指针增量更新所述矩阵读指针以得到下次计算的矩阵读指针。
在一些可行的实施方式中,上述待读取的矢量数据信息包括本次计算中待读取的矢量数据行;
所述通用处理器804还用于:
根据所述待处理矩阵中的非零元素的预标码确定所述待处理矩阵中各矩阵元素行包含的非零元素个数,根据所述各矩阵元素行包含的非零元素个数生成矢量元素的第二指示信息;
其中,第二指示信息包括所述矢量读指针所指示的待读取的矢量数据行,以及矢量读指针增量;
所述矢量读指针增量指示下次计算的待读取的矢量数据行与所述矢量读指针所指示的矢量数据行的间隔行数。
在一些可行的实施方式中,上述通用处理器804用于:
若所述各矩阵元素行包含的非零元素个数均不为零,则将所述矢量读指针增量设为H,所述H为预设的本次计算读取的矩阵数据大小与所述K的比值;
若所述各矩阵元素行中包含的非零元素为零的矩阵元素行的数目H1大于H,则将所述矢量读指针增量设为H1
在一些可行的实施方式中,上述调度单元802用于:
根据所述第二指示信息从所述输入矢量数据中查找待读取的矢量数据行,其中,所述输入矢量数据中包括T*K个元素,所述T为大于1的整数;
从所述矢量数据行中读取与所述第一位置标码对应的第二位置标码的矢量元素值。
在一些可行的实施方式中,上述通用处理器804还用于:
根据所述矢量读指针的增量更新所述矢量读指针以得到下次计算的矢量读指针。
具体实现中,上述乘法运算装置可通过其内置的各个功能单元执行上述实施例所描述的实现方式,在此不再赘述。
本发明实施例通过矩阵读指针、矩阵有效指针、有效矩阵元素个数以及矩阵读指针增量等信息指示待处理矩阵中的非零元素,从预置矩阵中读取非零元素值与矢量数据值进行乘法运算,可提高矩阵元素的调度准确性,减少矩阵元素值的调度之前矩阵元素的非零判 断等操作,降低了矩阵元素的调度操作复杂度。本发明实施例还可根据矢量读指针以及矢量读指针增量等指示信息从输入矢量数据中读取与矩阵元素值的位置相对应的矢量数据值,可节省乘法运算过程中矩阵元素值的判断操作,进而可降低数据处理的复杂度,降低数据处理的功耗,提高数据处理效率。本申请还可根据单次读取的数据大小为预置矩阵中的矩阵元素进行位置标码,可保证标码宽度固定,降低数据处理的操作复杂度。

Claims (20)

  1. 一种矩阵与矢量的乘法运算方法,其特征在于,包括:
    获取矩阵元素的第一指示信息,所述第一指示信息用于指示预置矩阵中的非零元素;
    根据所述第一指示信息从所述预置矩阵中读取非零元素的矩阵元素值,并确定读取的所述矩阵元素值的第一位置标码,所述第一位置标码为所述矩阵元素值在单次读取的矩阵数据中的位置标记;
    获取矢量元素的第二指示信息,所述第二指示信息用于指示待读取的矢量数据信息;
    根据所述第二指示信息从输入矢量数据中读取与所述第一位置标码对应的第二位置标码的矢量元素值;
    获取所述矩阵元素值与所述矢量元素值的乘法运算值。
  2. 如权利要求1所述的方法,其特征在于,所述获取矩阵元素的第一指示信息之前,所述方法还包括:
    获取待处理矩阵,并将所述待处理矩阵中的各个矩阵元素进行位置标码以得到各个矩阵元素的预标码,其中,所述待处理矩阵中每行包括K个元素,K为大于零的整数;
    选取所述待处理矩阵中的非零元素,并根据所述待处理矩阵中的非零元素的预标码生成预置矩阵,所述预置矩阵中每行包括K个非零元素;
    根据所述预置矩阵及其包含的各个非零元素的预标码生成矩阵元素的第一指示信息。
  3. 如权利要求2所述的方法,其特征在于,所述根据所述待处理矩阵中的非零元素的预标码生成预置矩阵之后,所述方法还包括:
    根据预设的本次计算读取的矩阵数据大小将所述预置矩阵中包含的各个非零元素的预标码进行处理以得到各个非零元素的位置标码;
    将所述各个非零元素的位置标码添加至所述第一指示信息;
    其中,所述各个非零元素中任一非零元素的位置标码小于所述数据大小。
  4. 如权利要求2或3所述的方法,其特征在于,所述第一指示信息包括矩阵读指针、矩阵有效指针和有效矩阵元素个数;
    所述矩阵读指针用于指示所述预置矩阵中参与本次计算的待读取的矩阵元素行;
    所述矩阵有效指针指向参与本次计算的起始非零元素在所述待读取的矩阵元素行中的位置;
    所述有效矩阵元素个数用于指示参与本次计算的待读取的非零元素的个数M,所述M为大于或者等于1的整数;
    所述根据所述第一指示信息从预置矩阵中读取非零元素的矩阵元素值包括:
    从所述预置矩阵中查找所述矩阵读指针所指向的指定矩阵元素行,并从所述矩阵有效指针所指向的指定位置开始,从所述指定矩阵元素行中读取M个矩阵元素值。
  5. 如权利要求4所述的方法,其特征在于,所述第一指示信息还包括矩阵读指针增量;
    所述矩阵读指针增量的初始值为零,表示本次计算中待读取的矩阵元素行为所述矩阵读指针所指示的矩阵元素行;
    所述根据所述预置矩阵及其包含的各个非零元素的预标码生成矩阵元素的第一指示信息包括:
    若所述M多于所述待读取的矩阵元素行中剩余非零元素的个数,则所述矩阵读指针增量加1,所述矩阵读指针增量加1表示下次计算中待读取的矩阵元素行为所述矩阵读指针所指示的矩阵元素行之后的两行;
    其中,所述剩余非零元素为所述待读取的矩阵元素行中包含的所述矩阵有效指针指向的位置之后的非零元素个数。
  6. 如权利要求4所述的方法,其特征在于,所述方法还包括:
    根据所述矩阵读指针增量更新所述矩阵读指针以得到下次计算的矩阵读指针。
  7. 如权利要求2-6任一项所述的方法,其特征在于,所述待读取的矢量数据信息包括本次计算中待读取的矢量数据行;
    所述获取矢量元素的第二指示信息之前,所述方法还包括:
    根据所述待处理矩阵中的非零元素的预标码确定所述待处理矩阵中各矩阵元素行包含的非零元素个数;
    根据所述各矩阵元素行包含的非零元素个数生成矢量元素的第二指示信息;
    其中,第二指示信息包括所述矢量读指针所指示的待读取的矢量数据行,以及矢量读指针增量;
    所述矢量读指针增量指示下次计算的待读取的矢量数据行与所述矢量读指针所指示的矢量数据行的间隔行数。
  8. 如权利要求7所述的方法,其特征在于,所述根据所述各矩阵元素行包含的非零元素个数生成矢量元素的第二指示信息包括:
    若所述各矩阵元素行包含的非零元素个数均不为零,则将所述矢量读指针增量设为H,所述H为预设的本次计算读取的矩阵数据大小与所述K的比值;
    若所述各矩阵元素行中包含的非零元素为零的矩阵元素行的数目H1大于H,则将所述矢量读指针增量设为H1
  9. 如权利要求7或8所述的方法,其特征在于,所述根据所述第二指示信息从输入矢量数据中读取与所述第一位置标码对应的第二位置标码的矢量元素值包括:
    根据所述第二指示信息从所述输入矢量数据中查找待读取的矢量数据行,其中,所述输入矢量数据中包括T*K个元素,所述T为大于1的整数;
    从所述矢量数据行中读取与所述第一位置标码对应的第二位置标码的矢量元素值。
  10. 如权利要求7-9任一项所述的方法,其特征在于,所述方法还包括:
    根据所述矢量读指针的增量更新所述矢量读指针以得到下次计算的矢量读指针。
  11. 一种矩阵与矢量的乘法运算装置,其特征在于,包括:存储器、调度单元和运算器;
    所述存储器,用于存储预置矩阵以及所述预置矩阵的矩阵元素的第一指示信息,所述第一指示信息用于指示预置矩阵中的非零元素;
    所述调度单元,用于从所述存储器中获取所述第一指示信息,并根据所述第一指示信息从所述预置矩阵中读取非零元素的矩阵元素值,并确定读取的所述矩阵元素值的第一位置标码,所述第一位置标码为所述矩阵元素值在单次读取的矩阵数据中的位置标记;
    所述存储器,还用于存储输入矢量数据以及所述输入矢量数据的矢量元素的第二指示信息,所述第二指示信息用于指示待读取的矢量数据信息;
    所述调度单元,还用于从所述存储器中读取所述第二指示信息,并根据所述第二指示信息从所述输入矢量数据中读取与所述第一位置标码对应的第二位置标码的矢量元素值;
    所述运算器,用于计算所述调度单元读取的所述矩阵元素值与所述矢量元素值的乘法运算值。
  12. 如权利要求11所述的乘法运算装置,其特征在于,所述乘法运算装置还包括:
    通用处理器,用于获取待处理矩阵,并将所述待处理矩阵中的各个矩阵元素进行位置标码以得到各个矩阵元素的预标码,其中,所述待处理矩阵中每行包括K个元素,K为大于零的整数;
    所述通用处理器,还用于选取所述待处理矩阵中的非零元素,并根据所述待处理矩阵中的非零元素的预标码生成预置矩阵并存储至所述存储器,所述预置矩阵中每行包括K个非零元素;
    所述通用处理器,还用于根据所述预置矩阵及其包含的各个非零元素的预标码生成矩阵元素的第一指示信息并存储至所述存储器。
  13. 如权利要求12所述的乘法运算装置,其特征在于,所述通用处理器还用于:
    根据预设的本次计算读取的矩阵数据大小将所述预置矩阵中包含的各个非零元素的预标码进行处理以得到各个非零元素的位置标码,将所述各个非零元素的位置标码添加至所述第一指示信息;
    其中,所述各个非零元素中任一非零元素的位置标码小于所述数据大小。
  14. 如权利要求12或13所述的乘法运算装置,其特征在于,所述第一指示信息包括矩阵读指针、矩阵有效指针和有效矩阵元素个数;
    所述矩阵读指针用于指示所述预置矩阵中参与本次计算的待读取的矩阵元素行;
    所述矩阵有效指针指向参与本次计算的起始非零元素在所述待读取的矩阵元素行中的位置;
    所述有效矩阵元素个数用于指示参与本次计算的待读取的非零元素的个数M,所述M为大于或者等于1的整数;
    所述调度单元用于:
    从所述预置矩阵中查找所述矩阵读指针所指向的指定矩阵元素行,并从所述矩阵有效指针所指向的指定位置开始,从所述指定矩阵元素行中读取M个矩阵元素值。
  15. 如权利要求14所述的乘法运算装置,其特征在于,所述第一指示信息还包括矩阵读指针增量;
    所述矩阵读指针增量的初始值为零,表示本次计算中待读取的矩阵元素行为所述矩阵读指针所指示的矩阵元素行;
    所述通用处理器用于:
    若所述M多于所述待读取的矩阵元素行中剩余非零元素的个数,则将所述矩阵读指针增量加1,所述矩阵读指针增量加1表示下次计算中待读取的矩阵元素行为所述矩阵读指针所指示的矩阵元素行之后的两行;
    其中,所述剩余非零元素为所述待读取的矩阵元素行中包含的所述矩阵有效指针指向的位置之后的非零元素个数。
  16. 如权利要求14所述的乘法运算装置,其特征在于,所述通用处理器还用于:
    根据所述矩阵读指针增量更新所述矩阵读指针以得到下次计算的矩阵读指针。
  17. 如权利要求12-16任一项所述的乘法运算装置,其特征在于,所述待读取的矢量数据信息包括本次计算中待读取的矢量数据行;
    所述通用处理器还用于:
    根据所述待处理矩阵中的非零元素的预标码确定所述待处理矩阵中各矩阵元素行包含的非零元素个数,根据所述各矩阵元素行包含的非零元素个数生成矢量元素的第二指示信息;
    其中,第二指示信息包括所述矢量读指针所指示的待读取的矢量数据行,以及矢量读指针增量;
    所述矢量读指针增量指示下次计算的待读取的矢量数据行与所述矢量读指针所指示的矢量数据行的间隔行数。
  18. 如权利要求17所述的乘法运算装置,其特征在于,所述通用处理器用于:
    若所述各矩阵元素行包含的非零元素个数均不为零,则将所述矢量读指针增量设为H,所述H为预设的本次计算读取的矩阵数据大小与所述K的比值;
    若所述各矩阵元素行中包含的非零元素为零的矩阵元素行的数目H1大于H,则将所述矢量读指针增量设为H1
  19. 如权利要求17或18所述的乘法运算装置,其特征在于,所述调度单元用于:
    根据所述第二指示信息从所述输入矢量数据中查找待读取的矢量数据行,其中,所述输入矢量数据中包括T*K个元素,所述T为大于1的整数;
    从所述矢量数据行中读取与所述第一位置标码对应的第二位置标码的矢量元素值。
  20. 如权利要求11所述的乘法运算装置,其特征在于,所述通用处理器还用于:
    根据所述矢量读指针的增量更新所述矢量读指针以得到下次计算的矢量读指针。
PCT/CN2017/113422 2017-03-31 2017-11-28 一种矩阵与矢量的乘法运算方法及装置 WO2018176882A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP17902691.9A EP3584719A4 (en) 2017-03-31 2017-11-28 METHOD AND DEVICE FOR MULTIPLYING MATRICES WITH VECTORS
US16/586,164 US20200026746A1 (en) 2017-03-31 2019-09-27 Matrix and Vector Multiplication Operation Method and Apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710211498.X 2017-03-31
CN201710211498.XA CN108664447B (zh) 2017-03-31 2017-03-31 一种矩阵与矢量的乘法运算方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/586,164 Continuation US20200026746A1 (en) 2017-03-31 2019-09-27 Matrix and Vector Multiplication Operation Method and Apparatus

Publications (1)

Publication Number Publication Date
WO2018176882A1 true WO2018176882A1 (zh) 2018-10-04

Family

ID=63675242

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/113422 WO2018176882A1 (zh) 2017-03-31 2017-11-28 一种矩阵与矢量的乘法运算方法及装置

Country Status (4)

Country Link
US (1) US20200026746A1 (zh)
EP (1) EP3584719A4 (zh)
CN (1) CN108664447B (zh)
WO (1) WO2018176882A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222835A (zh) * 2019-05-13 2019-09-10 西安交通大学 一种基于零值检测的卷积神经网络硬件系统及运算方法
US11379556B2 (en) * 2019-05-21 2022-07-05 Arm Limited Apparatus and method for matrix operations
TWI688871B (zh) 2019-08-27 2020-03-21 國立清華大學 矩陣乘法裝置及其操作方法
CN111798363A (zh) * 2020-07-06 2020-10-20 上海兆芯集成电路有限公司 图形处理器
US11429394B2 (en) * 2020-08-19 2022-08-30 Meta Platforms Technologies, Llc Efficient multiply-accumulation based on sparse matrix
US11244718B1 (en) * 2020-09-08 2022-02-08 Alibaba Group Holding Limited Control of NAND flash memory for al applications
CN115859011B (zh) * 2022-11-18 2024-03-15 上海天数智芯半导体有限公司 矩阵运算方法、装置及单元、电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1714552A (zh) * 2002-11-19 2005-12-28 高通股份有限公司 复杂度减小的无线通信系统的信道估计
CN101630178A (zh) * 2008-07-16 2010-01-20 中国科学院半导体研究所 一种硅基集成化的光学向量-矩阵乘法器
CN105426344A (zh) * 2015-11-09 2016-03-23 南京大学 基于Spark的分布式大规模矩阵乘法的矩阵计算方法
US20160179750A1 (en) * 2014-12-22 2016-06-23 Palo Alto Research Center Incorporated Computer-Implemented System And Method For Efficient Sparse Matrix Representation And Processing

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6904179B2 (en) * 2000-04-27 2005-06-07 Xerox Corporation Method for minimal-logic non-linear filter implementation
CN102541814B (zh) * 2010-12-27 2015-10-14 北京国睿中数科技股份有限公司 用于数据通信处理器的矩阵计算装置和方法
CN104951442B (zh) * 2014-03-24 2018-09-07 华为技术有限公司 一种确定结果向量的方法和装置
US9697176B2 (en) * 2014-11-14 2017-07-04 Advanced Micro Devices, Inc. Efficient sparse matrix-vector multiplication on parallel processors

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1714552A (zh) * 2002-11-19 2005-12-28 高通股份有限公司 复杂度减小的无线通信系统的信道估计
CN101630178A (zh) * 2008-07-16 2010-01-20 中国科学院半导体研究所 一种硅基集成化的光学向量-矩阵乘法器
US20160179750A1 (en) * 2014-12-22 2016-06-23 Palo Alto Research Center Incorporated Computer-Implemented System And Method For Efficient Sparse Matrix Representation And Processing
CN105426344A (zh) * 2015-11-09 2016-03-23 南京大学 基于Spark的分布式大规模矩阵乘法的矩阵计算方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3584719A4 *

Also Published As

Publication number Publication date
EP3584719A1 (en) 2019-12-25
EP3584719A4 (en) 2020-03-04
CN108664447B (zh) 2022-05-17
CN108664447A (zh) 2018-10-16
US20200026746A1 (en) 2020-01-23

Similar Documents

Publication Publication Date Title
WO2018176882A1 (zh) 一种矩阵与矢量的乘法运算方法及装置
US11698773B2 (en) Accelerated mathematical engine
CN109543832B (zh) 一种计算装置及板卡
EP3451162B1 (en) Device and method for use in executing matrix multiplication operations
US10379816B2 (en) Data accumulation apparatus and method, and digital signal processing device
CN109522052B (zh) 一种计算装置及板卡
CN104899182B (zh) 一种支持可变分块的矩阵乘加速方法
CN106855952B (zh) 基于神经网络的计算方法及装置
CN111651205B (zh) 一种用于执行向量内积运算的装置和方法
CN109885857B (zh) 指令发射控制方法、指令执行验证方法、系统及存储介质
US20190164254A1 (en) Processor and method for scaling image
CN114503126A (zh) 矩阵运算电路、装置以及方法
CN104536914B (zh) 基于寄存器访问标记的相关处理装置和方法
CN114138231B (zh) 执行矩阵乘法运算的方法、电路及soc
US20210248497A1 (en) Architecture to support tanh and sigmoid operations for inference acceleration in machine learning
CN108008665B (zh) 基于单片fpga的大规模圆阵实时波束形成器及波束形成计算方法
CN114117896A (zh) 面向超长simd管线的二值规约优化实现方法及系统
CN116382782A (zh) 向量运算方法、向量运算器、电子设备和存储介质
CN113890508A (zh) 一种批处理fir算法的硬件实现方法和硬件系统
CN112506853A (zh) 零缓冲流水的可重构处理单元阵列及零缓冲流水方法
Park et al. ShortcutFusion++: optimizing an end-to-end CNN accelerator for high PE utilization
TWI768497B (zh) 智慧處理器、資料處理方法及儲存介質
CN109522125A (zh) 一种矩阵乘积转置的加速方法、装置及处理器
CN109657192B (zh) 一种用于fft中旋转因子乘运算的操作数地址生成方法
CN100465878C (zh) 一种开方运算的方法及其装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17902691

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017902691

Country of ref document: EP

Effective date: 20190920