WO2020029018A1 - 矩阵的处理方法、装置及逻辑电路 - Google Patents

矩阵的处理方法、装置及逻辑电路 Download PDF

Info

Publication number
WO2020029018A1
WO2020029018A1 PCT/CN2018/098993 CN2018098993W WO2020029018A1 WO 2020029018 A1 WO2020029018 A1 WO 2020029018A1 CN 2018098993 W CN2018098993 W CN 2018098993W WO 2020029018 A1 WO2020029018 A1 WO 2020029018A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
processed
elements
distribution
distribution matrix
Prior art date
Application number
PCT/CN2018/098993
Other languages
English (en)
French (fr)
Inventor
董镇江
杨超然
刘虎
陈海
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202110395943.9A priority Critical patent/CN113190791A/zh
Priority to CN201880015972.4A priority patent/CN111010883B/zh
Priority to PCT/CN2018/098993 priority patent/WO2020029018A1/zh
Priority to EP18929049.7A priority patent/EP3690679A4/en
Publication of WO2020029018A1 publication Critical patent/WO2020029018A1/zh
Priority to US16/869,837 priority patent/US11250108B2/en
Priority to US17/560,472 priority patent/US11734386B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/40Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using contact-making devices, e.g. electromagnetic relay
    • G06F7/42Adding; Subtracting
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03KPULSE TECHNIQUE
    • H03K19/00Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits
    • H03K19/20Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits characterised by logic function, e.g. AND, OR, NOR, NOT circuits
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Definitions

  • the present application relates to the technical field of data processing, and in particular, to a method, a device, and a logic circuit for processing a matrix.
  • Matrix is a computing tool commonly used in scientific computing and is widely used in engineering applications.
  • a sparse matrix is a special case of a matrix, which refers to a matrix that contains only a small number of non-zero elements. Because there are a large number of elements with a value of 0 in the sparse matrix, the conventional matrix storage method will bring a large number of unnecessary operations in the matrix operation.
  • the mainstream sparse matrix processing method is Compressed Row Storage (CSR).
  • CSR Compressed Row Storage
  • the compressed matrix obtained after the CSR processing of the sparse matrix stores non-zero elements in the sparse matrix through row offsets, element column numbers, and element values.
  • the value and column number of the element represent an element and its column number in the matrix
  • the row offset represents the starting offset position of the first element of a row in the value.
  • the related operation can be directly replaced by the compressed compression matrix instead of the sparse matrix before compression, which reduces the zero element in the matrix at the same position as the corresponding sparse matrix during the convolution operation of the sparse matrix. Invalid calculation with multiplication to get a value of 0.
  • the non-zero elements in the CSR compression matrix need to be related to Which non-zero elements in another CSR compression matrix are convolved.
  • the present application provides a matrix processing method, device, and logic circuit.
  • determining non-zero elements of a matrix to be processed and a distribution matrix used to represent the position of the non-zero elements, the number of non-zero elements, the non-zero elements and The distribution matrices are combined into a compression matrix, so that when performing sparse matrix operations such as matrix convolution operations, multiplication and addition operations, multiplication and subtraction operations, division addition operations, or division and subtraction operations, compression matrices can be used instead of sparse matrices to operate and obtain The result of the operation, thereby avoiding invalid calculation of zero elements and improving the efficiency of the matrix processing method.
  • a first aspect of the present application provides a method for processing a matrix, including:
  • the number of the non-zero elements, the value of each non-zero element in the to-be-processed matrix, and the distribution matrix are sequentially arranged to obtain a compression matrix of the to-be-processed matrix.
  • the matrix processing method provided in this embodiment can determine the non-zero elements of the matrix to be processed and the distribution matrix used to represent the position of the non-zero elements, and combine the number of non-zero elements, the non-zero elements arranged in sequence, and the distribution matrix. Is the compression matrix.
  • the compression matrix obtained by the matrix processing method of this embodiment can be used to replace the matrix to be processed when the matrix to be processed is subjected to, for example, a matrix convolution operation, a multiplication and addition operation, a multiplication and subtraction operation, a division addition operation, or a division and subtraction operation. Perform operations to improve the processor's storage efficiency and operation efficiency for the matrix to be processed.
  • the distribution matrix is a one-dimensional matrix, and an element at each position in the matrix to be processed corresponds to an element at the same position in the distribution matrix one by one, and the determining The distribution matrix of the matrix to be processed includes:
  • whether a element in the matrix to be processed is a zero element is represented by a distribution matrix having the same dimension as the matrix to be processed and elements in a one-to-one correspondence at the same position. More specifically, by scanning the elements in the matrix to be processed, the constant 1 in the distribution matrix is used to represent the non-zero elements in the matrix to be processed, and the constant 0 is used to represent the zero elements in the matrix to be processed, so that when processing the matrix, The distribution of zero and non-zero elements in the matrix to be processed can be determined by a simpler distribution matrix.
  • the distribution of zero and non-zero elements in the matrix to be processed can be identified by using the constants 0 and 1 with a data amount of only 1 bit, so that It is necessary to scan a to-be-processed matrix with a large amount of data when the to-be-processed matrix performs an operation, thereby saving a bandwidth for reading data in the matrix processing.
  • the number of elements in the matrix to be processed is N, and the number of non-zero elements in the matrix to be processed is M.
  • the number of elements in the distribution matrix is The number is N, the number of elements with a value of 1 in the distribution matrix is M, and the number of elements in the compression matrix is M + N + 1, where N is a positive integer, M is a non-negative integer, and M is less than Is equal to N.
  • the method for processing the matrix limits the number of elements in the compression matrix more specifically, so that the compression matrix of the one-dimensional matrix includes only M + N + 1 elements, of which 1 non-zero element
  • the number, M non-zero elements, and the elements of the N distribution matrices make the compressed matrix in this embodiment equivalent to the operation of the matrix to be processed before compression, thereby ensuring the operation
  • the result obtained by calculating the compression matrix is equal to the result obtained by calculating the matrix to be processed.
  • the to-be-processed matrix includes a first to-be-processed matrix and a second to-be-processed matrix, and the number of elements in the first to-be-processed matrix and the second to-be-processed matrix
  • the number of elements is the same
  • the distribution matrix includes a first distribution matrix and a second distribution matrix
  • the method further includes:
  • the result of accumulating a product of an element at each position in the first to-be-processed matrix and an element at the same position in the second to-be-processed matrix is the same.
  • first compression matrices and second compression matrices of a first to-be-processed matrix and a second to-be-processed matrix that need to be calculated are calculated, and the first and second compression matrices are passed through
  • the distribution matrix and non-zero elements in the matrix are used in place of the first to-be-processed matrix and the second to-be-processed matrix to perform calculations with the first to-be-processed matrix and the second to-be-processed matrix, thereby improving the processor's performance of the first to-be-processed matrix.
  • the second storage efficiency and computing efficiency to be processed are processed.
  • the based on the first distribution matrix, the second distribution matrix, non-zero elements in the first to-be-processed matrix, and Non-zero elements get the target value, including:
  • each first target element in turn in the second distribution matrix to form a first mask matrix, wherein the position of each first target element in the second distribution matrix and the first The position of each element in the distribution matrix is the same;
  • the first effective element among the non-zero elements of the first to-be-processed matrix is used as the element of the first simplified matrix, where the first effective element is
  • the arrangement order of the non-zero elements of the first to-be-processed matrix is the same as the arrangement order of the obtained first target elements in the first mask matrix;
  • each second target element in turn in the first distribution matrix to form a second mask matrix, where the position of each second target element in the first distribution matrix and the second The position of each element in the distribution matrix is the same;
  • the second effective element among the non-zero elements of the second to-be-processed matrix is used as the element of the second simplified matrix, where the second effective element is
  • the arrangement order of the non-zero elements of the second to-be-processed matrix is the same as the arrangement order of the obtained second target elements in the second mask matrix;
  • the product of the elements at each position in the first simplified matrix and the elements at the same position in the second simplified matrix is accumulated to obtain the target value.
  • a first compression matrix and a second compression matrix can be used instead of the first to-be-processed matrix and the second to-be-processed matrix to perform a convolution operation to obtain a target value as the first to-be-processed matrix.
  • the first mask matrix and the second mask matrix can be determined through the first distribution matrix and the second distribution matrix, and finally according to the first mask matrix and the first
  • the two mask matrices determine the first simplified matrix and the second simplified matrix, and the target values can be obtained by performing an accumulative product operation on the aligned elements in the first simplified matrix and the second simplified matrix.
  • the method for processing a matrix provided in the first aspect of the present application, by determining non-zero elements of the matrix to be processed and a distribution matrix used to represent the position of the non-zero elements, The elements and distribution matrix are combined into a compression matrix, so that when a sparse matrix is subjected to a convolution operation, multiplication and addition operation, multiplication and subtraction operation, division and addition operation, or division and subtraction operation, the compression matrix replaces the sparse matrix operation and obtains the sparseness.
  • the operation result of the matrix is used to improve the operation efficiency of the sparse matrix, and then the efficiency of the matrix processing method is improved.
  • a second aspect of the present application provides a logic circuit, which is configured to obtain a first mask matrix and a second mask matrix through a first distribution matrix and a second distribution matrix; wherein the first distribution matrix is used for Represents the position of non-zero elements in the first to-be-processed matrix; the second distribution matrix is used to represent the positions of non-zero elements in the second to-be-processed matrix; the first mask matrix is used to represent the second distribution matrix The first target element in the position of each first target element in the second distribution matrix and the position of each element in the first distribution matrix with a value of 1; the second mask The matrix is used to represent a second target element in the second distribution matrix, and the position of each second target element in the first distribution matrix and each element in the first distribution matrix with a value of 1 In the same position;
  • the logic circuit includes: a first switching logic and a second switching logic; wherein,
  • a first input terminal of the first switching logic is used to sequentially receive elements at each position in the second distribution matrix, and a second input terminal of the first switching logic is used to sequentially receive the first distribution matrix.
  • An element in the same position as an element on the received second distribution matrix, and an output end of the first switching logic is configured to output the first target element to form the first mask matrix;
  • the first switch logic When the value of the element received by the second input terminal of the first switch logic is 1, the first switch logic outputs the element received by the first input terminal from the output terminal;
  • a first input terminal of the second switching logic is used to sequentially receive elements at each position in the first distribution matrix, and a second input terminal of the second switching logic is used to sequentially receive the second distribution matrix.
  • An element in the same position as an element on the received first distribution matrix, and an output end of the second switching logic is configured to output the second target element to form the second mask matrix;
  • the second switch logic When the value of the element received by the second input terminal of the second switch logic is 1, the second switch logic outputs the element received by the first input terminal from the output terminal.
  • the logic circuit provided in this embodiment can implement the method for obtaining the first mask matrix and the second mask matrix through the first distribution matrix and the second distribution matrix in the above embodiment through relatively simple switching logic, and the switching logic
  • the elements of the receiving distribution matrix and the elements of the output mask matrix can be completed within one processor clock, thereby ensuring the smooth operation of subsequent array processors.
  • the logic circuit further includes: AND gate logic
  • a first input terminal of the AND gate logic is used to sequentially receive elements at each position in the first distribution matrix, and a second input terminal of the AND gate logic is used to sequentially receive ANDs in the second distribution matrix.
  • the elements at the same position as the elements on the received first distribution matrix, and the output end of the AND logic is used for AND between the first input of the AND logic and the second input of the AND logic.
  • the operation result is output to a second input terminal of the first switching logic and a second input terminal of the second switching logic.
  • the logic circuit provided in this embodiment provides a switch closing time for the first switch logic and the second switch logic by adding AND logic that functions as a buffer. After the switches of the first switch logic and the second switch logic are closed, The AND gate logic outputs the AND operation result to the first switch logic and the second switch logic through the output terminal, thereby ensuring that the correct input is correctly received by the second input terminal of the first switch logic and the second switch logic.
  • the logic circuit further includes: a first latch and a second latch;
  • An input end of the first latch is used to sequentially receive an element at each position in the second distribution matrix, and an output end of the first latch is used to set all the elements after a first preset delay.
  • the element received by the input terminal is output to the first switching logic;
  • An input terminal of the second latch is used to sequentially receive an element at each position in the first distribution matrix, and an output terminal of the second latch is used to delay all the elements after a second preset delay.
  • the element received by the input terminal is output to the second switching logic.
  • the first preset delay is a switch-on delay of the first switch logic; the second preset delay is a switch-on delay of the second switch logic Delay.
  • the logic circuit provided in this embodiment includes a first latch and a second latch that function as buffers.
  • the first latch is the first switching logic after receiving the elements of the second distribution matrix. Provide the switch closing time, and after the switch of the first switch logic is closed, the received element is output to the first switch logic through the output terminal; the second latch is the second after receiving the element of the first partial matrix.
  • the switching logic provides a time for the switch to close, and after the switch of the second switching logic is closed, the received element is output to the second switching logic through the output terminal.
  • the first preset delay can be set as the switch-on delay of the first switching logic
  • the second preset delay can be set as the switch-on delay of the second switching logic, thereby ensuring the first switching logic and the second switch
  • the second input of the logic receives the correct element without error.
  • the logic circuit provided in the second aspect of the present application includes: the first switching logic and the second switching logic, which realizes that the first mask matrix and the second mask are obtained through the first distribution matrix and the second distribution matrix. matrix.
  • the first input terminal of the first switching logic is used to sequentially receive elements at each position in the second distribution matrix, and the second input terminal of the first switching logic is used to sequentially receive the first and second received elements in the first distribution matrix.
  • the output of the first switching logic is used to output the first target element to form the first mask matrix.
  • the first switch logic When the value of the element received by the second input of the first switching logic is At 1 o'clock, the first switch logic outputs the elements received at the first input terminal from the output terminal; the first input terminal of the second switch logic is used to sequentially receive the elements at each position in the first distribution matrix, and the second switch logic's The second input end is used to sequentially receive the elements in the second distribution matrix that are in the same position as the elements on the received first distribution matrix, and the output end of the second switching logic is used to output the second target element to form a second mask. Matrix; when the value of the element received by the second input terminal of the second switch logic is 1, the second switch logic outputs the element received by the first input terminal from the output terminal.
  • the logic circuit provided by the present application has simple logic and low hardware cost. When applied to the processor, it can be implemented in one clock time to obtain the first mask matrix and the second mask matrix through the first distribution matrix and the second distribution matrix. , Thereby improving the processing efficiency of the logic circuit.
  • a third aspect of the present application provides a matrix processing device, including:
  • a first determining module configured to determine the number of non-zero elements in a matrix to be processed, where the matrix to be processed is a one-dimensional matrix
  • a second determining module configured to determine a distribution matrix of the matrix to be processed, where the distribution matrix is used to represent positions of non-zero elements in the matrix to be processed;
  • a processing module configured to combine the number of the non-zero elements, a value of each non-zero element in the to-be-processed matrix, and the distribution matrix in order to obtain a compression matrix of the to-be-processed matrix.
  • the distribution matrix is a one-dimensional matrix, and elements at each position in the matrix to be processed correspond to elements at the same position in the distribution matrix one-to-one;
  • the second determining module is specifically configured to:
  • the number of elements in the matrix to be processed is N, and the number of non-zero elements in the matrix to be processed is M.
  • the number of elements in the distribution matrix is The number is N, the number of elements with a value of 1 in the distribution matrix is M, and the number of elements in the compression matrix is M + N + 1, where N is a positive integer, M is a non-negative integer, and M is less than Is equal to N.
  • the to-be-processed matrix includes a first to-be-processed matrix and a second to-be-processed matrix, and the number of elements in the first to-be-processed matrix and the second to-be-processed matrix The number of elements is the same, and correspondingly, the distribution matrix includes a first distribution matrix and a second distribution matrix;
  • the apparatus further includes a calculation module, based on the first distribution matrix, the second distribution matrix, non-zero elements in the first to-be-processed matrix, and non-zero elements in the second to-be-processed matrix.
  • Element to obtain a target value and the target value is the same as the result of accumulating a product of an element at each position in the first to-be-processed matrix and an element at the same position in the second-to-be-processed matrix.
  • the calculation module is specifically configured to:
  • each first target element in turn in the second distribution matrix to form a first mask matrix, wherein the position of each first target element in the second distribution matrix and the first The position of each element in the distribution matrix is the same;
  • the first effective element among the non-zero elements of the first to-be-processed matrix is used as the element of the first simplified matrix, where the first effective element is
  • the arrangement order of the non-zero elements of the first to-be-processed matrix is the same as the arrangement order of the obtained first target elements in the first mask matrix;
  • each second target element in turn in the first distribution matrix to form a second mask matrix, where the position of each second target element in the first distribution matrix and the second The position of each element in the distribution matrix is the same;
  • the second effective element among the non-zero elements of the second to-be-processed matrix is used as the element of the second simplified matrix, where the second effective element is
  • the arrangement order of the non-zero elements of the second to-be-processed matrix is the same as the arrangement order of the obtained second target elements in the second mask matrix;
  • the product of the elements at each position in the first simplified matrix and the elements at the same position in the second simplified matrix is accumulated to obtain the target value.
  • the first determination module determines non-zero elements of the matrix to be processed and the second determination module determines the distribution matrix used to represent the positions of the non-zero elements, and processes the
  • the module combines the number of non-zero elements, the sequentially arranged non-zero elements, and the distribution matrix into a compression matrix, so that the sparse matrix is subjected to, for example, a matrix convolution operation, a multiplication and addition operation, a multiplication and subtraction operation, a division and addition operation, and a division and subtraction operation.
  • the compressed matrix is used instead of the sparse matrix to perform the operation and obtain the operation result of the sparse matrix, so as to improve the operation efficiency of the sparse matrix and further improve the efficiency of the matrix processing method.
  • an embodiment of the present application provides a processing apparatus for a matrix, including: a processor and a memory; the memory is used to store a program; the processor is used to call a program stored in the memory to execute The method for processing a matrix according to any one of the first aspects of the present application.
  • an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores program code, and when the program code is executed, executes the program according to any one of the first aspect of the present application. Matrix processing method.
  • FIG. 1 is a schematic flowchart of an embodiment of a method for processing a matrix of this application
  • FIG. 2 is a schematic flowchart of determining a compression matrix in a matrix processing method of the present application
  • FIG. 3 is a schematic flowchart of determining a distribution matrix in a matrix processing method of the present application
  • FIG. 4 is a schematic structural diagram of a compression matrix in a matrix processing method of the present application.
  • FIG. 5 is a schematic flowchart of determining a compression matrix in a matrix processing method of the present application
  • FIG. 6 is a schematic flowchart of an embodiment of a method for processing a matrix of this application.
  • FIG. 7 is a schematic flowchart of determining a mask matrix by using a distribution matrix in a matrix processing method of the present application
  • FIG. 8 is a schematic structural diagram of an embodiment of a logic circuit of the present application.
  • FIG. 9 is a schematic structural diagram of an embodiment of a logic circuit of the present application.
  • FIG. 10 is a schematic structural diagram of an embodiment of a logic circuit of the present application.
  • FIG. 11 is a schematic diagram of a processing structure of a matrix processing method applied to a pulsation array processor in the present application;
  • FIG. 12A to FIG. 12E are schematic processing flow diagrams of applying the matrix processing method of the present application to a pulse array processor
  • FIG. 13 is a schematic diagram of a processing structure of applying a matrix processing method of this application to an image convolution operation
  • FIG. 14 is a schematic structural diagram of an embodiment of a processing device for a matrix of the present application.
  • FIG. 15 is a schematic structural diagram of an embodiment of a processing apparatus for a matrix of the present application.
  • FIG. 1 is a schematic flowchart of Embodiment 1 of a method for processing a matrix of this application. As shown in FIG. 1, the method for processing a matrix provided in this embodiment includes:
  • S101 Determine the number of non-zero elements in the matrix to be processed, where the matrix to be processed is a one-dimensional matrix;
  • S102 Determine a distribution matrix of a matrix to be processed, where the distribution matrix is used to represent positions of non-zero elements in the matrix to be processed;
  • S103 Combine the number of non-zero elements, the value of each non-zero element in the to-be-processed matrix, and the distribution matrix to obtain a compression matrix of the to-be-processed matrix.
  • the execution subject of this embodiment may be a processor with a data processing function in an electronic device, such as a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU).
  • the electronic device may be Phone, tablet, desktop, or laptop.
  • the processor when the processor needs to compress the processing matrix to obtain a compressed matrix of the processing matrix, the processor processes the processing matrix by using the matrix processing method described above.
  • the processor determines the non-zero elements of the to-be-processed matrix and the distribution matrix used to represent the positions of the non-zero elements, and then combines the number of non-zero elements, the non-zero elements arranged in sequence, and the distribution matrix into a compression matrix.
  • the matrix to be processed in this embodiment is a sparse matrix
  • the processor compresses the sparse matrix that needs to be processed to obtain a compression matrix, thereby improving the processor's storage efficiency for the sparse matrix.
  • sparse matrix operations such as matrix convolution operations, multiply-add operations, multiply-subtract operations, divide-add operations, or divide-subtract operations
  • operations are performed by compressing matrices instead of sparse matrices and obtaining the results of sparse matrices, thereby improving the processor. Operational efficiency for sparse matrices.
  • the matrix to be processed in this embodiment is a one-dimensional matrix, or the matrix to be processed may also be a multi-dimensional matrix.
  • a one-dimensional matrix to be processed is used as an example for description, and the processing method and principle of the one-dimensional matrix in the present application can also be applied to a multi-dimensional matrix processing method.
  • a dimension reduction operation may be performed on the matrix to be processed first.
  • the elements in the two-dimensional matrix can be read row by row to obtain a one-dimensional matrix, and then the matrix processing method in the embodiment of the present application is applied to the obtained one-dimensional matrix.
  • an element at each position in the matrix to be processed corresponds to an element at the same position in the distribution matrix, and whether the corresponding element in the matrix to be processed is non-zero can be determined through the elements of the distribution matrix.
  • element the distribution matrix includes elements of the first type and elements of the second type.
  • the positions of the elements of the first type in the distribution matrix are the same as the positions of the non-zero elements in the matrix to be processed.
  • the positions of the zero elements in the processing matrix are the same.
  • the elements of the first type and the elements of the second type refer to two types of elements with different representations and obvious different characteristics.
  • the first type of element is a constant 1
  • the second type of element is a constant 0
  • the first type of element Are odd
  • the second type of element is even.
  • FIG. 2 is a schematic flowchart of determining a compression matrix in a matrix processing method of the present application.
  • the matrix to be processed in FIG. 2 is [0,1,0,0,2,0,0,0,3,0,0].
  • the processor processes the processing matrix to obtain a compression matrix, it determines that the non-zero elements in the processing matrix are sequentially arranged as [1,2,3], and determines that the distribution matrix of the processing matrix is [0,1,0,0 , 1,0,0,0,1,0,0].
  • the processor sets the determined number of non-zero elements [3], the non-zero elements [1,2,3] arranged in order, and the distribution matrix [0,1,0,0,1,0,0,0,1, 0,0] to combine and finally get the compression matrix [3,1,2,3,0,1,0,0,1,0,0,0,1,0,0].
  • the step of S102 determining the distribution matrix of the to-be-processed matrix is: the processor sequentially scans the elements in the to-be-processed matrix; when the scanned element is a non-zero element, determining whether the The value of the element corresponding to the element of is 1; when the value of the scanned element is zero, it is determined that the value of the element corresponding to the scanned element in the distribution matrix is 0.
  • FIG. 3 is a schematic flowchart of determining a distribution matrix in the processing method of the matrix of the present application. The to-be-processed matrix in FIG.
  • each element corresponds to one, and non-zero elements in the to-be-processed matrix correspond to The value of the element at the same position in the distribution matrix of is 1, and the value of the element at the same position in the distribution matrix corresponding to the zero element in the matrix to be processed is 0.
  • the compression matrix combined by S103 can be arranged in the following manner: the number of non-zero elements, the non-zero elements arranged in sequence, and the distribution matrix. For example, in the example in FIG.
  • the obtained compression matrix is [3,1 , 2,3,0,1,0,0,1,0,0,0,1,0,0].
  • FIG. 4 is a schematic structural diagram of a compression matrix in a matrix processing method of this application. Different arrangements of the compression matrix shown in FIG. 4 are all within the protection scope of this embodiment. The number of non-zero elements, the arrangement of non-zero elements in sequence, and the arrangement of the distribution matrix are exemplarily described.
  • the number of bits of each element in the distribution matrix is 1 bit.
  • the number of bits of each element in the matrix to be processed may be 8 bits, 16 bits, or 32 bits, although the dimension of the distribution matrix is the same as the dimension of the matrix to be processed.
  • the storage space required by the distribution matrix is smaller than the storage space of the processing matrix. Therefore, after the processing matrix is compressed into a compressed matrix, the storage space of the processing matrix is also saved, and the storage efficiency of the processor is improved.
  • FIG. 5 is a schematic flow chart of determining a compression matrix in a matrix processing method of the present application.
  • FIG. 5 illustrates the application of the matrix processing method shown in FIG. 1 to a multi-dimensional matrix by taking the matrix to be processed as a multi-dimensional matrix as an example.
  • the matrix to be processed as shown in FIG. 5 is [0,4,0; 0,0,0; 0,0,5], and the dimensions are three rows and three columns.
  • the processor may scan the multi-dimensional processing matrix into a one-dimensional matrix and process the one-dimensional matrix.
  • the matrix to be processed is scanned from [0,4,0; 0,0,0; 0,0,5] into a one-dimensional matrix [0,4,0,0,0,0,0,0,0,5],
  • the dimension information of the matrix to be processed can also be added to the scanned one-dimensional matrix, such as [0,4,0,0,0,0,0,0,0,5,3, 3], with the last two elements [3,3] of the one-dimensional matrix representing the multi-dimensional matrix with three rows and three columns.
  • the processor determines that the non-zero elements in the to-be-processed matrix are sequentially arranged as [4,5] according to the one-dimensional matrix obtained after scanning, and determines that the distribution matrix is [0,1,0,0,0,0,0, 0,1,3,3], the last two elements in the distribution matrix are also used to represent the dimensions of the matrix to be processed, or when the processor is calculating the dimensions of the matrix to be processed, or can be determined by other parameters When the dimension of the matrix is not required, the dimension of the matrix to be processed may not be represented in the distribution matrix.
  • the determined non-zero element data [2], the non-zero elements [4,5] arranged in order, and the distribution matrix [0,1,0,0,0,0,0,1] are combined, and finally Get the compression matrix [2,4,5,0,1,0,0,0,0,0,0,1] of the matrix to be processed.
  • the method of processing the multi-dimensional matrix in this embodiment and obtaining the compression matrix of the multi-dimensional matrix is only an example.
  • the compression matrix may also include adding new rows or columns to the distribution matrix of the matrix to be processed, and using the newly added
  • the elements in the rows or columns represent the distribution and number of non-zero elements and non-zero elements in the matrix to be processed, that is, a multi-dimensional compressed matrix can be obtained after processing of the multi-dimensional matrix to be processed.
  • the compression matrix can be expressed as [2,4,5,0,1,0,0,0,0,0,0,1]. When the number of newly added rows or columns is small, multiple rows or columns may be added, or when the number of newly added rows or columns is large, zero elements may be added to align the obtained multi-dimensional compression matrix.
  • the to-be-processed matrix [0,4,0; 0,0,0; 0,0,5] can be reduced to a one-dimensional matrix [0,4,0,0,0]. , 0,0,0,5], and then perform the matrix processing shown in FIG. 1 on the one-dimensional matrix after the dimensionality reduction.
  • a distribution matrix used to represent the positions of the non-zero elements in the matrix to be processed is determined; Quantity, the value of each non-zero element in the to-be-processed matrix, and the distribution matrix to obtain a compression matrix of the to-be-processed matrix.
  • the compression matrix is used instead of the sparse matrix to perform the operation and the operation result of the sparse matrix is obtained to improve the processor
  • the processing efficiency of the matrix processing method is further improved.
  • the matrix to be processed includes the first matrix to be processed, and the number of elements in the first matrix to be processed is the same as the number of elements in the second matrix to be processed.
  • the distribution matrix includes the first A distribution matrix and a second distribution matrix.
  • the matrix processing method shown in FIG. 1 further includes: obtaining a target value based on the first distribution matrix, the second distribution matrix, non-zero elements in the first to-be-processed matrix, and non-zero elements in the second to-be-processed matrix.
  • the target value is the same as the result of accumulating the product of the elements at each position in the first to-be-processed matrix and the elements at the same position in the second-to-be-processed matrix.
  • the target value may be an operation result of the first to-be-processed matrix and the second to-be-processed matrix when performing, for example, a convolution operation. If the first to-be-processed matrix and the second to-be-processed matrix are directly used for the convolution operation, accumulation is required. The product of the elements at each position in the first to-be-processed matrix and the elements at the same position in the second-to-be-processed matrix. In this embodiment, the first distribution matrix and non-zero elements in the first compression matrix can be used.
  • the second distribution matrix and non-zero elements in the second compression matrix perform convolution operations instead of the first to-be-processed matrix and the second to-be-processed matrix, and the obtained target value is the same as the first to-be-processed matrix and the second to-be-processed
  • the result of the matrix convolution operation is the same.
  • FIG. 6 is a schematic flowchart of an embodiment of a matrix processing method of the present application.
  • the first matrix to be processed shown in FIG. 6 is [1,0,2,0,3,4,0,5], and the second matrix to be processed is shown in FIG. 6.
  • the matrix is [0,2,0,0,1,0,0, -1]; the first compression matrix obtained after processing the first matrix to be processed as shown in FIG.
  • the target value specifically includes:
  • a total of five first target elements 0,0,1,0,1 are sequentially obtained in the second distribution matrix [0,1,0,0,1,0,0,1] to form a first mask matrix [ 0,0,1,0,1], where the position of each first target element in the second distribution matrix is the same as the position of each element with a value of 1 in the first distribution matrix.
  • the obtained first mask matrix with the non-zero elements of the first matrix to be processed in the first compression matrix, and when the value of the first target element in the obtained first mask matrix is 1, the The first significant element among the non-zero elements of the first matrix to be processed is used as the element of the first simplified matrix.
  • the first simplified matrix and the second simplified matrix are used to perform convolution operations instead of the first to-be-processed matrix and the second to-be-processed matrix, respectively. Specifically, by accumulating the product of the elements at each position in the first simplified matrix and the elements at the same position in the second simplified matrix 3 * 1 + 5 * (-1) to obtain the target value -2 as the first to-be-processed The result of the convolution operation of the matrix and the second matrix to be processed.
  • the target value is the same as the result of accumulating the product of the elements at each position in the first to-be-processed matrix and the elements at the same position in the second-to-be-processed matrix, that is, the first to-be-processed matrix and the second to-be-processed matrix are rolled up.
  • the result of the product operation is the same as the target value.
  • a first compression matrix and a second compression matrix can be used to perform convolution operations instead of the first to-be-processed matrix and the second to-be-processed matrix, respectively, to obtain a target value as the first A convolution operation result of the to-be-processed matrix and the second to-be-processed matrix.
  • the first mask matrix and the second mask matrix can be determined through the first distribution matrix and the second distribution matrix, and finally according to the first mask matrix and the first
  • the two mask matrices determine the first simplified matrix and the second simplified matrix, and the target values can be obtained by performing an accumulative product operation on the aligned elements in the first simplified matrix and the second simplified matrix. Therefore, there is no need to add some zero elements to perform element alignment when performing the convolution operation, and only the elements in the first simplified matrix and the second simplified matrix are used to perform an absolutely effective operation.
  • the effective elements in the first and second compression matrices can be aligned, and In the alignment process, invalid operations caused by zero elements can be avoided, which further improves the efficiency of existing matrix processing methods.
  • FIG. 7 is a schematic flowchart of determining a mask matrix by using a distribution matrix in a matrix processing method of the present application.
  • the present application further provides a logic circuit, which is configured to obtain the first mask matrix and the second mask matrix through the first distribution matrix and the second distribution matrix in the foregoing embodiment.
  • the logic circuit sequentially takes as input the element at each position in the first distribution matrix and the element at the same position in the second distribution matrix as the element on the first distribution matrix, and the logic circuit sequentially outputs the first target element and the second target. Elements to form a first mask matrix and a second mask matrix, respectively.
  • FIG. 8 is a schematic structural diagram of an embodiment of a logic circuit of the present application; the logic circuit shown in FIG. 8 includes: a first switching logic and a second switching logic. among them,
  • a first input of the first switching logic is used to sequentially receive elements at each position in the second distribution matrix, and a second input of the first switching logic is used to sequentially receive the second distribution in the first distribution matrix and the received second distribution.
  • the elements with the same element position on the matrix, the output end of the first switching logic is used to output the first target element to form a first mask matrix.
  • the first input terminal of the first switching logic receives the first element [0] of the second distribution matrix, and the second input terminal receives the first element [1] of the first distribution matrix. Since the element received by the second input terminal is [1], the first input terminal and the output terminal of the first switching logic are turned on, and the element [0] received by the first input terminal is output from the output terminal to the first target element.
  • the first mask matrix Subsequently, the first input terminal of the first switching logic receives the second element [1] of the second distribution matrix, and the second input terminal receives the second element [0] of the first distribution matrix. Since the element received by the second input terminal is [0], the first input terminal and the output terminal of the first switching logic are turned off.
  • the first input terminal of the first switching logic receives the third element [0] of the second distribution matrix
  • the second input terminal receives the third element [1] of the first distribution matrix. Since the element received by the second input terminal is [1], the first input terminal and the output terminal of the first switching logic are turned on, and the element [0] received by the first input terminal is output from the output terminal to the first target element.
  • the first mask matrix Since the element received by the second input terminal is [1], the first input terminal and the output terminal of the first switching logic are turned on, and the element [0] received by the first input terminal is output from the output terminal to the first target element.
  • the second input end receives the last element of the first distribution matrix [1], and outputs [1] through the output end ]
  • all the first target elements output through the output end of the first switching logic are arranged in order to form the first mask matrix [0,0,1,0,1].
  • the first input of the second switching logic is used to sequentially receive elements at each position in the first distribution matrix
  • the second input of the second switching logic is used to sequentially receive the first and second received elements in the second distribution matrix.
  • the elements with the same element position on a distribution matrix are used for outputting the second target element to form a second mask matrix.
  • the first input terminal of the second switching logic shown in FIG. 8 receives the first element [1] of the first distribution matrix.
  • the second input of the second switching logic receives the first element [0] of the second distribution matrix. Since the element received by the second input terminal is [0], the first input terminal and the output terminal of the second off logic are disconnected. Subsequently, the first input terminal of the second switching logic receives the first element [0] of the first distribution matrix.
  • the second input of the second switching logic receives the first element [1] of the second distribution matrix. Since the element received by the second input terminal is [1], the first input terminal and the output terminal of the second switching logic are turned on, and the element [0] received by the first input terminal is output from the output terminal to the second target element.
  • the second mask matrix Since the element received by the second input terminal is [1], the first input terminal and the output terminal of the second switching logic are turned on, and the element [0] received by the first input terminal is output from the output terminal to the second target element.
  • a plurality of logic circuits provided in this embodiment may also be arranged in parallel in the processor, and each logic circuit processes the elements that can receive the first distribution matrix and the second distribution matrix separately.
  • Each logic circuit receives elements of the first distribution matrix and the second distribution matrix at the same time, and outputs a first target element and a second target element according to the received elements, respectively.
  • the first target elements output by all logic circuits are arranged in order to form a first mask matrix
  • the second target elements output by all logic circuits are arranged in order to form a second mask matrix.
  • the first switching logic sequentially receives the elements in the first distribution matrix and the second distribution matrix
  • the second switching logic sequentially receives the elements in the first distribution matrix and the second distribution matrix, which can be on the same clock of the processor at the same time.
  • the method of obtaining the first mask matrix and the second mask matrix by using the first distribution matrix and the second distribution matrix in the foregoing embodiment can be implemented through relatively simple switching logic.
  • the switching logic can complete the elements of the distribution matrix and the elements of the output mask matrix within one processor clock, thereby simplifying the logic circuit and further improving the processing efficiency of the matrix.
  • FIG. 9 is a schematic structural diagram of an embodiment of a logic circuit of the present application.
  • the logic circuit provided in this embodiment may be used to replace the logic circuit shown in FIG. 8.
  • the logic circuit shown in FIG. 9 is based on the logic circuit shown in FIG. 8, and further includes: AND gate logic.
  • the first input terminal of the AND logic is used to sequentially receive elements at each position in the first distribution matrix, and the second input terminal of the AND logic is used to sequentially receive the first distribution and the received first distribution in the second distribution matrix.
  • the element with the same element position on the matrix, the output of the AND logic is used to output the AND operation result of the first input of the AND logic and the second input of the AND logic to the second input of the first switching logic and The second input terminal of the second switching logic.
  • the first switch logic and the second switch logic The switch needs to be closed when the element received at the two inputs is [1]. If the element received at the first input is lost or the synchronization of the input elements is refreshed within the delay of closing the switch, the switch may be closed. After that, the elements output through the output end are disordered. Therefore, in this embodiment, AND gate logic is provided. The first input end and the second input end of the AND gate logic respectively receive the elements at each position in the first distribution matrix and the first received end in the second distribution matrix.
  • the elements with the same element position on a distribution matrix are ANDed together, and then output from the output to the second input of the first switching logic and the second input of the second switching logic.
  • the AND gate logic plays a buffer role here, and provides the switch closing time for the first switch logic and the second switch logic. After the switch is closed, the AND terminal outputs the AND operation to the first switch logic and the second switch logic. As a result, it is ensured that the second input terminals of the first switching logic and the second switching logic receive the correct elements without error.
  • the principle of the first input terminal, the second input terminal, and the output terminal of the first switching logic and the second switching logic in the embodiment shown in FIG. 9 is the same as that of the embodiment in FIG.
  • FIG. 10 is a schematic structural diagram of an embodiment of a logic circuit of the present application.
  • the logic circuit provided in this embodiment may be used to replace the logic circuit shown in FIG. 8.
  • the logic circuit shown in FIG. 10 is based on the logic circuit shown in FIG. 8, and further includes a first latch and a second latch.
  • the input end of the first latch is used to sequentially receive elements at each position in the second distribution matrix, and the output end of the first latch is used to receive the elements received by the input end after the first preset delay.
  • Output to the first switching logic; the input of the second latch is used to sequentially receive the element at each position in the first distribution matrix; the output of the second latch is used to The element received at the input is output to the second switching logic.
  • the first latch and the second latch both function as a buffer.
  • the first latch After receiving the elements of the second distribution matrix, the first latch provides the switch closing time for the first switching logic. After the switch of the switching logic is closed, the received element is output to the first switching logic through the output terminal.
  • the second latch After receiving the element of the first partial matrix, the second latch provides the switch closing time for the second switching logic. After the switch of the second switching logic is closed, the received element is output to the second switching logic through the output terminal.
  • the first preset delay may be set as the switch-on delay of the first switch logic
  • the second preset delay may be set as the switch-on delay of the second switch logic.
  • the on-delay of the first switching logic is the same as the on-delay of the second switching logic.
  • the principle of the first input terminal, the second input terminal, and the output terminal of the first switching logic and the second switching logic in the embodiment shown in FIG. 10 is the same as that of the embodiment in FIG. 8 and will not be described again.
  • the method for processing a matrix in each of the foregoing embodiments may be applied to a processor of a pulsation array architecture, and a convolution operation is performed on the matrix without changing an existing pulsation array architecture.
  • FIG. 11 is a schematic diagram of the processing structure of the matrix processing method applied to the pulsation array processor in FIG.
  • the second storage unit respectively store four matrices
  • the processor will preload the four matrices in the first storage unit to be calculated into the calculation units 1-4, respectively.
  • the matrix in the second storage unit is sequentially loaded into the computing unit 1, and after completing the calculation with the pre-loaded matrix, the matrix is transmitted to the computing unit 2; the computing unit 2 receives the matrices that have been calculated in the computing unit 1 sequentially After completing the calculation with the pre-loaded matrix, the matrix is transmitted to the calculation unit 3; and so on.
  • an alignment unit may be added before each calculation unit in the pulsation array processor.
  • the compression matrices obtained after the processing shown in the method are aligned to achieve effective calculation of only the first simplified matrix and the second simplified matrix in the calculation unit, ensuring that no invalid calculation of zero value is generated in the calculation unit.
  • the alignment unit and the calculation unit may be implemented by a software program in the processor; or, the alignment unit may be implemented by a logic circuit in the processor, and the logic circuit adopted by each alignment unit may be any A logic circuit as shown.
  • FIG. 12A to 12E are schematic diagrams of a processing flow of a matrix processing method applied to a pulsation array processor in the present application.
  • the processing structure of the pulsation array processor shown in FIG. 11 is described by using the processing flow of FIG. 12.
  • the processing flow in FIG. 12 may be that the processor performs convolution or full-link calculation of the matrix. For example, when the processor performs convolution or full-link calculation of the deep learning network, the parameter matrix and data in the deep learning network are required. The matrix is convolved.
  • the processor first processes the parameter matrix to be calculated by the method shown in FIG. 1 to obtain compression matrix 1, compression matrix 2, compression matrix 3, and compression matrix 4 to be calculated, and
  • the above matrix is stored in the first storage unit of the processor;
  • the data matrix to be calculated is processed by the method shown in FIG. 1 to obtain a compression matrix A, a compression matrix B, a compression matrix C, and a compression matrix D to be calculated.
  • storing the above matrix in a second storage unit of the processor.
  • the first storage unit and the second storage unit may be different storage units in the processor or different storage locations of the same storage unit, which are not limited herein.
  • the processor preloads compression matrix 1, compression matrix 2, compression matrix 3, and compression matrix 4 in the first storage unit into alignment unit 1, alignment unit 2, alignment unit 3, and alignment unit 4, respectively.
  • the pre-loaded matrix in the alignment unit may be a non-zero element and a distribution matrix arranged sequentially in the compression matrix.
  • the processor loads the compression matrix A in the second storage unit into the alignment unit 1, so that the alignment unit 1 determines that the compression matrix 1 corresponds to the distribution matrix in the compression matrix 1 and the distribution matrix in the compression matrix A.
  • the simplified matrix 1 and the compressed matrix B correspond to the simplified matrix A.
  • the processor outputs the simplified matrix 1 and the simplified matrix A obtained in the step of FIG. 12C to the calculation unit 1, and the calculation unit 1 accumulates a product of the elements at the same positions in the simplified matrix 1 and the simplified matrix A.
  • the processor also loads the compression matrix B in the second storage unit into the alignment unit 1, so that the alignment unit 1 determines the simplified matrix 1 corresponding to the compression matrix 1 and the distribution matrix in the compression matrix 1 and the distribution matrix in the compression matrix A.
  • Simplified matrix B corresponding to compression matrix B.
  • the processor also loads the compressed matrix A in the calculation unit 1 after the alignment is completed into the alignment unit 2, so that the alignment unit 2 determines the compression matrix by causing the alignment unit 2 to pass through the distribution matrix in the compression matrix 2 and the distribution matrix in the compression matrix A
  • the simplified matrix 2 corresponding to 2 and the simplified matrix A corresponding to the compression matrix A are simplified matrix 2 corresponding to 2 and the simplified matrix A corresponding to the compression matrix A.
  • the processor outputs the simplified matrix 1 and the simplified matrix B obtained in the step of FIG. 12D to the calculation unit 1, and the calculation unit 1 accumulates a product of the elements at the same positions in the simplified matrix 1 and the simplified matrix B.
  • the processor outputs the simplified matrix 2 and the simplified matrix A obtained in the step of FIG. 12D to the calculation unit 2, and the calculation unit 2 accumulates the product of the elements at the same positions in the simplified matrix 2 and the simplified matrix A.
  • the processor also loads the compression matrix C in the second storage unit into the alignment unit 1, so that the alignment unit 1 determines the simplified matrix 1 corresponding to the compression matrix 1 and the distribution matrix in the compression matrix 1 and the distribution matrix in the compression matrix C. Simplified matrix C corresponding to compression matrix C.
  • the processor also loads the compressed matrix B after the alignment is completed in the calculation unit 1 into the alignment unit 2 so that the alignment unit 2 determines the compression matrix by causing the alignment unit 2 to pass the distribution matrix in the compression matrix 2 and the distribution matrix in the compression matrix B.
  • the processor also loads the compressed matrix A in the calculation unit 2 after the alignment is completed into the alignment unit 3, so that the alignment unit 3 determines the compression matrix by causing the alignment unit 3 to pass the distribution matrix in the compression matrix 3 and the distribution matrix in the compression matrix A
  • the simplified matrix 3 corresponding to 3 and the simplified matrix A corresponding to the compression matrix A is
  • the alignment unit 1 will continue to load the next compression matrix D to be processed from the second storage space, and each alignment unit will continue to compress the compression matrix to the next after the alignment operation is completed. Aligned unit transfer. Each alignment unit transmits the simplified matrix obtained from the two compressed matrices loaded to the corresponding calculation unit for calculation, and the calculation unit outputs the calculation result.
  • the method and principle of determining the simplified matrix by compressing the matrix reference may be made to the foregoing embodiments of the present application, and details are not described herein again.
  • the matrix processing method provided in this application when the processor performs convolution or full-link calculation in a deep learning network, after the parameter matrix and data matrix to be calculated are compressed, The compressed data matrix and parameters are calculated by the alignment unit and the calculation unit in the processor. Therefore, the calculation unit avoids invalid calculations containing zero elements during calculation, thereby improving the storage efficiency and operation efficiency of the processor.
  • the matrix processing method provided in the present application is compatible with existing processors using a pulsating array architecture, which facilitates the implementation and promotion of the matrix processing method in the present application.
  • the matrix processing method provided in the present application may also be applied to a processor to perform a convolution operation of an image.
  • the image that the processor can process is a digital image, and the digital image is represented by an image matrix composed of gray values of each pixel of the image.
  • the convolution operation performed on the image by the processor means that the convolution kernel (or convolution template) is used to slide on the image matrix, and the elements at the corresponding position on the image matrix during the sliding process of the convolution kernel and the convolution kernel are used.
  • the process of multiplying and summing the elements to finally get the elements of an output matrix is called image convolution.
  • FIG. 13 is a schematic diagram of a processing structure of applying the matrix processing method of the present application to an image convolution operation.
  • the matrix to be processed for the convolution operation is the input image matrix in the figure, and the dimension of the matrix is 6 rows and 6 columns. It is assumed that the dimension of the convolution kernel selected for the convolution operation is 3 rows and 3 columns.
  • the processor sequentially aligns the convolution kernel with the elements of the intermediate matrix with dimensions of 3 rows and 3 columns in the input image matrix, and accumulates between the elements of the convolution kernel that are aligned with the intermediate matrix.
  • the matrix processing method in the embodiment shown in FIG. 5 of the present application may be used to perform compression processing on the intermediate matrix and the convolution kernel to obtain the convolution.
  • the compressed matrix of the kernel and the compressed matrix of the intermediate matrix, and then the two compressed matrices obtained are operated by the matrix processing method as shown in the embodiment of FIG. 6 of the present application to obtain the convolution kernel and the intermediate matrix.
  • the convolution kernel shown in FIG. 13 is [4,0,0; 0,0,0; 0,0, -4].
  • the convolution kernel is first The 9 elements of are aligned with the 9 elements in rows 1 to 3 and columns 1 to 3 in the input image matrix, and the intermediate matrix to be calculated is [0,0,0; 0,1,1; 0,0,2].
  • the compression matrix of the convolution kernel obtained by processing the convolution kernel is [2,4, -4,1,0,0,0,0,0,0,0,0,0,1]
  • the compressed matrix from the intermediate matrix to the intermediate matrix is [3,1,1,2,0,0,0,0,1,1,0,0,1].
  • the distribution matrix of the convolution kernel [1,0,0,0,0,0,0,0,0,1] and the distribution matrix of the intermediate matrix [0,0,0,0,1,1,0,0, 1] Determine the mask matrix [0,1] of the convolution kernel and the mask matrix [0,0,1] of the intermediate matrix.
  • the simplified matrix of the convolution kernel is determined as [-4] according to the mask matrix of the convolution kernel, and the simplified matrix of the intermediate matrix is determined as [2] according to the mask matrix of the intermediate matrix, and the target is obtained through the two simplified matrices obtained.
  • the convolution kernel is shifted to the right by one element and aligned with the 9 elements in the first to third rows and the second to fourth columns in the input image matrix.
  • the aligned intermediate matrix is [0,0, 0; 1,1,0; 2,0,0], and continue to calculate the multiplication and sum of the corresponding elements of the convolution kernel and the intermediate matrix through the above matrix processing method, and use the obtained result as the second line in the output image 3 columns of elements.
  • All the elements in rows 2 to 5 and columns 2 to 5 of the output image matrix are finally obtained. All the calculation processes of the intermediate matrix and the convolution kernel can be performed by using the matrix processing method in the above example. .
  • the elements on the outermost row 1, row 6, column 1, and column 6 of the outermost image matrix can be processed by ignoring the boundary elements and retaining the original boundary elements due to the boundary problem of image convolution. Since the processing of the matrix is not involved, this embodiment does not specifically limit this.
  • the matrix method provided by the present application can be applied to a processor to perform a convolution operation on an image.
  • the convolution In the multiplication and addition operation of the convolution kernel in the convolution operation and the corresponding intermediate matrix of the image matrix, the convolution The compressed matrix of the kernel and the compressed matrix of the intermediate matrix are operated to obtain the target value.
  • the compressed matrix does not need to add some zero elements to perform element alignment when performing the operation, and only the elements in the first simplified matrix and the second simplified matrix are used to perform an absolutely effective operation, the calculation including zeros is avoided. Element's invalid calculation. Therefore, the operation speed of the image convolution operation can be increased, and the processing efficiency of the processor for the image convolution operation can be further improved.
  • FIG. 14 is a schematic structural diagram of an embodiment of a processing device for a matrix of the present application.
  • the matrix processing apparatus provided in this embodiment includes a first determination module 1401, a second determination module 1402, and a processing module 1403.
  • the first determining module 1401 is configured to determine the number of non-zero elements in the matrix to be processed, and the matrix to be processed is a one-dimensional matrix;
  • the second determining module 1402 is used to determine the distribution matrix of the matrix to be processed, and the distribution matrix is used to represent the matrix The position of the non-zero elements in the processing matrix;
  • the processing module 1403 is configured to combine the number of non-zero elements, the value of each non-zero element in the to-be-processed matrix, and the distribution matrix in order to obtain a compression matrix of the to-be-processed matrix.
  • the apparatus for processing a matrix provided in this embodiment may be used to execute a method for processing a matrix as shown in FIG. 1.
  • the specific implementation manner is the same as the principle, and details are not described herein again.
  • the distribution matrix is a one-dimensional matrix, and elements at each position in the matrix to be processed correspond to elements at the same position in the distribution matrix one to one; the second determining module 1402 is specifically configured to sequentially Scan the elements in the processing matrix; when the scanned elements are non-zero elements, determine the value of the element corresponding to the scanned elements in the distribution matrix as 1; when the scanned elements are zero, determine the distribution matrix The value of the element corresponding to the scanned element is 0.
  • the number of elements in the matrix to be processed is N
  • the number of non-zero elements in the matrix to be processed is M
  • the number of elements in the distribution matrix is N
  • the number of elements with a value of 1 is M
  • the number of elements in the compression matrix is M + N + 1, where N is a positive integer, M is a non-negative integer, and M is less than or equal to N.
  • the apparatus for processing a matrix provided in this embodiment may be used to execute the method for processing a matrix in the foregoing embodiment.
  • the specific implementation manner is the same as the principle, and details are not described herein again.
  • FIG. 15 is a schematic structural diagram of an embodiment of a processing apparatus for a matrix of the present application.
  • the matrix processing device provided in this embodiment further includes a calculation module 1501 on the basis of FIG. 14.
  • the to-be-processed matrix in the foregoing embodiment includes a first to-be-processed matrix and a second to-be-processed matrix.
  • the number of elements in the first to-be-processed matrix and the number of elements in the second to-be-processed matrix are the same, and correspondingly, are distributed.
  • the matrix includes a first distribution matrix and a second distribution matrix.
  • the calculation module 1501 is configured to obtain a target value based on the first distribution matrix, the second distribution matrix, the non-zero elements in the first to-be-processed matrix and the non-zero elements in the second to-be-processed matrix, and the target value and the accumulated first to-be-processed
  • the result of the product of the element at each position in the matrix and the element at the same position in the second to-be-processed matrix is the same.
  • the calculation module 1501 is specifically configured to sequentially obtain each first target element in the second distribution matrix to form a first mask matrix, where the position of each first target element in the second distribution matrix and the first The position of each element in a distribution matrix with the same value of 1;
  • the first effective element among the non-zero elements of the first to-be-processed matrix is used as the element of the first simplified matrix, where the first effective element is in the first to-be-processed matrix.
  • the arrangement order of the non-zero elements is the same as that of the obtained first target element in the first mask matrix;
  • Each second target element is sequentially obtained in the first distribution matrix to form a second mask matrix, where the position of each second target element in the first distribution matrix and each value in the second distribution matrix are 1 The positions of the elements are the same;
  • the second effective element among the non-zero elements of the second to-be-processed matrix is used as the element of the second simplified matrix, where the second effective element is in the second to-be-processed matrix.
  • the arrangement order of non-zero elements is the same as that of the obtained second target element in the second mask matrix;
  • the product of the elements at each position in the first simplified matrix and the elements at the same position in the second simplified matrix is accumulated to obtain a target value.
  • the apparatus for processing a matrix provided in this embodiment may be used to execute a method for processing a matrix as shown in FIG. 6.
  • the specific implementation manner is the same as the principle, and details are not described again.
  • the division of the modules in the embodiments of the present application is schematic, and is only a logical function division. In actual implementation, there may be another division manner.
  • the functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist separately physically, or two or more modules may be integrated into one module.
  • the above integrated modules may be implemented in the form of hardware or software functional modules.
  • the integrated module When the integrated module is implemented in the form of a software functional module and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially a part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium.
  • a computer device which may be a personal computer, a server, or a network device
  • a processor to perform all or part of the steps of the method described in the embodiments of the present application.
  • the aforementioned storage media include: U disks, mobile hard disks, read-only memory (ROM), random access memory (RAM), magnetic disks or compact discs, and other media that can store program codes .
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, computer, server, or data center Transmission by wire (for example, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (for example, infrared, wireless, microwave, etc.) to another website site, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, and the like that includes one or more available medium integration.
  • the available medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (Solid State Disk (SSD)), and the like.
  • a magnetic medium for example, a floppy disk, a hard disk, a magnetic tape
  • an optical medium for example, a DVD
  • a semiconductor medium for example, a solid state disk (Solid State Disk (SSD)
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium stores program code, and when the program code is executed, a method for processing a matrix as in any of the foregoing embodiments is performed.
  • the present application also provides a computer program product.
  • the program code included in the computer program product is executed by a processor, a method for processing a matrix as in any of the foregoing embodiments is implemented.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Electromagnetism (AREA)
  • Computer Hardware Design (AREA)
  • Complex Calculations (AREA)

Abstract

本申请提供一种矩阵的处理方法、装置及逻辑电路,其中方法包括:确定待处理矩阵中的非零元素的数量,待处理矩阵为一维矩阵;确定待处理矩阵的分布矩阵,分布矩阵用于表示待处理矩阵中非零元素的位置;组合非零元素的数量、依次排列的待处理矩阵中每个非零元素的值和分布矩阵,以获得待处理矩阵的压缩矩阵。本申请提供的矩阵的处理方法、装置及逻辑电路,能够在稀疏矩阵在进行例如矩阵的卷积运算、乘加运算、乘减运算、除加运算或除减运算时,通过压缩矩阵代替稀疏矩阵进行运算并得到稀疏矩阵的运算结果,以提高稀疏矩阵的运算效率,进而提高了矩阵的处理方法的效率。

Description

矩阵的处理方法、装置及逻辑电路 技术领域
本申请涉及数据处理技术领域,尤其涉及一种矩阵的处理方法、装置及逻辑电路。
背景技术
矩阵是科学计算中常用的一种运算工具,被广泛应用于工程应用之中。而稀疏矩阵是矩阵的一种特例,指矩阵中仅含有少量非0元素的矩阵。由于稀疏矩阵中存在大量的值为0的元素,采用常规的矩阵存储方式会在矩阵运算时带来大量的不必要运算。
现有技术中,为了提高稀疏矩阵的运算效率,往往会采用更有效的稀疏矩阵处理方式对稀疏矩阵进行压缩处理。目前主流的稀疏矩阵的处理方式为压缩行存储(Compressed Row Storage,CSR),稀疏矩阵经过CSR处理后得到的压缩矩阵通过行偏移、元素列号和元素数值存储稀疏矩阵中的非0元素。其中,元素的数值和列号表示一个元素以及其在矩阵中所处的列号,行偏移表示某一行的第一个元素在数值里的起始偏移位置。从而在稀疏矩阵进行运算时,可直接通过压缩后的压缩矩阵代替压缩前的稀疏矩阵进行相关运算,减少了稀疏矩阵在进行卷积运算时,矩阵中的0元素与对应稀疏矩阵相同位置的元素进行乘法计算得到0值的无效计算。同时,在对两个CSR压缩矩阵进行卷积运算时,由于两个CSR压缩矩阵所对应的原矩阵中的非0元素个数通常不相同,并不能明确CSR压缩矩阵中的非0元素需要与另一CSR压缩矩阵中的哪些非0元素进行卷积运算。因此在对两个CSR压缩矩阵进行卷积运算之前,还需要还原出CSR压缩矩阵所对应的原矩阵中的部分0值。以确保两个CSR压缩矩阵中的非0元素进行补充0值后辅助对齐后,得到两个维数相同的CSR压缩矩阵,从而能够对两个维数相同的矩阵中每两个相同位置的元素的乘积进行累加,以得到两个矩阵的卷积运算结果。
采用现有的矩阵处理方法,由于压缩后的稀疏矩阵在进行卷积运算时还需要增加一些0元素以进行元素对齐。因此在压缩后的稀疏矩阵进行卷积运算时,并没有完全避免0元素的无效运算,从而造成了现有矩阵处理方法的效率较低。
发明内容
本申请提供一种矩阵的处理方法、装置及逻辑电路,通过确定待处理矩阵的非零元素以及用于表示非零元素位置的分布矩阵,将非零元素的数量、依次排列的非零元素和分布矩阵组合为压缩矩阵,使得稀疏矩阵在进行例如矩阵的卷积运算、乘加运算、乘减运算、除加运算或除减运算时,能够通过压缩矩阵代替稀疏矩阵进行运算并得到稀疏矩阵的运算结果,从而避免了零元素的无效计算并提高了矩阵的处理方法的效率。
本申请第一方面提供一种矩阵的处理方法,包括:
确定待处理矩阵中的非零元素的数量,所述待处理矩阵为一维矩阵;
确定所述待处理矩阵的分布矩阵,所述分布矩阵用于表示所述待处理矩阵中非零元素的位置;
组合所述非零元素的数量、依次排列的所述待处理矩阵中每个非零元素的值和所述分布矩阵,以获得所述待处理矩阵的压缩矩阵。
本实施例提供的矩阵的处理方法,能够通过确定待处理矩阵的非零元素以及用于表示非零元素位置的分布矩阵,并将非零元素的数量、依次排列的非零元素和分布矩阵组合为压缩矩阵。本实施例的矩阵处理方法所得到的压缩矩阵,能够在待处理矩阵进行例如矩阵的卷积运算、乘加运算、乘减运算、除加运算或除减运算时,通过压缩矩阵代替待处理矩阵进行运算,以提高处理器对于待处理矩阵的存储效率和运算效率。
在本申请第一方面一实施例中,所述分布矩阵为一维矩阵,所述待处理矩阵中每个位置上的元素和所述分布矩阵中相同位置上的元素一一对应,所述确定所述待处理矩阵的分布矩阵,包括:
依次扫描所述待处理矩阵中的元素;
当所述扫描到的元素为非零元素时,确定所述分布矩阵中与所述扫描到的元素相对应的元素的值为1;
当所述扫描到的元素为零值时,确定所述分布矩阵中与所述扫描到的元素相对应的元素的值为0。
在本实施例提供的矩阵的处理方法中,通过与待处理矩阵维数相同并且相同位置上元素一一对应的分布矩阵来表示待处理矩阵中的元素是否为零元素。更为具体地,通过扫描待处理矩阵中的元素,用分布矩阵中的常数1来表示待处理矩阵中的非零元素、常数0来表示待处理矩阵中的零元素,使得在处理矩阵时,能够通过更为简单的分布矩阵确定待处理矩阵中零元素和非零元素的分布。尤其是在待处理矩阵中的元素的数据量较大时,本实施例中通过数据量仅为1bit的常数0和1即可对待处理矩阵中零元素和非零元素的分布进行标识,从而不需要在待处理矩阵进行运算时扫描数据量较大的待处理矩阵,节省了矩阵处理中用于读取数据的带宽。
在本申请第一方面一实施例中,所述待处理矩阵中的元素的数量为N,所述待处理矩阵中的非零元素的数量为M,对应的,所述分布矩阵中的元素的数量为N,所述分布矩阵中值为1的元素的数量为M,所述压缩矩阵中的元素的数量为M+N+1,其中,N为正整数,M为非负整数,M小于等于N。
本实施例提供的矩阵的处理方法,通过更为具体地对压缩矩阵中元素数量进行限定,使得一维矩阵的压缩矩阵中仅包括M+N+1个元素,其中,1个非零元素的数量、M个非零元素以及N个分布矩阵的元素,使得本实施例中的压缩矩阵在代替待处理矩阵进行运算时,能够完全等效于采用压缩前的待处理矩阵进行运算,从而保证了计算压缩矩阵所得结果和计算待处理矩阵所得的结果相等。
在本申请第一方面一实施例中,所述待处理矩阵包括第一待处理矩阵和第二待处理矩阵,所述第一待处理矩阵中的元素的数量和所述第二待处理矩阵中的元素的数量相同,对应的,所述分布矩阵包括第一分布矩阵和第二分布矩阵,所述方法还包括:
基于所述第一分布矩阵、所述第二分布矩阵、所述第一待处理矩阵中的非零元素和所述第二待处理矩阵中的非零元素,获得目标值,所述目标值与累加所述第一待处理矩阵中每个位置上的元素和所述第二待处理矩阵中相同位置上的元素的乘积的结果相同。
本实施例提供的矩阵的处理方法中,分别计算需要进行运算的第一待处理矩阵和第二 待处理矩阵的第一压缩矩阵和第二压缩矩阵,并通过第一压缩矩阵和第二压缩矩阵中的分布矩阵和非零元素代替第一待处理矩阵和第二待处理矩阵进行运算,以得到与第一待处理矩阵和第二待处理矩阵运算结果,从而提高处理器对于第一待处理矩阵和第二待处理的存储效率和运算效率。
在本申请第一方面一实施例中,所述基于所述第一分布矩阵、所述第二分布矩阵、所述第一待处理矩阵中的非零元素和所述第二待处理矩阵中的非零元素,获得目标值,包括:
在所述第二分布矩阵中依次获取每个第一目标元素,以组成第一掩码矩阵,其中,所述每个第一目标元素在所述第二分布矩阵中的位置和所述第一分布矩阵中每个值为1的元素的位置相同;
当所述获取的第一目标元素的值为1时,将所述第一待处理矩阵的非零元素中的第一有效元素作为第一简化矩阵的元素,其中,所述第一有效元素在所述第一待处理矩阵的非零元素中的排列顺序与所述获取的第一目标元素在所述第一掩码矩阵中的排列顺序相同;
在所述第一分布矩阵中依次获取每个第二目标元素,以组成第二掩码矩阵,其中,所述每个第二目标元素在所述第一分布矩阵中的位置和所述第二分布矩阵中每个值为1的元素的位置相同;
当所述获取的第二目标元素的值为1时,将所述第二待处理矩阵的非零元素中的第二有效元素作为第二简化矩阵的元素,其中,所述第二有效元素在所述第二待处理矩阵的非零元素中的排列顺序与所述获取的第二目标元素在所述第二掩码矩阵中的排列顺序相同;
累加所述第一简化矩阵中每个位置上的元素和所述第二简化矩阵中相同位置上的元素的乘积,以获得所述目标值。
本实施例提供的矩阵的处理方法中,能够通过第一压缩矩阵和第二压缩矩阵分别代替第一待处理矩阵和第二待处理矩阵进行卷积运算,以获取目标值作为第一待处理矩阵和第二待处理矩阵的卷积运算结果。并且在第一压缩矩阵和第二压缩矩阵的计算过程中,能够通过第一分布矩阵和第二分布矩阵确定第一掩码矩阵和第二掩码矩阵,并最终根据第一掩码矩阵和第二掩码矩阵确定第一简化矩阵和第二简化矩阵,将第一简化矩阵和第二简化矩阵中对齐的元素进行累加乘积运算即可得到目标值。因此不需要在进行卷积运算时再增加一些零元素以进行元素对齐,而只通过第一简化矩阵和第二简化矩阵中的元素进行绝对有效的运算,从而在通过第一压缩矩阵和第二压缩矩阵分别代替第一待处理矩阵和第二待处理矩阵进行卷积运算时,能够对第一压缩矩阵和第二压缩矩阵中有效的元素进行对齐,而对齐过程中能够避免零元素造成的无效运算,进一步提高了现有矩阵处理方法的效率。
综上,在本申请第一方面提供的矩阵的处理方法中,通过确定待处理矩阵的非零元素以及用于表示非零元素位置的分布矩阵,将非零元素的数量、依次排列的非零元素和分布矩阵组合为压缩矩阵,从而在稀疏矩阵在进行例如矩阵的卷积运算、乘加运算、乘减运算、除加运算或除减运算时,通过压缩矩阵代替稀疏矩阵进行运算并得到稀疏矩阵的运算结果,以提高稀疏矩阵的运算效率,进而提高了矩阵的处理方法的效率。
本申请第二方面提供一种逻辑电路,所述逻辑电路用于通过第一分布矩阵和第二分布矩阵得到第一掩码矩阵和第二掩码矩阵;其中,所述第一分布矩阵用于表示第一待处理矩阵中非零元素的位置;所述第二分布矩阵用于表示第二待处理矩阵中非零元素的位置; 所述第一掩码矩阵用于表示所述第二分布矩阵中的第一目标元素,每个所述第一目标元素在所述第二分布矩阵中的位置和所述第一分布矩阵中每个值为1的元素的位置相同;所述第二掩码矩阵用于表示所述第二分布矩阵中的第二目标元素,每个所述第二目标元素在所述第一分布矩阵中的位置和所述第一分布矩阵中每个值为1的元素的位置相同;
所述逻辑电路包括:第一开关逻辑和第二开关逻辑;其中,
所述第一开关逻辑的第一输入端用于依次接收所述第二分布矩阵中每个位置上的元素,所述第一开关逻辑的第二输入端用于依次接收所述第一分布矩阵中与所述被接收的第二分布矩阵上的元素位置相同的元素,所述第一开关逻辑的输出端用于输出所述第一目标元素,以组成所述第一掩码矩阵;
当所述第一开关逻辑的第二输入端接收的元素的值为1时,所述第一开关逻辑将第一输入端接收的元素从输出端输出;
所述第二开关逻辑的第一输入端用于依次接收所述第一分布矩阵中每个位置上的元素,所述第二开关逻辑的第二输入端用于依次接收所述第二分布矩阵中与所述被接收的第一分布矩阵上的元素位置相同的元素,所述第二开关逻辑的输出端用于输出所述第二目标元素,以组成所述第二掩码矩阵;
当所述第二开关逻辑的第二输入端接收的元素的值为1时,所述第二开关逻辑将第一输入端接收的元素从输出端输出。
本实施例提供的逻辑电路,通过较为简单的开关逻辑即可实现了上述实施例中通过第一分布矩阵和第二分布矩阵得到第一掩码矩阵和第二掩码矩阵的方法,并且开关逻辑可以在一个处理器时钟内完成接收分布矩阵的元素以及输出掩码矩阵的元素,从而保证后续阵列处理器的流畅运行。
在本申请第二方面一实施例中,逻辑电路还包括:与门逻辑;
所述与门逻辑的第一输入端用于依次接收所述第一分布矩阵中每个位置上的元素,所述与门逻辑的第二输入端用于依次接收所述第二分布矩阵中与所述被接收的第一分布矩阵上的元素位置相同的元素,所述与门逻辑的输出端用于将所述与门逻辑的第一输入端和所述与门逻辑的第二输入端的与运算结果输出至所述第一开关逻辑的第二输入端和所述第二开关逻辑的第二输入端。
本实施例提供的逻辑电路,通过加入起到缓存作用的与门逻辑,为第一开关逻辑和第二开关逻辑提供开关闭合的时间,在第一开关逻辑和第二开关逻辑的开关闭合后,与门逻辑再通过输出端向第一开关逻辑和第二开关逻辑输出与运算结果,从而保证了第一开关逻辑和第二开关逻辑的第二输入端准确无误地接收到正确的元素。
在本申请第二方面一实施例中,逻辑电路还包括:第一锁存器和第二锁存器;
所述第一锁存器的输入端用于依次接收所述第二分布矩阵中每个位置上的元素,所述第一锁存器的输出端用于在第一预设时延后将所述输入端接收的元素输出至所述第一开关逻辑;
所述第二锁存器的输入端用于依次接收所述第一分布矩阵中每个位置上的元素,所述第二锁存器的输出端用于在第二预设时延后将所述输入端接收的元素输出至所述第二开关逻辑。
在本申请第二方面一实施例中,所述第一预设时延为所述第一开关逻辑的开关打开时 延;所述第二预设时延为所述第二开关逻辑的开关打开时延。
本实施例提供的逻辑电路,通过加入起到缓存作用的第一锁存器和第二锁存器,其中,第一锁存器在接收到第二分布矩阵的元素后,为第一开关逻辑提供开关闭合的时间,在第一开关逻辑的开关闭合后再通过输出端向第一开关逻辑输出接收到的元素;第二锁存器在接收到第一分部矩阵的元素后,为第二开关逻辑提供开关闭合的时间,在第二开关逻辑的开关闭合后再通过输出端向第二开关逻辑输出接收到的元素。并且第一预设时延可以设置为第一开关逻辑的开关打开时延,第二预设时延可以设置为第二开关逻辑的开关打开时延,从而保证了第一开关逻辑和第二开关逻辑的第二输入端准确无误地接收到正确的元素。
综上,在本申请第二方面提供的逻辑电路中,包括:第一开关逻辑和第二开关逻辑,实现了通过第一分布矩阵和第二分布矩阵得到第一掩码矩阵和第二掩码矩阵。其中,第一开关逻辑的第一输入端用于依次接收第二分布矩阵中每个位置上的元素,第一开关逻辑的第二输入端用于依次接收第一分布矩阵中与被接收的第二分布矩阵上的元素位置相同的元素,第一开关逻辑的输出端用于输出第一目标元素,以组成第一掩码矩阵;当第一开关逻辑的第二输入端接收的元素的值为1时,第一开关逻辑将第一输入端接收的元素从输出端输出;第二开关逻辑的第一输入端用于依次接收第一分布矩阵中每个位置上的元素,第二开关逻辑的第二输入端用于依次接收第二分布矩阵中与被接收的第一分布矩阵上的元素位置相同的元素,第二开关逻辑的输出端用于输出第二目标元素,以组成第二掩码矩阵;当第二开关逻辑的第二输入端接收的元素的值为1时,第二开关逻辑将第一输入端接收的元素从输出端输出。本申请提供的逻辑电路,逻辑简单,硬件代价低,应用于处理器实现时能够在一个时钟的时间内实现通过第一分布矩阵和第二分布矩阵得到第一掩码矩阵和第二掩码矩阵,从而提高了逻辑电路的处理效率。
本申请第三方面提供一种矩阵的处理装置,包括:
第一确定模块,用于确定待处理矩阵中的非零元素的数量,所述待处理矩阵为一维矩阵;
第二确定模块,用于确定所述待处理矩阵的分布矩阵,所述分布矩阵用于表示所述待处理矩阵中非零元素的位置;
处理模块,用于组合所述非零元素的数量、依次排列的所述待处理矩阵中每个非零元素的值和所述分布矩阵,以获得所述待处理矩阵的压缩矩阵。
在本申请第三方面一实施例中,所述分布矩阵为一维矩阵,所述待处理矩阵中每个位置上的元素和所述分布矩阵中相同位置上的元素一一对应;
所述第二确定模块具体用于,
依次扫描所述待处理矩阵中的元素;
当所述扫描到的元素为非零元素时,确定所述分布矩阵中与所述扫描到的元素相对应的元素的值为1;
当所述扫描到的元素为零值时,确定所述分布矩阵中与所述扫描到的元素相对应的元素的值为0。
在本申请第三方面一实施例中,所述待处理矩阵中的元素的数量为N,所述待处理矩阵中的非零元素的数量为M,对应的,所述分布矩阵中的元素的数量为N,所述分布矩阵中值为1的元素的数量为M,所述压缩矩阵中的元素的数量为M+N+1,其中,N为正整数, M为非负整数,M小于等于N。
在本申请第三方面一实施例中,所述待处理矩阵包括第一待处理矩阵和第二待处理矩阵,所述第一待处理矩阵中的元素的数量和所述第二待处理矩阵中的元素的数量相同,对应的,所述分布矩阵包括第一分布矩阵和第二分布矩阵;
所述装置还包括:计算模块,用于基于所述第一分布矩阵、所述第二分布矩阵、所述第一待处理矩阵中的非零元素和所述第二待处理矩阵中的非零元素,获得目标值,所述目标值与累加所述第一待处理矩阵中每个位置上的元素和所述第二待处理矩阵中相同位置上的元素的乘积的结果相同。
在本申请第三方面一实施例中,所述计算模块具体用于,
在所述第二分布矩阵中依次获取每个第一目标元素,以组成第一掩码矩阵,其中,所述每个第一目标元素在所述第二分布矩阵中的位置和所述第一分布矩阵中每个值为1的元素的位置相同;
当所述获取的第一目标元素的值为1时,将所述第一待处理矩阵的非零元素中的第一有效元素作为第一简化矩阵的元素,其中,所述第一有效元素在所述第一待处理矩阵的非零元素中的排列顺序与所述获取的第一目标元素在所述第一掩码矩阵中的排列顺序相同;
在所述第一分布矩阵中依次获取每个第二目标元素,以组成第二掩码矩阵,其中,所述每个第二目标元素在所述第一分布矩阵中的位置和所述第二分布矩阵中每个值为1的元素的位置相同;
当所述获取的第二目标元素的值为1时,将所述第二待处理矩阵的非零元素中的第二有效元素作为第二简化矩阵的元素,其中,所述第二有效元素在所述第二待处理矩阵的非零元素中的排列顺序与所述获取的第二目标元素在所述第二掩码矩阵中的排列顺序相同;
累加所述第一简化矩阵中每个位置上的元素和所述第二简化矩阵中相同位置上的元素的乘积,以获得所述目标值。
综上,在本申请第三方面提供的矩阵的处理装置中,通过第一确定模块确定待处理矩阵的非零元素以及第二确定模块确定用于表示非零元素位置的分布矩阵,并通过处理模块将非零元素的数量、依次排列的非零元素和分布矩阵组合为压缩矩阵,从而在稀疏矩阵在进行例如矩阵的卷积运算、乘加运算、乘减运算、除加运算或除减运算时,通过压缩矩阵代替稀疏矩阵进行运算并得到稀疏矩阵的运算结果,以提高稀疏矩阵的运算效率,进而提高了矩阵的处理方法的效率。
第四方面,本申请实施例提供一种矩阵的处理装置,包括:处理器和存储器;所述存储器,用于存储程序;所述处理器,用于调用所述存储器所存储的程序,以执行本申请第一方面中任一所述的矩阵的处理方法。
第五方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质存储程序代码,当所述程序代码被执行时,以执行如本申请第一方面任一所述的矩阵的处理方法。
附图说明
图1为本申请矩阵的处理方法一实施例的流程示意图;
图2为本申请矩阵的处理方法中确定压缩矩阵的流程示意图;
图3为本申请矩阵的处理方法中确定分布矩阵的流程示意图;
图4为本申请矩阵的处理方法中压缩矩阵的结构示意图;
图5为本申请矩阵的处理方法中确定压缩矩阵的流程示意图;
图6为本申请矩阵的处理方法一实施例的流程示意图;
图7为本申请矩阵的处理方法通过分布矩阵确定掩码矩阵的流程示意图;
图8为本申请逻辑电路一实施例的结构示意图;
图9为本申请逻辑电路一实施例的结构示意图;
图10为本申请逻辑电路一实施例的结构示意图;
图11为本申请矩阵的处理方法应用于脉动阵列处理器的处理结构示意图;
图12A-图12E为本申请矩阵的处理方法应用于脉动阵列处理器的处理流程示意图;
图13为本申请矩阵的处理方法应用于图像卷积运算的处理结构示意图;
图14为本申请矩阵的处理装置一实施例的结构示意图;
图15为本申请矩阵的处理装置一实施例的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例进行描述。
图1为本申请矩阵的处理方法实施例一的流程示意图。如图1所示,本实施例提供的矩阵的处理方法包括:
S101:确定待处理矩阵中的非零元素的数量,其中,待处理矩阵为一维矩阵;
S102:确定待处理矩阵的分布矩阵,其中,分布矩阵用于表示待处理矩阵中非零元素的位置;
S103:组合非零元素的数量、依次排列的待处理矩阵中每个非零元素的值和分布矩阵,以获得待处理矩阵的压缩矩阵。
具体地,本实施例的执行主体可以是电子设备中具备数据处理功能的处理器,如中央处理器(Central Processing Unit,CPU)或图形处理器(Graphics Processing Unit,GPU)等,电子设备可以是手机、平板电脑、台式电脑或笔记本电脑等。
在本实施例中,处理器需要对待处理矩阵进行压缩,以得到待处理矩阵的压缩矩阵时,处理器通过上述矩阵的处理方法对待处理矩阵进行处理。其中,处理器确定待处理矩阵的非零元素以及用于表示非零元素位置的分布矩阵后,将非零元素的数量、依次排列的非零元素和分布矩阵组合为压缩矩阵。
可选地,本实施例中的待处理矩阵为稀疏矩阵,处理器对需要处理的稀疏矩阵进行压缩得到压缩矩阵,从而提高处理器对于稀疏矩阵的存储效率。并在稀疏矩阵在进行如矩阵的卷积运算、乘加运算、乘减运算、除加运算或除减运算时,通过压缩矩阵代替稀疏矩阵进行运算并得到稀疏矩阵的运算结果,从而提高处理器对于稀疏矩阵的运算效率。
可选地,本实施例中的待处理矩阵为一维矩阵,或者,待处理矩阵也可以为多维矩阵。需要说明书的是,本申请各实施例中多以待处理矩阵为一维矩阵为例进行说明,而本申请一维矩阵的处理方式与原理也可应用于多维矩阵的处理方法。
可选地,本实施例中的待处理矩阵为多维矩阵时,可以首先对待处理矩阵进行降 维操作。示例性地,可以将二维矩阵中的元素逐行读取以获得一维矩阵,再对得到的一维矩阵应用本申请实施例中的矩阵处理方式。
可选地,本实施例中待处理矩阵中每个位置上的元素和分布矩阵中相同位置上的元素一一对应,并且通过分布矩阵的元素能够确定待处理矩阵内对应的元素是否为非零元素。例如:分布矩阵内包括第一类元素和第二类元素,第一类元素在分布矩阵内的位置与待处理矩阵内非零元素的位置相同,第二类元素在分布矩阵内的位置与待处理矩阵内零元素的位置相同。其中,第一类元素和第二类元素指的是表示方式不同并且存在明显的不同特征的两类元素,例如第一类元素是常数1,第二类元素是常数0;或者第一类元素是奇数,第二类元素是偶数。
下面以图2中所示的流程为例,对图1所示的矩阵的处理方法进行说明。图2为本申请矩阵的处理方法中确定压缩矩阵的流程示意图,图2中的待处理矩阵为[0,1,0,0,2,0,0,0,3,0,0]。则处理器对待处理矩阵进行处理得到压缩矩阵时,确定待处理矩阵中的非零元素依次排列为[1,2,3],并确定待处理矩阵的分布矩阵为[0,1,0,0,1,0,0,0,1,0,0]后。处理器将所确定的非零元素的数目[3]、依次排列的非零元素[1,2,3]以及分布矩阵[0,1,0,0,1,0,0,0,1,0,0]进行组合,最终得到压缩矩阵[3,1,2,3,0,1,0,0,1,0,0,0,1,0,0]。
可选地,在上述示例中,S102确定待处理矩阵的分布矩阵的步骤为:处理器依次扫描待处理矩阵中的元素;当扫描到的元素为非零元素时,确定分布矩阵中与扫描到的元素相对应的元素的值为1;当扫描到的元素为零值时,确定分布矩阵中与扫描到的元素相对应的元素的值为0。例如:图3为本申请矩阵的处理方法中确定分布矩阵的流程示意图,图3中的待处理矩阵与分布矩阵的维数相同,并且每个元素一一对应,待处理矩阵中非零元素对应的分布矩阵中相同位置的元素值为1,待处理矩阵中零元素对应的分布矩阵中相同位置的元素值为0。
可选地,在上述实施例中,当待处理矩阵中待处理矩阵中的元素的数量为N,待处理矩阵中的非零元素的数量为M时,对应的,分布矩阵中的元素的数量为N,分布矩阵中值为1的元素的数量为M,压缩矩阵中的元素的数量为M+N+1,其中,N为正整数,M为非负整数,M小于等于N。此外,S103所组合的压缩矩阵的可以采用的排列方式为:非零元素的数量、依次排列的非零元素和分布矩阵,例如在上述图2的示例中,得到的压缩矩阵为[3,1,2,3,0,1,0,0,1,0,0,0,1,0,0]。需要说明的是,上述排列方式仅为示例,本实施例对非零元素的数量、依次排列的非零元素和分布矩阵三者的排列顺序并不做具体限定。例如图4为本申请矩阵的处理方法中压缩矩阵的结构示意图,包括图4所示的压缩矩阵的不同排列方式均在本实施例的保护范围,而本申请各实施例中的压缩矩阵均采用非零元素的数量、依次排列的非零元素和分布矩阵的排列方式进行示例性的说明。
可选地,在上述实施例中,分布矩阵中的元素通过常数0和常数1表示时,分布矩阵中每个元素的比特位数为1bit。则当待处理矩阵中每个元素的比特位数大于1bit时,例如待处理矩阵的元素的比特位数可以是8bit,16bit或者32bit,虽然分布矩阵的维数与待处理矩阵的维数相同,但是分布矩阵所需的存储空间小于待处理矩阵的存储空间,因此将待处理矩阵压缩为压缩矩阵后,还节省了待处理矩阵的存储空间,提高了处理器的存储效率。
进一步地,如图1所示的矩阵的处理方法除了应用能够处理一维矩阵,还能够处理多维矩阵。图5为本申请矩阵的处理方法中确定压缩矩阵的流程示意图,图5以待处理矩阵为多维矩阵为例,对图1所示的矩阵的处理方法在多维矩阵中的应用进行说明。如图5所示的待处理矩阵为[0,4,0;0,0,0;0,0,5],维数为三行三列。则处理器对待处理矩阵进行处理得到压缩矩阵前,可以将多维的待处理矩阵扫描成一维矩阵后对一维矩阵进行处理。例如将待处理矩阵由[0,4,0;0,0,0;0,0,5]扫描为一维矩阵[0,4,0,0,0,0,0,0,5],而为了表示待处理矩阵的维数,还可以在扫描后的一维矩阵中加入待处理矩阵的维数信息例如[0,4,0,0,0,0,0,0,5,3,3],以通过一维矩阵最后两个元素[3,3]表示待处理矩阵为3行三列的多维矩阵。随后,处理器根据扫描后得到的一维矩阵确定待处理矩阵中的非零元素依次排列为[4,5],并确定分布矩阵为[0,1,0,0,0,0,0,0,1,3,3],分布矩阵中最后两个元素同样用于表示待处理矩阵的维数,或者当处理器在计算时已知待处理矩阵的维数或者能够通过其他参数确定待处理矩阵的维数时,可不在分布矩阵中表示待处理矩阵的维数。将所确定的非零元素的数据[2]、依次排列的非零元素[4,5]以及分布矩阵[0,1,0,0,0,0,0,0,1]进行组合,最终得到待处理矩阵的压缩矩阵[2,4,5,0,1,0,0,0,0,0,0,1]。其中,在本实施例中对多维矩阵进行处理并得到多维矩阵的压缩矩阵的方式仅为示例,压缩矩阵还可以采用在待处理矩阵的分布矩阵中加入新的行或列,并通过新加入的行或列内的元素,表示待处理矩阵中非零元素以及非零元素的分布和数目,即多维的待处理矩阵处理后能够得到多维的压缩矩阵。例如压缩矩阵可表示为[2,4,5,0,1,0,0,0,0,0,0,1]。而在新加入的行或列的元素数目较少时可加入多行或多列,或者在新加入的行或列元素数目较多时可补充零元素以对得到的多维压缩矩阵进行对齐。
在另一种可行的实施方式中,可以将待处理矩阵[0,4,0;0,0,0;0,0,5]降维为一维矩阵[0,4,0,0,0,0,0,0,5],再对降维后的一维矩阵进行图1所示的矩阵处理。
综上,在本申请提供的矩阵的处理方法中,通过确定待处理矩阵中的非零元素的数量,确定用于表示待处理矩阵中非零元素的位置的分布矩阵;并组合非零元素的数量、依次排列的待处理矩阵中每个非零元素的值和分布矩阵,以获得待处理矩阵的压缩矩阵。从而在稀疏矩阵在进行例如矩阵的卷积运算、乘加运算、乘减运算、除加运算或除减运算时,通过压缩矩阵代替稀疏矩阵进行运算并得到稀疏矩阵的运算结果,以提高处理器对于稀疏矩阵的存储效率和运算效率,进而提高了矩阵的处理方法的处理效率。
进一步地,在上述实施例中,待处理矩阵包括第一待处理矩阵,第一待处理矩阵中的元素的数量和第二待处理矩阵中的元素的数量相同,对应的,分布矩阵包括第一分布矩阵和第二分布矩阵。则在如图1所示的矩阵处理方法还包括:基于第一分布矩阵、第二分布矩阵、第一待处理矩阵中的非零元素和第二待处理矩阵中的非零元素,获得目标值,目标值与累加第一待处理矩阵中每个位置上的元素和第二待处理矩阵中相同位置上的元素的乘积的结果相同。
其中,目标值可以是第一待处理矩阵和第二待处理矩阵在进行例如卷积运算时的运算结果,如果直接使用第一待处理矩阵和第二待处理矩阵进行卷积运算,则需要累加第一待处理矩阵中每个位置上的元素和第二待处理矩阵中相同位置上的元素的乘积,而在本实施 例中,能够通过第一压缩矩阵中的第一分布矩阵、非零元素以及第二压缩矩阵中的第二分布矩阵、非零元素,代替第一待处理矩阵和第二待处理矩阵进行卷积运算,并且所得到的目标值与第一待处理矩阵和第二待处理矩阵进行卷积运算的结果相同。
具体地,以图6中所示确定压缩矩阵的流程为例,对上述方法进行说明。图6为本申请矩阵的处理方法一实施例的流程示意图,如图6所示的第一待处理矩阵为[1,0,2,0,3,4,0,5],第二待处理矩阵为[0,2,0,0,1,0,0,-1];第一待处理矩阵经过如图1所示的处理方式后得到的第一压缩矩阵为[5,1,2,3,4,5,1,0,1,0,1,1,0,1],其中,第一分布矩阵为[1,0,1,0,1,1,0,1],非零元素为[1,2,3,4,5];第二待处理矩阵经过如图1所示的处理方式后得到的第二压缩矩阵为[3,2,1,-1,0,1,0,0,1,0,0,1],其中,第二分布矩阵为[0,1,0,0,1,0,0,1],非零元素为[2,1,-1]。则基于第一分布矩阵、第二分布矩阵、第一待处理矩阵中的非零元素和第二待处理矩阵中的非零元素,获得目标值具体包括:
在第二分布矩阵[0,1,0,0,1,0,0,1]中依次获取共五个第一目标元素0,0,1,0,1,以组成第一掩码矩阵[0,0,1,0,1],其中,每个第一目标元素在第二分布矩阵中的位置和第一分布矩阵中每个值为1的元素的位置相同。并将所得到的第一掩码矩阵与第一压缩矩阵中的第一待处理矩阵的非零元素进行比较,当获取的第一掩码矩阵中的第一目标元素的值为1时,将第一待处理矩阵的非零元素中的第一有效元素作为第一简化矩阵的元素。即,将[0,0,1,0,1]与[1,2,3,4,5]进行比较,得到两个第一有效元素3,5。其中,第一有效元素在第一待处理矩阵的非零元素中的排列顺序与获取的第一目标元素在第一掩码矩阵中的排列顺序相同,从而得到第一简化矩阵[3,5]。
在第一分布矩阵[1,0,1,0,1,1,0,1]中依次获取三个第二目标元素0,1,1,以组成第二掩码矩阵[0,1,1],其中,每个第二目标元素在第一分布矩阵中的位置和第二分布矩阵中每个值为1的元素的位置相同。并将所得到的第二掩码矩阵与第二压缩矩阵中的第二待处理矩阵的非零元素进行比较,当获取的第二掩码矩阵中的第二目标元素的值为1时,将第二待处理矩阵的非零元素中的第二有效元素作为第二简化矩阵的元素。即,将[0,1,1]与[2,1,-1]进行比较,得到两个第二有效元素1,-1。其中,第二有效元素在第二待处理矩阵的非零元素中的排列顺序与获取的第二目标元素在第二掩码矩阵中的排列顺序相同,从而得到第二简化矩阵[1,-1]。
随后,通过第一简化矩阵和第二简化矩阵分别代替第一待处理矩阵和第二待处理矩阵进行卷积运算。具体地通过累加第一简化矩阵中每个位置上的元素和第二简化矩阵中相同位置上的元素的乘积3*1+5*(-1),以获得目标值-2作为第一待处理矩阵和第二待处理矩阵的卷积运算结果。其中,目标值与累加第一待处理矩阵中每个位置上的元素和第二待处理矩阵中相同位置上的元素的乘积的结果相同,即第一待处理矩阵和第二待处理矩阵进行卷积运算得到的运算结果与目标值相同。
综上,本实施例提供的矩阵的处理方法中,能够通过第一压缩矩阵和第二压缩矩阵分别代替第一待处理矩阵和第二待处理矩阵进行卷积运算,以获取目标值作为第一待处理矩阵和第二待处理矩阵的卷积运算结果。并且在第一压缩矩阵和第二压缩矩阵的计算过程中,能够通过第一分布矩阵和第二分布矩阵确定第一掩码矩阵和第二掩码矩阵,并最终根据第一掩码矩阵和第二掩码矩阵确定第一简化矩阵和第二简化矩阵,将第一简化矩阵和第二简化矩阵中对齐的元素进行累加乘积运算即可得到目标值。因此不需要在进行卷积运算时再 增加一些零元素以进行元素对齐,而只通过第一简化矩阵和第二简化矩阵中的元素进行绝对有效的运算。从而在通过第一压缩矩阵和第二压缩矩阵分别代替第一待处理矩阵和第二待处理矩阵进行卷积运算时,能够对第一压缩矩阵和第二压缩矩阵中有效的元素进行对齐,而对齐过程中能够避免零元素造成的无效运算,进一步提高了现有矩阵处理方法的效率。
图7为本申请矩阵的处理方法通过分布矩阵确定掩码矩阵的流程示意图。如图7所示,本申请还提供一种逻辑电路,用于实现上述实施例中通过第一分布矩阵和第二分布矩阵得到第一掩码矩阵和第二掩码矩阵。其中,逻辑电路依次以第一分布矩阵中每个位置上的元素和第二分布矩阵中与第一分布矩阵上的元素位置相同的元素为输入,逻辑电路依次输出第一目标元素和第二目标元素,以分别组成第一掩码矩阵和第二掩码矩阵。
具体地,图8为本申请逻辑电路一实施例的结构示意图;在如图8所示的逻辑电路中包括:第一开关逻辑和第二开关逻辑。其中,
第一开关逻辑的第一输入端用于依次接收第二分布矩阵中每个位置上的元素,第一开关逻辑的第二输入端用于依次接收第一分布矩阵中与被接收的第二分布矩阵上的元素位置相同的元素,第一开关逻辑的输出端用于输出第一目标元素,以组成第一掩码矩阵。其中,当第一开关逻辑的第二输入端接收的元素的值为1时,第一开关逻辑的开关闭合,以将第一输入端接收的元素从输出端输出;当第一开关逻辑的第二输入端接收的元素的值为0时,第一开关逻辑的开关断开,第一输入端接收的元素不会从输出端输出。
例如:如图8所示的第一开关逻辑的第一输入端接收第二分布矩阵的第一个元素[0],第二输入端接收第一分布矩阵的第一个元素[1]。由于第二输入端接收的元素为[1],则第一开关逻辑的第一输入端和输出端导通,将第一输入端接收的元素[0]作为第一目标元素从输出端输出至第一掩码矩阵。随后,第一开关逻辑的第一输入端接收第二分布矩阵的第二个元素[1],第二输入端接收第一分布矩阵的第二个元素[0]。由于第二输入端接收的元素为[0],则第一开关逻辑的第一输入端和输出端断开。随后,第一开关逻辑的第一输入端接收第二分布矩阵的第三个元素[0],第二输入端接收第一分布矩阵的第三个元素[1]。由于第二输入端接收的元素为[1],则第一开关逻辑的第一输入端和输出端导通,将第一输入端接收的元素[0]作为第一目标元素从输出端输出至第一掩码矩阵。依次类推,直到第一开关逻辑的第一输入端接收第二分布矩阵的最后一个元素[1],第二输入端接收第一分布矩阵的最后一个元素[1],并通过输出端输出[1]至第一掩码矩阵后,通过第一开关逻辑的输出端所输出的所有第一目标元素依次排列组成第一掩码矩阵[0,0,1,0,1]。
同时,第二开关逻辑的第一输入端用于依次接收第一分布矩阵中每个位置上的元素,第二开关逻辑的第二输入端用于依次接收第二分布矩阵中与被接收的第一分布矩阵上的元素位置相同的元素,第二开关逻辑的输出端用于输出第二目标元素,以组成第二掩码矩阵。其中,当第二开关逻辑的第二输入端接收的元素的值为1时,第二开关逻辑的开关闭合,以将第一输入端接收的元素从输出端输出;当第二开关逻辑的第二输入端接收的元素值为0时,第二开关逻辑的开关断开,第一输入端接收的元素不会从输出端输出。
例如:如图8所示的第二开关逻辑的第一输入端接收第一分布矩阵的第一个元素[1]。第二开关逻辑的第二输入端接收第二分布矩阵的第一个元素[0]。由于第二输入端接收的元素为[0],则第二关逻辑的第一输入端和输出端断开。随后,第二开关逻辑的第一输入端接收第一分布矩阵的第一个元素[0]。第二开关逻辑的第二输入端接收第二分 布矩阵的第一个元素[1]。由于第二输入端接收的元素为[1],则第二开关逻辑的第一输入端和输出端导通,将第一输入端接收的元素[0]作为第二目标元素从输出端输出至第二掩码矩阵。依次类推,直到第二开关逻辑的第一输入端接收第一分布矩阵的最后一个元素[1],第二输入端接收第二分布矩阵的最后一个元素[1],并通过输出端输出[1]至第二掩码矩阵后,通过第二开关逻辑的输出端所输出的所有第二目标元素依次排列组成第二掩码矩阵[0,1,1]。
可选地,处理器中还可以并列设置本实施例提供的多个逻辑电路,每个逻辑电路处理可分别接收第一分布矩阵和第二分布矩阵的元素。每个逻辑电路同时接收的第一分布矩阵和第二分布矩阵的元素,并分别根据所接收的元素输出第一目标元素和第二目标元素。最终,所有逻辑电路所输出的第一目标元素依次排列可组成第一掩码矩阵,所有逻辑电路所输出的第二目标元素依次排列可组成第二掩码矩阵。本实施例中第一开关逻辑依次接收第一分布矩阵和第二分布矩阵中的元素、第二开关逻辑依次接收第一分布矩阵和第二分布矩阵中的元素可以同时在处理器的同一个时钟内进行。
综上,本实施例提供的逻辑电路中,通过较为简单的开关逻辑即可实现了上述实施例中通过第一分布矩阵和第二分布矩阵得到第一掩码矩阵和第二掩码矩阵的方法,并且开关逻辑可以在一个处理器时钟内完成接收分布矩阵的元素以及输出掩码矩阵的元素,从而简化了逻辑电路并进一步提高了矩阵的处理效率。
图9为本申请逻辑电路一实施例的结构示意图,本实施例提供的逻辑电路可用于替换如图8所示的逻辑电路。具体地,如图9所示的逻辑电路在图8所示逻辑电路的基础上,还包括:与门逻辑。其中,与门逻辑的第一输入端用于依次接收第一分布矩阵中每个位置上的元素,与门逻辑的第二输入端用于依次接收第二分布矩阵中与被接收的第一分布矩阵上的元素位置相同的元素,与门逻辑的输出端用于将与门逻辑的第一输入端和与门逻辑的第二输入端的与运算结果输出至第一开关逻辑的第二输入端和第二开关逻辑的第二输入端。
具体地,由于如图8所示的基本逻辑在实现通过第一分布矩阵和第二分布矩阵得到第一掩码矩阵和第二掩码矩阵时,第一开关逻辑和第二开关逻辑在其第二输入端接收到的元素为[1]时需要闭合开关,如果在闭合开关的时延内第一输入端接收到的元素丢失或者未能同步导致输入元素的刷新,都有可能造成在开关闭合后,通过输出端所输出的元素错乱。因此,在本实施例中设置与门逻辑,与门逻辑的第一输入端和第二输入端分别依次接收第一分布矩阵中每个位置上的元素和第二分布矩阵中与被接收的第一分布矩阵上的元素位置相同的元素,并将二者进行相与运算后,从输出端输出至第一开关逻辑的第二输入端和第二开关逻辑的第二输入端。其中,与门逻辑在此起到了缓存的作用,为第一开关逻辑和第二开关逻辑提供开关闭合的时间,在开关闭合后再通过输出端向第一开关逻辑和第二开关逻辑输出与运算结果,从而保证了第一开关逻辑和第二开关逻辑的第二输入端准确无误地接收到正确的元素。图9所示的实施例中第一开关逻辑、第二开关逻辑的第一输入端、第二输入端和输出端的原理同图8中的实施例,不再赘述。
图10为本申请逻辑电路一实施例的结构示意图,本实施例提供的逻辑电路可用于替换如图8所示的逻辑电路。具体地,如图10所示的逻辑电路在图8所示逻辑电路的基础上,还包括:第一锁存器和第二锁存器。其中,第一锁存器的输入端用于依次接收第 二分布矩阵中每个位置上的元素,第一锁存器的输出端用于在第一预设时延后将输入端接收的元素输出至第一开关逻辑;第二锁存器的输入端用于依次接收第一分布矩阵中每个位置上的元素,第二锁存器的输出端用于在第二预设时延后将输入端接收的元素输出至第二开关逻辑。
具体地,如图10所示的实施例中,提供另一种保证第一开关逻辑和第二开关逻辑的第二输入端准确无误地接收到正确的元素的方法。其中,第一锁存器和第二锁存器均起到缓存的作用,第一锁存器在接收到第二分布矩阵的元素后,为第一开关逻辑提供开关闭合的时间,在第一开关逻辑的开关闭合后再通过输出端向第一开关逻辑输出接收到的元素;第二锁存器在接收到第一分部矩阵的元素后,为第二开关逻辑提供开关闭合的时间,在第二开关逻辑的开关闭合后再通过输出端向第二开关逻辑输出接收到的元素。因此,可选地,第一预设时延可以设置为第一开关逻辑的开关打开时延,第二预设时延可以设置为第二开关逻辑的开关打开时延。其中,第一开关逻辑的打开时延与第二开关逻辑的打开时延相同。图10所示的实施例中第一开关逻辑、第二开关逻辑的第一输入端、第二输入端和输出端的原理同图8中的实施例,不再赘述。
进一步地,上述各实施例中的矩阵的处理方法,可应用于脉动阵列构架的处理器之中,对矩阵进行卷积运算,并且不需要改变已有的脉动阵列架构。
例如:图11为本申请矩阵的处理方法应用于脉动阵列处理器的处理结构示意图,如图11所示,采用现有的脉动阵列处理器进行卷积或者全链接运算时,假设第一存储单元和第二存储单元中分别存储了四个矩阵,则处理器会将待计算的第一存储单元中的四个矩阵分别预加载到计算单元1-4之中。随后,依次将第二存储单元内的矩阵加载到计算单元1中,与预加载的矩阵完成计算后,将矩阵传输至计算单元2中;计算单元2依次接收将计算单元1中完成计算的矩阵,并与预加载的矩阵完成计算后,将矩阵传输至计算单元3中;依次类推。
而为了实现本申请的矩阵的处理方法,在如图11所示的实施例中,可在脉动阵列处理器中的每个计算单元之前加入对齐单元,用于在矩阵计算之前将经过如图1所示的方法处理后得到的压缩矩阵进行对齐,以在计算单元中仅实现第一简化矩阵和第二简化矩阵的有效计算,保证计算单元中不会产生零值的无效计算。其中,对齐单元和计算单元可通过处理器中的软件程序实现;或者,对齐单元可通过处理器中的逻辑电路实现,而每个对齐单元的所采用的逻辑电路可以是图7-10中任一所示的逻辑电路。
图12A-图12E为本申请矩阵的处理方法应用于脉动阵列处理器的处理流程示意图,以下,通过图12的处理流程对图11所示的脉动阵列处理器的处理结构进行说明。如图12中的处理流程可以是处理器在进行矩阵的卷积或者全链接计算,例如处理器在进行深度学习网络的卷积或全链接计算时,需要对深度学习网络中的参数矩阵和数据矩阵进行卷积运算。
如图12A所示,处理器首先要将需要进行计算的参数矩阵通过如图1所示的方法进行处理,得到待计算的压缩矩阵1、压缩矩阵2、压缩矩阵3和压缩矩阵4,并将上述矩阵存储在处理器的第一存储单元中;将需要进行计算的数据矩阵通过如图1所示的方法进行处理,得到待计算的压缩矩阵A、压缩矩阵B、压缩矩阵C和压缩矩阵D,并将上述矩阵存储在处理器的第二存储单元中。其中,第一存储单元和第二存储单元可以是处理器中不同存 储单元或者同一存储单元的不同存储位置,在此不做限定。
如图12B所示,处理器将第一存储单元中的压缩矩阵1、压缩矩阵2、压缩矩阵3和压缩矩阵4分别预加载到对齐单元1、对齐单元2、对齐单元3和对齐单元4中。更为具体地,对齐单元中预加载的矩阵可以是压缩矩阵中的依次排列的非零元素和分布矩阵。
如图12C所示,处理器将第二存储单元中的压缩矩阵A加载到对齐单元1中,使得对齐单元1通过压缩矩阵1中的分布矩阵和压缩矩阵A中的分布矩阵确定压缩矩阵1对应的简化矩阵1和压缩矩阵B对应的简化矩阵A。
如图12D所示,处理器将图12C的步骤中得到的简化矩阵1和简化矩阵A输出至计算单元1中,由计算单元1累加简化矩阵1和简化矩阵A中相同位置上元素的乘积。处理器还将第二存储单元中的压缩矩阵B加载到对齐单元1中,使得对齐单元1通过压缩矩阵1中的分布矩阵和压缩矩阵A中的分布矩阵确定压缩矩阵1对应的简化矩阵1和压缩矩阵B对应的简化矩阵B。处理器还将计算单元1中对齐完成后的压缩矩阵A加载到对齐单元2中,使得对齐单元2通过使得对齐单元2通过压缩矩阵2中的分布矩阵和压缩矩阵A中的分布矩阵确定压缩矩阵2对应的简化矩阵2和压缩矩阵A对应的简化矩阵A。
如图12E所示,处理器将图12D的步骤中得到的简化矩阵1和简化矩阵B输出至计算单元1中,由计算单元1累加简化矩阵1和简化矩阵B中相同位置上元素的乘积。处理器将图12D的步骤中得到的简化矩阵2和简化矩阵A输出至计算单元2中,由计算单元2累加简化矩阵2和简化矩阵A中相同位置上元素的乘积。处理器还将第二存储单元中的压缩矩阵C加载到对齐单元1中,使得对齐单元1通过压缩矩阵1中的分布矩阵和压缩矩阵C中的分布矩阵确定压缩矩阵1对应的简化矩阵1和压缩矩阵C对应的简化矩阵C。处理器还将计算单元1中对齐完成后的压缩矩阵B加载到对齐单元2中,使得对齐单元2通过使得对齐单元2通过压缩矩阵2中的分布矩阵和压缩矩阵B中的分布矩阵确定压缩矩阵2对应的简化矩阵2和压缩矩阵B对应的简化矩阵B。处理器还将计算单元2中对齐完成后的压缩矩阵A加载到对齐单元3中,使得对齐单元3通过使得对齐单元3通过压缩矩阵3中的分布矩阵和压缩矩阵A中的分布矩阵确定压缩矩阵3对应的简化矩阵3和压缩矩阵A对应的简化矩阵A。
在如图12E所示的处理完成后,对齐单元1将继续从第二存储空间中加载下一个待处理的压缩矩阵D,并且每个对齐单元继续在对齐动作完毕后都会将压缩矩阵向下一个对齐单元传输。每个对齐单元都会将通过所加载的两个压缩矩阵得到的简化矩阵传输至对应的计算单元计算,计算单元将计算结果输出。其中,通过压缩矩阵确定简化矩阵的方法及原理可参照本申请前述实施例,不再赘述。
综上,本申请所提供的矩阵的处理方法应用于脉动阵列处理器时,在处理器进行深度学习网络中的卷积或全链接计算时,能够对待计算的参数矩阵和数据矩阵进行压缩后,通过处理器中的对齐单元和计算单元对压缩后的数据矩阵和参数进行计算。从而在计算时避免了计算单元进行包含零元素的无效计算,从而提高了处理器的存储效率以及运算效率。并且使用本申请所提供的矩阵处理方法与现有的采用脉动阵列架构的处理器能够兼容,便于本申请中矩阵处理方法的实施与推广。
可选地,本申请提供的矩阵的处理方法还可以应用于处理器进行图像的卷积运算。其中,处理器能够处理的图像为数字图像,数字图像通过图像各像素点的灰度值组成 的图像矩阵来表示。处理器对图像进行的卷积运算是指利用卷积核(或称为卷积模板)在图像矩阵上滑动,并将卷积核滑动过程中图像矩阵上对应位置的元素与卷积核中的元素相乘并求和,最终得到一个输出矩阵的元素的过程称为图像的卷积。
具体地,如图13所示,图13为本申请矩阵的处理方法应用于图像卷积运算的处理结构示意图。其中,进行卷积运算待处理矩阵为图中的输入图像矩阵,该矩阵的维数为6行6列,假设选取卷积运算的卷积核的维数为3行3列。则在对输入图像矩阵进行卷积运算时,处理器依次将卷积核与输入图像矩阵中维数3行3列的中间矩阵的元素对齐,并累加卷积核与中间矩阵对齐的元素之间的乘积得到计算结果,将该计算结果作为输入图像矩阵中与输入图像矩阵中所计算的中间矩阵的位置对应的元素。而在对上述得到的卷积核与中间矩阵进行卷积计算时,可以采用如本申请图5所示的实施例中的矩阵处理方法对中间矩阵和卷积核进行压缩处理后,得到卷积核的压缩矩阵和中间矩阵的压缩矩阵,随后通过所得到的两个压缩矩阵通过如本申请图6的实施例中所示的矩阵处理方法进行运算,以得到卷积核与中间矩阵进行的卷积运算结果。
例如,图13中所示的卷积核为[4,0,0;0,0,0;0,0,-4],在进行输入图像矩阵的卷积运算时,首先将卷积核中的9个元素与输入图像矩阵中第1行至第3行、第1列至第3列的9个元素对齐,得到待计算的中间矩阵为[0,0,0;0,1,1;0,0,2]。根据如图5所示的方法,将卷积核经过处理后得到卷积核的压缩矩阵为[2,4,-4,1,0,0,0,0,0,0,0,1],将中间矩阵经过处理后的到中间矩阵的压缩矩阵为[3,1,1,2,0,0,0,0,1,1,0,0,1]。随后根据卷积核的分布矩阵[1,0,0,0,0,0,0,0,1]和中间矩阵的分布矩阵[0,0,0,0,1,1,0,0,1]确定卷积核的掩码矩阵[0,1]以及中间矩阵的掩码矩阵[0,0,1]。根据卷积核的掩码矩阵确定卷积核的简化矩阵为[-4]、根据中间矩阵的掩码矩阵确定中间矩阵的简化矩阵为[2],并通过所得到的两个简化矩阵得到目标值-8。将-8作为输出图像矩阵内第2行第2列的元素。随后,在将卷积核向右平移一个元素,与输入图像矩阵中第1行至第3行、第2列至第4列的9个元素对齐,得到对齐的中间矩阵为[0,0,0;1,1,0;2,0,0],并继续通过上述矩阵的处理方法计算卷积核与中间矩阵的对应元素乘积累加之和,将得到的结果作为输出图像内第2行第3列的元素。依次类推,最终得到输出图像矩阵第2行至第5行、第2列至第5列中所有的元素,其中中间矩阵与卷积核的所有计算过程均可采用上述示例中的矩阵处理方法进行。此外,对于输出图像矩阵最外侧的第1行、第6行、第1列以及第6列的元素由于涉及图像卷积的边界问题,可采用如忽略边界元素、保留原边界元素等方式进行处理,由于不涉及矩阵的处理,本实施例对此不做具体限定。
综上,本申请所提供的矩阵方法能够应用于处理器对图像进行卷积运算,其中,在卷积运算中的卷积核与对应的图像矩阵的中间矩阵进行乘加运算时,通过卷积核的压缩矩阵和中间矩阵的压缩矩阵进行运算得到目标值。而由于压缩矩阵在进行运算时不需要在增加一些零元素以进行元素对齐,而只通过第一简化矩阵和第二简化矩阵中的元素进行绝对有效的运算,因此在计算时避免了进行包含零元素的无效计算。从而能够提高图像卷积运算的运算速度,并进一步提高了处理器的对于图像卷积运算的处理效率。
图14为本申请矩阵的处理装置一实施例的结构示意图。如图14所示,本实施例提供的矩阵的处理装置包括:第一确定模块1401,第二确定模块1402和处理模块1403。 其中,第一确定模块1401用于确定待处理矩阵中的非零元素的数量,待处理矩阵为一维矩阵;第二确定模块1402用于确定待处理矩阵的分布矩阵,分布矩阵用于表示待处理矩阵中非零元素的位置;处理模块1403用于组合非零元素的数量、依次排列的待处理矩阵中每个非零元素的值和分布矩阵,以获得待处理矩阵的压缩矩阵。
本实施例提供的矩阵的处理装置,可用于执行如图1所示的矩阵的处理方法,其具体实现方式与原理相同,不再赘述。
可选地,在上述实施例中,分布矩阵为一维矩阵,待处理矩阵中每个位置上的元素和分布矩阵中相同位置上的元素一一对应;第二确定模块1402具体用于,依次扫描待处理矩阵中的元素;当扫描到的元素为非零元素时,确定分布矩阵中与扫描到的元素相对应的元素的值为1;当扫描到的元素为零值时,确定分布矩阵中与扫描到的元素相对应的元素的值为0。
可选地,在上述实施例中,待处理矩阵中的元素的数量为N,待处理矩阵中的非零元素的数量为M,对应的,分布矩阵中的元素的数量为N,分布矩阵中值为1的元素的数量为M,压缩矩阵中的元素的数量为M+N+1,其中,N为正整数,M为非负整数,M小于等于N。
本实施例提供的矩阵的处理装置,可用于执行前述实施例中的矩阵的处理方法,其具体实现方式与原理相同,不再赘述。
图15为本申请矩阵的处理装置一实施例的结构示意图。如图15所示,本实施例提供的矩阵的处理装置在图14的基础上,还包括:计算模块1501。其中,上述实施例中的待处理矩阵包括第一待处理矩阵和第二待处理矩阵,第一待处理矩阵中的元素的数量和第二待处理矩阵中的元素的数量相同,对应的,分布矩阵包括第一分布矩阵和第二分布矩阵。计算模块1501用于基于第一分布矩阵、第二分布矩阵、第一待处理矩阵中的非零元素和第二待处理矩阵中的非零元素,获得目标值,目标值与累加第一待处理矩阵中每个位置上的元素和第二待处理矩阵中相同位置上的元素的乘积的结果相同。
其中,计算模块1501具体用于,在第二分布矩阵中依次获取每个第一目标元素,以组成第一掩码矩阵,其中,每个第一目标元素在第二分布矩阵中的位置和第一分布矩阵中每个值为1的元素的位置相同;
当获取的第一目标元素的值为1时,将第一待处理矩阵的非零元素中的第一有效元素作为第一简化矩阵的元素,其中,第一有效元素在第一待处理矩阵的非零元素中的排列顺序与获取的第一目标元素在第一掩码矩阵中的排列顺序相同;
在第一分布矩阵中依次获取每个第二目标元素,以组成第二掩码矩阵,其中,每个第二目标元素在第一分布矩阵中的位置和第二分布矩阵中每个值为1的元素的位置相同;
当获取的第二目标元素的值为1时,将第二待处理矩阵的非零元素中的第二有效元素作为第二简化矩阵的元素,其中,第二有效元素在第二待处理矩阵的非零元素中的排列顺序与获取的第二目标元素在第二掩码矩阵中的排列顺序相同;
累加第一简化矩阵中每个位置上的元素和第二简化矩阵中相同位置上的元素的乘积,以获得目标值。
本实施例提供的矩阵的处理装置,可用于执行如图6所示的矩阵的处理方法,其具体实现方式与原理相同,不再赘述。
需要说明的是,本申请各实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。在本申请的实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。
本申请还提供一种计算机可读存储介质,计算机可读存储介质中存储程序代码,当程序代码被执行时,以执行如上述实施例中任一的矩阵的处理方法。
本申请还提供一种计算机程序产品,计算机程序产品包含的程序代码被处理器执行时,实现如上述实施例中任一的矩阵的处理方法。
最后应说明的是:以上各实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述各实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。

Claims (16)

  1. 一种矩阵的处理方法,其特征在于,包括:
    确定待处理矩阵中的非零元素的数量,所述待处理矩阵为一维矩阵;
    确定所述待处理矩阵的分布矩阵,所述分布矩阵用于表示所述待处理矩阵中非零元素的位置;
    组合所述非零元素的数量、依次排列的所述待处理矩阵中每个非零元素的值和所述分布矩阵,以获得所述待处理矩阵的压缩矩阵。
  2. 根据权利要求1所述的方法,其特征在于,所述分布矩阵为一维矩阵,所述待处理矩阵中每个位置上的元素和所述分布矩阵中相同位置上的元素一一对应,所述确定所述待处理矩阵的分布矩阵,包括:
    依次扫描所述待处理矩阵中的元素;
    当所述扫描到的元素为非零元素时,确定所述分布矩阵中与所述扫描到的元素相对应的元素的值为1;
    当所述扫描到的元素为零值时,确定所述分布矩阵中与所述扫描到的元素相对应的元素的值为0。
  3. 根据权利要求1或2所述的方法,其特征在于,所述待处理矩阵中的元素的数量为N,所述待处理矩阵中的非零元素的数量为M,对应的,所述分布矩阵中的元素的数量为N,所述分布矩阵中值为1的元素的数量为M,所述压缩矩阵中的元素的数量为M+N+1,其中,N为正整数,M为非负整数,M小于等于N。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述待处理矩阵包括第一待处理矩阵和第二待处理矩阵,所述第一待处理矩阵中的元素的数量和所述第二待处理矩阵中的元素的数量相同,对应的,所述分布矩阵包括第一分布矩阵和第二分布矩阵,所述方法还包括:
    基于所述第一分布矩阵、所述第二分布矩阵、所述第一待处理矩阵中的非零元素和所述第二待处理矩阵中的非零元素,获得目标值,所述目标值与累加所述第一待处理矩阵中每个位置上的元素和所述第二待处理矩阵中相同位置上的元素的乘积的结果相同。
  5. 根据权利要求4所述的方法,其特征在于,所述基于所述第一分布矩阵、所述第二分布矩阵、所述第一待处理矩阵中的非零元素和所述第二待处理矩阵中的非零元素,获得目标值,包括:
    在所述第二分布矩阵中依次获取每个第一目标元素,以组成第一掩码矩阵,其中,所述每个第一目标元素在所述第二分布矩阵中的位置和所述第一分布矩阵中每个值为1的元素的位置相同;
    当所述获取的第一目标元素的值为1时,将所述第一待处理矩阵的非零元素中的第一有效元素作为第一简化矩阵的元素,其中,所述第一有效元素在所述第一待处理矩阵的非零元素中的排列顺序与所述获取的第一目标元素在所述第一掩码矩阵中的排列顺序相同;
    在所述第一分布矩阵中依次获取每个第二目标元素,以组成第二掩码矩阵,其中,所述每个第二目标元素在所述第一分布矩阵中的位置和所述第二分布矩阵中每个值为1的元素的位置相同;
    当所述获取的第二目标元素的值为1时,将所述第二待处理矩阵的非零元素中的第二 有效元素作为第二简化矩阵的元素,其中,所述第二有效元素在所述第二待处理矩阵的非零元素中的排列顺序与所述获取的第二目标元素在所述第二掩码矩阵中的排列顺序相同;
    累加所述第一简化矩阵中每个位置上的元素和所述第二简化矩阵中相同位置上的元素的乘积,以获得所述目标值。
  6. 一种逻辑电路,其特征在于,所述逻辑电路用于通过第一分布矩阵和第二分布矩阵得到第一掩码矩阵和第二掩码矩阵;其中,所述第一分布矩阵用于表示第一待处理矩阵中非零元素的位置;所述第二分布矩阵用于表示第二待处理矩阵中非零元素的位置;所述第一掩码矩阵用于表示所述第二分布矩阵中的第一目标元素,每个所述第一目标元素在所述第二分布矩阵中的位置和所述第一分布矩阵中每个值为1的元素的位置相同;所述第二掩码矩阵用于表示所述第二分布矩阵中的第二目标元素,每个所述第二目标元素在所述第一分布矩阵中的位置和所述第一分布矩阵中每个值为1的元素的位置相同;
    所述逻辑电路包括:第一开关逻辑和第二开关逻辑;其中,
    所述第一开关逻辑的第一输入端用于依次接收所述第二分布矩阵中每个位置上的元素,所述第一开关逻辑的第二输入端用于依次接收所述第一分布矩阵中与所述被接收的第二分布矩阵上的元素位置相同的元素,所述第一开关逻辑的输出端用于输出所述第一目标元素,以组成所述第一掩码矩阵;
    当所述第一开关逻辑的第二输入端接收的元素的值为1时,所述第一开关逻辑将第一输入端接收的元素从输出端输出;
    所述第二开关逻辑的第一输入端用于依次接收所述第一分布矩阵中每个位置上的元素,所述第二开关逻辑的第二输入端用于依次接收所述第二分布矩阵中与所述被接收的第一分布矩阵上的元素位置相同的元素,所述第二开关逻辑的输出端用于输出所述第二目标元素,以组成所述第二掩码矩阵;
    当所述第二开关逻辑的第二输入端接收的元素的值为1时,所述第二开关逻辑将第一输入端接收的元素从输出端输出。
  7. 根据权利要求6所述的逻辑电路,其特征在于,还包括:与门逻辑;
    所述与门逻辑的第一输入端用于依次接收所述第一分布矩阵中每个位置上的元素,所述与门逻辑的第二输入端用于依次接收所述第二分布矩阵中与所述被接收的第一分布矩阵上的元素位置相同的元素,所述与门逻辑的输出端用于将所述与门逻辑的第一输入端和所述与门逻辑的第二输入端的与运算结果输出至所述第一开关逻辑的第二输入端和所述第二开关逻辑的第二输入端。
  8. 根据权利要求6所述的逻辑电路,其特征在于,还包括:第一锁存器和第二锁存器;
    所述第一锁存器的输入端用于依次接收所述第二分布矩阵中每个位置上的元素,所述第一锁存器的输出端用于在第一预设时延后将所述输入端接收的元素输出至所述第一开关逻辑;
    所述第二锁存器的输入端用于依次接收所述第一分布矩阵中每个位置上的元素,所述第二锁存器的输出端用于在第二预设时延后将所述输入端接收的元素输出至所述第二开关逻辑。
  9. 根据权利要求8所述的逻辑电路,其特征在于,
    所述第一预设时延为所述第一开关逻辑的开关打开时延;
    所述第二预设时延为所述第二开关逻辑的开关打开时延。
  10. 一种矩阵的处理装置,其特征在于,包括:
    第一确定模块,用于确定待处理矩阵中的非零元素的数量,所述待处理矩阵为一维矩阵;
    第二确定模块,用于确定所述待处理矩阵的分布矩阵,所述分布矩阵用于表示所述待处理矩阵中非零元素的位置;
    处理模块,用于组合所述非零元素的数量、依次排列的所述待处理矩阵中每个非零元素的值和所述分布矩阵,以获得所述待处理矩阵的压缩矩阵。
  11. 根据权利要求10所述的装置,其特征在于,所述分布矩阵为一维矩阵,所述待处理矩阵中每个位置上的元素和所述分布矩阵中相同位置上的元素一一对应;
    所述第二确定模块具体用于,
    依次扫描所述待处理矩阵中的元素;
    当所述扫描到的元素为非零元素时,确定所述分布矩阵中与所述扫描到的元素相对应的元素的值为1;
    当所述扫描到的元素为零值时,确定所述分布矩阵中与所述扫描到的元素相对应的元素的值为0。
  12. 根据权利要求11所述的装置,其特征在于,所述待处理矩阵中的元素的数量为N,所述待处理矩阵中的非零元素的数量为M,对应的,所述分布矩阵中的元素的数量为N,所述分布矩阵中值为1的元素的数量为M,所述压缩矩阵中的元素的数量为M+N+1,其中,N为正整数,M为非负整数,M小于等于N。
  13. 根据权利要求10-12任一项所述的装置,其特征在于,所述待处理矩阵包括第一待处理矩阵和第二待处理矩阵,所述第一待处理矩阵中的元素的数量和所述第二待处理矩阵中的元素的数量相同,对应的,所述分布矩阵包括第一分布矩阵和第二分布矩阵;
    所述装置还包括:计算模块,用于基于所述第一分布矩阵、所述第二分布矩阵、所述第一待处理矩阵中的非零元素和所述第二待处理矩阵中的非零元素,获得目标值,所述目标值与累加所述第一待处理矩阵中每个位置上的元素和所述第二待处理矩阵中相同位置上的元素的乘积的结果相同。
  14. 根据权利要求13所述的装置,其特征在于,所述计算模块具体用于,
    在所述第二分布矩阵中依次获取每个第一目标元素,以组成第一掩码矩阵,其中,所述每个第一目标元素在所述第二分布矩阵中的位置和所述第一分布矩阵中每个值为1的元素的位置相同;
    当所述获取的第一目标元素的值为1时,将所述第一待处理矩阵的非零元素中的第一有效元素作为第一简化矩阵的元素,其中,所述第一有效元素在所述第一待处理矩阵的非零元素中的排列顺序与所述获取的第一目标元素在所述第一掩码矩阵中的排列顺序相同;
    在所述第一分布矩阵中依次获取每个第二目标元素,以组成第二掩码矩阵,其中,所述每个第二目标元素在所述第一分布矩阵中的位置和所述第二分布矩阵中每个值为1的元素的位置相同;
    当所述获取的第二目标元素的值为1时,将所述第二待处理矩阵的非零元素中的第二有效元素作为第二简化矩阵的元素,其中,所述第二有效元素在所述第二待处理矩阵的非 零元素中的排列顺序与所述获取的第二目标元素在所述第二掩码矩阵中的排列顺序相同;
    累加所述第一简化矩阵中每个位置上的元素和所述第二简化矩阵中相同位置上的元素的乘积,以获得所述目标值。
  15. 一种矩阵的处理装置,其特征在于,包括:
    处理器和存储器;
    所述存储器,用于存储程序;
    所述处理器,用于调用所述存储器所存储的程序,以执行如权利要求1-5中任一所述的矩阵的处理方法。
  16. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储程序代码,当所述程序代码被执行时,以执行如权利要求1-5中任一所述的矩阵的处理方法。
PCT/CN2018/098993 2018-08-06 2018-08-06 矩阵的处理方法、装置及逻辑电路 WO2020029018A1 (zh)

Priority Applications (6)

Application Number Priority Date Filing Date Title
CN202110395943.9A CN113190791A (zh) 2018-08-06 2018-08-06 矩阵的处理方法、装置及逻辑电路
CN201880015972.4A CN111010883B (zh) 2018-08-06 2018-08-06 矩阵的处理方法、装置及逻辑电路
PCT/CN2018/098993 WO2020029018A1 (zh) 2018-08-06 2018-08-06 矩阵的处理方法、装置及逻辑电路
EP18929049.7A EP3690679A4 (en) 2018-08-06 2018-08-06 MATRIX PROCESSING PROCESS AND APPARATUS, AND LOGIC CIRCUIT
US16/869,837 US11250108B2 (en) 2018-08-06 2020-05-08 Matrix processing method and apparatus, and logic circuit
US17/560,472 US11734386B2 (en) 2018-08-06 2021-12-23 Matrix processing method and apparatus, and logic circuit

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/098993 WO2020029018A1 (zh) 2018-08-06 2018-08-06 矩阵的处理方法、装置及逻辑电路

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/869,837 Continuation US11250108B2 (en) 2018-08-06 2020-05-08 Matrix processing method and apparatus, and logic circuit

Publications (1)

Publication Number Publication Date
WO2020029018A1 true WO2020029018A1 (zh) 2020-02-13

Family

ID=69413876

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/098993 WO2020029018A1 (zh) 2018-08-06 2018-08-06 矩阵的处理方法、装置及逻辑电路

Country Status (4)

Country Link
US (2) US11250108B2 (zh)
EP (1) EP3690679A4 (zh)
CN (2) CN113190791A (zh)
WO (1) WO2020029018A1 (zh)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10474458B2 (en) 2017-04-28 2019-11-12 Intel Corporation Instructions and logic to perform floating-point and integer operations for machine learning
US10719323B2 (en) * 2018-09-27 2020-07-21 Intel Corporation Systems and methods for performing matrix compress and decompress instructions
US20200210517A1 (en) * 2018-12-27 2020-07-02 Intel Corporation Systems and methods to accelerate multiplication of sparse matrices
PL3938894T3 (pl) 2019-03-15 2024-02-19 Intel Corporation Zarządzanie pamięcią wielokafelkową dla wykrywania dostępu krzyżowego między kafelkami, zapewnianie skalowanie wnioskowania dla wielu kafelków i zapewnianie optymalnej migracji stron
US11934342B2 (en) 2019-03-15 2024-03-19 Intel Corporation Assistance for hardware prefetch in cache access
EP3938888A1 (en) 2019-03-15 2022-01-19 INTEL Corporation Systolic disaggregation within a matrix accelerator architecture
US11663746B2 (en) * 2019-11-15 2023-05-30 Intel Corporation Systolic arithmetic on sparse data
US11144615B1 (en) * 2020-04-14 2021-10-12 Apple Inc. Circuit for performing pooling operation in neural processor
CN114168895A (zh) * 2020-09-11 2022-03-11 北京希姆计算科技有限公司 矩阵计算电路、方法、电子设备及计算机可读存储介质
CN115150614A (zh) * 2021-03-30 2022-10-04 中国电信股份有限公司 图像特征的传输方法、装置和系统
CN114527930B (zh) * 2021-05-27 2024-01-30 北京灵汐科技有限公司 权重矩阵数据存储方法、数据获取方法和装置、电子设备
CN113671009A (zh) * 2021-07-27 2021-11-19 浙江华才检测技术有限公司 基于人工智能算法搭建的矩阵式广谱性物质检测传感器
CN113496008B (zh) * 2021-09-06 2021-12-03 北京壁仞科技开发有限公司 用于执行矩阵计算的方法、计算设备和计算机存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100235415A1 (en) * 2009-03-12 2010-09-16 Oki Electric Industry Co., Ltd. Processing apparatus for calculating an approximate value to an analytical value with a tolerance maintained and a method therefor
CN102141976A (zh) * 2011-01-10 2011-08-03 中国科学院软件研究所 稀疏矩阵的对角线数据存储方法及基于该方法的SpMV实现方法
CN102436438A (zh) * 2011-12-13 2012-05-02 华中科技大学 基于gpu的稀疏矩阵数据存储方法
CN103336758A (zh) * 2013-06-29 2013-10-02 中国科学院软件研究所 一种稀疏矩阵的存储方法CSRL及基于该方法的SpMV实现方法
CN107689224A (zh) * 2016-08-22 2018-02-13 北京深鉴科技有限公司 合理使用掩码的深度神经网络压缩方法

Family Cites Families (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5905666A (en) 1995-01-03 1999-05-18 International Business Machines Corporation Processing system and method for performing sparse matrix multiplication by reordering vector blocks
US6230253B1 (en) 1998-03-31 2001-05-08 Intel Corporation Executing partial-width packed data instructions
US6041404A (en) 1998-03-31 2000-03-21 Intel Corporation Dual function system and method for shuffling packed data elements
US8775495B2 (en) * 2006-02-13 2014-07-08 Indiana University Research And Technology Compression system and method for accelerating sparse matrix computations
US20080126467A1 (en) * 2006-09-26 2008-05-29 Anwar Ghuloum Technique for transposing nonsymmetric sparse matrices
US8612723B2 (en) * 2008-05-06 2013-12-17 L-3 Communications Integrated Systems, L.P. System and method for storing a sparse matrix
US20100306300A1 (en) * 2009-05-29 2010-12-02 Microsoft Corporation Sparse Matrix Padding
US8364739B2 (en) * 2009-09-30 2013-01-29 International Business Machines Corporation Sparse matrix-vector multiplication on graphics processor units
WO2011156247A2 (en) * 2010-06-11 2011-12-15 Massachusetts Institute Of Technology Processor for large graph algorithm computations and matrix operations
CN102033854A (zh) * 2010-12-17 2011-04-27 中国科学院软件研究所 针对稀疏矩阵的数据存储方法及基于该方法的SpMV实现方法
US8862653B2 (en) 2011-04-26 2014-10-14 University Of South Carolina System and method for sparse matrix vector multiplication processing
CN102522983B (zh) * 2011-12-16 2014-06-04 宝鸡石油机械有限责任公司 一种矩阵式开关量驱动器
US9317482B2 (en) * 2012-10-14 2016-04-19 Microsoft Technology Licensing, Llc Universal FPGA/ASIC matrix-vector multiplication architecture
WO2014167730A1 (ja) * 2013-04-12 2014-10-16 富士通株式会社 圧縮装置、圧縮方法、および圧縮プログラム
US9367519B2 (en) * 2013-08-30 2016-06-14 Microsoft Technology Licensing, Llc Sparse matrix data structure
US9760538B2 (en) * 2014-12-22 2017-09-12 Palo Alto Research Center Incorporated Computer-implemented system and method for efficient sparse matrix representation and processing
CN104636273B (zh) * 2015-02-28 2017-07-25 中国科学技术大学 一种带多级Cache的SIMD众核处理器上的稀疏矩阵存储方法
US20160259826A1 (en) * 2015-03-02 2016-09-08 International Business Machines Corporation Parallelized Hybrid Sparse Matrix Representations for Performing Personalized Content Ranking
US10565207B2 (en) * 2016-04-12 2020-02-18 Hsilin Huang Method, system and program product for mask-based compression of a sparse matrix
CN107229967B (zh) * 2016-08-22 2021-06-15 赛灵思公司 一种基于fpga实现稀疏化gru神经网络的硬件加速器及方法
US10346507B2 (en) 2016-11-01 2019-07-09 Nvidia Corporation Symmetric block sparse matrix-vector multiplication
US11003985B2 (en) * 2016-11-07 2021-05-11 Electronics And Telecommunications Research Institute Convolutional neural network system and operation method thereof
US20180131946A1 (en) * 2016-11-07 2018-05-10 Electronics And Telecommunications Research Institute Convolution neural network system and method for compressing synapse data of convolution neural network
CN107239824A (zh) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 用于实现稀疏卷积神经网络加速器的装置和方法
CN106846363A (zh) * 2016-12-29 2017-06-13 西安电子科技大学 一种改进稀疏矩阵的尺度自适应性压缩跟踪方法
US10146738B2 (en) * 2016-12-31 2018-12-04 Intel Corporation Hardware accelerator architecture for processing very-sparse and hyper-sparse matrix data
US10489480B2 (en) * 2017-01-22 2019-11-26 Gsi Technology Inc. Sparse matrix multiplication in associative memory device
CN107844322B (zh) * 2017-07-20 2020-08-04 上海寒武纪信息科技有限公司 用于执行人工神经网络正向运算的装置和方法
CN107562694A (zh) * 2017-08-23 2018-01-09 维沃移动通信有限公司 一种数据处理方法及移动终端
CN107977704B (zh) * 2017-11-10 2020-07-31 中国科学院计算技术研究所 权重数据存储方法和基于该方法的神经网络处理器
CN107944555B (zh) * 2017-12-07 2021-09-17 广州方硅信息技术有限公司 神经网络压缩和加速的方法、存储设备和终端
CN107909148B (zh) * 2017-12-12 2020-10-20 南京地平线机器人技术有限公司 用于执行卷积神经网络中的卷积运算的装置
JP2019148969A (ja) * 2018-02-27 2019-09-05 富士通株式会社 行列演算装置、行列演算方法および行列演算プログラム
GB2574060B (en) * 2018-05-25 2022-11-23 Myrtle Software Ltd Processing matrix vector multiplication
US20210065005A1 (en) * 2019-08-29 2021-03-04 Alibaba Group Holding Limited Systems and methods for providing vector-wise sparsity in a neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100235415A1 (en) * 2009-03-12 2010-09-16 Oki Electric Industry Co., Ltd. Processing apparatus for calculating an approximate value to an analytical value with a tolerance maintained and a method therefor
CN102141976A (zh) * 2011-01-10 2011-08-03 中国科学院软件研究所 稀疏矩阵的对角线数据存储方法及基于该方法的SpMV实现方法
CN102436438A (zh) * 2011-12-13 2012-05-02 华中科技大学 基于gpu的稀疏矩阵数据存储方法
CN103336758A (zh) * 2013-06-29 2013-10-02 中国科学院软件研究所 一种稀疏矩阵的存储方法CSRL及基于该方法的SpMV实现方法
CN107689224A (zh) * 2016-08-22 2018-02-13 北京深鉴科技有限公司 合理使用掩码的深度神经网络压缩方法

Also Published As

Publication number Publication date
CN111010883B (zh) 2022-07-12
CN111010883A (zh) 2020-04-14
US20200265108A1 (en) 2020-08-20
US20220114235A1 (en) 2022-04-14
CN113190791A (zh) 2021-07-30
US11734386B2 (en) 2023-08-22
EP3690679A4 (en) 2021-02-17
US11250108B2 (en) 2022-02-15
EP3690679A1 (en) 2020-08-05

Similar Documents

Publication Publication Date Title
WO2020029018A1 (zh) 矩阵的处理方法、装置及逻辑电路
WO2019184823A1 (zh) 基于卷积神经网络模型的图像处理方法和装置
Yang et al. Approximate compressors for error-resilient multiplier design
Kang et al. Knot calculation for spline fitting via sparse optimization
US20200218509A1 (en) Multiplication Circuit, System on Chip, and Electronic Device
CN108629406B (zh) 用于卷积神经网络的运算装置
WO2022037257A1 (zh) 卷积计算引擎、人工智能芯片以及数据处理方法
US11308647B2 (en) Method and system for improving compression ratio by difference between blocks of image file
WO2022166258A1 (zh) 行为识别方法、装置、终端设备及计算机可读存储介质
CN113556442B (zh) 视频去噪方法、装置、电子设备及计算机可读存储介质
US20230318829A1 (en) Cryptographic processor device and data processing apparatus employing the same
CN117435855B (zh) 用于进行卷积运算的方法、电子设备和存储介质
WO2019206161A1 (zh) 池化运算装置
CN111178513B (zh) 神经网络的卷积实现方法、卷积实现装置及终端设备
US20230342419A1 (en) Matrix calculation apparatus, method, system, circuit, and device, and chip
CN114764615A (zh) 卷积运算的实现方法、数据处理方法及装置
US20060126739A1 (en) SIMD optimization for H.264 variable block size motion estimation algorithm
WO2023124371A1 (zh) 数据处理装置、方法、芯片、计算机设备及存储介质
CN111083479A (zh) 一种视频帧预测方法、装置及终端设备
US20220292638A1 (en) Video resolution enhancement method, storage medium, and electronic device
JP2016045721A (ja) データ格納方法、三値内積演算回路、それを備えた半導体装置、及び、三値内積演算処理プログラム
CN115021759A (zh) 一种基于二值最小二乘法的二值稀疏信号恢复方法及系统
US11226740B2 (en) Selectively performing inline compression based on data entropy
CN111384975A (zh) 多进制ldpc解码算法的优化方法、装置及解码器
EP4379541A1 (en) Computing apparatus, method and system, and circuit, chip and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18929049

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2018929049

Country of ref document: EP

Effective date: 20200429

NENP Non-entry into the national phase

Ref country code: DE