WO2020062312A1 - 信号处理装置和信号处理方法 - Google Patents

信号处理装置和信号处理方法 Download PDF

Info

Publication number
WO2020062312A1
WO2020062312A1 PCT/CN2018/109228 CN2018109228W WO2020062312A1 WO 2020062312 A1 WO2020062312 A1 WO 2020062312A1 CN 2018109228 W CN2018109228 W CN 2018109228W WO 2020062312 A1 WO2020062312 A1 WO 2020062312A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
input
weight
buffer
signal processing
Prior art date
Application number
PCT/CN2018/109228
Other languages
English (en)
French (fr)
Inventor
郑明�
邵琪
韩国伟
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201880094243.2A priority Critical patent/CN112219210B/zh
Priority to PCT/CN2018/109228 priority patent/WO2020062312A1/zh
Publication of WO2020062312A1 publication Critical patent/WO2020062312A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to computer technology, and in particular, to a signal processing device, a signal processing method, and a computer-readable medium.
  • Convolutional Neural Network is a kind of multilayer neural network.
  • a processor performing a convolution operation usually converts a convolution of an input signal feature and a weight into a matrix multiplication operation between a signal matrix and a weight matrix.
  • the signal matrix and the weight matrix are divided into blocks to obtain multiple Fractional signal matrices and fractal weight matrices, and then matrix multiplication and accumulation are performed on the multiple fractal signal matrices and fractal weight matrices.
  • the convolution operation can be converted into a matrix multiplication operation between the signal matrix (input matrix) and the weight matrix, that is, AxB ([MxK] x [KxN]), where A represents the signal matrix (input matrix), B represents a weight matrix.
  • a matrix is an input matrix extracted from input data according to a convolution kernel stride during convolution, that is, an input matrix for input signal feature conversion.
  • the input matrix and weight matrix are relatively large matrices.
  • the size of the matrix that the matrix multiplying circuit can handle at one time is smaller than the size of the input matrix and weight matrix, so The large matrix multiplication needs to be split into a series of small matrix multiplications, and multiple small matrix multiplications are finally used to obtain matrix multiplication results of different sizes. Even so, how to further improve the calculation efficiency is still a problem.
  • the embodiments of the present application provide a signal processing device, a signal processing method, and a computer-readable medium, which can reduce the number of times that a matrix multiplier performs a matrix multiplication operation, thereby improving calculation efficiency.
  • an embodiment of the present application provides a signal processing apparatus.
  • the signal processing apparatus includes: a compression system for obtaining a compressed first matrix, and compressing at least one input matrix to obtain a second matrix;
  • the first matrix and the second matrix satisfy the following definitions: the first matrix is obtained by removing at least one all 0 rows in at least one weight matrix, and the second matrix is obtained by removing the Obtained by at least one column corresponding to at least one all 0 row; or the second matrix is obtained by removing at least one all 0 column from the at least one input matrix, and the first matrix is obtained by removing the at least one Obtained from at least one row corresponding to the at least one all 0 column in the weight matrix;
  • the input matrix includes a plurality of computer-processable signals, the weight matrix includes a plurality of weight coefficients; and a matrix multiplier for The compression system obtains the first matrix and the second matrix, and calculates a product of the second matrix and the first matrix.
  • the number of rows included in the second matrix is the same as the number of columns included in the first matrix.
  • the number of rows included in the first matrix is less than the number of rows included in the weight matrix
  • the number of columns included in the second matrix is less than the number of columns included in the input matrix.
  • a product of the input matrix and the weight matrix is equal to a product of the second matrix and the first matrix. Since the size of the second matrix is smaller than the size of the input matrix and the size of the first matrix is smaller than the weight matrix, in the embodiment of the present application, the multiplication of the input matrix and the weight matrix into the second matrix and the first matrix can be effectively performed. Reduce the amount of matrix multiplication.
  • the compression system is specifically configured to obtain the preset first matrix.
  • the compression system can directly obtain the first matrix without additional operations, and is simple to implement.
  • the first matrix may be preset in an external memory or other memories.
  • the first matrix may be preset in a compression system in a hardware form.
  • the compression system is specifically used for the compression system, and is specifically configured to compress the at least one weight matrix to obtain the first matrix.
  • the size of the matrix that the matrix multiplier needs to process or the number of times that the matrix multiplication needs to be performed can be reduced, thereby improving the calculation efficiency.
  • the compression system includes: a processor and a data compression unit, where the processor is configured to compress the at least one weight matrix to obtain the first matrix; and / or, The data compression unit is configured to compress the at least one input matrix to obtain the second matrix.
  • the weight matrix and the input matrix are compressed by the processor and the data compression unit, respectively, and the implementation is simple.
  • the processor is further configured to generate compression information, where the compression information is used to indicate the at least one all 0 line; and the data compression unit is further configured to be based on the compression information. Compressing the at least one input matrix to obtain the second matrix.
  • the data compression unit can accurately and quickly remove all 0 columns in at least one input matrix according to the compression information, so as to obtain a second matrix and achieve simple implementation.
  • the signal processing device further includes: a direct memory access controller DMAC and a weight buffer, the DMAC is coupled to the weight buffer and an external memory; the processor is further configured to: Storing the first matrix and the compression information in the external memory; the DMAC is used to move the first matrix from the external memory to the weight buffer, and is used to store the compression Information is moved from the external memory to the data compression unit; the matrix multiplier is further configured to obtain the first matrix from the weight buffer.
  • the DMAC can move the first matrix to the weight buffer and the compression information to the data compression unit in time, so that the data compression unit compresses the input matrix and the matrix multiplier quickly obtains the first matrix.
  • the signal processing device further includes a raw data buffer and an input buffer; the DMAC is further configured to move the at least one input matrix from the external memory to the raw data buffer.
  • the data compression unit is further configured to obtain the at least one input matrix from the original data buffer, and store the second matrix after compressing the at least one input matrix to obtain the second matrix.
  • the matrix multiplier is further configured to obtain the second matrix from the input buffer. In this implementation, at least one input matrix can be quickly compressed to obtain a second matrix, and stored in the input buffer.
  • the compression system includes a processor and a data compression unit, and the processor is configured to compress the at least one input matrix to obtain the second matrix; and / or, The data compression unit is configured to compress the at least one weight matrix to obtain the first matrix.
  • the processor and the data compression unit respectively input a weight matrix and a weight matrix, and the implementation is simple.
  • the processor is further configured to generate compression information, where the compression information is used to indicate the at least one all 0 column; and the data compression unit is further configured to be based on the compression information. Compressing the at least one weight matrix to obtain the first matrix.
  • the data compression unit may accurately and quickly remove all 0 columns in at least one input matrix according to the compression information, so as to obtain a second matrix and achieve simple implementation.
  • the signal processing device further includes: a direct memory access controller DMAC and an input buffer, the DMAC is coupled to the input buffer and an external memory; the processor is further configured to: Storing the second matrix and the compression information in the external memory; the DMAC is used to move the second matrix from the external memory to the input buffer, and is used to store the compression Information is moved from the external memory to the data compression unit; the matrix multiplier is further configured to obtain the second matrix from the input buffer.
  • the DMAC can move the second matrix to the input buffer and the compression information to the data compression unit in time, so that the data compression unit compresses the weight matrix and the matrix multiplier quickly obtains the second matrix.
  • the signal processing device further includes a raw data buffer and a weight buffer; the DMAC is further configured to move the at least one weight matrix from the external memory to the raw data buffer.
  • the data compression unit is further configured to obtain the at least one weight matrix from the original data buffer, and store the first matrix after compressing the at least one weight matrix to obtain the first matrix.
  • the matrix multiplier is further configured to obtain the first matrix from the weight buffer.
  • the at least one weight matrix can be quickly compressed to obtain a second matrix, and stored in the input buffer.
  • the signal processing device further includes an accumulation unit, which is configured to add a product of the second matrix and the first matrix to obtain a processing result.
  • an accumulator is used to accumulate the product of the second matrix and the first matrix to obtain a processing result, and the implementation is simple.
  • the processor is further configured to perform at least one of the following: splitting the original weight matrix to obtain the at least one weight matrix; or splitting the original input matrix to obtain the At least one input matrix.
  • the original weight matrix and the original input matrix are split, so that the product of the original input matrix and the original weight matrix is calculated by using the weighted matrix and input matrix obtained by the split.
  • the plurality of computer-processable signals include at least one of a voice signal, a text signal, or an image signal.
  • the processor is specifically configured to read the weight matrix from the external memory without performing a convolution operation task, and change non-all 0s in the weight matrix. Rows are spliced to obtain the first matrix, and the first matrix is sent to the external memory.
  • the convolution operation task refers to a task that needs to perform a convolution operation.
  • the processor can compress the weight matrix without performing a convolution operation or FC operation, instead of compressing the weight matrix during the convolution operation or FC operation, which can save The time overhead of the weight matrix is compressed, and the calculation efficiency is improved.
  • an embodiment of the present application provides a signal processing method.
  • the method includes: obtaining a compressed first matrix, and compressing at least one input matrix to obtain a second matrix; the first matrix and the first matrix
  • the two matrices satisfy the following definition: the first matrix is obtained by removing at least one all 0 row in at least one weight matrix, and the second matrix is obtained by removing at least one input matrix corresponding to the at least one all 0 row Obtained by one column; or the second matrix is obtained by removing at least one all 0 column in the at least one input matrix, and the first matrix is obtained by removing the at least one weight matrix from the at least one all Obtained from at least one row corresponding to column 0; the input matrix includes multiple computer-processable signals, the weight matrix includes multiple weight coefficients; and a product of the second matrix and the first matrix is calculated.
  • the signal processing device by compressing at least one weight matrix and compressing the input matrix, the signal processing device can reduce the number of times the matrix multiplier performs the matrix multiplication operation
  • the obtaining the compressed first matrix and the second matrix includes: obtaining a preset first matrix.
  • the obtaining the compressed first matrix and the second matrix includes: compressing the at least one weight matrix to obtain the first matrix.
  • the method further includes: generating compression information for indicating the at least one all 0 rows; and obtaining the second matrix includes: The at least one input matrix is compressed to obtain the second matrix.
  • the method further includes: generating compression information used to indicate the at least one all 0 column; and obtaining the first matrix includes: The at least one weight matrix is compressed to obtain the first matrix.
  • the method further includes: accumulating a product of the second matrix and the first matrix to obtain process result.
  • the method before the compressing at least one weight matrix to obtain a first matrix and compressing at least one input matrix to obtain a second matrix, the method further includes at least one of the following: The matrix is split to obtain the at least one weight matrix or the original input matrix is split to obtain the at least one input matrix.
  • the plurality of computer-processable signals include at least one of a voice signal, a text signal, or an image signal.
  • an embodiment of the present application provides another signal processing apparatus.
  • the signal processing apparatus includes: a compression unit, configured to obtain a compressed first matrix, and compressing at least one input matrix to obtain a second matrix;
  • the first matrix and the second matrix satisfy the following definition: the first matrix is obtained by removing at least one all 0 rows of the at least one weight matrix, and the second matrix is obtained by removing at least one input matrix and Obtained by at least one column corresponding to the at least one all 0 row; or the second matrix is obtained by removing at least one all 0 column from the at least one input matrix, and the first matrix is obtained by removing the at least one Obtained from at least one row corresponding to the at least one all 0 column in a weight matrix;
  • the input matrix includes a plurality of computer-processable signals, and the weight matrix includes a plurality of weight coefficients; a calculation unit for calculating all A product of the second matrix and the first matrix.
  • the compression unit is further configured to: obtain the preset first matrix.
  • the compression unit is further configured to: compress the at least one weight matrix to obtain the first matrix.
  • the compression unit is further configured to generate compression information that is used to indicate the at least one all 0 rows; and to the at least one input matrix according to the compression information. Do compression to get the second matrix.
  • the compression unit is further configured to generate compression information that is used to indicate the at least one all 0 column; and to the at least one weight matrix according to the compression information. Do compression to get the first matrix.
  • the signal processing device further includes an accumulation unit, which is configured to add a product of the second matrix and the first matrix to obtain a processing result.
  • the signal processing device further includes a splitting unit, which is configured to perform at least one of the following: splitting the original weight matrix to obtain the at least one weight matrix or The input matrix is split to obtain the at least one input matrix.
  • the plurality of computer-processable signals include at least one of a voice signal, a text signal, or an image signal.
  • an embodiment of the present application provides a computer-readable storage medium.
  • the computer storage medium stores a computer program, where the computer program includes program instructions, and the program instructions cause the processing when executed by a processor.
  • the processor performs the method of the second aspect and any optional implementation manner.
  • an embodiment of the present application provides a computer program product.
  • the computer program product includes program instructions that, when executed by a processor, cause the processor to execute the second aspect and any one of the foregoing.
  • Alternative implementation methods include program instructions that, when executed by a processor, cause the processor to execute the second aspect and any one of the foregoing.
  • an embodiment of the present application provides a device including a memory and a processor; the memory is configured to store program instructions, and the processor is configured to execute the program instructions to execute the foregoing second aspect and any optional implementation manner. method.
  • FIG. 1 is a schematic diagram of a neural network according to an embodiment of the present application.
  • FIG. 2 is a specific implementation scenario of a neural network provided by an embodiment of this application.
  • FIG. 3 is another specific implementation scenario of a neural network provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a method for matrix division and multiplication provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a matrix splitting and multiplication architecture provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a hardware architecture of a signal processing device according to an embodiment of the present application.
  • FIG. 7 is a flowchart of a signal processing method according to an embodiment of the present application.
  • 8A is a schematic diagram of a compressed original weight matrix according to an embodiment of the present application.
  • 8B is a schematic diagram of a compressed input matrix according to an embodiment of the present application.
  • 8C is a schematic diagram of a compressed input matrix according to an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a stitching sub-matrix according to an embodiment of the present application.
  • FIG. 10 is another signal processing method according to an embodiment of the present application.
  • FIG. 11 is a schematic diagram of a hardware architecture of another signal processing apparatus according to an embodiment of the present application.
  • FIG. 12 is a flowchart of a signal processing method according to another embodiment of the present application.
  • FIG. 13 is a schematic diagram of a compression weight matrix according to an embodiment of the present application.
  • 15 is a schematic diagram of a sub-matrix multiplication provided by an embodiment of the present application.
  • 16 is a schematic diagram of a stitching sub-matrix multiplication provided by an embodiment of the present application.
  • FIG. 17 is a schematic structural diagram of still another signal processing apparatus according to an embodiment of the present application.
  • the neural network 100 has N processing layers, N ⁇ 3 and N is a natural number.
  • the first layer of the neural network is the input layer 101, which is responsible for receiving input signals.
  • the last layer of the neural network is the output layer 103, and the processing results of the neural network are output.
  • the other layers except the first layer and the last layer are the intermediate layers 104. These intermediate layers together form a hidden layer 102.
  • Each of the hidden layers The middle layer can receive input signals or output signals.
  • the hidden layer is responsible for the processing of input signals.
  • Each layer represents a logical level of signal processing. Through multiple layers, data signals can be processed by multiple levels of logic.
  • the processing function may be an activation function (Rectified Linear Units, ReLU), a hyperbolic tangent function (tanh), or an S-shaped function (sigmoid).
  • Equation (2) The relationship between the input signal and the output signal is shown in equation (2), where b i is the offset value of the neural network processing function, and the offset value adjusts the input of the neural network to obtain the ideal output result.
  • h 3 f (W 31 x 1 + W 32 x 2 + W 33 x 3 + b 3 )
  • the input signal of the neural network may be various signals such as a voice signal, a text signal, an image signal, or a temperature signal.
  • the voice signal may be a voice signal recorded by a recording device, a mobile phone, or a fixed signal.
  • the text signal can be TXT text signal, Word text signal, and PDF text signal.
  • the image signal can be a landscape signal taken by the camera, a display
  • the input signals of this neural network include other various computer-processable engineering signals, which will not be enumerated here one by one.
  • the processing performed by the hidden layer 102 of the neural network may be processing such as removing noise signals mixed in the speech signal to enhance the speech signal, understanding specific content in the text signal, and recognizing the facial image signal of the human face.
  • the embodiment of the present application provides that the neural network 100 can be applied to various devices.
  • smart phones 202 and 2054 have built-in devices related to the neural network 100.
  • the mobile smartphone customer 201 initiates a voice call to the mobile smartphone customer 205, and the voice signal is sent via the smartphone 202 and transmitted to the smartphone 204 via the base station 203.
  • the initiation of a voice call caused a heavy rain and a strong thunder and lightning, which caused
  • the input signal 206 is severely weakened and contains large noise.
  • the input signal can be a one-dimensional digital voice signal.
  • the smart phone 204 is equipped with a neural network 100.
  • the neural network can be implemented in a chip in the form of a dedicated circuit.
  • the input signal 206 is processed in the neural network in the smart phone 204.
  • the processing includes noise removal and effective signal enhancement to obtain an output signal 207.
  • the output signal completely retains the voice information transmitted by the calling user, avoiding bad naturalness. Environmental interference to the signal.
  • the embodiment of the present application provides another specific implementation scenario of the neural network 100.
  • a car 303 runs at a high speed, and a passerby 301 uses a digital camera 302 to take a picture of the license plate number of the car 303.
  • the car 303 has a high speed v, and a motion blur phenomenon occurs on the input signal 304 of the digital camera.
  • the input signal is a two-dimensional digital image signal.
  • the digital camera 302 is equipped with a neural network 100.
  • the neural network may be a dedicated circuit. The form is implemented in a chip, or a software module running in an image signal processor.
  • the processing includes the estimation of the car's motion model and the removal of motion blur to obtain an output signal 305.
  • the clarity of the license plate information contained in the output signal is improved and accurate Identify.
  • convolutional neural networks widely used in image recognition, audio recognition and other fields often need to perform a large number of matrix multiplication operations.
  • Performing matrix multiplication operations requires a very high memory bandwidth and a large amount of calculations.
  • the convolution operation and Full Connect (FC) operation in the convolutional neural network will be converted into a matrix multiplication operation of AxB ([MxK] x [KxN]).
  • a and B each represent a matrix
  • M represents the number of rows of matrix A
  • K represents the number of columns of matrix A and the number of rows of matrix B
  • N represents the columns of matrix B
  • AxB represents the matrix A and matrix B are multiplied.
  • the input matrix and weight matrix are relatively large matrices.
  • the size of the matrix that can be processed by the current hardware is usually smaller than the input matrix and weight matrix, so it may be necessary to multiply large matrices. Split into a series of small matrix multiplications, and finally get matrix multiplication results of different sizes according to multiple small matrix multiplications.
  • FIG. 4 is a schematic diagram of a matrix division and multiplication method provided by an embodiment of the present application.
  • the leftmost matrix is an input matrix
  • the middle matrix is a weight matrix
  • the rightmost matrix is an output matrix.
  • the size of the input matrix is 3Hx3H
  • the size of the weight matrix is 3Hx2H
  • the size of the output matrix obtained by multiplying the two matrices is 3Hx2H.
  • the processing capability of the hardware is HxH matrix multiplication.
  • the input matrix and weight matrix need to be split into multiple HxH matrices, as shown in Figure 4.
  • the input matrix is split into A0 to A8, and the weight matrix is split.
  • Points are obtained from B0 to B5, and the product of the two HxH matrices is calculated each time, and the horizontal and vertical directions are slid in units of H points. After multiplying and adding the matrices in this way, a complete output matrix is finally obtained.
  • A0xB0, A1xB2, and A2xB4 can be calculated in sequence, and the matrices obtained from the three calculations are added to obtain C0. In the same way, calculate C1 to C5 in the same way as C0, and then combine C0 to C5 into an output matrix.
  • the leftmost matrix represents matrix A, such as A0 in FIG. 1
  • the middle matrix represents matrix B, such as B0 in FIG. 1
  • the rightmost matrix represents matrix C.
  • the calculation formula of each element included in matrix C is as follows:
  • the first matrix from the left is matrix A
  • the second matrix from the left is matrix B
  • the third matrix from the left is matrix C
  • the elements of the first row of the matrix B are all 0.
  • each data in the first column of matrix A will be multiplied by a value of 0.
  • each data of the second row of matrix B is multiplied by a value of 0
  • each data of the third row of matrix B are all 0, matrix A
  • Each of the three columns is multiplied by a value of zero.
  • each data of the first row of matrix B is multiplied by a value of zero.
  • the Mth row in the matrix B corresponds to the Mth column in the matrix A.
  • the elements of the Mth row of the matrix B are all 0, the elements of the Mth column of the matrix A are multiplied by a value of 0; when the elements of the Mth column of the matrix A are 0, the elements of the Mth row of the matrix B are all Multiply by a value of 0.
  • formula (13) when there is an entire column of 0 data in the matrix B, the [3x3] x [3x3] matrix multiplication can be converted to [3x2] x [2x3] matrix multiplication.
  • formula (13) can be converted into the following formula:
  • the first matrix from the left is the matrix compressed from the first matrix from the left in the formula (13), and the second matrix from the left is the matrix compressed from the second matrix from the left in the formula (13). It can be understood that when matrix A and matrix B are multiplied, if at least one entire row in matrix B or at least one entire column in matrix A is 0, the matrix A and matrix B can be compressed to reduce the matrix multiplication and sum. Add the number of calculations to reduce the power consumption and bandwidth overhead caused by matrix operations.
  • a matrix compression method is as follows: In the case that the matrix B (weight matrix) includes N all 0 rows, the non-all 0 rows of the matrix B are sequentially stitched to obtain a compressed matrix B, and the matrix A (input matrix) The target columns of the matrix A are compressed to obtain a compressed matrix A, where the target columns are the columns of the matrix A other than the N columns corresponding to the above N all 0 rows, the M row of the matrix B and the M column of the matrix A Correspondingly, N and M are both integers greater than 0.
  • matrix A input matrix
  • the non-all 0 columns of this matrix A are sequentially stitched to obtain a compressed matrix A
  • the target rows of matrix B are Compression is performed to obtain a compressed matrix B
  • the target row is the rows in the matrix B other than the N rows corresponding to the above N all 0 columns
  • the Mth column of the matrix A corresponds to the Mth row of the matrix B
  • N and M are both integers greater than 0.
  • the second and fourth rows of matrix B are all 0 rows
  • the non-all 0 rows of matrix B are spliced in order to obtain compressed matrix B.
  • the columns in A except the second and fourth columns are spliced in order to obtain a compressed matrix A.
  • the second and fourth rows of matrix A are all 0 columns, and the non-all 0 columns of matrix A are stitched in order to obtain compressed matrix A.
  • the rows in the matrix B other than the second row and the fourth row are sequentially stitched to obtain a compressed matrix B.
  • the above method can also be used to compress a submatrix obtained by splitting a large matrix.
  • the compressed result may be further split, which is not limited in this embodiment.
  • FIG. 5 is a schematic diagram of a matrix division and multiplication architecture according to an embodiment of the present application.
  • the matrix multiplier calculates the product of the input small matrix a i, k and the small matrix b k, j , and outputs the result to the accumulator.
  • the accumulator calculates the small matrix c i, j and the matrix multiplier.
  • the matrix division and multiplication architecture in FIG. 5 mainly implements the following matrix multiplication formula:
  • Each sub-matrix (including at least two elements) in the matrix A can be understood as one element of the matrix A.
  • the input matrix is divided into 9 small matrices (A0 to A8), and A0 to A8 can be understood as the elements of the input matrix (the small matrix obtained by the split).
  • A4 is the input matrix split. Small matrix after the second row and second column.
  • the small matrix a i, k can be understood as the small matrix of the i-th row and the k-th column after the matrix A is split
  • the small matrix b k, j can be understood as the k-th row and the j-th column of the B matrix after the split
  • the small matrix, the small matrix c i, j can be understood as the small matrix accumulation result of the i-th row and the j-th column after the previous matrix A and B matrix are multiplied
  • c ⁇ i, j is the i-th cumulative result of the current result.
  • FIG. 6 is a schematic diagram of a hardware architecture of a signal processing apparatus according to an embodiment of the present application, which is used to implement a computing function of the neural network 100.
  • the signal processing device in the embodiment of the present application can be applied to various devices that can perform matrix multiplication operations, such as a mobile phone, a tablet computer, a server, and a wearable device.
  • the signal processing device may include at least one of a circuit book, a chip, or a chipset, or a related running software program. It includes: an external memory 602 for storing the original weight matrix.
  • the external memory 602 can also store the original input matrix and other data.
  • a central processing unit (CPU) 601 is configured to read the original weight matrix from the external memory 602, compress the original weight matrix to obtain a compression weight matrix, and send the compression weight matrix to An external memory 602, wherein the compression weight matrix is obtained by removing at least one all 0 rows of the original weight matrix; it is also used to generate compression information and send the compression information to the external memory 602, and the compression information is used to indicate All at least one of the above is 0 lines.
  • any larger matrix can be split during compression, before compression, or after compression.
  • the CPU 601 is configured to read the original weight matrix from the external memory 602, compress the original weight matrix to obtain a compressed weight matrix, and send the matrix obtained by splitting the compressed weight matrix to the external memory 602.
  • the CPU 601 is configured to read the original weight matrix from the external memory 602. First, the original weight matrix is divided to obtain a weight matrix, and the weight matrix is further compressed to obtain a matrix to be provided to the external memory 602.
  • the CPU 601 reads the original weight matrix temporarily stored in the external memory 602 to perform data compression.
  • the row data is deleted, and the deleted records are recorded during the compression process.
  • Row number (the row number can be recorded in a k-table).
  • the compressed original weight matrix and the row number are written back to the external memory 602.
  • the compressed data of this part is written into the external memory 602 first.
  • the data compression of the original weight matrix may be deleting all 0 rows of data in the original weight matrix and then splicing the remaining rows; it may also extract non-all 0 rows of data in the original weight matrix and stitching.
  • the CPU 601 may be replaced with another type of processor, such as a microprocessor, a microcontroller, a neural network processor (Neural Network Processing Unit, NPU), or a digital signal processor (Digital Signal Processor, DSP).
  • the CPU 601 can also be replaced by dedicated hardware, such as Application Specific Integrated Circuits (ASICs), Field-Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gates, or transistors Logic devices, discrete hardware components, and other processors are not limited in the embodiments of the present application. Therefore, a processor for performing compression is a broad concept, and may be a processor executing a software program, a pure hardware logic calculation circuit, or a combination of the two. A processor that executes a software program is a common implementation.
  • the CPU 601 is also used to control the Direct Memory Access Controller (DMAC) 610 to move the compressed original weight matrix, that is, the compressed weight matrix, from the external memory 602 to the weight buffer 603 to control DMAC610 moves the compression information (row number) from the external memory 602 to the data compression unit 606. It can also control the DMAC610 to uncompress the original input matrix or at least one input matrix obtained by splitting from the original input matrix from the external memory 602 or The result memory (result buffer) 609 is moved to the raw data buffer 607 (raw data buffer) for temporary storage.
  • DMAC Direct Memory Access Controller
  • the CPU 601 is further configured to read the original input matrix from the external memory 602 or the result memory 609 (the previous calculation result is used as the current input), and split the original input matrix to obtain at least one input matrix.
  • the obtained at least one input matrix is written into the external memory 602.
  • the CPU 601 is further configured to split the compression weight matrix, and control the DMAC 610 to move the first matrix obtained by splitting the compression weight matrix to the weight buffer 603.
  • the CPU 610 splits the compressed weight matrix to obtain at least one weight matrix, and the DMAC 610 may sequentially move the weight matrix obtained by splitting the compressed weight matrix to the weight buffer 604.
  • the split operation can be performed before compression.
  • the CPU 601 is further configured to instruct the original data buffer 607 to split the uncompressed original input matrix, and import at least one input matrix obtained by the split into the data compression unit 606, or instruct the original data buffer 607 imports the above-mentioned at least one input matrix stored in the data compression unit 606.
  • the CPU 601 is further configured to instruct the data compression unit 606 to compress the at least one input matrix.
  • the CPU 601 is further configured to determine whether to split the original input matrix or the original weight matrix.
  • the data compression unit 606 is configured to compress the at least one input matrix according to the compression information, and write the compressed at least one input matrix (the second matrix) to the input buffer 604.
  • a matrix multiplier 605 is configured to obtain the first matrix from the weight buffer 604, obtain the second matrix from the input buffer 604, and calculate a product of the second matrix and the first matrix.
  • the accumulator 608 is configured to accumulate a product of the second matrix and the first matrix to obtain a processing result, and store the processing result in a result memory 609.
  • the compression system in the embodiment of the present application may include a data compression unit 606 and a CPU 601.
  • the components 603 to 610 in the figure may be integrated in an integrated circuit or chip, or may be further integrated with the CPU 601. It can be understood that the components 603 to 610 in the figure may be components included in an arithmetic accelerator, and the arithmetic accelerator is mounted on the CPU 601 to improve the performance of the CPU 601 in certain aspects.
  • the external memory 602 may not be integrated with the components 603 to 610 in the figure, or may not be integrated with the CPU 601. Of course, the external memory 602 may also be integrated with the components 603 to 610 in the figure, and may also be integrated with the CPU 601. Here, the external is relative to the compression system. Of course, the external memory 602 is not integrated with the components 601 or 603 to 610, but it is a more common solution to exist independently.
  • the external memory 602 may be a double-rate synchronous dynamic random access memory (Double Data Rate, DDR), or a high-bandwidth memory (High Bandwidth Memory, HBM).
  • the external memory may be used in the hardware architecture of the signal processing device or a general-purpose memory, which is not limited in this embodiment.
  • the CPU 601 is like a manager, responsible for controlling 602 to 610. It can be understood that 602 to 610 in FIG. 6 work under the control of the CPU 601.
  • DMA Direct Memory Access
  • DMA allows hardware devices of different speeds to communicate without having to rely on a large number of interrupt loads from the CPU, and does not require the CPU to directly control transmission, which can greatly improve the efficiency of the CPU.
  • the DMAC is directly in charge of the bus. After the DMAC obtains the bus control right, the CPU immediately suspends or executes only internal operations, and the DMAC outputs read and write commands to directly control the memory and various I / O interfaces for DMA transfer. Under the control of the DMAC, data is directly transferred between the memory and the external device, and the central processor is not required to participate in the transfer process.
  • the matrix multiplier 605 includes multiple processing units (Process Engines, PEs).
  • the arithmetic circuit 605 is a general-purpose matrix processor. For example, suppose there are an input matrix A, a weight matrix B, and an output matrix C. The matrix multiplier 605 takes the data corresponding to the matrix B from the weight buffer 603, and buffers the data on each PE in the operation circuit. The matrix multiplier takes matrix A data from the input buffer 604 and performs matrix operations on the matrix B, and then performs continuous addition operations in an accumulator 608. Partial or final results of the obtained matrix are stored in the result memory 609. .
  • FIG. 7 is a flowchart of a signal processing method according to an embodiment of the present application.
  • the method may include: 701, CPU 601 reads the The original weight matrix is obtained by compressing the original weight matrix to obtain a compression weight matrix and compression information, and storing the compression weight matrix and the compression information in the external memory 602.
  • the compression information is used to indicate all 0 rows in the original weight matrix.
  • the aforementioned original weight matrix contains N all 0 rows, where N is an integer greater than 0.
  • the compression information is a binary sequence, and each binary value in the binary sequence indicates whether a row in the original weight matrix is all 0 rows.
  • the original weight matrix includes 12 rows. Only the 5th and 8th rows of the original weight matrix are all 0 rows.
  • the compression information obtained by the CPU601 compressing the original weight matrix is 111101101111 (binary sequence). In the binary sequence, From left to right, the binary values of s correspond to the first to twelfth rows of the original weight matrix, the rows corresponding to 1 are non-all 0 rows, and the rows corresponding to 0 are all 0 rows.
  • the way to compress the original weight matrix can be to delete all 0 rows of data from the original weight matrix and then stitch the remaining rows; or it can extract the non-all 0 rows of data from the original weight matrix to stitch .
  • FIG. 8A is a schematic diagram of an original compression weight matrix provided by an embodiment of the present application.
  • 800 represents the uncompressed original weight matrix
  • 810 represents the compressed original weight matrix; wherein each small rectangular region corresponds to one element in the matrix, and the black solid portion is non-zero data, and white Parts are 0-valued data
  • 801, 802, and 803 represent all 0 rows of data in the original weight matrix.
  • the CPU 601 compresses the original weight matrix, it can delete all 0 rows of data, and then stitch the non-all 0 rows to obtain the compressed original weight matrix (compressed weight matrix).
  • the CPU 601 records the number of the deleted row, as shown in K-table (compressed information) in FIG.
  • the row number (K-table) can be used as a guide when compressing the input matrix.
  • the row numbers include the binary values that in turn correspond to the columns of the input matrix. Specifically, the binary value corresponding to the first row of the original weight matrix in the row encoding corresponds to the first column in the input matrix, the binary value corresponding to the last row of the original weight matrix in the row encoding corresponds to the last column in the input matrix, and so on.
  • the column corresponding to the binary value 1 included in the row number is a column to be deleted in the input matrix, and the column corresponding to 0 is not a column to be deleted in the input matrix.
  • the compression ratio varies depending on the degree of sparseness of the original weight matrix.
  • the original weight matrix is a 128x64 matrix
  • the original weight matrix has 64 rows of all 0 values
  • the compressed original weight matrix is a 64x64 matrix.
  • the compressed information may refer to the foregoing line number.
  • the binary value sequences in FIGS. 8A, 8B, and 8C refer to compressed information (K-table).
  • the CPU 601 controls the DMAC 610 to move the first matrix obtained by splitting the compression weight matrix from the external memory 602 to the weight buffer 603 and controls the DMAC 610 to move the compression information from the external memory 602 to the data compression unit 606.
  • the CPU 601 sends at least one weight matrix obtained by splitting the compressed weight matrix to the external memory 602.
  • the CPU 601 instructs the external memory 602 to transmit the compressed weight matrix to the weight buffer 604, that is, instructs the external memory 602 to transmit to the weight buffer 604 which part of the compressed weight matrix (the first matrix).
  • 703 and CPU 601 control DMAC 610 to move the original input matrix or at least one input matrix from external memory 602 or result memory 609 to original data buffer 607.
  • the at least one input matrix may be obtained by splitting the original input matrix.
  • the CPU 601 reads the original input matrix from the external memory 602, splits the original input matrix to obtain the at least one input matrix, and writes the obtained at least one input matrix to the external memory. 602.
  • the CPU 601 reads the original input matrix from the result memory 609, splits the original input matrix to obtain at least one input matrix, and writes the obtained at least one input matrix to the result memory 609. .
  • the order in which the CPU 601 executes 702 and 703 is not limited.
  • the execution sequence of 702 and 703 may be performed first, 702 and 703 may be performed simultaneously, and 703 and 702 may be executed first.
  • the CPU 601 instructs the original data buffer 607 to import at least one input matrix included in the original input matrix into the data compression unit 606, or instructs the original data buffer 607 to import at least one input matrix stored therein to the data compression unit 606 .
  • the size of any one of the at least one input matrix is less than or equal to the size of the largest matrix that can be processed by the matrix multiplier.
  • the CPU 601 instructs the data compression unit 606 to compress the at least one input matrix.
  • the original data buffer 607 may import a part of the original input matrix (at least one input matrix) to the data compression unit 606 each time according to an instruction of the CPU 601. It can be understood that the at least one input matrix is a sub-matrix of the original input matrix.
  • FIG. 8C is a schematic diagram of compressing at least one input matrix according to an embodiment of the present application. As shown in FIG. 8C, the original input matrix is a 12x12 matrix, the original input matrix is split into 16 3x3 matrices, and K-table is the compression information corresponding to the original input matrix.
  • the two input matrices obtained by splitting the original input matrix are different in the columns corresponding to the original input matrix, the two input matrices correspond to different parts of the compression information.
  • the K-table is divided into four parts from left to right, and each part corresponds to three columns.
  • input matrix A and input matrix E correspond to the first three columns of the original input matrix, and these two input matrices correspond to the first part of the K-table;
  • input matrix B and input matrix F correspond to the fourth column of the original input matrix to In column 6, these two input matrices correspond to the second part of the K-table.
  • the data compression unit 606 may compress the input matrix A and the input matrix E according to the first part of the compression information (K-table); and may compress the input matrix B and the input matrix F according to the second part of the compression information, so as to obtain each input matrix after compression.
  • the data compression unit 606 deletes the second column of the input matrix A, and replaces the first column and the third column of the input matrix A. Stitch them together to get the compressed input matrix A. In practical applications, the data compression unit 606 may sequentially compress at least one input matrix obtained by splitting the original input matrix.
  • the ping-pong buffer includes a ping-buffer and a pang-buffer.
  • the size of the storage space of the ping buffer and the ping buffer is the same, and the size of the input matrix obtained by splitting the original input matrix may be the same as the size of the largest matrix that the ping buffer can store.
  • FIG. 9 is a schematic diagram of a stitched and compressed input matrix according to an embodiment of the present application.
  • the small matrix A, small matrix B, small matrix C, and small matrix D in FIG. 9 are the input matrix A, the input matrix B, the input matrix C, and the input matrix D obtained by dividing the original input matrix in FIG.
  • the process of the data compression unit 606 splicing the compressed input matrix is as follows: write the compressed input matrix A into the ping buffer; first fill the storage space of the ping buffer (the compressed input matrix is about to be compressed) The first column of B is written to the ping buffer), and then the second column (the remaining columns) of the compressed input matrix B is written to the ping buffer; the matrix in the output ping buffer is the matrix in the ping buffer.
  • the input buffer first fill up the pong buffer (that is, write the first two columns of the compressed input matrix C into the pong buffer), and then write the third column (the remaining columns) of the compressed input matrix C Enter the ping buffer; write the matrix in the ping buffer to the input buffer; fill the ping buffer first, that is, write the compressed input matrix D to the ping buffer; write the matrix in the ping buffer to the input buffer Device.
  • the maximum matrix that can be stored by the ping buffer and the ping buffer is a 3 ⁇ 3 matrix.
  • the data compression unit 606 determines whether the aforementioned ping-pong buffer has stored a JxK matrix. If yes, go to 708; if no, go to 704. J and K are both integers greater than 0.
  • the JxK matrix can be the largest matrix that a ping buffer can store.
  • the data compression unit 606 determines whether the ping-pong buffer has stored a JxK matrix, which may be determining whether the storage space of the ping-pong buffer or the storage space of the ping-pong buffer is full.
  • 708 and the data compression unit 606 write the JxK matrix stored in the ping-pong buffer into the input buffer 604.
  • 709 and the matrix multiplier 605 obtain the matrix from the input buffer 604 and the weight buffer 603, respectively, and perform matrix multiplication .
  • the matrix obtained by the matrix multiplier 605 from the weight buffer 603 is the first matrix obtained by splitting the compression weight matrix described above, and the matrix obtained from the input buffer 604 is a matrix obtained by concatenating at least two compressed input matrices (No. Two matrices). 710.
  • the accumulator 608 accumulates a product of matrix multiplications of the matrix multiplier 605 to obtain a processing result.
  • the signal processing device can reduce the number of matrix multiplication operations performed by the matrix multiplier by compressing at least one input matrix and at least one weight matrix, and improve calculation efficiency.
  • the matrix multiplier 605 calculates a product of a matrix of at least one input matrix obtained by splitting the original input matrix and a weight matrix obtained by splitting the compressed original weight matrix. It can be understood that, in the method of FIG. 7, the compressed original weight matrix needs to be split and the original input matrix is split. In the case where the size of the original weight matrix and the size of the original input matrix are both smaller than the largest matrix that the matrix multiplier can handle, the original weight matrix and the original input matrix may not be split, but the compressed original input matrix and Product of the compressed original weight matrix.
  • FIG. 10 is another signal processing method provided by an embodiment of the present application. As shown in FIG.
  • the method may include: 1001, CPU 601 reads weights in external memory 602 Matrix, compressing the weight matrix to obtain a first matrix and compression information, and storing the first matrix and the compression information to the external memory 602.
  • the compression of the weight matrix to obtain the first matrix and the compression information may be detecting position information of all 0 rows in the weight matrix, and stitching non-all 0 rows in the weight matrix to obtain the first matrix. According to the above, The position information obtains the above-mentioned compression information.
  • the compression information is used to indicate all 0 rows in the weight matrix.
  • the CPU 601 controls the DMAC 610 to move the first matrix from the external memory 602 to the weight buffer 603, and controls the DMAC 610 to move the compression information from the external memory 602 to the data compression unit 606.
  • the CPU 601 controls the DMAC 610 to move the input matrix from The external memory 602 or the result memory 609 is moved to the original data buffer 607. 1004.
  • the CPU 601 instructs the original data buffer 607 to import the input matrix into the data compression unit 606. 1005.
  • the data compression unit 606 compresses the input matrix according to the compression information.
  • the second matrix is the first matrix from the external memory 602 to the weight buffer 603, and controls the DMAC 610 to move the compression information from the external memory 602 to the data compression unit 606.
  • the CPU 601 controls the DMAC 610 to move the input matrix from The external memory 602 or the result memory 609 is moved to the original data buffer 607. 1004.
  • the CPU 601 instructs the original data buffer 607 to import the input matrix into the data compression unit 606.
  • the weight matrix includes N all 0 rows, where N is an integer greater than 0.
  • the data compression unit 606 stitches the target columns of the input matrix to obtain a second matrix according to the compression information.
  • the target columns are columns other than the N columns corresponding to the N all 0 rows in the input matrix.
  • the F-th row of the weight matrix corresponds to the F-th column of the input matrix, and F is an integer greater than 0.
  • the data compression unit 606 removes columns corresponding to all 0 rows of the weight matrix in the input matrix according to the compression information to obtain the second matrix.
  • the data compression unit 606 imports the second matrix into the input buffer. 1007.
  • the CPU 601 instructs the matrix multiplier 605 to obtain the first matrix from the weight buffer 604 and the second matrix from the input buffer 603. 1008.
  • the CPU 601 instructs the matrix multiplier 605 to calculate a product of the second matrix and the first matrix.
  • the signal processing device can reduce the size of the matrix processed by the matrix multiplier by compressing the weight matrix and the input matrix, and improve the calculation efficiency.
  • FIG. 11 is a schematic diagram of a hardware architecture of another signal processing apparatus according to an embodiment of the present application.
  • the signal processing apparatus in the embodiment of the present application can be applied to a device that can perform matrix multiplication operations such as a mobile phone, a tablet computer, a server, and a wearable device.
  • the signal processing apparatus may include an external memory 1102 for storing the original input matrix.
  • the external memory 1102 may also store the original weight matrix and other data.
  • the CPU 1101 is configured to read the original input matrix from the external memory 1102, compress the original input matrix to obtain a compressed input matrix, and send the compressed input matrix to the external memory 1102.
  • the compressed input matrix is obtained by removing the original input. It is obtained by at least one all 0 column in the matrix; it is also used to generate compression information and send the above compression information to the external memory 1102, where the compression information is used to indicate the at least one all 0 column.
  • the CPU 601 is further configured to split the compressed input matrix, and send at least one input matrix obtained by splitting the compressed input matrix to the external memory 602.
  • the CPU 1101 reads the original input matrix temporarily stored in the external memory 1102 to perform data compression.
  • the column data is deleted, and the deleted columns are recorded during the compression process.
  • Number (the column number can be recorded in a k-table).
  • the compressed data of this part is written into the external memory 1102 first.
  • the original input matrix can be compressed by deleting all 0 columns of the original input matrix and then splicing the remaining columns; it can also extract non-all 0 columns of data from the original input matrix and stitching them. Similar to the description of the previous embodiment.
  • the CPU 1101 is also used to control the DMAC 1110 to move the compressed original input matrix (compressed input matrix) from the external memory 1102 to the input buffer 1103, and control the DMAC 1110 to move the compression information (column number) from the external memory 1102 to the data compression unit 1106.
  • the DMAC 1110 is controlled to move the uncompressed original weight matrix or at least one weight matrix obtained by splitting the original weight matrix from the external memory 1102 to the raw data buffer 1107 (raw data buffer) for temporary storage.
  • the CPU 1101 is further configured to split the compressed input matrix, and control the DMAC 1110 to move the second matrix obtained by splitting the compressed input matrix to the input buffer 1103.
  • the CPU 1110 splits the compressed input matrix to obtain at least one input matrix, and the DMAC 1110 may sequentially move the input matrix obtained by splitting the compressed input matrix to the input buffer 1103.
  • the CPU 1101 is further used to instruct the original data buffer 1107 to split the uncompressed original weight matrix, and to import at least one weight matrix obtained by the split into the data compression unit 1106, or to instruct the original data buffer 1107 imports at least one weight matrix stored in the data compression unit 1106.
  • the data compression unit 1106 obtains at least one weight matrix stored in the raw data buffer 1107.
  • the CPU 1101 is further configured to instruct the data compression unit 1106 to compress the at least one weight matrix.
  • the CPU 1101 is further configured to determine whether to split the original input matrix or the original weight matrix.
  • the data compression unit 1106 is configured to compress the at least one weight matrix according to the compression information, and write the compressed at least one weight matrix (first matrix) into the weight buffer 1104.
  • a matrix multiplier 1105 is configured to obtain the first matrix from the weight buffer 1104, obtain the second matrix from the input buffer 1103, and calculate a product of the second matrix and the first matrix.
  • the accumulator 1108 is configured to accumulate a product of the second matrix and the first matrix to obtain a processing result, and store the processed result in the result memory 1109.
  • the compression system in the embodiment of the present application may include a data compression unit 1106 and a CPU 1101.
  • 1103 to 1110 in the figure can be integrated in an integrated circuit or chip, or integrated with the CPU. It can be understood that 1103 to 1110 in the figure may be components included in an arithmetic accelerator, and the arithmetic accelerator is mounted on the CPU 1101 to improve the performance of the CPU 1101 in some aspects.
  • FIG. 12 is a signal processing method provided in an embodiment of the present application. As shown in FIG. 12, the method may include: 1201, CPU 1101 reading original input in external memory 1102 Matrix, compressing the original input matrix to obtain a compressed input matrix and compression information, and storing the compressed input matrix and the compression information to an external memory 1102.
  • the compression information is used to indicate all 0 columns in the original input matrix.
  • the original input matrix includes N all 0 columns, where N is an integer greater than 0.
  • the above compression information is a binary sequence, and each binary value in the binary sequence indicates whether a column in the original input matrix is all 0 columns.
  • the original input matrix includes 12 columns. Only the 5th and 8th columns of the original input matrix are all 0 columns.
  • the compression information obtained by compressing the original input matrix by the CPU1101 is 111101101111 (binary sequence). From left to right, the binary values of the digits correspond to the first to twelfth columns of the input matrix, the columns corresponding to 1 are non-all 0 columns, and the columns corresponding to 0 are all 0 columns.
  • the way to compress the original input matrix can be to delete all 0 columns of data from the original input matrix and then stitch the remaining columns; it can also extract the non-all 0 columns of data from the original input matrix to stitch them. .
  • the CPU 1101 controls the DMAC 1110 to move the second matrix obtained by splitting the compressed input matrix from the external memory 1102 to the input buffer 1103, and controls the DMAC 1110 to move the compression information from the external memory 1102 to the data compression unit 1106.
  • the CPU 1101 sends at least one input matrix obtained by splitting the compressed input matrix to the external memory 602.
  • 1203 and the CPU 1101 control the DMAC 1110 to move the original weight matrix or at least one weight matrix obtained by splitting the original weight matrix from the external memory 1102 to the original data buffer 1107.
  • the CPU 1101 reads the original weight matrix from the external memory 1102, splits the original weight matrix to obtain the at least one weight matrix, and writes the obtained at least one weight matrix to the external memory.
  • the CPU 1101 reads the original weight matrix from the result memory 1109, splits the original weight matrix to obtain at least one weight matrix, and writes the obtained at least one weight matrix into the result memory 609. .
  • the order in which the CPU executes 1202 and 1203 is not limited. 1202 and 1203 can be executed first, 1202 and 1203 can be executed simultaneously, and 1203 can be executed first.
  • 1204 and the CPU 1101 instruct the original data buffer 1107 to import the at least one weight matrix included in the original weight matrix into the data compression unit 1106, or instruct the original data buffer 1107 to store the at least one weight matrix into the data compression unit 1106.
  • the size of any weight matrix in the at least one weight matrix is less than or equal to the size of the largest matrix that can be processed by the matrix multiplier.
  • the CPU 1101 instructs the data compression unit 1106 to compress the at least one weight matrix.
  • FIG. 13 is a schematic diagram of a compression weight matrix according to an embodiment of the present application. As shown in FIG. 13, the original weight matrix is a 12 ⁇ 12 matrix, the original weight matrix is split into 16 3 ⁇ 3 matrices, and K-table is the compression information corresponding to the original weight matrix.
  • the two weight matrices obtained by splitting the original weight matrix are different in the rows corresponding to the weight matrices, then the two weight matrices correspond to different parts of the compressed information.
  • the K-table is divided into four sections from top to bottom, and each section corresponds to three rows.
  • the weight matrix A and the weight matrix B correspond to the first three rows of the original weight matrix, and these two weight matrices correspond to the first part of the K-table;
  • the weight matrix E and the weight matrix F correspond to the fourth row of the original weight matrix to In line 6, these two weight matrices correspond to the second part of the K-table.
  • the data compression unit 1106 may compress the weight matrix A and the weight matrix B according to the first part of the compression information (K-table); and may compress the weight matrix E and the weight matrix F according to the second part of the compression information to obtain a compressed weight matrix.
  • the binary value corresponding to the second row of the weight matrix A in the compression information is 0.
  • the data compression unit 1106 deletes the second row of the weight matrix A, and sets the first and third rows of the weight matrix A. Stitch together to get the compressed weight matrix A.
  • the data compression unit 1106 may sequentially compress at least one weight matrix obtained by splitting the original weight matrix.
  • the ping-pong buffer includes a ping-pong buffer and a ping-pong buffer.
  • the storage space size of the ping buffer and the ping buffer is the same, and the size of each weight matrix obtained by splitting the original weight matrix can be the same as the size of the largest matrix that the ping buffer can store.
  • the specific method of stitching and compressing the weight matrix is similar to the method in FIG. 9 and will not be described in detail here.
  • 1207 and the data compression unit 1106 determine whether the aforementioned ping-pong buffer has stored a JxK matrix. If yes, go to 1208; if no, go to 1204. J and K are both integers greater than 0.
  • the JxK matrix can be the largest matrix that a ping buffer can store.
  • the data compression unit 1106 may determine whether the ping-pong buffer has stored a JxK matrix.
  • the data compression unit 1106 may determine whether the storage space of the ping-pong buffer or the storage space of the ping-pong buffer is full.
  • the data compression unit 1106 writes the JxK matrix stored in the ping-pong buffer into the weight buffer 1104.
  • the matrix multiplier 1105 obtains the matrix from the input buffer 1103 and the weight buffer 1104, and performs matrix multiplication .
  • the matrix obtained by the matrix multiplier 1105 from the weight buffer 1104 is a matrix obtained by concatenating at least two compressed weight matrices (first matrix).
  • the matrix obtained from the input buffer 1103 is obtained by splitting the compressed input matrix described above.
  • the accumulator 1108 accumulates a product of matrix multiplications of the matrix multiplier 1105 to obtain a processing result.
  • the CPU 1101 determines whether the matrix multiplier calculates the last weight matrix obtained by splitting the original weight matrix. If yes, go to 1212, if no, go to 1204. 1212, stop executing 1204.
  • the signal processing device can reduce the number of matrix multiplication operations performed by the matrix multiplier and improve the calculation efficiency by splicing the reference rows of the weight matrix obtained by splitting the original weight matrix and compressing the original input matrix.
  • FIG. 14 is a flowchart of another signal processing method provided by an embodiment of the present application. As shown in FIG. 11, the method may include: 1401, CPU 1101 reads an external memory 1102 Compress the input matrix to obtain a second matrix and compression information, and store the second matrix and the compression information in the external memory 1102.
  • the compression of the input matrix to obtain the second matrix and the compression information may be detecting position information of all 0 columns in the input matrix, and stitching non-all 0 columns in the input matrix to obtain the second matrix. According to the above, The position information obtains the above-mentioned compression information.
  • the compression information is used to indicate all 0 columns in the input matrix.
  • the CPU 1101 controls the DMAC 1110 to move the second matrix from the external memory 1102 to the input buffer 1103, and controls the DMAC 1110 to move the compression information from the external memory 1102 to the data compression unit 1106.
  • the CPU 1101 controls the DMAC 1110 to shift the weight matrix from The external memory 1102 is moved to the original data buffer 1107.
  • the CPU 1101 instructs the original data buffer 1107 to import the weight matrix into the data compression unit 1106.
  • the data compression unit 1106 compresses the weight matrix according to the compression information to obtain a first matrix . 1406.
  • the data compression unit 1106 imports the first matrix into the weight buffer. 1407.
  • the CPU 1101 instructs the matrix multiplier 1105 to obtain the second matrix from the input buffer 1103 and obtain the first matrix from the weight buffer 1104. 1408.
  • the CPU 1101 instructs the matrix multiplier 1105 to calculate a product of the second matrix and the first matrix.
  • the signal processing device can reduce the size of the matrix processed by the matrix multiplier and improve the calculation efficiency by compressing the input matrix and the weight matrix.
  • the signal processing method in FIG. 7 is to convert the multiplication of the original input matrix and the original weight matrix to the phase of the compressed original weight matrix and the weighted matrix obtained by splitting the original weighted matrix and the matrix of the multiple weighted matrices obtained by splitting the original input matrix.
  • the signal processing method in FIG. 9 is to convert the multiplication of the original input matrix and the original weight matrix into the phase of the input matrix obtained by splitting the original input matrix after compression and the matrix of multiple weight matrix mosaics obtained by splitting the original weight matrix. Multiply.
  • FIG. 15 is a schematic diagram of a sub-matrix multiplication provided by an embodiment of the present application. As shown in Figure 15, A0 and A1 are the sub-matrixes obtained by splitting the input matrix, B0 and B2 are the sub-matrixes obtained by splitting the weight matrix, the third row of B0 is all 0 rows, and the first row and second row of B2 are All 0 lines.
  • A0 and A1 are the sub-matrixes obtained by splitting the input matrix
  • B0 and B2 are the sub-matrixes obtained by splitting the weight matrix
  • the third row of B0 is all 0 rows
  • the first row and second row of B2 are All 0 lines.
  • FIG. 16 is a schematic diagram of a stitched sub-matrix multiplication provided by an embodiment of the present application.
  • A0, A1, B0, B2, and C0 in FIG. 16 are the same as A0, A1, B0, B2, and C0 in FIG. 15, respectively.
  • A'0 is a sub-matrix obtained by joining the first two columns of A0 and the third column of A1
  • B'0 is a sub-matrix obtained by joining the first two rows of B0 and the third row of B2. Comparing FIG. 15 and FIG. 16, it can be seen that each element of C0 in FIG. 15 is the same as each element of C0 in FIG. 16.
  • the product of the stitched sub-matrices is the same as the product of the un-spliced sub-matrices. It can be seen from the above embodiments that at least one all 0 row or all 0 column of multiple matrices can be removed to obtain a compressed result, that is, multiple matrices are compressed into one matrix. Alternatively, only one matrix may be compressed to obtain a compressed matrix, which is not limited in this embodiment.
  • the signal processing device may adopt any one of the methods in FIG. 7, FIG. 10, FIG. 12, and FIG. 14 according to actual needs. If the signal processing device adopts the architecture in FIG. 6, the methods in FIG. 7 and FIG. 10 can be executed. If the signal processing device adopts the architecture in FIG. 11, the methods in FIG. 12 and FIG. 14 can be executed. The CPU in the signal processing device can determine whether to split the original weight matrix or the original input matrix. If the signal processing device adopts the architecture in FIG. 6 and the CPU determines to split the original weight matrix or the original input matrix, the method in FIG. 7 is adopted. If the signal processing device adopts the architecture in FIG. 6 and the CPU determines that the original weight matrix or the original input matrix is not split, the method in FIG. 10 is adopted.
  • the CPU in the signal processing device presets the size of the largest matrix that can be processed by the matrix multiplier. If the CPU determines that the size of the original weight matrix and the original input matrix are both smaller than the size of the maximum matrix, it determines that the original weight is not adjusted. The matrix and the original input matrix are split; if the CPU determines that the size of the original weight matrix or the original input matrix is larger than the size of the largest matrix, it determines to split the original weight matrix and the original input matrix.
  • the solutions in the above embodiments mainly introduce methods and corresponding devices for implementing matrix compression and signal processing.
  • the related methods can also be implemented by hardware, software, or a combination of software and hardware. If the related method is implemented in software, it can be considered to exist mainly as a software program or a storage medium storing the software.
  • a software program can be considered a computing program product.
  • An embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, the computer program includes software program instructions, and the program instructions are implemented when the processor executes: performing at least one weight matrix
  • the first matrix is obtained by compression
  • the second matrix is obtained by compressing at least one input matrix.
  • the input matrix includes multiple computer-processable signals
  • the weight matrix includes multiple weight coefficients
  • the compressed first matrix and The second matrix satisfies the following limitation: the first matrix is obtained by removing at least one all 0 rows of the at least one weight matrix, and the second matrix is obtained by removing the corresponding at least one all 0 rows from the at least one input matrix.
  • the plurality of computer-processable signals include at least one of a voice signal, a text signal, or an image signal.
  • the method further includes: generating compression information for indicating the at least one all 0 rows; and obtaining the second matrix includes: The at least one input matrix is compressed to obtain the second matrix.
  • the method further includes: generating compression information used to indicate the at least one all 0 column; and obtaining the first matrix includes: The at least one weight matrix is compressed to obtain the first matrix.
  • the method further includes: accumulating a product of the second matrix and the first matrix to obtain process result.
  • the method before the compressing at least one weight matrix to obtain a first matrix and compressing at least one input matrix to obtain a second matrix, the method further includes at least one of the following: The matrix is split to obtain the at least one weight matrix or the original input matrix is split to obtain the at least one input matrix.
  • the embodiments of the present application provide a device, such as a mobile phone, a tablet computer, a server, a wearable device, and other devices that can perform matrix multiplication operations.
  • the device includes a memory and a processor.
  • the memory is used as a computer-readable storage medium for storing program instructions, and the processor is configured to execute the program instructions to implement the above-mentioned method flow.
  • FIG. 17 is another signal processing device according to an embodiment of the present application, which may be placed in the device.
  • the signal processing device includes: a compression unit 1701, configured to compress at least one weight matrix to obtain a first matrix, and At least one input matrix is compressed to obtain a second matrix.
  • the input matrix includes multiple computer-processable signals
  • the weight matrix includes multiple weight coefficients
  • the compressed first matrix and the second matrix meet the following restrictions:
  • the first matrix is obtained by removing at least one all 0 row in the at least one weight matrix
  • the second matrix is obtained by removing at least one column corresponding to the at least one all 0 row in the at least one input matrix; or,
  • the second matrix is obtained by removing at least one all 0 column in the at least one input matrix
  • the first matrix is obtained by removing at least one row corresponding to the at least one all 0 column in the at least one weight matrix;
  • a calculation unit 1702. Calculate a product of the second matrix and the first matrix.
  • the plurality of computer-processable signals include at least one of a voice signal, a text signal, or an image signal.
  • the compression unit 1701 is further configured to: generate compression information used to indicate the at least one all 0 rows; and compress the at least one input matrix according to the compression information to obtain The second matrix is described.
  • the compression unit 1701 is further configured to generate compression information that is used to indicate the at least one all 0 column; and to compress the at least one weight matrix according to the compression information to obtain The first matrix is described.
  • the signal processing device further includes an accumulation unit 1703, and the accumulation unit 1703 is configured to accumulate a product of the second matrix and the first matrix to obtain a processing result.
  • the signal processing device further includes a splitting unit 1704, which is configured to split the original weight matrix to obtain the at least one weight matrix; and split the original input matrix to obtain the at least one Input matrix.
  • the compression unit 1701, the calculation unit 1702, the accumulation unit 1703, and the split unit 1704 in this embodiment may be implemented by software, hardware, or a combination of software and hardware. It can be seen that the processing processes involved in the above-mentioned device or device embodiments can be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented using software, the above embodiments may be implemented in whole or in part in the form of a computer program product.
  • the computer program product described above includes one or more computer instructions. When the computer program instructions are loaded or executed on the signal processing device, the above-mentioned processes or functions according to the embodiment of the present invention are wholly or partially generated.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, and the like, including one or more sets of available media.
  • the available medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium.
  • the semiconductor medium may be a solid state drive (SSD).
  • An implementation manner in the above embodiments is to directly compress at least one weight matrix to obtain a first matrix.
  • the first matrix may be preset. Because the first matrix as a weight parameter matrix usually does not change, there is no need to recalculate the first matrix every time. Therefore, the first matrix and the compression information may be preset in a device that needs to perform a matrix multiplication operation.
  • the compression system directly obtains a preset first matrix from another device or a certain memory in the signal processing device, for example, the external memory 602 in FIG. 6.
  • the first matrix may be preset in a compression system in a hardware form. In this embodiment, a specific preset manner of the first matrix in the entire system or device is not limited.
  • the first matrix does not need to be recalculated each time it is executed, but the input matrix is directly compressed according to a preset first matrix and compression information to obtain a second matrix corresponding to the first matrix.
  • first matrix does not need to be recalculated each time it is executed, but the input matrix is directly compressed according to a preset first matrix and compression information to obtain a second matrix corresponding to the first matrix.
  • the first matrix may be a matrix obtained by compressing at least one weight matrix by the compression system, or may be a matrix obtained by further dividing the compressed weight matrix. It can be understood that the above compression system can directly obtain the above first matrix by compressing the matrix, or obtain the above first matrix by first dividing the matrix and then compressing the obtained matrix.
  • the above-mentioned first matrix is obtained in a compressed matrix manner, and this application does not limit the manner in which the first matrix is obtained. Whether the correlation matrix is split one or more times before and after the compression operation is not limited in this embodiment.
  • the related splitting operation can make the obtained matrix size meet the preset specifications, which is beneficial for performing operations.
  • the compression system may compress at least one weight matrix in an offline state (without starting a matrix multiplication task) to obtain the first matrix.
  • the weight matrix can be compressed offline to obtain the first matrix before the device leaves the factory or during the manufacturing and development process, and is preset in the device, such as inside the memory. Therefore, during subsequent online operations, that is, when the user needs to perform a task, the first matrix preset above can be used to achieve the effect of this embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Image Processing (AREA)

Abstract

本发明实施例公开了一种信号处理装置和信号处理方法,该信号处理装置包括:压缩系统,用于获取压缩后的第一矩阵,和对至少一个输入矩阵做压缩得到第二矩阵;矩阵乘法器,用于从所述压缩系统获取所述第一矩阵和所述第二矩阵,计算所述第二矩阵和所述第一矩阵的乘积。本申请实施例中,减少矩阵乘法器处理的矩阵的尺寸或减少矩阵乘法器执行矩阵相乘运算的次数,提高计算效率。

Description

信号处理装置和信号处理方法 技术领域
本申请涉及计算机技术,尤其涉及一种信号处理装置、信号处理方法及计算机可读介质。
背景技术
在计算机技术中,卷积神经网络(Convolutional Neural Network,CNN)是一种多层的神经网络。目前,在卷积神经网络中,处理器进行卷积操作通常是将输入信号特征与权重的卷积,转换为信号矩阵与权重矩阵之间的矩阵乘运算。在具体矩阵乘运算时,对信号矩阵和权重矩阵进行分块处理,得到多个分形(Fractional)信号矩阵和分形权重矩阵,然后对多个分形信号矩阵和分形权重矩阵进行矩阵乘和累加运算。也就是说,卷积操作可以转换为信号矩阵(输入矩阵)与权重矩阵之间的矩阵相乘运算,即AxB([MxK]x[KxN]),其中,A表示信号矩阵(输入矩阵),B表示权重矩阵。通常A矩阵为卷积时根据卷积核步长(kernel stride)从输入数据提取出来的输入矩阵,即输入信号特征转换的输入矩阵。
一般情况下,输入矩阵和权重矩阵都是相对比较大的矩阵,出于节省硬件成本和功耗的考虑,矩阵乘法电路一次能处理的矩阵的尺寸会比输入矩阵和权重矩阵的尺寸小,因此需要把大矩阵乘法拆分成一系列的小矩阵乘法,再从多个小矩阵乘法最终得出不同尺寸的矩阵乘法结果。即便如此,如何进一步提高计算效率仍然是一个问题。
发明内容
本申请实施例提供了一种信号处理装置、信号处理方法及计算机可读介质,可以减少矩阵乘法器执行矩阵相乘运算的次数,以提高计算效率。
第一方面,本申请实施例提供了一种信号处理装置,该信号处理装置包括:压缩系统,用于获取压缩后的第一矩阵,和对至少一个输入矩阵做压缩得到第二矩阵;所述第一矩阵和所述第二矩阵满足如下限定:所述第一矩阵是去除至少一个权重矩阵中的至少一个全0行得到的,所述第二矩阵是去除所述至少一个输入矩阵中与所述至少一个全0行对应的至少一个列得到的;或者,所述第二矩阵是去除所述至少一个输入矩阵中的至少一个全0列得到的,所述第一矩阵是去除所述至少一个权重矩阵中与所述至少一个全0列对应的至少一个行得到的;所述输入矩阵包括多个计算机可处理的信号,所述权重矩阵包括多个权重系数;矩阵乘法器,用于从所述压缩系统获取所述第一矩阵和所述第二矩阵,计算所述第二矩阵和所述第一矩阵的乘积。
可选地,所述第二矩阵包括的行的数量与所述第一矩阵包括的列的数量相同。可选地,所述第一矩阵包括的行的数量小于所述权重矩阵包括的行的数量,所述第二矩阵包括的列的数量小于所述输入矩阵包括的列的数量。所述输入矩阵和所述权重矩阵的乘积等于所述第二矩阵和所述第一矩阵的乘积。由于第二矩阵的尺寸小于输入矩阵的尺寸,第一矩阵的 尺寸小于权重矩阵,本申请实施例中通过将输入矩阵和权重矩阵的相乘转换为第二矩阵和第一矩阵相乘,可以有效减少矩阵乘法的运算量。
在一个可选的实现方式中,所述压缩系统,具体用于获取预设的所述第一矩阵。在该实现方式中,压缩系统可以直接获取第一矩阵,不需要额外的操作,实现简单。可选地,所述第一矩阵可以预设在外部存储器或其他存储器中。可选地,所述第一矩阵可以以硬件形式预设在压缩系统中。
在一个可选的实现方式中,所述压缩系统,具体用于所述压缩系统,具体用于对所述至少一个权重矩阵做压缩得到所述第一矩阵。在该实现方式中,通过对权重矩阵和输入矩阵进行压缩,可以减少矩阵乘法器所需处理的矩阵的尺寸或所需执行矩阵乘法的次数,进而提升计算效率。
在一个可选的实现方式中,所述压缩系统包括:处理器和数据压缩单元,所述处理器,用于对所述至少一个权重矩阵做压缩得到所述第一矩阵;和/或,所述数据压缩单元,用于对所述至少一个输入矩阵做压缩得到所述第二矩阵。在本实现方式中,通过处理器和数据压缩单元分别压缩权重矩阵和输入矩阵,实现简单。
在一个可选的实现方式中,所述处理器,还用于生成压缩信息,所述压缩信息用于指示所述至少一个全0行;所述数据压缩单元,还用于根据所述压缩信息对所述至少一个输入矩阵做压缩得到所述第二矩阵。在该实现方式中,数据压缩单元可以根据压缩信息可以准确、快速地去除至少一个输入矩阵中的全0列,以便于得到第二矩阵,实现简单。
在一个可选的实现方式中,所述信号处理装置还包括:直接内存访问控制器DMAC与权重缓存器,所述DMAC耦合至所述权重缓存器和外部存储器;所述处理器,还用于将所述第一矩阵和所述压缩信息存入所述外部存储器;所述DMAC,用于将所述第一矩阵从所述外部存储器搬移到所述权重缓存器,以及用于将所述压缩信息从所述外部存储器搬移到所述数据压缩单元;所述矩阵乘法器还用于从所述权重缓存器获取所述第一矩阵。在该实现方式中,DMAC可以及时地将第一矩阵搬移到权重缓存器以及将压缩信息搬移到数据压缩单元,以便于数据压缩单元对输入矩阵做压缩以及矩阵乘法器快速地获取第一矩阵。
在一个可选的实现方式中,所述信号处理装置还包括原始数据缓存器和输入缓存器;所述DMAC还用于将所述至少一个输入矩阵从所述外部存储器搬移到所述原始数据缓存器;所述数据压缩单元还用于从所述原始数据缓存器获取所述至少一个输入矩阵,并在对所述至少一个输入矩阵做压缩得到所述第二矩阵后将所述第二矩阵存入所述输入缓存器;所述矩阵乘法器还用于从所述输入缓存器获取所述第二矩阵。在该实现方式中,可以快速地对至少一个输入矩阵做压缩得到第二矩阵,并存入输入缓存器。
在一个可选的实现方式中,所述压缩系统包括:处理器和数据压缩单元,所述处理器,用于对所述至少一个输入矩阵做压缩得到所述第二矩阵;和/或,所述数据压缩单元,用于对所述至少一个权重矩阵做压缩得到所述第一矩阵。在本实现方式中,通过处理器和数据压缩单元分别输入权重矩阵和权重矩阵,实现简单。
在一个可选的实现方式中,所述处理器,还用于生成压缩信息,所述压缩信息用于指示所述至少一个全0列;所述数据压缩单元,还用于根据所述压缩信息对所述至少一个权重矩阵做压缩得到所述第一矩阵。在该实现方式中,数据压缩单元可以根据压缩信息可以 准确、快速地去除至少一个输入矩阵中的全0列,以便于得到第二矩阵,实现简单。
在一个可选的实现方式中,所述信号处理装置还包括:直接内存访问控制器DMAC与输入缓存器,所述DMAC耦合至所述输入缓存器和外部存储器;所述处理器,还用于将所述第二矩阵和所述压缩信息存入所述外部存储器;所述DMAC,用于将所述第二矩阵从所述外部存储器搬移到所述输入缓存器,以及用于将所述压缩信息从所述外部存储器搬移到所述数据压缩单元;所述矩阵乘法器还用于从所述输入缓存器获取所述第二矩阵。在该实现方式中,DMAC可以及时地将第二矩阵搬移到输入缓存器以及将压缩信息搬移到数据压缩单元,以便于数据压缩单元对权重矩阵做压缩以及矩阵乘法器快速地获取第二矩阵。
在一个可选的实现方式中,所述信号处理装置还包括原始数据缓存器和权重缓存器;所述DMAC还用于将所述至少一个权重矩阵从所述外部存储器搬移到所述原始数据缓存器;所述数据压缩单元还用于从所述原始数据缓存器获取所述至少一个权重矩阵,并在对所述至少一个权重矩阵做压缩得到所述第一矩阵后将所述第一矩阵存入所述权重缓存器;所述矩阵乘法器还用于从所述权重缓存器获取所述第一矩阵。在该实现方式中,可以快速地对至少一个权重矩阵做压缩得到第二矩阵,并存入输入缓存器。
在一个可选的实现方式中,所述信号处理装置还包括累加单元,所述累加单元,用于对所述第二矩阵和所述第一矩阵的乘积做累加得到处理结果。在该实现方式中,利用累加器对第二矩阵和第一矩阵的乘积做累加得到处理结果,实现简单。
在一个可选的实现方式中,所述处理器,还用于执行以下至少一项:对原始权重矩阵做拆分得到所述至少一个权重矩阵;或,对原始输入矩阵做拆分得到所述至少一个输入矩阵。在该实现方式中,对原始权重矩阵和原始输入矩阵做拆分,以便于通过拆分得到的权重矩阵和输入矩阵计算该原始输入矩阵和该原始权重矩阵的乘积。
在一个可选的实现方式中,所述多个计算机可处理的信号包括:语音信号、文本信号或图像信号中的至少一项。
在一个可选的实现方式中,所述处理器,具体用于在未执行卷积运算任务的情况下,从所述外部存储器读取所述权重矩阵,将所述权重矩阵中的非全0行进行拼接得到所述第一矩阵,将所述第一矩阵发送至所述外部存储器。可选地,卷积运算任务是指需要执行卷积运算的任务。在该实现方式中,处理器可以在未执行卷积运算或FC运算的情况下,对权重矩阵进行压缩,而不是在执行卷积运算或FC运算的过程中对该权重矩阵进行压缩,可以节省压缩该权重矩阵的时间开销,提高计算效率。
第二方面,本申请实施例提供了一种信号处理方法,该方法包括:获取压缩后的第一矩阵,和对至少一个输入矩阵做压缩得到第二矩阵;所述第一矩阵和所述第二矩阵满足如下限定:所述第一矩阵是去除至少一个权重矩阵中的至少一个全0行得到的,所述第二矩阵是去除至少一个输入矩阵中与所述至少一个全0行对应的至少一个列得到的;或者,所述第二矩阵是去除所述至少一个输入矩阵中的至少一个全0列得到的,所述第一矩阵是去除所述至少一个权重矩阵中与所述至少一个全0列对应的至少一个行得到的;所述输入矩阵包括多个计算机可处理的信号,所述权重矩阵包括多个权重系数;计算所述第二矩阵和所述第一矩阵的乘积。本申请实施例中,信号处理装置通过对至少一个权重矩阵做压缩以及对输入矩阵做压缩,可以减少矩阵乘法器执行矩阵相乘运算的次数或相乘矩阵的尺寸, 以提高计算效率。
在一个可选的实现方式中,所述获取压缩后的第一矩阵和第二矩阵包括:获取预设的所述第一矩阵。
在一个可选的实现方式中,所述获取压缩后的第一矩阵和第二矩阵包括:对所述至少一个权重矩阵做压缩得到所述第一矩阵。
在一个可选的实现方式中,所述方法还包括:生成压缩信息,所述压缩信息用于指示所述至少一个全0行;所述获取第二矩阵包括:根据所述压缩信息对所述至少一个输入矩阵做压缩得到所述第二矩阵。
在一个可选的实现方式中,所述方法还包括:生成压缩信息,所述压缩信息用于指示所述至少一个全0列;所述获取第一矩阵包括:根据所述压缩信息对所述至少一个权重矩阵做压缩得到所述第一矩阵。
在一个可选的实现方式中,所述计算所述第二矩阵和所述第一矩阵的乘积之后,所述方法还包括:对所述第二矩阵和所述第一矩阵的乘积做累加得到处理结果。
在一个可选的实现方式中,所述对至少一个权重矩阵做压缩得到第一矩阵,并对至少一个输入矩阵做压缩得到第二矩阵之前,所述方法还包括如下至少一项:对原始权重矩阵做拆分得到所述至少一个权重矩阵或者对原始输入矩阵做拆分得到所述至少一个输入矩阵。
在一个可选的实现方式中,所述多个计算机可处理的信号包括:语音信号、文本信号或图像信号中的至少一项。
第三方面,本申请实施例提供了另一种信号处理装置,该信号处理装置包括:压缩单元,用于获取压缩后的第一矩阵,和对至少一个输入矩阵做压缩得到第二矩阵;所述第一矩阵和所述第二矩阵满足如下限定:所述第一矩阵是去除所述至少一个权重矩阵中的至少一个全0行得到的,所述第二矩阵是去除至少一个输入矩阵中与所述至少一个全0行对应的至少一个列得到的;或者,所述第二矩阵是去除所述至少一个输入矩阵中的至少一个全0列得到的,所述第一矩阵是去除所述至少一个权重矩阵中与所述至少一个全0列对应的至少一个行得到的;所述输入矩阵包括多个计算机可处理的信号,所述权重矩阵包括多个权重系数;计算单元,用于计算所述第二矩阵和所述第一矩阵的乘积。
在一个可选的实现方式中,所述压缩单元,还用于:获取预设的所述第一矩阵。
在一个可选的实现方式中,所述压缩单元,还用于:对所述至少一个权重矩阵做压缩得到所述第一矩阵。
在一个可选的实现方式中,所述压缩单元,还用于:生成压缩信息,所述压缩信息用于指示所述至少一个全0行;以及根据所述压缩信息对所述至少一个输入矩阵做压缩得到所述第二矩阵。
在一个可选的实现方式中,所述压缩单元,还用于:生成压缩信息,所述压缩信息用于指示所述至少一个全0列;以及根据所述压缩信息对所述至少一个权重矩阵做压缩得到所述第一矩阵。
在一个可选的实现方式中,所述信号处理装置还包括累加单元,所述累加单元,用于对所述第二矩阵和所述第一矩阵的乘积做累加得到处理结果。
在一个可选的实现方式中,所述信号处理装置还包括拆分单元,所述拆分单元用于执 行以下至少一项:对原始权重矩阵做拆分得到所述至少一个权重矩阵或者对原始输入矩阵做拆分得到所述至少一个输入矩阵。
在一个可选的实现方式中,所述多个计算机可处理的信号包括:语音信号、文本信号或图像信号中的至少一项。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行上述第二方面以及任一种可选实现方式的方法。
第五方面,本申请实施例提供了一种计算机程序产品,所述计算机程序产品包括程序指令,所述程序指令当被处理器执行时使所述信处理器执行上述第二方面以及任一种可选实现方式的方法。
第六方面,本申请实施例提供了一种设备,包括存储器和处理器;存储器用于保存程序指令,处理器用于执行所述程序指令以执行上述第二方面以及任一种可选实现方式的方法。
附图说明
图1为本申请实施例提供的一种神经网络的原理示意图;
图2为本申请实施例提供的一种神经网络具体的实施场景;
图3为本申请实施例提供的另一种神经网络具体的实施场景;
图4为本申请实施例提供的一种矩阵分拆相乘方法的示意图;
图5为本申请实施例提供的一种矩阵分拆相乘架构示意图;
图6为本申请实施例提供的一种信号处理装置的硬件架构示意图;
图7为本申请实施例提供的一种信号处理方法流程图;
图8A为本申请实施例提供的一种压缩原始权重矩阵的示意图;
图8B为本申请实施例提供的一种压缩输入矩阵的示意图;
图8C为本申请实施例提供的一种压缩输入矩阵的示意图;
图9为本申请实施例提供的一种拼接子矩阵的示意图;
图10为本申请实施例提供的另一种信号处理方法;
图11为本申请实施例提供的另一种信号处理装置的硬件架构示意图;
图12为本申请另一实施例提供的一种信号处理方法流程图;
图13为本申请实施例提供的一种压缩权重矩阵的示意图;
图14为本申请实施例提供的另一种信号处理方法流程图;
图15为本申请实施例提供的一种子矩阵相乘的示意图;
图16为本申请实施例提供的一种拼接的子矩阵相乘的示意图;
图17为本申请实施例提供的又一种信号处理装置的结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请实施例方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。
本申请的说明书实施例和权利要求书及上述附图中的术语“第一”、“第二”、和“第三”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元。方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。“和/或”用于表示在其所连接的两个对象之间选择一个或全部。例如“A和/或B”表示A、B或A+B。
如图1所示,是一种神经网络的原理示意图,该神经网络100具有N个处理层,N≥3且N取自然数,该神经网络的第一层为输入层101,负责接收输入信号,该神经网络的最后一层为输出层103,输出神经网络的处理结果,除去第一层和最后一层的其他层为中间层104,这些中间层共同组成隐藏层102,隐藏层中的每一层中间层既可以接收输入信号,也可以输出信号,隐藏层负责输入信号的处理过程。每一层代表了信号处理的一个逻辑级别,通过多个层,数据信号可经过多级逻辑的处理。
为便于理解,下面对本申请实施例中神经网络的处理原理进行描述,神经网络的处理通常是非线性函数f(x i),如f(x i)=max(0,x i),在一些可行的实施例中,该处理函数可以是激活函数(Rectified Linear Units,ReLU)、双曲正切函数(tanh)或S型函数(sigmoid)等。假设(x 1,x 2,x 3)是一个一维输入信号矩阵,(h 1,h 2,h 3)是输出信号矩阵,W ij表示输入x j与输出h i之间的权重系数,权重系数构成的矩阵为权重矩阵,则该一维输入信号矩阵与输出信号矩阵对应的权重矩阵W如式(1)所示:
Figure PCTCN2018109228-appb-000001
输入信号与输出信号的关系如式(2)所示,其中b i为神经网络处理函数的偏置值,该偏置值对神经网络的输入进行调整从而得到理想的输出结果。
h 1=f(W 11x 1+W 12x 2+W 13x 3+b 1)
h 2=f(W 21x 1+W 22x 2+W 23x 3+b 2)         (2)
h 3=f(W 31x 1+W 32x 2+W 33x 3+b 3)
在一些可行的实施例中该神经网络的输入信号可以是语音信号、文本信号、图像信号、或温度信号等各种形式的信号,该语音信号可以是录音设备录制的语音信号、移动手机或固定电话在通话过程中接收的语音信号、以及收音机接收的电台发送的语音信号等,文本信号可以是TXT文本信号、Word文本信号、以及PDF文本信号等,图像信号可以是相机拍摄的风景信号、显监控设备捕捉的社区环境的图像信号以及门禁系统获取的人脸的面部信号等,该神经网络的输入信号包括其他各种计算机可处理的工程信号,在此不再一一列举。该神经网络的隐藏层102进行的处理可以是去除语音信号中混杂的噪音信号从而增强 语音信号、对文本信号中的特定内容进行理解、以及对人脸的面部图像信号进行识别等处理。
本申请实施例提供一种该神经网络100可以应用于各类设备中。在一种具体的实施场景,如图2所示,智能手机202和2054以内置该神经网络100相关的装置。移动智能手机客户201向移动智能手机客户205发起语音呼叫,语音信号经智能手机202发出,经基站203转送给智能手机204,由于发起语音呼叫时暴雨骤起且伴有强烈的电闪雷鸣,导致输入信号206被严重削弱且含有较大的噪声,该输入信号可以为一维数字语音信号,智能手机204中配备有神经网络100,该神经网络可以是以专用电路的形式在芯片中实现,也可以是运行在中央处理单元(Central Processing Unit,CPU)或其他处理器中的程序指令。输入信号206在智能手机204中的神经网络中经过处理,该处理包括噪声去除以及有效信号增强等,得到输出信号207,该输出信号完整的保留了主叫用户传送的语音信息,避免了恶劣自然环境对信号的干扰。
本申请实施例提供该神经网络100的另一种具体的实施场景,如图3所示,一轿车303在高速行驶,一路人301使用数码相机302拍下了该轿车303的车牌号,但是由于轿车303具有较高的车速v,数码相机的输入信号304发生了运动模糊现象,该输入信号为二维数字图像信号,该数码相机302中配备有神经网络100,该神经网络可以是以专用电路的形式在芯片中实现,也可以是运行在图像信号处理器中的软件模块。输入信号304在数码相机302中的神经网络中经过处理后,该处理包括轿车运动模型估计、运动模糊去除等,得到输出信号305,输出信号中包含的车牌号信息清晰度得以提高,可得到准确辨识。
如前所示,在图像识别、音频识别等领域广泛应用的卷积神经网络往往需要执行大量的矩阵乘法运算,执行矩阵乘法运算需要非常高的存储带宽且运算量很大。为了充分利用硬件的处理能力,卷积神经网络中的卷积运算和全连接层运算(Full Connect,FC)会转换为AxB([MxK]x[KxN])的矩阵相乘运算。其中,A和B均表示矩阵,M表示矩阵A的行数,K表示矩阵A的列数以及矩阵B的行数,N表示矩阵B的列,AxB表示矩阵A与矩阵B相乘。在实际应用中,输入矩阵和权重矩阵都是相对比较大的矩阵,当前的硬件(矩阵乘法器)一次能处理的矩阵的尺寸通常会比输入矩阵和权重矩阵小,因此可能需要把大矩阵乘法拆分成一系列的小矩阵乘法,再根据多个小矩阵乘法最终得出不同尺寸的矩阵乘法结果。
图4为本申请实施例提供的一种矩阵分拆相乘方法的示意图,如图4所示,最左边的矩阵为输入矩阵,中间的矩阵为权重矩阵,最右边的矩阵为输出矩阵,该输入矩阵的尺寸为3Hx3H,该权重矩阵的尺寸为3Hx2H,这两个矩阵相乘得到的输出矩阵的尺寸为3Hx2H。假设硬件(矩阵乘法器)的处理能力为HxH矩阵相乘,需要分别把输入矩阵和权重矩阵拆分为多个HxH矩阵,如图4所示,输入矩阵拆分得到A0至A8,权重矩阵拆分得到B0至B5,每次计算两个HxH矩阵的乘积,横向和纵向均以H点为单位滑动。这样多次矩阵相乘和相加后最终得出一个完整的输出矩阵。其中,C0=A0xB0+A1xB2+A2xB4,C1=A0xB1+A1xB3+A2xB5,C2=A3xB0+A4xB2+A5xB4,C3=A3xB1+A4xB3+A5xB5,C4=A6xB0+A7xB2+A8xB4,C5=A6xB1+A7xB3+A8xB5。在实际应用中,可以依次计算A0xB0、A1xB2以及A2xB4,再把这三次计算得到的矩阵进行相加,得到C0。同理,采用 与计算C0相同的方式,计算C1至C5,再将C0至C5组合成输出矩阵。
可以理解,把大矩阵拆小后需要处理多个矩阵乘法和加法,例如计算图4中输入矩阵和权重矩阵的乘积需要18次小矩阵相乘和12次小矩阵相加,每一个小矩阵的乘法都是按以下公式计算:矩阵C=矩阵Ax矩阵B。其中,矩阵C的计算公式如下:
Figure PCTCN2018109228-appb-000002
其中,最左边的矩阵表示矩阵A,例如图1中的A0,中间的矩阵表示矩阵B,例如图1中的B0,最右边的矩阵表示矩阵C。矩阵C包括的各元素的计算公式如下:
Figure PCTCN2018109228-appb-000003
其中,“*”表示乘号。上面左数第一个矩形框图包括的元素为矩阵A第一列的元素,左数第二个矩形框图包括的元素为矩阵B第一行的元素,左数第三个矩形框图包括的元素为矩阵A第二列的元素,依次类推。从上述公式可以推出,如果矩阵B中出现一整行的0数据,AxB的矩阵相乘过程中,矩阵A的一列数据会被乘以一个0值,参见如下计算公式:
Figure PCTCN2018109228-appb-000004
Figure PCTCN2018109228-appb-000005
其中,公式(13)中左数第一个矩阵为矩阵A,左数第二个矩阵为矩阵B,左数第三个矩阵为矩阵C,矩阵B第一行的元素均为0。从上面的公式可以看出,矩阵A第一列的每个数据均会被乘以一个0值。可以理解,当矩阵B第二行的元素均为0时,矩阵A第二列的每个数据均会被乘以一个0值;当矩阵B第三行的元素均为0时,矩阵A第三列的每 个数据均会被乘以一个0值。同理,当矩阵A第一列的元素均为0时,矩阵B第一行的每个数据均会被乘以一个0值。也就是说,矩阵B中的第M行与矩阵A中的第M列相对应。这样,当矩阵B第M行的元素均为0时,矩阵A第M列的元素均乘以一个0值;当矩阵A第M列的元素均为0时,矩阵B第M行的元素均乘以一个0值。
从公式(13)-(22)可以看出,当矩阵B中有一整列0数据时,[3x3]x[3x3]矩阵相乘可以转换为[3x2]x[2x3]的矩阵相乘。举例来说,公式(13)可以转换为如下公式:
Figure PCTCN2018109228-appb-000006
其中,左数第一个矩阵为公式(13)中的左数第一个矩阵压缩后的矩阵,左数第二个矩阵为公式(13)中的左数第二个矩阵压缩后的矩阵。可以理解,矩阵A和矩阵B相乘时,若矩阵B中的至少一整行或矩阵A中的至少一整列为0时,可以对矩阵A和矩阵B进行压缩,以便于减少矩阵相乘和相加计算次数,从而减少因矩阵运算带来的功耗和带宽开销。
本申请实施例提供了多种在计算矩阵A和矩阵B的乘积时,对矩阵进行压缩的方法。一种矩阵压缩方法如下:在矩阵B(权重矩阵)包括N个全0行的情况下,将该矩阵B的非全0行依次进行拼接得到压缩后的矩阵B,将矩阵A(输入矩阵)的目标列进行拼接得到压缩后的矩阵A,其中,目标列为该矩阵A中除上述N个全0行对应的N列之外的列,矩阵B的第M行与矩阵A的第M列相对应,N和M均为大于0的整数。另一种矩阵压缩方法如下:在矩阵A(输入矩阵)包括N个全0列的情况下,将该矩阵A的非全0列依次进行拼接得到压缩后的矩阵A,将矩阵B的目标行进行拼接得到压缩后的矩阵B,其中,目标行是该矩阵B中除上述N个全0列对应的N行之外的行,矩阵A的第M列与矩阵B的第M行相对应,N和M均为大于0的整数。举例来说,计算矩阵A和矩阵B的乘积时,矩阵B的第二行和第四行为全0行,将该矩阵B的非全0行依次进行拼接得到压缩后的矩阵B,将该矩阵A中除第二列和第四列之外的列依次进行拼接得到压缩后的矩阵A。又举例来说,计算矩阵A和矩阵B的乘积时,矩阵A的第二行和第四行为全0列,将该矩阵A的非全0列依次进行拼接得到压缩后的矩阵A,将该矩阵B中除第二行和第四行之外的行依次进行拼接得到压缩后的矩阵B。在实际应用中,也可以采用上述方法对一个大矩阵拆分得到的子矩阵做压缩。或者可以对压缩后的结果做进一步拆分,本实施例对此不做限定。
图5为本申请实施例提供的一种矩阵分拆相乘架构示意图。如图5所示,矩阵乘法器计算输入的小矩阵a i,k和小矩阵b k,j的乘积,并将计算结果输出至累加器,累加器计算小矩阵c i,j和矩阵乘法器当前输出结果的累加值。图5中的矩阵分拆相乘架构主要是实现以下的矩阵相乘公式:
Figure PCTCN2018109228-appb-000007
矩阵A中的每个子矩阵(至少包括两个元素)可以理解为该矩阵A的一个元素。如图4所示,输入矩阵拆分得到9个小矩阵(A0至A8),A0至A8均可以理解为该输入矩阵的元素(拆分得到的小矩阵),例如A4为该输入矩阵拆分后的第二行第二列的小矩阵。因此,小矩阵a i,k可以理解为A矩阵拆分后的第i行和第k列的小矩阵,小矩阵b k,j可以理解为B矩阵拆分后的第k行和第j列的小矩阵,小矩阵c i,j可以理解为之前A矩阵和B矩阵相乘后 的第i行和第j列的小矩阵累加结果,c` i,j为和当前结果累加后的第i行和第j列的小矩阵。
图6为本申请实施例提供的一种信号处理装置的硬件架构示意图,用于实现神经网络100的运算功能。本申请实施例中的信号处理装置可以应用到手机、平板电脑、服务器、可穿戴设备等可执行矩阵乘法运算的各类设备中。如图6所示,该信号处理装置可以包括电路本、芯片、或芯片组、或相关运行软件程序中的至少一个。包括:外部存储器602,用于存储原始权重矩阵。外部存储器602还可以存储原始输入矩阵以及其他数据。
在图6中,中央处理器(Central Processing Unit,CPU)601,用于从外部存储器602读取上述原始权重矩阵,对上述原始权重矩阵做压缩得到压缩权重矩阵,并将上述压缩权重矩阵发送至外部存储器602,其中,上述压缩权重矩阵是去除上述原始权重矩阵中的至少一个全0行得到的;还用于生成压缩信息,并将上述压缩信息发送至外部存储器602,上述压缩信息用于指示上述至少一个全0行。或者,在压缩过程中、压缩前或压缩后,任一较大的矩阵可以被拆分。例如,CPU601,用于从外部存储器602读取上述原始权重矩阵,对上述原始权重矩阵做压缩得到压缩权重矩阵,并将拆分上述压缩权重矩阵得到的矩阵发送至外部存储器602。再例如,CPU601,用于从外部存储器602读取上述原始权重矩阵,首先对上述原始权重矩阵做分拆得到权重矩阵,并进一步对权重矩阵做压缩得到一矩阵以提供至外部存储器602。
可选的,CPU601读取暂存在外部存储器602中的原始权重矩阵以进行数据压缩,当找到一整行的0值数据时便把该行数据删除,并在压缩的过程中记录被删除掉的行编号(该行编号可记录在一个k-table),当读取完原始权重矩阵后把压缩后的原始权重矩阵和行编号写回外部存储器602。可选的,在完成对上述原始权重矩阵的一部分的压缩后,先将这一部分压缩后的数据写入外部存储器602。对原始权重矩阵的数据压缩可以是将原始权重矩阵中全0行的数据进行删除,再将剩余的行进行拼接;也可以是提取出原始权重矩阵中的非全0行数据进行拼接。可选的,CPU601可以替换为其他类型处理器,如微处理器、微控制器、神经网络处理器(Neural Network Processing Unit,NPU)、或数字信号处理器(Digital Signal Processor,DSP)。可选地,CPU601还可以被专用硬件代替,如专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等以及其他处理器,本申请实施例不作限定。因此,用于执行压缩的处理器是广义上的概念,可以是执行软件程序的处理器、纯硬件逻辑计算电路或者二者的结合。用执行软件程序的处理器是一种常见的实现方式。
在以上实施例中,CPU601,还用于控制直接内存访问控制器(Direct Memory Access Controller,DMAC)610把压缩后的原始权重矩阵,即压缩权重矩阵从外部存储器602搬到权重缓存器603,控制DMAC610把压缩信息(行编号)从外部存储器602搬移到数据压缩单元606,也可以控制DMAC610把未经压缩的原始输入矩阵或从上述原始输入矩阵拆分得到的至少一个输入矩阵从外部存储器602或者结果存储器(result buffer)609搬到原始数据缓存器607(raw data buffer)暂存。可选的,CPU601,还用于从外部存储器602或结果存储器609(前一次计算结果作为本次输入)读取上述原始输入矩阵,对上述原始输入矩阵做拆分得到至少一个输入矩阵,并将得到的至少一个输入矩阵写入外部存储器602。
可选的,CPU601,还用于拆分上述压缩权重矩阵,并控制DMAC610将从上述压缩权重矩阵拆分得到的第一矩阵搬到权重缓存器603。在实际应用中,CPU610拆分压缩权重矩阵可以得到至少一个权重矩阵,DMAC610可以依次将上述压缩权重矩阵拆分得到的权重矩阵搬到权重缓存器604。或者,拆分操作也可在压缩之前进行。
可选的,CPU601,还用于指示原始数据缓存器607将未经压缩的原始输入矩阵进行拆分,并将拆分得到的至少一个输入矩阵导入数据压缩单元606,或者,指示原始数据缓存器607将其存储的上述至少一个输入矩阵导入数据压缩单元606。可选地,CPU601,还用于指示数据压缩单元606对上述至少一个输入矩阵做压缩。可选的,CPU601,还用于确定是否对上述原始输入矩阵或原始权重矩阵进行拆分。
在以上实施例中,数据压缩单元606,用于根据上述压缩信息对上述至少一个输入矩阵做压缩,并将压缩后的上述至少一个输入矩阵(第二矩阵)写入输入缓存器604。矩阵乘法器605,用于从权重缓存器604获取上述第一矩阵,从输入缓存器604获取上述第二矩阵,并计算上述第二矩阵和上述第一矩阵的乘积。累加器608,用于对上述第二矩阵和上述第一矩阵的乘积做累加得到处理结果,保存在结果存储器609。
本申请实施例中的压缩系统可以包括数据压缩单元606和CPU601。图中的部件603至610可以集成在一个集成电路或芯片中,也可以进一步与CPU601集成在一起。可以理解,图中的部件603至610可以为一个运算加速器包括的各部件,该运算加速器挂载在CPU601,以提高CPU601某方面的性能。外部存储器602可以不与图中的部件603至610集成在一起,也可以不与CPU601集成在一起。当然,外部存储器602也可以与图中的部件603至610集成在一起,还可以与CPU601集成在一起。这里的外部是相对压缩系统而言的外部。当然外部存储器602不与部件601或603至610集成,而是独立存在是一种更为常见的方案。
在以上实施例中,外部存储器602可以是双倍速率同步动态随机存储器(Double Data Rate,DDR)、或高带宽存储器(High Bandwidth Memory,HBM)等。外部存储器可以准用于该信号处理装置的硬件架构,或者是一个通用存储器,本实施例对此不限定。CPU601就像管理者,负责控制602至610。可以理解,图6中的602至610在CPU601的控制下进行工作。例如,直接内存存取(Direct Memory Access,DMA)是指一种高速的数据传输操作。DMA允许不同速度的硬件装置来沟通,而不需要依于CPU的大量中断负载,无需CPU直接控制传输,能使CPU的效率大为提高。在实现DMA传输时,是由DMAC直接掌管总线。DMAC获得总线控制权后,CPU即刻挂起或只执行内部操作,由DMAC输出读写命令,直接控制存储器与各类I/O接口进行DMA传输。在DMAC的控制下,在存储器和外部设备之间直接进行数据传送,在传送过程中不需要中央处理器的参与。
在一些实现中,矩阵乘法器605内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路605是通用的矩阵处理器。举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。矩阵乘法器605从权重缓存器603中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。矩阵乘法器从输入缓存器604中取矩阵A数据与矩阵B进行矩阵运算,然后在累加器(accumulator)608中进行连续的加法操作,得到的矩阵的部分结果或最终结果,保存在结果存储器609。
基于图6提供的信号处理装置的硬件架构,图7为本申请实施例提供的一种信号处理方法流程图,如图7所示,该方法可包括:701、CPU601读取外部存储器602中的原始权重矩阵,对上述原始权重矩阵进行压缩得到压缩权重矩阵和压缩信息,并将上述压缩权重矩阵和上述压缩信息存储至外部存储器602。
上述压缩信息用于指示上述原始权重矩阵中的全0行。上述原始权重矩阵包含N个全0行,N为大于0的整数。可选的,上述压缩信息为一个二进制序列,该二进制序列中的每一个二进制值指示原始权重矩阵中的一行是否为全0行。举例来说,原始权重矩阵包括12行,该原始权重矩阵中只有第5行和第8行是全0行,CPU601压缩该原始权重矩阵得到的压缩信息为111101101111(二进制序列),该二进制序列中的二进制数值从左到右依次对应该原始权重矩阵的第1行至第12行,1对应的行是非全0行,0对应的行是全0行。对原始权重矩阵进行压缩的方式可以是将该原始权重矩阵中的全0行数据进行删除,再将剩余的行进行拼接;也可以是提取出该原始权重矩阵中的非全0行数据进行拼接。
图8A为本申请实施例提供的一种原始压缩权重矩阵的示意图。如图8A所示,800表示未经压缩的原始权重矩阵,810表示压缩后的原始权重矩阵;其中,每个小矩形区域对应矩阵中的一个元素,黑色实心部分为非0值的数据,白色部分为0值数据,801、802以及803表示原始权重矩阵中的全0行数据。CPU601在压缩原始权重矩阵时,可以把全0行数据删除掉,再把非全0行进行拼接,得到压缩后的原始权重矩阵(压缩权重矩阵)。另外,CPU601把删除的行号记录下来,如图8A中的K-table(压缩信息)所示,已经删除的行以0标示,没有被删除的行以1标示。行编号(K-table)可以作为压缩输入矩阵时的指引。行编号包括的二进制数值依次与输入矩阵的列相对应。具体的,行编码中对应原始权重矩阵第一行的二进制数值对应输入矩阵中的第一列,行编码中对应原始权重矩阵最后一行的二进制数值对应输入矩阵中的最后一列,依次类推。如图8B所示,行编号包括的二进制数值1对应的列为输入矩阵中待删除的列,0对应的列不为输入矩阵中待删除的列。根据原始权重矩阵的稀疏程度不同,压缩比例会有所不同。例如图8A中,原始权重矩阵为128x64矩阵,该原始权重矩阵有64行全0值的数据,压缩后的原始权重矩阵为64x64矩阵。本申请实施例中,压缩信息可以是指上述行编号。图8A、8B、8C中的二进制数值序列是指压缩信息(K-table)。
进一步地,702、CPU601控制DMAC610将从上述压缩权重矩阵拆分得到的第一矩阵从外部存储器602搬移到权重缓存器603以及控制DMAC610将上述压缩信息从外部存储器602搬移到数据压缩单元606。可选的,CPU601对上述原始权重矩阵做压缩得到压缩权重矩阵后,将拆分上述压缩权重矩阵得到的至少一个权重矩阵发送至外部存储器602。可选的,CPU601指示外部存储器602向权重缓存器604传输上述压缩权重矩阵的方式,即指示外部存储器602每次向权重缓存器604传输上述压缩权重矩阵中的哪一部分(第一矩阵)。
进一步地,703、CPU601控制DMAC610将原始输入矩阵或至少一个输入矩阵从外部存储器602或结果存储器609搬移到原始数据缓存器607。上述至少一个输入矩阵可以是上述原始输入矩阵拆分得到的。可选的,CPU601在执行703之前,CPU601从外部存储器602读取上述原始输入矩阵,对上述原始输入矩阵做拆分得到上述至少一个输入矩阵,并将得到的上述至少一个输入矩阵写入外部存储器602。可选的,CPU601在执行703之前, CPU601从结果存储器609读取上述原始输入矩阵,对上述原始输入矩阵做拆分得到至少一个输入矩阵,并将得到的上述至少一个输入矩阵写入结果存储器609。CPU601执行702和703的顺序不作限定,可以先执行702后执行703,也可以同时执行702和703,还可以先执行703后执行702。
进一步地,704、CPU601指示原始数据缓存器607将上述原始输入矩阵包括的至少一个输入矩阵导入数据压缩单元606,或者,指示原始数据缓存器607将其存储的至少一个输入矩阵导入数据压缩单元606。上述至少一个输入矩阵中任一矩阵的尺寸小于或等于矩阵乘法器可处理的最大矩阵的尺寸。可选的,CPU601在执行704后,指示数据压缩单元606对上述至少一个输入矩阵做压缩。原始数据缓存器607可以根据CPU601的指示,每次向数据压缩单元606导入上述原始输入矩阵的一部分(至少一个输入矩阵)。可以理解,上述至少一个输入矩阵为上述原始输入矩阵的一个子矩阵。
进一步地,705、数据压缩单元606根据上述压缩信息对上述至少一个输入矩阵做压缩,得到至少一个压缩后的输入矩阵。数据压缩单元606根据上述压缩信息对上述至少一个输入矩阵做压缩可以是根据上述压缩信息确定上述至少一个输入矩阵的参考列,并将上述至少一个输入矩阵的参考列拼接起来,上述至少一个输入矩阵的参考列为与上述原始权重矩阵的全0行对应的列。图8C为本申请实施例提供的一种压缩至少一个输入矩阵的示意图。如图8C所示,原始输入矩阵为12x12矩阵,该原始输入矩阵拆分为16个3x3矩阵,K-table为该原始输入矩阵对应的压缩信息。若原始输入矩阵拆分得到的两个输入矩阵在上述原始输入矩阵对应的列不同,则这两个输入矩阵对应压缩信息的不同部分。从图8C可以看出,K-table从左至右分为4个部分,每个部分对应三列。例如,输入矩阵A、输入矩阵E均对应原始输入矩阵的前三列,这两个输入矩阵均对应K-table的第一部分;输入矩阵B、输入矩阵F均对应原始输入矩阵的第4列至第6列,这两个输入矩阵均对应K-table的第二部分。数据压缩单元606可以根据压缩信息(K-table)的第一部分压缩输入矩阵A和输入矩阵E;可以根据压缩信息的第二部分压缩输入矩阵B和输入矩阵F,以便于得到各输入矩阵压缩后的矩阵。图8C中,输入矩阵A的第二列在压缩信息中对应的二进制数值为0,数据压缩单元606删除该输入矩阵A的第二列,并将该输入矩阵A的第一列和第三列拼接起来,得到压缩后的输入矩阵A。在实际应用中,数据压缩单元606可以依次对原始输入矩阵拆分得到的至少一个输入矩阵做压缩。
进一步地,706、数据压缩单元606将上述至少一个压缩后的输入矩阵写入乒乓缓存器进行拼接。例如,乒乓缓存器(ping pang buffer)包括乒缓存器(ping buffer)和乓缓存器(pang buffer)。上述乒缓存器和上述乓缓存器的存储空间大小相同,上述原始输入矩阵拆分得到的输入矩阵的大小可以与上述乒缓存器可存储的最大矩阵的大小相同。图9为本申请实施例提供的一种拼接压缩后的输入矩阵的示意图。图9中的小矩阵A、小矩阵B、小矩阵C以及小矩阵D依次为图8C中原始输入矩阵拆分得到的输入矩阵A、输入矩阵B、输入矩阵C以及输入矩阵D。如图9所示,数据压缩单元606拼接压缩后的输入矩阵的过程如下:将压缩后的输入矩阵A写入乒缓存器;先将乒缓存器的存储空间填满(即将压缩后的输入矩阵B的第一列写入乒缓存器),再将压缩后的输入矩阵B的第二列(剩余的列)写入乓缓存器;输出乒缓存器中的矩阵,即将乒缓存器中的矩阵写入输入缓存器;先将乓 缓存器填满(即将压缩后的输入矩阵C的前两列写入乓缓存器),再将压缩后的输入矩阵C的第三列(剩余的列)写入乒缓存器;将乓缓存器中的矩阵写入输入缓存器;先将乒缓存器填满,即将压缩后的输入矩阵D写入乒缓存器;将乒缓存器中的矩阵写入输入缓存器。图9中,乒缓存器和乓缓存器可存储的最大矩阵为3x3矩阵,数据压缩单元606在将乒缓存器和乓缓存器中的一个的存储空间填满后,将存储空间填满的存储器中的矩阵写入输入缓存器604。乒缓存器和乓缓存器中的一个的存储空间填满表示乒缓存器或乓缓存器中的数据量已经满足矩阵乘法器的要求。也就是说,数据压缩单元拼接得到的输入矩阵满足矩阵乘法器的要求。上述过程可概括如下:
进一步地,707、数据压缩单元606判断上述乒乓缓存器是否已存储一个JxK的矩阵。若是,执行708;若否,执行704。J和K均为大于0的整数。JxK矩阵可以为乒缓存器可存储的最大矩阵。上述数据压缩单元606判断上述乒乓缓存器是否已存储一个JxK的矩阵可以是判断乒缓存器的存储空间或乓缓存器的存储空间是否已填满。
进一步地,708、数据压缩单元606将乒乓缓存器存储的JxK的矩阵写入输入缓存器604。709、矩阵乘法器605分别从输入缓存器604和权重缓存器603获取矩阵,并进行矩阵相乘。矩阵乘法器605从权重缓存器603获取的矩阵是上述压缩权重矩阵拆分得到的第一矩阵,从输入缓存器604获取的矩阵是对至少两个压缩后的输入矩阵做拼接得到的矩阵(第二矩阵)。710、累加器608对矩阵乘法器605的矩阵相乘的乘积做累加,得到处理结果。
进一步地,711、CPU601判断矩阵乘法器计算的是否为原始输入矩阵拆分得到的最后一个输入矩阵。若是,执行712,若否,执行704。712、停止执行704。本申请实施例中,信号处理装置通过对至少一个输入矩阵和至少一个权重矩阵做压缩,可以减少矩阵乘法器执行矩阵相乘运算的次数,提高计算效率。
图7的方法中,矩阵乘法器605计算的是由原始输入矩阵拆分得到的至少一个输入矩阵拼接的矩阵以及压缩后的原始权重矩阵拆分得到的权重矩阵的乘积。可以理解,在图7的方法中,需要对压缩后的原始权重矩阵进行拆分以及对原始输入矩阵进行拆分。在原始权重矩阵的尺寸和原始输入矩阵的尺寸均小于矩阵乘法器可处理的最大矩阵的情况下,可以不对原始权重矩阵和原始输入矩阵进行拆分,而是直接计算压缩后的原始输入矩阵和压缩后的原始权重矩阵的乘积。基于图6提供的信号处理装置的硬件架构,图10为本申请实施例提供的另一种信号处理方法,如图10所示,该方法可包括:1001、CPU601读取外部存储器602中的权重矩阵,对上述权重矩阵进行压缩得到第一矩阵和压缩信息,并将上述第一矩阵和上述压缩信息存储至外部存储器602。上述对上述权重矩阵进行压缩得到第一矩阵和压缩信息可以是检测上述权重矩阵中的全0行的位置信息,将上述权重矩阵中的非全0行拼接在一起得到上述第一矩阵,根据上述位置信息得到上述压缩信息。上述压缩信息用于指示上述权重矩阵中的全0行。
进一步地,1002、CPU601控制DMAC610将上述第一矩阵从外部存储器602搬移到权重缓存器603以及控制DMAC610将上述压缩信息从外部存储器602搬移到数据压缩单元606。1003、CPU601控制DMAC610将输入矩阵从外部存储器602或结果存储器609搬移到原始数据缓存器607。1004、CPU601指示原始数据缓存器607将输入矩阵导入数据压缩 单元606。1005、数据压缩单元606根据上述压缩信息对上述输入矩阵做压缩得到第二矩阵。
上述权重矩阵包含N个全0行,N为大于0的整数。可选的,数据压缩单元606根据上述压缩信息将上述输入矩阵的目标列进行拼接得到第二矩阵,上述目标列为上述输入矩阵中除上述N个全0行对应的N列之外的列,其中,上述权重矩阵的第F行对应上述输入矩阵的第F列,F为大于0的整数。可选的,数据压缩单元606根据上述压缩信息去除上述输入矩阵中与上述权重矩阵的全0行对应的列得到上述第二矩阵。
进一步地,1006、数据压缩单元606将上述第二矩阵导入输入缓存器。1007、CPU601指示矩阵乘法器605从上述权重缓存器604获取上述第一矩阵以及从上述输入缓存器603获取上述第二矩阵。1008、CPU601指示矩阵乘法器605计算上述第二矩阵和上述第一矩阵的乘积。本申请实施例中,信号处理装置可以通过对权重矩阵和输入矩阵做压缩,减少矩阵乘法器处理的矩阵的大小,提高计算效率。
图11为本申请实施例提供的另一种信号处理装置的硬件架构示意图。本申请实施例中的信号处理装置可以应用到手机、平板电脑、服务器、可穿戴设备等可执行矩阵乘法运算的设备中。如图11所示,该信号处理装置可以包括:外部存储器1102,用于存储原始输入矩阵。外部存储器1102还可以存储原始权重矩阵以及其他数据。
CPU1101,用于从外部存储器1102读取上述原始输入矩阵,对上述原始输入矩阵做压缩得到压缩输入矩阵,并将上述压缩输入矩阵发送至外部存储器1102,其中,上述压缩输入矩阵是去除上述原始输入矩阵中的至少一个全0列得到的;还用于生成压缩信息,并将上述压缩信息发送至外部存储器1102,上述压缩信息用于指示上述至少一个全0列。或者,CPU601,还用于拆分上述压缩输入矩阵,将拆分上述压缩输入矩阵得到的至少一个输入矩阵发送至外部存储器602。
可选的,CPU1101读取暂存在外部存储器1102中的原始输入矩阵以进行数据压缩,当找到一整列的0值数据时便把该列数据删除,并在压缩的过程中记录被删除掉的列编号(该列编号可记录在一个k-table),当读取完原始输入矩阵后把压缩后的原始输入矩阵和列编号写回外部存储器1102。可选的,在完成对原始输入矩阵的一部分的压缩后,先将这一部分压缩后的数据写入外部存储器1102。对原始输入矩阵的压缩可以是将原始输入矩阵中全0列的数据进行删除,再将剩余的列进行拼接;也可以是提取出原始输入矩阵中的非全0列数据进行拼接。与之前实施例的描述类似。
CPU1101,还用于控制DMAC1110把压缩后的原始输入矩阵(压缩输入矩阵)从外部存储器1102搬到输入缓存器1103,控制DMAC1110把压缩信息(列编号)从外部存储器1102搬移到数据压缩单元1106,同时控制DMAC1110把未经压缩的原始权重矩阵或原始权重矩阵拆分得到的至少一个权重矩阵从外部存储器1102搬到原始数据缓存器1107(raw data buffer)暂存。
可选的,CPU1101,还用于拆分上述压缩输入矩阵,并控制DMAC1110将从上述压缩输入矩阵拆分得到的第二矩阵搬到输入缓存器1103。在实际应用中,CPU1110拆分压缩输入矩阵可以得到至少一个输入矩阵,DMAC1110可以依次将上述压缩输入矩阵拆分得到的输入矩阵搬到输入缓存器1103。
可选的,CPU1101,还用于指示原始数据缓存器1107将未经压缩的原始权重矩阵进行拆分,并将拆分得到的至少一个权重矩阵导入数据压缩单元1106,或者,指示原始数据缓存器1107将其存储的至少一个权重矩阵导入数据压缩单元1106。可选的,数据压缩单元1106从原始数据缓存器1107获取其存储的至少一个权重矩阵。进一步地,CPU1101,还用于指示数据压缩单元1106对上述至少一个权重矩阵做压缩。可选的,CPU1101,还用于确定是否对上述原始输入矩阵或原始权重矩阵进行拆分。
在上述实施例中,数据压缩单元1106,用于根据上述压缩信息对上述至少一个权重矩阵做压缩,并将压缩后的上述至少一个权重矩阵(第一矩阵)写入权重缓存器1104。矩阵乘法器1105,用于从权重缓存器1104获取上述第一矩阵,从输入缓存器1103获取上述第二矩阵,并计算上述第二矩阵和上述第一矩阵的乘积。累加器1108,用于对上述第二矩阵和上述第一矩阵的乘积做累加得到处理结果,保存在结果存储器1109。
本申请实施例中的压缩系统可以包括数据压缩单元1106和CPU1101。图中的1103至1110可以集成在一个集成电路或芯片中,也可以与CPU集成在一起。可以理解,图中的1103至1110可以为一个运算加速器包括的各部件,该运算加速器挂载在CPU1101,以提高CPU1101某方面的性能。
图11中的信号处理装置与图6中的信号处理装置的不同之处主要包括以下几点:(1)、输入缓存器1103连接DMAC1110,权重缓存器1104连接数据压缩单元;(2)、CPU1101对原始输入矩阵进行压缩;(3)、数据压缩单元对至少一个权重矩阵做压缩。基于图11提供的信号处理装置的硬件架构,图12为本申请实施例提供的一种信号处理方法,如图12所示,该方法可包括:1201、CPU1101读取外部存储器1102中的原始输入矩阵,对上述原始输入矩阵做压缩得到压缩输入矩阵和压缩信息,并将上述压缩输入矩阵和上述压缩信息存储至外部存储器1102。
上述压缩信息用于指示上述原始输入矩阵中的全0列。上述原始输入矩阵包含N个全0列,N为大于0的整数。可选的,上述压缩信息为一个二进制序列,该二进制序列中的每一个二进制值指示原始输入矩阵中的一列是否为全0列。举例来说,原始输入矩阵包括12列,该原始输入矩阵中只有第5列和第8列是全0列,CPU1101压缩该原始输入矩阵得到的压缩信息为111101101111(二进制序列),该二进制序列中的二进制数值从左到右依次对应输入矩阵的第1列至第12列,1对应的列是非全0列,0对应的列时全0列。对原始输入矩阵进行压缩的方式可以是将该原始输入矩阵中的全0列数据进行删除,再将剩余的列进行拼接;也可以是提取出该原始输入矩阵中的非全0列数据进行拼接。
进一步地,1202、CPU1101控制DMAC1110将从上述压缩输入矩阵拆分得到的第二矩阵从外部存储器1102搬移到输入缓存器1103以及控制DMAC1110将上述压缩信息从外部存储器1102搬移到数据压缩单元1106。可选的,CPU1101对上述原始输入矩阵做压缩得到压缩输入矩阵后,将拆分上述压缩输入矩阵得到的至少一个输入矩阵发送至外部存储器602。
进一步地,1203、CPU1101控制DMAC1110将原始权重矩阵或原始权重矩阵拆分得到的至少一个权重矩阵从外部存储器1102搬移到原始数据缓存器1107。可选的,CPU1101在执行1203之前,CPU1101从外部存储器1102读取上述原始权重矩阵,对上述原始权重 矩阵做拆分得到上述至少一个权重矩阵,并将得到的上述至少一个权重矩阵写入外部存储器1102。可选的,CPU1101在执行1203之前,CPU1101从结果存储器1109读取上述原始权重矩阵,对上述原始权重矩阵做拆分得到至少一个权重矩阵,并将得到的上述至少一个权重矩阵写入结果存储器609。CPU执行1202和1203的顺序不作限定,可以先执行1202后执行1203,也可以同时执行1202和1203,还可以先执行1203后执行1202。
进一步地,1204、CPU1101指示原始数据缓存器1107将原始权重矩阵包括的至少一个权重矩阵导入数据压缩单元1106,或者,指示原始数据缓存器1107将其存储至少一个权重矩阵导入数据压缩单元1106。上述至少一个权重矩阵中任一权重矩阵的尺寸小于或等于矩阵乘法器可处理的最大矩阵的尺寸。可选的,CPU1101在执行1204后,指示数据压缩单元1106压缩上述至少一个权重矩阵。
进一步地,1205、数据压缩单元1106根据上述压缩信息对上述至少一个权重矩阵做压缩,得到至少一个压缩后的权重矩阵。数据压缩单元1106根据上述压缩信息对上述至少一个权重矩阵做压缩可以是根据上述压缩信息确定上述至少一个权重矩阵的参考行,并将上述至少一个权重矩阵的参考行拼接起来,上述至少一个权重矩阵的参考行是与上述原始输入矩阵的全0列对应的行。图13为本申请实施例提供的一种压缩权重矩阵的示意图。如图13所示,原始权重矩阵为12x12矩阵,该原始权重矩阵拆分为16个3x3矩阵,K-table为该原始权重矩阵对应的压缩信息。若原始权重矩阵拆分得到的两个权重矩阵在上述权重矩阵对应的行不同,则这两个权重矩阵对应压缩信息的不同部分。从图13可以看出,K-table从上至下分为4个部分,每个部分对应三行。例如,权重矩阵A、权重矩阵B均对应原始权重矩阵的前三行,这两个权重矩阵均对应K-table的第一部分;权重矩阵E、权重矩阵F均对应原始权重矩阵的第4行至第6行,这两个权重矩阵均对应K-table的第二部分。数据压缩单元1106可以根据压缩信息(K-table)的第一部分压缩权重矩阵A和权重矩阵B;可以根据压缩信息的第二部分压缩权重矩阵E和权重矩阵F,以得到压缩后的权重矩阵。图13中,权重矩阵A的第二行在压缩信息中对应的二进制数值为0,数据压缩单元1106删除该权重矩阵A的第二行,并将该权重矩阵A的第一行和第三行拼接起来,得到压缩后的权重矩阵A。在实际应用中,数据压缩单元1106可以依次对原始权重矩阵拆分得到的至少一个权重矩阵做压缩。
进一步地,1206、数据压缩单元1106将上述至少一个压缩后的权重矩阵写入乒乓缓存器进行拼接。例如,乒乓缓存器包括乒缓存器和乓缓存器。上述乒缓存器和上述乓缓存器的存储空间大小相同,上述原始权重矩阵拆分得到的各权重矩阵的大小可以与上述乒缓存器可存储的最大矩阵的大小相同。具体的拼接压缩后的权重矩阵的方法与图9中的方法类似,这里不再详述。
进一步地,1207、数据压缩单元1106判断上述乒乓缓存器是否已存储一个JxK的矩阵。若是,执行1208;若否,执行1204。J和K均为大于0的整数。JxK矩阵可以为乒缓存器可存储的最大矩阵。上述数据压缩单元1106判断上述乒乓缓存器是否已存储一个JxK的矩阵可以是判断乒缓存器的存储空间或乓缓存器的存储空间是否已填满。
进一步地,1208、数据压缩单元1106将乒乓缓存器存储的JxK的矩阵写入权重缓存器1104。1209、矩阵乘法器1105分别从输入缓存器1103和权重缓存器1104获取矩阵,并进 行矩阵相乘。矩阵乘法器1105从权重缓存器1104获取的矩阵是对至少两个压缩后的权重矩阵做拼接得到的矩阵(第一矩阵),从输入缓存器1103获取的矩阵是上述压缩输入矩阵拆分得到的第二矩阵。1210、累加器1108对矩阵乘法器1105的矩阵相乘的乘积做累加,得到处理结果。
进一步地,1211、CPU1101判断矩阵乘法器计算的是否为原始权重矩阵拆分得到的最后一个权重矩阵。若是,执行1212,若否,执行1204。1212、停止执行1204。本申请实施例中,信号处理装置通过对原始权重矩阵拆分得到的权重矩阵的参考行进行拼接以及对原始输入矩阵做压缩,可以减少矩阵乘法器执行矩阵相乘运算的次数,提高计算效率。
在权重矩阵的尺寸和输入矩阵的尺寸均小于矩阵乘法器可处理的最大矩阵的情况下,可以不对权重矩阵和输入矩阵进行拆分,而是直接计算压缩后的输入矩阵和压缩后的权重矩阵的乘积。基于图11提供的信号处理装置的硬件架构,图14为本申请实施例提供的另一种信号处理方法流程图,如图11所示,该方法可包括:1401、CPU1101读取外部存储器1102中的输入矩阵,对上述输入矩阵进行压缩得到第二矩阵和压缩信息,并将上述第二矩阵和上述压缩信息存储至外部存储器1102。上述对上述输入矩阵进行压缩得到第二矩阵和压缩信息可以是检测上述输入矩阵中的全0列的位置信息,将上述输入矩阵中的非全0列拼接在一起得到上述第二矩阵,根据上述位置信息得到上述压缩信息。上述压缩信息用于指示上述输入矩阵中的全0列。
进一步地,1402、CPU1101控制DMAC1110将上述第二矩阵从外部存储器1102搬移到输入缓存器1103以及控制DMAC1110将上述压缩信息从外部存储器1102搬移到数据压缩单元1106。1403、CPU1101控制DMAC1110将权重矩阵从外部存储器1102搬移到原始数据缓存器1107。1404、CPU1101指示原始数据缓存器1107将上述权重矩阵导入数据压缩单元1106。1405、数据压缩单元1106根据上述压缩信息对上述权重矩阵做压缩得到第一矩阵。1406、数据压缩单元1106将上述第一矩阵导入权重缓存器。1407、CPU1101指示矩阵乘法器1105从上述输入缓存器1103获取上述第二矩阵以及从上述权重缓存器1104获取上述第一矩阵。1408、CPU1101指示矩阵乘法器1105计算上述第二矩阵和上述第一矩阵的乘积。
本申请实施例中,信号处理装置通过对输入矩阵和权重矩阵做压缩,可以减少矩阵乘法器处理的矩阵的大小,提高计算效率。图7中的信号处理方法是将原始输入矩阵和原始权重矩阵的相乘转换为压缩后的原始权重矩阵拆分得到的权重矩阵和拆分原始输入矩阵得到的多个权重矩阵拼接的矩阵的相乘。图9中的信号处理方法是将原始输入矩阵和原始权重矩阵的相乘转换为压缩后的原始输入矩阵拆分得到的输入矩阵和拆分原始权重矩阵得到的多个权重矩阵拼接的矩阵的相乘。
可以理解,采用图7和图12中的方法能够计算输入矩阵和权重矩阵乘积的前提是拼接得到的子矩阵的乘积与未拼接的子矩阵的乘积相同。图15为本申请实施例提供的一种子矩阵相乘的示意图。如图15所示,A0、A1为输入矩阵拆分得到的子矩阵,B0、B2为权重矩阵拆分得到的子矩阵,B0的第三行为全0行,B2的第一行和第二行为全0行。图16为本申请实施例提供的一种拼接的子矩阵相乘的示意图。图16中的A0、A1、B0、B2以及C0分别与图15中的A0、A1、B0、B2以及C0相同。如图16所示,A’0为A0的前两列 和A1的第三列拼接得到的子矩阵,B’0为B0的前两行和B2的第三行拼接得到的子矩阵。对比图15和图16可以看出,图15中的C0的各元素与图16中的C0的各元素相同。因此,拼接得到的子矩阵的乘积与未拼接的子矩阵的乘积相同。从上述实施例中可以看出,可以将多个矩阵中的至少一个全0行或全0列去除以得到压缩后的结果,即多个矩阵压缩为一个矩阵。或者,也可以仅对一个矩阵做压缩得到压缩后的一个矩阵,本实施例对此不作限定。
在实际应用中,信号处理装置可以根据实际需要采用图7、图10、图12以及图14中的任一种方法。若信号处理装置采用图6中的架构,则可以执行图7和图10中的方法。若信号处理装置采用图11中的架构,则可以执行图12和图14中的方法。信号处理装置中的CPU可以确定是否对原始权重矩阵或原始输入矩阵进行拆分。若信号处理装置采用图6中的架构且CPU确定对原始权重矩阵或原始输入矩阵进行拆分,则采用图7中的方法。若信号处理装置采用图6中的架构且CPU确定不对原始权重矩阵或原始输入矩阵进行拆分,则采用图10中的方法。可选的,信号处理装置中的CPU预置有矩阵乘法器可处理的最大矩阵的尺寸,若该CPU确定原始权重矩阵和原始输入矩阵的尺寸均小于该最大矩阵的尺寸,则确定不对原始权重矩阵和原始输入矩阵进行拆分;若该CPU确定原始权重矩阵或原始输入矩阵的尺寸大于该最大矩阵的尺寸,则确定对原始权重矩阵和原始输入矩阵进行拆分。
以上实施例中的方案主要介绍了实现矩阵压缩和信号处理的方法和相应的装置,具体可参考之前实施例对应的装置和方法。实际上相关方法也可以通过硬件、软件或软硬件结合的方式来实现。如果相关方法以软件方式实现,其可以被认为主要以软件程序或存储该软件的存储介质的方式存在。软件程序可被视为是一种计算程序产品。
本申请实施例还提供了一种计算机可读存储介质,上述计算机可读存储介质存储有计算机程序,上述计算机程序包括软件程序指令,上述程序指令被处理器执行时实现:对至少一个权重矩阵做压缩得到第一矩阵,并对至少一个输入矩阵做压缩得到第二矩阵;其中,上述输入矩阵包括多个计算机可处理的信号,上述权重矩阵包括多个权重系数;压缩后的上述第一矩阵和上述第二矩阵满足如下限定:上述第一矩阵是去除上述至少一个权重矩阵中的至少一个全0行得到的,上述第二矩阵是去除上述至少一个输入矩阵中与上述至少一个全0行对应的至少一个列得到的;或者,上述第二矩阵是去除上述至少一个输入矩阵中的至少一个全0列得到的,上述第一矩阵是去除上述至少一个权重矩阵中与上述至少一个全0列对应的至少一个行得到的;计算上述第二矩阵和上述第一矩阵的乘积。该程序指令被处理器执行时实现的方法流程的细节可以参照之前实施例提到的方法流程。所述多个计算机可处理的信号包括:语音信号、文本信号或图像信号中的至少一项。在一个可选的实现方式中,所述方法还包括:生成压缩信息,所述压缩信息用于指示所述至少一个全0行;所述获取第二矩阵包括:根据所述压缩信息对所述至少一个输入矩阵做压缩得到所述第二矩阵。在一个可选的实现方式中,所述方法还包括:生成压缩信息,所述压缩信息用于指示所述至少一个全0列;所述获取第一矩阵包括:根据所述压缩信息对所述至少一个权重矩阵做压缩得到所述第一矩阵。
在一个可选的实现方式中,所述计算所述第二矩阵和所述第一矩阵的乘积之后,所述方法还包括:对所述第二矩阵和所述第一矩阵的乘积做累加得到处理结果。在一个可选的 实现方式中,所述对至少一个权重矩阵做压缩得到第一矩阵,并对至少一个输入矩阵做压缩得到第二矩阵之前,所述方法还包括如下至少一项:对原始权重矩阵做拆分得到所述至少一个权重矩阵或者对原始输入矩阵做拆分得到所述至少一个输入矩阵。
进一步地,本申请实施例提供了一种设备,例如手机、平板电脑、服务器、可穿戴设备等可执行矩阵乘法运算的各类设备。该设备包括存储器和处理器。存储器作为计算机可读存储介质,用于保存程序指令,处理器用于执行所述程序指令以实现以上提到的方法流程。
图17为本申请实施例提供的另一种信号处理装置,可置于所述设备中,该信号处理装置包括:压缩单元1701,用于对至少一个权重矩阵做压缩得到第一矩阵,并对至少一个输入矩阵做压缩得到第二矩阵;其中,上述输入矩阵包括多个计算机可处理的信号,上述权重矩阵包括多个权重系数;压缩后的上述第一矩阵和上述第二矩阵满足如下限定:上述第一矩阵是去除上述至少一个权重矩阵中的至少一个全0行得到的,上述第二矩阵是去除上述至少一个输入矩阵中与上述至少一个全0行对应的至少一个列得到的;或者,上述第二矩阵是去除上述至少一个输入矩阵中的至少一个全0列得到的,上述第一矩阵是去除上述至少一个权重矩阵中与上述至少一个全0列对应的至少一个行得到的;计算单元1702,用于计算上述第二矩阵和上述第一矩阵的乘积。进一步地,所述多个计算机可处理的信号包括:语音信号、文本信号或图像信号中的至少一项。
可选地,所述压缩单元1701,还用于:生成压缩信息,所述压缩信息用于指示所述至少一个全0行;以及根据所述压缩信息对所述至少一个输入矩阵做压缩得到所述第二矩阵。可选地,所述压缩单元1701,还用于:生成压缩信息,所述压缩信息用于指示所述至少一个全0列;以及根据所述压缩信息对所述至少一个权重矩阵做压缩得到所述第一矩阵。
进一步地,所述信号处理装置还包括累加单元1703,所述累加单元1703,用于对所述第二矩阵和所述第一矩阵的乘积做累加得到处理结果。进一步地,信号处理装置还包括拆分单元1704,所述拆分单元1704,用于对原始权重矩阵做拆分得到所述至少一个权重矩阵;以及对原始输入矩阵做拆分得到所述至少一个输入矩阵。
该实施例中的压缩单元1701、计算单元1702、累加单元1703以及拆分单元1704可以由软件、硬件或软硬件结合实现。由此可见,上述设备给或装置实施例中涉及的处理过程,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。上述计算机程序产品包括一个或多个计算机指令。在信号处理装置上加载或执行上述计算机程序指令时,全部或部分地产生按照本发明实施例上述的流程或功能。上述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。上述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘(solid state Drive,SSD)。
以上的实施例中的一种实现方式是直接压缩至少一个权重矩阵得到第一矩阵。但是作为一个可替换的实现方式,第一矩阵可以是预设的。因为第一矩阵作为一种权重参数矩阵通常不会改变,不需要每次都重新计算该第一矩阵。因此上述第一矩阵以及上述压缩信息可以是预设在需要执行矩阵乘法运算的设备中。例如,上述压缩系统从其他器件或者上述 信号处理装置中的某个存储器,例如图6中的外部存储器602中直接获取预设的第一矩阵。或者,可选地,所述第一矩阵可以以硬件形式预设在压缩系统中,本实施例对于第一矩阵在整个系统或装置内的具体预设的方式不做限制。这样,不需要在每次执行的时候都重新计算第一矩阵,而是根据预设的第一矩阵和压缩信息直接进行输入矩阵的压缩得到对应于第一矩阵的第二矩阵,对于具体的压缩过程可参照之前实施例的描述,此处不做赘述。
在以上实施例中,上述第一矩阵可以是上述压缩系统对至少一个权重矩阵做压缩得到的矩阵,还可以是上述压缩后的权重矩阵做进一步拆分得到的矩阵。可以理解,上述压缩系统可以通过压缩矩阵的方式直接得到上述第一矩阵,也可以通过先拆分矩阵再压缩拆分得到的矩阵的方式得到上述第一矩阵,还可以通过先压缩矩阵在拆分压缩后的矩阵的方式得到上述第一矩阵,本申请不作限定得到第一矩阵的方式。关于在压缩操作前后,相关矩阵是否经过一次或多次拆分,本实施例不做限定。相关拆分操作可以使得得到的矩阵大小满足预设的规格,有利于执行运算。另外,上述压缩系统可以在离线状态(未启动矩阵乘法运算任务)下对至少一个权重矩阵做压缩得到上述第一矩阵。例如,可以在设备出厂之前或制造和开发过程中,离线压缩权重矩阵以得到第一矩阵并预设在设备中,例如存储器内部。因此,在后续在线操作,即用户需要执行任务的时候,以上预设的第一矩阵可以被使用,以达到本实施例的效果。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。

Claims (21)

  1. 一种信号处理装置,其特征在于,包括:
    压缩系统,用于获取压缩后的第一矩阵,和对至少一个输入矩阵做压缩得到第二矩阵;所述第一矩阵和所述第二矩阵满足如下限定:所述第一矩阵是去除至少一个权重矩阵中的至少一个全0行得到的,所述第二矩阵是去除所述至少一个输入矩阵中与所述至少一个全0行对应的至少一个列得到的;或者,所述第二矩阵是去除所述至少一个输入矩阵中的至少一个全0列得到的,所述第一矩阵是去除所述至少一个权重矩阵中与所述至少一个全0列对应的至少一个行得到的;所述输入矩阵包括多个计算机可处理的信号,所述权重矩阵包括多个权重系数;
    矩阵乘法器,用于从所述压缩系统获取所述第一矩阵和所述第二矩阵,计算所述第二矩阵和所述第一矩阵的乘积。
  2. 根据权利要求1所述的信号处理装置,其特征在于,所述压缩系统,具体用于获取预设的所述第一矩阵。
  3. 根据权利要求1所述的信号处理装置,其特征在于,所述压缩系统,具体用于对所述至少一个权重矩阵做压缩得到所述第一矩阵。
  4. 根据权利要求1至3中任一项所述的信号处理装置,所述压缩系统包括:处理器和数据压缩单元;
    所述处理器,用于对所述至少一个权重矩阵做压缩得到所述第一矩阵;和/或
    所述数据压缩单元,用于对所述至少一个输入矩阵做压缩得到所述第二矩阵。
  5. 根据权利要求4所述的信号处理装置,所述处理器,还用于生成压缩信息,所述压缩信息用于指示所述至少一个全0行;所述数据压缩单元,还用于根据所述压缩信息对所述至少一个输入矩阵做压缩得到所述第二矩阵。
  6. 根据权利要求5所述的信号处理装置,所述信号处理装置还包括:直接内存访问控制器DMAC与权重缓存器,所述DMAC耦合至所述权重缓存器和外部存储器;所述处理器,还用于将所述第一矩阵和所述压缩信息存入所述外部存储器;所述DMAC,用于将所述第一矩阵从所述外部存储器搬移到所述权重缓存器,以及用于将所述压缩信息从所述外部存储器搬移到所述数据压缩单元;所述矩阵乘法器还用于从所述权重缓存器获取所述第一矩阵。
  7. 根据权利要求6所述的信号处理装置,所述信号处理装置还包括原始数据缓存器和输入缓存器;所述DMAC还用于将所述至少一个输入矩阵从所述外部存储器搬移到所述原始数据缓存器;所述数据压缩单元还用于从所述原始数据缓存器获取所述至少一个输入矩阵,并在对所述至少一个输入矩阵做压缩得到所述第二矩阵后将所述第二矩阵存入所述输入缓存器;所述矩阵乘法器还用于从所述输入缓存器获取所述第二矩阵。
  8. 根据权利要求1至3中任一项所述的信号处理装置,所述压缩系统包括:处理器和数据压缩单元;
    所述处理器,用于对所述至少一个输入矩阵做压缩得到所述第二矩阵;和/或
    所述数据压缩单元,用于对所述至少一个权重矩阵做压缩得到所述第一矩阵。
  9. 根据权利要求8所述的信号处理装置,所述处理器,还用于生成压缩信息,所述压 缩信息用于指示所述至少一个全0列;所述数据压缩单元,还用于根据所述压缩信息对所述至少一个权重矩阵做压缩得到所述第一矩阵。
  10. 根据权利要求9所述的信号处理装置,所述信号处理装置还包括:直接内存访问控制器DMAC与输入缓存器,所述DMAC耦合至所述输入缓存器和外部存储器;所述处理器,还用于将所述第二矩阵和所述压缩信息存入所述外部存储器;所述DMAC,用于将所述第二矩阵从所述外部存储器搬移到所述输入缓存器,以及用于将所述压缩信息从所述外部存储器搬移到所述数据压缩单元;所述矩阵乘法器还用于从所述输入缓存器获取所述第二矩阵。
  11. 根据权利要求10所述的信号处理装置,所述信号处理装置还包括原始数据缓存器和权重缓存器;所述DMAC还用于将所述至少一个权重矩阵从所述外部存储器搬移到所述原始数据缓存器;所述数据压缩单元还用于从所述原始数据缓存器获取所述至少一个权重矩阵,并在对所述至少一个权重矩阵做压缩得到所述第一矩阵后将所述第一矩阵存入所述权重缓存器;所述矩阵乘法器还用于从所述权重缓存器获取所述第一矩阵。
  12. 根据权利要求1至11任意一项所述的信号处理装置,所述信号处理装置还包括累加单元,所述累加单元,用于对所述第二矩阵和所述第一矩阵的乘积做累加得到处理结果。
  13. 根据权利要求4至10任意一项所述的信号处理装置,所述处理器,还用于执行以下至少一项:
    对原始权重矩阵做拆分得到所述至少一个权重矩阵;或
    对原始输入矩阵做拆分得到所述至少一个输入矩阵。
  14. 根据权利要求1至13任意一项所述的信号处理装置,所述多个计算机可处理的信号包括:语音信号、文本信号或图像信号中的至少一项。
  15. 一种信号处理方法,其特征在于,包括:
    获取压缩后的第一矩阵,和对至少一个输入矩阵做压缩得到第二矩阵;所述第一矩阵和所述第二矩阵满足如下限定:所述第一矩阵是去除至少一个权重矩阵中的至少一个全0行得到的,所述第二矩阵是去除所述至少一个输入矩阵中与所述至少一个全0行对应的至少一个列得到的;或者,所述第二矩阵是去除所述至少一个输入矩阵中的至少一个全0列得到的,所述第一矩阵是去除所述至少一个权重矩阵中与所述至少一个全0列对应的至少一个行得到的;所述输入矩阵包括多个计算机可处理的信号,所述权重矩阵包括多个权重系数;
    计算所述第二矩阵和所述第一矩阵的乘积。
  16. 根据权利要求15所述的方法,其特征在于,所述方法还包括:生成压缩信息,所述压缩信息用于指示所述至少一个全0行;
    所述对至少一个输入矩阵做压缩得到第二矩阵包括:根据所述压缩信息对所述至少一个输入矩阵做压缩得到所述第二矩阵。
  17. 根据权利要求15所述的方法,其特征在于,所述方法还包括:生成压缩信息,所述压缩信息用于指示所述至少一个全0列;
    所述获取第一矩阵包括:根据所述压缩信息对所述至少一个权重矩阵做压缩得到所述第一矩阵。
  18. 一种信号处理装置,其特征在于,包括:
    压缩单元,用于压缩后的第一矩阵,和对至少一个输入矩阵做压缩得到第二矩阵;所述第一矩阵和所述第二矩阵满足如下限定:所述第一矩阵是去除至少一个权重矩阵中的至少一个全0行得到的,所述第二矩阵是去除所述至少一个输入矩阵中与所述至少一个全0行对应的至少一个列得到的;或者,所述第二矩阵是去除所述至少一个输入矩阵中的至少一个全0列得到的,所述第一矩阵是去除所述至少一个权重矩阵中与所述至少一个全0列对应的至少一个行得到的;所述输入矩阵包括多个计算机可处理的信号,所述权重矩阵包括多个权重系数;
    计算单元,用于计算所述第二矩阵和所述第一矩阵的乘积。
  19. 根据权利要求18所述的信号处理装置,其特征在于,所述压缩单元,还用于:
    生成压缩信息,所述压缩信息用于指示所述至少一个全0行;以及
    根据所述压缩信息对所述至少一个输入矩阵做压缩得到所述第二矩阵。
  20. 根据权利要求18所述的信号处理装置,其特征在于,所述压缩单元,还用于:
    生成压缩信息,所述压缩信息用于指示所述至少一个全0列;以及
    根据所述压缩信息对所述至少一个权重矩阵做压缩得到所述第一矩阵。
  21. 一种计算机可读存储介质,其特征在于,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如权利要求15-17任一项所述的方法。
PCT/CN2018/109228 2018-09-30 2018-09-30 信号处理装置和信号处理方法 WO2020062312A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201880094243.2A CN112219210B (zh) 2018-09-30 2018-09-30 信号处理装置和信号处理方法
PCT/CN2018/109228 WO2020062312A1 (zh) 2018-09-30 2018-09-30 信号处理装置和信号处理方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/109228 WO2020062312A1 (zh) 2018-09-30 2018-09-30 信号处理装置和信号处理方法

Publications (1)

Publication Number Publication Date
WO2020062312A1 true WO2020062312A1 (zh) 2020-04-02

Family

ID=69949558

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/109228 WO2020062312A1 (zh) 2018-09-30 2018-09-30 信号处理装置和信号处理方法

Country Status (2)

Country Link
CN (1) CN112219210B (zh)
WO (1) WO2020062312A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200167637A1 (en) * 2018-11-26 2020-05-28 Samsung Electronics Co., Ltd. Neural network processor using dyadic weight matrix and operation method thereof
US11521953B1 (en) 2019-03-18 2022-12-06 Kepler Computing Inc. 3D stacked ferroelectric compute and memory
US11694940B1 (en) 2021-08-06 2023-07-04 Kepler Computing Inc. 3D stack of accelerator die and multi-core processor die
US11784164B2 (en) 2019-05-31 2023-10-10 Kepler Computing Inc. 3D stacked compute and memory with copper-to-copper hybrid bond
US11836102B1 (en) * 2019-03-20 2023-12-05 Kepler Computing Inc. Low latency and high bandwidth artificial intelligence processor
US11844223B1 (en) 2019-05-31 2023-12-12 Kepler Computing Inc. Ferroelectric memory chiplet as unified memory in a multi-dimensional packaging
US12086410B1 (en) 2019-05-31 2024-09-10 Kepler Computing Inc. Ferroelectric memory chiplet in a multi-dimensional packaging with I/O switch embedded in a substrate or interposer

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118590073A (zh) * 2023-03-02 2024-09-03 华为技术有限公司 数据处理方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017129325A1 (en) * 2016-01-29 2017-08-03 Fotonation Limited A convolutional neural network
CN107239825A (zh) * 2016-08-22 2017-10-10 北京深鉴智能科技有限公司 考虑负载均衡的深度神经网络压缩方法
CN107590533A (zh) * 2017-08-29 2018-01-16 中国科学院计算技术研究所 一种用于深度神经网络的压缩装置
CN107944555A (zh) * 2017-12-07 2018-04-20 广州华多网络科技有限公司 神经网络压缩和加速的方法、存储设备和终端
CN108268947A (zh) * 2016-12-30 2018-07-10 富士通株式会社 用于提高神经网络的处理速度的装置和方法及其应用

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541814B (zh) * 2010-12-27 2015-10-14 北京国睿中数科技股份有限公司 用于数据通信处理器的矩阵计算装置和方法
US9317482B2 (en) * 2012-10-14 2016-04-19 Microsoft Technology Licensing, Llc Universal FPGA/ASIC matrix-vector multiplication architecture
US9697176B2 (en) * 2014-11-14 2017-07-04 Advanced Micro Devices, Inc. Efficient sparse matrix-vector multiplication on parallel processors
JP2017130036A (ja) * 2016-01-20 2017-07-27 富士通株式会社 情報処理装置、演算方法、および演算プログラム
CN107239823A (zh) * 2016-08-12 2017-10-10 北京深鉴科技有限公司 一种用于实现稀疏神经网络的装置和方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017129325A1 (en) * 2016-01-29 2017-08-03 Fotonation Limited A convolutional neural network
CN107239825A (zh) * 2016-08-22 2017-10-10 北京深鉴智能科技有限公司 考虑负载均衡的深度神经网络压缩方法
CN108268947A (zh) * 2016-12-30 2018-07-10 富士通株式会社 用于提高神经网络的处理速度的装置和方法及其应用
CN107590533A (zh) * 2017-08-29 2018-01-16 中国科学院计算技术研究所 一种用于深度神经网络的压缩装置
CN107944555A (zh) * 2017-12-07 2018-04-20 广州华多网络科技有限公司 神经网络压缩和加速的方法、存储设备和终端

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11562046B2 (en) * 2018-11-26 2023-01-24 Samsung Electronics Co., Ltd. Neural network processor using dyadic weight matrix and operation method thereof
US20200167637A1 (en) * 2018-11-26 2020-05-28 Samsung Electronics Co., Ltd. Neural network processor using dyadic weight matrix and operation method thereof
US11521953B1 (en) 2019-03-18 2022-12-06 Kepler Computing Inc. 3D stacked ferroelectric compute and memory
US11637090B2 (en) 2019-03-18 2023-04-25 Kepler Computing Inc. Method of forming a 3D stacked compute and memory
US11764190B1 (en) 2019-03-18 2023-09-19 Kepler Computing Inc. 3D stacked compute and memory with copper pillars
US11836102B1 (en) * 2019-03-20 2023-12-05 Kepler Computing Inc. Low latency and high bandwidth artificial intelligence processor
US12086410B1 (en) 2019-05-31 2024-09-10 Kepler Computing Inc. Ferroelectric memory chiplet in a multi-dimensional packaging with I/O switch embedded in a substrate or interposer
US11784164B2 (en) 2019-05-31 2023-10-10 Kepler Computing Inc. 3D stacked compute and memory with copper-to-copper hybrid bond
US11844223B1 (en) 2019-05-31 2023-12-12 Kepler Computing Inc. Ferroelectric memory chiplet as unified memory in a multi-dimensional packaging
US11829699B1 (en) 2021-08-06 2023-11-28 Kepler Computing Inc. Method to segregate logic and memory into separate dies for thermal management in a multi-dimensional packaging
US11841757B1 (en) 2021-08-06 2023-12-12 Kepler Computing Inc. Method and apparatus for cycle-by-cycle clock gating of ferroelectric or paraelectric logic and CMOS based logic
US11791233B1 (en) 2021-08-06 2023-10-17 Kepler Computing Inc. Ferroelectric or paraelectric memory and logic chiplet with thermal management in a multi-dimensional packaging
US11899613B1 (en) 2021-08-06 2024-02-13 Kepler Computing Inc. Method and apparatus to process an instruction for a distributed logic having tightly coupled accelerator core and processor core in a multi-dimensional packaging
US12001266B1 (en) 2021-08-06 2024-06-04 Kepler Computing Inc. Method and apparatus for managing power of ferroelectric or paraelectric logic and CMOS based logic
US12019492B1 (en) 2021-08-06 2024-06-25 Kepler Computing Inc. Method and apparatus for managing power in a multi-dimensional packaging
US12026034B1 (en) 2021-08-06 2024-07-02 Kepler Computing Inc. Method and apparatus for heuristic-based power gating of non-CMOS logic and CMOS based logic
US11694940B1 (en) 2021-08-06 2023-07-04 Kepler Computing Inc. 3D stack of accelerator die and multi-core processor die

Also Published As

Publication number Publication date
CN112219210B (zh) 2024-03-29
CN112219210A (zh) 2021-01-12

Similar Documents

Publication Publication Date Title
WO2020062312A1 (zh) 信号处理装置和信号处理方法
US11429852B2 (en) Convolution acceleration and computing processing method and apparatus, electronic device, and storage medium
US11151361B2 (en) Dynamic emotion recognition in unconstrained scenarios
CN110263909B (zh) 图像识别方法及装置
WO2020073211A1 (zh) 运算加速器、处理方法及相关设备
US20160358069A1 (en) Neural network suppression
US20160267111A1 (en) Two-stage vector reduction using two-dimensional and one-dimensional systolic arrays
CN110889416B (zh) 一种基于级联改良网络的显著性物体检测方法
JP6684951B2 (ja) 人工知能推論演算装置
US20220156575A1 (en) Multi-dimensional tensor support extension in neural network processor
WO2019001323A1 (zh) 信号处理的系统和方法
US20200218777A1 (en) Signal Processing Method and Apparatus
WO2020220797A1 (zh) 特征图放大的方法、装置、设备和计算机可读存储介质
WO2021147276A1 (zh) 数据处理方法、装置及芯片、电子设备、存储介质
WO2022151779A1 (zh) 卷积运算的实现方法、数据处理方法及装置
JP2019185784A (ja) モジュール接続されているcnnベース集積回路を用いた深層学習画像処理システム
CN107888970A (zh) 视频处理方法、装置、嵌入式设备及存储介质
WO2019095333A1 (zh) 一种数据处理方法及设备
CN112784951B (zh) Winograd卷积运算方法及相关产品
CN114600126A (zh) 一种卷积运算电路和卷积运算方法
Tang et al. Energy-efficient pedestrian detection system: Exploiting statistical error compensation for lossy memory data compression
CN112862095A (zh) 基于特征分析的自蒸馏学习方法、设备以及可读存储介质
WO2021135572A1 (zh) 神经网络的卷积实现方法、卷积实现装置及终端设备
WO2023109086A1 (zh) 文字识别方法、装置、设备及存储介质
WO2024045320A1 (zh) 人脸识别方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18935904

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18935904

Country of ref document: EP

Kind code of ref document: A1