CN110929854B

CN110929854B - Data processing method and device and hardware accelerator

Info

Publication number: CN110929854B
Application number: CN201811100198.5A
Authority: CN
Inventors: 刘保庆; 郑淼; 张一栋
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-09-20
Filing date: 2018-09-20
Publication date: 2024-04-16
Anticipated expiration: 2038-09-20
Also published as: CN110929854A

Abstract

The application provides a data processing method, a data processing device and a hardware accelerator. The data processing method comprises the following steps: the processor acquires a first matrix and a second matrix, and at least one third matrix and at least one index information are obtained according to the non-zero elements in the second matrix and the specification of the hardware accelerator; the hardware accelerator acquires a fourth matrix from the corresponding l rows in the first matrix according to index information corresponding to the third matrix, acquires a fifth matrix according to the fourth matrix and the third matrix, and acquires a target result according to at least one fifth matrix. The processor eliminates part or all of zero elements in n columns of the second matrix to obtain a third matrix, the number of zero elements of the third matrix participating in operation is smaller than that of the zero elements in n columns of the second matrix, and the number of elements of the fourth matrix obtained according to the index information is smaller, so that the purpose of reducing the total operation amount of the hardware accelerator can be achieved by reducing the number of the zero elements participating in operation, and the operation efficiency of the hardware accelerator is improved.

Description

Data processing method and device and hardware accelerator

Technical Field

The present disclosure relates to the field of data processing, and in particular, to a data processing method, apparatus, and hardware accelerator.

Background

The neural network abstracts the human brain neural network from the angle of information processing, builds a certain simple model, and forms different networks according to different connection modes. In recent years, neural networks have rapidly developed and are widely used in many fields such as image recognition, voice recognition, natural language processing, weather forecast, gene expression, content push, and the like.

The main involved operations in neural networks are convolutional layers (convolution layer, conv) and fully-connected layers (fully connected layers, FC). The convolution layer and the full connection layer occupy more than 90% of the calculation amount and calculation time of the whole network, so that the acceleration of the calculation of the convolution layer and the full connection layer is a key for improving the performance of the neural network. Most operations in the convolution layer and the full connection layer can be categorized into matrix multiplication vector or matrix multiplication matrix operations, and a large number of parameters are involved in the operations to participate in the operations, but the specifications of a hardware accelerator for executing the operations are limited, so that the hardware accelerator cannot directly calculate a data matrix and a parameter matrix, and needs to process the data matrix and the parameter matrix. Take fig. 1a and 1b as an example. As shown in fig. 1a, the hardware accelerator is a schematic diagram of multiplying a matrix supporting 8×8 by a matrix supporting 8×8. As shown in fig. 1B, a schematic multiplication of a parameter matrix a (32×32) and a data matrix B (32×32) is shown. If the matrix operation shown in fig. 1B needs to use the hardware accelerator shown in fig. 1a, the parameter matrix a and the data matrix B may be loaded into the hardware accelerator, and then the parameter matrix a and the data matrix B are divided into 16 matrices of 8×8, each row of the parameter matrix a needs to perform a multiply-add operation with each column of the data matrix B, the calculated amount is relatively large, and as the parameter matrix a and the data matrix B increase, the calculated amount is also larger, so that the operation efficiency of the hardware accelerator is lower.

Disclosure of Invention

The application provides a data processing method, a data processing device and a hardware accelerator, which are used for improving the operation efficiency of the hardware accelerator.

In a first aspect, the present application provides a data processing method applicable to a data processing apparatus, wherein the data processing apparatus comprises a processor and a hardware accelerator. The method comprises the steps that a processor acquires a first matrix and a second matrix, and at least one third matrix and at least one corresponding index information are obtained by the processor according to non-zero elements in the second matrix and specifications of a hardware accelerator; the hardware accelerator acquires index information obtained through processing of the processor, and acquires a fourth matrix from corresponding l rows in the first matrix according to the index information corresponding to the third matrix aiming at each third matrix in the at least one third matrix, and acquires a fifth matrix according to the fourth matrix and the third matrix; obtaining a target result according to at least one fifth matrix, wherein the target result is a matrix of L; wherein, the first matrix and the second matrix both comprise non-zero elements, the first matrix is a matrix of L x M, the second matrix is a matrix of M x N, and L, M and N are both positive integers; the specification of the hardware accelerator is used for indicating the hardware accelerator to process the product operation of a matrix of l x m and a matrix of m x n, the third matrix is a matrix of (t x m) x n, each third matrix in the at least one third matrix respectively comprises each non-zero element in different n columns of the second matrix and does not comprise part or all of zero elements in n columns of the second matrix, and the index information is used for indicating the position information of the non-zero elements in n columns of the second matrix; l is a positive integer not greater than L, M is a positive integer not greater than M, N is a positive integer not greater than N, and t is a positive integer; the fourth matrix is a matrix of i x m.

Based on the scheme, the processor eliminates part or all of zero elements in n columns of the second matrix to obtain a third matrix, so that the number of zero elements of the third matrix participating in operation in the hardware accelerator is reduced compared with the number of zero elements in n columns of the second matrix. The third matrix is obtained by deleting rows with only zero elements in the n columns of the second matrix, and the fourth matrix with smaller element number is obtained according to the index information, and the zero elements have no influence on the operation result, so that the purpose of reducing the total operation amount of the hardware accelerator can be achieved by reducing the number of the zero elements participating in the operation on the premise of not influencing the operation result, and the operation efficiency of the hardware accelerator can be further improved.

In a possible implementation manner, the processor converts the second matrix into a third matrix that can be operated by the hardware accelerator, and determining the third matrix can be implemented by determining a value of t, where t satisfies the following condition: (t-1) m < p < t < m > and p is the number of non-zero elements in the column with the largest number of non-zero elements in the second matrix. Therefore, the third matrix can comprise all non-zero elements of the second matrix, and the non-zero elements have a larger influence on the operation result, and the zero elements have no influence on the operation result, so that the precision of the hardware accelerator on matrix operation is improved. Further, by the determined t value, the third matrix may be obtained from the second matrix of any sparsity.

Further, in order to improve the operation performance of hardware acceleration when performing sparse processing on the second matrix, p of the second matrix may be made to satisfy the following conditions: p is less than or equal to M-M.

In a possible implementation, the index information includes a row address and a column address, where one column address corresponds to m row addresses; the hardware accelerator may select m column elements from the corresponding l rows in the first matrix according to m row addresses corresponding to one column address of the index information, to obtain a fourth matrix with l×m, where the m row addresses are in one-to-one correspondence with m column addresses of the m column elements. The index information includes the same number of column addresses as the number of the fourth matrices obtained.

The processor, when determining the third matrix from the second matrix, has two cases, the first case: t=1; second case: t is more than or equal to 2.

For the first case, the hardware accelerator may directly multiply-add the fourth matrix and the third matrix to obtain a fifth matrix. For the second case, the hardware accelerator may first divide the third matrix into t m×n matrices, and then multiply and add the fourth matrix and the t m×n matrices respectively to obtain a fifth matrix. In this way, the calculation process of the matrix can be simplified.

In a second aspect, the present application provides a data processing method applicable to a hardware accelerator, the method comprising: the hardware accelerator acquires a first matrix, a third matrix and index information, wherein the third matrix and the index information are obtained by a processor according to non-zero elements in a second matrix and specifications of the hardware accelerator, the first matrix and the second matrix both comprise non-zero elements, the first matrix is an L-by-M matrix, the second matrix is an M-by-N matrix, the specifications of the hardware accelerator are used for indicating the hardware accelerator to process the product operation of the I-by-M matrix and the M-by-N matrix, the index information is used for indicating the position information of the non-zero elements in different N columns in the second matrix, the third matrix is a (t-by-M) by-N matrix, each third matrix in at least one third matrix respectively comprises each non-zero element in different N columns in the second matrix and does not comprise part or all of zero elements in N columns of the second matrix, M and N are positive integers, L is a positive integer not greater than L, M is a positive integer not greater than M, N is a positive integer not greater than N, and t is a positive integer; the hardware accelerator acquires a fourth matrix from the corresponding l rows in the first matrix according to the index information, wherein the fourth matrix is a matrix of l x m; the hardware accelerator obtains a fifth matrix according to the fourth matrix and the third matrix; and the hardware accelerator obtains a target result according to at least one fifth matrix, wherein the target result is a matrix of L.

In a third aspect, the present application provides an apparatus comprising a processor and a hardware accelerator. Optionally, the apparatus may further comprise a memory for storing instructions when it comprises the memory; the processor is configured to execute the instructions stored in the memory and control the hardware accelerator to perform an acceleration task, and when the processor executes the instructions stored in the memory, the processor in the apparatus is configured to execute the method executed by the processor in the first aspect or any one of the first aspects, to obtain at least one third matrix and at least one index information corresponding to the third matrix. The hardware accelerator in the apparatus is then caused to perform the method performed by the hardware accelerator of the first aspect or any one of the first aspects described above.

In a fourth aspect, the present application provides an apparatus for implementing any one of the above first aspect or the method of the first aspect, including corresponding functional modules, respectively for implementing the steps in the above method. The functions may be realized by hardware, or may be realized by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the functions described above.

In a possible implementation manner, the apparatus includes a processing module and an operation module, where the modules may perform corresponding functions in the method example of the first aspect, and detailed descriptions in the method example are specifically referred to and not repeated herein.

In a fifth aspect, the present application provides a hardware accelerator comprising an access circuit, a selection circuit and an arithmetic circuit, the hardware accelerator being operable to perform any one of the methods of the second aspect or the second aspect described above by the access circuit, the selection circuit and the arithmetic circuit under control of an external processor.

In a sixth aspect, the present application provides a hardware accelerator for implementing any one of the second aspect or the second aspect, including corresponding functional circuits for implementing steps in the above method, respectively. The functions may be implemented by hardware. The hardware includes one or more circuits corresponding to the functions described above.

In one possible design, the hardware accelerator includes an access circuit, an operation circuit and a storage circuit in a structure, and may perform the corresponding functions in the second aspect or any method example of the second aspect, and detailed descriptions in the method examples are specifically referred to herein and are not repeated herein.

In a seventh aspect, the present application provides a computer storage medium having instructions stored therein, which when run on a computer, cause a processor to perform the method performed by the processor in the first aspect, any possible implementation manner of the first aspect, the second aspect, or any possible implementation manner of the second aspect, resulting in at least one third matrix and corresponding at least one index information. The hardware accelerator is then caused to perform the method performed by the hardware accelerator in the first aspect, any possible implementation manner of the first aspect, the second aspect, or any possible implementation manner of the second aspect, according to the at least one third matrix and the corresponding at least one index information.

In an eighth aspect, the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the first aspect or any possible implementation of the first aspect, or cause the computer to perform the method of the second aspect or any possible implementation of the second aspect.

In a ninth aspect, the present application provides a chip system including a processor, a hardware accelerator. Optionally, the system on a chip further comprises a memory for storing instructions when it comprises a memory; the memory is configured to store a computer program, and the processor is configured to call and run the computer program from the memory, so that the device on which the chip system is installed performs any method on the processor side of the first aspect, any possible implementation manner of the first aspect, the second aspect, or any possible implementation manner of the second aspect, and obtains at least one third matrix and at least one corresponding index information. The hardware accelerator is then caused to perform the method performed by the hardware accelerator in the first aspect, any possible implementation manner of the first aspect, the second aspect, or any possible implementation manner of the second aspect, according to the at least one third matrix and the corresponding at least one index information.

Drawings

FIG. 1a is a schematic diagram of a hardware accelerator according to the prior art;

FIG. 1b is a schematic diagram of a prior art product of a parameter matrix and a data matrix;

FIG. 2a is a schematic diagram of a data processing apparatus according to the present application;

FIG. 2b is a schematic diagram of a software and hardware collaboration framework of a data processing apparatus provided in the present application;

FIG. 3 is a schematic flow chart of a data processing method provided in the present application;

FIG. 4a is a schematic diagram of a second matrix according to the present application;

fig. 4b is a schematic structural diagram of a first matrix provided in the present application;

FIG. 4c is a flowchart illustrating another data processing method provided in the present application;

FIG. 5a is a schematic diagram of a second matrix block according to the present application;

FIG. 5b is a schematic diagram of a third matrix according to the present disclosure;

FIG. 5c is a schematic diagram of an index information structure provided in the present application;

FIG. 6a is a schematic diagram of a first matrix block provided in the present application;

FIG. 6b is a schematic diagram of a fourth matrix according to the present disclosure;

FIG. 6c is a schematic diagram of a fourth matrix multiplied by a third matrix according to the present application;

FIG. 7a is a schematic diagram of another second matrix block provided in the present application;

FIG. 7b is a schematic diagram of another third matrix provided in the present application;

FIG. 7c is a schematic diagram of another index information structure provided in the present application;

fig. 8 is a schematic structural diagram of a device provided in the present application.

Detailed Description

Fig. 2a schematically shows a schematic structure of a data processing device provided in the present application. As shown in fig. 2a, the apparatus includes a processing system 10, a hardware accelerator 20, a data bus 30, and a control bus 40. The processing system 10 and the hardware accelerator 20 may transmit data over the data bus 30 and control information over the control bus 40. The processing system 10 is a control component of the overall system, including a processor 111, and optionally the processing system 10 further includes a memory 112. The processor 111 is used to run the software-side code and load acceleration tasks (e.g., matrices or vectors) onto the hardware accelerator 20. Alternatively, the hardware accelerator 20 may include one or more intellectual property (intellectual property, IP) cores, and the hardware accelerator 20 may control the operating state of the one or more IP cores, data reading, and the like. The functions of the IP core include, for example: some of the functional blocks commonly used in digital circuits, but more complex, are designed as parameter-modifiable circuits such as synchronous dynamic random access memory (synchronous dynamic random access memory, SDRAM) controllers, peripheral component interconnect standard (peripheral component interconnect, PCI) interfaces, and the like. The processor 111 may be a central processor (central processing unit, CPU), a network processor (network processor, NP) or a combination of CPU and NP. The processor 111 may further comprise a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), or a combination thereof. The PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), general-purpose array logic (generic array logic, GAL), or any combination thereof. The memory 112 is used for storing data, and may be controlled by the processor 111 to read data from or write data to the memory 112, the memory 112 may be a double rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM), a Random Access Memory (RAM) or a nonvolatile memory (non-volatile memory), such as a flash memory (flash memory), a hard disk (HDD) or a Solid State Drive (SSD), and the memory 112 may further include a combination of the above types of memories.

The hardware accelerator 20 is a hardware accelerating component of the whole system, and can design and implement a special accelerating IP core according to a specific computing task, and can implement acceleration of an algorithm through the IP core on the hardware accelerator. For example, a hardware accelerator applied in the neural network may implement acceleration of the neural network algorithm. The hardware accelerator 20 and the processing system 10 may independently execute tasks in parallel after the data interaction is completed. The hardware accelerator 20 may be a graphics processor (graphics processing unit, GPU), or FPGA, or an integrated circuit (application specific integrated circuit, ASIC) for special applications.

A data bus 30 for data transfer throughout the processing system 10 and the hardware accelerator 20. The data bus 30 may employ either the (advanced extensible interface-Stream, AXI-Stream) protocol, which is a bus protocol, or the bus interface (PCI Express, PCI-E) bus protocol, which is a high-performance transport protocol, allowing unrestricted data burst transport.

A control bus 40 for the transmission of control signals for the entire processing system 10 and the hardware accelerator 20. The control bus 40 may use AXI-Lite protocol, which is a lightweight address mapping single transmission protocol suitable for control signal transmission of hardware arithmetic circuits.

Based on the data processing apparatus framework shown in fig. 2a, fig. 2b schematically illustrates a schematic diagram of a software and hardware collaboration framework of a specific data processing apparatus provided in the present application. As shown in fig. 2b, the framework includes a processing system 10a, a hardware accelerator 20a, a data bus 30a, and a control bus 40a, the processing system 10a and the hardware accelerator 20a may transmit data via the data bus 30a and control information via the control bus 40 a.

The processing system 10a may be the same as the processing system 10 described above in fig. 2 a. That is, the processor 111a in the processing system 10a may be the same as the processor 111 in FIG. 2a described above, and the memory 112a may also be the same as the memory 112 in FIG. 2a described above. The data bus 30a may be the same as the data bus 30 in fig. 2a, and the control bus 40a may be the same as the control bus 40 in fig. 2a, and is not repeated here.

The hardware accelerator 20a may be the same as the hardware accelerator 20 described above in fig. 2 a. The hardware accelerator 20a may include IP cores that are: a memory circuit 211, an arithmetic circuit 212, an input buffer circuit 213, an output buffer circuit 214, a read circuit 215 (fig. 2b illustrates a read circuit as a direct memory access (direct memory access, DMA)) and a control interconnect circuit 216.

The memory circuit 211 includes a first data buffer 2111, a second data buffer 2112, an index information buffer 2113, and a selection circuit 2114 (the selection circuit is illustrated as a selector in fig. 2 b). The first data buffer 2111 is configured to buffer data loaded into the hardware accelerator from the memory 112, for example, when the hardware accelerator 20a is applied to a neural network, the first data buffer 2111 may buffer a convolution kernel parameter matrix of each layer of the neural network model. A second data buffer 2112 for buffering data loaded from the memory 112a into the hardware accelerator, for example, the second data buffer 2112 may buffer the profile data matrix when the hardware accelerator 20a is applied in a neural network. The index information buffer 2113 may be used to buffer location information of non-zero elements in a parameter matrix or data matrix, such as column addresses, row addresses, and relationships between column addresses and row addresses of non-zero elements. A selector 2114 for selecting data of a corresponding position according to the index information cached in the index information cache 2113; for example, if the index information is obtained according to the parameter matrix, the selector selects the data of the corresponding position from the data matrix according to the index information; if the index information is obtained according to the data matrix, the selector selects the data of the corresponding position from the parameter matrix according to the index information.

The arithmetic circuit 212 is a core component of the hardware accelerator, and various functional units of the arithmetic circuit are cured inside. The arithmetic circuit 212 includes a multiplication unit 2121, an accumulation unit 2122, a vector calculation unit 2123, and an excitation function unit 2124. Multiplication section 2121 is configured to multiply an input vector by an element at a position corresponding to the matrix to obtain an intermediate value, or to multiply an input matrix by the matrix to obtain an intermediate value, and the multiplication section may be a matrix-vector multiplication section or a matrix-matrix multiplication section. An accumulating unit 2122, configured to accumulate the intermediate values calculated by the multiplying unit 2121 to obtain respective gate vector values; and an excitation function unit 2123, configured to perform excitation function processing on the gate vector values obtained by the accumulation unit, and a vector calculation unit 2124, configured to calculate each gate vector, so as to obtain a final result. The input buffer 213 and the output buffer 214 are used for temporarily storing input and output data in the hardware accelerator, and the arithmetic circuit 212 can put the result into the output buffer and output the result after completing the calculation, wherein the output buffer 214 can be written back into the memory 112 once after the buffer is full.

The direct memory access DMA215 is used for data transfer between the hardware accelerator 20a and the memory 112a, alternatively, the physical address of the data store in the memory 112a may be contiguous, facilitating the DMA215 in the hardware accelerator 20a for data transfer. Each circuit of the hardware accelerator may be equipped with a DMA to enable parallel data reading. Control interconnect circuit 216 is used to control the interconnection of the signal lines.

Either of the above-described fig. 2a or fig. 2b may be applied to a neural network acceleration system, and an alternative implementation manner in the neural network acceleration system is as follows: under the control of a processor in the processor system, the opening of the DMA is controlled through the control bus and the data bus respectively, the DMA transfers the convolution kernel parameter data in the memory to the first data buffer area 2111 in the storage circuit 211 of the hardware accelerator 20, the input feature map data in the memory is transferred to the second data buffer area 2112 of the storage circuit of the hardware accelerator, then the target result is obtained through the processing of multiplication operation, accumulation operation, excitation function and the like in the operation circuit 212, and the obtained target result is returned to the memory through the DMA.

The data processing apparatus shown in fig. 2a or fig. 2b may be applied to a device, such as a mobile phone, a camera, or a cloud server. The data processing device in fig. 2a or 2b may also be part of a System on Chip (SoC).

It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more.

Based on the architecture shown in fig. 2a or fig. 2b, the present application provides a data processing method. Fig. 3 is a schematic flow chart of a data processing method provided in the present application. The processor in this embodiment may be the processor 111 in fig. 2a, the processor 111a in fig. 2b, or the hardware accelerator 20 in fig. 2a, or the hardware accelerator 20a in fig. 2 b. As shown in fig. 3, the method comprises the steps of:

In step 301, the processor obtains a first matrix and a second matrix.

The first matrix and the second matrix both comprise non-zero elements, the first matrix is a matrix of L x M, the second matrix is a matrix of M x N, and L, M and N are both positive integers.

In one implementation, L and M of the first matrix are equal and M and N of the second matrix are equal.

In yet another implementation, L and M of the first matrix are equal and M and N of the second matrix are unequal.

In yet another implementation, L and M of the first matrix are not equal and M and N of the second matrix are equal.

In yet another implementation, L and M of the first matrix are not equal and M and N of the second matrix are not equal.

In step 302, the processor obtains at least one third matrix and at least one index information according to the non-zero element in the second matrix and the specification of the hardware accelerator.

The specification of the hardware accelerator is used for indicating the hardware accelerator to process the product operation of a matrix of l x m and a matrix of m x n, a third matrix is a matrix of (t x m) x n, each third matrix in the at least one third matrix respectively comprises each non-zero element in different n columns of the second matrix and does not comprise part or all zero elements in the n columns of the second matrix, index information is used for indicating the position information of the non-zero elements in the n columns of the second matrix, and one third matrix corresponds to one index information; l is a positive integer not greater than L, M is a positive integer not greater than M, N is a positive integer not greater than N, and t is a positive integer.

For example, if 4 third matrices are obtained, each third matrix includes n columns of the second matrix, and n columns of the second matrix included in any two of the 4 third matrices are different. For example, if the second matrix includes 32 columns and each third matrix includes 8 columns, the 8 columns of the second matrix included in the 4 third matrices may be: the first third matrix comprises 1 to 8 columns of the second matrix, the second third matrix comprises 9 to 16 columns of the second matrix, the third matrix comprises 17 to 24 columns of the second matrix, and the fourth third matrix comprises 25 to 32 columns of the second matrix. Wherein the first, second, third and fourth do not indicate a sequential order, only for distinguishing the third matrix.

Optionally, the sizes of l, m and n in the specification of the hardware accelerator are the same, or any two of them are the same, or all three are different.

In a possible implementation manner, the t value of the third matrix determined by the processor satisfies the condition: (t-1) m < p.ltoreq.t.m, where p is the number of non-zero elements in the column with the largest number of non-zero elements in different n columns in the second matrix. It may also be understood that the processor may determine the number of rows of the third matrix according to the value of t, and then form the third matrix according to the non-zero elements in the n columns of the second matrix block and the determined number of rows and columns of the third matrix. The number p of the non-zero elements in the most columns of the non-zero elements in different n columns of the second matrix is used for determining the number of rows of the third matrix, all the non-zero elements in the different n columns of the second matrix can be included in the third matrix, and the third matrix comprises all the non-zero elements in the different n columns of the second matrix because the influence of the non-zero elements on the operation result is large, and the operation result can be accurately obtained through the third matrix.

For example, if the number of rows of the determined third matrix is 8, and the number of non-zero elements in a certain column in different n columns in the second matrix is 6, the column may be supplemented with two zero elements, where the positions of the two zero elements in the column may be in the first two rows of the column, or in the last two rows of the column, or in any two positions in the middle, which is not limited in this application.

Alternatively, the first matrix and the second matrix may both be sparse matrices, or the first matrix and the second matrix may both be non-sparse matrices, or one of them may be sparse matrices. A sparse matrix refers to a matrix in which the number of elements with a value of 0 is far more than the number of non-zero elements (e.g., the ratio of the number of elements with a value of 0 to the total number of elements in the matrix is greater than a preset threshold), and the distribution of non-zero elements is not regular, and the sparse matrix helps to reduce the need for hardware Input/Output port (I/O) bandwidth when data is loaded from memory to the hardware accelerator. In an exemplary embodiment, taking the second matrix as the sparse matrix, since zero elements in the sparse matrix have no influence on the operation result, but zero elements participate in the operation to additionally increase the consumption of hardware acceleration and reduce the operation performance of the hardware accelerator, when the second matrix is subjected to sparse processing, in order to improve the operation performance of the hardware acceleration, the number p of the non-zero elements in the column with the largest non-zero elements in each column of the second matrix can be made to satisfy the condition p-M, so when the second matrix is converted into a plurality of third matrices, the processor can delete the rows with only zero elements in different n columns of the second matrix to obtain the third matrix, so that the number of the zero elements participating in the operation can be reduced, and the calculation efficiency of the hardware accelerator can be improved. Further, according to the condition that t meets the above, any irregular second sparse matrix can be converted into a regular third matrix, and further product operation of participation of the second matrix with any sparse rate can be realized through a hardware accelerator.

As described above in connection with fig. 2b, after obtaining at least one third matrix and corresponding at least one index information according to the non-zero elements in the second matrix and the specifications of the hardware accelerator, the processor 111a may store the obtained at least one third matrix and corresponding at least one index information in the memory 112 a.

Step 303, for each third matrix in the at least one third matrix, the hardware accelerator obtains a fourth matrix from the corresponding l rows in the first matrix according to the index information corresponding to the third matrix, where the fourth matrix is a matrix of l×m.

Optionally, the index information includes a row address and a column address, one column address corresponding to m row addresses. The index information may be in the form of a table, or may be in other forms, such as text in the format of extensible markup language (extensible markup language, XML), which is not limited in this application.

Specifically, the hardware accelerator selects m column elements from the corresponding l rows in the first matrix according to m row addresses corresponding to one column address of the index information to obtain a fourth matrix, wherein the m row addresses are in one-to-one correspondence with m column addresses of the m column elements. Here, at least one fourth matrix may be obtained from one index information.

In a possible implementation, n columns of the second matrix correspond to l rows of the first matrix. For example, if the first matrix comprises 32 rows and the second matrix comprises 32 columns, and each third matrix comprises 8 columns, then the first third matrix comprises 1 to 8 columns of the second matrix, which may correspond to 1 to 8 rows of the first matrix; the second third matrix comprises 9 to 16 columns of the second matrix and can correspond to 9 to 16 rows of the first matrix; the third matrix includes 17 to 24 columns of the second matrix, which may correspond to 17 to 24 columns of the first matrix; the fourth third matrix includes 25 to 32 columns of the second matrix and may correspond to 25 to 32 rows of the first matrix. Wherein the first, second, third and fourth do not represent a sequential order, only to distinguish the third matrix.

Taking fig. 2b as an example, in one possible implementation, the hardware accelerator 20a loads the first matrix in the memory 112a into the second data buffer 2112, loads the index information into the index information buffer 2113, loads the third matrix into the first data buffer 2111 of the hardware accelerator 20a under the control of the processor 111a, and the processor 111a controls the selector 2114 to obtain the fourth matrix from the first matrix buffered into the hardware accelerator 20a according to the index information. In another possible implementation, the processor 111a divides the first matrix into one or more first matrix blocks (the first matrix blocks include l rows), and the hardware accelerator 20 loads the first matrix blocks in the memory 112a into the second data buffer 2112 multiple times under the control of the processor 111a, where one first matrix block may be loaded at a time, or multiple first matrix blocks may be loaded at a time. In this way, the bandwidth of the loading can be improved; the fourth matrix may be obtained in particular from the first matrix block loaded into the second data buffer 2112.

In one possible implementation, the first matrix may include at least one first matrix block a _i First matrix block A _i The first matrix includes l rows, which may be adjacent l rows in the first matrix or may be l rows spaced by the first matrix. The second matrix comprises at least one second matrix block B _j Second matrix block B _j The second matrix n columns are included, wherein the n columns can be adjacent n columns in the second matrix or can be n columns at intervals of the second matrix, and i and j are integers. The present application provides three processors to determine a first matrix block A _i Is of the size of (a) and a second matrix block B _j Wherein the specification of the hardware accelerator is used to instruct the hardware accelerator to process the product operation of the matrix of i x m and the matrix of m x n.

In one implementation manner, determining the sparsity of the second matrix, and taking the reciprocal of the sparsity of the second matrix as the highest gear T ₁ According to the specification of the hardware accelerator and the highest gear T ₁ A second matrix block B of the second matrix _j Can be determined as (m x T) ₁ ) Row n column. In this way, the size of the second matrix block of the second matrix can be determined quickly, and in particular, when the distribution of non-zero elements in the second matrix is relatively uniform, the size of the second matrix block can be determined quickly and accurately.

Accordingly, a first matrix block A of the first matrix _i Is equal to the rows of the second matrix, thus the first matrix block a can be divided into _i Is sized as l rows (m x T) ₁ ) Columns.

Implementation II, according to non-zero elements in the second matrixDividing the second matrix into a plurality of small matrix blocks (for example, dividing the second matrix into 9 small matrix blocks according to a nine square lattice), determining the sparseness of each small matrix block, and if the sparseness of most small matrix blocks is smaller than a threshold value and the sparseness of the small matrix blocks is larger than the threshold value, taking the reciprocal of the sparseness of most small matrix blocks as the highest gear T ₂ According to the specification of the hardware accelerator and the highest gear T ₂ A second matrix block B of the second matrix _j Is determined as (m x T) ₂ ) Row n column. For the second matrix with non-zero element non-uniform distribution, through the second implementation manner, the size of the second matrix block can be accurately determined.

Exemplary, if the majority of the second matrix has a sparsity of less than 25% and the local sparsity is greater than 25%, the highest gear T ₂ May be set to 4.

Accordingly, the size of the first matrix block of the first matrix may be determined as l rows (m×t ₂ ) Columns.

Third, determining the highest gear T of the second matrix according to the historical empirical values ₃ According to the specification of the hardware accelerator and the highest gear T ₃ Determining a second matrix block B of the second matrix _j Can be determined as (m x T) ₃ ) Row n column.

Accordingly, the size of the first matrix block of the first matrix may be determined as l rows (m×t ₃ ) Columns.

For example, taking the first matrix as the matrix of 32×32, the second matrix as the matrix of 32×32, and the specification of the hardware accelerator as the product operation of the matrix of 8×8 and the matrix of 8×8, the highest gear t=4 of the second matrix is taken as an example, and fig. 4a provides a schematic structural diagram of the second matrix. As shown in fig. 4a, the second matrix comprises four second matrix blocks B _j Respectively B ₀ 、B ₁ 、B ₂ And B ₃ Each second matrix block includes 8 columns, the number of rows of the second matrix block is (m×t) = (8*4), and the number of columns n=8. Accordingly, fig. 4b provides a schematic structural diagram of the first matrix for the present application. As shown in fig. 4 b. The first matrix also comprises four first matrix blocks a _i Respectively A ₀ 、A ₁ 、A ₂ And A ₃ The number of rows i=8, the number of columns (m×t) = (8*4) =32 of each first matrix block.

Step 304, for each third matrix in the at least one third matrix, the hardware accelerator obtains a fifth matrix according to the fourth matrix and the third matrix.

With reference to fig. 2b, the processor 111a controls the hardware accelerator 20a to input the fourth matrix and the third matrix into the operation module 212, calculates the multiplication of the third matrix and the fourth matrix in the multiplication unit 2121 of the operation circuit 212 to obtain an intermediate value, and inputs the intermediate value into the accumulation unit 2122 of the operation circuit 212 to perform an accumulation operation to obtain a fifth matrix.

Optionally, the fourth matrix includes a plurality of, one fourth matrix is multiplied by one column of the third matrix, and the result of multiplying each input fourth matrix by each column of the third matrix is accumulated to obtain a fifth matrix.

In one possible implementation, if t is an integer greater than or equal to 2, the hardware accelerator divides the third matrix into t matrices of m×n; and the hardware accelerator performs multiplication and addition operation on the fourth matrix and t m-n matrices respectively to obtain a fifth matrix.

In step 305, the hardware accelerator obtains a target result according to the at least one fifth matrix, where the target result is a matrix of l×n.

In combination with fig. 2b, in a possible implementation manner, the accumulating unit 2122 of the hardware accelerator 20a accumulates at least one fifth matrix to obtain a target result, where the target result is a result of a product operation of the first matrix and the second matrix.

Through steps 301 to 305, since the third matrix is obtained by converting non-zero elements in different N columns in the second matrix, the fourth matrix of l×m is reduced compared with the first matrix of l×m, and the fourth matrix of l×m meets the specification of the hardware accelerator, and the third matrix is generated according to the specification of the hardware accelerator, the hardware accelerator may obtain the fifth matrix by multiplying the fourth matrix of l×m and the third matrix, and the target result is a matrix of l×n according to the fifth matrix, that is, the target result is the operation result of the first matrix and the second matrix. Thus, the product operation of the first matrix and the second matrix is realized by the product operation of the third matrix and the fourth matrix, and the operation according to the fifth matrix. In this scheme, the processor eliminates part or all of zero elements in different n columns in the second matrix to obtain the third matrix, so that the number of zero elements of the third matrix participating in the operation in the hardware accelerator is reduced compared with the number of zero elements in different n columns in the second matrix. The third matrix is obtained by deleting rows with only zero elements in different n columns in the second matrix, and the fourth matrix is obtained according to index information, and the zero elements have no influence on the operation result, so that the purpose of reducing the total operation amount of the hardware accelerator can be achieved by reducing the number of the zero elements participating in operation on the premise of not influencing the operation result, and the operation efficiency of the hardware accelerator can be further improved.

The data processing method flow shown in fig. 3 can be applied to the data processing scene of the network model of the neural network. In a data processing scene of the neural network model, the first matrix can be a data matrix or a parameter matrix, and if the first matrix is the data matrix, the second matrix is the parameter matrix; if the first matrix is a parameter matrix, the second matrix is a data matrix. In neural network models, the parameter matrix is typically a sparse matrix.

The data processing method provided in the present application will be described below by taking the first matrix as the matrix shown in fig. 4b, and taking the second matrix as the matrix shown in fig. 4a as an example. In the method, taking the specification of the hardware accelerator as an example, the product operation of the matrix supporting 8 x 8 and the matrix supporting 8 x 8 is taken as the example.

As shown in fig. 4c, a flowchart of another data processing method is provided for the present application. The processor in the method may be the processor in fig. 2a or fig. 2b, and the hardware accelerator may be the hardware accelerator in fig. 2a or fig. 2 b. The method comprises the following steps:

in step 401, a processor obtains a first matrix and a second matrix.

This step is the same as step 301 in the embodiment shown in fig. 3 and is referred to in the foregoing description, and will not be repeated here.

In step 402, the processor divides the first matrix into four first matrix blocks and the second matrix into four second matrix blocks.

Dividing the first matrix into A ₀ 、A ₁ 、A ₂ And A ₃ Four first matrix blocks, the second matrix is divided into B ₀ 、B ₁ 、B ₂ And B ₃ Four matrix blocks. The process of the processor partitioning the first matrix and the second matrix may refer to the related implementation manner of step 303 in the embodiment shown in fig. 3, which is not described herein.

In the present application, the product operation of the first matrix and the second matrix needs to be performed on each first matrix block of the first matrix and each second matrix of the second matrix, and then the product operation is performed on each first matrix block of the first matrix and each second matrix of the second matrix. The product operation of the first matrix and the second matrix is: a is that ₀ *B ₀ +A ₀ *B ₁ +A ₀ *B ₂ +A ₀ *B ₃ +A ₁ *B ₀ +A ₁ *B ₁ +A ₁ *B ₂ +A ₁ *B ₃ +A ₂ *B ₀ +A ₂ *B ₁ +A ₂ *B ₂ +A ₂ *B ₃ +A ₃ *B ₀ +A ₃ *B ₁ +A ₃ *B ₂ +A ₃ *B ₃ . First matrix block A _i And a second matrix block B _j The product operation process of (a) is the same, wherein B _j The number of non-zero elements in A _i *B _j Is a process of operation of (1). This embodiment is described separately in two cases.

The first case is with the second matrix block B shown in fig. 5a _j For illustration, as shown in FIG. 5a, an exemplary diagram of a second matrix block B _j Including 32 rows and 8 columns. In this first case, after the above step 402, steps 403 to 410 are continued.

Step 403, the processor determines a second matrix block B _j The number of non-zero elements in the column with the largest number of non-zero elementsp。

As in fig. 5a, a second matrix block B _j The number of non-zero elements in each column is: p is p ₀ ＝6，p ₁ ＝4，p ₂ ＝7，p ₃ ＝6，p ₄ ＝0，p ₅ ＝4，p ₆ ＝5，p ₇ =3, that is, the second matrix block B _j The number of non-zero elements in the column with the largest number of non-zero elements p=7.

In step 404, the processor determines a third matrix according to the determined p and the specification of the hardware accelerator.

In a possible implementation, t in the third matrix satisfies the condition: (t-1) m<p.ltoreq.t.m, in combination with fig. 5a (second matrix block B _j The number of non-zero elements in the column with the largest non-zero elements is 7) and the specification of the hardware accelerator, and t satisfies the condition: (t-1) x 8<7.ltoreq.t.times.8, it may be determined that t=1, i.e. the third matrix is a matrix of 8.times.8, as shown in fig. 5 b. The third matrix as shown in fig. 5B comprises a second matrix block B as shown in fig. 5a _j Each column is supplemented with an appropriate number of zero elements, which are identified by x. The non-zero elements of the third matrix may not be closely arranged, for example, 6 of column 0 may be in the last row, or row 6, and it is also understood that the complementary zero elements may be in the last two rows of column 0, or in the first two rows of column 0, or in any two rows of column 0, and the location of the complementary zero elements in any column is not limited in this application.

Step 405, the processor determines a second matrix block B _j Index information of (a) is provided.

It is also understood that the second matrix block B is determined _j Position information of non-zero elements in the matrix, and position information of zero padding required for forming a matrix meeting the specification of the hardware accelerator.

Second matrix block B shown in fig. 5a _j As shown in fig. 5c, the location information of the non-zero elements are respectively: the row addresses of the non-zero elements of the 0 th column are 1/4/8/12/17/28 respectively, the row addresses of the non-zero elements of the 1 st column are 3/6/14/24 respectively, the row addresses of the non-zero elements of the 2 nd column are 0/1/21/25/28/29/30 respectively, and the non-zero elements of the 3 rd columnThe row addresses of the non-zero elements in the column 4 are respectively 0/5/12/13/21/27, the row addresses of the non-zero elements in the column 5 are respectively 5/12/18/28, the row addresses of the non-zero elements in the column 6 are respectively 3/8/21/23/25, and the row addresses of the non-zero elements in the column 7 are respectively 11/16/30. For the number p of non-zero elements in the column _i Columns less than 8 (8 is the number of rows of the right matrix of the hardware accelerator) may be padded with zero elements, as shown in fig. 5c, the zero elements may also be identified by x, e.g., the number of non-zero elements in column 0 is 6, 2 zero elements need to be padded, the number of non-zero elements in column 3 is 8, and zero elements need not be padded.

As described above in connection with fig. 2b, the processor 111a may control storing the determined third matrix and index information in the memory 112 a. Since the number of zero elements of the third matrix is compared to the second matrix block B _j The number of zero elements in the memory is small, so that the data amount stored in the memory can be reduced, and the memory space of the memory can be saved.

Step 404 and step 405 are not in sequence, and step 404 may be performed first and then step 405 may be performed; step 405 may be performed first, and then step 404 may be performed; step 404 and step 405 may also be performed simultaneously.

At step 406, the hardware accelerator loads the third matrix, index information, and corresponding first matrix blocks.

In a possible implementation manner, the first matrix block a _i As shown in fig. 6a, comprising 32 rows and 8 columns. The hardware accelerator is configured to load a first matrix block A _i The third matrix and index information corresponding to the third matrix do not require loading all of the first and second matrices at once, thus helping to increase the bandwidth of data loading from memory to the hardware accelerator. It will also be appreciated that in order to improve the bandwidth of data loading from memory to the hardware accelerator, and the computational performance of the hardware accelerator, in an alternative implementation, the hardware accelerator may be loaded one first matrix block a at a time _i A third matrix and index information corresponding to the third matrix. As described above in connection with FIG. 2b, in one possible implementation, the processor 111a will be the firstThe three matrices and index information are stored in the memory 112a, the first matrix also being in the first matrix block A _i Is stored in memory 112 a. The hardware accelerator 20a loads the third matrix in the memory 112a into the first data buffer 2111, the first matrix block corresponding to the third matrix into the second data buffer 2122, and the index information into the index information buffer 2113 under the control of the processor 111 a.

Step 407, the hardware accelerator extracts the first matrix block A from the index information _i M columns of elements are selected to obtain a fourth matrix.

Wherein, m row addresses corresponding to one column address of the index information, a first matrix block A _i Each row is associated with a second matrix block B _j The product operation is performed on each column of the index information, so that a fourth matrix is obtained according to a column address of the index information, and 8 fourth matrices of 8×8 are obtained in combination with the index information in fig. 5c, which are respectively named as a _{i_1} 、a _{i_2} …a _{i_8} . Taking 8 row addresses corresponding to the 0 th column address in FIG. 5c as an example, one can select from the first matrix block A _i A fourth matrix of 8 x 8 is obtained as shown in fig. 6b, wherein zero elements in the index information can be obtained from the first matrix block a _i Or can be directly supplemented with zero elements, and the fourth matrix shown in FIG. 6b is the first matrix block A _i The last two elements of each row. From the first matrix block a by index information _i It may also be understood that a matrix block of 32 rows and 8 columns generates 8 fourth matrices of 8 x 8 according to the index information, so that the specification of the matrix requiring hardware accelerator operation may be reduced, so that the matrix block of 32 rows and 8 columns may operate in the hardware accelerator of the specification.

Taking fig. 2b as an example, the selector 2114 in the hardware accelerator 20a caches the first matrix block a from the second data cache 2122 according to the index information cached in the index information cache 2113 _i M columns of elements are selected to obtain a fourth matrix.

In step 409, the hardware accelerator performs a multiply-add operation on the fourth matrix and the third matrix to obtain a fifth matrix.

One possible implementation is shown in fig. 6c, a _{i_1} Multiplying the fourth matrix of (a) by the first column of the third matrix, a _{i_2} Multiplying the fourth matrix of (a) by the second column of the third matrix to thereby push, a _{i_8} Multiplying the fourth matrix of (2) by the 8 th column of the third matrix, and accumulating the intermediate values obtained by the multiplication to obtain a fifth matrix of 8 x 8. The product operation of the fourth matrix and the third matrix can also be understood as a product operation of a matrix and a vector. By the scheme, the hardware accelerator can finish the product operation of the third matrix and the fourth matrix at one time.

In a possible implementation manner in conjunction with fig. 2b, the multiplication unit 2121 of the third matrix and the 8 fourth matrices input to the operation circuit 212 at a time may perform a product operation to obtain an intermediate value, and the intermediate value is input to the accumulation unit 2122 to perform an accumulation operation to obtain a fifth matrix.

In step 410, the hardware accelerator accumulates the obtained at least one fifth matrix to obtain a target result.

As described with reference to fig. 4a, fig. 4b, and fig. 2b, the first matrix shown in fig. 4b includes 4 first matrix blocks, and the second matrix shown in fig. 4a includes 4 second matrix blocks, so that the hardware accelerator 20a may obtain 4*4 =16 fifth matrices, and the accumulating unit 2122 of the hardware accelerator 20a may accumulate the 16 fifth matrices to obtain the target result.

Through the steps 403 to 410, the specification of the hardware accelerator is that the matrix supporting 8×8 is multiplied by the matrix supporting 8×8, and the product operation of the matrix supporting 8×32 and the matrix supporting 32×8 is implemented by the hardware accelerator according to the specification at a time.

The second case is with a second matrix block B as shown in FIG. 7a _j For illustration, as shown in FIG. 7a, an exemplary diagram of a second matrix, block B _j Including 32 rows and 8 columns, the following is a detailed description of the steps 403 to 410 that continue after the step 402 described above in this second case.

In step 403, the processor determines the second matrix block B of fig. 7a _j The number of non-zero elements in each column is: p is p ₀ ＝8，p ₁ ＝8，p ₂ ＝11，p ₃ ＝6，p ₄ ＝1，p ₅ ＝6，p ₆ ＝5，p ₇ =3, then the second matrix block B _j The number of non-zero elements in the column with the largest number of non-zero elements p=11.

In step 404, the processor satisfies the condition according to t in the third matrix: (t-1) m<11.ltoreq.t.times.m, t=2 can be determined, i.e. the third matrix is the matrix of (2*8).times.8, as shown in fig. 7 b. The third matrix shown in fig. 7B comprises the second matrix block B shown in fig. 7a _j Each column is supplemented with an appropriate number of zero elements, which are identified by x, e.g., column 0 is supplemented with 8 zero elements, and the last 8 rows of column 0 are exemplified with the 8 zero elements supplemented. Alternatively, the non-zero elements of the third matrix may not be closely arranged, for example, the complementary zero elements may be at the last 8 rows of column 0, or at the first 8 rows of column 0, or at any 8 rows of column 0, and the location of the complementary zero elements in any column is not limited herein.

In step 405 above, the processor determines a second matrix block B shown in fig. 7a _j As shown in fig. 7c, the index information is divided into two parts, the first part is the position information of the non-zero element of the first 8 columns, and the position information is respectively: the row addresses of the non-zero elements of the 0 th column are 1/4/8/9/12/13/17/28 respectively, the row addresses of the non-zero elements of the 1 st column are 3/4/6/11/14/20/24/26 respectively, the row addresses of the non-zero elements of the 2 nd column are 0/1/3/16/18/20/21/25 respectively, the row addresses of the non-zero elements of the 3 rd column are 0/5/12/13/21/27 respectively, the row addresses of the non-zero elements of the 4 th column are 15 respectively, the row addresses of the non-zero elements of the 5 th column are 5//11/12/18/24/28 respectively, the row addresses of the non-zero elements of the 6 th column are 3/8/21/23/25 respectively, and the row addresses of the non-zero elements of the 7 th column are 11/16/30 respectively. For the number p of non-zero elements in the column _i Columns smaller than m=8 may be padded with zero elements, which may also be identified by x. For example, if the number of non-zero elements in column 0 is 8, there is no need to supplement zero elements, 6 non-zero elements in column 3, 2 zero elements are needed to supplement 2 zero elements, and 2 zero elements are present in the columnThe position of column 3 may be arbitrary, and fig. 7c illustrates the position in the last two rows. The second part is the position information of the non-zero elements in the last 8 columns, and the position information is respectively: the row address of the non-zero element in the 0 th column is 29, namely the 0 th column has one non-zero element, the rest 7 available zero elements are identified, the 1 st column has no non-zero element and can be identified by zero elements, the row addresses of the non-zero elements in the 2 nd column are 28/29/30 respectively, the rest 5 available zero elements are identified, the 3/4/5/6/7 th column has no non-zero element, and the 1 st column has zero elements.

In step 406 described above, the hardware accelerator loads the third matrix shown in fig. 7b, the index information shown in fig. 7c, and the corresponding first matrix block. In this second case, the first matrix block may be the same as the first matrix block shown in fig. 6a, and the manner in which the hardware accelerator loads the third matrix, the index information, and the corresponding first matrix block may be the same as that described in step 406, which is not described in detail herein.

In step 407, the hardware accelerator may be configured to select the first matrix block a according to the index information shown in fig. 7c _i The selection of data was performed twice in a loop: the first time 8 fourth matrices are obtained from the first matrix block according to the row address corresponding to each of the first 8 columns (the first 0 th column to the 7 th column) in the index information shown in fig. 7c, which may be named a, respectively _{i_1_1} 、a _{i_1_2} …a _{i_1_8} The method comprises the steps of carrying out a first treatment on the surface of the Based on row addresses corresponding to the last 8 columns (the following 0 th column to the 7 th column) in the index information shown in fig. 7c, 8 fourth matrices are obtained from the first matrix for the second time, which can be named as a respectively _{i_2_1} 、a _{i_2_2} …a _{i_2_8} The manner of obtaining the fourth matrix twice is the same as the above-mentioned procedure of fig. 6b, and is not repeated here.

Taking the above-described fig. 2b as an example, the selector 2114 in the hardware accelerator 20a may divide the first matrix block a buffered from the second data buffer 2122 into two times according to the index information buffered in the index information buffer 2113 _i M columns of elements are selected to obtain a fourth matrix.

After the above step 407, the second case further includes a step 408 before the above step 409, where the hardware accelerator divides the third matrix into 2 matrix blocks of 8×8.

As shown in the third matrix of fig. 7b, above the dashed line is an 8 x 8 matrix block and below the dashed line is another 8 x 8 matrix block. The index information corresponding to the first 8 rows and columns in fig. 7c is the position information of the non-zero elements in the matrix block above the dashed line in fig. 7b, and the index information corresponding to the last 8 rows and columns in fig. 7c is the position information of the non-zero elements in the matrix block below the dashed line in fig. 7 b.

In step 409, the hardware accelerator performs a multiply-add operation on the obtained 16 fourth matrices as shown in fig. 6b and the third matrix as shown in fig. 7b, to obtain a fifth matrix.

With reference to FIG. 2b, the 16 fourth matrices are input into the arithmetic circuit 212 of the hardware accelerator 20a twice, and the third matrices, a, are input for the first time _{i_1_1} Fourth matrix, a _{i_1_2} Fourth matrix … a of (a) _{i_1_8} Together with the fourth matrix of the fourth matrix, each of the fourth matrices in the multiplication units 2121 is multiplied by one column of the third matrix, i.e. a, of the multiplication units 2121 of the arithmetic circuit 212 of the hardware accelerator 20a _{i_1_1} Multiplied by the first column of the third matrix, a _{i_1_2} Multiply with the second column of the third matrix, and so on, a _{i_1_8} Multiplied by column 8 of the third matrix. Second time will a _{i_2_1} Fourth matrix, a _{i_2_2} Fourth matrix … a of (a) _{i_2_8} The fourth matrix of (a) is also input to the multiplication unit 2121 of the arithmetic circuit 212 of the hardware accelerator 20a for product operation, i.e., a _{i_2_1} Multiplied by the first column of the third matrix, a _{i_2_2} Multiply with the second column of the third matrix, and so on, a _{i_2_8} The multiplication with the 8 th column of the third matrix, the multiplication process of two times may refer to the process of fig. 6c, which is not described herein again, and the intermediate values obtained by the operation results of two times are input into the accumulating unit 2122 for accumulation, so as to obtain the fifth matrix. The procedure in case two can also be understood as the procedure in case t (t=2 in case two) times of loop operation.

Further, the hardware accelerator accumulates the at least one fifth matrix obtained in case two to obtain the target result.

In case two, the hardware accelerator can complete the product operation of the 8×32 matrix and the 32×8 matrix by performing the product operation of the 8×8 matrix and the 8×8 matrix twice.

In the application, the processor obtains at least one third matrix and at least one corresponding index information according to the non-zero element in the second matrix and the specification of the hardware accelerator, including two cases. For convenience of description of the scheme, taking the second matrix shown in fig. 4a and the first matrix shown in fig. 4b as an example, the hardware accelerator is sized to support the product operation of the 8×8 matrix and the 8×8 matrix.

In this application, if the second matrix is a left matrix, the third matrix is an m (t) matrix, and the data processing method can refer to the contents of fig. 3 to fig. 7c, which are not described herein again.

Based on the above and the same conception, the present application provides an apparatus, as shown in fig. 2a, for performing the above data processing method. The apparatus in this example may perform the scheme described above for the corresponding execution of the data processing apparatus in fig. 3, or may perform the scheme described above for the corresponding execution of the data processing apparatus in fig. 4 c.

Optionally, the memory 112 may also be used for storing program instructions, and the processor 111 may execute one or more steps of the embodiments shown in the above-mentioned solutions, or an optional implementation thereof, by calling the program instructions stored in the memory 112, so that the apparatus implements the functions of the data processing apparatus in the above-mentioned method.

The processor 111 is configured to execute the instruction stored in the memory and control the hardware accelerator 20 to perform an acceleration task, where when the processor 111 executes the instruction stored in the memory 112, the processor 111 in the device is configured to obtain a first matrix and a second matrix, where the first matrix and the second matrix each include non-zero elements, the first matrix is a matrix of l×m, the second matrix is a matrix of m×n, and L, M and N are both positive integers; obtaining at least one third matrix and at least one corresponding index information according to the specifications of the non-zero elements in the second matrix and the hardware accelerator, wherein the specifications of the hardware accelerator are used for indicating the hardware accelerator to process the product operation of the matrix of i x m and the matrix of m x n, the third matrix is a matrix of (t x m) x n, each third matrix in the at least one third matrix respectively comprises each non-zero element in different n columns in the second matrix and does not comprise part or all of the zero elements in the n columns of the second matrix, and the index information is used for indicating the position information of the non-zero elements in the different n columns in the second matrix; l is a positive integer not greater than L, M is a positive integer not greater than M, N is a positive integer not greater than N, and t is a positive integer;

A hardware accelerator 20, configured to obtain, for each third matrix in the at least one third matrix, a fourth matrix from the corresponding l rows in the first matrix according to index information corresponding to the third matrix, where the fourth matrix is a matrix of l×m; obtaining a fifth matrix according to the fourth matrix and the third matrix; and obtaining a target result according to at least one fifth matrix, wherein the target result is a matrix of L.

In an alternative embodiment, t satisfies the following condition: (t-1) m < p < t < m > and p is the number of non-zero elements in the column with the largest number of non-zero elements in different n columns in the second matrix.

In an alternative embodiment, p satisfies the following condition: p is less than or equal to M-M.

In an alternative embodiment, the index information includes row addresses and column addresses, one column address corresponding to m row addresses; the hardware accelerator 20 is configured to select m column elements from the corresponding l rows in the first matrix according to m row addresses corresponding to one column address of the index information, so as to obtain a fourth matrix, where the m row addresses are in one-to-one correspondence with m column addresses of the m column elements.

In an alternative embodiment, t is an integer greater than or equal to 2; a hardware accelerator 20 for dividing the third matrix into t matrices of m x n; and respectively carrying out multiply-add operation on the fourth matrix and t m-n matrices to obtain a fifth matrix.

Based on the foregoing and the same, the present application provides an apparatus 800 for performing the above-described data processing method. Fig. 8 schematically illustrates a structural diagram of an apparatus provided in the present application, and as shown in fig. 8, an apparatus 800 includes a processing module 801, a selecting module 802, and an operation module 803. The selecting module 802 may be the selector 2114 in fig. 2b, and the calculating module 803 may be the calculating circuit 212 in fig. 2 b. The apparatus 800 in this example may be the data processing apparatus in the foregoing, and may execute the scheme corresponding to the data processing apparatus in fig. 3, or may execute the scheme corresponding to the data processing apparatus in fig. 4 c.

A processing module 801, configured to obtain a first matrix and a second matrix, where the first matrix and the second matrix each include non-zero elements, the first matrix is a matrix of l×m, the second matrix is a matrix of m×n, and L, M and N are both positive integers; obtaining at least one third matrix and at least one corresponding index information according to the specifications of the non-zero elements in the second matrix and the hardware accelerator, wherein the specifications of the hardware accelerator are used for indicating the hardware accelerator to process the product operation of the matrix of i x m and the matrix of m x n, the third matrix is a matrix of (t x m) x n, each third matrix in the at least one third matrix respectively comprises each non-zero element in different n columns in the second matrix and does not comprise part or all of the zero elements in the n columns of the second matrix, and the index information is used for indicating the position information of the non-zero elements in the different n columns in the second matrix; l is a positive integer not greater than L, M is a positive integer not greater than M, N is a positive integer not greater than N, and t is a positive integer;

A selecting module 802, configured to obtain, for each third matrix in the at least one third matrix, a fourth matrix from the corresponding l rows in the first matrix according to index information corresponding to the third matrix, where the fourth matrix is a matrix of l×m;

an operation module 803, configured to obtain a fifth matrix according to the fourth matrix and the third matrix; and obtaining a target result according to the fifth matrix, wherein the target result is a matrix of L.

It should be understood that the above division of the units of the apparatus 800 is merely a division of a logic function, and may be fully or partially integrated into a physical entity or may be physically separated. In this application, the processing module 801 of fig. 8 may be implemented by the processor 111 of fig. 2a, and the selecting module 802 and the computing module 803 may be implemented by the hardware accelerator 20 of fig. 2 a. That is, in the embodiment of the present application, the processing module 801 may execute the scheme executed by the processor 111 of fig. 2a, the selecting 802 and the calculating module 803 may execute the scheme executed by the hardware accelerator 20 of fig. 2a, and the rest of the contents may be referred to the above, which is not described herein.

In the above-described embodiments, may be implemented in whole or in part by software, hardware, or a combination thereof, and when implemented using a software program, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The instructions may be stored in a computer storage medium or transmitted from one computer storage medium to another computer storage medium, for example, the instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by means of wired (e.g., coaxial cable, fiber optic, twisted pair), or wireless (e.g., infrared, wireless, microwave, etc.). Computer storage media may be any medium that can be accessed by a computer or a data storage device including one or more media integrated servers, data centers, and the like. The medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape, a magneto-optical disk (MO), etc.), an optical medium (e.g., an optical disk), or a semiconductor medium (e.g., ROM, EPROM, EEPROM, a Solid State Disk (SSD)), etc.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by instructions. These instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. A data processing method, characterized by being applied to a data processing apparatus including a processor and a hardware accelerator, the method comprising:

the processor acquires a first matrix and a second matrix, wherein the first matrix and the second matrix both comprise non-zero elements, the first matrix is an L-M matrix, the second matrix is an M-N matrix, and L, M and N are both positive integers;

the processor obtains at least one third matrix and at least one corresponding index information according to the non-zero elements in the second matrix and the specification of the hardware accelerator, wherein the specification of the hardware accelerator is used for indicating the hardware accelerator to process the product operation of a matrix of i x m and a matrix of m x n, the third matrix is a matrix of (t x m) x n, each third matrix in the at least one third matrix respectively comprises non-zero elements in different n columns of the second matrix and does not comprise part or all of the zero elements in the n columns of the second matrix, and the index information is used for indicating the position information of the non-zero elements in the n columns of the second matrix; l is a positive integer not greater than L, M is a positive integer not greater than M, N is a positive integer not greater than N, and t is a positive integer;

For each third matrix in the at least one third matrix, the hardware accelerator acquires a fourth matrix from the corresponding l rows in the first matrix according to index information corresponding to the third matrix, wherein the fourth matrix is a matrix of l x m; obtaining a fifth matrix according to the fourth matrix and the third matrix; and obtaining a target result according to at least one fifth matrix, wherein the target result is a matrix of L.

2. The method of claim 1, wherein t satisfies the following condition:

(t-1) m < p < t < m, where p is the number of non-zero elements in the column with the largest number of non-zero elements in the n columns of the second matrix.

3. The method of claim 2, wherein p satisfies the following condition:

p≤M-m。

4. a method as claimed in any one of claims 1 to 3, wherein the index information comprises row addresses and column addresses, one column address corresponding to m row addresses;

the hardware accelerator obtains a fourth matrix from the corresponding l rows in the first matrix according to the index information, including:

and the hardware accelerator selects m column elements from the corresponding l rows in the first matrix according to m row addresses corresponding to one column address of the index information to obtain the fourth matrix, wherein the m row addresses are in one-to-one correspondence with m column addresses of the m column elements.

5. The method of any one of claims 1 to 4, wherein t is an integer greater than or equal to 2;

the hardware accelerator obtains a fifth matrix according to the fourth matrix and the third matrix, and the method comprises the following steps:

the hardware accelerator divides the third matrix into t matrices of m x n;

and the hardware accelerator performs multiplication and addition operation on the fourth matrix and the t m-by-n matrices respectively to obtain the fifth matrix.

6. A data processing method, applied to a hardware accelerator, the method comprising:

the hardware accelerator obtains a first matrix, a third matrix and index information, wherein the third matrix and the index information are obtained by a processor according to non-zero elements in a second matrix and specifications of the hardware accelerator, the first matrix and the second matrix both comprise non-zero elements, the first matrix is an L-by-M matrix, the second matrix is an M-by-N matrix, the specifications of the hardware accelerator are used for indicating the hardware accelerator to process the product operation of an L-by-M matrix and an M-by-N matrix, the index information is used for indicating the position information of non-zero elements in different N columns of the second matrix, the third matrix is a (t-by-M) -by-N matrix, the third matrix comprises all non-zero elements in N columns of the second matrix and does not comprise part or all of zero elements in the N columns of the second matrix, wherein M and N are positive integers, L is a positive integer which is not greater than L, M is not greater than N, and the positive integer which is not greater than N;

The hardware accelerator acquires a fourth matrix from the corresponding l rows in the first matrix according to the index information, wherein the fourth matrix is a matrix of l x m;

the hardware accelerator obtains a fifth matrix according to the fourth matrix and the third matrix;

and the hardware accelerator obtains a target result according to at least one fifth matrix, wherein the target result is a matrix of L x N.

7. The method of claim 6, wherein t satisfies the following condition:

8. The method of claim 7, wherein p satisfies the following condition:

p≤M-m。

9. a method as claimed in any one of claims 6 to 8, wherein the index information comprises row addresses and column addresses, one column address corresponding to m row addresses;

10. The method of any one of claims 6 to 9, wherein t is an integer greater than or equal to 2;

the hardware accelerator divides the third matrix into t matrices of m x n;

11. An apparatus, comprising:

a processor, configured to obtain a first matrix and a second matrix, where the first matrix and the second matrix each include non-zero elements, the first matrix is a matrix of l×m, the second matrix is a matrix of m×n, and L, M and N are both positive integers; obtaining at least one third matrix and at least one corresponding index information according to specifications of non-zero elements in the second matrix and a hardware accelerator, wherein the specifications of the hardware accelerator are used for indicating the hardware accelerator to process the product operation of a matrix of i x m and a matrix of m x n, the third matrix is a matrix of (t x m) x n, each third matrix in the at least one third matrix respectively comprises each non-zero element in different n columns of the second matrix, and the index information is used for indicating the position information of the non-zero elements in the n columns of the second matrix; l is a positive integer not greater than L, M is a positive integer not greater than M, N is a positive integer not greater than N, and t is a positive integer;

A hardware accelerator, configured to obtain, for each third matrix in the at least one third matrix, a fourth matrix from the corresponding l rows in the first matrix according to index information corresponding to the third matrix, where the fourth matrix is a matrix of l×m; obtaining a fifth matrix according to the fourth matrix and the third matrix; and obtaining a target result according to at least one fifth matrix, wherein the target result is a matrix of L.

12. The apparatus of claim 11, wherein t satisfies the following condition:

13. The apparatus of claim 12, wherein p satisfies the following condition:

p≤M-m。

14. the apparatus according to any one of claims 11 to 13, wherein the index information includes row addresses and column addresses, one column address corresponding to m row addresses;

the hardware accelerator is specifically configured to:

and selecting m column elements from the corresponding l rows in the first matrix according to m row addresses corresponding to one column address of the index information to obtain the fourth matrix, wherein the m row addresses are in one-to-one correspondence with m column addresses of the m column elements.

15. The apparatus of any one of claims 11 to 14, wherein t is an integer greater than or equal to 2;

the hardware accelerator is specifically configured to:

dividing the third matrix into t m x n matrices; and multiplying and adding the fourth matrix and the t m-n matrices respectively to obtain the fifth matrix.

16. A hardware accelerator, comprising:

the access circuit is used for acquiring a first matrix, a third matrix and index information, wherein the third matrix and the index information are obtained by a processor according to non-zero elements in a second matrix and specifications of the hardware accelerator, the first matrix and the second matrix both comprise non-zero elements, the first matrix is an L x M matrix, the second matrix is an M x N matrix, the specifications of the hardware accelerator are used for indicating the hardware accelerator to process the product operation of the L x M matrix and the M x N matrix, the index information is used for indicating the position information of the non-zero elements in different N columns in the second matrix, the third matrix is a (t x M) x N matrix, and each third matrix in at least one third matrix respectively comprises all non-zero elements in different N columns in the second matrix and does not comprise part or zero elements in the N columns of the second matrix, wherein M and N are integers which are not larger than L and not larger than N, and are integers which are not larger than N, and are not larger than N;

The selection circuit is used for acquiring a fourth matrix from the corresponding l rows in the first matrix according to the index information, wherein the fourth matrix is a matrix of l x m;

the operation circuit is used for obtaining a fifth matrix according to the fourth matrix and the third matrix; and obtaining a target result according to at least one fifth matrix, wherein the target result is a matrix of L.

17. The hardware accelerator of claim 16 wherein t satisfies the following condition:

18. The hardware accelerator of claim 17 wherein p satisfies the following condition:

p≤M-m。

19. the hardware accelerator of any of claims 16 to 18, wherein the index information comprises row addresses and column addresses, one column address corresponding to m row addresses;

the selection circuit is specifically configured to:

20. The hardware accelerator of any of claims 16 to 19 wherein t is an integer greater than or equal to 2;

the operation circuit is specifically configured to: