CN116136894A

CN116136894A - Data processing method based on matrix processor and readable storage medium

Info

Publication number: CN116136894A
Application number: CN202111372106.0A
Authority: CN
Inventors: 王学东; 潘卫星
Original assignee: Beijing Simm Computing Technology Co ltd
Current assignee: Beijing Simm Computing Technology Co ltd
Priority date: 2021-11-18
Filing date: 2021-11-18
Publication date: 2023-05-19

Abstract

The application provides a data processing method based on a matrix processor and a readable storage medium, wherein the method comprises the following steps: reading W first elements in a first matrix, and sending the W first elements to a calculation unit of a matrix processor for calculation; wherein, W is greater than the width N of the first matrix and is less than or equal to the number K of the computing units of the matrix processor; repeating the steps until the number of the residual elements in the first matrix is smaller than W; and responding to the fact that the number of the residual elements in the first matrix is not zero, and sending the residual elements to the computing unit for computing. The method can improve the utilization rate of the computing unit of the matrix processor, reduce the cycle number used for computation, shorten the computation time and fully utilize the computing unit of the matrix processor.

Description

Data processing method based on matrix processor and readable storage medium

Technical Field

The application belongs to the technical field of processors, and particularly relates to a data processing method based on a matrix processor and a readable storage medium.

Background

In matrix processor designs, which often have multiple computing units, it becomes important how to use these efficiently, however sometimes the width of the matrix that needs to be computed is smaller than the width of the computing units. In the prior art, generally, one row of data of a matrix is read each time, and then the data is calculated according to rows, and when the matrix with a small size is calculated, all calculation units cannot be fully utilized, so that the required calculation period is increased, and the utilization rate of the calculation units is low.

Disclosure of Invention

The application aims to provide a data processing method based on a matrix processor and a readable storage medium.

According to an aspect of the present application, there is provided a data processing method based on a matrix processor, the method comprising: reading W first elements in a first matrix, and sending the W first elements to a calculation unit of a matrix processor for calculation; wherein, W is greater than the width N of the first matrix and is less than or equal to the number K of the computing units of the matrix processor; repeating the steps until the number of the residual elements in the first matrix is smaller than W; and responding to the fact that the number of the residual elements in the first matrix is not zero, and sending the residual elements to the computing unit for computing.

Optionally, before the reading of the W first elements in the first matrix, the method further includes: receiving an operation instruction, and analyzing the received operation instruction to determine the operation type indicated by the operation instruction; and determining the value of W according to the determined operation type, the width of the first matrix and the number of calculation units.

Optionally, W first elements in the first matrix are read to the first register unit of the matrix processor, and w=k is set.

Optionally, the computing unit for sending the W first elements to the matrix processor for computing includes: in response to the set w=k, clipping W first elements in the read first matrix in data order to obtain W ₁ Elements, wherein W ₁ ＝β*N≤W，β>1, and β is an integer; and the W is ₁ The individual elements are fed into the computation unit of the matrix processor for computation.

Optionally, before reading the W elements in the source matrix, the method further includes: determining a read address of the first matrix, wherein the read address is addr+β×n (C-1), addr is an address of a first element of the first matrix, and C is the number of times of reading the first matrix.

Optionally, the reading W first elements in the first matrix and sending the W first elements to a calculation unit of the matrix processor for calculation includes: reading W first elements in the first matrix to a first register unit of the matrix processor; reading out W first elements from the first register unit and feeding them into the calculation unit of the matrix processor, and reading out W calculation results from the result register unit of the matrix processor and feeding them into the calculation unit of the matrix processor, so that the calculation unit performs calculation and updates the calculation results to the result register unit; the result register unit is used for caching the calculation result calculated by the calculation unit each time.

Optionally, the calculating unit that reads W first elements from the first register unit and sends the W first elements to the matrix processor performs calculation, and outputs a result of the calculation to a result register unit of the matrix processor, and the method further includes: determining that u×n elements are contained in the result register in response to the number of remaining elements in the first matrix being zero; setting L equal to the integer part of u/2; if u is an even number, dividing the elements in the result register into two groups according to a storage sequence, wherein each group contains L.N elements, sending the two groups of elements into the calculation unit for calculation, and outputting the L.N elements of the calculated result to the result register unit; if u is an odd number, dividing the elements in the result register into three groups according to a storage sequence, wherein a first group and a second group comprise L.N elements, a third group comprises N elements, the first group and the second group of elements are sent to a calculation unit to be calculated, L.N elements of a calculation result are output to the result register unit, and the result register unit comprises L.N elements of the calculation result and N elements of the third group of elements; the above steps are repeated until u=1.

According to two aspects of the present application, there is provided a data processing method based on a matrix processor, including: reading W first elements in a first matrix, and acquiring W second elements corresponding to the W first elements in a second matrix; wherein, W is greater than the width N of the first matrix and is less than or equal to the number K of the computing units of the matrix processor; the W first elements and the W second elements are sent to a calculation unit of the matrix processor for calculation; repeating the steps until the number of the residual elements in the first matrix is smaller than W; and responding to the fact that the number of the residual elements in the first matrix is not zero, and sending the residual elements to the computing unit for computing.

Optionally, Y second elements in the second matrix are read into the second register unit of the matrix processor, and if the width and the height of the second matrix are respectively equal to the width and the height of the second matrix, y=w is set.

Optionally, the first matrix is a matrix with a plurality of rows and a plurality of columns, the second matrix is a matrix with a plurality of rows and a plurality of columns, and the width and the height of the second matrix are respectively equal to the width and the height of the second matrix, so that W is equal to K, and Y is equal to K.

Optionally, the obtaining W second elements corresponding to the W first elements in the second matrix includes:

reading N second elements in a second matrix to a second register unit of the matrix processor in response to the set w=α×n+_k, and the width of the first matrix is equal to the width of the second matrix; wherein α >1, and α is an integer;

copying the N second elements in alpha parts in the second register unit to expand to obtain W second elements; or alternatively

Reading alpha second elements in a second matrix to a second register unit of the matrix processor in response to the set w=alpha×n+.k, and the height of the first matrix is equal to the height of the second matrix;

in the second register unit, copying the alpha second elements respectively into N copies to expand to obtain W second elements; or alternatively

Reading 1 second element in the second matrix to a second register unit of the matrix processor in response to the set w=k and the widths and heights of the second matrix are all 1;

in the second register unit, the W shares are copied to the 1 second elements respectively to expand to obtain W second elements.

Optionally, before the reading of the W first elements in the first matrix, the method further includes: performing register parameter configuration according to a register instruction, wherein the register instruction at least comprises the width, the height and the row interval number of the first matrix; in response to the width and the row spacing number of the first matrix being equal, confirming that the source addresses of the first matrix are consecutive.

According to a third aspect of the present application, there is provided a data processing apparatus based on a matrix processor, comprising: the first reading module is used for reading W first elements in the first matrix and sending the W first elements to the calculating unit of the matrix processor for calculation; wherein, W is greater than the width N of the first matrix and is less than or equal to the number K of the computing units of the matrix processor; repeating the steps until the number of the residual elements in the first matrix is smaller than W;

and the second reading module is used for responding to the fact that the number of the residual elements in the first matrix is not zero, and sending the residual elements to the calculation unit for calculation.

According to a fourth aspect of the present application, there is provided a matrix processor based data processing apparatus comprising: the first reading module is used for reading W first elements in the first matrix and acquiring W second elements corresponding to the W first elements in the second matrix; wherein, W is greater than the width N of the first matrix and is less than or equal to the number K of the computing units of the matrix processor; the W first elements and the W second elements are sent to a calculation unit of the matrix processor for calculation; and the second reading module is used for responding to the fact that the number of the residual elements in the first matrix is not zero, and sending the residual elements to the calculation unit for calculation.

According to a fifth aspect of the present application, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the methods of the first and second aspects of the embodiments of the present application.

According to a sixth aspect of the present application, there is provided a computer program product for, when run on a computer, causing the computer to perform the method according to the first and second aspects of the embodiments of the present application.

According to a seventh aspect of the present application, there is provided a chip, wherein the chip includes a matrix processor and the apparatus of the third aspect and the fourth aspect of the embodiments of the present application.

The beneficial effects of this application include:

according to the data processing method based on the matrix processor, the matrix processor calculation unit is optimized for dispatching matrix data to calculate in a calculation period, the utilization rate of the calculation unit is improved, the number of cycles used for calculation is reduced, and the calculation time can be greatly shortened.

Drawings

FIG. 1 is a flow chart of a data processing method based on a matrix processor according to an embodiment of the present application;

FIG. 2 is a flow chart of a data processing method based on a matrix processor according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a matrix performing a row-column accumulation operation;

FIG. 4 is a schematic diagram of an intermediate result of a column accumulation operation performed by a matrix of a matrix processor-based data processing method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a double matrix dot product operation;

FIG. 6 (a) -FIG. 6 (b) are schematic diagrams of two matrices before and after operation optimization in a matrix processor-based data processing method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a matrix adding a single number;

FIG. 8 is a schematic diagram of a matrix and row vector addition operation;

FIG. 9 (a) -FIG. 9 (c) are schematic diagrams of a matrix processor-based data processing method before and after performing operation optimization on a multi-row multi-column matrix and a single-row matrix according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a matrix-column vector addition operation;

fig. 11 is a schematic diagram illustrating an intermediate process of adding a multi-row multi-column matrix and a single-column matrix in a data processing method based on a matrix processor according to an embodiment of the present application.

Detailed description of the preferred embodiments

For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail below with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present application. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present application.

Unless otherwise defined, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. The terms "first," "second," and the like, as used herein, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items.

Fig. 1 is a schematic flow chart of a data processing method based on a matrix processor according to an embodiment of the present application. As shown in fig. 1, the data processing method based on the matrix processor in this method embodiment mainly includes:

step S101: w first elements in the first matrix are read, and the W first elements are sent to a calculation unit of a matrix processor for calculation; wherein W is greater than N, less than or equal to K. Wherein the first matrix is the source matrix to be calculated. Optionally, the data address stored in the source matrix to be calculated is read from the memory according to the data address in the matrix, and then the read data is calculated.

In one embodiment, W first elements in a first matrix are read into a first register unit of a matrix processor. In a matrix processor, before performing an operation, data needs to be read into a register from the outside, and then the data in the register is sent to a calculation unit for calculation. The source of the data may be a memory or other device that is directly input to or output from the processor.

It should be noted that W is greater than N elements per row of the first matrix (i.e., the width of the first matrix), and W > N is set so as to make full use of the matrix processor computing unit in the subsequent computation. And the number W is larger than the number N of elements in each row of the first matrix and smaller than or equal to the number K of the computing units of the matrix processor. In one embodiment, w=k may be set in the step of reading W first elements in the first matrix to the first register unit of the matrix processor. When w=k, all matrix processor calculation units can be utilized in subsequent calculations, resulting in a utilization of the calculation units of 100%.

Step S102: the above step S101 is repeated until the number of the remaining elements in the first matrix is smaller than W. In this step, W first elements in the first matrix read each time are written into the calculation unit to perform calculation. When the calculation unit of the matrix processor performs the first calculation, the W first elements in the first matrix are read and sent to the calculation unit of the matrix processor, then the W first elements in the first matrix are read again and sent to the calculation unit of the matrix processor, the calculation unit of the matrix processor performs the calculation, and the calculation result is output to the result register. When the calculation unit of the matrix processor performs subsequent calculation, W first elements in the first matrix are read and sent to the calculation unit of the matrix processor, then W calculation results are read from the result register and sent to the calculation unit of the matrix processor, the calculation unit of the matrix processor performs calculation, and the calculation result is updated to the result register. That is, the calculation result stored in the result register is the latest calculation result.

Step S103: and in response to the number of the residual elements in the first matrix being not zero, acquiring the residual elements in the first matrix to the calculation unit for calculation. If the number of remaining elements in the first matrix is zero, indicating that the calculation is completed, step S103 may not be performed. And traversing all the elements of the first matrix, and repeating the steps S101 and S102 until the elements in the first matrix are completely read, namely the number of the rest elements in the first matrix is zero.

If the number of the remaining elements in the first matrix is less than W after repeating steps S101 and S102 a plurality of times, the W first elements in the first matrix cannot be read at a time, and then the remaining elements in the first matrix can be calculated at a time in the calculating unit, and then the calculation is completed. Alternatively, the calculation result may be output to the result register unit of the matrix processor each time the calculation is completed by the calculation unit.

Compared with the prior art, W elements are calculated in each calculation period, and W is larger than the width N of the first matrix, so that the calculation units of the matrix processor are fully utilized, and the utilization rate of the calculation units is improved. Illustratively, w=2n, 2n elements are taken to be calculated in parallel in the calculation unit, and the calculation efficiency is 2 times that of the prior art.

Fig. 2 is a flow chart of a data processing method based on a matrix processor according to an embodiment of the present application. As shown in fig. 2, the data processing method based on the matrix processor in this method embodiment mainly includes:

step S201: reading W first elements in the first matrix, and acquiring W second elements corresponding to the W first elements in the second matrix; wherein W is larger than the width N of the first matrix and smaller than or equal to the number K of the computing units of the matrix processor. In this embodiment, the calculation of the first matrix and the second matrix may be implemented. Wherein the first matrix is a matrix of a plurality of rows and a plurality of columns. The second matrix may be a matrix with multiple rows and multiple columns, a single row matrix with a height of 1, a single column matrix with a width of 1, or a single number with both a height and a width of 1.

Optionally, in the process of obtaining W second elements corresponding to W first elements in the second matrix, Y second elements in the second matrix may be read first, and in response to Y being smaller than W, the Y second elements are expanded into W second elements. When the first matrix is a matrix with a plurality of rows and a plurality of columns, and the second matrix is a matrix with a plurality of rows and a plurality of columns, W is equal to K, and Y is equal to K. When the first matrix is a matrix with a plurality of rows and columns, the second matrix is a single number with the height and the width of 1, Y is set to be 1, and W is equal to K. Specific: in response to the setting of w=α×n++k, and the width of the first matrix is equal to the width of the second matrix, reading N second elements in the second matrix into a second register unit of the matrix processor; wherein α >1, and α is an integer; in the second register unit, the N second elements are copied to α copies to expand to obtain W second elements. Or, in response to the setting w=α×n+.k, and the height of the first matrix is equal to the height of the second matrix, reading α second elements in the second matrix into the second register unit of the matrix processor; in the second register unit, the alpha second elements are copied into N copies respectively to be expanded to obtain W second elements. Or alternatively. Reading 1 second element in the second matrix to a second register unit of the matrix processor in response to the set w=k and the widths and heights of the second matrix are all 1; in the second register unit, the W shares are copied to the 1 second elements respectively to expand to obtain W second elements.

Optionally, before the data is read, the register parameter configuration may be further performed according to a register instruction, where the register instruction includes at least a width, a height, and a row interval number of the first matrix. In response to the width of the first matrix and the row interval number being equal, the source address of the first matrix is confirmed to be continuous, and then a plurality of data of the first matrix can be read at one time.

Step S202: and feeding the W first elements and the W second elements into a calculation unit of the matrix processor for calculation.

Step S203: repeating the steps until the number of the residual elements in the first matrix is smaller than W.

Step S204: and sending the residual elements to a calculation unit for calculation in response to the fact that the number of the residual elements in the first matrix is not zero.

According to the embodiment, W elements are calculated in each calculation period, so that the calculation units of the matrix processor are fully utilized, and the utilization rate of the calculation units is improved. Illustratively, w=2n, 2n elements are taken to be calculated in parallel in the calculation unit, and the calculation efficiency is 2 times that of the prior art.

An embodiment of the application discloses a data processing method based on a matrix processor, which includes:

Step S301: and receiving the operation instruction, and analyzing the received operation instruction to determine the operation type indicated by the operation instruction. In one embodiment, the matrix processor obtains an operation instruction, where the operation instruction includes at least an operation type of an operation to be performed, and the operation type may include a single matrix operation (i.e. performing computation on elements in a matrix, such as column element accumulation or row element accumulation), a double matrix operation (i.e. an inter-matrix operation), an operation of a matrix and a single number (i.e. a constant), an operation of a matrix and a single row matrix, an operation of a matrix and a single column matrix, and so on.

Wherein, the operation instruction is RISC (Reduced Instruction Set Computer ) instruction, and the instruction format is as shown in Table 1:

table 1 Format example of matrix operation instructions

Vector instruction	31 26	25	24 20	19 15	14	13 12	11 7	6 0
									Veadd.mm	000001	0	rs2	rs1	0	00	rd	1111011

The instruction name of the operation instruction illustrated in table 1 is veadd.mm, which is used to indicate that matrix addition is performed on two matrices, as shown in the above table, 0:6 bits of the instruction indicate that the instruction is a custom instruction of RISC, rs1, rs2 designates a first register and a second register, and rd designates a result register. The first register and the second register are respectively two source registers of a matrix to be calculated and are used for storing data to be calculated of the source matrix; the result register is a destination register for storing calculation result data. And reading the first register, the second register and the result register to obtain the starting addresses of the first matrix, the second matrix and the third matrix. Bits 26:31 indicate that the operation indicated by the instruction is an add operation, i.e., an addition operation is performed on elements in both matrices. 25,14,13:12 together determine that the type of operation of the current instruction is a double matrix operation (expressed in mm). The instruction types also include m, mf, and mv, where m represents a single matrix operation, mm represents an operation between two matrices, mf represents an operation between a matrix and a single number, mv represents an operation between a matrix and a vector, and operation with a single row of matrices (row vectors) is represented by mv dim0, and operation with a single column of matrices (column vectors) is represented by mv dim 1.

Step S302: and determining the number W of the periodic reading elements according to the analysis result. Specifically, the determining process can be determined in a preconfigured manner, namely, the value of the number W of the current periodic reading element is clear, and the value of W is fixed at the moment and does not need to be determined in real time; the value of W can also be determined in real time according to the operation instruction received in real time, the operation type of the instruction, the width of the first matrix and the number of the calculation units.

Due to the operational characteristics of different operation types, the number of elements for data reading and operation will be different. For example, for column element accumulation of m, if the number of elements read at a time is the whole row (1 row, 2 rows, 3 rows, etc.), after the read data is written into the calculation unit, the column position is corresponding; if the number of elements read each time is the whole row, the column information of each element read is also recorded, so that the column information of the elements acquired by each computing unit is the same (i.e. located in the same column) when the elements are written into the computing unit. And, in determining the value of W, the utilization of the calculation unit is also considered. The number of elements read at a time will be different for different operation types or different numbers of computational units.

Therefore, in step S302, in response to the operation type being the first type, the number W of periodic read elements is determined as the number K of calculation units. Wherein the first type includes an operation type.mm and an operation type.mf. In response to the operation type being a second type, the number of periodic read elements W is determined to be an integer multiple of the width of the first matrix, the second type comprising the operation type.

Illustratively, in the case that the number of calculation units of the matrix processor is 16 and the number of row elements of the source matrix is 7: if the received instruction type is a mm instruction, 16 (i.e. w=k) elements of the source matrix can be read at a time; if the type of the received instruction is m, 14 (i.e. w=2n) elements of the source matrix can be read at a time to perform the whole row reading and calculation of the elements of the source matrix. When the number of calculation units of the matrix processor is 32 and the number of row elements of the source matrix is 7: if the received instruction type is a mm instruction, 32 (i.e. w=k) elements of the source matrix can be read at a time; if the type of the received instruction is m, 28 (i.e. w=4n) elements of the source matrix can be read at a time to perform the whole row reading and calculation of the elements of the source matrix.

In one embodiment, a register instruction may be received and then register parameter configuration is performed according to the register instruction, the register instruction including at least a width, a height, and a row spacing number of the first matrix. The width of the first matrix is the number N of elements in each row of the first matrix, and the row interval number is the address difference between two adjacent rows and the same column in the matrix. In response to the width and the number of row intervals of the first matrix being equal, it is confirmed that the source addresses of the first matrix are consecutive. The source address is consecutive, then W data can be read in full by one read memory.

Step S303: and periodically reading W first elements in the first matrix, and writing the W first elements into a calculation unit of the matrix processor for calculation until the elements in the first matrix are read out. In this step, W elements of the source matrix are read from the memory according to the start address of the first matrix, and written into the calculation unit of the matrix processor for calculation.

And before periodically reading the W first elements in the first matrix, confirming that the source addresses of the first matrix are continuous in response to the widths of the first matrix and the row interval numbers being equal, so that the W data can be completely read in through one read memory.

In one embodiment, W first elements are fed into the matrix processor computing unit for computation, and in response to the set w=k, the W first elements in the read first matrix are clipped in data order to obtainW ₁ Elements, i.e. W ₁ The element is the first W of W first elements ₁ The elements. Wherein W is ₁ ＝β*N≤W，β>1, and β is an integer; will W ₁ The individual elements are fed into the computation unit of the matrix processor for computation. For example, if the value of W is fixed for the preconfigured W, assuming that the value of W is fixed as the number of calculation units regardless of the type of the instruction and the width of the source matrix, since the position correspondence of the data to be calculated needs to be considered, the W first elements cannot be calculated at a time, the number W of elements calculated each time is determined in the process of writing the data into the calculation units ₁ Wherein the W is ₁ Is an integer multiple of the width of the source matrix.

For example, the number of calculation units of the matrix processor is 16, the number of rows 7 and the number of columns 5 of the first matrix, i.e. k=16, m=7, n=5, w=16 is set before the W first elements in the first matrix are read into the first register unit of the matrix processor, so that 16 first elements are read into the first register. Setting W before W first elements are sent to a matrix processor computing unit for computation ₁ Let β×n, e.g. set β=3, where n=5, then W ₁ =15, so that 15 out of 16 read-in elements are truncated and sent to the calculation unit for calculation. For some operation types, such as m, the number of elements calculated each time is an integer multiple of N, which is beneficial to the output of the final result.

In one embodiment, a read address of the first matrix is determined, where the read address is addr+β (C-1), addr is an address of a first element of the first matrix (i.e., a base address of the source matrix), and C is a number of times of reading the first matrix, and data of the first matrix may be read according to the read address. The first element address addr of the first matrix can be preconfigured through register configuration before calculation, and in the data reading process, the first element address of the first matrix can be directly determined.

For example, if β=3, n=5 is set, the first element of the first matrix is read for the first time, c=1, then the read address of the first matrix is addr, and c=2 is read for the second time, then the read address of the first matrix is addr+15.

In one embodiment, when the first matrix is read last time, if the number of first elements in the first matrix that are not read is less than β×n, the first elements in the first matrix are read into the result register.

For example, the number of column-accumulating calculation is 16, the number of the matrix processor calculating units is 16, the number of rows 7 and the number of columns 5 of the first matrix, that is, k=16, m=7, n=5, β=3 is set, and when the first matrix is read for the third time, the number of unread first elements in the first matrix is 5, less than 15, at this time, 5 first elements in the first matrix are read to the result register. Since the 5 elements and the 15 elements cannot be aligned, the 5 elements are read into the result register to participate in subsequent operations.

Fig. 3 is a schematic diagram of a matrix performing a column accumulation operation, and fig. 4 is a schematic diagram of an intermediate result of a column accumulation operation performed by a matrix of a data processing method based on a matrix processor according to an embodiment of the present application. For convenience of description, in this embodiment, the operation instruction acc.mdim 0 is used to represent a matrix to perform row and column accumulation operation, where the first matrix is M1, and the obtained destination matrix is V.

For the column accumulation of the matrix, the column elements of all rows need to be accumulated, and when the operation instruction acc.mdim 0 is realized according to the embodiment of the application, a plurality of rows of data are read at a time and calculated, so that the folding process of intermediate results is involved, namely, the intermediate results which are larger than the width of the matrix are split into a plurality of groups of data, and the number of the intermediate results included in each group of data is equal to the width of the matrix. As shown in fig. 4, in the last calculation or the penultimate calculation, assuming that the intermediate results c11, c12 … c1N, c21, c22 … c2N stored in the result register unit are 2N elements in total, the 2N elements are split into c11, c12 … c1N and c21, c22 … c2N, respectively, and then the two sets of data c11, c12 … c1N and c21, c22 … c2N are written into N calculation units, respectively, to calculate the destination matrix V. Specific: the result register contains u x N elements; setting L equal to the integer part of u/2; if u is even, dividing the elements in the result register into two groups according to the storage sequence, wherein each group contains L.N elements, sending the two groups of elements into a calculation unit for calculation, and outputting the L.N elements of the calculation result to a result register unit; if u is an odd number, dividing the elements in the result register into three groups according to a storage sequence, wherein the first group and the second group comprise L x N elements, the third group comprises N elements, the first group and the second group of elements are sent to a calculation unit for calculation, the L x N elements of a calculation result are output to the result register unit, and the result register unit comprises the L x N elements of the calculation result and the N elements of the third group of elements; if u is even, setting u equal to L, and if u is odd, setting u equal to L+1; the above steps are repeated until u=1.

For example, the acc.mdim0 calculation is performed on the first matrix, the number of calculation units of the matrix processor is 16, the number of rows 7 and columns 5 of the first matrix, that is, k=16, m=7, n=5, β=3 is set, after the second reading of the first matrix elements is performed, and after the calculation is performed by the calculation unit, there are 15 elements in the result register, and the third reading of 5 elements from the first matrix is performed to the result register. There are 20 elements in the result register, 4*5 elements, u=4, then L equals 2; at this time, u is an even number, and according to the storage sequence, the elements in the result register are divided into two groups, namely, the first 10 elements in the result register are in one group, the last 10 elements are in one group, the two groups of elements are sent to the calculation unit to perform addition operation, and 10 elements are in the result register after the addition operation. u is an even number, l=2, and u=2 is set. Repeating the steps, dividing 10 elements into two groups, wherein each group contains 5 elements, at this time, L=1, sending the elements of the first group and the second group into a calculation unit for calculation, outputting 5 elements of a calculation result to a result register unit, wherein u is an even number, L=1, setting u=1, and finishing calculation. And 5 elements in the result register unit are used for carrying out the calculation result of acc.mdim0 for the first matrix.

For example, the acc.mdim 0 calculation is performed on the first matrix, the number of calculation units of the matrix processor is 16, the number of rows 8 and columns 5 of the first matrix, that is, k=16, m=8, n=5, β=3 is set, after the second reading of the first matrix elements is performed, and after the calculation is performed by the calculation unit, there are 15 elements in the result register, and the third reading of 10 elements from the first matrix is performed to the result register. There are 25 elements in the result register, i.e. 5*5 elements, u=5, then L equals 2; at this time, u is an odd number, and the elements in the result register are divided into three groups according to the storage sequence, wherein the first group and the second group comprise 10 elements, the third group comprises 5 elements, the first group and the second group elements are sent to the calculation unit for calculation, 10 elements of the calculation result are output to the result register unit, and the result register unit comprises 10 elements of the calculation result and 5 elements of the third group element. u=5 is an odd number, u is set equal to 3, and the above steps are repeated until u=1.

For the implementation of the acc.mdim 0 operation instruction, the implementation specifically includes:

the first step: the first matrix M1 is determined to be address-continuous and the destination matrix V is determined to be address-continuous by the configured register parameter configuration, assuming that the number of calculation units is 16, the width of M1 is 5, and the height is 7, 16 data (i.e., first elements) are read according to the base address of M1.

And a second step of: in the implementation process of the acc.mdim 0 operation instruction, each calculation of 16 calculation units of the matrix processor is performed on the whole row of the first matrix M1, so that 16 data corresponding to the read M1 are cut, the width of the cut data is 3 times of the width of the matrix, and the 3 times of the width of the matrix can maximally utilize the calculation units of the matrix processor. Since the data is read for the first time, there is no other data that can be added, the 15 data can be temporarily put into the first register. Or, the 15 data after clipping may be temporarily written into the result register as an intermediate result, so that in the periodic calculation process, the 16 data of the first matrix M1 is read each time and then fed into the calculation unit, and then the 15 data is read from the result register and fed into the calculation unit, so that the calculation unit performs addition operation on the data acquired by the calculation unit, and the 15 calculation results obtained by calculation are sequentially written into the result register, and in the next calculation process, calculation may be performed according to the 15 calculation results and the read data of the first matrix M1.

And a third step of: the width of the base address + matrix of M1 is calculated, and the start address of the second data reading (i.e. the address of the first column data of the fourth row of M1) is determined, and the 16 data corresponding to M1 are read according to the start address for the second time.

Fourth step: and cutting 16 data corresponding to the read M1, wherein the width of the cut data is 3 times of the width of the matrix. The data cut twice is written into 15 calculation units for calculation, each calculation unit adds the written two data, and the intermediate result of the calculation is written into a result register.

Fifth step: the width of the matrix of the base address +2 of M1 is calculated, the initial address of the data read for the third time (namely, the address of the first column data of the seventh row of M1) is determined, and the 5 data corresponding to M1 are read for the third time according to the initial address.

Sixth step: folding the intermediate result in the result register to form 10 data and 5 data, forming 10 data by the folded 5 data and five data acquired for the third time, writing two groups of 10 data into 10 calculation units for calculation, and writing the calculated intermediate result into the result register.

Seventh step: and folding the intermediate result in the result register again to form 5 data and 5 data, writing the refolded 5 data and 5 data into 5 calculation units for calculation, writing the calculation result into a destination address, and finishing calculation.

Similarly to the calculation process of acc.mdim 0, when calculating the maximum value max.mdim 0 and the minimum value min.mdim 0 of the element columns of the first matrix, W first elements are also sent to the matrix processor calculating unit, and when calculating, the result of the last comparison of the magnitude values, that is, the calculated intermediate result is compared with the data read at this time, and the calculated intermediate result is stored in the result register.

Illustratively, the column maximum value max.mdim 0 of the elements of the first matrix represents the maximum value of each column of elements extracted to form the vectors V1, V2 … VN. Taking 2N elements processed each time as an example, setting W as 2N, firstly reading 2N first elements of the first matrix into the first register, directly copying the data read for the first time into the third memory as an intermediate result without performing the reference of maximum value taking calculation, and exemplarily, recording as c11, c12 … c1N, c21, c22 … c2N. The second time, from reading the 2N first elements of the first matrix to the first register, performs a maximum calculation with the intermediate results c11, c12 … c1N, c21, c22 … c2N stored in the result register, and assigns the results to c11, c12 … c1N and c21, c22 … c2N, stored in the result register, and so on until the elements of the first matrix are read out. In the last step, the vectors V1, V2 … VN are obtained by folding and performing a maximum value calculation on c11, c12 … c1N, c, c22 … c2N. Correspondingly, the minimum value min.mdim 0 of the element columns of the first matrix also adopts the same operation process.

FIG. 5 is a schematic illustration of a double matrix dot product operation, i.e., a _ij And b _ij Multiplying to obtain c _ij . As shown in fig. 5, in this embodiment, the dot product operation is performed by using a double matrix represented by an operation command mul.mm, where the first matrix is M1, the second matrix is M2, and the destination matrix is M3.

For the realization of mul.mm operation instruction, specifically include:

the first step: receiving a mul.mm instruction, the instruction comprising 32 bits, the instruction format being as follows:

31 26	25	24 20	19 15	14	13 12	11 7	6 0
								000001	6	rs2	rs1	0	0	rd	1111011

wherein 0:6 bits represent that the instruction is a custom instruction for RISC-V;

rd is the base address of the destination matrix M3;

rs1 is the base address of the first matrix M1;

rs2 is the base address of the second matrix M2;

bits 26:31 indicate that the instruction is a mul instruction, i.e., each compute unit performs a multiplication operation;

25 Together, 14, 13:12 determine that the current instruction is mm, i.e., the instruction type is the type of double matrix operation.

And a second step of: since the height, width, number of row intervals of the M1, M2 matrix, and number of row intervals of the M3 matrix have been set by the csr instruction (register instruction), the M1, M2, M3 addresses can be determined to be consecutive according to the configured parameter, assuming that the number of calculation units is 16, the width of M1, M2 is 5, and the height is 7. According to the received mul.mm instruction, 16 data corresponding to M1 are read according to the base address of M1 (non-whole line read data, namely non-integer multiple of the line number acquired data), 16 data corresponding to M2 are read according to the base address of M2 (non-whole line read data), 16 data corresponding to M1 and M2 are written into 16 computing units for computing, and the computing result is written into the destination address. In the present embodiment, y=w, y=k is set so that the utilization of the matrix processor computing unit is 100%.

And a third step of: and respectively calculating the base addresses of M1 and M2 and a calculating unit, determining the initial address of the second read data, respectively reading 16 data corresponding to M1 and M2 according to the initial address for the second time, writing the 16 data into the calculating unit for calculation, and then writing the calculation result into the destination address.

Fourth step: and respectively calculating the number of the base address+2 calculation units of M1 and M2, determining the initial address of the third read data, reading 3 data corresponding to M1 and M2 according to the initial address, writing the data into the calculation units for calculation, and writing the calculation result into the destination address to finish the calculation.

Since mm is a dot product of two matrices having the same shape, the operation of the data is not affected by the non-whole row of read data, but the whole row of read data may be calculated by reading 15 data (3 rows of data) corresponding to M1 and M2 each time for 16 calculation units, specifically: firstly, reading 15 data corresponding to M1 and M2 according to the base addresses of M1 and M2; then, according to the base address of M1, M2 and the width of the matrix, the start address of the second read data (i.e. the address of the first column data of the fourth row of M1, M2) is determined, and the 15 data corresponding to M1, M2 are read for the second time according to the start address. Finally, according to the width of the matrix of the base address +2 of M1, M2, determining the initial address of the third reading data (namely the address of the first row of the seventh row of M1, M2), and reading the 5 data corresponding to M1, M2 according to the initial address.

FIG. 6 (a) -FIG. 6 (b) are schematic diagrams of two matrices before and after operation optimization in a matrix processor-based data processing method according to an embodiment of the present application; wherein fig. 6 (a) is a schematic diagram before optimization, and fig. 6 (b) is a schematic diagram after optimization.

Referring to fig. 6 (a), there are 16 calculation units (EU 0-EU 15) in the matrix processor, a first matrix of size 3*6 and a second matrix, and in the prior art, only 3 calculation units can be used per cycle, and a total of 6 cycles are required to perform all the calculations.

In an alternative embodiment, the first matrix is a matrix of rows and columns, the second matrix is a matrix of rows and columns, W is equal to K, and Y is equal to K. Referring to fig. 6 (a), where K is 16, both W and Y take values 16.

Referring to fig. 6 (b), during the first period calculation, the first 16 data of the first matrix and the second matrix are respectively read into the first register and the second register, and during the calculation, the first 16 data are sent into 16 calculation units to perform one calculation; then, when the second period is operated, the rest two data are calculated, so that the matrix operation of the chip can be completed in 2 periods, and the operation is shortened by 2/3 compared with the original 6 periods. The operation of the two matrices in this embodiment may be, for example, an addition or a hadamard product multiplication operation.

For the operation of the non-single matrix, there is a case that the shapes of the first matrix and the second matrix are different, and at this time, for example, when the second matrix is a constant, the second matrix can be expanded, and then during the data acquisition process, Y second elements in the second matrix are read into the second register unit of the matrix processor, and in response to Y being smaller than W, the Y second elements are expanded into W second elements in the second register unit. And sending the W first elements and the W second elements into K matrix processor computing units for computing, and outputting the computing results to a result register unit of the matrix processor.

FIG. 7 is a schematic diagram of the addition of a matrix to a single number, i.e., a _ij Added to f to obtain c _ij . As shown in fig. 7, in this embodiment, the operation instruction acc.mf represents that the matrix is added to a single number, the first matrix is M1, the second matrix is f, and the obtained destination matrix is M3.

For the implementation of the acc.mf operation instruction, the implementation specifically includes:

the first step: and receiving an mf operation instruction.

And a second step of: since the height, width, number of row intervals of the M1 matrix, and number of row intervals of the M3 matrix have been set by the csr instruction (register instruction), M1, M3 address continuation can be determined according to the configured parameter. Assuming that the number of the calculation units is 16, the width of M1 is 5, and the height is 7, 16 data (i.e., first elements) corresponding to M1 are read according to the base address of M1.

And a third step of: f (i.e. the second element) is read only once, the f data is copied multiple times, the width of the copied data is equal to the number K of the computing units, in this embodiment, the number K of the computing units is equal to the number of the first elements read, and the copied data is stored in the second register. If the copied data is stored in the second register, f can be read once, and f does not need to be read from the outside every time in the periodic calculation process of the calculation unit, so that the calculation speed can be improved.

Fourth step: the 16 data corresponding to the read M1 are written into 16 computing units, the copied 16 f data are written into the 16 computing units, and each computing unit performs addition calculation on the acquired data and writes the calculation result into a destination address.

Fifth step: and calculating the base address of M1 and the number of calculation units, determining the initial address of the second read data, reading 16 data corresponding to M1 according to the initial address of the second read data, writing the 16 data and the copied 16 data corresponding to f into the calculation units again for calculation, and then writing the calculation result into the destination address.

Sixth step: and calculating the number of the base address+2 of the computing units of M1, determining the initial address of the third read data, reading 3 data corresponding to M1 according to the initial address of the third read data, writing the data corresponding to 3 f into the computing units for computing, writing the computing result into the destination address, and completing the computation of the destination matrix which is formed by all the computing results written into the destination address.

According to the embodiment, the addition operation of the first matrix and the single number is completed, the calculation cycle number (the calculation number of calculation units) is equal to the total number (m×n)/W of the first matrix elements, and W is greater than N, so that the calculation cycle number is smaller than M, and compared with the prior art, the calculation efficiency is improved. Illustratively, if W is 4 times N, the number of calculation cycles is M/4.

FIG. 8 is a schematic diagram of a matrix-to-row vector addition operation, i.e., a _ij And V is equal to _j Adding to obtain c _ij . As shown in fig. 8, the computation of the first matrix M1 and the vector V in the row direction means that the first row data of M1 is computed with the first row data of V (V ₁ -V _N ) The second row of data of M1 is also operated with the first row of data of V, and so on, until the firstAll rows of matrix M1 are added to V. In this embodiment, the operation instruction acc.mv dim0 is used to represent that the matrix and the row vector are added, and the row vector is a single row matrix, where the first matrix is M1, the second matrix is V, and the obtained destination matrix is M3.

For the implementation of the acc.mv dim0 operation instruction, specifically includes:

the first step: an operation instruction of mv dim0 is received.

And a second step of: the addresses of M1 and M3 are determined to be continuous, the number of calculation units is assumed to be 16, the width of M1 is 5, the height of M1 is 7,V, the width of M1 is 5, the height of M1 is 1, and 16 data corresponding to M1 are read according to the base address of M1.

And a third step of: and cutting the 16 data corresponding to the read M1, wherein the width of the cut data is 3 times of the width of the matrix (namely, 3 whole data are calculated). If it is determined that the number W of reads is 15, that is, W is an integer multiple of the width of the source matrix, then the read data may not be clipped.

Fourth step: and reading 16 data according to the base address of V, and cutting the read 16 data to obtain 5 data corresponding to V, wherein if the 5 data of V is directly read, the data does not need to be cut. Copying the cut data for a plurality of times, namely copying the whole V row for 3 times, wherein the width of the copied data is 3 times of the width of the matrix; the processed data is stored in the second register, and the data of V does not need to be read from the memory again when the following row is calculated.

Fifth step: the 15 data corresponding to the cut M1 are written into the calculating unit (at the moment, one calculating unit is idle), the copied 15 data corresponding to the V are written into the calculating unit for addition calculation, and the calculated result is written into the destination address.

Fifth step: the width of the base address+matrix of M1 is calculated, the initial address of the data read for the second time (namely, the address of the first column data of the fourth row of M1) is determined, and 16 data corresponding to M1 are read according to the initial address for the second time.

Sixth step: and cutting the 16 data corresponding to the read M1 again, wherein the width of the cut data is 3 times of the width of the matrix.

Seventh step: and writing 15 data corresponding to the re-cut M1 into 15 computing units, writing 15V corresponding data copied in the register into 15 computing units for computation, and writing the computation result into the destination address.

Eighth step: the width of the matrix of the base address +2 of M1 is calculated, the initial address of the data read for the third time (namely, the address of the first column data of the seventh row of M1) is determined, and the 5 data corresponding to M1 are read for the third time according to the initial address.

Ninth step: and writing the third read 5 data and 5V corresponding data into a calculation unit for calculation, writing the calculation result into a destination address, and finishing the calculation.

fig. 9 (a) schematically shows an operation diagram of a matrix of a plurality of rows and columns and a matrix of a single row in the related art, the first matrix operating with the second matrix row by row. The matrix processor has 16 calculation units, the first matrix has a size of 3*6, the second matrix has a size of 1*3, only 3 calculation units can be used in each periodic calculation process by the prior art, and a total of 6 cycles are required to perform all the calculations.

FIG. 9 (b) schematically illustrates an operational diagram after optimization of a multi-row, multi-column matrix and a single-row matrix, in one embodiment, the first matrix being a multi-row, multi-column matrix and the second matrix being a single-row matrix; y is the number 3 of the elements of the second matrix, then in the process of reading the elements of the second matrix, Y second elements are read from the outside for the first time, and in the process of realizing the calculation type instruction, W is set to be an integer multiple of Y, so that calculation is more convenient, and the output of a final result is facilitated; the step of expanding the Y second elements into W second elements is to copy the Y second elements into W second elements in the second register unit. Referring to fig. 9 (a) and 9 (b), y= 3,W =6 at this time. The calculation can be completed in 3 periods by respectively carrying out operation on 6 data of the first matrix, namely two data, on the second element of the second matrix each time, so that 6 calculation units can be used in each period, and the time is shortened by 1/2 compared with the original 6 periods.

It should be noted that W is an integer multiple of Y, as long as the value of W does not exceed the number of calculation units of the matrix processor. For example, when w=9, the operation can be completed only by 2 calculation cycles, which is 2/3 shorter than the originally used 6 cycles.

Fig. 9 (c) schematically shows a schematic diagram of the operational middle after optimization of the multi-row multi-column matrix and the single-row matrix. At this time, W is twice Y, V1, V2 … VN is duplicated. At this time, W is also 2 times the number N of the first matrix columns, i.e., twice the number of elements in each row, and W first elements are the first row elements and the second row elements of the first matrix, i.e., a11, a12 … a1N and a21, a22 … a2N. W is twice N, so the number of cycles used to complete the operation is also 1/2 of the original.

FIG. 10 is a schematic diagram of a matrix-column vector addition operation, i.e., a _ij And V is equal to _i Adding to obtain c _ij . As shown in fig. 10, the first matrix M1 and the vector V perform addition calculation in the column direction, which means that the first row data of M1 are both equal to V ₁ Performing operation, wherein the second row data of M1 are all equal to V ₂ Calculation is performed and so on until all columns of the first matrix M1 are added to V. In this embodiment, the operation instruction acc.mv dim1 is used to represent the matrix and the column vector to perform the addition operation, and the column vector is a single column matrix, where the first matrix is M1, the second matrix is V, and the obtained destination matrix is M3.

For the implementation of the acc.mv dim1 operation instruction, specifically includes:

the first step: and receiving an acc.mv dim1 operation instruction.

And a second step of: the addresses of M1 and V, M3 are determined to be continuous, the number of calculation units is assumed to be 16, the width of M1 is 5, the height of M1 is 7,V, the width is 1, the height is 7, and 16 data corresponding to M1 are read according to the starting address of M1.

And a third step of: and cutting the 16 data corresponding to the read M1, wherein the width of the cut data is 3 times of the width of the matrix (namely, 3 whole data are calculated).

Fourth step: and reading 16 data according to the base address of V, and cutting the read 16 data to obtain 7 data corresponding to V, wherein if the 7 data of V are directly read, the data do not need to be cut. Copying the cut data into a plurality of copies (the data v 1-v 7 are respectively copied into 5 copies in turn); the processed data is stored in a register, and the data does not need to be read from the memory again when the later row is calculated.

Fifth step: and writing 15 data corresponding to the cut M1 into 15 computing units, writing 15 data corresponding to the copied v1, v2 and v3 into 15 computing units for computing, and writing the computing result into the destination address.

Seventh step: and writing 15 data corresponding to the re-cut M1 into 15 computing units, writing 15 data corresponding to v4, v5 and v6 copied in the register into 15 computing units for computing, and writing the computing result into the destination address.

Eighth step: and calculating the width of the M1 base address+2 matrix, determining the initial address of the third read data (namely the address of the first data of the seventh row of M1), and reading 5 data corresponding to M1 according to the initial address.

Ninth step: and 5 data read for the third time and 5 data corresponding to v7 are written into the calculation unit to be calculated, and the calculation result is written into the destination address to finish the calculation.

In one embodiment, the first matrix is a multi-row, multi-column matrix and the second matrix is a single-column matrix; y is the number of elements of the second matrix, W is R times of N, and R is an integer; the step of expanding the Y second elements into W second elements is that in the second register unit, the Y second elements are duplicated in N-1 times to form an intermediate matrix with the number of elements being Y multiplied by N, and R rows of second elements are taken out from the intermediate matrix to form W second elements. Referring to fig. 10, at this time r=2, V1 and V2 are copied N-1 times in the second register, respectively, so that there are N V1 and V2 in the second register. The second element of 2 rows, i.e., N V1 and N V2, is fetched from the intermediate matrix and operated on with the first 2 rows of elements a11, a12 … a1N, a21, a22 … a2N of the first matrix. Illustratively, an addition operation may be performed. W is twice N, so the number of cycles used to complete the operation is also 1/2 of the original.

An embodiment of the present application provides a data processing device based on a matrix processor, including a first reading module, configured to read W first elements in a first matrix, and send the W first elements to a computing unit of the matrix processor for computing; wherein W is larger than the width N of the first matrix and smaller than or equal to the number K of the computing units of the matrix processor; repeating the steps until the number of the residual elements in the first matrix is smaller than W. And the second reading module is used for sending the residual elements to the calculation unit for calculation in response to the fact that the number of the residual elements in the first matrix is not zero.

An embodiment of the application provides a data processing device based on a matrix processor, which comprises a first reading module, a second reading module and a data processing module, wherein the first reading module is used for reading W first elements in a first matrix and acquiring W second elements corresponding to the W first elements in a second matrix; wherein W is larger than the width N of the first matrix and smaller than or equal to the number K of the computing units of the matrix processor; the W first elements and the W second elements are sent to a calculation unit of a matrix processor for calculation; and the second reading module is used for sending the residual elements to the calculation unit for calculation in response to the fact that the number of the residual elements in the first matrix is not zero.

An embodiment of the present application provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor implements a matrix processor-based data processing method according to an embodiment of the present application.

An embodiment of the present application provides a computer program product which, when run on a computer, causes the computer to perform a matrix processor based data processing method as the embodiments of the present application.

An embodiment of the present application provides a chip, where the chip includes a matrix processor and a data processing apparatus based on the matrix processor according to an embodiment of the present application.

In the embodiment of the application, the number of the computing units of the matrix processor is determined by hardware, and if the memory addresses of the matrix outside the processor are continuous, the data processing method in the application can be adopted, so that the utilization rate of the computing units of the matrix processor is improved, the cycle number adopted in operation is reduced, and the computing efficiency is improved.

It is to be understood that the above-described embodiments of the present application are merely illustrative of or explanation of the principles of the present application and are in no way limiting of the present application. Accordingly, any modifications, equivalent substitutions, improvements, etc. made without departing from the spirit and scope of the present application are intended to be included within the scope of the present application. Furthermore, the appended claims are intended to cover all such changes and modifications that fall within the scope and boundary of the appended claims, or equivalents of such scope and boundary.

Claims

1. A method for matrix processor-based data processing, comprising:

reading W first elements in a first matrix, and sending the W first elements to a calculation unit of a matrix processor for calculation; wherein, W is greater than the width N of the first matrix and is less than or equal to the number K of the computing units of the matrix processor;

repeating the steps until the number of the residual elements in the first matrix is smaller than W;

and responding to the fact that the number of the residual elements in the first matrix is not zero, and sending the residual elements to the computing unit for computing.

2. The method of claim 1, wherein said feeding the W first elements into the computation unit of the matrix processor for computation comprises:

in response to the set w=k, clipping W first elements in the read first matrix in data order to obtain W ₁ Elements, wherein W ₁ ＝β*N≤W，β>1, and β is an integer;

and the W is ₁ The individual elements are fed into the computation unit of the matrix processor for computation.

3. The method of claim 2, further comprising, prior to reading the W elements in the source matrix:

Determining a read address of the first matrix, wherein the read address is addr+β×n (C-1), addr is an address of a first element of the first matrix, and C is the number of times of reading the first matrix.

4. A method according to any one of claims 1-3, wherein the reading W first elements in the first matrix and feeding the W first elements into a computation unit of the matrix processor for computation comprises:

reading W first elements in the first matrix to a first register unit of the matrix processor;

reading out W first elements from the first register unit and feeding them into the calculation unit of the matrix processor, and reading out W calculation results from the result register unit of the matrix processor and feeding them into the calculation unit of the matrix processor, so that the calculation unit performs calculation and updates the calculation results to the result register unit; the result register unit is used for caching the calculation result calculated by the calculation unit each time.

5. The method of claim 4, wherein the reading W first elements from the first register unit and feeding them to the calculation unit of the matrix processor for calculation, and outputting the result of the calculation to the result register unit of the matrix processor, further comprises:

Determining that u×n elements are contained in the result register in response to the number of remaining elements in the first matrix being zero;

setting L equal to the integer part of u/2;

if u is an even number, dividing the elements in the result register into two groups according to a storage sequence, wherein each group contains L.N elements, sending the two groups of elements into the calculation unit for calculation, and outputting the L.N elements of the calculated result to the result register unit;

if u is an odd number, dividing the elements in the result register into three groups according to a storage sequence, wherein a first group and a second group comprise L.N elements, a third group comprises N elements, the first group and the second group of elements are sent to a calculation unit to be calculated, L.N elements of a calculation result are output to the result register unit, and the result register unit comprises L.N elements of the calculation result and N elements of the third group of elements;

the above steps are repeated until u=1.

6. A method for matrix processor-based data processing, comprising:

reading W first elements in a first matrix, and acquiring W second elements corresponding to the W first elements in a second matrix; wherein, W is greater than the width N of the first matrix and is less than or equal to the number K of the computing units of the matrix processor;

The W first elements and the W second elements are sent to a calculation unit of the matrix processor for calculation;

7. The method of claim 6, wherein the obtaining W second elements of the second matrix corresponding to the W first elements comprises:

8. The method according to any one of claims 6 or 7, further comprising, prior to said reading W first elements in the first matrix:

receiving an operation instruction, and analyzing the received operation instruction to determine the operation type indicated by the operation instruction;

and determining the value of W according to the determined operation type, the width of the first matrix and the number of calculation units.

9. The method according to any one of claims 6 or 7, further comprising, prior to said reading W first elements in the first matrix:

performing register parameter configuration according to a register instruction, wherein the register instruction at least comprises the width, the height and the row interval number of the first matrix;

in response to the width and the row spacing number of the first matrix being equal, confirming that the source addresses of the first matrix are consecutive.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any one of claims 1 to 5, 6 to 9.