CN114691087A

CN114691087A - Data operation device and method, processing core and electronic equipment

Info

Publication number: CN114691087A
Application number: CN202011624174.7A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing Simm Computing Technology Co ltd
Current assignee: Beijing Simm Computing Technology Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2022-07-01

Abstract

The invention discloses a data operation device and method, a processing core and an electronic device, wherein the device comprises: the data reading module is used for receiving an instruction and reading the first matrix and the second matrix based on the instruction, wherein the instruction comprises a grouping parameter of the computing unit array, the grouping parameter is a parameter used for dividing the computing unit array into sub-computing arrays, and the grouping parameter is related to a row of the first matrix or a column of the second matrix; the sub-calculation array reads data of the first matrix and the second matrix and performs operations of the first matrix and the second matrix. The data arithmetic device divides the calculation unit array into a plurality of sub-calculation arrays according to the grouping parameters in the instruction, thereby realizing the flexible combination of the calculation units of the calculation unit array and effectively utilizing the calculation units to improve the calculation power of the data arithmetic device.

Description

Data operation device and method, processing core and electronic equipment

Technical Field

The present invention relates to the field of processing core technologies, and in particular, to a data operation device and method, a processing core, and an electronic device.

Background

With the development of science and technology, the human society is rapidly entering the intelligent era. The important characteristics of the intelligent era are that people obtain more and more data, the quantity of the obtained data is larger and larger, and the requirement on the speed of processing the data is higher and higher.

Chips are the cornerstone of data processing, which fundamentally determines the ability of people to process data. From the application field, the chip mainly has two routes: one is a generic chip path, such as a CPU or the like, which offers great flexibility but is less computationally efficient in processing domain-specific algorithms; the other is a special chip route, such as TPU and the like, which can exert higher effective computing power in certain specific fields, but have poorer or even no processing capability in the more flexible and changeable and more general fields.

Because the data of the intelligent era is various and huge in quantity, the chip is required to have extremely high flexibility, can process algorithms in different fields and in different days, has extremely high processing capacity, and can rapidly process extremely large and sharply increased data volume.

Disclosure of Invention

Objects of the invention

The invention aims to provide a data arithmetic device and a method, a processing core and electronic equipment, wherein the data arithmetic device divides a computing unit array into a plurality of sub-computing arrays according to grouping parameters in an instruction, so that the computing units of the computing unit array can be flexibly combined, and the computing units can be effectively utilized to improve the computing power of the data arithmetic device.

(II) technical scheme

To solve the above problem, a first aspect of the present invention provides a data arithmetic device, comprising: the data reading module is used for receiving an instruction and reading the first matrix and the second matrix based on the instruction, wherein the instruction comprises a grouping parameter of the computing unit array, the grouping parameter is a parameter used for dividing the computing unit array into sub-computing arrays, and the grouping parameter is related to a row of the first matrix or a column of the second matrix; the sub-computation array reads data of the first matrix and the second matrix and performs operations of the first matrix and the second matrix.

The data arithmetic device provided by the embodiment of the invention can divide the computing unit array into a plurality of sub-computing arrays according to the grouping parameters in the instruction, can realize flexible combination of the computing units of the computing unit array, and can effectively utilize the computing units to improve the computing power of the data arithmetic device.

Optionally, the grouping parameter is a parameter of a sub-computation array that divides the computation unit array based on rows of the first matrix, and the number of rows of the sub-computation array is the same as the number of rows of the first matrix; or the grouping parameter is a parameter of a sub calculation array into which the calculation unit array is divided based on columns of the second matrix, and the number of columns of the sub calculation array is the same as the number of columns of the second matrix.

Optionally, the grouping parameter is a parameter of a sub-calculation array into which the calculation unit array is divided according to the rows of the first matrix; the sub-calculation array reads the first matrix column by column, and each column of calculation units of the sub-calculation array correspondingly reads a column of data of the first matrix; the sub-calculation array divides the second matrix into a plurality of second sub-matrixes by taking the column dimension of the sub-calculation array as a unit, and reads the corresponding second sub-matrixes row by row so that each row of calculation units of the sub-calculation array reads one row of data of the corresponding second sub-matrixes.

Optionally, the grouping parameter is a parameter of a sub-calculation array into which the calculation unit array is divided according to columns of the second matrix; the sub-calculation array divides the first matrix into a plurality of first sub-matrixes by taking the row dimension of the sub-calculation array as a unit, and reads the corresponding first sub-matrixes column by column so that each column of calculation units of the sub-calculation array reads one column of data of the corresponding first sub-matrixes; and reading the second matrix by the sub-calculation array line by line, so that each line of calculation units of the sub-calculation array reads one line of data of the second matrix.

Optionally, each computing unit in the sub-computing array is configured to accumulate results of each operation to obtain an output matrix.

Optionally, the data reading module includes a plurality of storage areas; when the grouping parameter is a parameter of a sub-calculation array into which the calculation unit array is divided according to the rows of the first matrix, the data reading module is configured to store the elements of the first matrix into one storage area and store the elements of the second matrix into a plurality of storage areas in a grouping manner based on the grouping parameter; or, when the grouping parameter is a parameter of a sub-calculation array into which the calculation unit array is divided according to columns of the second matrix, the data reading module is configured to store the elements of the second matrix into one storage area and store the elements of the first matrix into a plurality of storage areas in a grouping manner based on the grouping parameter.

Optionally, the data reading module is further configured to switch on the switch from the corresponding storage area to the sub-calculation unit arrays based on the grouping parameter, so that each sub-calculation unit array can extract the elements of the group corresponding to the current operation.

Optionally, the data reading module includes: a first data reading module comprising: the first control unit is used for receiving an instruction, extracting the first matrix according to the instruction and generating a first control signal based on the instruction; the first switch array is used for switching on the switches from the storage area corresponding to the first matrix to the sub-computing unit arrays based on the first control signal, so that each sub-computing unit array can extract the elements of the group corresponding to the current operation; a second data reading module comprising: the second control unit is used for receiving an instruction, extracting a second matrix according to the instruction and generating a second control signal based on the instruction; and the second switch array is used for switching on the switches from the storage area corresponding to the second matrix to the sub-computing unit arrays based on the second control signal, so that each sub-computing unit array can extract the elements of the group corresponding to the current operation.

Optionally, the instructions further comprise: the storage head address of the first matrix and the storage head address of the second matrix; the first control unit includes: a first storage unit; the first address generation unit generates an access address of a first matrix based on a head address of the first matrix, extracts the first matrix based on the access address of the first matrix, and stores the first matrix to the first storage unit according to the grouping parameter; the second control unit includes: a second storage unit; and the second address generating unit is used for generating an access address of the second matrix based on the first address of the second matrix, extracting the second matrix based on the access address of the second matrix and storing the second matrix to the second storage unit according to the grouping parameter.

According to a second aspect of the present invention there is provided a processing core comprising one or more data manipulation devices as in the first aspect.

According to a third aspect of the invention, there is provided an electronic device comprising the processing core of the second aspect.

According to a fourth aspect of the invention, there is provided a chip comprising one or more processing cores as provided in the third aspect.

According to a fifth aspect of the present invention, there is provided a card including one or more chips as provided in the fourth aspect.

According to a sixth aspect of the invention, there is provided an electronic device comprising one or more chips as provided by the fifth aspect.

According to a seventh aspect of the present invention, there is provided a data operation method comprising: receiving an instruction; reading a first matrix and a second matrix based on the instruction, the instruction including grouping parameters of the computing unit array, the grouping parameters being parameters for dividing the computing unit array into sub-computing arrays, the grouping parameters being related to rows of the first matrix or columns of the second matrix; and the sub-calculation array reads the data of the first matrix and the second matrix and executes the operation of the first matrix and the second matrix.

According to an eighth aspect of the present invention, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the data operation method of the sixth aspect.

According to a ninth aspect of the present invention, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the data operation method of the sixth aspect when executing the program.

According to a tenth aspect of the present invention, there is provided a computer program product comprising computer instructions which, when executed by a computing device, the computing device is operable to perform the data operation method of the sixth aspect.

(III) advantageous effects

The technical scheme of the invention has the following beneficial technical effects:

Drawings

FIG. 1(a) is a schematic diagram of a matrix operation;

FIG. 1(b) is a schematic diagram of a data operation device;

FIG. 2(a) is a schematic diagram of the data operation device shown in FIG. 1(b) executing a matrix operation;

FIG. 2(b) is a schematic diagram of a first step of performing a matrix operation by the data operation apparatus shown in FIG. 1 (b);

FIG. 2(c) is a diagram illustrating a second step of performing a matrix operation by the data operation apparatus shown in FIG. 1 (b);

FIG. 3 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 4(a) is a schematic structural diagram of a first data reading module in the data computing device according to the embodiment of the present invention;

fig. 4(b) is a schematic structural diagram of a second data reading module in the data computing device according to the embodiment of the present invention;

FIG. 5(a) is a schematic diagram of a matrix operation provided by an embodiment of the present invention;

FIG. 5(b) is a schematic diagram of a data operation device according to an embodiment of the present invention executing a matrix operation;

FIG. 5(c) is a schematic diagram of a data operation device according to an embodiment of the present invention executing a matrix operation;

FIG. 5(d) is a diagram illustrating a first step of performing a matrix operation by the data operation apparatus according to the embodiment of the present invention;

FIG. 5(e) is a diagram illustrating a second step of performing a matrix operation by the data operation apparatus according to the embodiment of the present invention;

FIG. 5(f) is a schematic diagram of a third step of performing a matrix operation by the data operation apparatus according to the embodiment of the present invention;

FIG. 5(g) is a diagram illustrating a fourth step of performing a matrix operation by the data operation apparatus according to the embodiment of the present invention;

fig. 6 is a flowchart of a data operation method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "first", "second", and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

In neural network operations, matrix operations (including convolution operations, since convolution operations can be converted into matrix operations) account for a significant portion of the total amount of operations. The key point of improving the throughput of the neural network task, reducing the time delay and improving the effective computing power of a chip is to improve the speed of matrix operation.

In order to increase the speed of matrix operation, the computing unit array is usually used to perform matrix operation, so as to achieve high data multiplexing rate and increase the operation efficiency.

FIG. 1(a) is a schematic diagram of a matrix operation.

As shown in fig. 1(a), the first matrix M1 is a matrix with M rows and K columns, the second matrix M2 is a matrix with K rows and N columns, and the multiplication between M1 and M2 outputs an output matrix M with M rows and N columns.

The value of the ith row and nth column element Cin of M in the output matrix is the sum of the corresponding multiplication of the i row element of M1 and the n column element of M2.

FIG. 1(b) is a schematic diagram of a data operation device.

As shown in FIG. 1(b), the apparatus includes M rows and N columns of the computing unit array PU, which includes the PU_1,1～PU_M，N。

M1, M2, and M are register data buffers for the two input and output matrices, respectively. The computing unit array can make full use of data. For example, for one element in M1, it can be multiplexed by N compute units in the same row at the same time, and for one element in M2, it can be multiplexed by M compute units in the same column at the same time. That is, each compute unit can complete the computation of a column element of M1 and a row element of M2 at a time. For example, the first row and first column calculation unit completes the corresponding multiplication and addition of the element of the first row of M1 and the element of the first column of M2.

FIG. 2(a) is a schematic diagram of the data operation device shown in FIG. 1(b) executing a matrix operation.

In the example shown in fig. 2(a), M1 is a 2x4 matrix, M2 is a 4x8 matrix, and the array of computing elements is a 4x4 array. M1 and M2 perform multiplication, which results in a 2x8 output matrix M.

FIG. 2(b) is a diagram illustrating a first step of performing a matrix operation by the data operation apparatus shown in FIG. 1 (b).

As shown in FIG. 2(b), since there are only 4 computing units in a row in the computing unit array, and M2 is 4x8 in size and 8 columns, the whole computing process can be completed in two steps.

In the first step, the whole of M1 and M2 are takenThe first four columns are calculated to obtain the first half of the output matrix M (data of the first four columns in two rows), and because the calculation unit array only has 4 columns, in the first step of operation process, the calculation units in the last 2 rows of the calculation units do not execute operation. Specific operations, e.g. computing units PU_0,0Calculating the sum of products of one-to-one multiplication of four elements of the 0 th row of M1 and four elements of the 0 th column of M2 to obtain data of the 0 th row and the 0 th column of the output matrix M; computing unit PU₀₁And calculating the sum of products of the one-to-one multiplication of the four elements of the 0 th row of the M1 and the four elements of the 1 st column of the M2 to obtain the data of the 0 th row and the 1 st column of the output matrix M.

FIG. 2(c) is a diagram illustrating a second step of performing a matrix operation by the data operation apparatus shown in FIG. 1.

In the second step, the second four columns of M1 and M2 are calculated to obtain the second half of the output matrix M (data of the first four columns of two rows), and in this step, the first two rows of the calculation cell array are still calculated, and the second two rows are not calculated. For example a computing unit PU_0,0Calculating the sum of products of one-to-one multiplication of four elements of the 0 th row of M1 and four elements of the 4 th column of M2 to obtain data of the 0 th row and the 4 th column of the output matrix M; computing unit PU₀₁And calculating the sum of products of the one-to-one multiplication of the four elements of the 0 th row of the M1 and the four elements of the 5 th column of the M2 to obtain the data of the 0 th row and the 5 th column of the output matrix M.

The data arithmetic device has the following defects:

(1) once the circuit of the data operation device is designed, the size of the calculation unit array is determined, so that for matrix operation of certain sizes, for example, the number of rows of the first matrix is 2 times or more than 2 times of the number of rows of the calculation unit array, or the number of columns of the second matrix is 2 times or more than 2 times of the number of columns of the calculation unit array, the effective calculation power of the calculation unit array cannot be fully exerted, and the time spent on matrix calculation is increased.

(2) Some data need to be taken out for many times, so that the power consumption is increased.

Fig. 3 is a schematic structural diagram of a data operation device according to an embodiment of the present invention.

As shown in fig. 3, the data operation unit EU includes: a data reading module, configured to receive an instruction, and read a first matrix and a second matrix based on the instruction, where the instruction includes a grouping parameter of a computing unit array, the grouping parameter is a parameter for dividing the computing unit array PUA into sub-computing arrays, and the grouping parameter is related to a row of the first matrix or a column of the second matrix;

the sub-computation array reads data of the first matrix and the second matrix and performs operations of the first matrix and the second matrix.

In some embodiments, the grouping parameter is a parameter of a sub-calculation array into which the calculation unit array is divided with reference to a row of the first matrix or a column of the second matrix.

In some embodiments, the grouping parameter is a parameter of a sub-compute array that divides the compute unit array into based on rows of the first matrix, the number of rows of the sub-compute array being the same as the number of rows of the first matrix. For example, assuming that the computing unit array is divided according to the number of rows of the first matrix, the information covered by the grouping parameter includes that the computing unit array is divided according to the rows of the first matrix and is divided into N, and the specific number of groups of the division may be a quotient of the number of rows of the computing unit array and the number of rows of the first matrix, and the quotient takes a positive integer.

In some embodiments, the grouping parameter is a parameter of a sub-calculation array that divides the calculation unit array based on columns of the second matrix, the number of columns of the sub-calculation array being the same as the number of columns of the second matrix. For example, assuming that the computing unit array is divided according to the number of columns of the second matrix, the information covered by the grouping parameter includes that the computing unit array is divided according to the number of columns of the second matrix and is divided into M, and the specific number of groups of division may be the quotient of the number of columns of the computing unit array and the number of columns of the second matrix, and the quotient takes a positive integer.

In some embodiments, the array of computing units is divided into a plurality of sub-computing arrays by rows of a first matrix, which is referred to as a referenced matrix and which is referred to as a referenced "dimension", and for a second matrix which is referred to as an un-referenced matrix and which is referred to as an un-referenced "dimension". Of course, if the computing unit array is divided into a plurality of sub-computing arrays according to the columns of the second matrix, the "second matrix" is also referred to as a reference matrix, the "columns" are referred to as "dimensions" referred to as references, and the "rows" are referred to as "dimensions" not referred to for the first matrix.

In this embodiment, the sub-calculation array successively reads data of a matrix that is not referenced according to the row dimension or the column dimension of the referenced matrix in the row of the first matrix or the column of the second matrix; the sub-calculation array divides the non-referenced matrix into a plurality of sub-matrices by taking the non-referenced dimension of the sub-calculation array as a unit, and the calculation unit array successively reads the elements of the corresponding non-referenced matrix according to the referenced dimension.

Specifically, the grouping parameter is a parameter of a sub-calculation array into which the calculation unit array is divided according to the rows of the first matrix; the sub-calculation array reads the data of the first matrix column by column, and each column of calculation units of the sub-calculation array correspondingly reads a column of data of the first matrix; the sub-calculation array divides the second matrix into a plurality of second sub-matrixes by taking the column dimension of the sub-calculation array as a unit, and reads the corresponding second sub-matrixes row by row so that each row of calculation units of the sub-calculation array reads one row of data of the corresponding second sub-matrixes.

It is understood that, in the present embodiment, the number of rows of each sub-calculation array is the same as the number of rows of the first matrix, and the number of columns is the same as the number of columns of the non-divided calculation unit array.

In the present embodiment, each sub-compute array reads the elements of M1 column by column each time when performing the multiplication operations of M1 and M2, so that each column of the sub-compute unit array reads the elements of one column of M1 correspondingly in one compute, that is: each row of compute units of the array of sub-compute units reads the elements of the corresponding row of M1. For example, this time, the first column element of M1 is read, each column computing unit of the sub computing array reads the first column element of M1, the first row computing unit of each sub computing array reads the first row element of the column of M1, and the last row computing unit of each sub computing array reads the last row element of the column of M1.

In addition, the second matrix is divided into a plurality of second sub-matrixes by taking the column dimension of each sub-calculation array as a unit, each sub-calculation array reads the element of one corresponding M2 sub-matrix row by row, each row of calculation units of the sub-calculation array reads the element of the corresponding row of the second sub-matrix row, namely, each column of the first sub-calculation array reads the column element corresponding to the row of M2, and the column number of the row element of M2 read by the first sub-calculation array each time is the same as the column number of the calculation unit of the first sub-calculation array. And sequentially reading the second sub-matrixes by other sub-calculation arrays in the sub-calculation arrays according to the sequence of the second sub-matrixes in the same reading mode as the first sub-calculation array.

For example, the computing unit array is divided into 2 sub-computing arrays by the row number of M1, the number of columns of each sub-computing array is 5 columns which is the same as the number of columns of the computing unit array, the number of columns of M2 is 10 columns, then M2 is divided into a first sub-matrix and a second sub-matrix, the first sub-computing array extracts 5 columns of data of the first sub-matrix by rows each time, and the second sub-computing array extracts 5 columns of data of the second sub-matrix by rows each time.

In some embodiments, the grouping parameter is a parameter of a sub-calculation array that divides the calculation cell array according to columns of the second matrix;

the sub-calculation array divides the first matrix into a plurality of first sub-matrixes by taking the row dimension of the sub-calculation array as a unit, and reads the corresponding first sub-matrixes column by column so that each column of calculation units of the sub-calculation array reads one column of data of the corresponding first sub-matrixes; and reading the second matrix row by row, so that each row of calculation units of the sub-calculation array reads one row of data of the second matrix.

In some embodiments, the operations performed by M1 and M2 are multiply operations, and each compute unit in the sub-compute array is configured to accumulate the results of each operation to obtain an output matrix.

In some embodiments, a data reading module includes a plurality of memory areas; when the grouping parameter is a parameter of a sub-calculation array into which the calculation unit array is divided according to the rows of the first matrix, the data reading module is configured to store the elements of the first matrix into one storage area and store the elements of the second matrix into a plurality of storage areas in a grouping manner based on the grouping parameter; or, when the grouping parameter is a parameter of a sub-calculation array into which the calculation unit array is divided according to columns of the second matrix, the data reading module is configured to store the elements of the second matrix into one storage area and store the elements of the first matrix into a plurality of storage areas in a grouping manner based on the grouping parameter.

In some optional embodiments, when the grouping parameter is a parameter of a sub-calculation array into which the calculation unit array is divided according to a row of the first matrix, the data reading module divides the second matrix into a plurality of second sub-matrices according to columns of the calculation unit array, and stores the plurality of second sub-matrices into different storage areas respectively. Alternatively, a plurality of second sub-matrices may be stored in a continuous plurality of storage areas in the order of the columns of the second sub-matrices. When the grouping parameter is a parameter of a sub-calculation array which divides the calculation unit array according to the columns of the second matrix, the data reading module divides the first matrix into a plurality of first sub-matrixes according to the columns of the calculation unit array, and respectively stores the plurality of first sub-matrixes into different storage areas. Alternatively, the plurality of first sub-matrices may be stored in a continuous plurality of storage areas in an order of rows of the first sub-matrices.

In some embodiments, the data reading module is further configured to turn on switches from the corresponding storage areas to the sub-calculation unit arrays based on the grouping parameter, so that each sub-calculation unit array can extract elements of an array corresponding to the current operation.

In some embodiments, the data reading module includes a first data reading module LD _ M1 and a second data reading module LD _ M2. The first data reading module is used for reading the data of the M1 from the external storage module according to the received instruction and storing the data of the M1. And the second data reading module is used for reading the data of the M2 from the external storage module according to the received instruction and storing the data of the M1.

The following describes the operation of the data arithmetic device in detail, taking as an example the division of the calculation cell array into a plurality of sub-calculation arrays in accordance with the number of rows M1:

firstly, an instruction decoding unit ID outside the EU receives an instruction I, then decodes the instruction I, and sends the decoded instruction to an LD _ M1 module, an LD _ M2 module and a PUA module in the EU respectively; the specific decoded instruction I includes a control signal and a parameter. More specifically, the EU sends control signals to the PUA and parameters to LD _ M1 and LD _ M2, respectively. The parameters include the storage head address of M1, the size of the storage area occupied by M1, the storage head address of M2, the size of the storage area occupied by M2, the grouping parameter of the PUA and the like.

Then, LD _ M1 can fetch the elements in matrix M1 from the memory module and store them in the first memory module DB according to the parameters such as the memory head address of matrix M1, the size of the memory area occupied by matrix M1, and so on. The first storage module DB is composed of a plurality of storage areas, the number of the storage areas of the first storage module is a positive integer X, the number of the rows of the calculation unit array is a positive integer M, wherein M is more than or equal to X and is more than or equal to 1, and when X is equal to 1, the calculation unit array cannot be divided according to columns; when X is M, the entire array of arithmetic units can be variously combined, and most extremely, the arithmetic units of the entire array are divided into M groups. The number of the storage areas of the first storage module is X, which is the maximum Row Group (Row Group) number that all rows in the computing unit array can be equally divided into, and if X of one chip is fixed, it indicates that the maximum Row Group number of the computing unit array is also fixed. The Row Group, i.e., a Group into which rows according to which the computing unit array is divided, includes a plurality of rows of computing units in each Row Group. If X is fixed, when the computing unit arrays are combined, the number of the divided sub-computing arrays is M/X, and each sub-computing array comprises X rows of computing units; if the number of the storage areas of the first storage module is the same as the number of the rows of the computing unit, the computing unit array calculates according to (M/X) at the moment, and it is known that the computing unit array has 1 sub-computing array. When the number of row groups is greater than 1, it is necessary to ensure that the number of rows of computing units included in all the row groups is the same.

LD _ M1 will turn on the corresponding switch in switch array SM1 in units of row groups according to the parameters in the received instruction, so that when each row calculation unit of the PUA reads data from the DB of LD _ M1, the data of the corresponding row can be read, and each data can be shared for all calculation units of the corresponding row in all row groups.

And the LD _ M2 can take out the elements in the matrix M2 from the external storage module and store the elements in the second storage module DB according to the parameters such as the storage head address of the matrix M2, the size of the storage area occupied by the matrix M2 and the like. The second memory module DB is also composed of a plurality of memory areas, and the number of the memory areas is X, and different memory areas can be accessed by different row groups.

LD _ M2 turns on the corresponding switch in switch array SM2 according to the parameters in the received command, so that when the column calculation units of the PUA in the same row group read data from the DB of LD _ M2, the data of the corresponding column can be read, and each data is shared by all the calculation units of the columns in this row group. The columns of different row groups access data in memory areas of different second memory modules.

The PUA reads column data from the DB of LD _ M1, and reads line data from the DB of LD _ M2, and calculates them.

Fig. 4(a) is a schematic structural diagram of a first data reading module in the data computing device according to the embodiment of the present invention.

As shown in fig. 4(a), the first data reading module LD _ M1 includes: the first control unit Ctrl is configured to receive an instruction, extract the first matrix M1 according to the instruction, and generate a first control signal based on the instruction; and the first switch array SM is used for switching on the storage area corresponding to the first matrix to the switch of the sub-computing unit array based on the first control signal, so that each sub-computing unit array can read the element of the group corresponding to the current operation.

In some embodiments, the instructions further comprise: a storage head address of the first matrix; the first control unit Ctrl includes: a first storage unit DB including a plurality of storage areas (2 storage areas, DB1 and DB2, respectively, in the illustration of fig. 4 (a)); and the first address generation unit AG generates an access address Addr1 of the first matrix based on the head address of the first matrix, extracts the first matrix based on Addr1, and stores the first matrix to the DB according to the grouping parameters.

In the present embodiment, the LD _ M1 works as follows:

ctrl receives decoded instruction I _ D, and sends the parameters for LD _ M1 in the instruction to each module inside LD _ M1, for example, sends the parameters such as the memory head address of matrix M1, the size of the occupied memory area, and the access method, which need to be imported for each calculation, to AG1, and calculates the grouping parameters of the cell array to CL 1.

The address generation module AG1 generates the fetch address Addr1 of M1, fetches all or part of the data of M1, and temporarily stores the data generated by AG1 in the buffer DB according to the storage address of the data in the DB of LD _ M1.

CL1 generates control signal to SM1 according to the grouping parameter of the computing unit array, opens the channel from DB to each group of computing unit array, and makes the computing unit array PUA directly obtain the data of M1 of this operation, namely the first data DO1 read by the first sub-computing array_G1And the first data DO1 read by the second sub-compute array_G2The same data is read by the arithmetic units in the same row in each sub-calculation unit array, so that the times of extracting the same data of M1 can be reduced.

Fig. 4(b) is a schematic structural diagram of a second data reading module in the data computing device according to the embodiment of the present invention.

As shown in fig. 4(b), the second data reading module Ctrl2 includes: a second control unit CL2, configured to receive an instruction, extract a second matrix according to the instruction, and generate a second control signal based on the instruction; and the second switch array is used for switching on the switches from the storage area corresponding to the second matrix to the sub-computing unit arrays based on the second control signal, so that each sub-computing unit array can extract the elements of the group corresponding to the current operation.

In one embodiment, the instructions further comprise: a storage head address of the second matrix; the second control unit CL2 includes: a second storage unit DB; and the second address generation unit AG2 is used for generating an access address Addr2 of the second matrix based on the first address of the second matrix, extracting the second matrix based on the access address of the second matrix and storing the second matrix to the second storage unit according to the grouping parameters.

In this embodiment, the operation of LD _ M2 is as follows:

ctrl2 receives decoded instruction I _ D, and sends the parameters for LD _ M2 in the instruction to each module inside LD _ M2, for example, sends the parameters for calculating the memory head address, occupied memory area size, and access mode of matrix M2 to be imported to AG2, and calculates the grouping parameters of the cell array to CL 2.

The address generation module AG2 generates the fetch address Addr2 of M2, fetches all or part of the data of M2, and temporarily stores the data generated by AG2 in the buffer DB according to the storage address of the data in the DB of LD _ M2.

The CL2 generates a control signal for the switch array SM2 according to the grouping parameters of the computing unit array, and opens the channel from the DB of LD _ M2 to each group of computing units, so that the computing unit array PUA can directly obtain the correct data during operation, i.e. the columns of the computing units in different rows and groups, and use different data as the second input data, for example, the second data DO2G1 read by the first sub-computing array uses the data in the DB1 of LD _ M2, and the second data DO2G2 read by the second sub-computing array uses the data in the DB 2. Specifically, the arithmetic units belonging to the same column in each sub-calculation array read the same data; the arithmetic units of different sub-calculation arrays, which do not belong to the same column, read the data in different DBx, so that the number of times of extracting the same data of M2 can be reduced.

It is understood that the number of the storage areas of the storage modules of LD _ M1 and LD _ M2 may be the same or different, and the embodiment is not limited thereto.

In some embodiments, the grouping parameters may include two parameters, K and X, where K denotes K columns of the import original input matrix M1 per each import of each sub-compute unit, while K rows of M2 per each sequential import; x denotes the division of the array of calculation units into several groups by rows, for example. X is 2, that is, in the calculation process, the calculation unit array is divided into 2 sub-calculation arrays by rows.

In some embodiments, the grouping parameters may further include two parameters, K and Y, where K denotes K columns of the import original input matrix M1 per import of each sub-compute unit, while K rows of M2 per sequential import; y indicates that the array of computing elements is divided into several groups by columns, for example. Y is 2, i.e., in the calculation process, the calculation unit array is divided into 2 sub-calculation arrays by columns.

The data operation device provided in the above embodiment of the present invention will be discussed in detail with reference to specific embodiments. In this embodiment, a 4x4 PUA is taken as an example to realize matrix multiplication of an input matrix M1 of 2x4 and an input matrix M2 of 4x8 to obtain an output matrix of 2x 8.

Fig. 5(a) is a schematic diagram of a matrix operation according to an embodiment of the present invention.

As shown in fig. 5(a), M1 is a 2 × 4 matrix, and M2 is a 4 × 8 matrix, which are multiplied by each other to obtain a 2 × 8 output matrix M.

Fig. 5(b) is a schematic diagram of the data operation device according to the embodiment of the present invention performing a matrix operation, and fig. 5(c) is a schematic diagram of the data operation device according to the embodiment of the present invention performing a matrix operation.

Referring to fig. 5(b), the grouping parameter of the computing unit array is obtained by dividing the computing unit array according to the row number of M1, the grouping parameter includes the row number and the column number of each sub-computing unit array, the row number in the grouping parameter in this embodiment is 2, and the column number is 4, i.e. the grouping parameter indicates that the computing unit array is divided into 2 sub-computing unit arrays by rows, each sub-computing array is a 2 × 4 array, both sub-computing arrays use all 2 rows and 4 columns of the original input matrix M1, the first sub-computing matrix of the two sub-computing matrices reads the 1 st to 4 th columns of 4 rows of the input matrix M2, and the second sub-computing matrix reads the 5 th to 8 th columns of 4 rows of the input matrix M2.

LD _ M1 reads to M1 according to the instruction and stores M1 as one set of data to DB1 of LD _ M1, LD _ M2 reads to M2 according to the instruction and divides M2 into 2 sets of data (two second sub-matrices) averaged by M2 according to the column dimension of the computing unit array according to grouping parameters and stores the two sets of data to DB1 and DB2 of LD _ M, respectively.

Specifically, each sub-compute array needs to read 4 columns of data of matrix M1 from DB 1; meanwhile, each sub-compute array needs to read 4 rows of data of M2 from DB1 and DB2 of LD _ M2, respectively, so LD _ M1 stores 4 columns of M1 as a group in DB1, LD _ M2 stores the first 4 columns of M2 in DB1, and the last 4 columns in DB 2.

The switch array SW1 of the LD _ M1 connects the inputs of the 2 sub-compute arrays with DB 1; the switch array SW2 of LD _ M2 connects the input of the first sub-calculation array Row Group1 to DB1, and connects the input of the second calculation array Row Group2 to DB 2.

At this time, the original 4 × 4 computing element arrays are recombined to form a 2 × 8 computing element array, see fig. 5 (c).

Fig. 5(d) is a schematic diagram of the data operation device according to the embodiment of the present invention executing the first step of the matrix operation.

As shown in fig. 5(d), the first step of the data operation device performing the matrix operation includes: the LD _ M1 gates the corresponding switches according to the grouping parameters for the first operation, so that the DB1 of the LD _ M1 is connected to the first input data paths of the Row Group1 and the Row Group2, respectively. Both Row Group1 and Row Group2 of PUA read column 1 data in DB1 from DB of LD _ M1 as first inputs to two sub-compute arrays. The specific allocation is as follows: the 1 st data "1" in the 1 st column is sent to all the 0 th row calculation units as the first input, and the 2 nd data "0" in the 1 st column is sent to all the 1 st row calculation units as the first input, it can be understood that the 0 th row refers to the 0 th row of the recombined 2x8 calculation unit array, and is equivalent to the 0 th row and the 2 nd row of the original 4x4 calculation unit array; similarly, row 1 herein refers to row 1 of the recombined 2x8 computational cell array, and corresponds to row 1 and row 3 of the original 4x4 computational cell array.

The LD _ M2 gates the corresponding switch for the first operation according to the grouping parameter, so that DB1 of LD _ M2 is communicated with the second data path of the Row Group1 of the PUA, and DB2 is communicated with the second data path of the Row Group 2. The Row Group1 reads the 1 st line data in the DB1 from the DB of the LD _ M2 as a second input of the Row Group 1; row Group2 reads line 1 of data in DB2 from the DB of LD _ M2 as a second input to Row Group 2. The specific allocation is as follows: the 1 st data "1" in Row 1 of DB1, which is allocated to all 0 th column computing units of Row Group1 as the second input (here, column 0 refers to column 0 of the recombined 2x8 computing unit array, which is equivalent to the first half of column 0 of the original 4x4 computing unit array, and only includes the 1 st and 2 nd computing units, the same applies hereinafter), and so on, and other data allocation of DB1 is performed in this way; the 1 st data "1" of Row 1 of DB2 is assigned as a second input to all the 0 th column computing elements of Row Group2 (where column 0 refers to column 4 of the recombined 2x8 computing element array, which is equivalent to the second half of column 0 of the original 4x4 computing element array, and includes only the 3 rd and 4 th computing elements, and so on), and so on, and other data assignments for DB2 are made in this way.

Each of the calculation units in the Row Group1 and the Row Group2 multiplies the first input data and the second input data to obtain a result of the calculation unit of this time, and the result of this time output by all the calculation unit arrays of the Row Group1 and the Row Group2 is an intermediate result matrix M _ temp of the first operation.

Fig. 5(e) is a schematic diagram of a second step of performing a matrix operation by the data operation apparatus according to the embodiment of the present invention.

In the second operation, as shown in fig. 5(e), Row Group1 and Row Group2 of PUA both read the 2 nd column data in DB1 from DB of LD _ M1 as the first inputs of Row Group1 and Row Group2, respectively. The specific allocation is as follows: the 1 st data "0" in column 2 is fed to all row 0 computing units as a first input, and the second data "2" is fed to all row 1 computing units as a first input.

Row Group1 of PUA reads the 2 nd Row data in DB1 from DB of LD _ M2 as the second input of Row Group 11; row Group1 reads Row 2 data in DB2 from the DB of LD _ M2 as a second input to Row Group 2.

Each of the calculation units of the Row Group1 and the Row Group2 performs multiplication operation on the two input data of the operation in the current step, and accumulates the intermediate results of the operation performed in the previous step to obtain the operation result of the current time, and the current results output by all the calculation unit arrays of the Row Group1 and the Row Group2 are the intermediate result matrix M _ temp of the second operation.

Fig. 5(f) is a schematic diagram of a third step of performing a matrix operation by the data operation apparatus according to the embodiment of the present invention.

As shown in FIG. 5(f), both the Row Group1 and the Row Group2 of the PUA read the 3 rd column data in DB1 from the DB of LD _ M1 as the first input for the two Row groups. The specific allocation is as follows: the 1 st data "3" in the 3 rd column is sent to all the 0 th row calculation units as a first input, and the second data "0" is sent to all the 1 st row calculation units as a first input; the Row Group1 of the PUA reads the 3 rd Row data in the DB1 from the DB of the LD _ M2 as the second input of the Row Group 1; row Group2 reads Row 3 data in DB2 from the DB of LD _ M2 as the second input to Row Group 2.

Each of the calculation units of Row Group1 and Row Group2 performs multiplication on two input data of the current operation, and accumulates intermediate results of the previous operation to obtain a current operation result, and the current results output by all the calculation unit arrays of Row Group1 and Row Group2 are an intermediate result matrix M _ temp of the third operation.

Fig. 5(g) is a diagram illustrating a fourth step of performing a matrix operation by the data operation apparatus according to the embodiment of the present invention.

As shown in FIG. 5(g), both the Row Group1 and the Row Group2 of the PUA read the 4 th column data in DB1 from the DB of LD _ M1 as the first input of the two Row groups. The specific allocation is as follows: the 1 st data of the 4 th column, namely 0, is sent to all the 0 th row calculation units as a first input, and the second data, namely 4, is sent to all the 1 st row calculation units as a first input;

row Group1 of PUA reads the 4 th line of data in DB1 from DB of LD _ M2 as the second input of Row Group 1; row Group2 reads Row 4 data in DB2 from the DB of LD _ M2 as the second input to Row Group 2.

Each computing unit of the Row Group1 and the Row Group2 performs multiplication operation on the two input data of the operation in the current step, and accumulates the intermediate results of the operation performed in the previous step to obtain the current operation result, and the current results output by all the computing unit arrays of the Row Group1 and the Row Group2 are the result matrix M of M1 and M2. And finally, outputting the result matrix M.

According to the data operation device provided by the embodiment of the invention, on one hand, the operation unit arrays can be flexibly combined, so that the effect of the operation unit is effectively utilized according to the dimensional characteristics of the data matrix, and the exertion of the chip calculation power is promoted; on the other hand, the data operation device can multiplex more data, so that the utilization rate of the data is better, and the power consumption caused by data transportation is reduced.

According to another embodiment of the invention, a processing core is provided, which comprises one or more of the data operation devices provided in the above embodiments.

In some embodiments, the processing core further comprises a decode unit to decode the received instruction and send the decoded instruction to the data operation device.

According to another embodiment of the invention, an electronic device is provided, which includes one or more processing cores provided in the above embodiments.

According to another embodiment of the invention, a chip is provided, which comprises one or more processing cores provided in the above embodiments.

According to another embodiment of the present invention, a card is provided that includes one or more of the chips provided in the above embodiments.

According to another embodiment of the invention, an electronic device is provided, which comprises one or more chips provided by the above embodiments.

As shown in fig. 6, the method includes:

step S101, receiving an instruction;

step S102, reading a first matrix and a second matrix based on the instruction, wherein the instruction comprises grouping parameters of a computing unit array, the grouping parameters are parameters for dividing the computing unit array into sub-computing arrays, and the grouping parameters are related to rows of the first matrix or columns of the second matrix;

step S103, the sub-calculation array reads the data of the first matrix and the second matrix and executes the operation of the first matrix and the second matrix.

In some embodiments, the grouping parameter is a parameter of a sub-computation array that divides the computation cell array based on rows of the first matrix, the number of rows of the sub-computation array being the same as the number of rows of the first matrix; the grouping parameter is a parameter of a sub calculation array into which the calculation unit array is divided based on columns of the second matrix, and the number of columns of the sub calculation array is the same as the number of columns of the second matrix.

In some embodiments, when the grouping parameter is a parameter of a sub-calculation array that divides the calculation cell array according to rows of the first matrix; reading the first matrix column by the sub-computing array, and correspondingly reading a column of data of the first matrix by each column of computing units of the sub-computing array; the sub-calculation array divides the second matrix into a plurality of second sub-matrixes by taking the column dimension of the sub-calculation array as a unit, and reads the corresponding second sub-matrixes row by row so that each row of calculation units of the sub-calculation array reads one row of data of the corresponding second sub-matrixes.

In some embodiments, when the grouping parameter is a parameter of a sub-calculation array dividing the calculation unit array according to columns of the second matrix; the sub-calculation array divides the first matrix into a plurality of first sub-matrixes by taking the row dimension of the sub-calculation array as a unit, and reads the corresponding first sub-matrixes column by column so that each column of calculation units of the sub-calculation array reads one column of data of the corresponding first sub-matrixes; and reading the second matrix by the sub-calculation array line by line, so that each line of calculation units of the sub-calculation array reads one line of data of the second matrix.

In some embodiments, each calculation unit in the sub-calculation array is configured to accumulate results of each operation to obtain an output matrix.

In some embodiments, the above method further comprises: when the grouping parameter is a parameter of a sub-calculation array which divides the calculation unit array according to the row of the first matrix, the data reading module stores the elements of the first matrix into one storage area and stores the elements of the second matrix into a plurality of storage areas in a grouping manner based on the grouping parameter; or, when the grouping parameter is a parameter of a sub-calculation array into which the calculation unit array is divided according to columns of the second matrix, the data reading module is configured to store the elements of the second matrix into one storage area and store the elements of the first matrix into a plurality of storage areas in a grouping manner based on the grouping parameter.

In some embodiments, the method further comprises: and the data reading module switches on the corresponding storage area to the switch of the sub-computing unit array based on the grouping parameter, so that each sub-computing unit array can extract the elements of the array corresponding to the current operation.

According to some embodiments of the present invention, there is provided a computer storage medium having a computer program stored thereon, the computer program being executed by a processor to perform the data operation method of the above embodiments.

According to some embodiments of the present invention, there is provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the data operation method according to the above embodiments.

According to some embodiments of the present invention, there is provided a computer program product, which comprises computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute the data operation method of the above embodiments.

It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.

Claims

1. A data operation device, comprising:

the data reading module is used for receiving an instruction and reading the first matrix and the second matrix based on the instruction, wherein the instruction comprises a grouping parameter of the computing unit array, the grouping parameter is a parameter used for dividing the computing unit array into sub-computing arrays, and the grouping parameter is related to a row of the first matrix or a column of the second matrix;

2. The data operation device according to claim 1,

the grouping parameter is a parameter of a sub-calculation array into which the calculation unit array is divided based on rows of the first matrix, and the number of rows of the sub-calculation array is the same as the number of rows of the first matrix; or the like, or, alternatively,

the grouping parameter is a parameter of a sub calculation array into which the calculation unit array is divided based on columns of the second matrix, and the number of columns of the sub calculation array is the same as the number of columns of the second matrix.

3. The data arithmetic device according to claim 1 or 2,

the grouping parameter is a parameter of a sub-calculation array into which the calculation unit array is divided according to rows of the first matrix;

the sub-computation array reads the first matrix column by column, and each column of computation units of the sub-computation array correspondingly reads a column of data of the first matrix;

the sub-calculation array divides the second matrix into a plurality of second sub-matrixes by taking the column dimension of the sub-calculation array as a unit, and reads the corresponding second sub-matrixes row by row so that each row of calculation units of the sub-calculation array reads one row of data of the corresponding second sub-matrixes.

4. The data arithmetic apparatus of any one of claims 1 to 3,

the grouping parameter is a parameter of a sub-calculation array into which the calculation unit array is divided according to columns of the second matrix;

the sub-calculation array divides the first matrix into a plurality of first sub-matrixes by taking the row dimension of the sub-calculation array as a unit, and reads the corresponding first sub-matrixes column by column so that each column of calculation units of the sub-calculation array reads one column of data of the corresponding first sub-matrixes;

and reading the second matrix by the sub-calculation array line by line, so that each line of calculation units of the sub-calculation array reads one line of data of the second matrix.

5. The data arithmetic device according to any one of claims 1 to 4,

and each computing unit in the sub-computing array is used for accumulating the results of each operation to obtain an output matrix.

6. The data arithmetic device according to any one of claims 1 to 5,

the data reading module comprises a plurality of storage areas;

when the grouping parameter is a parameter of a sub-calculation array into which the calculation cell array is divided according to the rows of the first matrix,

the data reading module is used for storing the elements of the first matrix into a storage area and storing the elements of the second matrix into a plurality of storage areas in a grouping manner based on the grouping parameters; alternatively, the first and second electrodes may be,

when the grouping parameter is a parameter of a sub calculation array into which the calculation unit array is divided according to columns of the second matrix,

the data reading module is used for storing the elements of the second matrix into a storage area and storing the elements of the first matrix into a plurality of storage areas in a grouping manner based on the grouping parameters.

7. The data operation device according to claim 6, wherein the data reading module is further configured to turn on switches from the corresponding storage areas to the sub-calculation unit arrays based on the grouping parameter, so that each of the sub-calculation unit arrays can extract elements of a group corresponding to the current operation.

8. A processing core comprising one or more data manipulation devices according to any one of claims 1 to 7.

9. An electronic device comprising the processing core of claim 8.

10. A data operation method is characterized in that,

receiving an instruction;

reading a first matrix and a second matrix based on the instruction, the instruction including grouping parameters of the computing unit array, the grouping parameters being parameters for dividing the computing unit array into sub-computing arrays, the grouping parameters being related to rows of the first matrix or columns of the second matrix;

and the sub-calculation array reads the data of the first matrix and the second matrix and executes the operation of the first matrix and the second matrix.