CN114168895A

CN114168895A - Matrix calculation circuit, matrix calculation method, electronic device, and computer-readable storage medium

Info

Publication number: CN114168895A
Application number: CN202010955659.8A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing Simm Computing Technology Co ltd
Current assignee: Beijing Simm Computing Technology Co ltd
Priority date: 2020-09-11
Filing date: 2020-09-11
Publication date: 2022-03-11
Also published as: WO2022053032A1

Abstract

The embodiment of the disclosure discloses a matrix calculation circuit, a matrix calculation method, electronic equipment and a computer-readable storage medium. Wherein the matrix calculation circuit includes: the device comprises a first data reading circuit, a second data reading circuit and a first data processing circuit, wherein the first data reading circuit is used for reading and caching first data in a first matrix and position information of the first data, and the first matrix is a compression matrix of a data matrix; generating a second data output control signal according to the position information of the first data; the second data reading circuit is used for reading and caching second data in the second matrix; controlling to output the second data according to the second data output control signal; and the calculation circuit is used for calculating third data according to the first data and the second data. The matrix calculation circuit controls the output of the plurality of second data through the read position information of the plurality of first data, and solves the technical problems that only single data calculation and complex access address calculation can be performed during matrix calculation in the prior art.

Description

Matrix calculation circuit, matrix calculation method, electronic device, and computer-readable storage medium

Technical Field

The present disclosure relates to the field of processors, and in particular, to a matrix calculation circuit, a matrix calculation method, an electronic device, and a computer-readable storage medium.

Background

With the development of science and technology, the human society is rapidly entering the intelligent era. The important characteristics of the intelligent era are that people obtain more and more data, the quantity of the obtained data is larger and larger, and the requirement on the speed of processing the data is higher and higher. Chips are the cornerstone of task assignment, which fundamentally determines the ability of people to process data. From the application field, the chip mainly has two routes: one is a general chip route, such as a cpu (central processing unit), which provides great flexibility but is less computationally efficient in processing domain-specific algorithms; the other is a special chip route, such as tpu (thermoplastic processing unit), which can exert higher effective computing power in some specific fields, but in the face of flexible and versatile more general fields, the processing capability is worse or even impossible. Because the data of the intelligent era is various and huge in quantity, the chip is required to have extremely high flexibility, can process algorithms in different fields and in different days, has extremely high processing capacity, and can rapidly process extremely large and sharply increased data volume.

In the neural network calculation, the convolution calculation accounts for most of the total operation amount, and the convolution calculation can be converted into matrix multiplication calculation, so that the matrix multiplication calculation speed is improved to improve the throughput in the neural network task, reduce the time delay and improve the effective calculation power of a chip.

The matrix formed by the data in many neural networks (the data includes the parameter data and the input data in the neural networks) is a sparse matrix, that is, the matrix has a large number of elements with 0 values. In order to reduce the storage capacity and bandwidth occupation of data in the neural network calculation, a sparse matrix is compressed for storage; in order to improve the matrix operation speed, the sparse matrix operation is optimized.

FIG. 1a is a schematic diagram of a matrix multiplication computation in a neural network. As shown in FIG. 1a, M1 is a data matrix, M2 is a parameter matrix, and M is an output matrix. Each of the data in a row in M1 and each of the parameters in a column in M2 are multiplied and added to obtain one data in M. Wherein, in fig. 1a, the two matrices M1 and M2 may be one sparse matrix or both sparse matrices.

Fig. 1b shows a schematic compression of the matrix. For storage in sparse matrices, a general compression method can be employed: only elements other than 0 are stored. While storing the value of this non-0 element, it stores its position information in the matrix, i.e., the relative coordinates X and Y of the element in the matrix. Wherein X represents the serial number of the matrix row, and Y represents the serial number of the matrix column. In this method, data and coordinates are stored as one data structure, and the data structure is used as a unit. As shown in fig. 1b, taking an MxN matrix as an example, the left MxN matrix is compressed into a right compression matrix, and each data structure in the compression matrix represents non-0 data in the left matrix and coordinates of the non-0 data in the matrix.

In the sparse matrix, the values of the elements in the matrix are 0, and the 0 elements do not need to be stored, so the storage capacity of the matrix can be effectively reduced by adopting the compression method. Fig. 1c is a schematic diagram illustrating an example of compressing a matrix by using the above compression method. For a 16x16 sparse matrix, only a, b, c and d are elements other than 0, and after compressed storage, only the values and coordinates of the elements need to be stored, thereby saving storage space.

When performing the matrix operation of M1xM2, the matrix after compression is used as the matrix used for actual access. However, the above technical solutions have the following disadvantages: 1. when matrix operation is carried out, the utilization rate of data is low, and usually only independent operation units can be used for calculating single data; 2. according to the data coordinates of the compression matrix, the calculation of the access address is complex, and the performance is influenced.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In order to solve the above technical problems in the prior art, the embodiment of the present disclosure provides the following technical solutions:

in a first aspect, an embodiment of the present disclosure provides a matrix calculation circuit, including:

the device comprises a first data reading circuit, a second data reading circuit and a first data processing circuit, wherein the first data reading circuit is used for reading and caching first data in a first matrix and position information of the first data, and the first matrix is a compression matrix of a data matrix; generating a second data output control signal according to the position information of the first data;

the second data reading circuit is used for reading and caching second data in the second matrix; controlling to output the second data according to the second data output control signal;

and the calculation circuit is used for calculating third data according to the first data and the second data.

Further, the first data reading circuit further includes:

the device comprises a first data cache circuit, a first data sorting circuit and a first control circuit;

the first control circuit is used for generating a first data reading address according to a first address of the first matrix;

the first data cache circuit is used for caching first data read out according to the first data reading address and position information of the first data;

the first data sorting circuit is configured to reorder, according to the first data location information in the first data cache circuit, the location information of the first data and the first data in a manner of location one-to-one correspondence, respectively, where the result of reordering is that data in the same row in the data matrix is still in the same row after being reordered.

Further, the second data reading circuit further includes:

the second data cache circuit, the data selection circuit and the second control circuit;

the second control circuit is used for generating a second data reading address according to the first address of the second matrix;

the second data cache circuit is used for caching second data read out according to the second data reading address;

the data selection circuit is configured to select and output the second data from the second data buffer circuit according to the second data output control signal.

Further, the generating a second data output control signal according to the position information of the first data includes:

the first data sorting circuit is configured to generate the second data output control signal according to column information in the first data location information.

Further, the data selecting circuit is configured to select and output the second data from the second data buffer circuit according to the second data output control signal, and includes:

the data selection circuit is used for selecting second data corresponding to the column information from the second data buffer circuit according to the column information in the second data output control signal and outputting the second data.

Further, the first data is K columns of first data in the data matrix, and the second data is K rows of second data in the second matrix corresponding to the K columns of first data in the matrix calculation.

Further, the computation circuit includes:

a computing unit array, wherein the computing unit array comprises a plurality of computing units;

a row of the computing units in the computing unit array receives a row of the second data;

a row of compute units in the array of compute units receives one of the first data.

Further, the calculating circuit is configured to calculate third data according to the first data and the second data, and includes:

the computing circuit receives a column of first data output by the first data sorting circuit; receiving at least one row of second data selectively output by the data selection circuit; and calculating to obtain third data according to the column of first data and the at least one row of second data.

Further, the location information of the first data includes: a row coordinate and a column coordinate of the first data in the data matrix.

In a second aspect, an embodiment of the present disclosure provides a matrix calculation method, including:

reading and caching first data in a first matrix and position information of the first data, wherein the first matrix is a compression matrix of a data matrix;

generating a second data output control signal according to the position information of the first data;

reading and caching second data in the second matrix;

controlling to output the second data according to the second data output control signal;

and calculating to obtain third data according to the first data and the second data.

In a third aspect, an embodiment of the present disclosure provides a processing core, including the matrix calculation circuit described in any one of the first aspects, a decoding unit, and a storage device.

In a fourth aspect, an embodiment of the present disclosure further provides a chip, where the chip includes at least one processing core in the third aspect.

In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including: a memory for storing computer readable instructions; and one or more processors configured to execute the computer-readable instructions, such that the processors when executed implement the matrix computation method of any of the preceding first aspects.

In a sixth aspect, the disclosed embodiments provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the matrix computation method of any of the preceding first aspects.

In a seventh aspect, the present disclosure provides a computer program product, which includes computer instructions, and when the computer instructions are executed by a computing device, the computing device may execute the matrix calculation method in any one of the foregoing first aspects.

In an eighth aspect, the embodiments of the present disclosure provide a computing device, including one or more chips described in the fourth aspect.

The foregoing is a summary of the present disclosure, and for the purposes of promoting a clear understanding of the technical means of the present disclosure, the present disclosure may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

FIGS. 1a-1c are schematic diagrams of the prior art of the present disclosure;

fig. 2 is a schematic structural diagram of a matrix calculation circuit provided in an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a first data reading circuit according to an embodiment of the disclosure;

FIG. 4 is a schematic diagram illustrating an example of reordering of a first data read circuit according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a second data reading circuit according to an embodiment of the disclosure;

6a-6e are schematic diagrams of an application example of the disclosed embodiments;

fig. 7 is a flowchart of a matrix calculation method according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Fig. 2 is a schematic diagram of a matrix calculation circuit provided in an embodiment of the present disclosure. The matrix calculation circuit (EU)200 provided in the present embodiment includes:

a first data reading circuit (LD _ M1)201 for reading and buffering first data in a first matrix and position information of the first data, wherein the first matrix is a compression matrix of a data matrix; generating a second data output control signal according to the position information of the first data;

a second data read circuit (LD _ M2)202 for reading and buffering second data in a second matrix; controlling to output the second data according to the second data output control signal;

a calculation circuit 203 for calculating third data according to the first data and the second data.

Illustratively, the first data reading circuit reads and buffers first data in the first matrix according to a read address of the first data, and the read address of the first data is generated according to a storage head address of the first matrix; and the second data reading circuit reads and buffers the second data in the second matrix according to the reading address of the second data, and the reading address of the second data is generated according to the storage head address of the second matrix. The storage head address of the first matrix and the storage head address of the second matrix are obtained through an instruction decoding circuit ID (instruction decoder), and the instruction decoding circuit is used for decoding a matrix calculation instruction to obtain the storage head address of the first matrix, the storage head address of the second matrix, the size of the first matrix and the size of the second matrix and other parameters.

Illustratively, the matrix calculation instruction includes an instruction type, a first storage address of the first matrix, a first storage address of the second matrix, and size parameters of the first matrix and the second matrix. In one embodiment, the instruction type is a multiplication instruction of a matrix, the first matrix is a compression matrix of a data matrix in the neural network convolution calculation, and the second matrix is a parameter matrix in the neural network convolution calculation; wherein the data matrix and/or the second matrix is a sparse matrix having a large number of elements with values of 0. It is understood that the memory head address of the matrix and the size parameter of the matrix (such as the number of rows and columns of the matrix) in the matrix calculation instruction may be represented in the form of register addresses, and the instruction decoding circuit acquires corresponding data from the corresponding register addresses.

In the embodiment of the present disclosure, the first data reading circuit 201 receives the first address of the first matrix decoded by the instruction decoding circuit, and generates a reading address of the first data according to the first address; optionally, the plurality of first data in the first matrix are read at one time according to the read address of the first data. For example, the maximum number of the first data read at a time is preset to be K columns, where the K columns are K columns in the data matrix, the first data reading circuit generates a reading address of the first data according to a head address of the first matrix and K, and reads and buffers the plurality of first data representing the K columns and position information of the plurality of first data from the first matrix at a time. After obtaining the position information of the plurality of first data, the first data reading circuit generates a control signal of second data according to the position information of the plurality of first data, so as to control the output of the plurality of second data buffered by the second data reading circuit.

In the embodiment of the present disclosure, the second data reading circuit 202 receives the first address of the second matrix decoded by the instruction decoding circuit, and generates a reading address of the second data according to the first address; and reading a plurality of second data in the second matrix at one time according to the reading address of the second data. For example, the maximum number of the second data read at one time is preset to be K rows, and for example, if the second matrix is not a compression matrix, the K rows are K rows in the second matrix; and the second data reading circuit generates a reading address of second data according to the first address of the second matrix and K, and reads and buffers K rows of second data from the second matrix at one time. And then, according to the received control signal of the second data, controlling the output of the plurality of second data to output all or part of the plurality of second data.

In the embodiment of the present disclosure, the calculation circuit receives a plurality of first data transmitted from the first data reading circuit and a plurality of second data transmitted from the second data reading circuit, and calculates to obtain third data, where the third data is one or more.

As shown in fig. 3, in order to implement the function of the first data reading circuit, optionally, the first data reading circuit further includes:

a first data buffer circuit 301, a first data sorting circuit 302, and a first control circuit 303;

the first control circuit 303 is configured to generate a first data read address according to a first address of the first matrix;

the first data buffer circuit 301 is configured to buffer first data read according to the first data read address and location information of the first data;

the first data sorting circuit 302 is configured to reorder, according to the first data location information in the first data cache circuit, the location information of the first data and the first data in a one-to-one correspondence manner, where the result of reordering is that data in the same row in the data matrix is still in the same row.

Optionally, the first control circuit 303 receives a first address of the first matrix obtained by decoding by the instruction decoding circuit, a preset parameter K, and a size parameter of the first matrix, for example, the first matrix includes data in N columns of data matrices. Optionally, the first control circuit includes a first read control circuit CL1 and a first address generating circuit AG1, where the first read control circuit CL1 receives a first address of the first matrix decoded by the instruction decoding circuit, a preset parameter K, a size parameter of the first matrix, and the like, and controls the AG1 to generate a first data read address Addr1, so that the first data read circuit can read K columns of first data in the first matrix at a time according to the Addr 1.

Optionally, the first data buffer circuit 301 further includes a first memory or first storage area DB11 for buffering a plurality of first data, and a second memory or second storage area DB10 for buffering position information of the plurality of first data, the plurality of first data being buffered in the DB11 after the plurality of first data and the position information of the plurality of first data are read out from the first matrix, the position information of the plurality of first data being buffered in the DB 10.

Optionally, the first data sorting circuit 302 further includes a reordering position information buffer circuit IRDB and a reordering first data buffer circuit DRDB. Wherein the IRDB is used for caching the position information of the plurality of reordered first data, and the DRDB is used for caching the plurality of reordered first data. Optionally, the position information of the plurality of first data includes row coordinates and column coordinates of the first data in the data matrix, the row coordinates are represented by X coordinates, and the column coordinates are represented by Y coordinates. Illustratively, the reordering is performed according to a column-first order and a row-second order, that is, according to a Y coordinate, the reordering is performed from small to large, and according to an X coordinate, the reordering is performed from small to large to ensure that the first data in the same row in the data matrix is still in the same row, and the first data not in the same row is still not in the same row. The XY coordinates after reordering are cached in the IRDB, and the first data after reordering is cached in the DRDB. Fig. 4 is a schematic diagram of an example of reordering, as shown in fig. 4, a Data matrix M1_ O is a sparse matrix, a first matrix is a compressed matrix M1 of the Data matrix, M1 includes first Data in the Data matrix and position information (X, Y) of the first Data in the Data matrix, 3 columns of Data in M1 are read by a first Data reading circuit, the position information therein is arranged from small to large according to a Y coordinate, and then is rearranged according to an order of arranging from small to large according to an X coordinate, and the position information with the same X coordinate is located in the same row, the position information with different X coordinates is located in different rows, as shown in fig. 4, (0,0) and (0,1) are located in a 0 th row, and (1,2) is located in a 1 st row; and reordering the first data according to the position corresponding to the position information, as shown in fig. 4, wherein the

first data

1 and 2 are located in the 0 th row, and the first data 3 is located in the second row. Since the number of non-0 data in each row of the data matrix may be different, the length of each row of data after reordering may be different, such as the 0 th row of fig. 4 has two data, and the 1 st row has only one data.

After reordering, the first data reading circuit outputs the position information DO0 and the first data DO 1. Wherein DO1 is some or all of the plurality of first data, and the position information DO0 is position information corresponding to the D01.

In one embodiment, the first data reading circuit and the second data reading circuit read and buffer all the first data in the first matrix and all the second data in the second matrix at a time according to the configuration, and at this time, the position information DO0 may be directly used as the control information of the second data; optionally, the first data reading circuit and the second data reading circuit read and buffer a portion of the first data and the second data at a time according to the configuration, at this time, control information of the second data may be generated by the location information DO0, and for example, relative column information in the buffer is generated by the location information D00 as the control information of the second data.

As shown in fig. 5, in order to implement the function of the second data reading circuit, optionally, the second data reading circuit further includes:

a second data buffer circuit 501, a data selection circuit 502, and a second control circuit 503;

the second control circuit 503 is configured to generate a second data read address according to a first address of the second matrix;

the second data buffer circuit 501 is configured to buffer the second data read according to the second data read address;

the data selecting circuit 502 is configured to select and output the second data from the second data buffer circuit according to the second data output control signal.

Optionally, the second control circuit 503 receives a first address of the second matrix decoded by the instruction decoding circuit, a preset parameter K, and a size parameter of the second matrix, for example, the second matrix includes N rows of second data. Optionally, the second control circuit includes a second read control circuit CL2 and a second address generating circuit AG2, where the second read control circuit CL2 receives the first address of the first matrix decoded by the instruction decoding circuit, a preset parameter K, a size parameter of the first matrix, and the like, and controls the AG2 to generate a second data reading address Addr2, so that the second data reading circuit can read K rows of second data in the second matrix at a time according to the Addr 2.

Optionally, the second data buffer circuit 501 includes a second data memory or a second data storage area, the size of the second data memory or the second data storage area is the size of K rows of second data, and the plurality of read second data are buffered in the second data buffer circuit row by row according to the positions of the plurality of second data in the second matrix.

Optionally, the data selecting circuit 502 includes a switch signal generating circuit DEC and a gate circuit SW, wherein the switch signal generating circuit is configured to receive the second data output control signal to generate a switch signal of the gate circuit, and the gate circuit SW controls a switch corresponding to the switch signal to be opened after receiving the switch signal to output the corresponding second data.

Optionally, the second control signal includes column information in the location information of the plurality of first data, and the data selection circuit selects and outputs second data corresponding to the column information from the second data buffer circuit according to the column information in the second data output control signal. Specifically, the switching signal generating circuit DEC obtains column information therein after receiving the second data output control signal, generates row switching information corresponding to the column information, and turns on the switching circuit, thereby outputting one row of second data among the plurality of second data corresponding to the switching circuit.

As shown in fig. 2, the calculation circuit 203 includes:

a computing unit array PUA including a plurality of computing units PU_1,1,PU_1,2,……PU_M,N；

Optionally, the calculating circuit 203 receives a reordered column of first data output by the first data sorting circuit; receiving at least one row of second data selectively output by the data selection circuit; and calculating third data according to the reordered first data of the column and the second data of the at least one row.

Specifically, one first data in a column of first data output by the first data sorting circuit is output to one row of computing units in the computing circuit, if the column of first data includes two first data, the first data in the 0 th row is output to each computing unit in the 0 th row of computing units, and the first data in the 1 st row is output to each computing unit in the 1 st row of computing units; a data selection circuit for selecting one or more rows of the second data corresponding to a column of the first data outputted from the first data sorting circuit; if the column of first data includes 1 first data, the data selection circuit selects the output second data to be the 1 row of second data. Therefore, the calculation units participating in the calculation all obtain two data inputs, namely a first data and a second data, the calculation units calculate the calculation results of the first data and the second data according to the calculation type specified by the type of the calculation instruction to obtain a third data, and the plurality of calculation units obtain and output a plurality of third data. And circulating the calculation process, and accumulating the calculation result by each calculation unit until all the first data and the second data are read to obtain an output matrix, wherein the value of each element in the output matrix is the accumulation result of the calculation unit participating in the calculation.

Fig. 6a to 6e are examples of the calculation process of the matrix calculation circuit in the above embodiment. As shown in FIG. 6a, for the matrix multiplication required by the matrix calculation circuit, M1_ O is the data matrix, M2 is the second matrix, and M is the third matrix obtained by multiplying the M1_ O and M2 matrices.

Wherein, M1_ O is stored in the form of compressed matrix, as shown in fig. 6b, M1_0 is compressed to generate the first matrix M1 and stored. Let K be 4, i.e. during the calculation, each time the 4 columns of the first data in the data matrix M1_ O are read, and each time the 4 rows of the second data in the second matrix are read, for the example, all the data in M1 and M2 are read and buffered at once. The first data reading circuit of the matrix calculation circuit reads the first data in the entire first matrix M1 at a time into the data buffer circuit as shown in fig. 6b, and reorders through the first data ordering circuit to get the storage order in the IRDB and in the DRDB as shown in fig. 6 b.

Fig. 6c is a general schematic diagram of the matrix calculation using the matrix calculation circuit. Reading 4 columns of first data of M1, namely columns with column numbers 0-3 in the data matrix in a unit of K ═ 4 columns, since the total number of columns of the data matrix M1_ O is 4 in this example, the whole M1 is read at a time and buffered in the first data reading circuit LD _ M1; the reading is followed by reordering, the location information is stored in the IRDB of LD _ M1, and the first data is stored in the DRDB of LD _ M1. The 4 lines of data of M2 are read in units of K-4 lines and buffered in the second data read circuit LD _ M2, and since the total number of lines of M2 is 4 in this example, the entire M2 is read and buffered in LD _ M2 at a time. And then outputting an output matrix M of 4 x 4 through the calculation of 4 calculation units in the calculation array, wherein each element in M corresponds to the output data of one calculation unit.

Fig. 6d is a schematic diagram of the first calculation. The calculating circuit obtains a first column of first data from the DRDB of the LD _ M1, wherein the first column of first data comprises a 1 of the 0 th row and a 2 of the 1 st row, wherein the 1 of the 0 th row is input to the 0 th row calculating unit PU in the calculating circuit_0,0And PU_0,1Performing the following steps; line 12 input to the computerRow 1 computation unit PU in a way_1,0And PU_1,1Performing the following steps; LD _ M1 sends the column coordinates 0 and 1 of the first column first data buffered in IRDB to LD _ M2, and the data selection circuit of LD _ M2 selects and outputs the 0 th row and the 1 st row second data corresponding to the first data column coordinates buffered in LD _ M2 according to the column coordinates 0 and 1, wherein the 0 th row second data is input to the 0 th row calculation unit PU_0,0And PU_0,1Line 0 second data comprises 1 and 2, wherein second data 1 is input to the calculation unit PU_0,0Second data 2 are input to the calculation unit PU_0,1Performing the following steps; wherein the second data of line 1 are input to the calculation unit PU of line 1_1,0And PU_1,1Line 1 second data comprises 1 and 2, wherein second data 1 is input to the calculation unit PU_1,0Second data 2 are input to the calculation unit PU_1,1In (1). Then each computing unit independently carries out multiply-accumulate computation to respectively obtain the PU_0,0Calculated result of (1, PU)_0,1Calculated result of (2), PU_1,0 Calculated result 2 and PU _1,14, the calculation result of (a); since the first data and the second data have not been calculated yet, the resulting third data is the intermediate data M _ temp.

Fig. 6e is a schematic diagram of the second calculation. The calculating circuit obtains the second column of the first data from the DRDB of the LD _ M1, wherein the second column of the first data comprises 3 of the 0 th row and 4 of the 1 st row, wherein the 3 of the 0 th row is input to the 0 th row calculating unit PU in the calculating circuit_0,0And PU_0,1Performing the following steps; line 1 4 inputs to a line 1 calculation unit PU in the calculation circuit_1,0And PU_1,1Performing the following steps; LD _ M1 sends the column coordinates 2 and 3 of the first column first data buffered in IRDB to LD _ M2, and the data selection circuit of LD _ M2 selects and outputs the second data of the 2 nd row and the 3 rd row corresponding to the column coordinates of the first data buffered in LD _ M2 according to the column coordinates 2 and 3, wherein the second data of the 2 nd row are respectively input into the corresponding 0 th row calculation unit PU_0,0And PU_0,1Line 0 second data comprises 1 and 2, wherein second data 1 is input to the calculation unit PU_0,0Second data 2 are input to the calculation unit PU_0,1Performing the following steps; wherein the second data of line 3 is input to the calculation of line 1Unit PU_1,0And PU_1,1Line 3 second data comprises 1 and 2, wherein second data 1 is input to the calculation unit PU_1,0Second data 2 are input to the calculation unit PU_1,1In (1). Then each computing unit independently carries out multiply-accumulate computation to respectively obtain the PU_0,0Calculated result of (4), PU_0,1Calculated result of (8, PU)_1,0 Calculated result 6 and PU _1,112; since the first data and the second data are calculated, the obtained third data are values of elements in the output matrix M.

It can be seen from the calculation process of the above example that the matrix calculation circuit in the present disclosure is used to perform matrix multiplication, and only two calculations are needed to complete the multiplication of one 2 x 4 matrix and one 4 x 2 matrix, thereby greatly increasing the calculation speed and saving the calculation time.

By the technical scheme, the compressed sparse matrix is directly calculated, so that the storage space is effectively saved, and the data bandwidth is saved; by using the computing unit array, all the computing units synchronously process data, the data utilization rate is greatly improved, and a plurality of computing units can share the same data; the compressed sparse matrix is directly calculated, and calculation of some 0 elements is skipped, so that the operation speed is increased, and the effective calculation capacity of the chip is improved.

Fig. 7 is a flowchart of a matrix calculation method according to an embodiment of the present disclosure. As shown in fig. 7, the method includes the steps of:

step S701, reading and caching first data in a first matrix and position information of the first data, wherein the first matrix is a compression matrix of a data matrix;

step S702, generating a second data output control signal according to the first data position information;

step S703, reading and caching second data in the second matrix;

step S704, controlling to output the second data according to the second data output control signal;

step S705, calculating third data according to the first data and the second data.

Further, the reading and caching the first data in the first matrix and the location information of the first data includes:

generating a first data reading address according to the first address of the first matrix;

caching first data read out according to the first data reading address and position information of the first data;

and respectively reordering the first data position information and the first data in a position one-to-one corresponding mode according to the position information of the first data, wherein the reordering result is that the data in the same row in the data matrix are still in the same row.

Further, the reading and buffering the second data in the second matrix includes:

generating a second data reading address according to the first address of the second matrix;

and caching the second data read according to the second data reading address.

Further, the generating a second data output control signal according to the first data position information includes:

and generating the second data output control signal according to the column information in the first data position information.

Further, the controlling the output of the second data according to the second data output control signal includes:

and selecting second data corresponding to the column information from the second data according to the column information in the second data output control signal and outputting the second data.

Further, the calculating third data according to the first data and the second data includes:

receiving a column of first data; receiving at least one row of second data; and calculating to obtain third data according to the column of first data and the at least one row of second data.

In the above, although the steps in the above method embodiments are described in the above sequence, it should be clear to those skilled in the art that the steps in the embodiments of the present disclosure are not necessarily performed in the above sequence, and may also be performed in other sequences such as reverse, parallel, and cross, and further, on the basis of the above steps, other steps may also be added by those skilled in the art, and these obvious modifications or equivalents should also be included in the protection scope of the present disclosure, and are not described herein again.

The embodiment of the present disclosure further provides a processing core, where the processing core includes at least one matrix calculation circuit, a decoding unit, and a storage device in the above embodiments.

The embodiment of the present disclosure further provides a chip, where the chip includes at least one processing core in the above embodiments.

An embodiment of the present disclosure provides an electronic device, including: a memory for storing computer readable instructions; and one or more processors configured to execute the computer-readable instructions, such that the processors when executed perform the matrix computation method of any of the embodiments.

The disclosed embodiments also provide a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer instructions for causing a computer to execute the matrix calculation method described in any one of the foregoing embodiments.

The embodiment of the present disclosure further provides a computer program product, wherein: comprising computer instructions which, when executed by a computing device, may perform the matrix calculation method of any of the preceding embodiments.

The embodiment of the present disclosure further provides a computing device, which includes the chip in any one of the embodiments.

The flowchart and block diagrams in the figures of the present disclosure illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Claims

1. A matrix computation circuit, comprising:

2. The matrix calculation circuit of claim 1 wherein the first data reading circuit further comprises:

3. The matrix calculation circuit according to claim 1 or 2, wherein the second data reading circuit further comprises:

4. The matrix calculation circuit according to claim 3, wherein the generating of the second data output control signal according to the position information of the first data comprises:

5. The matrix calculation circuit according to claim 4, wherein the data selection circuit for selecting and outputting the second data from the second data buffer circuit in accordance with the second data output control signal comprises:

6. The matrix computation circuit of any of claims 1-5, wherein the computation circuit comprises:

7. The matrix computation circuit of claim 3, wherein the computation circuit to compute third data from the first data and the second data comprises:

the calculation circuit receives the reordered column of first data output by the first data sorting circuit; receiving at least one row of second data selectively output by the data selection circuit; and calculating third data according to the reordered first data of the column and the second data of the at least one row.

8. The matrix computation circuit of any of claims 1-7, wherein the location information of the first data comprises: a row coordinate and a column coordinate of the first data in the data matrix.

9. A matrix calculation method, comprising:

reading and caching second data in the second matrix;

10. A processing core comprising the matrix computation circuit of any of claims 1-8.