CN114168897A

CN114168897A - Matrix calculation circuit, matrix calculation method, electronic device, and computer-readable storage medium

Info

Publication number: CN114168897A
Application number: CN202010956493.1A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing Simm Computing Technology Co ltd
Current assignee: Beijing Simm Computing Technology Co ltd
Priority date: 2020-09-11
Filing date: 2020-09-11
Publication date: 2022-03-11

Abstract

The embodiment of the disclosure discloses a matrix calculation circuit, a matrix calculation method, electronic equipment and a computer-readable storage medium. Wherein the matrix calculation circuit includes: the first data reading circuit is used for reading and caching first data in the first matrix and bitmap data in the bitmap matrix; outputting at least one of the first data and position information indicated by bitmap data corresponding to the first data; the second data reading circuit is used for reading and caching second data in the second matrix; outputting at least one of the second data according to the location information; and the calculation circuit is used for performing calculation on the first data and the second data to obtain third data. The matrix calculation circuit controls the output of the plurality of second data through the read position information of the plurality of first data, and solves the technical problems that only single data calculation and complex access address calculation can be performed during matrix calculation in the prior art.

Description

Matrix calculation circuit, matrix calculation method, electronic device, and computer-readable storage medium

Technical Field

The present disclosure relates to the field of processors, and in particular, to a matrix calculation circuit, a matrix calculation method, an electronic device, and a computer-readable storage medium.

Background

With the development of science and technology, the human society is rapidly entering the intelligent era. The important characteristics of the intelligent era are that people obtain more and more data, the quantity of the obtained data is larger and larger, and the requirement on the speed of processing the data is higher and higher. Chips are the cornerstone of task assignment, which fundamentally determines the ability of people to process data. From the application field, the chip mainly has two routes: one is a general chip route, such as a cpu (central processing unit), which provides great flexibility but is less computationally efficient in processing domain-specific algorithms; the other is a special chip route, such as tpu (thermoplastic processing unit), which can exert higher effective computing power in some specific fields, but in the face of flexible and versatile more general fields, the processing capability is worse or even impossible. Because the data of the intelligent era is various and huge in quantity, the chip is required to have extremely high flexibility, can process algorithms in different fields and in different days, has extremely high processing capacity, and can rapidly process extremely large and sharply increased data volume.

In the neural network calculation, the convolution calculation accounts for most of the total operation amount, and the convolution calculation can be converted into matrix multiplication calculation, so that the matrix multiplication calculation speed is improved to improve the throughput in the neural network task, reduce the time delay and improve the effective calculation power of a chip.

The matrix formed by the data in many neural networks (the data includes the parameter data and the input data in the neural networks) is a sparse matrix, that is, the matrix has a large number of elements with 0 values. In order to reduce the storage capacity and bandwidth occupation of data in the neural network calculation, a sparse matrix is compressed for storage; in order to improve the matrix operation speed, the sparse matrix operation is optimized.

FIG. 1a is a schematic diagram of a matrix multiplication computation in a neural network. As shown in FIG. 1a, M1 is a data matrix, M2 is a parameter matrix, and M is an output matrix. Each of the data in a row in M1 and each of the parameters in a column in M2 are multiplied and added to obtain one data in M. Wherein, in fig. 1a, the two matrices M1 and M2 may be one sparse matrix or both sparse matrices.

Fig. 1b shows a schematic compression of the matrix. For storage in sparse matrices, a general compression method can be employed: only elements other than 0 are stored. While storing the value of this non-0 element, it stores its position information in the matrix, i.e., the relative coordinates X and Y of the element in the matrix. Wherein X represents the serial number of the matrix row, and Y represents the serial number of the matrix column. In this method, data and coordinates are stored as one data structure, and the data structure is used as a unit. As shown in fig. 1b, taking an MxN matrix as an example, the left MxN matrix is compressed into a right compression matrix, and each data structure in the compression matrix represents non-0 data in the left matrix and coordinates of the non-0 data in the matrix.

In the sparse matrix, the values of the elements in the matrix are 0, and the 0 elements do not need to be stored, so the storage capacity of the matrix can be effectively reduced by adopting the compression method. Fig. 1c is a schematic diagram illustrating an example of compressing a matrix by using the above compression method. For a 16x16 sparse matrix, only a, b, c and d are elements other than 0, and after compressed storage, only the values and coordinates of the elements need to be stored, thereby saving storage space.

When performing the matrix operation of M1xM2, the matrix after compression is used as the matrix used for actual access. However, the above technical solutions have the following disadvantages: 1. when matrix operation is carried out, the utilization rate of data is low, and usually only independent operation units can be used for calculating single data; 2. according to the data coordinates of the compression matrix, the calculation of the access address is complex, and the performance is influenced.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In order to solve the above technical problems in the prior art, the embodiment of the present disclosure provides the following technical solutions:

in a first aspect, an embodiment of the present disclosure provides a matrix calculation circuit, including:

the first data reading circuit is used for reading and caching first data in the first matrix and bitmap data in the bitmap matrix; outputting at least one of the first data and position information indicated by bitmap data corresponding to the first data; the first matrix is a matrix formed by non-0 data in a data matrix, and the bitmap data in the bitmap matrix and the data in the data matrix are in one-to-one correspondence in position;

the second data reading circuit is used for reading and caching second data in the second matrix; outputting at least one of the second data according to the location information;

and the calculation circuit is used for performing calculation on the first data and the second data to obtain third data.

Further, the first data reading circuit further includes:

the device comprises a first data cache circuit, a bitmap matrix cache circuit, a first data sorting circuit and a first control circuit;

the first control circuit is used for generating a first data reading address according to a first address of the first matrix; generating a position information reading address according to the first address of the bitmap matrix;

the first data cache circuit is used for caching a plurality of first data read out according to the first data reading address;

the bitmap matrix cache circuit is used for caching the bitmap data read out according to the position information reading address;

and the first data sorting circuit is used for re-sorting the first data according to the bitmap data in a column-by-column manner in a position one-to-one correspondence manner, wherein the re-sorting result indicates that the data in the same row in the data matrix is still in the same row and the data in different rows are still not in the same row, and if two adjacent first data are in the same row in the sorting process, 0 is supplemented to other rows in the previous column.

Further, the bitmap matrix buffer circuit is further configured to:

transmitting position information indicated by the bitmap data corresponding to the first data to the second data reading circuit.

Further, the second data reading circuit further includes:

a second data buffer circuit and a second control circuit;

the second control circuit is used for generating a second data reading address according to the first address of the second matrix;

the second data buffer circuit is used for buffering second data read out according to the second data reading address.

Further, the second data reading circuit further includes:

and a switch circuit for controlling output of the second data in the second data buffer circuit according to the position information indicated by the bitmap data corresponding to the first data.

Further, the switch circuit is configured to control a plurality of second data outputs in the second data buffer circuit according to the position information indicated by the bitmap data corresponding to the first data, and includes:

the switch circuit controls to output at least one row of the second data corresponding to the column information in the second data buffer circuit according to the column information of the position information indicated by the bitmap data corresponding to the first data.

Further, the computation circuit includes:

a computing unit array, wherein the computing unit array comprises a plurality of computing units;

a row of the computing units in the computing unit array receives a row of the second data;

a column of compute units in the array of compute units receives a column of first data in the first data.

Further, the calculating circuit is configured to calculate third data according to the first data and the second data, and includes:

the calculation circuit receives the reordered column of first data output by the first data sorting circuit; receiving at least one row of second data corresponding to the column of first data output by the switch circuit; and calculating to obtain third data according to the column of first data and the at least one row of second data.

In a second aspect, an embodiment of the present disclosure provides a matrix calculation method, including:

reading and caching first data in a first matrix and bitmap data in a bitmap matrix, wherein the first matrix is a matrix formed by non-0 data in a data matrix, and the bitmap data in the bitmap matrix and the data in the data matrix are in one-to-one correspondence in position;

outputting at least one of the first data and position information indicated by bitmap data corresponding to the first data;

reading and caching second data in the second matrix;

outputting at least one of the second data according to the location information;

and performing calculation on the first data and the second data to obtain third data.

Further, the reading and buffering the first data in the first matrix and the bitmap data in the bitmap matrix includes:

generating a first data reading address according to the first address of the first matrix;

generating a position information reading address according to the first address of the bitmap matrix;

caching a plurality of first data read out according to the first data reading address;

caching the bitmap data read out according to the position information reading address;

and reordering the first data according to the rows in a position one-to-one correspondence mode according to the bitmap data, wherein the reordering result is that the data in the same row in the data matrix are still in the same row, the data in different rows are still not in the same row, and if two adjacent first data are in the same row in the ordering process, 0 is supplemented to other rows in the previous column.

Further, the method further comprises:

Further, the reading and buffering the second data in the second matrix includes:

generating a second data reading address according to the first address of the second matrix;

and caching the second data read according to the second data reading address.

Further, the outputting at least one piece of the second data according to the position information includes:

output of second data in the second data buffer circuit is controlled in accordance with position information indicated by the bitmap data corresponding to the first data.

Further, the controlling of the plurality of second data outputs in the second data buffer circuit according to the position information indicated by the bitmap data corresponding to the first data includes:

and controlling to output at least one row of the second data corresponding to the column information in the second data buffer circuit according to the column information of the position information indicated by the bitmap data corresponding to the first data.

Further, the performing the calculation on the first data and the second data to obtain third data includes:

receiving the reordered column of first data; receiving at least one row of second data corresponding to the column of first data; and calculating to obtain third data according to the column of first data and the at least one row of second data.

In a third aspect, an embodiment of the present disclosure further provides a processing core, where the processing core includes at least one matrix calculation circuit in the first aspect, a decoding unit, and a storage device.

In a fourth aspect, an embodiment of the present disclosure further provides a chip, where the chip includes at least one processing core in the third aspect.

In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including: a memory for storing computer readable instructions; and one or more processors configured to execute the computer-readable instructions, such that the processors when executed implement the matrix computation method of any of the preceding first aspects.

In a sixth aspect, the present disclosure provides a non-transitory computer-readable storage medium, which stores computer instructions for causing a computer to execute the matrix calculation method according to any one of the foregoing first aspects.

In a seventh aspect, an embodiment of the present disclosure provides a computer program product, wherein: comprising computer instructions which, when executed by a computing device, may perform the matrix calculation method of any of the preceding first aspects.

In an eighth aspect, the embodiments of the present disclosure provide a computing device, which includes one or more chips described in the fourth aspect.

The foregoing is a summary of the present disclosure, and for the purposes of promoting a clear understanding of the technical means of the present disclosure, the present disclosure may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

FIGS. 1a-1c are schematic diagrams of the prior art of the present disclosure;

fig. 2 is a schematic structural diagram of a matrix calculation circuit provided in an embodiment of the present disclosure;

3 a-3 b are schematic diagrams of the generation of the first matrix and the bitmap matrix;

fig. 4 is a schematic structural diagram of a first data reading circuit according to an embodiment of the disclosure;

FIGS. 5a-5b are schematic diagrams of an example of reordering of a first data read circuit according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a second data reading circuit according to an embodiment of the disclosure;

FIGS. 7a-7e are schematic diagrams of an example application of an embodiment of the present disclosure;

fig. 8 is a flowchart of a matrix calculation method provided in the embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Fig. 2 is a schematic diagram of a matrix calculation circuit provided in an embodiment of the present disclosure. The matrix calculation circuit (EU)200 provided in the present embodiment includes:

a first data reading circuit (LD _ M1)201 for reading and buffering first data in the first matrix and bitmap data in the bitmap matrix; outputting at least one of the first data and position information indicated by bitmap data corresponding to the first data; the first matrix is a matrix formed by non-0 data in a data matrix, and the bitmap data in the bitmap matrix and the data in the data matrix are in one-to-one correspondence in position;

a second data read circuit (LD _ M2)202 for reading and buffering second data in a second matrix; outputting at least one of the second data according to the location information;

a calculation circuit 203 for performing a calculation on the first data and the second data to obtain third data.

Illustratively, the first data reading circuit reads and buffers first data in the first matrix according to a read address of the first data, and the read address of the first data is generated according to a storage head address of the first matrix; and the second data reading circuit reads and buffers the second data in the second matrix according to the reading address of the second data, and the reading address of the second data is generated according to the storage head address of the second matrix. The storage head address of the first matrix and the storage head address of the second matrix are obtained through an instruction decoding circuit ID (instruction decoder), and the instruction decoding circuit is used for decoding a matrix calculation instruction to obtain the storage head address of the first matrix, the storage head address of the second matrix, the storage head address of the bitmap matrix, the size of the first matrix and the size of the second matrix.

Illustratively, the matrix calculation instruction includes an instruction type, a storage head address of the first matrix, a storage head address of the second matrix, a storage head address of the bitmap matrix, and parameters such as a size of the first matrix, a size of the second matrix, and a size of the bitmap matrix. In one embodiment, the instruction type is a multiplication instruction of a matrix, the first matrix is a compression matrix of a data matrix in the neural network convolution calculation, and the second matrix is a parameter matrix in the neural network convolution calculation; wherein the data matrix and/or the second matrix is a sparse matrix having a large number of elements with values of 0. It is understood that the memory head address of the matrix and the size parameter of the matrix (such as the number of rows and columns of the matrix) in the matrix calculation instruction may be represented in the form of register addresses, and the instruction decoding circuit acquires corresponding data from the corresponding register addresses.

In the embodiment of the present disclosure, the first data reading circuit 201 receives the first address of the first matrix decoded by the instruction decoding circuit, and generates a reading address of the first data according to the first address; optionally, a plurality of first data in the first matrix are read at one time according to the read address of the first data; the first data reading circuit 201 receives the first address of the bitmap matrix decoded by the instruction decoding circuit, and generates a reading address of bitmap data according to the first address; optionally, the bitmap data in the bitmap matrix is read at one time according to the read address of the bitmap data.

For example, the maximum number of first data read at a time is preset to be first data in K rows of data corresponding to a data matrix, and the first data reading circuit generates a reading address of the first data according to a head address and K of the first matrix, and reads and caches a plurality of first data corresponding to the K rows of data from the first matrix at a time; similarly, the first data reading circuit reads and caches a plurality of bitmap data corresponding to the K columns of data from the bitmap matrix at a time according to the head address of the bitmap matrix and the read address of the K generation position information; the first data reading circuit further outputs at least one of the first data and position information indicated by bitmap data corresponding to the first data after obtaining the first data and the bitmap data.

FIG. 3a is a diagram of a data matrix M1_ O, a first matrix M1, and a bitmap matrix M1_ map according to an embodiment of the disclosure. As shown in fig. 3, the data matrix M1_ O is a sparse matrix of M × K, and M1 is a compressed matrix of M1_ O, which stores only non-0 data in M1_ O, and which may be stored in order of column-first storage or row-first storage. In the following description, the description is made in the order of column-first storage. The bitmap matrix M1_ map is the same size as M1_ O, i.e., both have the same number of rows and columns; however, each data in the bitmap matrix has only 1bit, and each data in the bitmap matrix has a one-to-one correspondence with each data in M1_ O at the matrix position, if the data in M1_ O is 0, the corresponding data in M1_ map is also 0, and if the data in M1_ O is not 0, the corresponding data in M1_ map is 1. The bitmap matrix and the first matrix may be generated before performing the matrix operation, and are not described herein again.

FIG. 3b is an exemplary diagram of the data matrix M1_ O, the first matrix M1, and the bitmap matrix M1_ map in the embodiment of the disclosure. As shown in fig. 3b, the data matrix M1_ O is a 2 × 4 sparse matrix, and the compressed matrix M1 only includes the non-0 data in M1_ O and is stored in the order of priority of the columns in the data matrix, so the first data in M1 is the non-0 data 1 in the first column, the second data is the non-0 data 2 in the second column, the third data is the non-0 data 3 in the third column, and the fourth data is the non-0 data 4 in the fourth column. And the bitmap matrix M1_ map of M1_ O has the same size as M1_ O, but each data is only 1bit, and only 0 or 1 is used for representing, wherein the data in M1_ map corresponding to

data

1,2, 3 and 4 in M1_ O is 1, and other data are all 0. Thus, the data matrix M1_ O originally needs 1byte for storing one data, i.e. 8 bits, then 64 bits are needed for 2 × 4M 1_ O, and M1_ O is expressed by M1 and M1_ map, then only 4 data in M1 need 4 × 8 bits-32 bits, and each data in M1_ map only occupies 1bit, which needs 8 bits, and only 40 bits in total. Typically, the data matrix is much larger than the above example, and thus a significant saving in memory space can be achieved.

In the embodiment of the present disclosure, the second data reading circuit 202 receives the first address of the second matrix decoded by the instruction decoding circuit, and generates a reading address of the second data according to the first address; and reading a plurality of second data in the second matrix at one time according to the reading address of the second data. For example, the maximum number of the second data read at one time is preset to be K rows, and for example, if the second matrix is not a compression matrix, the K rows are K rows in the second matrix; and the second data reading circuit generates a reading address of second data according to the first address of the second matrix and K, and reads and buffers K rows of second data from the second matrix at one time. And then, controlling the output of the second data according to the received position information of the first data so as to output all or part of the second data.

In the embodiment of the present disclosure, the calculation circuit receives the first data transmitted from the first data reading circuit and the second data transmitted from the second data reading circuit, and calculates to obtain third data, where the third data is one or more.

As shown in fig. 4, in order to implement the function of the first data reading circuit, optionally, the first data reading circuit further includes:

a first data buffer circuit 401, a bitmap matrix buffer circuit 402, a first data sorting circuit 403, and a first control circuit 404;

the first control circuit 404 is configured to generate a first data read address according to a first address of the first matrix; generating a position information reading address according to the first address of the bitmap matrix;

the first data buffer circuit 401 is configured to buffer a plurality of first data read according to the first data read address;

the bitmap matrix buffer circuit 402 is configured to buffer the bitmap data read out according to the position information read address;

the first data sorting circuit 403 is configured to reorder, according to the bitmap data, the first data in rows in a one-to-one position-to-one correspondence manner, where the result of reordering is that data in the same row in the data matrix is still in the same row, and data in different rows are still not in the same row, and if two adjacent first data are in the same row in the process of sorting, 0 is supplemented to other rows in the previous column.

Optionally, the first control circuit 404 receives a first address of the bitmap matrix decoded by the instruction decoding circuit and a preset parameter K. Optionally, the first control circuit includes a first read control circuit CL1 and a first address generating circuit AG1, where the first read control circuit CL1 receives a first address of the bitmap matrix decoded by the instruction decoding circuit and a preset parameter K, and controls the AG1 to generate a first data read address Addr0, so that the first data read circuit can read bitmap data indicating positions of data in the data matrix in the bitmap matrix at a time according to Addr 0.

Optionally, the first control circuit 404 receives a first address of the first matrix decoded by the instruction decoding circuit and the number of the non-0 bitmap data in the K columns of bitmap data. Optionally, the first control circuit includes a first read control circuit CL1 and a first address generating circuit AG1, where the first read control circuit CL1 receives the first address of the first matrix decoded by the instruction decoding circuit and the number of the non-0 bitmap data in the K columns of bitmap data, and controls the AG1 to generate a first data read address Addr1, so that the first data read circuit can read the K columns of first data in the first matrix at a time according to the Addr 1. Optionally, the first control circuit 404 may further read the first data while reading the K columns of bitmap data, where the first data may be read by a preset number of reads; further, in order to prevent the number of the read first data from being smaller than the number of the non-0 data in the K columns of bitmap data, it may be set that M × K first data are read at a time, where M is the number of rows of the data matrix, so that M × K is not smaller than the number of the non-0 data in the K columns of bitmap data. In this case, the first read control circuit CL1 receives the first address of the first matrix decoded by the instruction decoding circuit and the set parameter M × K, and controls the AG1 to generate the first data read address Addr1, so that the first data read circuit can read M × K first data in the first matrix at a time according to the Addr 1.

Optionally, the first data buffer circuit 401 further includes a first memory or a first storage area DB11 for buffering first data, which is buffered in the DB11 after being read out from the first matrix.

Optionally, the bitmap matrix buffer circuit 402 further includes a second memory or a second storage area DB10 for buffering first data, and the bitmap data is buffered in the DB10 after being read out from the bitmap matrix.

Optionally, the first data sorting circuit 403 further includes a reordering first data buffer circuit DRDB. The DRDB is used for caching the reordered first data. Optionally, for example, the reordering is performed in a column-first order, that is, the reordering is performed in a Y coordinate from small to large, and then the reordering is performed in a X coordinate from small to large in order to ensure that the first data in the same row and the first data not in the same row in the data matrix are still in the same row, and the first data after the reordering is buffered in the DRDB, and some rows in some columns after the ordering may lack data, and then 0 is filled in these positions. FIG. 5a is a schematic diagram of an example of reordering, and as shown in FIG. 5a, the data matrix M1_ O is a sparse matrix, the first matrix is a compressed matrix M1 of the data matrix, the M1 stores non-0

data

1,2, 3 and 4 in the M1_ O in a column-first order, and the M1_ map is a bitmap matrix of the data matrix; the first data reading circuit reads out the data of the 4-column bitmap matrix at a time and stores the data in the DB 10; there are 4 "1" s in DB10 in total, the first data reading circuit reads 4

first data

1,2, 3, and 4 in the first matrix at a time. Then reordering the acquired 4 first data, traversing the non-0 data in the bitmap data in a column-first order according to the positions of the non-0 data in the bitmap data, wherein the position of the first 1 is in the 0 th row, the first data 1 corresponding to the first 1 is located at the (0,0) position in the DRDB, the position of the second 1 is in the 1 st row, the first data 2 corresponding to the second 1 is located at the (1,0) position in the DRDB, the position of the third 1 is in the 0 th row, the first data 3 corresponding to the third 1 is located at the (0,1) position in the DRDB, the position of the fourth 1 is in the 1 st row, and the first data 4 corresponding to the fourth 1 is located at the (1,1) position in the DRDB, so that in the data matrix, the 1 and 3 in the same row are still in the same row after reordering, and the 2 and 4 in the same row are still in the same row after reordering, and 1 and 3 are located in different rows than 2 and 4.

FIG. 5b is a schematic diagram of another example of reordering. As shown in fig. 5b, in the process of rank ordering, rank 0 is first ranked, row 0 of column 0 of M1_ map is row 0, so 1 in M1 is stored in the location of (0,0) in DRDB, column 0 of M1_ map is row 1, so (0,1) location of DRDB temporarily does not store data; column 1 of M1_ map is all 0, skip no sort; column 2, row 0,1 of M1_ map, then store 2 in M1 in the next position (0,1) of row 0 in DRDB, at which time row 0 is complemented by row 1 of column 0 because row 0 stores two first data 1 and 2 in series; column 2, row 1, behavior 0 of M1_ map, so the (1,1) position of DRDB temporarily holds no data; column 3, row 0 of M1_ map 1, so 3 in M1 is stored in (0,2) position in DRDB, at which time row 0 is complemented by row 1 of column 1 because row 0 is stored with two first data 2 and 3 in the same row consecutively; column 3, line 1 of M1_ map, therefore, stores 3 in M1 in the (1,2) position in DRDB, and gets the first data in DRDB that has been reordered by this time.

After reordering, the first data reading circuit outputs the bitmap data DO0 and the first data DO 1. Wherein DO1 is some or all of the plurality of first data, the bitmap data DO0 is bitmap data corresponding to the D01. Optionally, the bitmap matrix buffer circuit is further configured to: transmitting position information indicated by the bitmap data corresponding to the first data to the second data reading circuit. Optionally, the position information indicated by the bitmap data is column coordinates of the bitmap data in the bitmap matrix.

As shown in fig. 6, in order to implement the function of the second data reading circuit, optionally, the second data reading circuit further includes:

a second data buffer circuit 601 and a second control circuit 602;

the second control circuit 602 is configured to generate a second data read address according to a first address of the second matrix;

the second data buffer circuit 601 is configured to buffer the second data read according to the second data read address.

Optionally, the second control circuit 602 receives a first address of the second matrix decoded by the instruction decoding circuit, a preset parameter K, and a size parameter of the second matrix, for example, the second matrix includes N rows of second data. Optionally, the second control circuit includes a second read control circuit CL2 and a second address generating circuit AG2, where the second read control circuit CL2 receives the first address of the second matrix decoded by the instruction decoding circuit, a preset parameter K, a size parameter of the second matrix, and the like, and controls the AG2 to generate a second data reading address Addr2, so that the second data reading circuit can read K rows of second data in the second matrix at a time according to the Addr 2.

Optionally, the second data buffer circuit 601 includes a second data memory or a second data storage area, the size of which is the size of K rows of second data, and the read second data is buffered in the second data memory or the second data storage area row by row according to the position of the second data in the second matrix.

The second data reading circuit outputs all or part of the read second data according to the position information indicated by the bitmap data. Optionally, the second data reading circuit further includes:

a switch circuit 603 for controlling output of the second data in the second data buffer circuit according to the position information indicated by the bitmap data corresponding to the first data.

Optionally, the switch circuit 603 includes a switch control circuit SC and a switch array SW, wherein the switch control circuit is configured to receive the position information indicated by the bitmap data to generate a switch signal of the switch array, and the switch array SW controls a switch corresponding to the switch signal to be opened after receiving the switch signal to output corresponding second data.

Optionally, the column information is included in the position information, and the column information represents column coordinates of data in a data matrix corresponding to the first data in the data matrix, and the switch circuit controls to output at least one row of the second data corresponding to the column information in the second data buffer circuit according to the column information of the position information indicated by the bitmap data corresponding to the first data. Specifically, after receiving the position information, the switch control circuit SC obtains column information therein, generates row switch information corresponding to the column information, and turns on the switch circuit, thereby outputting the second data corresponding to the row switch information, where the output second data is one or more rows of second data buffered in the second data reading circuit.

As shown in fig. 2, the calculation circuit 203 includes:

a computing unit array PUA including a plurality of computing units PU_1,1,PU_1,2,……PU_M,N；

Optionally, the calculating circuit 203 receives a reordered column of first data output by the first data sorting circuit; receiving at least one row of second data corresponding to the column of first data output by the switch circuit; and calculating to obtain third data according to the column of first data and the at least one row of second data.

Specifically, one of the reordered first data in the column output by the first data sorting circuit is output to a row of computing units in the computing circuit, and if the first data in the column includes two first data, the first data in the 0 th row is output to each computing unit in the 0 th row of computing units, and the first data in the 1 st row is output to each computing unit in the 1 st row of computing units; one or more rows of second data output by the switching circuit and corresponding to a column of first data output by the first data sorting circuit; if the column of first data includes 1 first data, the second data selectively outputted by the switch circuit is 1 row of second data, the row number of the second data outputted by the switch circuit is related to the column information of the first data outputted by the first data sorting circuit, and several different column information switch circuits correspondingly output several rows of second data. Therefore, the calculation units participating in the calculation all obtain two data inputs, namely a first data and a second data, the calculation units execute calculation operation according to the calculation type specified by the type of the calculation instruction to obtain the calculation results of the first data and the second data as third data, and the plurality of calculation units obtain and output a plurality of third data. And circulating the calculation process, and accumulating the calculation result by each calculation unit until all the first data and the second data are read to obtain an output matrix, wherein the value of each element in the output matrix is the accumulation result of the calculation unit participating in the calculation.

Fig. 7a to 7e are examples of the calculation process of the matrix calculation circuit in the above embodiment. As shown in FIG. 7a, for the matrix multiplication required by the matrix calculation circuit, M1_ O is the data matrix, M2 is the second matrix, and M is the third matrix M obtained by multiplying M1_ O by M2 matrix.

Wherein, M1_ O is stored in the form of compressed matrix, as shown in fig. 7b, M1_0 is compressed to generate the first matrix M1 and stored. Let K be 4, i.e. during the calculation, read K columns of bitmap data of bitmap matrix M _ map each time, read a number of first data in M1 corresponding to 4 columns of data in data matrix M1_ O each time, read 4 rows of second data in the second matrix each time, for the example, read and buffer all data in M1 and M2 at once. Then as shown in fig. 7b, the first data reading circuit of the matrix calculation circuit reads the first data in the entire first matrix M1 to the first data buffer circuit at a time, and reads the entire bitmap matrix to the bitmap data buffer circuit at a time; the reordering by the first data sorting circuit results in the storage order in the DRDB as shown in fig. 7 b.

Fig. 7c is an overall schematic diagram of matrix calculation using the matrix calculation circuit. Reading 4 columns of bitmap data of the M1_ map by using a unit of K-4 columns, reading a plurality of first data in the M1 corresponding to 4 columns of data of the M1_ O, and reading and buffering the whole M1_ map and the whole M1 into the first data reading circuit LD _ M1 at a time because the total columns of the data matrix M1_ O are 4; the data is read and then reordered, and the reordered first data is stored in the DRDB of the LD _ M1. The 4 lines of data of M2 are read in units of K-4 lines and buffered in the second data read circuit LD _ M2, and since the total number of lines of M2 is 4 in this example, the entire M2 is read and buffered in LD _ M2 at a time. And then outputting an output matrix M of 4 x 4 through the calculation of 4 calculation units in the calculation array, wherein each element in M corresponds to the accumulated value of the output data of one calculation unit.

Fig. 7d is a schematic diagram of the first calculation. The calculating circuit obtains a first column of first data from the DRDB of the LD _ M1, wherein the first column of first data comprises a 1 of the 0 th row and a 2 of the 1 st row, wherein the 1 of the 0 th row is input to the 0 th row calculating unit PU in the calculating circuit_0,0And PU_0,1Performing the following steps; line 12 inputs to a line 1 calculation unit PU in the calculation circuit_1,0And PU_1,1Performing the following steps; LD _ M1 sends column coordinates 0 and 1 of bitmap data corresponding to the first column of first data buffered in DB10 to LD _ M2, and the switch circuit of LD _ M2 selects and outputs the second data of line 0 and line 1 corresponding to the column coordinates of the first data buffered in LD _ M2 according to the column coordinates 0 and 1, wherein the second data of line 0 is input to the calculation unit PU of line 0_0,0And PU_0,1Line 0 second data comprises 1 and 2, wherein second data 1 is input to the calculation unit PU_0,0Second data 2 are input to the calculation unit PU_0,1Performing the following steps; wherein the second data of line 1 are input to the calculation unit PU of line 1_1,0And PU_1,1Line 1 second data comprises 1 and 2, wherein second data 1 is input to the calculation unit PU_1,0Second data 2 are input to the calculation unit PU_1,1In (1). Then each computing unit independently carries out multiply-accumulate computation to respectively obtain the PU_0,0Calculated result of (1, PU)_0,1Calculated result of (2), PU_1,0 Calculated result 2 and PU _1,14, the calculation result of (a); since the first data and the second data have not been calculated yet, the resulting third data is the intermediate data M _ temp.

Fig. 7e is a schematic diagram of the second calculation. The calculating circuit obtains the second column of the first data from the DRDB of the LD _ M1, wherein the second column of the first data comprises 3 of the 0 th row and 4 of the 1 st row, wherein the 3 of the 0 th row is input to the 0 th row calculating unit PU in the calculating circuit_0,0And PU_0,1Performing the following steps; line 1 4 inputs to a line 1 calculation unit PU in the calculation circuit_1,0And PU_1,1Performing the following steps; LD _ M1 buffered in DB10The column coordinates 2 and 3 of the first data of the second column are sent to LD _ M2, the switch circuit of LD _ M2 selects and outputs the second data of the 2 nd row and the 3 rd row corresponding to the column coordinates of the first data buffered in LD _ M2 according to the column coordinates 2 and 3, wherein the second data of the 2 nd row are respectively input into the corresponding 0 th row calculation unit PU_0,0And PU_0,1Line 0 second data comprises 1 and 2, wherein second data 1 is input to the calculation unit PU_0,0Second data 2 are input to the calculation unit PU_0,1Performing the following steps; wherein the second data of line 3 is input to the calculation unit PU of line 1_1,0And PU_1,1Line 3 second data comprises 1 and 2, wherein second data 1 is input to the calculation unit PU_1,0Second data 2 are input to the calculation unit PU_1,1In (1). Then each computing unit independently carries out multiply-accumulate computation to respectively obtain the PU_0,0Calculated result of (4), PU_0,1Calculated result of (8, PU)_1,0 Calculated result 6 and PU _1,112; since the first data and the second data are calculated, the obtained third data are values of elements in the output matrix M.

It can be seen from the calculation process of the above example that the matrix calculation circuit in the present disclosure is used to perform matrix multiplication, and only two calculations are needed to complete the multiplication of one 2 x 4 matrix and one 4 x 2 matrix, thereby greatly increasing the calculation speed and saving the calculation time.

By the technical scheme, the compressed sparse matrix is directly calculated, so that the storage space is effectively saved, and the data bandwidth is saved; by using the computing unit array, all the computing units synchronously process data, the data utilization rate is greatly improved, and a plurality of computing units can share the same data; the compressed sparse matrix is directly calculated, and calculation of some 0 elements is skipped, so that the operation speed is increased, and the effective calculation capacity of the chip is improved.

Fig. 8 is a flowchart of a matrix calculation method provided in the embodiment of the present disclosure. As shown in fig. 8, the method includes the steps of:

step S801, reading and caching first data in a first matrix and bitmap data in a bitmap matrix, wherein the first matrix is a matrix formed by non-0 data in a data matrix, and the bitmap data in the bitmap matrix corresponds to the data in the data matrix in position one to one;

step S802, outputting at least one first data and position information indicated by bitmap data corresponding to the first data;

step S803, reading and caching the second data in the second matrix;

step S804, outputting at least one second data according to the position information;

step S805, performing calculation on the first data and the second data to obtain third data.

Further, the method further comprises:

and caching the second data read according to the second data reading address.

In the above, although the steps in the above method embodiments are described in the above sequence, it should be clear to those skilled in the art that the steps in the embodiments of the present disclosure are not necessarily performed in the above sequence, and may also be performed in other sequences such as reverse, parallel, and cross, and further, on the basis of the above steps, other steps may also be added by those skilled in the art, and these obvious modifications or equivalents should also be included in the protection scope of the present disclosure, and are not described herein again.

The embodiments of the present disclosure further provide a processing core, where the processing core includes at least one matrix calculation circuit in any of the above embodiments, a decoding unit, and a storage device.

The embodiment of the present disclosure further provides a chip, where the chip includes at least one processing core in any one of the above embodiments.

An embodiment of the present disclosure provides an electronic device, including: a memory for storing computer readable instructions; and one or more processors configured to execute the computer-readable instructions, such that the processors when executed perform the matrix computation method of any of the embodiments.

The disclosed embodiments also provide a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer instructions for causing a computer to execute the matrix calculation method described in any one of the foregoing embodiments.

The embodiment of the present disclosure further provides a computer program product, wherein: comprising computer instructions which, when executed by a computing device, may perform the matrix calculation method of any of the preceding embodiments.

The embodiment of the present disclosure further provides a computing device, which includes the chip in any one of the embodiments.

The flowchart and block diagrams in the figures of the present disclosure illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Claims

1. a matrix computing circuit, is characterized in that, comprises:

a first data reading circuit, configured to read and cache the first data in the first matrix and the bitmap data in the bitmap matrix; output at least one of the first data and bits corresponding to the first data The position information indicated by the map data; wherein the first matrix is a matrix composed of non-0 data in the data matrix, and the bitmap data in the bitmap matrix is in position with the data in the data matrix one-to-one correspondence;

a second data reading circuit, configured to read and buffer the second data in the second matrix; output at least one of the second data according to the position information;

A calculation circuit, configured to perform calculation on the first data and the second data to obtain third data.

2. The matrix calculation circuit according to claim 1, wherein the first data reading circuit further comprises:

a first data buffer circuit, a bitmap matrix buffer circuit, a first data sorting circuit and a first control circuit;

Wherein, the first control circuit is configured to generate the first data read address according to the first address of the first matrix; generate the position information read address according to the first address of the bitmap matrix;

the first data buffer circuit, configured to buffer a plurality of first data read out according to the first data read address;

The bitmap matrix buffer circuit is used for buffering the bitmap data read out according to the position information read address;

The first data sorting circuit is configured to reorder the first data by column in a one-to-one correspondence manner according to the bitmap data, wherein the reordering result is the same in the data matrix. The row data is still in the same row, and the data in different rows is still not in the same row. If two adjacent first data are in the same row during the sorting process, add 0 to other rows in the previous column.

3. The matrix calculation circuit as claimed in claim 2, wherein the bitmap matrix buffer circuit is further used for:

The position information indicated by the bitmap data corresponding to the first data is sent to the second data reading circuit.

4. The matrix calculation circuit according to any one of claims 1-3, wherein the second data reading circuit further comprises:

a second data buffer circuit and a second control circuit;

Wherein, the second control circuit is configured to generate a second data read address according to the first address of the second matrix;

The second data buffer circuit is used for buffering the second data read out according to the second data read address.

5. The matrix calculation circuit of claim 4, wherein the second data reading circuit further comprises:

The switch circuit is configured to control the output of the second data in the second data buffer circuit according to the position information indicated by the bitmap data corresponding to the first data.

6. The matrix calculation circuit according to claim 5, wherein the switch circuit is configured to control the second data buffer according to the position information indicated by the bitmap data corresponding to the first data Multiple second data outputs in the circuit, including:

The switch circuit controls, according to the column information of the position information indicated by the bitmap data corresponding to the first data, to output at least one row of the second data buffer circuit corresponding to the column information. data.

7. The matrix calculation circuit of any one of claims 1-6, wherein the calculation circuit comprises:

a computing unit array, wherein the computing unit array includes a plurality of computing units;

A row of computing units in the computing unit array receives a row of second data in the second data;

A column of computing units in the computing unit array receives a column of first data in the first data.

8. The matrix calculation circuit according to claim 4, wherein the calculation circuit is configured to calculate and obtain the third data according to the first data and the second data, comprising:

The calculation circuit receives a reordered column of first data output by the first data sorting circuit; receives at least one row of second data corresponding to the column of first data output by the switch circuit; The first data and the at least one row of second data are calculated to obtain third data.

9. A matrix calculation method, characterized in that, comprising:

reading and buffering first data in a first matrix and bitmap data in a bitmap matrix, the first matrix being a matrix consisting of non-zero data in a data matrix, the bits in the bitmap matrix The map data corresponds to the data in the data matrix in one-to-one position;

outputting at least one of the first data and the position information indicated by the bitmap data corresponding to the first data;

reading and buffering the second data in the second matrix;

A calculation is performed on the first data and the second data to obtain third data.

10. A processing core comprising the matrix calculation circuit of any of claims 1-8.