WO2021072732A1

WO2021072732A1 - Matrix computing circuit, apparatus and method

Info

Publication number: WO2021072732A1
Application number: PCT/CN2019/111878
Authority: WO
Inventors: 罗飞; 王维伟
Original assignee: 北京希姆计算科技有限公司
Priority date: 2019-10-18
Filing date: 2019-10-18
Publication date: 2021-04-22
Also published as: CN114503126A

Abstract

Disclosed are a matrix computing circuit, apparatus and method. The matrix computing circuit comprises: a control circuit; and a computing unit array. The computing unit array comprises a plurality of computing units. The computing unit comprises a first input register, a second input register, and an output register. The first input register is used for receiving data of a first input matrix. The second input register is used for receiving data of a second input matrix. The control circuit is used for receiving a matrix computing instruction, and in response to the instruction, controlling at least one of the plurality of computing units to perform, according to an indication of the instruction, a computing operation on the first input matrix and the second input matrix, the instruction being a single instruction. The output register is used for storing the computing result of the computing operation. By means of the method, the technical problems in the prior art of low computing efficiency and high power consumption during convolutional computing are resolved.

Description

Matrix operation circuit, device and method

Technical field

The present disclosure relates to the field of neural network computing, and in particular to a matrix operation circuit, device and method.

Background technique

With the development of science and technology, human society is rapidly entering the era of intelligence. The important feature of the intelligent age is that people are getting more and more types of data, the amount of data is getting bigger and bigger, and the speed of processing data is getting higher and higher. Chips are the cornerstone of data processing, it is fundamentally determined Improved people’s ability to process data. From the perspective of application fields, there are two main routes for chips: one is a general-purpose chip route, such as CPU (Central Processing Unit, central processing unit), etc. They can provide great flexibility, but they are effective in processing algorithms in specific fields. The power is relatively low; the other is a dedicated chip route, such as TPU (Tensor Processing Unit, tensor processor), etc. They can exert higher effective computing power in some specific fields, but they are more versatile in the face of flexible and changeable In the field, their processing power is relatively poor or even unable to handle. Due to the wide variety and huge amount of data in the intelligent era, the chip is required to have extremely high flexibility, capable of processing different fields and rapidly changing algorithms, and extremely strong processing capabilities, which can quickly process extremely large and rapidly increasing data. the amount.

Convolution calculations are often required in artificial intelligence calculations. In the existing schemes for realizing convolution calculations, there are usually two schemes:

(1) CPU scheme: In this scheme, if it is a single-core CPU, the matrix will be disassembled into scalars for operation, and convolution operations are realized by combining scalar instructions; if it is a multi-core CPU, multiple cores may execute each in parallel Scalar instructions are combined to implement convolution operations. However, the use of this solution has the following disadvantages: the underlying program is complex, and generally requires multi-layer loops to implement convolution operations; convolution operations are implemented through general calculation instructions, which is inefficient and require multiple branch jumps; CPU cache is limited, and the implementation is relatively large Convolution operation needs to move data from off-chip multiple times, which affects efficiency; CPU needs to access data multiple times, which will increase the calculation time for convolution operation; CPU needs to access data multiple times, which will increase the realization of convolution The computational power consumption added by the calculation; if it is a multi-core parallel calculation, the communication between the cores is complicated and the communication performance may become a bottleneck.

(2) GPU (Graphics Processing Unit) solution: In this solution, the GPU will disassemble the convolution operation into multiple instruction operations. These instructions are mainly vector instructions, and the convolution operation is realized by combining and executing vector instructions. However, the use of this solution has the following disadvantages: the underlying program is complex, and generally requires multiple layers of loops to implement convolution operations; the convolution operation is achieved through multiple combinations of vector instructions, which is inefficient; GPU requires multiple data access, which will increase the implementation The calculation time of convolution operation; GPU needs to access data multiple times, which will increase the calculation power consumption of convolution operation; GPU has limited cache, and the realization of relatively large convolution operation requires multiple transfers from outside the chip. effectiveness.

Summary of the invention

The content of the invention is provided to introduce concepts in a brief form, and these concepts will be described in detail in the following specific embodiments. The content of the invention is not intended to identify the key features or essential features of the technical solution required to be protected, nor is it intended to be used to limit the scope of the technical solution required to be protected.

In the first aspect, an embodiment of the present disclosure provides a matrix operation circuit, including:

Control circuit;

An arithmetic unit array, the arithmetic unit array includes a plurality of arithmetic units, the arithmetic unit includes a first input register, a second input register, and an output register;

The first input register is used to receive data of a first input matrix, and the second input register is used to receive data of a second input matrix;

The control circuit is configured to receive a matrix operation instruction, and in response to the instruction, control at least one operation unit of the plurality of operation units to perform an operation on the first input matrix and the second input according to the instruction of the instruction. The matrix execution operation operation, wherein the instruction is a single instruction;

The output register is used to store the operation result of the operation operation.

Further, the instruction includes the instruction name, the first address of the first input matrix, the first address of the second input matrix, and the first address of the output matrix.

Further, the matrix operation instruction is a matrix multiplication instruction.

Further, the arithmetic unit includes an arithmetic unit, the arithmetic unit includes at least a multiplier and an adder; the arithmetic unit is configured to combine the multiplier and the adder according to the matrix multiplication instruction to perform a matrix multiplication operation .

Further, in response to the matrix multiplication instruction, for each arithmetic unit that executes the matrix multiplication instruction:

Read the data of the first input matrix in the first input register;

Reading the data of the second input matrix in the second input register;

Calculating a product of the data of the first input matrix and the data of the second input matrix by using a multiplier;

Calculating the accumulated value of the product through an adder;

The accumulated value is stored in the output register.

Further, the matrix multiplication instruction is used to implement matrix convolution operation, wherein:

The data of the first input matrix is row vector data of the first input matrix;

The data of the second input matrix is row vector data of the second input matrix.

In a second aspect, embodiments of the present disclosure provide a matrix calculation device, including:

A memory for storing matrix operation instructions, a first input matrix, a second input matrix, and an output matrix;

An instruction fetching module, connected to the memory, and configured to obtain the matrix operation instruction from the memory;

A decoding module, connected to the instruction fetching module, and configured to decode the matrix operation instructions acquired by the instruction fetching module;

A register for storing attribute data of the first input matrix, the second input matrix, and the output matrix;

The execution module is connected to the decoding module, the memory and the register, and includes the matrix operation circuit according to claims 1-6, which is used to execute the decoded matrix operation instruction.

Further, the execution module obtains the decoded matrix operation instruction from the decoding module; the execution module obtains the attribute data of the first input matrix and the attribute data of the second input matrix from the register Attribute data and attribute data of the output matrix; the execution module obtains the data of the first input matrix and the first input matrix for calculation from the memory according to the attribute data of the first input matrix and the second input matrix Two input matrix data; the execution module calculates the data of the first input matrix and the data of the second input matrix according to the decoded matrix operation instruction to obtain the data of the output matrix; the execution module calculates the data of the output matrix according to the The attribute data of the output matrix stores the data of the output matrix in the memory.

Further, the attribute data of the first input matrix includes the number of rows, the number of columns, and the row vector interval of the first input matrix; the attribute data of the second input matrix includes the number of rows of the second input matrix, The number of columns and the interval of row vectors; the attribute data of the output matrix includes the number of rows, the number of columns, and the interval of row vectors of the output matrix.

Further, the execution module acquiring data of the first input matrix and data of the second input matrix for calculation from the memory according to the attribute data of the first input matrix and the second input matrix includes:

The execution module reads the data of the first input matrix according to the preset first reading method and the attribute data of the first input matrix;

The execution module reads the data of the second input matrix according to the preset second reading method and the attribute data of the second input matrix.

Further, the first reading mode is row reading or column reading; the second reading mode is row reading or column reading.

In a third aspect, embodiments of the present disclosure provide a matrix operation method, which is based on the matrix operation circuit of any one of the foregoing first aspects, and is characterized in that it includes:

Take out the matrix operation instruction from the memory;

Decode the matrix operation instruction, and send the decoded instruction operation instruction to the matrix operation circuit;

Based on the decoded matrix operation instruction, the matrix operation circuit obtains the data of the first input matrix and the data of the second input matrix from the memory and performs an operation, and stores the operation result in the memory after the operation is completed .

In a fourth aspect, an embodiment of the present disclosure provides an electronic device, including: a memory, configured to store computer-readable instructions; and one or more processors, configured to execute the computer-readable instructions to cause the processor to run When realizing any one of the above-mentioned matrix operation methods in the third aspect.

In a fifth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium, characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute the aforementioned third aspect Any of the matrix operation methods described above.

In a sixth aspect, embodiments of the present disclosure provide a computer program product, which is characterized in that it includes computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute any of the foregoing third aspects. The matrix operation method.

In a seventh aspect, an embodiment of the present disclosure provides a chip, which is characterized by comprising the matrix operation circuit described in any one of the first aspect.

In an eighth aspect, an embodiment of the present disclosure provides a computing device, which is characterized by including the chip described in any one of the seventh aspects.

The embodiments of the present disclosure disclose a matrix operation circuit, device and method. The matrix operation circuit includes: a control circuit; an arithmetic unit array, the arithmetic unit array includes a plurality of arithmetic units, the arithmetic unit includes a first input register, a second input register, and an output register; the first input register is used for For receiving the data of the first input matrix, the second input register is used for receiving the data of the second input matrix; the control circuit is used for receiving a matrix operation instruction, and responding to the instruction to control the plurality of operation units At least one arithmetic unit of, performs arithmetic operations on the first input matrix and the second input matrix according to the instruction of the instruction, wherein the instruction is a single instruction; the output register is used to store the arithmetic The result of the operation. Through the above method, the technical problems of low calculation efficiency and high power consumption in the prior art when performing convolution calculations are solved.

The above description is only an overview of the technical solutions of the present disclosure. In order to understand the technical means of the present disclosure more clearly, they can be implemented in accordance with the content of the specification, and to make the above and other objectives, features and advantages of the present disclosure more obvious and understandable. , The following is a detailed description of the preferred embodiments in conjunction with the accompanying drawings.

Description of the drawings

The above and other features, advantages, and aspects of the embodiments of the present disclosure will become more apparent in conjunction with the accompanying drawings and with reference to the following specific implementations. Throughout the drawings, the same or similar reference signs indicate the same or similar elements. It should be understood that the drawings are schematic and the originals and elements are not necessarily drawn to scale.

FIG. 1 is a schematic structural diagram of a matrix operation circuit provided by an embodiment of the disclosure;

FIG. 2 is a schematic structural diagram of an arithmetic unit provided by an embodiment of the disclosure;

FIG. 3 is a further structural schematic diagram of an arithmetic unit provided by an embodiment of the disclosure;

Figures 4a-4b are schematic diagrams of convolution operations in an embodiment of the present disclosure;

Figures 5 to 8 are calculation processes of partial convolution provided by embodiments of the disclosure;

FIG. 9 is a schematic structural diagram of a matrix calculation device provided by an embodiment of the disclosure;

10a-10d are schematic diagrams of the storage order and format of the matrix in the disclosure;

Figures 11-15 are schematic diagrams of data overlap in convolution operations;

FIG. 16 is a schematic diagram of a specific example of the convolution budget in the embodiment of the disclosure.

Detailed ways

Hereinafter, embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although some embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure can be implemented in various forms and should not be construed as being limited to the embodiments set forth herein. On the contrary, these embodiments are provided for Have a more thorough and complete understanding of this disclosure. It should be understood that the drawings and embodiments of the present disclosure are only used for exemplary purposes, and are not used to limit the protection scope of the present disclosure.

It should be understood that the steps recorded in the method embodiments of the present disclosure may be executed in a different order, and/or executed in parallel. In addition, method implementations may include additional steps and/or omit to perform the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and its variations as used herein are open-ended includes, that is, "including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments." Related definitions of other terms will be given in the following description.

It should be noted that the concepts of “first” and “second” mentioned in the present disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units. Or interdependence.

It should be noted that the modifications of “a” and “a plurality of” mentioned in the present disclosure are illustrative and not restrictive, and those skilled in the art should understand that unless otherwise clearly indicated in the context, they should be understood as “one or Multiple".

The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are only used for illustrative purposes, and are not used to limit the scope of these messages or information.

FIG. 1 is a schematic structural diagram of a matrix operation circuit provided by an embodiment of the disclosure. As shown in FIG. 1, the matrix operation circuit 100 includes a control circuit 101 and an arithmetic unit array 102. The arithmetic unit array includes a plurality of arithmetic units (PU) 103. As shown in FIG. 2, the arithmetic unit 103 includes a first The input register (Rin1) 104, the second input register (Rin2) 105, and the output register (Rout) 106, wherein the first input register 104 is used to receive the data of the first input matrix, and the second input register 105 is used to Receive the data of the second input matrix; the control circuit 101 is configured to receive a matrix operation instruction, and in response to the instruction, control at least one operation unit 103 of the plurality of operation units 103, according to the instruction of the instruction The first input matrix and the second input matrix perform arithmetic operations, wherein the instruction is a single instruction; the output register 106 is used to store the result of the arithmetic operation.

In the present disclosure, the single command includes the command name, the first address of the first input matrix, the first address of the second input matrix, and the first address of the output matrix. The following table shows an exemplary command format:

The instruction name corresponds to the meaning, format and operation of the instruction. The first address of the first input matrix and the first address of the second input matrix respectively define the read addresses of the two source operands of the instruction, and the first address of the output matrix is defined Indicates the storage address of the destination operand of the instruction. The above exemplary instruction is a matrix multiplication instruction, which implements the multiplication of two matrices, specifically:

c _in =∑(a _ij *b _jn )

FIG. 3 is a further structural schematic diagram of an arithmetic unit provided by an embodiment of the disclosure. As shown in FIG. 3, in addition to the first input register, the second input register, and the output register, the arithmetic unit also includes an arithmetic unit. The arithmetic unit includes at least a multiplier 301 and an adder 302, wherein the arithmetic unit And used for combining the multiplier and the adder according to the matrix multiplication instruction to perform a matrix multiplication operation.

Specifically, in response to the matrix multiplication instruction, for each arithmetic unit that executes the matrix multiplication instruction: read the data of the first input matrix in the first input register; read the second input in the second input register Matrix data; calculate the product of the data of the first input matrix and the data of the second input matrix by a multiplier; calculate the accumulated value of the product by an adder; store the accumulated value in the output register . Wherein, the data of the first input matrix in the first input register is the data sequentially read into the first input register according to the first address of the first input matrix; the data in the second input register The data of the second input matrix is data sequentially read into the second input register according to the first address of the second input matrix.

For example, after reading in the data, the multiplier is used to calculate the product of the data in the first input register and the second input register. As in the above example, in one clock cycle, if the data received in the first input register is a ₁₁ , The data received in the second input register is b ₁₁ , the multiplier in the arithmetic unit calculates _{the product of a 11} *b ₁₁ , and then sends the product of a ₁₁ *b _{11 to} the adder for accumulation. When the accumulator has no input in the previous clock cycle, the accumulation result of the accumulator is a ₁₁ *b ₁₁ ; then in the next clock cycle, continue the above calculation operation, at this time the data received in the first input register is a ₁₂ , The data received in the second input register is b ₂₁ , the multiplier in the arithmetic unit calculates _{the product of a 12} *b ₂₁ , and then sends the product of a ₁₂ *b _{21 to} the adder for accumulation. The input of one clock cycle is a ₁₁ *b ₁₁ , so calculate a ₁₁ *b ₁₁ + a ₁₂ *b ₂₁ in this clock cycle; continue to do the above operations until one row of the first input matrix and one column of the second input matrix are calculated Upon completion, the final accumulated value is obtained, and then the accumulated value is stored in the output register. The control circuit stores the accumulated value in the output register in the system memory according to the first address of the output matrix in the instruction.

In the present disclosure, the matrix multiplication instructions can implement matrix convolution operations. The convolution calculation of the matrix is the cumulative sum of the product of the data point-to-point multiplication of the two matrices.

An exemplary convolution operation is as follows:

4a-4b are schematic diagrams of convolution operations in an embodiment of the disclosure. Figure 4a is the overall schematic diagram of the convolution operation process, and the diagram is explained as follows:

Win: Enter the width of the feature map (Feature Map);

Hin: input the height of the feature map;

Cin: The number of channels of the input feature map, collectively referred to as the depth of the input feature map later;

Kw: the width of the convolution kernel;

Kh: the height of the convolution kernel

Wout: the width of the output feature map;

Hout: the height of the output feature map;

Cout: The number of channels of the output feature map, collectively referred to as the depth of the output feature map later.

The feature points on the input feature map constitute the first input matrix; the points on the convolution kernel constitute the second input matrix; the output feature map is the output matrix, and a feature point on the output feature map is a data on the output matrix. During the convolution operation, the convolution kernel will slide on the input feature map. Each time it slides, it will multiply and accumulate data with the corresponding data in the input feature map to extract an output feature point, that is, a data on the output matrix. .

Figure 4b is a schematic diagram of the calculation of an input feature point with depth. As shown in Figure 4b, the convolution kernel slides on the input feature map. When it stops at a position, it will multiply and accumulate the corresponding data with the feature points in the input feature map at that position to obtain the output feature corresponding to the position. Points; there are Cout convolution kernels, and each convolution kernel will multiply and accumulate data with the feature points in the input feature map at the same position to obtain Cout output feature points in the depth direction; Cout output feature points are composed A feature point with depth on the entire output feature map, the depth of this point is Cout; the convolution kernel will slide the entire input feature map to obtain the entire output feature map.

For a certain convolution kernel at depth l (1<=l<=Cout), its feature extraction formula is as follows:

Dout is a point with depth in the output feature map, and its superscript l corresponds to the depth of the output; Din refers to the data corresponding to the convolution kernel in the input feature map, and its superscript i corresponds to the depth of the input feature map, j And k respectively correspond to the width and height of the convolution kernel; w is the convolution kernel, and its superscripts l and i correspond to the depth of the output feature map and the depth of the input feature map, respectively, and j and k correspond to the width of the convolution kernel. And height.

For the convolution kernel with the size of Kh*Kw*Cin, it can be divided into Kh Kw*Cin partial convolution kernels to perform partial feature extraction. Each time the entire feature extraction is achieved 1/Kh, which is Kw*Cin Some of the features corresponding to the convolution kernel, and the partial results obtained are:

Finally, add these Kh partial results to get the final result

among them,

It can be divided into Kw steps, and each step is realized

Then add Kw partial results to get the final result

The realization is that a row of input data matrix (that is, part of the first input matrix) is multiplied by a column of weight matrix (that is, part of the convolution kernel), and its realization is shown in Figure 5.

The realization of the same is the multiplication of a row of input data matrix and a column of weight matrix, but the number of data in the rows and columns at this time is

Kw times the number of data in the middle, the realization is shown in Figure 6. The number of convolution kernels is Cout. Therefore, the depth of the output feature point is Cout. The input data matrix of one row can be multiplied by the convolution kernel matrix of the Cout column composed of Cout convolution kernels, that is, the weight matrix. A feature point with depth, this feature point is a vector, the length of the vector is the depth Cout of the output feature point, and its realization is shown in Figure 7.

And because the process of convolution or partial convolution of the neural network is the sliding process of the convolution kernel or part of the convolution kernel on the input feature map, it can be regarded as the data of the input feature map changes with the sliding, and the weight remains unchanged In this way, the process of convolution by the neural network becomes the multiplication of the input data matrix of the Wout row and the weight matrix of the Cout column to obtain the output data matrix of the Wout row. The implementation is shown in Figure 8.

In the above process, Kw points out of the Kh*Kw points in the entire convolution kernel have been calculated. Since the entire convolution kernel is divided into Kh parts, each part is Kw points, so the final result is Kh points Add the calculation results to get the true convolution result.

In the present disclosure, the above convolution operation only needs to use a single instruction, that is, the above matrix multiplication instruction to complete the entire convolution process, and it only needs to set the order of data reading in the upper-level program in advance. Specifically, the matrix multiplication instruction is used to implement a matrix convolution operation, wherein: the data of the first input matrix is the row vector data of the first input matrix; the data of the second input matrix is the second input matrix The row vector data. In other words, the data in the first input register and the data in the second input register both need to be row vector data of the matrix, so that the calculated data is the convolution result. Figure 16 is a schematic diagram of the above convolution calculation. Take the first input matrix as a 4*4*2 matrix, and the second input matrix as two 3*3*2 convolution kernels. The step size is 1. The output matrix is a 2*2*2 matrix. In the calculation, the partial convolution method is used to first calculate the intermediate value of each point of the output matrix. As shown in Figure 16, when the partial convolution is performed, a row of the second input matrix slides on the first input matrix, The read data of the first input matrix is as shown in 1601, and each row corresponds to the data of the first input matrix at a position of the second input matrix 1602, including 3 numbers with depth, a total of 6 data, One column in 1602 is the second input matrix including a row of 3 numbers and a total of 6 data. One row of data in 1601 and one column of data in 1602 are multiplied and accumulated to get a point in 1603 and one in 1603. The point is the result of the partial convolution. In this example, the result of the partial convolution is a part of the value of a point in the output matrix, and finally the three rows of the second input matrix are respectively slid on the second input matrix The result of the data extraction calculation is accumulated to obtain the value of a point in the output matrix (one of the two values of the point with a depth of 2).

Combined with the PU array shown in Figures 1-3, briefly explain how the PU array implements the calculation of convolution.

The value 1 of data 1601 of the first input matrix in Figure 16 is sent to Rin1 of PU11 and Rin1 of PU12, and the value of data 1601 of the first input matrix is sent to Rin1 of PU21 and Rin1 of PU22; the second input matrix 1602 is sent to Rin1 of PU11 and Rin1 of PU12. The value of 0.1 is sent to Rin2 of PU11 and Rin2 of PU21, and the value of 1.9 in the second input matrix 1602 is sent to Rin2 of PU12 and Rin2 of PU22; PU11, PU12, PU21, and PU22 perform data multiplication operations and send the result into The output register is saved; in the next clock cycle, the value 2 of data 1601 of the first input matrix is sent to Rin1 of PU11 and Rin1 of PU12, and the value of data 1601 of the first input matrix is sent to Rin1 of PU21 and Rin1 of PU22; The value 0.2 in the second input matrix 1602 is sent to Rin2 of PU11 and Rin2 of PU21, and the value of 2.0 in the second input matrix 1602 is sent to Rin2 of PU12 and Rin2 of PU22. PU11, PU12, PU21, and PU22 execute this data Multiply operation, and send the result to the output register and accumulate the result saved last time.

By analogy, the result of multiplication and accumulation of partial convolution is finally obtained.

FIG. 9 is a schematic structural diagram of a matrix calculation device provided by an embodiment of the disclosure. As shown in FIG. 9, the matrix calculation device 900 includes: a memory 901 for storing matrix operation instructions, a first input matrix, a second input matrix, and an output matrix; an instruction fetching module 902, connected to the memory 901, To obtain the matrix operation instruction from the memory 901; a decoding module 903, connected to the instruction fetching module 902, for decoding the matrix operation instruction obtained by the instruction fetching module 902; register 904 , Used to store the attribute data of the first input matrix, the second input matrix, and the output matrix; the execution module 905 is connected to the decoding module 903, the memory 901 and the register 904, including the above The matrix operation circuit in the embodiment is used to execute the decoded matrix operation instruction.

In one embodiment, the execution module obtains the decoded matrix operation instruction from the decoding module; the execution module obtains the attribute data of the first input matrix and the second input matrix from the register. The attribute data of the input matrix and the attribute data of the output matrix; the execution module obtains the attribute data of the first input matrix used for calculation from the memory according to the attribute data of the first input matrix and the second input matrix Data and the data of the second input matrix; the execution module calculates the data of the first input matrix and the data of the second input matrix according to the decoded matrix operation instruction to obtain the data of the output matrix; the execution module The data of the output matrix is stored in the memory according to the attribute data of the output matrix. Wherein, the attribute data of the first input matrix includes the number of rows, columns, and row vector intervals of the first input matrix; the attribute data of the second input matrix includes the number of rows and columns of the second input matrix. The attribute data of the output matrix includes the number of rows, the number of columns, and the interval of row vectors of the output matrix. Wherein, the number of rows and the number of columns defines the size of the matrix, and the row vector interval defines the storage address difference between two adjacent rows of the matrix. For example, each row of the matrix has 10 int8 matrix elements. For continuous storage, the row vector interval is 10Bytes. If two rows are stored at a certain interval, such as 20Bytes, then 10Bytes are matrix elements, and the content of 10Bytes does not belong to the matrix, which may be invalid data or other Use data.

Optionally, the execution module obtains the data of the first input matrix and the data of the second input matrix for calculation from the memory according to the attribute data of the first input matrix and the second input matrix, including : The execution module reads the data of the first input matrix according to the preset first reading method and the attribute data of the first input matrix; the execution module reads the data of the first input matrix according to the preset second reading method and the attribute data. The attribute data of the second input matrix reads the data of the second input matrix. Wherein, the first reading mode is row reading or column reading; the second reading mode is row reading or column reading.

For example, if the attribute data defines that the number of rows of the first input matrix is 5 rows, the number of columns is 5 columns, and the row spacing is 5 bytes, and the preset reading method is reading by row, then according to the instruction The first address of the first input matrix and the row spacing. Read the first row of the first input matrix. You can know that the first row has 5 matrix elements through the number of columns. Then the first address plus the row spacing is used as the new first address , Read the second row, which is also 5 matrix elements, so read 5 times in sequence, then the execution module can obtain all the data of the first input matrix.

Similarly, the data of the second input matrix is obtained in the same way. Optionally, the execution module storing the data of the output matrix in the memory according to the attribute data of the output matrix includes: the execution module according to a preset storage mode and the attribute data of the output matrix The data of the output matrix is stored in the memory. Wherein, the predetermined storage mode is row storage or column storage, and the specific storage mode is similar to reading, but the direction is opposite, so it will not be repeated here.

Fig. 10a is a schematic diagram of the storage order and format of the first input matrix in the present disclosure. As shown in FIG. 10a, it is an example of the first input matrix in the above embodiment. When it is stored in the memory, it is stored in a manner of depth Cin first, then width Win, and finally height Hin. Take Cin=2, Win=3, Hin=3 as an example, first store the first point of the first line of Hin=3 lines, because of the depth, the one point contains 2 data, and then store the second point Point, also contains 2 data, until the third point is stored, there are a total of 3*2=6 data in this row, including 3 points in the Win direction; after that, the second row of Hin=3 rows The first point is stored in this way until all points are stored. An example of the storage order and format of the first input matrix is shown in Figure 10b.

Fig. 10c is a schematic diagram of the storage order and format of the second input matrix of the present disclosure. As shown in Fig. 10c, it is an example of the second input matrix in the above-mentioned embodiment. The number of cores is first stored in rows, and each column stores a convolution kernel. In the column direction, the depth Cin of the convolution kernel is prioritized, then the width Kw, and then the height Kh. Taking Cout=2, Cin=2, Kw=2, Kh=2 as an example, a total of Cin*Kw*Kh=8 rows of data will be stored, and each row will store 2 data. Store the first data in the first row and first column of the first convolution kernel with a depth of 2, and then store the second convolution kernel in the first row and first column of the point with a depth of 2 Complete the storage of the first row of data of the second input matrix; after that, store the second data in the point with the depth of 2 in the first row and first column of the first convolution kernel, and then store the second data The second data of the second row of the first row and the first column of the two convolution kernels with a depth of 2 completes the storage of the second row of data of the second input matrix, and stores them in this way until all the points are stored. An example of the storage order and format of the second input matrix is shown in Figure 10d.

RS1 as shown in Figure 10a is the first address of the first input matrix. The data of the matrix that is read in can be controlled by the setting of the storage format and the setting of the reading method. For example, according to the row order storage in the above example, with the row reading and row interval, the first input can be read The row vector data of the matrix. As shown in Figure 10c, RS2 is the first address of the second input matrix. The data of the read matrix can be controlled by setting the storage format and the setting of the reading mode, such as storing in the order in the above example, In conjunction with reading by row and row spacing, the row vector data of the second input matrix can be read. After the data of the first input matrix and the second input matrix are read out, the single instruction convolution operation is completed by using the matrix multiplication instruction.

Before calling the above-mentioned matrix multiplication instruction, the attribute data of the first input matrix, the second input matrix, and the output matrix may be set. In this disclosure, multiple registers are defined to store the attribute data of each matrix. An example configuration of the registers is shown in the following table:

寄存器名Register name	寄存器功能描述(31:0)Register function description (31:0)
Shape1Shape1	31:16(第一输入矩阵列数，即矩阵宽度)；15:0(第一输入矩阵一行数，即矩阵长度)31:16 (the number of columns in the first input matrix, that is, the width of the matrix); 15:0 (the number of rows in the first input matrix, that is, the length of the matrix)
Shape2Shape2	31:16(第二输入矩阵矩阵列数，即矩阵宽度)；15:0(第二输入矩阵行数，即矩阵长度)31:16 (number of columns of the second input matrix, that is, the width of the matrix); 15:0 (number of rows of the second input matrix, that is, the length of the matrix)
Stride1Stride1	15:0(输出矩阵行间隔数，即前一行的头和后一行的头之间间隔的数据个数，下同)15:0 (output matrix row interval number, that is, the number of data spaced between the head of the previous row and the head of the next row, the same below)
Stride2Stride2	31:16(第二输入矩阵行间隔数)；15:0(第一输入矩阵行间隔数)31:16 (number of row intervals of the second input matrix); 15:0 (number of row intervals of the first input matrix)

Figures 11-15 are schematic diagrams of data overlap in convolution operations. As shown in Figure 11, in the convolution operation, there is often a partial overlap between the data used to calculate the previous output feature point and the data used to calculate the next output feature point. In Figure 11, take the 3*3 convolution kernel, stride (stride, the distance of each convolution kernel sliding) as an example to calculate the second point in the second row, after the calculation of the first After the second point in the second row, as shown in Figure 12, slide the convolution kernel one point to the right to calculate the third point in the second row. From the perspective of the sliding process of the input data, when calculating the two points before and after, the input data partially overlaps, and the overlapped part is shown in the gray part in Figure 13. When only partial convolution is considered, as shown in Fig. 14, the gray part is the overlapped data in the previous two calculations. This is equivalent to when the convolution kernel is sliding, two-thirds of the data of the input matrix is repeated in the convolution calculation of two adjacent positions.

Assuming that part of the convolution input data is matrix one, the data in matrix one is recorded as

x, y, and z are the coordinates on the coordinate system in Figure 14. At the same time, it is assumed that the weight of the partial convolution is matrix two, and the data in matrix two is recorded as

w has one more dimension of Cout, so add a dimension to its superscript, and the data in the output matrix is recorded as

The calculation process of the matrix is shown in Figure 15. When calculating the second point in the second row of the output feature map, use the data in the second row of matrix one to multiply and accumulate each column of matrix two to obtain a second point with depth on the output feature map. The depth of the two points is 8, which is represented as the second row of the output matrix in Figure 15

It is a point of the output feature map, the depth of this point is Cout, and each data sequence in the depth corresponds to each data of the output matrix; when calculating the third point in the first row of the output feature map, use the third of matrix one The row data is multiplied and accumulated with each column of matrix two, and the third point with depth on the output feature map is obtained. The third point is a point with a depth of 8. As can be seen from Figure 15, in When calculating the second point with depth in the second row and the third point with depth in the second row, the 8 data in the gray part are overlapped. If they are stored in the form of matrix one when storing, it will be Cause a lot of waste of memory.

The above-mentioned calculation problems can be solved by setting the above-mentioned line interval. Among them, taking the above embodiment as an example, the number of row data of the first input matrix is 4*3=12 (4 is the depth of the first input matrix, 3 is the number of columns of the second data matrix). Before the matrix multiplication instruction, set the number of row intervals represented by the 15:0 bits in the aforementioned register Stride2 to 4, then between the two rows, 12-4=8 data will be repeatedly filled into the aforementioned matrix calculation device. In this way, one instruction is realized to realize the entire partial convolution process, and repeated data does not need to be stored repeatedly during storage, which saves memory space. At the same time, set the register of the output matrix so that the output data is stored in sequence, which is the same as the storage order of the input data. If convolution or other operations continue later, there is no need to adjust the data shape and can be used directly, eliminating the need The time and power consumption required for data adjustment are reduced.

The embodiment of the present disclosure also provides a matrix operation method, which is based on any of the foregoing matrix operation circuits, and is characterized in that it includes: fetching a matrix operation instruction from a memory; decoding the matrix operation instruction, and The decoded instruction operation instruction is sent to the matrix operation circuit; based on the decoded matrix operation instruction, the matrix operation circuit obtains the data of the first input matrix and the second input matrix data from the memory and performs Operation, the operation result is stored in the memory after the operation is completed.

An embodiment of the present disclosure also provides an electronic device, including: a memory, configured to store computer-readable instructions; and one or more processors, configured to run the computer-readable instructions, so that the processor can realize Any one of the matrix operation methods in the foregoing embodiments.

The embodiments of the present disclosure also provide a non-transitory computer-readable storage medium, which is characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute any of the foregoing embodiments. The matrix operation method.

An embodiment of the present disclosure provides a computer program product, which is characterized by including computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute any of the matrix operation methods in the foregoing embodiments .

An embodiment of the present disclosure provides a chip, which is characterized by including the matrix operation circuit described in any of the foregoing embodiments.

An embodiment of the present disclosure provides a computing device, which is characterized by including the chip described in any of the foregoing embodiments.

The flowcharts and block diagrams in the drawings of the present disclosure illustrate the possible implementation architecture, functions, and operations of the system, method, and computer program product according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the module, program segment, or part of code contains one or more for realizing the specified logical function Executable instructions. It should also be noted that, in some alternative implementations, the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, and they can sometimes be executed in the reverse order, depending on the functions involved. It should also be noted that each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart, can be implemented by a dedicated hardware-based system that performs the specified functions or operations Or it can be realized by a combination of dedicated hardware and computer instructions.

The units involved in the embodiments described in the present disclosure can be implemented in software or hardware. Among them, the name of the unit does not constitute a limitation on the unit itself under certain circumstances.

The functions described hereinabove may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that can be used include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Product (ASSP), System on Chip (SOC), Complex Programmable Logical device (CPLD) and so on.

In the context of the present disclosure, a machine-readable medium may be a tangible medium, which may contain or store a program for use by the instruction execution system, apparatus, or device or in combination with the instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

Claims

A matrix operation circuit, including:

Control circuit;

An arithmetic unit array, the arithmetic unit array includes a plurality of arithmetic units, the arithmetic unit includes a first input register, a second input register, and an output register;

The first input register is used to receive data of a first input matrix, and the second input register is used to receive data of a second input matrix;

The control circuit is configured to receive a matrix operation instruction, and in response to the instruction, control at least one operation unit of the plurality of operation units to perform an operation on the first input matrix and the second input according to the instruction of the instruction. The matrix execution operation operation, wherein the instruction is a single instruction;

The output register is used to store the operation result of the operation operation.
3. The matrix operation circuit according to claim 1, wherein the instruction includes the name of the instruction, the first address of the first input matrix, the first address of the second input matrix, and the first address of the output matrix.
The matrix operation circuit according to claim 1 or 2, characterized in that:

The matrix operation instruction is a matrix multiplication instruction.
The matrix operation circuit of claim 3, wherein:

The arithmetic unit includes an arithmetic unit, and the arithmetic unit includes at least a multiplier and an adder;

The operation unit is configured to combine the multiplier and the adder according to the matrix multiplication instruction to perform a matrix multiplication operation.
The matrix operation circuit according to claim 4, wherein, in response to the matrix multiplication instruction, for at least one operation unit that executes the matrix multiplication instruction:

Read the data of the first input matrix in the first input register;

Reading the data of the second input matrix in the second input register;

Calculating a product of the data of the first input matrix and the data of the second input matrix by using a multiplier;

Calculating the accumulated value of the product through an adder;

The accumulated value is stored in the output register.
The matrix operation circuit according to claim 5, wherein the matrix multiplication instruction is used to implement a matrix convolution operation, wherein:

The data of the first input matrix is row vector data of the first input matrix;

The data of the second input matrix is row vector data of the second input matrix.
A matrix calculation device, including:

A memory for storing matrix operation instructions, a first input matrix, a second input matrix, and an output matrix;

An instruction fetching module, connected to the memory, and configured to obtain the matrix operation instruction from the memory;

A decoding module, connected to the instruction fetching module, and configured to decode the matrix operation instructions acquired by the instruction fetching module;

A register for storing attribute data of the first input matrix, the second input matrix, and the output matrix;

The execution module is connected to the decoding module, the memory and the register, and includes the matrix operation circuit according to claims 1-6, which is used to execute the decoded matrix operation instruction.
8. The matrix calculation device of claim 7, wherein:

The execution module obtains the decoded matrix operation instruction from the decoding module;

The execution module obtains the attribute data of the first input matrix, the attribute data of the second input matrix, and the attribute data of the output matrix from the register;

The execution module obtains the data of the first input matrix and the data of the second input matrix for calculation from the memory according to the attribute data of the first input matrix and the second input matrix;

The execution module calculates the data of the first input matrix and the data of the second input matrix according to the decoded matrix operation instruction to obtain the data of the output matrix;

The execution module stores the data of the output matrix in the memory according to the attribute data of the output matrix.
The matrix calculation device according to claim 8, wherein:

The attribute data of the first input matrix includes the number of rows, the number of columns, and the row vector interval of the first input matrix;

The attribute data of the second input matrix includes the number of rows, the number of columns, and the row vector interval of the second input matrix;

The attribute data of the output matrix includes the number of rows, the number of columns, and the row vector interval of the output matrix.
The matrix calculation device according to claim 8, wherein the execution module obtains data of the first input matrix for calculation from the memory according to the attribute data of the first input matrix and the second input matrix And the data of the second input matrix, including:

The execution module reads the data of the first input matrix according to the preset first reading method and the attribute data of the first input matrix;

The execution module reads the data of the second input matrix according to the preset second reading method and the attribute data of the second input matrix.
The matrix calculation device of claim 10, wherein:

The first reading method is reading by row or reading by column;

The second reading method is reading in rows or reading in columns.
A matrix operation method based on the matrix operation circuit of any one of claims 1 to 6, characterized in that it comprises:

Take out the matrix operation instruction from the memory;

Decode the matrix operation instruction, and send the decoded instruction operation instruction to the matrix operation circuit;

Based on the decoded matrix operation instruction, the matrix operation circuit obtains the data of the first input matrix and the data of the second input matrix from the memory and performs an operation, and stores the operation result in the memory after the operation is completed.