WO2021072732A1 - Matrix computing circuit, apparatus and method - Google Patents

Matrix computing circuit, apparatus and method Download PDF

Info

Publication number
WO2021072732A1
WO2021072732A1 PCT/CN2019/111878 CN2019111878W WO2021072732A1 WO 2021072732 A1 WO2021072732 A1 WO 2021072732A1 CN 2019111878 W CN2019111878 W CN 2019111878W WO 2021072732 A1 WO2021072732 A1 WO 2021072732A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
input
data
instruction
input matrix
Prior art date
Application number
PCT/CN2019/111878
Other languages
French (fr)
Chinese (zh)
Inventor
罗飞
王维伟
Original Assignee
北京希姆计算科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京希姆计算科技有限公司 filed Critical 北京希姆计算科技有限公司
Priority to CN201980101046.3A priority Critical patent/CN114503126A/en
Priority to PCT/CN2019/111878 priority patent/WO2021072732A1/en
Publication of WO2021072732A1 publication Critical patent/WO2021072732A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons

Definitions

  • the present disclosure relates to the field of neural network computing, and in particular to a matrix operation circuit, device and method.
  • Chips are the cornerstone of data processing, it is fundamentally determined Improved people’s ability to process data.
  • a general-purpose chip route such as CPU (Central Processing Unit, central processing unit), etc. They can provide great flexibility, but they are effective in processing algorithms in specific fields.
  • the power is relatively low; the other is a dedicated chip route, such as TPU (Tensor Processing Unit, tensor processor), etc.
  • CPU scheme In this scheme, if it is a single-core CPU, the matrix will be disassembled into scalars for operation, and convolution operations are realized by combining scalar instructions; if it is a multi-core CPU, multiple cores may execute each in parallel Scalar instructions are combined to implement convolution operations.
  • GPU Graphics Processing Unit
  • the GPU will disassemble the convolution operation into multiple instruction operations. These instructions are mainly vector instructions, and the convolution operation is realized by combining and executing vector instructions.
  • the use of this solution has the following disadvantages: the underlying program is complex, and generally requires multiple layers of loops to implement convolution operations; the convolution operation is achieved through multiple combinations of vector instructions, which is inefficient; GPU requires multiple data access, which will increase the implementation The calculation time of convolution operation; GPU needs to access data multiple times, which will increase the calculation power consumption of convolution operation; GPU has limited cache, and the realization of relatively large convolution operation requires multiple transfers from outside the chip. effectiveness.
  • an embodiment of the present disclosure provides a matrix operation circuit, including:
  • An arithmetic unit array includes a plurality of arithmetic units, the arithmetic unit includes a first input register, a second input register, and an output register;
  • the first input register is used to receive data of a first input matrix
  • the second input register is used to receive data of a second input matrix
  • the control circuit is configured to receive a matrix operation instruction, and in response to the instruction, control at least one operation unit of the plurality of operation units to perform an operation on the first input matrix and the second input according to the instruction of the instruction.
  • the matrix execution operation operation wherein the instruction is a single instruction;
  • the output register is used to store the operation result of the operation operation.
  • the instruction includes the instruction name, the first address of the first input matrix, the first address of the second input matrix, and the first address of the output matrix.
  • the matrix operation instruction is a matrix multiplication instruction.
  • the arithmetic unit includes an arithmetic unit, the arithmetic unit includes at least a multiplier and an adder; the arithmetic unit is configured to combine the multiplier and the adder according to the matrix multiplication instruction to perform a matrix multiplication operation .
  • the accumulated value is stored in the output register.
  • matrix multiplication instruction is used to implement matrix convolution operation, wherein:
  • the data of the first input matrix is row vector data of the first input matrix
  • the data of the second input matrix is row vector data of the second input matrix.
  • a matrix calculation device including:
  • a memory for storing matrix operation instructions, a first input matrix, a second input matrix, and an output matrix
  • An instruction fetching module connected to the memory, and configured to obtain the matrix operation instruction from the memory
  • a decoding module connected to the instruction fetching module, and configured to decode the matrix operation instructions acquired by the instruction fetching module
  • a register for storing attribute data of the first input matrix, the second input matrix, and the output matrix
  • the execution module is connected to the decoding module, the memory and the register, and includes the matrix operation circuit according to claims 1-6, which is used to execute the decoded matrix operation instruction.
  • the execution module obtains the decoded matrix operation instruction from the decoding module; the execution module obtains the attribute data of the first input matrix and the attribute data of the second input matrix from the register Attribute data and attribute data of the output matrix; the execution module obtains the data of the first input matrix and the first input matrix for calculation from the memory according to the attribute data of the first input matrix and the second input matrix Two input matrix data; the execution module calculates the data of the first input matrix and the data of the second input matrix according to the decoded matrix operation instruction to obtain the data of the output matrix; the execution module calculates the data of the output matrix according to the The attribute data of the output matrix stores the data of the output matrix in the memory.
  • the attribute data of the first input matrix includes the number of rows, the number of columns, and the row vector interval of the first input matrix
  • the attribute data of the second input matrix includes the number of rows of the second input matrix, The number of columns and the interval of row vectors
  • the attribute data of the output matrix includes the number of rows, the number of columns, and the interval of row vectors of the output matrix.
  • the execution module acquiring data of the first input matrix and data of the second input matrix for calculation from the memory according to the attribute data of the first input matrix and the second input matrix includes:
  • the execution module reads the data of the first input matrix according to the preset first reading method and the attribute data of the first input matrix
  • the execution module reads the data of the second input matrix according to the preset second reading method and the attribute data of the second input matrix.
  • the first reading mode is row reading or column reading; the second reading mode is row reading or column reading.
  • embodiments of the present disclosure provide a matrix operation method, which is based on the matrix operation circuit of any one of the foregoing first aspects, and is characterized in that it includes:
  • the matrix operation circuit Based on the decoded matrix operation instruction, the matrix operation circuit obtains the data of the first input matrix and the data of the second input matrix from the memory and performs an operation, and stores the operation result in the memory after the operation is completed .
  • an embodiment of the present disclosure provides an electronic device, including: a memory, configured to store computer-readable instructions; and one or more processors, configured to execute the computer-readable instructions to cause the processor to run When realizing any one of the above-mentioned matrix operation methods in the third aspect.
  • embodiments of the present disclosure provide a non-transitory computer-readable storage medium, characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute the aforementioned third aspect Any of the matrix operation methods described above.
  • embodiments of the present disclosure provide a computer program product, which is characterized in that it includes computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute any of the foregoing third aspects.
  • the matrix operation method is characterized in that it includes computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute any of the foregoing third aspects.
  • an embodiment of the present disclosure provides a chip, which is characterized by comprising the matrix operation circuit described in any one of the first aspect.
  • an embodiment of the present disclosure provides a computing device, which is characterized by including the chip described in any one of the seventh aspects.
  • the embodiments of the present disclosure disclose a matrix operation circuit, device and method.
  • the matrix operation circuit includes: a control circuit; an arithmetic unit array, the arithmetic unit array includes a plurality of arithmetic units, the arithmetic unit includes a first input register, a second input register, and an output register; the first input register is used for For receiving the data of the first input matrix, the second input register is used for receiving the data of the second input matrix; the control circuit is used for receiving a matrix operation instruction, and responding to the instruction to control the plurality of operation units At least one arithmetic unit of, performs arithmetic operations on the first input matrix and the second input matrix according to the instruction of the instruction, wherein the instruction is a single instruction; the output register is used to store the arithmetic The result of the operation.
  • FIG. 1 is a schematic structural diagram of a matrix operation circuit provided by an embodiment of the disclosure
  • FIG. 2 is a schematic structural diagram of an arithmetic unit provided by an embodiment of the disclosure.
  • FIG. 3 is a further structural schematic diagram of an arithmetic unit provided by an embodiment of the disclosure.
  • FIGS. 4a-4b are schematic diagrams of convolution operations in an embodiment of the present disclosure.
  • Figures 5 to 8 are calculation processes of partial convolution provided by embodiments of the disclosure.
  • FIG. 9 is a schematic structural diagram of a matrix calculation device provided by an embodiment of the disclosure.
  • 10a-10d are schematic diagrams of the storage order and format of the matrix in the disclosure.
  • FIGS. 11-15 are schematic diagrams of data overlap in convolution operations
  • FIG. 16 is a schematic diagram of a specific example of the convolution budget in the embodiment of the disclosure.
  • FIG. 1 is a schematic structural diagram of a matrix operation circuit provided by an embodiment of the disclosure.
  • the matrix operation circuit 100 includes a control circuit 101 and an arithmetic unit array 102.
  • the arithmetic unit array includes a plurality of arithmetic units (PU) 103.
  • PU arithmetic units
  • the arithmetic unit 103 includes a first The input register (Rin1) 104, the second input register (Rin2) 105, and the output register (Rout) 106, wherein the first input register 104 is used to receive the data of the first input matrix, and the second input register 105 is used to Receive the data of the second input matrix;
  • the control circuit 101 is configured to receive a matrix operation instruction, and in response to the instruction, control at least one operation unit 103 of the plurality of operation units 103, according to the instruction of the instruction
  • the first input matrix and the second input matrix perform arithmetic operations, wherein the instruction is a single instruction; the output register 106 is used to store the result of the arithmetic operation.
  • the single command includes the command name, the first address of the first input matrix, the first address of the second input matrix, and the first address of the output matrix.
  • the following table shows an exemplary command format:
  • the instruction name corresponds to the meaning, format and operation of the instruction.
  • the first address of the first input matrix and the first address of the second input matrix respectively define the read addresses of the two source operands of the instruction, and the first address of the output matrix is defined Indicates the storage address of the destination operand of the instruction.
  • the above exemplary instruction is a matrix multiplication instruction, which implements the multiplication of two matrices, specifically:
  • FIG. 3 is a further structural schematic diagram of an arithmetic unit provided by an embodiment of the disclosure.
  • the arithmetic unit in addition to the first input register, the second input register, and the output register, the arithmetic unit also includes an arithmetic unit.
  • the arithmetic unit includes at least a multiplier 301 and an adder 302, wherein the arithmetic unit And used for combining the multiplier and the adder according to the matrix multiplication instruction to perform a matrix multiplication operation.
  • the data of the first input matrix in the first input register is the data sequentially read into the first input register according to the first address of the first input matrix; the data in the second input register The data of the second input matrix is data sequentially read into the second input register according to the first address of the second input matrix.
  • the multiplier is used to calculate the product of the data in the first input register and the second input register.
  • the multiplier in the arithmetic unit calculates the product of a 11 *b 11 , and then sends the product of a 11 *b 11 to the adder for accumulation.
  • the accumulation result of the accumulator is a 11 *b 11 ; then in the next clock cycle, continue the above calculation operation, at this time the data received in the first input register is a 12 , The data received in the second input register is b 21 , the multiplier in the arithmetic unit calculates the product of a 12 *b 21 , and then sends the product of a 12 *b 21 to the adder for accumulation.
  • the input of one clock cycle is a 11 *b 11 , so calculate a 11 *b 11 + a 12 *b 21 in this clock cycle; continue to do the above operations until one row of the first input matrix and one column of the second input matrix are calculated Upon completion, the final accumulated value is obtained, and then the accumulated value is stored in the output register.
  • the control circuit stores the accumulated value in the output register in the system memory according to the first address of the output matrix in the instruction.
  • the matrix multiplication instructions can implement matrix convolution operations.
  • the convolution calculation of the matrix is the cumulative sum of the product of the data point-to-point multiplication of the two matrices.
  • An exemplary convolution operation is as follows:
  • FIGS. 4a-4b are schematic diagrams of convolution operations in an embodiment of the disclosure.
  • Figure 4a is the overall schematic diagram of the convolution operation process, and the diagram is explained as follows:
  • Cin The number of channels of the input feature map, collectively referred to as the depth of the input feature map later;
  • Kw the width of the convolution kernel
  • Wout the width of the output feature map
  • Hout the height of the output feature map
  • the feature points on the input feature map constitute the first input matrix; the points on the convolution kernel constitute the second input matrix; the output feature map is the output matrix, and a feature point on the output feature map is a data on the output matrix.
  • the convolution kernel will slide on the input feature map. Each time it slides, it will multiply and accumulate data with the corresponding data in the input feature map to extract an output feature point, that is, a data on the output matrix. .
  • Figure 4b is a schematic diagram of the calculation of an input feature point with depth.
  • the convolution kernel slides on the input feature map. When it stops at a position, it will multiply and accumulate the corresponding data with the feature points in the input feature map at that position to obtain the output feature corresponding to the position. Points; there are Cout convolution kernels, and each convolution kernel will multiply and accumulate data with the feature points in the input feature map at the same position to obtain Cout output feature points in the depth direction; Cout output feature points are composed A feature point with depth on the entire output feature map, the depth of this point is Cout; the convolution kernel will slide the entire input feature map to obtain the entire output feature map.
  • Dout is a point with depth in the output feature map, and its superscript l corresponds to the depth of the output; Din refers to the data corresponding to the convolution kernel in the input feature map, and its superscript i corresponds to the depth of the input feature map, j And k respectively correspond to the width and height of the convolution kernel; w is the convolution kernel, and its superscripts l and i correspond to the depth of the output feature map and the depth of the input feature map, respectively, and j and k correspond to the width of the convolution kernel. And height.
  • Kh*Kw*Cin For the convolution kernel with the size of Kh*Kw*Cin, it can be divided into Kh Kw*Cin partial convolution kernels to perform partial feature extraction. Each time the entire feature extraction is achieved 1/Kh, which is Kw*Cin Some of the features corresponding to the convolution kernel, and the partial results obtained are: Finally, add these Kh partial results to get the final result
  • the input data matrix of one row can be multiplied by the convolution kernel matrix of the Cout column composed of Cout convolution kernels, that is, the weight matrix.
  • a feature point with depth, this feature point is a vector, the length of the vector is the depth Cout of the output feature point, and its realization is shown in Figure 7.
  • the process of convolution or partial convolution of the neural network is the sliding process of the convolution kernel or part of the convolution kernel on the input feature map, it can be regarded as the data of the input feature map changes with the sliding, and the weight remains unchanged In this way, the process of convolution by the neural network becomes the multiplication of the input data matrix of the Wout row and the weight matrix of the Cout column to obtain the output data matrix of the Wout row.
  • the implementation is shown in Figure 8.
  • the above convolution operation only needs to use a single instruction, that is, the above matrix multiplication instruction to complete the entire convolution process, and it only needs to set the order of data reading in the upper-level program in advance.
  • the matrix multiplication instruction is used to implement a matrix convolution operation, wherein: the data of the first input matrix is the row vector data of the first input matrix; the data of the second input matrix is the second input matrix The row vector data.
  • the data in the first input register and the data in the second input register both need to be row vector data of the matrix, so that the calculated data is the convolution result.
  • Figure 16 is a schematic diagram of the above convolution calculation.
  • the step size is 1.
  • the output matrix is a 2*2*2 matrix.
  • the partial convolution method is used to first calculate the intermediate value of each point of the output matrix. As shown in Figure 16, when the partial convolution is performed, a row of the second input matrix slides on the first input matrix, The read data of the first input matrix is as shown in 1601, and each row corresponds to the data of the first input matrix at a position of the second input matrix 1602, including 3 numbers with depth, a total of 6 data, One column in 1602 is the second input matrix including a row of 3 numbers and a total of 6 data.
  • One row of data in 1601 and one column of data in 1602 are multiplied and accumulated to get a point in 1603 and one in 1603.
  • the point is the result of the partial convolution.
  • the result of the partial convolution is a part of the value of a point in the output matrix, and finally the three rows of the second input matrix are respectively slid on the second input matrix
  • the result of the data extraction calculation is accumulated to obtain the value of a point in the output matrix (one of the two values of the point with a depth of 2).
  • the value 1 of data 1601 of the first input matrix in Figure 16 is sent to Rin1 of PU11 and Rin1 of PU12, and the value of data 1601 of the first input matrix is sent to Rin1 of PU21 and Rin1 of PU22; the second input matrix 1602 is sent to Rin1 of PU11 and Rin1 of PU12.
  • the value of 0.1 is sent to Rin2 of PU11 and Rin2 of PU21, and the value of 1.9 in the second input matrix 1602 is sent to Rin2 of PU12 and Rin2 of PU22; PU11, PU12, PU21, and PU22 perform data multiplication operations and send the result into The output register is saved; in the next clock cycle, the value 2 of data 1601 of the first input matrix is sent to Rin1 of PU11 and Rin1 of PU12, and the value of data 1601 of the first input matrix is sent to Rin1 of PU21 and Rin1 of PU22; The value 0.2 in the second input matrix 1602 is sent to Rin2 of PU11 and Rin2 of PU21, and the value of 2.0 in the second input matrix 1602 is sent to Rin2 of PU12 and Rin2 of PU22.
  • PU11, PU12, PU21, and PU22 execute this data Multiply operation, and send the result to the output register and accumulate the result saved last time.
  • FIG. 9 is a schematic structural diagram of a matrix calculation device provided by an embodiment of the disclosure.
  • the matrix calculation device 900 includes: a memory 901 for storing matrix operation instructions, a first input matrix, a second input matrix, and an output matrix; an instruction fetching module 902, connected to the memory 901, To obtain the matrix operation instruction from the memory 901; a decoding module 903, connected to the instruction fetching module 902, for decoding the matrix operation instruction obtained by the instruction fetching module 902; register 904 , Used to store the attribute data of the first input matrix, the second input matrix, and the output matrix; the execution module 905 is connected to the decoding module 903, the memory 901 and the register 904, including the above
  • the matrix operation circuit in the embodiment is used to execute the decoded matrix operation instruction.
  • the execution module obtains the decoded matrix operation instruction from the decoding module; the execution module obtains the attribute data of the first input matrix and the second input matrix from the register. The attribute data of the input matrix and the attribute data of the output matrix; the execution module obtains the attribute data of the first input matrix used for calculation from the memory according to the attribute data of the first input matrix and the second input matrix Data and the data of the second input matrix; the execution module calculates the data of the first input matrix and the data of the second input matrix according to the decoded matrix operation instruction to obtain the data of the output matrix; the execution module The data of the output matrix is stored in the memory according to the attribute data of the output matrix.
  • the attribute data of the first input matrix includes the number of rows, columns, and row vector intervals of the first input matrix
  • the attribute data of the second input matrix includes the number of rows and columns of the second input matrix
  • the attribute data of the output matrix includes the number of rows, the number of columns, and the interval of row vectors of the output matrix.
  • the number of rows and the number of columns defines the size of the matrix
  • the row vector interval defines the storage address difference between two adjacent rows of the matrix. For example, each row of the matrix has 10 int8 matrix elements. For continuous storage, the row vector interval is 10Bytes. If two rows are stored at a certain interval, such as 20Bytes, then 10Bytes are matrix elements, and the content of 10Bytes does not belong to the matrix, which may be invalid data or other Use data.
  • the execution module obtains the data of the first input matrix and the data of the second input matrix for calculation from the memory according to the attribute data of the first input matrix and the second input matrix, including : The execution module reads the data of the first input matrix according to the preset first reading method and the attribute data of the first input matrix; the execution module reads the data of the first input matrix according to the preset second reading method and the attribute data. The attribute data of the second input matrix reads the data of the second input matrix.
  • the first reading mode is row reading or column reading; the second reading mode is row reading or column reading.
  • the attribute data defines that the number of rows of the first input matrix is 5 rows, the number of columns is 5 columns, and the row spacing is 5 bytes, and the preset reading method is reading by row, then according to the instruction The first address of the first input matrix and the row spacing. Read the first row of the first input matrix. You can know that the first row has 5 matrix elements through the number of columns. Then the first address plus the row spacing is used as the new first address , Read the second row, which is also 5 matrix elements, so read 5 times in sequence, then the execution module can obtain all the data of the first input matrix.
  • the execution module storing the data of the output matrix in the memory according to the attribute data of the output matrix includes: the execution module according to a preset storage mode and the attribute data of the output matrix
  • the data of the output matrix is stored in the memory.
  • the predetermined storage mode is row storage or column storage, and the specific storage mode is similar to reading, but the direction is opposite, so it will not be repeated here.
  • Fig. 10a is a schematic diagram of the storage order and format of the first input matrix in the present disclosure.
  • FIG. 10a it is an example of the first input matrix in the above embodiment.
  • it is stored in the memory, it is stored in a manner of depth Cin first, then width Win, and finally height Hin.
  • the first point is stored in this way until all points are stored.
  • An example of the storage order and format of the first input matrix is shown in Figure 10b.
  • Fig. 10c is a schematic diagram of the storage order and format of the second input matrix of the present disclosure. As shown in Fig. 10c, it is an example of the second input matrix in the above-mentioned embodiment.
  • the number of cores is first stored in rows, and each column stores a convolution kernel. In the column direction, the depth Cin of the convolution kernel is prioritized, then the width Kw, and then the height Kh.
  • the depth Cin of the convolution kernel is prioritized, then the width Kw, and then the height Kh.
  • RS1 as shown in Figure 10a is the first address of the first input matrix.
  • the data of the matrix that is read in can be controlled by the setting of the storage format and the setting of the reading method. For example, according to the row order storage in the above example, with the row reading and row interval, the first input can be read The row vector data of the matrix.
  • RS2 is the first address of the second input matrix.
  • the data of the read matrix can be controlled by setting the storage format and the setting of the reading mode, such as storing in the order in the above example, In conjunction with reading by row and row spacing, the row vector data of the second input matrix can be read. After the data of the first input matrix and the second input matrix are read out, the single instruction convolution operation is completed by using the matrix multiplication instruction.
  • the attribute data of the first input matrix, the second input matrix, and the output matrix may be set.
  • multiple registers are defined to store the attribute data of each matrix.
  • An example configuration of the registers is shown in the following table:
  • Shape1 31:16 (the number of columns in the first input matrix, that is, the width of the matrix); 15:0 (the number of rows in the first input matrix, that is, the length of the matrix)
  • Shape2 31:16 (number of columns of the second input matrix, that is, the width of the matrix); 15:0 (number of rows of the second input matrix, that is, the length of the matrix)
  • Stride1 15:0 output matrix row interval number, that is, the number of data spaced between the head of the previous row and the head of the next row, the same below
  • Stride2 31:16 (number of row intervals of the second input matrix); 15:0 (number of row intervals of the first input matrix)
  • Figures 11-15 are schematic diagrams of data overlap in convolution operations.
  • the convolution operation there is often a partial overlap between the data used to calculate the previous output feature point and the data used to calculate the next output feature point.
  • Figure 11 take the 3*3 convolution kernel, stride (stride, the distance of each convolution kernel sliding) as an example to calculate the second point in the second row, after the calculation of the first After the second point in the second row, as shown in Figure 12, slide the convolution kernel one point to the right to calculate the third point in the second row. From the perspective of the sliding process of the input data, when calculating the two points before and after, the input data partially overlaps, and the overlapped part is shown in the gray part in Figure 13.
  • the gray part is the overlapped data in the previous two calculations. This is equivalent to when the convolution kernel is sliding, two-thirds of the data of the input matrix is repeated in the convolution calculation of two adjacent positions.
  • the data in matrix one is recorded as x, y, and z are the coordinates on the coordinate system in Figure 14.
  • the weight of the partial convolution is matrix two, and the data in matrix two is recorded as w has one more dimension of Cout, so add a dimension to its superscript, and the data in the output matrix is recorded as
  • the calculation process of the matrix is shown in Figure 15.
  • the depth of the two points is 8, which is represented as the second row of the output matrix in Figure 15 It is a point of the output feature map, the depth of this point is Cout, and each data sequence in the depth corresponds to each data of the output matrix; when calculating the third point in the first row of the output feature map, use the third of matrix one The row data is multiplied and accumulated with each column of matrix two, and the third point with depth on the output feature map is obtained.
  • the third point is a point with a depth of 8.
  • the 8 data in the gray part are overlapped. If they are stored in the form of matrix one when storing, it will be Cause a lot of waste of memory.
  • the above-mentioned calculation problems can be solved by setting the above-mentioned line interval.
  • the embodiment of the present disclosure also provides a matrix operation method, which is based on any of the foregoing matrix operation circuits, and is characterized in that it includes: fetching a matrix operation instruction from a memory; decoding the matrix operation instruction, and The decoded instruction operation instruction is sent to the matrix operation circuit; based on the decoded matrix operation instruction, the matrix operation circuit obtains the data of the first input matrix and the second input matrix data from the memory and performs Operation, the operation result is stored in the memory after the operation is completed.
  • An embodiment of the present disclosure also provides an electronic device, including: a memory, configured to store computer-readable instructions; and one or more processors, configured to run the computer-readable instructions, so that the processor can realize Any one of the matrix operation methods in the foregoing embodiments.
  • the embodiments of the present disclosure also provide a non-transitory computer-readable storage medium, which is characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute any of the foregoing embodiments.
  • the matrix operation method is characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute any of the foregoing embodiments.
  • An embodiment of the present disclosure provides a computer program product, which is characterized by including computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute any of the matrix operation methods in the foregoing embodiments .
  • An embodiment of the present disclosure provides a chip, which is characterized by including the matrix operation circuit described in any of the foregoing embodiments.
  • An embodiment of the present disclosure provides a computing device, which is characterized by including the chip described in any of the foregoing embodiments.
  • each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the module, program segment, or part of code contains one or more for realizing the specified logical function Executable instructions.
  • the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, and they can sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart can be implemented by a dedicated hardware-based system that performs the specified functions or operations Or it can be realized by a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments described in the present disclosure can be implemented in software or hardware. Among them, the name of the unit does not constitute a limitation on the unit itself under certain circumstances.
  • exemplary types of hardware logic components include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Product (ASSP), System on Chip (SOC), Complex Programmable Logical device (CPLD) and so on.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • ASSP Application Specific Standard Product
  • SOC System on Chip
  • CPLD Complex Programmable Logical device
  • a machine-readable medium may be a tangible medium, which may contain or store a program for use by the instruction execution system, apparatus, or device or in combination with the instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • the machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any suitable combination of the foregoing.
  • machine-readable storage media would include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or flash memory erasable programmable read-only memory
  • CD-ROM compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

Disclosed are a matrix computing circuit, apparatus and method. The matrix computing circuit comprises: a control circuit; and a computing unit array. The computing unit array comprises a plurality of computing units. The computing unit comprises a first input register, a second input register, and an output register. The first input register is used for receiving data of a first input matrix. The second input register is used for receiving data of a second input matrix. The control circuit is used for receiving a matrix computing instruction, and in response to the instruction, controlling at least one of the plurality of computing units to perform, according to an indication of the instruction, a computing operation on the first input matrix and the second input matrix, the instruction being a single instruction. The output register is used for storing the computing result of the computing operation. By means of the method, the technical problems in the prior art of low computing efficiency and high power consumption during convolutional computing are resolved.

Description

矩阵运算电路、装置以及方法Matrix operation circuit, device and method 技术领域Technical field
本公开涉及神经网络计算领域,尤其涉及一种矩阵运算电路、装置以及方法。The present disclosure relates to the field of neural network computing, and in particular to a matrix operation circuit, device and method.
背景技术Background technique
随着科学技术的发展,人类社会正在快速进入智能时代。智能时代的重要特点,就是人们获得数据的种类越来越多,获得数据的量越来越大,而对处理数据的速度要求越来越高.芯片是数据处理的基石,它从根本上决定了人们处理数据的能力。从应用领域来看,芯片主要有两条路线:一条是通用芯片路线,例如CPU(Central Processing Unit,中央处理器)等,它们能提供极大的灵活性,但是在处理特定领域算法时有效算力比较低;另一条是专用芯片路线,例如TPU(Tensor Processing Unit,张量处理器)等,它们在某些特定领域,能发挥较高的有效算力,但是面对灵活多变的比较通用的领域,它们处理能力比较差甚至无法处理。由于智能时代的数据种类繁多且数量巨大,所以要求芯片既具有极高的灵活性,能处理不同领域且日新月异的算法,又具有极强的处理能力,能快速处理极大的且急剧增长的数据量。With the development of science and technology, human society is rapidly entering the era of intelligence. The important feature of the intelligent age is that people are getting more and more types of data, the amount of data is getting bigger and bigger, and the speed of processing data is getting higher and higher. Chips are the cornerstone of data processing, it is fundamentally determined Improved people’s ability to process data. From the perspective of application fields, there are two main routes for chips: one is a general-purpose chip route, such as CPU (Central Processing Unit, central processing unit), etc. They can provide great flexibility, but they are effective in processing algorithms in specific fields. The power is relatively low; the other is a dedicated chip route, such as TPU (Tensor Processing Unit, tensor processor), etc. They can exert higher effective computing power in some specific fields, but they are more versatile in the face of flexible and changeable In the field, their processing power is relatively poor or even unable to handle. Due to the wide variety and huge amount of data in the intelligent era, the chip is required to have extremely high flexibility, capable of processing different fields and rapidly changing algorithms, and extremely strong processing capabilities, which can quickly process extremely large and rapidly increasing data. the amount.
在人工智能计算中常常需要实现卷积计算,现有的实现卷积计算的方案中,通常有两种方案:Convolution calculations are often required in artificial intelligence calculations. In the existing schemes for realizing convolution calculations, there are usually two schemes:
(1)CPU方案:此方案中,如果是单核CPU,会将矩阵拆解成标量进行运算,通过组合标量指令实现卷积运算;如果是多核CPU,可能会通过多个核并行执行各自的标量指令,组合实现卷积运算。但是使用该方案有如下缺点:底层程序复杂,一般需要多层循环实现卷积运算;通过通用计算指令实现卷积运算,效率低,需要多次分支跳转;CPU的缓存有限,实现比较大的卷积运算需要多次从片外搬数,影响效率;CPU需要多次进行数据的存取,会增加实现卷积运算的计算时间;CPU需要多次进行数据的存取,会增加实现卷积运算加的计算功耗;如果是多核并行计算,核间的通信复杂,通信性能可能成为瓶颈。(1) CPU scheme: In this scheme, if it is a single-core CPU, the matrix will be disassembled into scalars for operation, and convolution operations are realized by combining scalar instructions; if it is a multi-core CPU, multiple cores may execute each in parallel Scalar instructions are combined to implement convolution operations. However, the use of this solution has the following disadvantages: the underlying program is complex, and generally requires multi-layer loops to implement convolution operations; convolution operations are implemented through general calculation instructions, which is inefficient and require multiple branch jumps; CPU cache is limited, and the implementation is relatively large Convolution operation needs to move data from off-chip multiple times, which affects efficiency; CPU needs to access data multiple times, which will increase the calculation time for convolution operation; CPU needs to access data multiple times, which will increase the realization of convolution The computational power consumption added by the calculation; if it is a multi-core parallel calculation, the communication between the cores is complicated and the communication performance may become a bottleneck.
(2)GPU(Graphics Processing Unit,图形处理器)方案:此方案中,GPU会将卷积运算拆解成多条指令运算,这些指令主要是向量指令,通过组合执行向量指令实现卷积运算。但是使用该方案有如下缺点:底层程序复杂,一般需要多层循环实现卷积运算;通过向量指令多次组合实现卷积运算,效率较低;GPU需要多次进行数据的存取,会增加实现卷积运算的计算时间;GPU需要多次进行数据的存取,会增加实现卷积运算的计算功耗;GPU的缓存有限,实现比较大的卷积运算需要多次从片外搬数,影响效率。(2) GPU (Graphics Processing Unit) solution: In this solution, the GPU will disassemble the convolution operation into multiple instruction operations. These instructions are mainly vector instructions, and the convolution operation is realized by combining and executing vector instructions. However, the use of this solution has the following disadvantages: the underlying program is complex, and generally requires multiple layers of loops to implement convolution operations; the convolution operation is achieved through multiple combinations of vector instructions, which is inefficient; GPU requires multiple data access, which will increase the implementation The calculation time of convolution operation; GPU needs to access data multiple times, which will increase the calculation power consumption of convolution operation; GPU has limited cache, and the realization of relatively large convolution operation requires multiple transfers from outside the chip. effectiveness.
发明内容Summary of the invention
提供该发明内容部分以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该发明内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。The content of the invention is provided to introduce concepts in a brief form, and these concepts will be described in detail in the following specific embodiments. The content of the invention is not intended to identify the key features or essential features of the technical solution required to be protected, nor is it intended to be used to limit the scope of the technical solution required to be protected.
第一方面,本公开实施例提供一种矩阵运算电路,包括:In the first aspect, an embodiment of the present disclosure provides a matrix operation circuit, including:
控制电路;Control circuit;
运算单元阵列,所述运算单元阵列包括多个运算单元,所述运算单元包括第一输入寄存器、第二输入寄存器以及输出寄存器;An arithmetic unit array, the arithmetic unit array includes a plurality of arithmetic units, the arithmetic unit includes a first input register, a second input register, and an output register;
所述第一输入寄存器用于接收第一输入矩阵的数据,所述第二输入寄存器用于接收第二输入矩阵的数据;The first input register is used to receive data of a first input matrix, and the second input register is used to receive data of a second input matrix;
所述控制电路用于接收矩阵运算指令,并响应于所述指令控制所述多个运算单元中的至少一个运算单元根据所述指令的指示,对所述第一输入矩阵和所述第二输入矩阵的执行运算操作,其中,所述指令为单条指令;The control circuit is configured to receive a matrix operation instruction, and in response to the instruction, control at least one operation unit of the plurality of operation units to perform an operation on the first input matrix and the second input according to the instruction of the instruction. The matrix execution operation operation, wherein the instruction is a single instruction;
所述输出寄存器用于存储所述运算操作的运算结果。The output register is used to store the operation result of the operation operation.
进一步的,所述指令包括指令名、第一输入矩阵的首地址、第二输入矩阵的首地址和输出矩阵的首地址。Further, the instruction includes the instruction name, the first address of the first input matrix, the first address of the second input matrix, and the first address of the output matrix.
进一步的,所述矩阵运算指令为矩阵乘法指令。Further, the matrix operation instruction is a matrix multiplication instruction.
进一步的,所述运算单元包括运算器,所述运算器至少包括乘法器和加法器;所述运算单元用于根据所述矩阵乘法指令组合所述乘法器和所述加法器以执行矩阵乘法运算。Further, the arithmetic unit includes an arithmetic unit, the arithmetic unit includes at least a multiplier and an adder; the arithmetic unit is configured to combine the multiplier and the adder according to the matrix multiplication instruction to perform a matrix multiplication operation .
进一步的,响应于所述矩阵乘法指令,对于执行所述矩阵乘法指令的每个运算单元:Further, in response to the matrix multiplication instruction, for each arithmetic unit that executes the matrix multiplication instruction:
读取第一输入寄存器中的第一输入矩阵的数据;Read the data of the first input matrix in the first input register;
读取第二输入寄存器中的第二输入矩阵的数据;Reading the data of the second input matrix in the second input register;
通过乘法器计算所述第一输入矩阵的数据和所述第二输入矩阵的数据的乘积;Calculating a product of the data of the first input matrix and the data of the second input matrix by using a multiplier;
通过加法器计算所述乘积的累加值;Calculating the accumulated value of the product through an adder;
将所述累加值存入所述输出寄存器。The accumulated value is stored in the output register.
进一步的,所述矩阵乘法指令用于实现矩阵的卷积运算,其中:Further, the matrix multiplication instruction is used to implement matrix convolution operation, wherein:
所述第一输入矩阵的数据为第一输入矩阵的行向量数据;The data of the first input matrix is row vector data of the first input matrix;
所述第二输入矩阵的数据为第二输入矩阵的行向量数据。The data of the second input matrix is row vector data of the second input matrix.
第二方面,本公开实施例提供一种矩阵计算装置,包括:In a second aspect, embodiments of the present disclosure provide a matrix calculation device, including:
存储器,用于存储矩阵运算指令、第一输入矩阵、第二输入矩阵以及输出矩阵;A memory for storing matrix operation instructions, a first input matrix, a second input matrix, and an output matrix;
取指模块,与所述存储器相连,用于从所述存储器中获取所述矩阵运算指令;An instruction fetching module, connected to the memory, and configured to obtain the matrix operation instruction from the memory;
解码模块,与所述取指模块相连,用于对所述取指模块所获取到的所述矩阵运算指令进行解码;A decoding module, connected to the instruction fetching module, and configured to decode the matrix operation instructions acquired by the instruction fetching module;
寄存器,用于存储所述第一输入矩阵、所述第二输入矩阵和所述输出矩阵的属性数据;A register for storing attribute data of the first input matrix, the second input matrix, and the output matrix;
执行模块,与所述解码模块、所述存储器和所述寄存器相连,包括如权利要求1-6所述的矩阵运算电路,用于执行所述解码后的矩阵运算指令。The execution module is connected to the decoding module, the memory and the register, and includes the matrix operation circuit according to claims 1-6, which is used to execute the decoded matrix operation instruction.
进一步的,所述执行模块从所述解码模块中获取所述解码后的矩阵运算指令;所述执行模块从所述寄存器中获取所述第一输入矩阵的属性数据、所述第二输入矩阵的属性数据和所述输出矩阵的属性数据;所述执行模块根据所述第一输入矩阵和所述第二输入矩阵的属性数据从所述存储器中获取用于计算的第一输入矩阵的数据和第二输入矩阵的数据;所述执行模块根据所述解码后的矩阵运算指令对所述第一输入矩阵的数据和第二输入矩阵的数据进行计算得到输出矩阵的数据;所述执行模块根据所述输出矩阵的属性数据将所述输出矩阵的数据存入所述存储器中。Further, the execution module obtains the decoded matrix operation instruction from the decoding module; the execution module obtains the attribute data of the first input matrix and the attribute data of the second input matrix from the register Attribute data and attribute data of the output matrix; the execution module obtains the data of the first input matrix and the first input matrix for calculation from the memory according to the attribute data of the first input matrix and the second input matrix Two input matrix data; the execution module calculates the data of the first input matrix and the data of the second input matrix according to the decoded matrix operation instruction to obtain the data of the output matrix; the execution module calculates the data of the output matrix according to the The attribute data of the output matrix stores the data of the output matrix in the memory.
进一步的,所述第一输入矩阵的属性数据包括所述第一输入矩阵的行数、列数以及行向量间隔;所述第二输入矩阵的属性数据包括所述第二输入矩阵的行数、列数以及行向量间隔;所述输出矩阵的属性数据包括所述输出矩阵的行数、列数以及行向量间隔。Further, the attribute data of the first input matrix includes the number of rows, the number of columns, and the row vector interval of the first input matrix; the attribute data of the second input matrix includes the number of rows of the second input matrix, The number of columns and the interval of row vectors; the attribute data of the output matrix includes the number of rows, the number of columns, and the interval of row vectors of the output matrix.
进一步的,所述执行模块根据所述第一输入矩阵和所述第二输入矩阵的属性数据从所述存储器中获取用于计算的第一输入矩阵的数据和第二输入矩阵的数据,包括:Further, the execution module acquiring data of the first input matrix and data of the second input matrix for calculation from the memory according to the attribute data of the first input matrix and the second input matrix includes:
所述执行模块根据预设的第一读取方式以及所述第一输入矩阵的属性数据读取所述第一输入矩阵的数据;The execution module reads the data of the first input matrix according to the preset first reading method and the attribute data of the first input matrix;
所述执行模块根据预设的第二读取方式以及所述第二输入矩阵的属性数据读取所述第二输入矩阵的数据。The execution module reads the data of the second input matrix according to the preset second reading method and the attribute data of the second input matrix.
进一步的,所述第一读取方式为按行读取或按列读取;所述第二读取方式为按行读取或按列读取。Further, the first reading mode is row reading or column reading; the second reading mode is row reading or column reading.
第三方面,本公开实施例提供一种矩阵运算方法,是基于前述第一方面中任一所述的矩阵运算电路的矩阵运算方法,其特征在于,包括:In a third aspect, embodiments of the present disclosure provide a matrix operation method, which is based on the matrix operation circuit of any one of the foregoing first aspects, and is characterized in that it includes:
从存储器中取出矩阵运算指令;Take out the matrix operation instruction from the memory;
对所述矩阵运算指令进行解码,并将解码后的指令运算指令发送至所述矩阵运算电路;Decode the matrix operation instruction, and send the decoded instruction operation instruction to the matrix operation circuit;
基于解码后的矩阵运算指令,所述矩阵运算电路从存储器中获取所述第一输入矩阵的数据和所述第二输入矩阵数据并进行运算,在运算完成后将运算结果存储到所述存储器中。Based on the decoded matrix operation instruction, the matrix operation circuit obtains the data of the first input matrix and the data of the second input matrix from the memory and performs an operation, and stores the operation result in the memory after the operation is completed .
第四方面,本公开实施例提供一种电子设备,包括:存储器,用于存储计算机可读指令;以及一个或多个处理器,用于运行所述计算机可读指令,使得所述处理器运行时实现前述第三方面中的任一所述矩阵运算方法。In a fourth aspect, an embodiment of the present disclosure provides an electronic device, including: a memory, configured to store computer-readable instructions; and one or more processors, configured to execute the computer-readable instructions to cause the processor to run When realizing any one of the above-mentioned matrix operation methods in the third aspect.
第五方面,本公开实施例提供一种非暂态计算机可读存储介质,其特征在于,该非暂态计算机可读存储介质存储计算机指令,该计算机指令用于使计算机执行前述第三方面中的任一所述矩阵运算方法。In a fifth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium, characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute the aforementioned third aspect Any of the matrix operation methods described above.
第六方面,本公开实施例提供一种计算机程序产品,其中,其特征在于:包括计算机指令,当所述计算机指令被计算设备执行时,所述计算设备可以执行前述第三方面中的任一所述矩阵运算方法。In a sixth aspect, embodiments of the present disclosure provide a computer program product, which is characterized in that it includes computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute any of the foregoing third aspects. The matrix operation method.
第七方面,本公开实施例提供一种芯片,其特征在于,包括第一方面中的任一所述的矩阵运算电路。In a seventh aspect, an embodiment of the present disclosure provides a chip, which is characterized by comprising the matrix operation circuit described in any one of the first aspect.
第八方面,本公开实施例提供一种计算装置,其特征在于,包括所述第七方面中的任一所述的芯片。In an eighth aspect, an embodiment of the present disclosure provides a computing device, which is characterized by including the chip described in any one of the seventh aspects.
本公开实施例公开了一种矩阵运算电路、装置以及方法。其中该矩阵运算电路包括:控制电路;运算单元阵列,所述运算单元阵列包括多个运算单元,所述运算单元包括第一输入寄存器、第二输入寄存器以及输出寄存器;所述第一输入寄存器用于接收第一输入矩阵的数据,所述第二输入寄存器用于接收第二输入矩阵的数据;所述控制电路用于接收矩阵运算指令,并响应于所述指令控制所述多个运算单元中的至少一个运算单元根据所述指令的指示,对所述第一输入矩阵和所述第二输入矩阵的执行运算操作,其中,所述指令为单条指令;所述输出寄存器用于存储所述运算操作的运算结果。通过上述方法,解决了现有技术中的在进行卷积计算时计算效率低、功耗大的技术问题。The embodiments of the present disclosure disclose a matrix operation circuit, device and method. The matrix operation circuit includes: a control circuit; an arithmetic unit array, the arithmetic unit array includes a plurality of arithmetic units, the arithmetic unit includes a first input register, a second input register, and an output register; the first input register is used for For receiving the data of the first input matrix, the second input register is used for receiving the data of the second input matrix; the control circuit is used for receiving a matrix operation instruction, and responding to the instruction to control the plurality of operation units At least one arithmetic unit of, performs arithmetic operations on the first input matrix and the second input matrix according to the instruction of the instruction, wherein the instruction is a single instruction; the output register is used to store the arithmetic The result of the operation. Through the above method, the technical problems of low calculation efficiency and high power consumption in the prior art when performing convolution calculations are solved.
上述说明仅是本公开技术方案的概述,为了能更清楚了解本公开的技术手段,而可依照说明书的内容予以实施,并且为让本公开的上述和其他目的、特征和优点能够更明显易懂, 以下特举较佳实施例,并配合附图,详细说明如下。The above description is only an overview of the technical solutions of the present disclosure. In order to understand the technical means of the present disclosure more clearly, they can be implemented in accordance with the content of the specification, and to make the above and other objectives, features and advantages of the present disclosure more obvious and understandable. , The following is a detailed description of the preferred embodiments in conjunction with the accompanying drawings.
附图说明Description of the drawings
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。The above and other features, advantages, and aspects of the embodiments of the present disclosure will become more apparent in conjunction with the accompanying drawings and with reference to the following specific implementations. Throughout the drawings, the same or similar reference signs indicate the same or similar elements. It should be understood that the drawings are schematic and the originals and elements are not necessarily drawn to scale.
图1为本公开实施例提供的矩阵运算电路的结构示意图;FIG. 1 is a schematic structural diagram of a matrix operation circuit provided by an embodiment of the disclosure;
图2为本公开实施例提供的运算单元的结构示意图;FIG. 2 is a schematic structural diagram of an arithmetic unit provided by an embodiment of the disclosure;
图3为本公开实施例提供的运算单元的进一步的结构示意图;FIG. 3 is a further structural schematic diagram of an arithmetic unit provided by an embodiment of the disclosure;
图4a-图4b本公开实施例中的卷积运算的示意图;Figures 4a-4b are schematic diagrams of convolution operations in an embodiment of the present disclosure;
图5-图8为本公开实施例提供的部分卷积的计算过程;Figures 5 to 8 are calculation processes of partial convolution provided by embodiments of the disclosure;
图9为本公开实施例提供的矩阵计算装置的结构示意图;FIG. 9 is a schematic structural diagram of a matrix calculation device provided by an embodiment of the disclosure;
图10a-10d为本公开中的矩阵的存储顺序和格式的示意图;10a-10d are schematic diagrams of the storage order and format of the matrix in the disclosure;
图11-图15为卷积运算中数据重叠的示意图;Figures 11-15 are schematic diagrams of data overlap in convolution operations;
如图16为本公开实施例中卷积预算的一个具体实例的示意图。FIG. 16 is a schematic diagram of a specific example of the convolution budget in the embodiment of the disclosure.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。Hereinafter, embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although some embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure can be implemented in various forms and should not be construed as being limited to the embodiments set forth herein. On the contrary, these embodiments are provided for Have a more thorough and complete understanding of this disclosure. It should be understood that the drawings and embodiments of the present disclosure are only used for exemplary purposes, and are not used to limit the protection scope of the present disclosure.
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。It should be understood that the steps recorded in the method embodiments of the present disclosure may be executed in a different order, and/or executed in parallel. In addition, method implementations may include additional steps and/or omit to perform the illustrated steps. The scope of the present disclosure is not limited in this respect.
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。The term "including" and its variations as used herein are open-ended includes, that is, "including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments." Related definitions of other terms will be given in the following description.
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。It should be noted that the concepts of “first” and “second” mentioned in the present disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units. Or interdependence.
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。It should be noted that the modifications of “a” and “a plurality of” mentioned in the present disclosure are illustrative and not restrictive, and those skilled in the art should understand that unless otherwise clearly indicated in the context, they should be understood as “one or Multiple".
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are only used for illustrative purposes, and are not used to limit the scope of these messages or information.
图1为本公开实施例提供的矩阵运算电路的结构示意图。如图1所述,矩阵运算电路100包括控制电路101和运算单元阵列102,其中所述运算单元阵列包括多个运算单元(PU)103,所述运算单元103如图2所示,包括第一输入寄存器(Rin1)104、第二输入寄存器(Rin2)105以及输出寄存器(Rout)106,其中所述第一输入寄存器104用于接收第一输入矩阵的数据,所述第二输入寄存器105用于接收第二输入矩阵的数据;所述控制电路101用于接收矩阵运算指令,并响应于所述指令控制所述多个运算单元103中的至少一个运算单元103根据所述指令的指示,对所述第一输入矩阵和所述第二输入矩阵的执行运算操作,其中,所述指令为单条指令;所述输出寄存器106用于存储所述运算操作的运算结果。FIG. 1 is a schematic structural diagram of a matrix operation circuit provided by an embodiment of the disclosure. As shown in FIG. 1, the matrix operation circuit 100 includes a control circuit 101 and an arithmetic unit array 102. The arithmetic unit array includes a plurality of arithmetic units (PU) 103. As shown in FIG. 2, the arithmetic unit 103 includes a first The input register (Rin1) 104, the second input register (Rin2) 105, and the output register (Rout) 106, wherein the first input register 104 is used to receive the data of the first input matrix, and the second input register 105 is used to Receive the data of the second input matrix; the control circuit 101 is configured to receive a matrix operation instruction, and in response to the instruction, control at least one operation unit 103 of the plurality of operation units 103, according to the instruction of the instruction The first input matrix and the second input matrix perform arithmetic operations, wherein the instruction is a single instruction; the output register 106 is used to store the result of the arithmetic operation.
在本公开中,所述的单条指令包括指令名、第一输入矩阵的首地址、第二输入矩阵的首地址以及输出矩阵的首地址。如下表所示为一个示例性的指令格式:In the present disclosure, the single command includes the command name, the first address of the first input matrix, the first address of the second input matrix, and the first address of the output matrix. The following table shows an exemplary command format:
Figure PCTCN2019111878-appb-000001
Figure PCTCN2019111878-appb-000001
其中指令名与指令含义、格式以及实现的运算对应,第一输入矩阵的首地址、第二输入矩阵的首地址分别定义了指令的两个源操作数的读取地址,输出矩阵的首地址定义了指令的目的操作数的存储地址。上述示例性的指令为矩阵的乘法指令,其实现两个矩阵的乘法,具体的:The instruction name corresponds to the meaning, format and operation of the instruction. The first address of the first input matrix and the first address of the second input matrix respectively define the read addresses of the two source operands of the instruction, and the first address of the output matrix is defined Indicates the storage address of the destination operand of the instruction. The above exemplary instruction is a matrix multiplication instruction, which implements the multiplication of two matrices, specifically:
Figure PCTCN2019111878-appb-000002
Figure PCTCN2019111878-appb-000002
c in=∑(a ij*b jn) c in =∑(a ij *b jn )
图3为本公开实施例提供的运算单元的进一步的结构示意图。如图3所示,所述运算单元除了第一输入寄存器、第二输入寄存器以及输出寄存器之外,还包括运算器,所述运算器至少包括乘法器301以及加法器302,其中所述运算单元用于根据所述矩阵乘法指令组合所 述乘法器和所述加法器以执行矩阵乘法运算。FIG. 3 is a further structural schematic diagram of an arithmetic unit provided by an embodiment of the disclosure. As shown in FIG. 3, in addition to the first input register, the second input register, and the output register, the arithmetic unit also includes an arithmetic unit. The arithmetic unit includes at least a multiplier 301 and an adder 302, wherein the arithmetic unit And used for combining the multiplier and the adder according to the matrix multiplication instruction to perform a matrix multiplication operation.
具体的,响应于所述矩阵乘法指令,对于执行所述矩阵乘法指令的每个运算单元:读取第一输入寄存器中的第一输入矩阵的数据;读取第二输入寄存器中的第二输入矩阵的数据;通过乘法器计算所述第一输入矩阵的数据和所述第二输入矩阵的数据的乘积;通过加法器计算所述乘积的累加值;将所述累加值存入所述输出寄存器。其中,所述第一输入寄存器中的第一输入矩阵的数据,是根据所述第一输入矩阵的首地址依次读入到所述第一输入寄存器中的数据;所述第二输入寄存器中的第二输入矩阵的数据,是根据所述第二输入矩阵的首地址依次读入到所述第二输入寄存器中的数据。Specifically, in response to the matrix multiplication instruction, for each arithmetic unit that executes the matrix multiplication instruction: read the data of the first input matrix in the first input register; read the second input in the second input register Matrix data; calculate the product of the data of the first input matrix and the data of the second input matrix by a multiplier; calculate the accumulated value of the product by an adder; store the accumulated value in the output register . Wherein, the data of the first input matrix in the first input register is the data sequentially read into the first input register according to the first address of the first input matrix; the data in the second input register The data of the second input matrix is data sequentially read into the second input register according to the first address of the second input matrix.
例如,在读入数据之后,通过乘法器计算第一输入寄存器和第二输入寄存器中的数据的乘积,如上示例中,在一个时钟周期中,如果第一输入寄存器中接收到的数据为a 11,第二输入寄存器中接收到的数据为b 11,则运算单元中的乘法器计算a 11*b 11的乘积,之后将所述a 11*b 11的乘积送入加法器做累加,由于此时累加器在上一时钟周期没有输入,因此累加器的累加结果为a 11*b 11;之后在下一个时钟周期,继续上述计算操作,此时第一输入寄存器中接收到的数据为a 12,第二输入寄存器中接收到的数据为b 21,则运算单元中的乘法器计算a 12*b 21的乘积,之后将所述a 12*b 21的乘积送入加法器做累加,累加器上一时钟周期的输入为a 11*b 11,因此在本时钟周期计算a 11*b 11+a 12*b 21;持续做上述操作,直至第一输入矩阵的一行和第二输入矩阵的一列计算完毕,得到最终的累加值,之后将累加值存入所述输出寄存器中。控制电路根据所述指令中的输出矩阵首地址,将所述输出寄存器中的累加值存储系统存储器中。 For example, after reading in the data, the multiplier is used to calculate the product of the data in the first input register and the second input register. As in the above example, in one clock cycle, if the data received in the first input register is a 11 , The data received in the second input register is b 11 , the multiplier in the arithmetic unit calculates the product of a 11 *b 11 , and then sends the product of a 11 *b 11 to the adder for accumulation. When the accumulator has no input in the previous clock cycle, the accumulation result of the accumulator is a 11 *b 11 ; then in the next clock cycle, continue the above calculation operation, at this time the data received in the first input register is a 12 , The data received in the second input register is b 21 , the multiplier in the arithmetic unit calculates the product of a 12 *b 21 , and then sends the product of a 12 *b 21 to the adder for accumulation. The input of one clock cycle is a 11 *b 11 , so calculate a 11 *b 11 + a 12 *b 21 in this clock cycle; continue to do the above operations until one row of the first input matrix and one column of the second input matrix are calculated Upon completion, the final accumulated value is obtained, and then the accumulated value is stored in the output register. The control circuit stores the accumulated value in the output register in the system memory according to the first address of the output matrix in the instruction.
在本公开中,所述矩阵乘法指令可以实现矩阵的卷积运算。矩阵的卷积计算,是两个矩阵的中的数据点对点相乘之后的乘积的累加和。In the present disclosure, the matrix multiplication instructions can implement matrix convolution operations. The convolution calculation of the matrix is the cumulative sum of the product of the data point-to-point multiplication of the two matrices.
一个示例性的卷积运算如下:An exemplary convolution operation is as follows:
图4a-图4b为本公开实施例中的卷积运算的示意图。如图4a为卷积运算过程的整体示意图,其中的图示解释如下:4a-4b are schematic diagrams of convolution operations in an embodiment of the disclosure. Figure 4a is the overall schematic diagram of the convolution operation process, and the diagram is explained as follows:
Win:输入特征图(Feature Map)的宽度;Win: Enter the width of the feature map (Feature Map);
Hin:输入特征图的高度;Hin: input the height of the feature map;
Cin:输入特征图的通道数,后面统称输入特征图的深度;Cin: The number of channels of the input feature map, collectively referred to as the depth of the input feature map later;
Kw:卷积核的宽度;Kw: the width of the convolution kernel;
Kh:卷积核的高度Kh: the height of the convolution kernel
Wout:输出特征图的宽度;Wout: the width of the output feature map;
Hout:输出特征图的高度;Hout: the height of the output feature map;
Cout:输出特征图的通道数,后面统称输出特征图的深度。Cout: The number of channels of the output feature map, collectively referred to as the depth of the output feature map later.
其中输入特征图上的特征点构成为第一输入矩阵;卷积核上的点构成第二输入矩阵;输出特征图为输出矩阵,输出特征图上的一个特征点为输出矩阵上的一个数据。在进行卷积运算时,卷积核会在输入特征图上滑动,每滑动一次,就会与输入特征图中对应的数据进行数据乘累加,提取一个输出特征点,即输出矩阵上的一个数据。The feature points on the input feature map constitute the first input matrix; the points on the convolution kernel constitute the second input matrix; the output feature map is the output matrix, and a feature point on the output feature map is a data on the output matrix. During the convolution operation, the convolution kernel will slide on the input feature map. Each time it slides, it will multiply and accumulate data with the corresponding data in the input feature map to extract an output feature point, that is, a data on the output matrix. .
图4b为一个带深度的输入特征点的计算示意图。如图4b所示,卷积核在输入特征图上滑动,当它停在一个位置,会与该位置处的输入特征图中的特征点进行对应数据乘累加,得到与该位置对应的输出特征点;共有Cout个卷积核,每一个卷积核均会与此同一位置出的输入特征图中的特征点进行数据乘累加,得到深度方向的Cout个输出特征点;Cout个输出特征点组成整个输出特征图上的一个带深度的特征点,此点的深度即Cout;卷积核会滑完整个输入特征图,从而得到整个输出特征图。Figure 4b is a schematic diagram of the calculation of an input feature point with depth. As shown in Figure 4b, the convolution kernel slides on the input feature map. When it stops at a position, it will multiply and accumulate the corresponding data with the feature points in the input feature map at that position to obtain the output feature corresponding to the position. Points; there are Cout convolution kernels, and each convolution kernel will multiply and accumulate data with the feature points in the input feature map at the same position to obtain Cout output feature points in the depth direction; Cout output feature points are composed A feature point with depth on the entire output feature map, the depth of this point is Cout; the convolution kernel will slide the entire input feature map to obtain the entire output feature map.
对于处于深度l(1<=l<=Cout)上的某个卷积核,它进行特征提取的公式如下:For a certain convolution kernel at depth l (1<=l<=Cout), its feature extraction formula is as follows:
Figure PCTCN2019111878-appb-000003
Figure PCTCN2019111878-appb-000003
Dout是输出特征图中的某个带深度的点,其上标l对应输出的深度;Din是指输入特征图中对应于卷积核的数据,其上标i对应输入特征图的深度,j和k分别对应卷积核对应的宽度和高度;w是卷积核,其上标l和i分别对应输出特征图的深度和输入特征图的深度,j和k分别对应此卷积核的宽度和高度。Dout is a point with depth in the output feature map, and its superscript l corresponds to the depth of the output; Din refers to the data corresponding to the convolution kernel in the input feature map, and its superscript i corresponds to the depth of the input feature map, j And k respectively correspond to the width and height of the convolution kernel; w is the convolution kernel, and its superscripts l and i correspond to the depth of the output feature map and the depth of the input feature map, respectively, and j and k correspond to the width of the convolution kernel. And height.
对于尺寸为Kh*Kw*Cin的卷积核,可以分成Kh个Kw*Cin的部分卷积核,进行部分特征提取,每次实现的是整个特征提取的1/Kh,也就是Kw*Cin的部分卷积核对应的特征,得到的部分结果是:
Figure PCTCN2019111878-appb-000004
最后将这Kh个部分结果相加即得到最终结果
Figure PCTCN2019111878-appb-000005
For the convolution kernel with the size of Kh*Kw*Cin, it can be divided into Kh Kw*Cin partial convolution kernels to perform partial feature extraction. Each time the entire feature extraction is achieved 1/Kh, which is Kw*Cin Some of the features corresponding to the convolution kernel, and the partial results obtained are:
Figure PCTCN2019111878-appb-000004
Finally, add these Kh partial results to get the final result
Figure PCTCN2019111878-appb-000005
其中,
Figure PCTCN2019111878-appb-000006
又可以分成Kw步,每一步实现
Figure PCTCN2019111878-appb-000007
然后将Kw个部分结果相加即得到最终结果
Figure PCTCN2019111878-appb-000008
的实现,是一个一行的输入数据矩阵(即部分的第一输入矩阵)和一个一列的权重矩阵(即部分的卷积核)相乘,其实现如图5所示。
Figure PCTCN2019111878-appb-000009
的实现,同样是一个一行的输入数据矩阵和一个一列的权重矩阵相乘,只是此时的行和列中数据的个数是
Figure PCTCN2019111878-appb-000010
中数据的个数的Kw倍,其 实现如图6所示。卷积核的数量是Cout个,因此,输出特征点的深度是Cout,可以将一行的输入数据矩阵与由Cout个卷积核构成的Cout列的卷积核矩阵即权重矩阵相乘,得出一个带深度的特征点,此特征点是一个向量,向量的长度即输出特征点的深度Cout,其实现如图7所示。
among them,
Figure PCTCN2019111878-appb-000006
It can be divided into Kw steps, and each step is realized
Figure PCTCN2019111878-appb-000007
Then add Kw partial results to get the final result
Figure PCTCN2019111878-appb-000008
The realization is that a row of input data matrix (that is, part of the first input matrix) is multiplied by a column of weight matrix (that is, part of the convolution kernel), and its realization is shown in Figure 5.
Figure PCTCN2019111878-appb-000009
The realization of the same is the multiplication of a row of input data matrix and a column of weight matrix, but the number of data in the rows and columns at this time is
Figure PCTCN2019111878-appb-000010
Kw times the number of data in the middle, the realization is shown in Figure 6. The number of convolution kernels is Cout. Therefore, the depth of the output feature point is Cout. The input data matrix of one row can be multiplied by the convolution kernel matrix of the Cout column composed of Cout convolution kernels, that is, the weight matrix. A feature point with depth, this feature point is a vector, the length of the vector is the depth Cout of the output feature point, and its realization is shown in Figure 7.
又由于神经网络实现卷积或者部分卷积的过程,就是卷积核或者部分卷积核在输入特征图上的滑动过程,可以看成是输入特征图的数据随着滑动变化,而权重不变的过程,这样神经网络实现卷积的过程就变成了Wout行的输入数据矩阵和Cout列的权重矩阵的相乘,得出Wout行的输出数据矩阵,其实现如图8所示。And because the process of convolution or partial convolution of the neural network is the sliding process of the convolution kernel or part of the convolution kernel on the input feature map, it can be regarded as the data of the input feature map changes with the sliding, and the weight remains unchanged In this way, the process of convolution by the neural network becomes the multiplication of the input data matrix of the Wout row and the weight matrix of the Cout column to obtain the output data matrix of the Wout row. The implementation is shown in Figure 8.
上述过程中,已经计算了整个卷积核中Kh*Kw个点中的Kw个点,由于整个卷积核被分成了Kh部分,每个部分是Kw个点,因此最终的结果就是将Kh个计算结果相加,得到真正的卷积结果。In the above process, Kw points out of the Kh*Kw points in the entire convolution kernel have been calculated. Since the entire convolution kernel is divided into Kh parts, each part is Kw points, so the final result is Kh points Add the calculation results to get the true convolution result.
上述的卷积运算,在本公开中只需要使用一条单独的指令,即上述矩阵乘法指令即可完成整个卷积过程,其只需要在上层的程序中预先设置好数据的读入的顺序。具体的,所述矩阵乘法指令用于实现矩阵的卷积运算,其中:所述第一输入矩阵的数据为第一输入矩阵的行向量数据;所述第二输入矩阵的数据为第二输入矩阵的行向量数据。也就是说,第一输入寄存器中的数据和第二输入寄存器中的数据,均需要是矩阵的行向量数据,这样计算出来的数据才是卷积结果。如图16所示为上述卷积计算的示意图,以第一输入矩阵为一个4*4*2的矩阵,第二输入矩阵为两个3*3*2的卷积核为例,步长为1,输出矩阵为一个2*2*2的矩阵。在计算时,使用部分卷积的方法先计算输出矩阵各点的中间值,如图16所示,进行部分卷积时,所述第二输入矩阵的一行在所述第一输入矩阵上滑动,读取出的第一输入矩阵的数据如1601的所示,每行为第二输入矩阵1602的一个位置上对应的第一输入矩阵的数据,包括带有深度的3个数一共6个数据,1602中的一列为第二输入矩阵的包括深度的一行3个数一共6个数据,1601中的一行数据与1602中的一列数据做乘累加运算,得到1603中的一个点,1603中的一个点为部分卷积的结果,在该例子中,所述部分卷积的结果为输出矩阵中一个点的值的一部分,最终将第二输入矩阵的三行分别在所述第二输入矩阵上滑动提取数据计算的结果进行累加,得到输出矩阵中的一个点的值(深度为2的点的两个值中的一个值)。In the present disclosure, the above convolution operation only needs to use a single instruction, that is, the above matrix multiplication instruction to complete the entire convolution process, and it only needs to set the order of data reading in the upper-level program in advance. Specifically, the matrix multiplication instruction is used to implement a matrix convolution operation, wherein: the data of the first input matrix is the row vector data of the first input matrix; the data of the second input matrix is the second input matrix The row vector data. In other words, the data in the first input register and the data in the second input register both need to be row vector data of the matrix, so that the calculated data is the convolution result. Figure 16 is a schematic diagram of the above convolution calculation. Take the first input matrix as a 4*4*2 matrix, and the second input matrix as two 3*3*2 convolution kernels. The step size is 1. The output matrix is a 2*2*2 matrix. In the calculation, the partial convolution method is used to first calculate the intermediate value of each point of the output matrix. As shown in Figure 16, when the partial convolution is performed, a row of the second input matrix slides on the first input matrix, The read data of the first input matrix is as shown in 1601, and each row corresponds to the data of the first input matrix at a position of the second input matrix 1602, including 3 numbers with depth, a total of 6 data, One column in 1602 is the second input matrix including a row of 3 numbers and a total of 6 data. One row of data in 1601 and one column of data in 1602 are multiplied and accumulated to get a point in 1603 and one in 1603. The point is the result of the partial convolution. In this example, the result of the partial convolution is a part of the value of a point in the output matrix, and finally the three rows of the second input matrix are respectively slid on the second input matrix The result of the data extraction calculation is accumulated to obtain the value of a point in the output matrix (one of the two values of the point with a depth of 2).
结合图1-3所示的PU阵列,简要说明一下PU阵列如何实现卷积的计算。Combined with the PU array shown in Figures 1-3, briefly explain how the PU array implements the calculation of convolution.
将图16中第一输入矩阵的数据1601的数值1送入PU11的Rin1和PU12的Rin1,第一输入矩阵的数据1601的数值3送入PU21的Rin1和PU22的Rin1;将第二输入矩阵1602中的数值0.1送入PU11的Rin2和PU21的Rin2,第二输入矩阵1602中的数值1.9送入PU12的Rin2和PU22的Rin2;PU11、PU12、PU21、PU22执行数据乘操作,并将结果送入输出寄存 器保存;在下一个时钟周期,第一输入矩阵的数据1601的数值2送入PU11的Rin1和PU12的Rin1,第一输入矩阵的数据1601的数值4送入PU21的Rin1和PU22的Rin1;将第二输入矩阵1602中的数值0.2送入PU11的Rin2和PU21的Rin2,第二输入矩阵1602中的数值2.0送入PU12的Rin2和PU22的Rin2,PU11、PU12、PU21、PU22执行此次的数据乘操作,并将结果送入输出寄存器与上次保存的结果进行累加。The value 1 of data 1601 of the first input matrix in Figure 16 is sent to Rin1 of PU11 and Rin1 of PU12, and the value of data 1601 of the first input matrix is sent to Rin1 of PU21 and Rin1 of PU22; the second input matrix 1602 is sent to Rin1 of PU11 and Rin1 of PU12. The value of 0.1 is sent to Rin2 of PU11 and Rin2 of PU21, and the value of 1.9 in the second input matrix 1602 is sent to Rin2 of PU12 and Rin2 of PU22; PU11, PU12, PU21, and PU22 perform data multiplication operations and send the result into The output register is saved; in the next clock cycle, the value 2 of data 1601 of the first input matrix is sent to Rin1 of PU11 and Rin1 of PU12, and the value of data 1601 of the first input matrix is sent to Rin1 of PU21 and Rin1 of PU22; The value 0.2 in the second input matrix 1602 is sent to Rin2 of PU11 and Rin2 of PU21, and the value of 2.0 in the second input matrix 1602 is sent to Rin2 of PU12 and Rin2 of PU22. PU11, PU12, PU21, and PU22 execute this data Multiply operation, and send the result to the output register and accumulate the result saved last time.
依次类推,最后求得部分卷积的乘累加结果。By analogy, the result of multiplication and accumulation of partial convolution is finally obtained.
图9为本公开实施例提供的矩阵计算装置的结构示意图。如图9所示,所述矩阵计算装置900包括:存储器901,用于存储矩阵运算指令、第一输入矩阵、第二输入矩阵以及输出矩阵;取指模块902,与所述存储器901相连,用于从所述存储器901中获取所述矩阵运算指令;解码模块903,与所述取指模块902相连,用于对所述取指模块902所获取到的所述矩阵运算指令进行解码;寄存器904,用于存储所述第一输入矩阵、所述第二输入矩阵和所述输出矩阵的属性数据;执行模块905,与所述解码模块903、所述存储器901和所述寄存器904相连,包括上述实施例中的矩阵运算电路,用于执行所述解码后的矩阵运算指令。FIG. 9 is a schematic structural diagram of a matrix calculation device provided by an embodiment of the disclosure. As shown in FIG. 9, the matrix calculation device 900 includes: a memory 901 for storing matrix operation instructions, a first input matrix, a second input matrix, and an output matrix; an instruction fetching module 902, connected to the memory 901, To obtain the matrix operation instruction from the memory 901; a decoding module 903, connected to the instruction fetching module 902, for decoding the matrix operation instruction obtained by the instruction fetching module 902; register 904 , Used to store the attribute data of the first input matrix, the second input matrix, and the output matrix; the execution module 905 is connected to the decoding module 903, the memory 901 and the register 904, including the above The matrix operation circuit in the embodiment is used to execute the decoded matrix operation instruction.
在一个实施例中,所述执行模块从所述解码模块中获取所述解码后的矩阵运算指令;所述执行模块从所述寄存器中获取所述第一输入矩阵的属性数据、所述第二输入矩阵的属性数据和所述输出矩阵的属性数据;所述执行模块根据所述第一输入矩阵和所述第二输入矩阵的属性数据从所述存储器中获取用于计算的第一输入矩阵的数据和第二输入矩阵的数据;所述执行模块根据所述解码后的矩阵运算指令对所述第一输入矩阵的数据和第二输入矩阵的数据进行计算得到输出矩阵的数据;所述执行模块根据所述输出矩阵的属性数据将所述输出矩阵的数据存入所述存储器中。其中,所述第一输入矩阵的属性数据包括所述第一输入矩阵的行数、列数以及行向量间隔;所述第二输入矩阵的属性数据包括所述第二输入矩阵的行数、列数以及行向量间隔;所述输出矩阵的属性数据包括所述输出矩阵的行数、列数以及行向量间隔。其中,所述行数和列数定义了矩阵的大小,所述行向量间隔定义了矩阵相邻的两行之间的存储地址差,例如矩阵每一行有10个int8的矩阵元素,如果两行是连续存储,那么行向量间隔就是10Byte,如果两行是以一定的间隔存储,例如20Byte,那么有10Byte是矩阵元素,另外10Byte的内容不属于本矩阵,可能是无效的数据,也可能是其他用途的数据。In one embodiment, the execution module obtains the decoded matrix operation instruction from the decoding module; the execution module obtains the attribute data of the first input matrix and the second input matrix from the register. The attribute data of the input matrix and the attribute data of the output matrix; the execution module obtains the attribute data of the first input matrix used for calculation from the memory according to the attribute data of the first input matrix and the second input matrix Data and the data of the second input matrix; the execution module calculates the data of the first input matrix and the data of the second input matrix according to the decoded matrix operation instruction to obtain the data of the output matrix; the execution module The data of the output matrix is stored in the memory according to the attribute data of the output matrix. Wherein, the attribute data of the first input matrix includes the number of rows, columns, and row vector intervals of the first input matrix; the attribute data of the second input matrix includes the number of rows and columns of the second input matrix. The attribute data of the output matrix includes the number of rows, the number of columns, and the interval of row vectors of the output matrix. Wherein, the number of rows and the number of columns defines the size of the matrix, and the row vector interval defines the storage address difference between two adjacent rows of the matrix. For example, each row of the matrix has 10 int8 matrix elements. For continuous storage, the row vector interval is 10Bytes. If two rows are stored at a certain interval, such as 20Bytes, then 10Bytes are matrix elements, and the content of 10Bytes does not belong to the matrix, which may be invalid data or other Use data.
可选的,所述执行模块根据所述第一输入矩阵和所述第二输入矩阵的属性数据从所述存储器中获取用于计算的第一输入矩阵的数据和第二输入矩阵的数据,包括:所述执行模块根据预设的第一读取方式以及所述第一输入矩阵的属性数据读取所述第一输入矩阵的数据;所述执行模块根据预设的第二读取方式以及所述第二输入矩阵的属性数据读取所述第二输入矩阵的数据。其中,所述第一读取方式为按行读取或按列读取;所述第二读取方式为按行读取 或按列读取。Optionally, the execution module obtains the data of the first input matrix and the data of the second input matrix for calculation from the memory according to the attribute data of the first input matrix and the second input matrix, including : The execution module reads the data of the first input matrix according to the preset first reading method and the attribute data of the first input matrix; the execution module reads the data of the first input matrix according to the preset second reading method and the attribute data. The attribute data of the second input matrix reads the data of the second input matrix. Wherein, the first reading mode is row reading or column reading; the second reading mode is row reading or column reading.
举例来说,如果所述属性数据定义了第一输入矩阵的行数为5行,列数为5列,行间距为5byte,预设的读取方式为按行读取,则根据指令中的第一输入矩阵的首地址以及所述行间距读取第一输入矩阵的第一行,通过列数可以知道该第一行有5个矩阵元素,之后将首地址加上行间距作为新的首地址,读取第二行,同样是5个矩阵元素,这样依次读取5次,则所述执行模块可以获取到第一输入矩阵的所有数据。For example, if the attribute data defines that the number of rows of the first input matrix is 5 rows, the number of columns is 5 columns, and the row spacing is 5 bytes, and the preset reading method is reading by row, then according to the instruction The first address of the first input matrix and the row spacing. Read the first row of the first input matrix. You can know that the first row has 5 matrix elements through the number of columns. Then the first address plus the row spacing is used as the new first address , Read the second row, which is also 5 matrix elements, so read 5 times in sequence, then the execution module can obtain all the data of the first input matrix.
同样的,所述第二输入矩阵的数据按照同样的方式获取。可选的,所述执行模块根据所述输出矩阵的属性数据将所述输出矩阵的数据存入所述存储器中,包括:所述执行模块根据预设的存储方式以及所述输出矩阵的属性数据将所述输出矩阵的数据存入所述存储器中。其中,所述预定的存储方式为按行存储或者按列存储,具体的存储方式与读取类似,只是方向相反,在此不再赘述。Similarly, the data of the second input matrix is obtained in the same way. Optionally, the execution module storing the data of the output matrix in the memory according to the attribute data of the output matrix includes: the execution module according to a preset storage mode and the attribute data of the output matrix The data of the output matrix is stored in the memory. Wherein, the predetermined storage mode is row storage or column storage, and the specific storage mode is similar to reading, but the direction is opposite, so it will not be repeated here.
图10a为本公开中的第一输入矩阵的存储顺序和格式的示意图。如图10a所示,为上述实施例中的第一输入矩阵的一个示例,其在存储器中存储时,按照深度Cin优先,然后是宽度Win,最后是高度Hin的方式存储。以Cin=2、Win=3、Hin=3为例,先存储Hin=3行中的第一行的第一个点,由于带深度,所述一个点包含2个数据,再存第二个点,也包含2个数据,直到存完第3个点,这一行一共有3*2=6个数据,包含Win方向上的3个点;之后开始存Hin=3行中的第二行的第一个点,依次这样存储,直到所有的点被存储完毕。第一输入矩阵的存储顺序和格式的实例如图10b所示。Fig. 10a is a schematic diagram of the storage order and format of the first input matrix in the present disclosure. As shown in FIG. 10a, it is an example of the first input matrix in the above embodiment. When it is stored in the memory, it is stored in a manner of depth Cin first, then width Win, and finally height Hin. Take Cin=2, Win=3, Hin=3 as an example, first store the first point of the first line of Hin=3 lines, because of the depth, the one point contains 2 data, and then store the second point Point, also contains 2 data, until the third point is stored, there are a total of 3*2=6 data in this row, including 3 points in the Win direction; after that, the second row of Hin=3 rows The first point is stored in this way until all points are stored. An example of the storage order and format of the first input matrix is shown in Figure 10b.
图10c为本公开的第二输入矩阵的存储顺序和格式的示意图,如图10c所示,为上述实施例中的第二输入矩阵的一个示例,其在存储器中存储时,按照Cout(卷积核的个数)优先的方式按行存储,每一列存储一个卷积核,在列方向上,按照卷积核的深度Cin优先,然后是宽度Kw,然后是高度Kh的方式存储。以Cout=2、Cin=2、Kw=2、Kh=2为例,一共会存储Cin*Kw*Kh=8行数据,每行存储2个数据。先存储第一个卷积核的第一行第一列的深度为2的点中的第一个数据,再存储第二个卷积核的第一行第一列的深度为2的点中的第一个数据,完成第二输入矩阵的第一行数据的存储;之后存储第一个卷积核的第一行第一列的深度为2的点中的第二个数据,再存储第二个卷积核的第一行第一列的深度为2的点中的第二个数据,完成第二输入矩阵的第二行数据的存储,依次这样存储,直到所有的点被存储完毕。第二输入矩阵的存储顺序和格式的实例如图10d所示。Fig. 10c is a schematic diagram of the storage order and format of the second input matrix of the present disclosure. As shown in Fig. 10c, it is an example of the second input matrix in the above-mentioned embodiment. The number of cores is first stored in rows, and each column stores a convolution kernel. In the column direction, the depth Cin of the convolution kernel is prioritized, then the width Kw, and then the height Kh. Taking Cout=2, Cin=2, Kw=2, Kh=2 as an example, a total of Cin*Kw*Kh=8 rows of data will be stored, and each row will store 2 data. Store the first data in the first row and first column of the first convolution kernel with a depth of 2, and then store the second convolution kernel in the first row and first column of the point with a depth of 2 Complete the storage of the first row of data of the second input matrix; after that, store the second data in the point with the depth of 2 in the first row and first column of the first convolution kernel, and then store the second data The second data of the second row of the first row and the first column of the two convolution kernels with a depth of 2 completes the storage of the second row of data of the second input matrix, and stores them in this way until all the points are stored. An example of the storage order and format of the second input matrix is shown in Figure 10d.
如图10a中所示的RS1即为第一输入矩阵的首地址。通过存储格式的设定以及读取方式的设定可以控制读入的矩阵的数据,如按照上述示例中的按行顺序存储,配合按行读取以及行间隔,则可以读取出第一输入矩阵的行向量数据。如图10c中所示的RS2即为第二输入矩 阵的首地址,通过存储格式的设定以及读取方式的设定可以控制读入的矩阵的数据,如按照上述示例中的顺序进行存储,配合按行读取以及行间隔,则可以读取出第二输入矩阵的行向量数据。第一输入矩阵和第二输入矩阵的数据读出之后,利用上述矩阵乘法指令完成单指令的卷积运算。RS1 as shown in Figure 10a is the first address of the first input matrix. The data of the matrix that is read in can be controlled by the setting of the storage format and the setting of the reading method. For example, according to the row order storage in the above example, with the row reading and row interval, the first input can be read The row vector data of the matrix. As shown in Figure 10c, RS2 is the first address of the second input matrix. The data of the read matrix can be controlled by setting the storage format and the setting of the reading mode, such as storing in the order in the above example, In conjunction with reading by row and row spacing, the row vector data of the second input matrix can be read. After the data of the first input matrix and the second input matrix are read out, the single instruction convolution operation is completed by using the matrix multiplication instruction.
在调用上述矩阵乘法指令之前,可以对所述第一输入矩阵、所述第二输入矩阵和所述输出矩阵的属性数据进行设置。在本公开中,定义了多个寄存器来存储各个矩阵的属性数据,寄存器的一个示例配置如下表所示:Before calling the above-mentioned matrix multiplication instruction, the attribute data of the first input matrix, the second input matrix, and the output matrix may be set. In this disclosure, multiple registers are defined to store the attribute data of each matrix. An example configuration of the registers is shown in the following table:
寄存器名Register name 寄存器功能描述(31:0)Register function description (31:0)
Shape1Shape1 31:16(第一输入矩阵列数,即矩阵宽度);15:0(第一输入矩阵一行数,即矩阵长度)31:16 (the number of columns in the first input matrix, that is, the width of the matrix); 15:0 (the number of rows in the first input matrix, that is, the length of the matrix)
Shape2Shape2 31:16(第二输入矩阵矩阵列数,即矩阵宽度);15:0(第二输入矩阵行数,即矩阵长度)31:16 (number of columns of the second input matrix, that is, the width of the matrix); 15:0 (number of rows of the second input matrix, that is, the length of the matrix)
Stride1Stride1 15:0(输出矩阵行间隔数,即前一行的头和后一行的头之间间隔的数据个数,下同)15:0 (output matrix row interval number, that is, the number of data spaced between the head of the previous row and the head of the next row, the same below)
Stride2Stride2 31:16(第二输入矩阵行间隔数);15:0(第一输入矩阵行间隔数)31:16 (number of row intervals of the second input matrix); 15:0 (number of row intervals of the first input matrix)
图11-图15为卷积运算中数据重叠的示意图。如图11所示,在卷积运算中,经常会出现计算前一个输出特征点所用到的数据与计算后一个输出特征点所用到的数据有部分重叠。在图11中,以用3*3的卷积核、stride(步长,卷积核每次滑动的距离)为1做卷积为例计算第二行第二个点,在计算完该第二行第二个点之后,如图12所示,将卷积核向右滑动一个点的距离计算第二行第三个点。从输入数据的滑动过程来看,在计算前后两个点的时候,输入数据有部分重叠,重叠部分如图13中的灰色部分所示。当只考虑部分卷积时,如图14所示,灰色部分为前后两次计算时重叠的数据。这相当于在卷积核进行滑动时,相邻两次位置的卷积计算中,输入矩阵的数据有三分之二是重复的。Figures 11-15 are schematic diagrams of data overlap in convolution operations. As shown in Figure 11, in the convolution operation, there is often a partial overlap between the data used to calculate the previous output feature point and the data used to calculate the next output feature point. In Figure 11, take the 3*3 convolution kernel, stride (stride, the distance of each convolution kernel sliding) as an example to calculate the second point in the second row, after the calculation of the first After the second point in the second row, as shown in Figure 12, slide the convolution kernel one point to the right to calculate the third point in the second row. From the perspective of the sliding process of the input data, when calculating the two points before and after, the input data partially overlaps, and the overlapped part is shown in the gray part in Figure 13. When only partial convolution is considered, as shown in Fig. 14, the gray part is the overlapped data in the previous two calculations. This is equivalent to when the convolution kernel is sliding, two-thirds of the data of the input matrix is repeated in the convolution calculation of two adjacent positions.
假设部分卷积输入数据为矩阵一,矩阵一中的数据记为
Figure PCTCN2019111878-appb-000011
x,y和z为图14中的坐标系上的坐标,同时假设部分卷积的权重为矩阵二,矩阵二中的数据记为
Figure PCTCN2019111878-appb-000012
w多一个Cout的维度,因此在它的上标加一个维度,输出矩阵中的数据记为
Figure PCTCN2019111878-appb-000013
矩阵的计算过程如图15所示。在计算输出特征图第二行第二个点时,使用矩阵一的第二行的数据分别与矩阵二的每一列做乘累加,得到输出特征图上一个带深度的第二点,所述第二点的深度为8,在图15中表示为输出矩阵的第二行
Figure PCTCN2019111878-appb-000014
就是输出特征图的一个点,此点的深度为Cout,深度上每一个数据顺序对应输出矩阵的每一个数据;在计算输出特征图第一行第三个点的时候,使用矩阵一的第三行的数据分别与矩阵二的每一列做乘累加,得到输出特征图上带深度的第三个点,所述第三个点为一个深度为8的点,从图15中可以看出,在计算第二行带深度的第二个点和第二行带深度的第三个点的时候,其中的灰色部分的8个数据是重叠的,如果在存储的时候 按照矩阵一的方式存储,会造成存储器大量的浪费。
Assuming that part of the convolution input data is matrix one, the data in matrix one is recorded as
Figure PCTCN2019111878-appb-000011
x, y, and z are the coordinates on the coordinate system in Figure 14. At the same time, it is assumed that the weight of the partial convolution is matrix two, and the data in matrix two is recorded as
Figure PCTCN2019111878-appb-000012
w has one more dimension of Cout, so add a dimension to its superscript, and the data in the output matrix is recorded as
Figure PCTCN2019111878-appb-000013
The calculation process of the matrix is shown in Figure 15. When calculating the second point in the second row of the output feature map, use the data in the second row of matrix one to multiply and accumulate each column of matrix two to obtain a second point with depth on the output feature map. The depth of the two points is 8, which is represented as the second row of the output matrix in Figure 15
Figure PCTCN2019111878-appb-000014
It is a point of the output feature map, the depth of this point is Cout, and each data sequence in the depth corresponds to each data of the output matrix; when calculating the third point in the first row of the output feature map, use the third of matrix one The row data is multiplied and accumulated with each column of matrix two, and the third point with depth on the output feature map is obtained. The third point is a point with a depth of 8. As can be seen from Figure 15, in When calculating the second point with depth in the second row and the third point with depth in the second row, the 8 data in the gray part are overlapped. If they are stored in the form of matrix one when storing, it will be Cause a lot of waste of memory.
可以通过上述行间隔的设置解决上述计算中的问题。其中,以上述实施例为例,第一输入矩阵的行数据的个数是4*3=12个(4为第一输入矩阵的深度,3为第二数据矩阵的列数),在使用上述矩阵乘法指令之前,将上述寄存器Stride2中的15:0位所代表的行间隔数设置为4,那么两行之间,会有12-4=8个数据被重复填充到上述矩阵计算装置中,从而实现了一条指令实现整个部分卷积的过程,并且在存储时不需要将重复的数据进行重复存储,节省了存储器的空间。同时,设置输出矩阵的寄存器,使得输出数据也是顺序存储,与输入数据的存储顺序一样,那么如果后面继续存在卷积或者其他运算的话,就不需要对数据形状进行调整,可以直接使用,省去了数据调整需要花费的时间和功耗。The above-mentioned calculation problems can be solved by setting the above-mentioned line interval. Among them, taking the above embodiment as an example, the number of row data of the first input matrix is 4*3=12 (4 is the depth of the first input matrix, 3 is the number of columns of the second data matrix). Before the matrix multiplication instruction, set the number of row intervals represented by the 15:0 bits in the aforementioned register Stride2 to 4, then between the two rows, 12-4=8 data will be repeatedly filled into the aforementioned matrix calculation device. In this way, one instruction is realized to realize the entire partial convolution process, and repeated data does not need to be stored repeatedly during storage, which saves memory space. At the same time, set the register of the output matrix so that the output data is stored in sequence, which is the same as the storage order of the input data. If convolution or other operations continue later, there is no need to adjust the data shape and can be used directly, eliminating the need The time and power consumption required for data adjustment are reduced.
本公开实施例还提供了一种矩阵运算方法,是基于前述任一矩阵运算电路的矩阵运算方法,其特征在于,包括:从存储器中取出矩阵运算指令;对所述矩阵运算指令进行解码,并将解码后的指令运算指令发送至所述矩阵运算电路;基于解码后的矩阵运算指令,所述矩阵运算电路从存储器中获取所述第一输入矩阵的数据和所述第二输入矩阵数据并进行运算,在运算完成后将运算结果存储到所述存储器中。The embodiment of the present disclosure also provides a matrix operation method, which is based on any of the foregoing matrix operation circuits, and is characterized in that it includes: fetching a matrix operation instruction from a memory; decoding the matrix operation instruction, and The decoded instruction operation instruction is sent to the matrix operation circuit; based on the decoded matrix operation instruction, the matrix operation circuit obtains the data of the first input matrix and the second input matrix data from the memory and performs Operation, the operation result is stored in the memory after the operation is completed.
本公开实施例还提供了一种电子设备,包括:存储器,用于存储计算机可读指令;以及一个或多个处理器,用于运行所述计算机可读指令,使得所述处理器运行时实现前述实施例中的任一所述矩阵运算方法。An embodiment of the present disclosure also provides an electronic device, including: a memory, configured to store computer-readable instructions; and one or more processors, configured to run the computer-readable instructions, so that the processor can realize Any one of the matrix operation methods in the foregoing embodiments.
本公开实施例还提供了一种非暂态计算机可读存储介质,其特征在于,该非暂态计算机可读存储介质存储计算机指令,该计算机指令用于使计算机执行前述实施例中的任一所述矩阵运算方法。The embodiments of the present disclosure also provide a non-transitory computer-readable storage medium, which is characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute any of the foregoing embodiments. The matrix operation method.
本公开实施例提供一种计算机程序产品,其中,其特征在于:包括计算机指令,当所述计算机指令被计算设备执行时,所述计算设备可以执行前述实施例中的任一所述矩阵运算方法。An embodiment of the present disclosure provides a computer program product, which is characterized by including computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute any of the matrix operation methods in the foregoing embodiments .
本公开实施例提供一种芯片,其特征在于,包括前述实施例中的任一所述的矩阵运算电路。An embodiment of the present disclosure provides a chip, which is characterized by including the matrix operation circuit described in any of the foregoing embodiments.
本公开实施例提供一种计算装置,其特征在于,包括前述实施例中的任一所述的芯片。An embodiment of the present disclosure provides a computing device, which is characterized by including the chip described in any of the foregoing embodiments.
本公开附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上 可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and block diagrams in the drawings of the present disclosure illustrate the possible implementation architecture, functions, and operations of the system, method, and computer program product according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the module, program segment, or part of code contains one or more for realizing the specified logical function Executable instructions. It should also be noted that, in some alternative implementations, the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, and they can sometimes be executed in the reverse order, depending on the functions involved. It should also be noted that each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart, can be implemented by a dedicated hardware-based system that performs the specified functions or operations Or it can be realized by a combination of dedicated hardware and computer instructions.
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定。The units involved in the embodiments described in the present disclosure can be implemented in software or hardware. Among them, the name of the unit does not constitute a limitation on the unit itself under certain circumstances.
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。The functions described hereinabove may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that can be used include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Product (ASSP), System on Chip (SOC), Complex Programmable Logical device (CPLD) and so on.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium, which may contain or store a program for use by the instruction execution system, apparatus, or device or in combination with the instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

Claims (12)

  1. 一种矩阵运算电路,包括:A matrix operation circuit, including:
    控制电路;Control circuit;
    运算单元阵列,所述运算单元阵列包括多个运算单元,所述运算单元包括第一输入寄存器、第二输入寄存器以及输出寄存器;An arithmetic unit array, the arithmetic unit array includes a plurality of arithmetic units, the arithmetic unit includes a first input register, a second input register, and an output register;
    所述第一输入寄存器用于接收第一输入矩阵的数据,所述第二输入寄存器用于接收第二输入矩阵的数据;The first input register is used to receive data of a first input matrix, and the second input register is used to receive data of a second input matrix;
    所述控制电路用于接收矩阵运算指令,并响应于所述指令控制所述多个运算单元中的至少一个运算单元根据所述指令的指示,对所述第一输入矩阵和所述第二输入矩阵的执行运算操作,其中,所述指令为单条指令;The control circuit is configured to receive a matrix operation instruction, and in response to the instruction, control at least one operation unit of the plurality of operation units to perform an operation on the first input matrix and the second input according to the instruction of the instruction. The matrix execution operation operation, wherein the instruction is a single instruction;
    所述输出寄存器用于存储所述运算操作的运算结果。The output register is used to store the operation result of the operation operation.
  2. 如权利要求1所述的矩阵运算电路,其特征在于,所述指令包括指令名、第一输入矩阵的首地址、第二输入矩阵的首地址和输出矩阵的首地址。3. The matrix operation circuit according to claim 1, wherein the instruction includes the name of the instruction, the first address of the first input matrix, the first address of the second input matrix, and the first address of the output matrix.
  3. 如权利要求1或2所述的矩阵运算电路,其特征在于:The matrix operation circuit according to claim 1 or 2, characterized in that:
    所述矩阵运算指令为矩阵乘法指令。The matrix operation instruction is a matrix multiplication instruction.
  4. 如权利要求3所述的矩阵运算电路,其特征在于:The matrix operation circuit of claim 3, wherein:
    所述运算单元包括运算器,所述运算器至少包括乘法器和加法器;The arithmetic unit includes an arithmetic unit, and the arithmetic unit includes at least a multiplier and an adder;
    所述运算单元用于根据所述矩阵乘法指令组合所述乘法器和所述加法器以执行矩阵乘法运算。The operation unit is configured to combine the multiplier and the adder according to the matrix multiplication instruction to perform a matrix multiplication operation.
  5. 如权利要求4所述的矩阵运算电路,其特征在于,响应于所述矩阵乘法指令,对于执行所述矩阵乘法指令的至少一个运算单元:The matrix operation circuit according to claim 4, wherein, in response to the matrix multiplication instruction, for at least one operation unit that executes the matrix multiplication instruction:
    读取第一输入寄存器中的第一输入矩阵的数据;Read the data of the first input matrix in the first input register;
    读取第二输入寄存器中的第二输入矩阵的数据;Reading the data of the second input matrix in the second input register;
    通过乘法器计算所述第一输入矩阵的数据和所述第二输入矩阵的数据的乘积;Calculating a product of the data of the first input matrix and the data of the second input matrix by using a multiplier;
    通过加法器计算所述乘积的累加值;Calculating the accumulated value of the product through an adder;
    将所述累加值存入所述输出寄存器。The accumulated value is stored in the output register.
  6. 如权利要求5所述的矩阵运算电路,其特征在于,所述矩阵乘法指令用于实现矩阵的卷积运算,其中:The matrix operation circuit according to claim 5, wherein the matrix multiplication instruction is used to implement a matrix convolution operation, wherein:
    所述第一输入矩阵的数据为第一输入矩阵的行向量数据;The data of the first input matrix is row vector data of the first input matrix;
    所述第二输入矩阵的数据为第二输入矩阵的行向量数据。The data of the second input matrix is row vector data of the second input matrix.
  7. 一种矩阵计算装置,包括:A matrix calculation device, including:
    存储器,用于存储矩阵运算指令、第一输入矩阵、第二输入矩阵以及输出矩阵;A memory for storing matrix operation instructions, a first input matrix, a second input matrix, and an output matrix;
    取指模块,与所述存储器相连,用于从所述存储器中获取所述矩阵运算指令;An instruction fetching module, connected to the memory, and configured to obtain the matrix operation instruction from the memory;
    解码模块,与所述取指模块相连,用于对所述取指模块所获取到的所述矩阵运算指令进行解码;A decoding module, connected to the instruction fetching module, and configured to decode the matrix operation instructions acquired by the instruction fetching module;
    寄存器,用于存储所述第一输入矩阵、所述第二输入矩阵和所述输出矩阵的属性数据;A register for storing attribute data of the first input matrix, the second input matrix, and the output matrix;
    执行模块,与所述解码模块、所述存储器和所述寄存器相连,包括如权利要求1-6所述的矩阵运算电路,用于执行所述解码后的矩阵运算指令。The execution module is connected to the decoding module, the memory and the register, and includes the matrix operation circuit according to claims 1-6, which is used to execute the decoded matrix operation instruction.
  8. 如权利要求7所述的矩阵计算装置,其特征在于:8. The matrix calculation device of claim 7, wherein:
    所述执行模块从所述解码模块中获取所述解码后的矩阵运算指令;The execution module obtains the decoded matrix operation instruction from the decoding module;
    所述执行模块从所述寄存器中获取所述第一输入矩阵的属性数据、所述第二输入矩阵的属性数据和所述输出矩阵的属性数据;The execution module obtains the attribute data of the first input matrix, the attribute data of the second input matrix, and the attribute data of the output matrix from the register;
    所述执行模块根据所述第一输入矩阵和所述第二输入矩阵的属性数据从所述存储器中获取用于计算的所述第一输入矩阵的数据和所述第二输入矩阵的数据;The execution module obtains the data of the first input matrix and the data of the second input matrix for calculation from the memory according to the attribute data of the first input matrix and the second input matrix;
    所述执行模块根据所述解码后的矩阵运算指令对所述第一输入矩阵的数据和第二输入矩阵的数据进行计算得到输出矩阵的数据;The execution module calculates the data of the first input matrix and the data of the second input matrix according to the decoded matrix operation instruction to obtain the data of the output matrix;
    所述执行模块根据所述输出矩阵的属性数据将所述输出矩阵的数据存入所述存储器中。The execution module stores the data of the output matrix in the memory according to the attribute data of the output matrix.
  9. 如权利要求8所述的矩阵计算装置,其特征在于:The matrix calculation device according to claim 8, wherein:
    所述第一输入矩阵的属性数据包括所述第一输入矩阵的行数、列数以及行向量间隔;The attribute data of the first input matrix includes the number of rows, the number of columns, and the row vector interval of the first input matrix;
    所述第二输入矩阵的属性数据包括所述第二输入矩阵的行数、列数以及行向量间隔;The attribute data of the second input matrix includes the number of rows, the number of columns, and the row vector interval of the second input matrix;
    所述输出矩阵的属性数据包括所述输出矩阵的行数、列数以及行向量间隔。The attribute data of the output matrix includes the number of rows, the number of columns, and the row vector interval of the output matrix.
  10. 如权利要求8所述的矩阵计算装置,其中,所述执行模块根据所述第一输入矩阵和所述第二输入矩阵的属性数据从所述存储器中获取用于计算的第一输入矩阵的数据和第二输入矩阵的数据,包括:The matrix calculation device according to claim 8, wherein the execution module obtains data of the first input matrix for calculation from the memory according to the attribute data of the first input matrix and the second input matrix And the data of the second input matrix, including:
    所述执行模块根据预设的第一读取方式以及所述第一输入矩阵的属性数据读取所述第一输入矩阵的数据;The execution module reads the data of the first input matrix according to the preset first reading method and the attribute data of the first input matrix;
    所述执行模块根据预设的第二读取方式以及所述第二输入矩阵的属性数据读取所述第二输入矩阵的数据。The execution module reads the data of the second input matrix according to the preset second reading method and the attribute data of the second input matrix.
  11. 如权利要求10所述的矩阵计算装置,其特征在于:The matrix calculation device of claim 10, wherein:
    所述第一读取方式为按行读取或按列读取;The first reading method is reading by row or reading by column;
    所述第二读取方式为按行读取或按列读取。The second reading method is reading in rows or reading in columns.
  12. 一种矩阵运算方法,是基于权利要求1至6中的任一项所述的矩阵运算电路的矩阵运算方法,其特征在于,包括:A matrix operation method based on the matrix operation circuit of any one of claims 1 to 6, characterized in that it comprises:
    从存储器中取出矩阵运算指令;Take out the matrix operation instruction from the memory;
    对所述矩阵运算指令进行解码,并将解码后的指令运算指令发送至所述矩阵运算电路;Decode the matrix operation instruction, and send the decoded instruction operation instruction to the matrix operation circuit;
    基于解码后的矩阵运算指令,所述矩阵运算电路从存储器中获取第一输入矩阵的数据和第二输入矩阵数据并进行运算,在运算完成后将运算结果存储到所述存储器中。Based on the decoded matrix operation instruction, the matrix operation circuit obtains the data of the first input matrix and the data of the second input matrix from the memory and performs an operation, and stores the operation result in the memory after the operation is completed.
PCT/CN2019/111878 2019-10-18 2019-10-18 Matrix computing circuit, apparatus and method WO2021072732A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980101046.3A CN114503126A (en) 2019-10-18 2019-10-18 Matrix operation circuit, device and method
PCT/CN2019/111878 WO2021072732A1 (en) 2019-10-18 2019-10-18 Matrix computing circuit, apparatus and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/111878 WO2021072732A1 (en) 2019-10-18 2019-10-18 Matrix computing circuit, apparatus and method

Publications (1)

Publication Number Publication Date
WO2021072732A1 true WO2021072732A1 (en) 2021-04-22

Family

ID=75537332

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/111878 WO2021072732A1 (en) 2019-10-18 2019-10-18 Matrix computing circuit, apparatus and method

Country Status (2)

Country Link
CN (1) CN114503126A (en)
WO (1) WO2021072732A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792868A (en) * 2021-09-14 2021-12-14 绍兴埃瓦科技有限公司 Neural network computing module, method and communication device
CN113807509A (en) * 2021-09-14 2021-12-17 绍兴埃瓦科技有限公司 Neural network acceleration device, method and communication equipment
CN114723034A (en) * 2022-06-10 2022-07-08 之江实验室 Separable image processing neural network accelerator and acceleration method

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784973A (en) * 2019-11-04 2021-05-11 北京希姆计算科技有限公司 Convolution operation circuit, device and method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101630242A (en) * 2009-07-28 2010-01-20 苏州国芯科技有限公司 Contribution module for rapidly computing self-adaptive code book by G723.1 coder
CN102065309A (en) * 2010-12-07 2011-05-18 青岛海信信芯科技有限公司 DCT (Discrete Cosine Transform) realizing method and circuit
CN109416756A (en) * 2018-01-15 2019-03-01 深圳鲲云信息科技有限公司 Acoustic convolver and its applied artificial intelligence process device
US20190179776A1 (en) * 2017-09-15 2019-06-13 Mythic, Inc. System and methods for mixed-signal computing
CN110197274A (en) * 2018-02-27 2019-09-03 上海寒武纪信息科技有限公司 Integrated circuit chip device and Related product

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101630242A (en) * 2009-07-28 2010-01-20 苏州国芯科技有限公司 Contribution module for rapidly computing self-adaptive code book by G723.1 coder
CN102065309A (en) * 2010-12-07 2011-05-18 青岛海信信芯科技有限公司 DCT (Discrete Cosine Transform) realizing method and circuit
US20190179776A1 (en) * 2017-09-15 2019-06-13 Mythic, Inc. System and methods for mixed-signal computing
CN109416756A (en) * 2018-01-15 2019-03-01 深圳鲲云信息科技有限公司 Acoustic convolver and its applied artificial intelligence process device
CN110197274A (en) * 2018-02-27 2019-09-03 上海寒武纪信息科技有限公司 Integrated circuit chip device and Related product

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792868A (en) * 2021-09-14 2021-12-14 绍兴埃瓦科技有限公司 Neural network computing module, method and communication device
CN113807509A (en) * 2021-09-14 2021-12-17 绍兴埃瓦科技有限公司 Neural network acceleration device, method and communication equipment
CN113807509B (en) * 2021-09-14 2024-03-22 绍兴埃瓦科技有限公司 Neural network acceleration device, method and communication equipment
CN113792868B (en) * 2021-09-14 2024-03-29 绍兴埃瓦科技有限公司 Neural network computing module, method and communication equipment
CN114723034A (en) * 2022-06-10 2022-07-08 之江实验室 Separable image processing neural network accelerator and acceleration method
CN114723034B (en) * 2022-06-10 2022-10-04 之江实验室 Separable image processing neural network accelerator and acceleration method

Also Published As

Publication number Publication date
CN114503126A (en) 2022-05-13

Similar Documents

Publication Publication Date Title
WO2021072732A1 (en) Matrix computing circuit, apparatus and method
WO2021088563A1 (en) Convolution operation circuit, apparatus and method
JP6796177B2 (en) Processing using compact arithmetic processing elements
Fang et al. swdnn: A library for accelerating deep learning applications on sunway taihulight
EP3664093A1 (en) Semiconductor memory device employing processing in memory (pim) and method of operating the semiconductor memory device
JP2021508125A (en) Matrix multiplier
WO2017124647A1 (en) Matrix calculation apparatus
CN108388537B (en) Convolutional neural network acceleration device and method
KR102121866B1 (en) Mixed-width SIMD operations with even-element and odd-element operations using register pairs for wide data elements.
US20220206796A1 (en) Multi-functional execution lane for image processor
JP7387017B2 (en) Address generation method and unit, deep learning processor, chip, electronic equipment and computer program
EP4318275A1 (en) Matrix multiplier and method for controlling matrix multiplier
CN112348182B (en) Neural network maxout layer computing device
WO2021147567A1 (en) Convolutional operation method and chip
US20210048992A1 (en) Processor for performing dynamic programming according to an instruction, and a method for configuring a processor for dynamic programming via an instruction
KR20210084220A (en) System and method for reconfigurable systolic array with partial read/write
CN113222099A (en) Convolution operation method and chip
JP7136343B2 (en) Data processing system, method and program
WO2021057112A1 (en) Matrix operation circuit, matrix operation device and matrix operation method
WO2021057111A1 (en) Computing device and method, chip, electronic device, storage medium and program
JP5025521B2 (en) Semiconductor device
Murakami FPGA implementation of a SIMD-based array processor with torus interconnect
TW202416185A (en) Deep fusion of kernel execution
CN113961871A (en) Matrix calculation circuit, matrix calculation method, electronic device, and computer-readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19948937

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19948937

Country of ref document: EP

Kind code of ref document: A1