WO2021072732A1 - Circuit, appareil et procédé de calcul matriciel - Google Patents

Circuit, appareil et procédé de calcul matriciel Download PDF

Info

Publication number
WO2021072732A1
WO2021072732A1 PCT/CN2019/111878 CN2019111878W WO2021072732A1 WO 2021072732 A1 WO2021072732 A1 WO 2021072732A1 CN 2019111878 W CN2019111878 W CN 2019111878W WO 2021072732 A1 WO2021072732 A1 WO 2021072732A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
input
data
instruction
input matrix
Prior art date
Application number
PCT/CN2019/111878
Other languages
English (en)
Chinese (zh)
Inventor
罗飞
王维伟
Original Assignee
北京希姆计算科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京希姆计算科技有限公司 filed Critical 北京希姆计算科技有限公司
Priority to PCT/CN2019/111878 priority Critical patent/WO2021072732A1/fr
Priority to CN201980101046.3A priority patent/CN114503126A/zh
Publication of WO2021072732A1 publication Critical patent/WO2021072732A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons

Definitions

  • the present disclosure relates to the field of neural network computing, and in particular to a matrix operation circuit, device and method.
  • Chips are the cornerstone of data processing, it is fundamentally determined Improved people’s ability to process data.
  • a general-purpose chip route such as CPU (Central Processing Unit, central processing unit), etc. They can provide great flexibility, but they are effective in processing algorithms in specific fields.
  • the power is relatively low; the other is a dedicated chip route, such as TPU (Tensor Processing Unit, tensor processor), etc.
  • CPU scheme In this scheme, if it is a single-core CPU, the matrix will be disassembled into scalars for operation, and convolution operations are realized by combining scalar instructions; if it is a multi-core CPU, multiple cores may execute each in parallel Scalar instructions are combined to implement convolution operations.
  • GPU Graphics Processing Unit
  • the GPU will disassemble the convolution operation into multiple instruction operations. These instructions are mainly vector instructions, and the convolution operation is realized by combining and executing vector instructions.
  • the use of this solution has the following disadvantages: the underlying program is complex, and generally requires multiple layers of loops to implement convolution operations; the convolution operation is achieved through multiple combinations of vector instructions, which is inefficient; GPU requires multiple data access, which will increase the implementation The calculation time of convolution operation; GPU needs to access data multiple times, which will increase the calculation power consumption of convolution operation; GPU has limited cache, and the realization of relatively large convolution operation requires multiple transfers from outside the chip. effectiveness.
  • an embodiment of the present disclosure provides a matrix operation circuit, including:
  • An arithmetic unit array includes a plurality of arithmetic units, the arithmetic unit includes a first input register, a second input register, and an output register;
  • the first input register is used to receive data of a first input matrix
  • the second input register is used to receive data of a second input matrix
  • the control circuit is configured to receive a matrix operation instruction, and in response to the instruction, control at least one operation unit of the plurality of operation units to perform an operation on the first input matrix and the second input according to the instruction of the instruction.
  • the matrix execution operation operation wherein the instruction is a single instruction;
  • the output register is used to store the operation result of the operation operation.
  • the instruction includes the instruction name, the first address of the first input matrix, the first address of the second input matrix, and the first address of the output matrix.
  • the matrix operation instruction is a matrix multiplication instruction.
  • the arithmetic unit includes an arithmetic unit, the arithmetic unit includes at least a multiplier and an adder; the arithmetic unit is configured to combine the multiplier and the adder according to the matrix multiplication instruction to perform a matrix multiplication operation .
  • the accumulated value is stored in the output register.
  • matrix multiplication instruction is used to implement matrix convolution operation, wherein:
  • the data of the first input matrix is row vector data of the first input matrix
  • the data of the second input matrix is row vector data of the second input matrix.
  • a matrix calculation device including:
  • a memory for storing matrix operation instructions, a first input matrix, a second input matrix, and an output matrix
  • An instruction fetching module connected to the memory, and configured to obtain the matrix operation instruction from the memory
  • a decoding module connected to the instruction fetching module, and configured to decode the matrix operation instructions acquired by the instruction fetching module
  • a register for storing attribute data of the first input matrix, the second input matrix, and the output matrix
  • the execution module is connected to the decoding module, the memory and the register, and includes the matrix operation circuit according to claims 1-6, which is used to execute the decoded matrix operation instruction.
  • the execution module obtains the decoded matrix operation instruction from the decoding module; the execution module obtains the attribute data of the first input matrix and the attribute data of the second input matrix from the register Attribute data and attribute data of the output matrix; the execution module obtains the data of the first input matrix and the first input matrix for calculation from the memory according to the attribute data of the first input matrix and the second input matrix Two input matrix data; the execution module calculates the data of the first input matrix and the data of the second input matrix according to the decoded matrix operation instruction to obtain the data of the output matrix; the execution module calculates the data of the output matrix according to the The attribute data of the output matrix stores the data of the output matrix in the memory.
  • the attribute data of the first input matrix includes the number of rows, the number of columns, and the row vector interval of the first input matrix
  • the attribute data of the second input matrix includes the number of rows of the second input matrix, The number of columns and the interval of row vectors
  • the attribute data of the output matrix includes the number of rows, the number of columns, and the interval of row vectors of the output matrix.
  • the execution module acquiring data of the first input matrix and data of the second input matrix for calculation from the memory according to the attribute data of the first input matrix and the second input matrix includes:
  • the execution module reads the data of the first input matrix according to the preset first reading method and the attribute data of the first input matrix
  • the execution module reads the data of the second input matrix according to the preset second reading method and the attribute data of the second input matrix.
  • the first reading mode is row reading or column reading; the second reading mode is row reading or column reading.
  • embodiments of the present disclosure provide a matrix operation method, which is based on the matrix operation circuit of any one of the foregoing first aspects, and is characterized in that it includes:
  • the matrix operation circuit Based on the decoded matrix operation instruction, the matrix operation circuit obtains the data of the first input matrix and the data of the second input matrix from the memory and performs an operation, and stores the operation result in the memory after the operation is completed .
  • an embodiment of the present disclosure provides an electronic device, including: a memory, configured to store computer-readable instructions; and one or more processors, configured to execute the computer-readable instructions to cause the processor to run When realizing any one of the above-mentioned matrix operation methods in the third aspect.
  • embodiments of the present disclosure provide a non-transitory computer-readable storage medium, characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute the aforementioned third aspect Any of the matrix operation methods described above.
  • embodiments of the present disclosure provide a computer program product, which is characterized in that it includes computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute any of the foregoing third aspects.
  • the matrix operation method is characterized in that it includes computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute any of the foregoing third aspects.
  • an embodiment of the present disclosure provides a chip, which is characterized by comprising the matrix operation circuit described in any one of the first aspect.
  • an embodiment of the present disclosure provides a computing device, which is characterized by including the chip described in any one of the seventh aspects.
  • the embodiments of the present disclosure disclose a matrix operation circuit, device and method.
  • the matrix operation circuit includes: a control circuit; an arithmetic unit array, the arithmetic unit array includes a plurality of arithmetic units, the arithmetic unit includes a first input register, a second input register, and an output register; the first input register is used for For receiving the data of the first input matrix, the second input register is used for receiving the data of the second input matrix; the control circuit is used for receiving a matrix operation instruction, and responding to the instruction to control the plurality of operation units At least one arithmetic unit of, performs arithmetic operations on the first input matrix and the second input matrix according to the instruction of the instruction, wherein the instruction is a single instruction; the output register is used to store the arithmetic The result of the operation.
  • FIG. 1 is a schematic structural diagram of a matrix operation circuit provided by an embodiment of the disclosure
  • FIG. 2 is a schematic structural diagram of an arithmetic unit provided by an embodiment of the disclosure.
  • FIG. 3 is a further structural schematic diagram of an arithmetic unit provided by an embodiment of the disclosure.
  • FIGS. 4a-4b are schematic diagrams of convolution operations in an embodiment of the present disclosure.
  • Figures 5 to 8 are calculation processes of partial convolution provided by embodiments of the disclosure.
  • FIG. 9 is a schematic structural diagram of a matrix calculation device provided by an embodiment of the disclosure.
  • 10a-10d are schematic diagrams of the storage order and format of the matrix in the disclosure.
  • FIGS. 11-15 are schematic diagrams of data overlap in convolution operations
  • FIG. 16 is a schematic diagram of a specific example of the convolution budget in the embodiment of the disclosure.
  • FIG. 1 is a schematic structural diagram of a matrix operation circuit provided by an embodiment of the disclosure.
  • the matrix operation circuit 100 includes a control circuit 101 and an arithmetic unit array 102.
  • the arithmetic unit array includes a plurality of arithmetic units (PU) 103.
  • PU arithmetic units
  • the arithmetic unit 103 includes a first The input register (Rin1) 104, the second input register (Rin2) 105, and the output register (Rout) 106, wherein the first input register 104 is used to receive the data of the first input matrix, and the second input register 105 is used to Receive the data of the second input matrix;
  • the control circuit 101 is configured to receive a matrix operation instruction, and in response to the instruction, control at least one operation unit 103 of the plurality of operation units 103, according to the instruction of the instruction
  • the first input matrix and the second input matrix perform arithmetic operations, wherein the instruction is a single instruction; the output register 106 is used to store the result of the arithmetic operation.
  • the single command includes the command name, the first address of the first input matrix, the first address of the second input matrix, and the first address of the output matrix.
  • the following table shows an exemplary command format:
  • the instruction name corresponds to the meaning, format and operation of the instruction.
  • the first address of the first input matrix and the first address of the second input matrix respectively define the read addresses of the two source operands of the instruction, and the first address of the output matrix is defined Indicates the storage address of the destination operand of the instruction.
  • the above exemplary instruction is a matrix multiplication instruction, which implements the multiplication of two matrices, specifically:
  • FIG. 3 is a further structural schematic diagram of an arithmetic unit provided by an embodiment of the disclosure.
  • the arithmetic unit in addition to the first input register, the second input register, and the output register, the arithmetic unit also includes an arithmetic unit.
  • the arithmetic unit includes at least a multiplier 301 and an adder 302, wherein the arithmetic unit And used for combining the multiplier and the adder according to the matrix multiplication instruction to perform a matrix multiplication operation.
  • the data of the first input matrix in the first input register is the data sequentially read into the first input register according to the first address of the first input matrix; the data in the second input register The data of the second input matrix is data sequentially read into the second input register according to the first address of the second input matrix.
  • the multiplier is used to calculate the product of the data in the first input register and the second input register.
  • the multiplier in the arithmetic unit calculates the product of a 11 *b 11 , and then sends the product of a 11 *b 11 to the adder for accumulation.
  • the accumulation result of the accumulator is a 11 *b 11 ; then in the next clock cycle, continue the above calculation operation, at this time the data received in the first input register is a 12 , The data received in the second input register is b 21 , the multiplier in the arithmetic unit calculates the product of a 12 *b 21 , and then sends the product of a 12 *b 21 to the adder for accumulation.
  • the input of one clock cycle is a 11 *b 11 , so calculate a 11 *b 11 + a 12 *b 21 in this clock cycle; continue to do the above operations until one row of the first input matrix and one column of the second input matrix are calculated Upon completion, the final accumulated value is obtained, and then the accumulated value is stored in the output register.
  • the control circuit stores the accumulated value in the output register in the system memory according to the first address of the output matrix in the instruction.
  • the matrix multiplication instructions can implement matrix convolution operations.
  • the convolution calculation of the matrix is the cumulative sum of the product of the data point-to-point multiplication of the two matrices.
  • An exemplary convolution operation is as follows:
  • FIGS. 4a-4b are schematic diagrams of convolution operations in an embodiment of the disclosure.
  • Figure 4a is the overall schematic diagram of the convolution operation process, and the diagram is explained as follows:
  • Cin The number of channels of the input feature map, collectively referred to as the depth of the input feature map later;
  • Kw the width of the convolution kernel
  • Wout the width of the output feature map
  • Hout the height of the output feature map
  • the feature points on the input feature map constitute the first input matrix; the points on the convolution kernel constitute the second input matrix; the output feature map is the output matrix, and a feature point on the output feature map is a data on the output matrix.
  • the convolution kernel will slide on the input feature map. Each time it slides, it will multiply and accumulate data with the corresponding data in the input feature map to extract an output feature point, that is, a data on the output matrix. .
  • Figure 4b is a schematic diagram of the calculation of an input feature point with depth.
  • the convolution kernel slides on the input feature map. When it stops at a position, it will multiply and accumulate the corresponding data with the feature points in the input feature map at that position to obtain the output feature corresponding to the position. Points; there are Cout convolution kernels, and each convolution kernel will multiply and accumulate data with the feature points in the input feature map at the same position to obtain Cout output feature points in the depth direction; Cout output feature points are composed A feature point with depth on the entire output feature map, the depth of this point is Cout; the convolution kernel will slide the entire input feature map to obtain the entire output feature map.
  • Dout is a point with depth in the output feature map, and its superscript l corresponds to the depth of the output; Din refers to the data corresponding to the convolution kernel in the input feature map, and its superscript i corresponds to the depth of the input feature map, j And k respectively correspond to the width and height of the convolution kernel; w is the convolution kernel, and its superscripts l and i correspond to the depth of the output feature map and the depth of the input feature map, respectively, and j and k correspond to the width of the convolution kernel. And height.
  • Kh*Kw*Cin For the convolution kernel with the size of Kh*Kw*Cin, it can be divided into Kh Kw*Cin partial convolution kernels to perform partial feature extraction. Each time the entire feature extraction is achieved 1/Kh, which is Kw*Cin Some of the features corresponding to the convolution kernel, and the partial results obtained are: Finally, add these Kh partial results to get the final result
  • the input data matrix of one row can be multiplied by the convolution kernel matrix of the Cout column composed of Cout convolution kernels, that is, the weight matrix.
  • a feature point with depth, this feature point is a vector, the length of the vector is the depth Cout of the output feature point, and its realization is shown in Figure 7.
  • the process of convolution or partial convolution of the neural network is the sliding process of the convolution kernel or part of the convolution kernel on the input feature map, it can be regarded as the data of the input feature map changes with the sliding, and the weight remains unchanged In this way, the process of convolution by the neural network becomes the multiplication of the input data matrix of the Wout row and the weight matrix of the Cout column to obtain the output data matrix of the Wout row.
  • the implementation is shown in Figure 8.
  • the above convolution operation only needs to use a single instruction, that is, the above matrix multiplication instruction to complete the entire convolution process, and it only needs to set the order of data reading in the upper-level program in advance.
  • the matrix multiplication instruction is used to implement a matrix convolution operation, wherein: the data of the first input matrix is the row vector data of the first input matrix; the data of the second input matrix is the second input matrix The row vector data.
  • the data in the first input register and the data in the second input register both need to be row vector data of the matrix, so that the calculated data is the convolution result.
  • Figure 16 is a schematic diagram of the above convolution calculation.
  • the step size is 1.
  • the output matrix is a 2*2*2 matrix.
  • the partial convolution method is used to first calculate the intermediate value of each point of the output matrix. As shown in Figure 16, when the partial convolution is performed, a row of the second input matrix slides on the first input matrix, The read data of the first input matrix is as shown in 1601, and each row corresponds to the data of the first input matrix at a position of the second input matrix 1602, including 3 numbers with depth, a total of 6 data, One column in 1602 is the second input matrix including a row of 3 numbers and a total of 6 data.
  • One row of data in 1601 and one column of data in 1602 are multiplied and accumulated to get a point in 1603 and one in 1603.
  • the point is the result of the partial convolution.
  • the result of the partial convolution is a part of the value of a point in the output matrix, and finally the three rows of the second input matrix are respectively slid on the second input matrix
  • the result of the data extraction calculation is accumulated to obtain the value of a point in the output matrix (one of the two values of the point with a depth of 2).
  • the value 1 of data 1601 of the first input matrix in Figure 16 is sent to Rin1 of PU11 and Rin1 of PU12, and the value of data 1601 of the first input matrix is sent to Rin1 of PU21 and Rin1 of PU22; the second input matrix 1602 is sent to Rin1 of PU11 and Rin1 of PU12.
  • the value of 0.1 is sent to Rin2 of PU11 and Rin2 of PU21, and the value of 1.9 in the second input matrix 1602 is sent to Rin2 of PU12 and Rin2 of PU22; PU11, PU12, PU21, and PU22 perform data multiplication operations and send the result into The output register is saved; in the next clock cycle, the value 2 of data 1601 of the first input matrix is sent to Rin1 of PU11 and Rin1 of PU12, and the value of data 1601 of the first input matrix is sent to Rin1 of PU21 and Rin1 of PU22; The value 0.2 in the second input matrix 1602 is sent to Rin2 of PU11 and Rin2 of PU21, and the value of 2.0 in the second input matrix 1602 is sent to Rin2 of PU12 and Rin2 of PU22.
  • PU11, PU12, PU21, and PU22 execute this data Multiply operation, and send the result to the output register and accumulate the result saved last time.
  • FIG. 9 is a schematic structural diagram of a matrix calculation device provided by an embodiment of the disclosure.
  • the matrix calculation device 900 includes: a memory 901 for storing matrix operation instructions, a first input matrix, a second input matrix, and an output matrix; an instruction fetching module 902, connected to the memory 901, To obtain the matrix operation instruction from the memory 901; a decoding module 903, connected to the instruction fetching module 902, for decoding the matrix operation instruction obtained by the instruction fetching module 902; register 904 , Used to store the attribute data of the first input matrix, the second input matrix, and the output matrix; the execution module 905 is connected to the decoding module 903, the memory 901 and the register 904, including the above
  • the matrix operation circuit in the embodiment is used to execute the decoded matrix operation instruction.
  • the execution module obtains the decoded matrix operation instruction from the decoding module; the execution module obtains the attribute data of the first input matrix and the second input matrix from the register. The attribute data of the input matrix and the attribute data of the output matrix; the execution module obtains the attribute data of the first input matrix used for calculation from the memory according to the attribute data of the first input matrix and the second input matrix Data and the data of the second input matrix; the execution module calculates the data of the first input matrix and the data of the second input matrix according to the decoded matrix operation instruction to obtain the data of the output matrix; the execution module The data of the output matrix is stored in the memory according to the attribute data of the output matrix.
  • the attribute data of the first input matrix includes the number of rows, columns, and row vector intervals of the first input matrix
  • the attribute data of the second input matrix includes the number of rows and columns of the second input matrix
  • the attribute data of the output matrix includes the number of rows, the number of columns, and the interval of row vectors of the output matrix.
  • the number of rows and the number of columns defines the size of the matrix
  • the row vector interval defines the storage address difference between two adjacent rows of the matrix. For example, each row of the matrix has 10 int8 matrix elements. For continuous storage, the row vector interval is 10Bytes. If two rows are stored at a certain interval, such as 20Bytes, then 10Bytes are matrix elements, and the content of 10Bytes does not belong to the matrix, which may be invalid data or other Use data.
  • the execution module obtains the data of the first input matrix and the data of the second input matrix for calculation from the memory according to the attribute data of the first input matrix and the second input matrix, including : The execution module reads the data of the first input matrix according to the preset first reading method and the attribute data of the first input matrix; the execution module reads the data of the first input matrix according to the preset second reading method and the attribute data. The attribute data of the second input matrix reads the data of the second input matrix.
  • the first reading mode is row reading or column reading; the second reading mode is row reading or column reading.
  • the attribute data defines that the number of rows of the first input matrix is 5 rows, the number of columns is 5 columns, and the row spacing is 5 bytes, and the preset reading method is reading by row, then according to the instruction The first address of the first input matrix and the row spacing. Read the first row of the first input matrix. You can know that the first row has 5 matrix elements through the number of columns. Then the first address plus the row spacing is used as the new first address , Read the second row, which is also 5 matrix elements, so read 5 times in sequence, then the execution module can obtain all the data of the first input matrix.
  • the execution module storing the data of the output matrix in the memory according to the attribute data of the output matrix includes: the execution module according to a preset storage mode and the attribute data of the output matrix
  • the data of the output matrix is stored in the memory.
  • the predetermined storage mode is row storage or column storage, and the specific storage mode is similar to reading, but the direction is opposite, so it will not be repeated here.
  • Fig. 10a is a schematic diagram of the storage order and format of the first input matrix in the present disclosure.
  • FIG. 10a it is an example of the first input matrix in the above embodiment.
  • it is stored in the memory, it is stored in a manner of depth Cin first, then width Win, and finally height Hin.
  • the first point is stored in this way until all points are stored.
  • An example of the storage order and format of the first input matrix is shown in Figure 10b.
  • Fig. 10c is a schematic diagram of the storage order and format of the second input matrix of the present disclosure. As shown in Fig. 10c, it is an example of the second input matrix in the above-mentioned embodiment.
  • the number of cores is first stored in rows, and each column stores a convolution kernel. In the column direction, the depth Cin of the convolution kernel is prioritized, then the width Kw, and then the height Kh.
  • the depth Cin of the convolution kernel is prioritized, then the width Kw, and then the height Kh.
  • RS1 as shown in Figure 10a is the first address of the first input matrix.
  • the data of the matrix that is read in can be controlled by the setting of the storage format and the setting of the reading method. For example, according to the row order storage in the above example, with the row reading and row interval, the first input can be read The row vector data of the matrix.
  • RS2 is the first address of the second input matrix.
  • the data of the read matrix can be controlled by setting the storage format and the setting of the reading mode, such as storing in the order in the above example, In conjunction with reading by row and row spacing, the row vector data of the second input matrix can be read. After the data of the first input matrix and the second input matrix are read out, the single instruction convolution operation is completed by using the matrix multiplication instruction.
  • the attribute data of the first input matrix, the second input matrix, and the output matrix may be set.
  • multiple registers are defined to store the attribute data of each matrix.
  • An example configuration of the registers is shown in the following table:
  • Shape1 31:16 (the number of columns in the first input matrix, that is, the width of the matrix); 15:0 (the number of rows in the first input matrix, that is, the length of the matrix)
  • Shape2 31:16 (number of columns of the second input matrix, that is, the width of the matrix); 15:0 (number of rows of the second input matrix, that is, the length of the matrix)
  • Stride1 15:0 output matrix row interval number, that is, the number of data spaced between the head of the previous row and the head of the next row, the same below
  • Stride2 31:16 (number of row intervals of the second input matrix); 15:0 (number of row intervals of the first input matrix)
  • Figures 11-15 are schematic diagrams of data overlap in convolution operations.
  • the convolution operation there is often a partial overlap between the data used to calculate the previous output feature point and the data used to calculate the next output feature point.
  • Figure 11 take the 3*3 convolution kernel, stride (stride, the distance of each convolution kernel sliding) as an example to calculate the second point in the second row, after the calculation of the first After the second point in the second row, as shown in Figure 12, slide the convolution kernel one point to the right to calculate the third point in the second row. From the perspective of the sliding process of the input data, when calculating the two points before and after, the input data partially overlaps, and the overlapped part is shown in the gray part in Figure 13.
  • the gray part is the overlapped data in the previous two calculations. This is equivalent to when the convolution kernel is sliding, two-thirds of the data of the input matrix is repeated in the convolution calculation of two adjacent positions.
  • the data in matrix one is recorded as x, y, and z are the coordinates on the coordinate system in Figure 14.
  • the weight of the partial convolution is matrix two, and the data in matrix two is recorded as w has one more dimension of Cout, so add a dimension to its superscript, and the data in the output matrix is recorded as
  • the calculation process of the matrix is shown in Figure 15.
  • the depth of the two points is 8, which is represented as the second row of the output matrix in Figure 15 It is a point of the output feature map, the depth of this point is Cout, and each data sequence in the depth corresponds to each data of the output matrix; when calculating the third point in the first row of the output feature map, use the third of matrix one The row data is multiplied and accumulated with each column of matrix two, and the third point with depth on the output feature map is obtained.
  • the third point is a point with a depth of 8.
  • the 8 data in the gray part are overlapped. If they are stored in the form of matrix one when storing, it will be Cause a lot of waste of memory.
  • the above-mentioned calculation problems can be solved by setting the above-mentioned line interval.
  • the embodiment of the present disclosure also provides a matrix operation method, which is based on any of the foregoing matrix operation circuits, and is characterized in that it includes: fetching a matrix operation instruction from a memory; decoding the matrix operation instruction, and The decoded instruction operation instruction is sent to the matrix operation circuit; based on the decoded matrix operation instruction, the matrix operation circuit obtains the data of the first input matrix and the second input matrix data from the memory and performs Operation, the operation result is stored in the memory after the operation is completed.
  • An embodiment of the present disclosure also provides an electronic device, including: a memory, configured to store computer-readable instructions; and one or more processors, configured to run the computer-readable instructions, so that the processor can realize Any one of the matrix operation methods in the foregoing embodiments.
  • the embodiments of the present disclosure also provide a non-transitory computer-readable storage medium, which is characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute any of the foregoing embodiments.
  • the matrix operation method is characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute any of the foregoing embodiments.
  • An embodiment of the present disclosure provides a computer program product, which is characterized by including computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute any of the matrix operation methods in the foregoing embodiments .
  • An embodiment of the present disclosure provides a chip, which is characterized by including the matrix operation circuit described in any of the foregoing embodiments.
  • An embodiment of the present disclosure provides a computing device, which is characterized by including the chip described in any of the foregoing embodiments.
  • each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the module, program segment, or part of code contains one or more for realizing the specified logical function Executable instructions.
  • the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, and they can sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart can be implemented by a dedicated hardware-based system that performs the specified functions or operations Or it can be realized by a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments described in the present disclosure can be implemented in software or hardware. Among them, the name of the unit does not constitute a limitation on the unit itself under certain circumstances.
  • exemplary types of hardware logic components include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Product (ASSP), System on Chip (SOC), Complex Programmable Logical device (CPLD) and so on.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • ASSP Application Specific Standard Product
  • SOC System on Chip
  • CPLD Complex Programmable Logical device
  • a machine-readable medium may be a tangible medium, which may contain or store a program for use by the instruction execution system, apparatus, or device or in combination with the instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • the machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any suitable combination of the foregoing.
  • machine-readable storage media would include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or flash memory erasable programmable read-only memory
  • CD-ROM compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

L'invention concerne un circuit, un appareil et un procédé de calcul matriciel. Le circuit de calcul matriciel comprend : un circuit de commande ; et un réseau d'unités de calcul. Le réseau d'unités de calcul comprend une pluralité d'unités de calcul. L'unité de calcul comprend un premier registre d'entrée, un second registre d'entrée et un registre de sortie. Le premier registre d'entrée est utilisé pour recevoir des données d'une première matrice d'entrée. Le second registre d'entrée est utilisé pour recevoir des données d'une seconde matrice d'entrée. Le circuit de commande est utilisé pour recevoir une instruction de calcul matriciel, et en réponse à l'instruction, commander au moins l'une de la pluralité d'unités de calcul pour effectuer, selon une indication de l'instruction, une opération de calcul sur la première matrice d'entrée et la seconde matrice d'entrée, l'instruction étant une instruction unique. Le registre de sortie est utilisé pour stocker le résultat de calcul de l'opération de calcul. Grâce au procédé, les problèmes techniques de l'état de la technique concernant une faible efficacité de calcul et une consommation d'énergie élevée pendant un calcul de convolution sont résolus.
PCT/CN2019/111878 2019-10-18 2019-10-18 Circuit, appareil et procédé de calcul matriciel WO2021072732A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/111878 WO2021072732A1 (fr) 2019-10-18 2019-10-18 Circuit, appareil et procédé de calcul matriciel
CN201980101046.3A CN114503126A (zh) 2019-10-18 2019-10-18 矩阵运算电路、装置以及方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/111878 WO2021072732A1 (fr) 2019-10-18 2019-10-18 Circuit, appareil et procédé de calcul matriciel

Publications (1)

Publication Number Publication Date
WO2021072732A1 true WO2021072732A1 (fr) 2021-04-22

Family

ID=75537332

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/111878 WO2021072732A1 (fr) 2019-10-18 2019-10-18 Circuit, appareil et procédé de calcul matriciel

Country Status (2)

Country Link
CN (1) CN114503126A (fr)
WO (1) WO2021072732A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792868A (zh) * 2021-09-14 2021-12-14 绍兴埃瓦科技有限公司 神经网络计算模块、方法和通信设备
CN113807509A (zh) * 2021-09-14 2021-12-17 绍兴埃瓦科技有限公司 神经网络加速装置、方法和通信设备
CN114723034A (zh) * 2022-06-10 2022-07-08 之江实验室 一种可分离的图像处理神经网络加速器及加速方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784973A (zh) * 2019-11-04 2021-05-11 北京希姆计算科技有限公司 卷积运算电路、装置以及方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101630242A (zh) * 2009-07-28 2010-01-20 苏州国芯科技有限公司 G.723.1编码器快速计算自适应码书的贡献模块
CN102065309A (zh) * 2010-12-07 2011-05-18 青岛海信信芯科技有限公司 一种dct实现方法及dct实现电路
CN109416756A (zh) * 2018-01-15 2019-03-01 深圳鲲云信息科技有限公司 卷积器及其所应用的人工智能处理装置
US20190179776A1 (en) * 2017-09-15 2019-06-13 Mythic, Inc. System and methods for mixed-signal computing
CN110197274A (zh) * 2018-02-27 2019-09-03 上海寒武纪信息科技有限公司 集成电路芯片装置及相关产品

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101630242A (zh) * 2009-07-28 2010-01-20 苏州国芯科技有限公司 G.723.1编码器快速计算自适应码书的贡献模块
CN102065309A (zh) * 2010-12-07 2011-05-18 青岛海信信芯科技有限公司 一种dct实现方法及dct实现电路
US20190179776A1 (en) * 2017-09-15 2019-06-13 Mythic, Inc. System and methods for mixed-signal computing
CN109416756A (zh) * 2018-01-15 2019-03-01 深圳鲲云信息科技有限公司 卷积器及其所应用的人工智能处理装置
CN110197274A (zh) * 2018-02-27 2019-09-03 上海寒武纪信息科技有限公司 集成电路芯片装置及相关产品

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792868A (zh) * 2021-09-14 2021-12-14 绍兴埃瓦科技有限公司 神经网络计算模块、方法和通信设备
CN113807509A (zh) * 2021-09-14 2021-12-17 绍兴埃瓦科技有限公司 神经网络加速装置、方法和通信设备
CN113807509B (zh) * 2021-09-14 2024-03-22 绍兴埃瓦科技有限公司 神经网络加速装置、方法和通信设备
CN113792868B (zh) * 2021-09-14 2024-03-29 绍兴埃瓦科技有限公司 神经网络计算模块、方法和通信设备
CN114723034A (zh) * 2022-06-10 2022-07-08 之江实验室 一种可分离的图像处理神经网络加速器及加速方法
CN114723034B (zh) * 2022-06-10 2022-10-04 之江实验室 一种可分离的图像处理神经网络加速器及加速方法

Also Published As

Publication number Publication date
CN114503126A (zh) 2022-05-13

Similar Documents

Publication Publication Date Title
WO2021088563A1 (fr) Circuit, appareil et procédé d'opération de convolution
WO2021072732A1 (fr) Circuit, appareil et procédé de calcul matriciel
JP6977239B2 (ja) 行列乗算器
JP6796177B2 (ja) コンパクトな演算処理要素を用いたプロセッシング
Fang et al. swdnn: A library for accelerating deep learning applications on sunway taihulight
CN111580865B (zh) 一种向量运算装置及运算方法
WO2017124647A1 (fr) Appareil de calcul de matrice
CN108388537B (zh) 一种卷积神经网络加速装置和方法
EP3326060B1 (fr) Opérations instruction unique, données multiples de largeurs mixtes comprenant des opérations d'éléments pairs et d'éléments impairs mettant en oeuvre une paire de registres pour de larges éléments de données
WO2017185392A1 (fr) Dispositif et procédé permettant d'effectuer quatre opérations fondamentales de calcul de vecteurs
US20220206796A1 (en) Multi-functional execution lane for image processor
CN112348182B (zh) 一种神经网络maxout层计算装置
WO2021147567A1 (fr) Procédé d'opération de convolution et puce
US11726757B2 (en) Processor for performing dynamic programming according to an instruction, and a method for configuring a processor for dynamic programming via an instruction
KR20210084220A (ko) 부분 판독/기입을 갖는 재구성 가능한 시스톨릭 어레이를 위한 시스템 및 방법
CN113222099A (zh) 卷积运算方法及芯片
JP7136343B2 (ja) データ処理システム、方法、およびプログラム
Hosseini et al. Fast implementation of dense stereo vision algorithms on a highly parallel SIMD architecture
WO2021057112A1 (fr) Circuit d'opération matricielle, dispositif d'opération matricielle et procédé d'opération matricielle
WO2021057111A1 (fr) Dispositif et procédé informatique, puce, dispositif électronique, support d'informations et programme
JP5025521B2 (ja) 半導体装置
Murakami FPGA implementation of a SIMD-based array processor with torus interconnect
CN115617717A (zh) 一种基于忆阻器的协处理器设计方法
TW202416185A (zh) 核心執行的深度融合

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19948937

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19948937

Country of ref document: EP

Kind code of ref document: A1