WO2019205617A1 - Calculation method and apparatus for matrix multiplication - Google Patents

Calculation method and apparatus for matrix multiplication Download PDF

Info

Publication number
WO2019205617A1
WO2019205617A1 PCT/CN2018/117559 CN2018117559W WO2019205617A1 WO 2019205617 A1 WO2019205617 A1 WO 2019205617A1 CN 2018117559 W CN2018117559 W CN 2018117559W WO 2019205617 A1 WO2019205617 A1 WO 2019205617A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
repository
multiplication
calculation
column
Prior art date
Application number
PCT/CN2018/117559
Other languages
French (fr)
Chinese (zh)
Inventor
方民权
吴小蓉
程剑
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2019205617A1 publication Critical patent/WO2019205617A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Definitions

  • the present invention relates to the field of graphics technologies, and in particular, to a technical field of matrix multiplication calculation.
  • a graphics processor (English name: Graphics Processing Unit, abbreviation: GPU) is a microprocessor used for image computing operations on devices such as a host computer.
  • the stream multiprocessor (English name: Streaming Multiprocessor, abbreviation: SM) is a basic computing unit, which adopts a single instruction multi-thread execution mode, and can ensure simultaneous execution of multiple threads.
  • SM includes instruction cache (English: Instruction Buffer), thread bundle scheduler (English: Warp Scheduler), instruction distribution unit (English: Dispatch Unit), stream processor (English full name: Streaming Processor, abbreviation: SP) Double precision floating point unit (English full name: Double precision floating-point unit, abbreviation: DP) and other units.
  • Matrix multiplication is one of the most important operations in data calculation when the GPU performs image processing, and has many applications.
  • convolutional neural networks can give better results in image and speech recognition, and have excellent performance for large-scale image processing.
  • the convolution calculation can be converted into a matrix multiplication calculation, the convolution kernel matrix and the input image matrix are transformed into two large matrices A and B, and then A and B are multiplied to obtain a result matrix D. Wherein each row of the result matrix D represents an output image whose number of output images is equal to the number of rows of the result matrix D.
  • an M*N matrix is a rectangular array of elements arranged in M rows and N columns.
  • the calculation rule of matrix multiplication is that each element of the first row of the first matrix is multiplied by the element corresponding to the first column of the second matrix, and then the products are added together as the first column of the first row of the result matrix. element.
  • the elements of the Kth column of the Jth row of the result matrix are equal to the sum of the product of each element of the Jth row of the first matrix and the Kth column of the second matrix.
  • the calculation rule of matrix addition is relatively simple, that is, the elements at the same position of the two matrices to be added are added as the elements at the position of the result matrix, thereby obtaining the result matrix.
  • the matrix multiplier is an important component. It is the GPU that relies on various algorithms to perform matrix multiplication operations. At present, the SM performs matrix multiplication operations in the GPU, which requires a large amount of The chip space and the need for a large amount of memory access, resulting in SM matrix multiplication calculations are less efficient.
  • Embodiments of the present application provide a matrix multiplier that can improve the efficiency of matrix multiplication calculations.
  • the present application provides a matrix multiplier comprising N*N computing units, the N*N computing units form a matrix of N*N, and N is a positive integer greater than or equal to 2.
  • the matrix multiplier further includes two repository sets, each repository set includes N repositories, the first repository set is used to store a first multiplication matrix in the input matrix, and the second repository set is used to store an input matrix a second multiplication matrix in which the N repositories in the first repository set are connected to the N*N matrix by way of row connection, and the Mth repository in the first repository set and the N*N
  • Each of the computing units of the Mth row in the matrix is connected, and the N repositories in the second repository set are connected to the N*N matrix by a column connection, and the Mth storage in the second matrix set
  • the library is connected to each calculation unit of the Mth column in the matrix of N*N, where M is a variable and takes a value of 1 ⁇ M ⁇ N.
  • each computational unit of each row in the matrix of N*N is used to receive a first input data, a matrix of N*N, broadcast by a repository in a first repository set connected to itself.
  • Each of the computing units of each of the columns is configured to receive second input data broadcast by a repository in a second repository set connected to itself; each calculation in a matrix of N*Ns per clock cycle
  • the unit performs multiplication calculation according to the received first input data and second input data; after the end of the Nth clock period, the matrix multiplier performs multiplication of the first multiplication matrix and the second multiplication matrix.
  • the matrix multiplier uses the characteristics that different groups of banks can simultaneously access when performing matrix multiplication, and each time a row element of a matrix as a multiplicand and a column element of a matrix as a multiplier are loaded to corresponding In the calculation unit, the calculation is performed at the same time, thereby reducing the steps required to complete the matrix multiplication operation, reducing the number of memory accesses required, thereby improving the efficiency of the matrix processor for matrix multiplication calculation.
  • a possible implementation manner is that, in each clock cycle, all the computing units in the same row in the matrix of N*N receive the same first input data, and the matrix of the N*N is located. All computing units in the same column receive the same second input data.
  • the matrix multiplier further includes a third repository set, where the third repository set is used to store a result matrix, and the N repositories in the third repository
  • the M*N matrix is connected by a column connection, and the Mth repository in the third repository set is connected to each of the Mth columns in the N*N matrix.
  • the matrix multiplier further includes a fourth repository set, where the fourth repository set is used to store an addition matrix in the input matrix, where the fourth repository N repositories are connected to the N*N matrix by way of row connection, and the Mth repository in the fourth repository set is connected to each computing unit of the Mth row in the N*N matrix .
  • each computing unit of the first column in the matrix of N*N is configured to receive a first set of data entered by a repository of a fourth repository set connected to itself, the first set of data
  • each of the second columns of the N*N matrix in the second clock cycle is used to receive the bank input in the fourth repository set connected to itself a second set of data
  • the second set of data is the second column of data in the addition matrix
  • each computing unit of the Nth column in the matrix of N*N is used for receiving The Nth group of data input by the bank in the fourth repository set connected to itself; in the N+1th clock cycle, each of the N*N matrices is further used to receive according to the addition
  • the input data of the matrix and the multiplication calculation result of the first multiplication matrix and the second multiplication matrix are added to obtain a product calculation result of the first multiplication matrix, the second multiplication matrix, and the addition matrix.
  • matrix multipliers can also be used for matrix
  • the matrix multiplier further includes a scheduler, configured to obtain a first multiplication matrix and a second multiplication matrix in the form of an N*N matrix, and the first multiplication method
  • the matrix and the second multiplication matrix are respectively stored in the first repository set and the second repository set.
  • the present application provides a graphics processor comprising the matrix multiplier as described in the first aspect.
  • the application provides a system on a chip, the system on a chip comprising the matrix multiplier as described in the first aspect.
  • the present application provides a calculation method for a matrix multiplier to perform calculation, the matrix multiplier comprising: N*N computing units, the N*N computing units form a matrix of N*N, and N is greater than a positive integer equal to 2; two repository sets, each repository set includes N repositories, a first repository set for storing a first multiplication matrix in the input matrix, and a second repository set for storing an input matrix a second multiplication matrix in which the N repositories in the first repository set are connected to the N*N matrix by way of row connection, and the Mth repository in the first repository set and the N* Each computing unit of the Mth row in the matrix of N is connected, and the N repositories in the second repository set are connected to the matrix of N*N by means of column concatenation, and the number in the second repository set
  • the M repositories are connected to each of the calculation units of the Mth column in the matrix of the N*N, where M is a variable and takes a value of 1 ⁇ M
  • the calculation method includes: in a first clock cycle, each of the computing units of each of the N*N matrices receives the first input data broadcast by the repository in the first repository set connected to itself, N Each computing unit of each column in the matrix of *N is configured to receive second input data broadcast by a repository in a second repository set connected to itself, each computing unit in the matrix of the N*N Performing multiplication calculation according to the first input data and the second input data to obtain a first multiplication calculation result, and each calculation unit in the N*N matrix adds the first multiplication calculation result and the initial value in the internal register to obtain the first calculation result.
  • each calculation unit of each row in the matrix of N*N receives the connection with itself a first input data of a repository broadcast in a repository set, each computing unit of each column in the matrix of N*N being used to receive a repository broadcast in a second repository set connected to itself
  • Two input data each calculation unit in the N*N matrix performs multiplication calculation according to the first input data and the second input data to obtain a second multiplication calculation result, and each calculation unit in the matrix of N*N will itself
  • the calculated second multiplication calculation result and the second multiplication and addition calculation result are added to obtain a second multiplication and addition result, and the second multiplication and addition calculation result is saved in the internal register; in the subsequent clock cycle, the calculation is performed by analogy until After the Nth clock cycle, the matrix multiplier performs the multiplication of the first multiplication matrix and the second multiplication matrix.
  • a possible implementation manner is that, in each clock cycle, all the computing units in the same row in the matrix of N*N receive the same first input data, and the N*N matrix is located in All computing units in the same column receive the same second input data.
  • the matrix multiplier further includes a third repository set, where the third repository set is used to store a result matrix, and the N storages in the third repository set
  • the library is connected to the computing unit in the matrix of the N*N by means of a column connection, and the Mth repository in the third repository set is connected to each computing unit of the Mth column in the matrix of the N*N .
  • the calculation method further includes: each of the calculation units in the N*N matrix outputs the calculated Nth multiplication and addition calculation result to a repository in the third repository set connected to itself.
  • the matrix multiplier further includes a fourth repository set, where the fourth repository set is used to store an addition matrix in the input matrix, where the fourth repository N repositories are connected to the N*N matrix by way of row connection, and the Mth repository in the fourth repository set is connected to each calculation unit of the Mth row in the N*N matrix .
  • the calculation method further includes: in the first clock cycle, each computing unit of the first column in the matrix of N*N is configured to receive the first set of data input by the repository of the fourth repository set connected to itself, The first set of data is the first column of data in the addition matrix, and each of the second columns of the N*N matrix in the second clock cycle is used to receive the fourth repository set connected to itself The second set of data input by the repository, the second set of data is the second column of data in the addition matrix, and so on, in the Nth clock cycle, each of the Nth columns in the N*N matrix
  • the computing unit is configured to receive the Nth group of data input by the bank in the fourth repository set connected to itself; in the (N+1)th clock cycle, each of the N*N matrix is further used And performing an addition operation according to the input data of the received addition matrix and the multiplication calculation result of the first multiplication matrix and the second multiplication matrix to obtain a product calculation result of the first multiplication matrix, the second multiplication matrix, and the addition matrix.
  • FIG. 1 is a schematic diagram of the structure of a repository set in the prior art.
  • FIG. 2 is a schematic diagram showing the structure of a matrix multiplier in the prior art.
  • FIG. 3 is a schematic diagram of the structure of a computing unit in the prior art.
  • FIG. 4 is a schematic diagram showing the structure of a matrix multiplier provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a structure of a computing unit provided by an embodiment of the present application.
  • FIG. 6 is a schematic flow chart of an embodiment of the present application.
  • FIG. 7 is a schematic diagram of an initial state of an embodiment of the present application.
  • FIG. 8 is a schematic diagram showing the state of the first clock cycle of the embodiment of the present application.
  • FIG. 9 is a schematic diagram of a state of a second clock cycle of an embodiment of the present application.
  • FIG. 10 is a schematic diagram showing the state of the third clock cycle of the embodiment of the present application.
  • FIG. 11 is a schematic diagram of a fourth clock cycle state of an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a fifth clock cycle state of an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a graphics processor provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a system on chip provided by an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of a matrix multiplier block provided by an embodiment of the present application.
  • FIG. 1 shows a schematic diagram of the structure of a repository set.
  • a repository set is composed of a plurality of column storage blocks, and each column storage block is a storage library, wherein each storage block is 32-bit or 64-bit in size.
  • the repository collection is the default row continuation, that is, when assigning a value to the repository collection, consecutive elements are contiguously stored in rows.
  • the access unit (English full name: Load/Store Units, abbreviation: LD/ST) loads the data from the video memory into the repository, and the SP needs to access the repository when executing the specific calculation instruction. Read data in.
  • each SP may need to access data in any set of repositories.
  • the repository and the SP are connected to each other through a fully connected network to form a matrix multiplier.
  • the SP, the DP, and the repository in FIG. 2 are all connected to the fully connected network. In this way, mutual access between the SP and all the repositories is realized.
  • the SP mainly includes a calculation unit for performing the basic steps of matrix multiplication.
  • Figure 3 is a diagram of the structure of a typical computing unit. As shown in FIG. 3, the calculation unit mainly includes four registers such as a register 301, a register 302, a register 303, and a register 304, a multiplication unit 305, and an addition unit 306.
  • the register 301 and the register 302 are placed into a multiplicand and a multiplier for multiplication, and the multiplying unit 301 multiplies the two numbers and adds them to the number placed in the register 303 (if not required) If an addition operation is performed, 0 can be placed in the register 303), and the result of the addition can be stored in the register 304, thereby completing a multiplication and addition calculation.
  • the elements of the Jth row and the Kth column of the result matrix D are equal to the Jth row of the first matrix and the Kth column of the second matrix, and the sum of the products of each element of the corresponding position.
  • the matrix A and the matrix B are both a matrix of 4 rows and 4 columns, and then the elements of the first row and the first column of the matrix D are multiplied by each element of the first row of the matrix A by the first column of the matrix B, respectively. Elements, and add 4 products to add.
  • the matrix is segmented according to the specifications of the multiplier in the SM to form a sub-matrix conforming to the multiplier specification, and the access unit further loads the segmented sub-matrix from the video memory into the repository.
  • the specification of the matrix to be divided is smaller than the specification of the multiplier, it is necessary to fill the corresponding position of the matrix to be divided with 0 to form a sub-matrix conforming to the specification.
  • one data is read from the corresponding bank corresponding to the sub-matrix to be matrix multiplied to the corresponding SP. Since all SPs and repositories are connected to each other through a fully connected network, the SP can read the elements of the corresponding matrix A and matrix B into the calculation unit according to the calculation rule of matrix multiplication.
  • matrix A and matrix B are sub-matrices with a size of 4 rows and 4 columns (which can be expressed as 4*4).
  • a00*b00+a01*b10 is required.
  • +a02*b20+a03*b30 Therefore, a00 and b00 are respectively taken out from the corresponding libraries of the matrix A and the matrix B, and placed in the register 301 and the register 302 in the calculation unit in the corresponding SP.
  • the SP uses the calculation unit to perform the multiply-and-accumulate operation. It should be noted that each time the multiply-accumulate operation is performed, the result is stored in a pre-prepared repository. After the next multiplication operation is completed, the multiplication and addition operation result is taken out from the storage space, and is added to the register 303 for addition. For example, after the calculation of a00*b00 is completed, the result is placed from the register 304. Prepared in the storage space in advance. When the calculation of a01*b10 is performed, the result of a00*b00 is put into the register 303 from the storage space, and the result of the adder 306 is added to the result of a00*b00, and the result is first placed in the register 304.
  • the final SP uses the multiplier to complete the calculation of a00*b00+a01*b10+a02*b20+a03*b30, and writes the result to the corresponding position in the repository through the fully connected network, and takes the value of d00.
  • embodiments of the present application provide a new matrix multiplier for use in a GPU.
  • the embodiment of the present application when calculating the multiplication of the matrix A and the matrix B of the size N*N, no more elements are taken out from each matrix for calculation at a time, but a different set of repositories can be utilized. Simultaneous access characteristics, each time a row of elements of matrix A and a column of elements of matrix B are loaded into the corresponding computing unit, and calculations are performed at the same time. By doing so, the steps of completing the multiplication of matrix A and matrix B can be reduced, and the number of memory accesses required can be reduced, thereby improving the efficiency of SM for matrix multiplication calculation.
  • matrix multiplier 400 includes a scheduler, a repository, and a computing unit.
  • the scheduler is configured to obtain a matrix of a corresponding specification for calculation, and save the matrix in a corresponding repository set.
  • the scheduler specifically includes a matrix multiplication scheduling unit 401, an instruction distribution unit 402, and an instruction distribution unit 403 (two shown in the figure, which may actually be one or more), wherein the matrix multiplication scheduling unit 401 functions as a matrix multiplier
  • the instruction scheduling unit of 400 is mainly responsible for order sorting and scheduling. By inputting instructions, processes such as input, load, calculation, storage and output can be organically combined.
  • the instruction distribution unit is connected to the storage library and the calculation unit (not shown) through the control connection, and is configured to send the scheduling instruction determined by the matrix multiplication scheduling unit 401 to the storage library and the calculation unit, thereby causing the repository and the calculation
  • the unit processes the data according to the instructions.
  • the number of instruction distribution units included in the matrix multiplier 400 may be two, so that instruction dual transmission can be realized.
  • the connection between the computing unit of the matrix multiplier of the present application and the repository is no longer connected through the fully connected network. As shown in FIG. 4, each matrix multiplier includes N*N computing units, and the N*N computing units are formed.
  • An N*N matrix (illustrated as a 4*4 matrix, wherein the computing units are named from left to right, top to bottom, respectively, as computing unit 430 to computing unit 445), and each matrix multiplier also includes at least two A repository set, each repository set includes N repositories (illustrated as four repositories set having a total of 16 repositories, respectively named as repositories 410 to 425), and the first repository set is used to store input A first multiplication matrix in the matrix, the second repository set is used to store a second multiplication matrix in the input matrix.
  • the matrix multiplier may further include a third repository set and a fourth repository set, wherein the third repository set is used to store the result matrix, and the fourth repository set is used to store the addition matrix in the input matrix.
  • the matrix multiplier of the present application can perform N*N multiplication calculations in one calculation cycle (in the computer field, also called clock cycle or beat), thereby improving computational efficiency.
  • the first repository set is connected to the computing unit by way of row connection, and the N repositories in the first repository set are directly connected to the computing unit of each row in the N*N computing unit matrix, respectively.
  • a first repository of a repository set is coupled to each compute unit of the first row of the N*N compute unit matrix, the second repository of the first repository set and the N*N compute unit matrix
  • Each of the computing units of the second row is connected, the Nth repository of the first repository set is connected to each computing cell of the Nth row in the N*N computing cell matrix; the second repository set is connected by columns
  • the method is connected to the computing unit, and the N computing units in the second repository set are directly connected to the computing unit of each column in the N*N computing unit matrix, and the first repository and the N of the second repository set are respectively connected.
  • the first repository set includes a repository 410, a repository 411, a repository 412, and a repository 413.
  • the first repository set and the computing unit matrix maintain row connections, and the storage library 410 and the matrix Each computing unit of a row is connected, the storage library 411 is connected to each computing unit of the second row in the matrix, and the storage library 412 is connected to each computing unit of the third row in the matrix, and the storage library 413 is in the matrix
  • the second repository set includes a repository 414, a repository 415, a repository 416, and a repository 417, the second repository set and the computing unit matrix maintain column connections, and the repository 414
  • Each of the computing units of the first column in the matrix is connected, the storage library 415 is connected to each computing unit of the second column in the matrix, and the storage library 416 is connected to each computing unit of the third column in the matrix, the storage library 417 Connected to each calculation unit in the fourth column of the matrix.
  • the first repository set may broadcast N data to the N*N computing units in the first clock cycle, and the second repository set may also broadcast N to the N*N computing units in the first clock cycle.
  • each calculation unit can perform a multiplication calculation in the first clock cycle, and after N clock cycles, all multiplication calculations can be completed.
  • the fourth repository set in the matrix multiplier is used to load the addition matrix C in the input matrix, and the application may connect the fourth repository set to the row of the computing unit matrix, or may set the fourth repository set. Maintaining a column connection with the computing unit matrix, and if the fourth repository set is connected to the computing unit matrix, each of the fourth repository sets and each row of the computing unit matrix The computing units are connected.
  • the storage library 418 is respectively connected to each computing unit of the first row in the computing unit matrix
  • the storage library 419 is respectively associated with each computing unit of the second row in the computing unit matrix.
  • the storage library 420 is respectively connected to each of the computing units of the third row in the computing unit matrix
  • the storage library 421 is respectively connected to each of the computing units of the fourth row in the computing unit matrix.
  • the fourth repository set can load data to the N computing units every clock cycle. After N clock cycles, the addition matrix C stored in the fourth repository set is all input to the corresponding computing unit, and then at the Nth clock. Cycle, you can perform the corresponding addition calculation. Further, the third repository set in the matrix multiplier is used to load the result matrix D in the input matrix, and the application may keep the third repository set connected to the computing unit matrix, or may be the third repository set.
  • each of the third repository sets and each column of the computing unit matrix The computing units are connected.
  • the storage library 425 is respectively connected to each computing unit of the first column in the computing unit matrix, and the storage library 424 and each computing unit of the second column in the computing unit matrix respectively.
  • the storage 423 is connected to each of the computing units of the third column of the computing unit matrix, respectively, and the storage 422 is connected to each of the computing units of the fourth column of the computing unit matrix.
  • FIG. 5 is a computing unit provided by an embodiment of the present application to adapt matrix multiplication operations supported by matrix multiplier 400.
  • the calculation unit 500 may be any one of the above-described matrix multipliers 400, including five registers such as a register 501 to a register 505, a multiplication unit 506, an addition unit 507, and an addition unit 508.
  • a01 and b10 can be placed in register 501 and register 502, respectively, using a multiplication unit 506 to calculate a01*b10
  • a02 and b20 can be placed in registers respectively.
  • 501 and register 502 using a multiplication unit 506 to calculate a02*b20
  • a03 and b30 can be placed in the register 501 and the register 502, respectively, and the multiplication unit 506 calculates a03*b30 at the fourth clock.
  • the value in register 503 is a00*b00+a01*b10+a02*b20+a03*b30.
  • the value in the register 505 is the value in the register 503, and the value is taken out and stored in the storage space prepared in advance.
  • the value of the addition matrix C is stored in the register 504, The register 503 is added to the value in the register 504, and the result of the addition is used as the value of the first column element d00 of the first row of the result matrix.
  • FIG. 6 is a schematic flow chart of an embodiment of the present application.
  • A, B, C, and D are all matrices of 4 rows and 4 columns.
  • the elements in the matrix are represented by aij, bij, cij, and dij, where i represents the element in the matrix.
  • the number of rows in the field is decremented by 1, and j indicates that the number of columns of the element in the matrix is decremented by 1, and i and j are integers greater than or equal to 0 and less than or equal to 3.
  • the elements of the matrix A are respectively placed in the storage 410 to the storage 413 according to different number of rows, and the elements of the matrix B are respectively placed in the storage 414 to the storage according to different column numbers.
  • the elements of the matrix C are placed in the repository 418 into the repository 421, respectively, according to different number of rows.
  • each of the storage library 410 to the storage library 417 broadcasts a data to all the connected computing units according to the received order according to the received instruction, and each of the multiply and add units respectively
  • the elements from the matrix A and the elements from the matrix B are received and placed in the register 501 and the register 502 in the calculation unit respectively, and the product of the two is calculated according to the method mentioned above to obtain the product of the time, and the obtained product will be obtained.
  • the product is added to the previous multiplication and addition calculation result, and the multiplication and addition calculation result is obtained.
  • the previous calculation result that is, the initial value of the register is 0.
  • the storage 410 to the storage 413 transmit the elements of the Mth column of the matrix A to the computing unit set by broadcasting, wherein the computing unit receiving matrix in the Jth row The elements of the Mth row and the Mth column of A; the storage library 414 to the storage library 417 transmit the elements of the Mth row of the matrix B to the set of computing units by broadcasting, wherein the computing unit located in the Kth column receives the matrix B
  • the elements of the Mth row and the Kth column, J, K, and M are positive integers of 4 or less.
  • the multiplication calculation unit performs a multiplication calculation of the received elements from the matrix A and the matrix B to obtain an Mth multiplication calculation result, and adds the Mth multiplication calculation result to the M-1th multiplication and addition calculation result.
  • the result of the Mth multiplication and addition calculation is obtained, wherein the 0th calculation result, that is, the initial value of the internal register is set to 0.
  • the repository 410 places a00 into the computing unit 430 in the first row to the computing unit 433, and the repository 414 puts b00 into the calculation in the first column.
  • the other registers in the repository 410 to the repository 417 also perform corresponding operations.
  • the second time period as shown in FIG.
  • the storage library 410 and the storage library 414 respectively put a01 and b10 into corresponding computing units, and the computing unit 430 calculates the product of a01 and b10, and registers with the register 503.
  • the stored a00*b00 is added and placed in the register 503 so that the value in the register 503 at this time is a00*b00+a01*b10.
  • the operation in each time period is followed by the analogy, and the state of the third clock cycle shown in FIG. 10 and the state of the fourth clock cycle shown in FIG. 11 can be referred to.
  • each bank in the repository 418 to the repository 421 places one of its saved elements of the matrix C into a register 504 in the corresponding computing unit in accordance with the received instruction in a predetermined order. in.
  • each of the storage 418 to the storage 421 sends the Mth column element of the matrix C to the computing unit of the Mth column, where the Lth row is located
  • the calculation unit of the M column receives an element located in the Mth column of the Lth row of the matrix C, and L is a positive integer of 4 or less.
  • the repository 418, the repository 419, the repository 420, and the repository 421 respectively put c00, c10, c20, and c30 into the computing unit 430, the computing unit 434, The calculation unit 438 and the register 504 in the calculation unit 442.
  • the storage library 418, the storage library 419, the storage library 420, and the storage library 421 put c01, c11, c21, and c31 into the computing unit 431 and the computing unit 435, respectively.
  • Step S602 is repeated.
  • the repository 410 to the repository 417 have placed the elements of the matrix A and matrix B they store into the corresponding computational units and complete the corresponding multiply-and-accumulate calculations.
  • the calculation unit 430 completes a00*b00+a01*b10+a02*b20+a03*b30 and places the result in its own register 503.
  • the repository 418 to the repository 421 have placed the elements of the matrix C into the registers 504 of the corresponding computational unit.
  • the values in the register 503 and the register 504 are added by the adder 508 in each calculation unit, and the obtained result is used as an element of the result matrix D, and is put into each calculation.
  • S605 Move the elements of the result matrix D stored in the repository 422 to the repository 425 to the specified storage space.
  • the algorithm proposed in the embodiment of the present application can reduce the steps of completing the multiplication operation of the matrix A and the matrix B, thereby increasing the efficiency of the GPU for matrix multiplication.
  • the calculation matrix A*B+C since the method of broadcasting the elements of the matrix A and the matrix B to one row or one column of the computing unit is adopted, the calculation matrix A*B+C only needs to perform 3*N*N read operations and N*N.
  • the secondary write operation greatly reduces the number of read and write operations compared to the prior art.
  • the size of the chip space can be reduced.
  • the present application designs two sets of instructions for external calling and internal control of the matrix multiplier, respectively.
  • the present application designs three instructions.
  • pA is a pointer to the outer matrix of the matrix multiplier
  • mA is the pointing matrix multiplier
  • the pointer of the inner matrix A, m is the number of rows of the matrix A (or the number of column elements), and n is the number of columns of the matrix A (or the number of row elements).
  • the effect of the instruction is to take A from The matrix multiplier is externally loaded into the matrix multiplier.
  • the second is used to perform multiplication and addition calculation of the matrix.
  • mD matrix_mul_mmp(mA, mB, mC, m, n, k), where mA, mB, mC, mD are pointers to matrices A, B, C, D inside the matrix multiplier, m is matrix A
  • the number of column elements of C, D, n is the number of row elements of matrix A, and is also the number of column elements of matrix B, and k is the number of elements of matrix B, C, and D rows.
  • the third is used to copy the result matrix into the memory outside the matrix multiplier.
  • store_matrix_mmp(pD,mD,m,n) where pD is a pointer to the matrix of the matrix multiplier, mD is a pointer to the matrix inside the matrix multiplier, m is the number of elements of the matrix column, and n is the matrix The number of row elements.
  • the effect of this instruction is to copy the matrix D to the space pointed by the pointer outside the matrix multiplier, where the size of the matrix D is m*n.
  • mA load_matrix_mmp(pA,4,4)
  • mB load_matrix_mmp(pB,4,4)
  • mC load_matrix_mmp(pC,4,4)
  • mD matrix_mul_mmp(mA,mB,mC,4,4,4)
  • the present application designs two instructions.
  • the first is to load the elements of the matrix into a specific register of the calculation unit and multiply and accumulate the loaded elements.
  • Load_line_mmp(mA, mB, mC, n) where mA, mB, mC are pointers to matrix A, matrix B, matrix C, respectively, and n represents the number of the loaded row or column.
  • the effect of the instruction is to load the nth column of the matrix A and the nth row of the matrix B into a specific register of the computing unit in the form of a broadcast, and load the nth column of the matrix C into a specific register of the computing unit, and
  • the loaded matrix elements are multiplied and accumulated according to preset rules.
  • the second is used to perform matrix addition calculations and store the calculation results line by line to the specified storage space.
  • matrix_add_mmp(mD) mD is a pointer to matrix D.
  • the effect of the instruction is that the multiplication and accumulation result of the matrix A and the matrix B are added to the matrix C, and the calculation result is stored as a result matrix D row by row to the storage space pointed to by the mD.
  • the matrix multiplier provided by the embodiment of the present application can be embedded in the GPU to efficiently implement matrix multiplication operations.
  • the GPU includes a storage controller (English name: Memory Controller, abbreviation: MMC), a Peripheral Component Interconnect Express (PCI-E) interface, and a thread engine (English full name). : Thread Engine), L2 cache (English; L2 Cache) and several components such as SM (L2 cache connection SM and storage controller, not shown).
  • SM is the core computing component in the GPU, providing computing power for the entire GPU. It should be noted that the number of SMs included in the GPU is not fixed, but can be adjusted as needed. The number of SMs shown in FIG.
  • the matrix multiplier provided in the present application is located in the SM, which can reduce the occupied chip space and reduce the number of times of reading and writing data during matrix multiplication, thereby improving the matrix multiplication performance and energy efficiency ratio of the GPU.
  • the matrix multiplier provided by the embodiment of the present application can also be combined with the CPU core to construct an on-chip system (English name: System on a Chip, referred to as SoC), which can quickly process matrix multiplication and addition operations in the application.
  • Figure 14 is a system on a chip including a matrix multiplier.
  • the system on chip includes a processor, a digital signal processing unit (English name: Digital Signal Processing, DSP for short), a codec (English: CODEC), and a matrix multiplier block (English full name: Matrix Multiplication Block). , referred to as: MMB), these components are connected through a secondary cache.
  • the processor can be an advanced reduced instruction set machine processor (English full name: Advanced RISC Machine Processor).
  • the MMP is composed of a number of matrix multipliers that are connected by a level one cache (English: L1 Cache). Based on the system on chip, it is only necessary to extract the matrix multiplication and addition operations in the application, and the calculation is performed by using the matrix multiplier provided by the embodiment of the present application, thereby improving the efficiency of application operation.

Abstract

A matrix multiplier (400). The fully connected network included in an existing matrix multiplier occupies a large chip space, and a great amount of memory access is required during the calculation of the matrix multiplier, thus causing low efficiency of matrix multiplication calculation performed by a streaming multiprocessor. On the basis of the purpose of improving the efficiency of matrix multiplication calculation performed by a graphics processing unit, during matrix multiplication, according to the characteristic that different groups of repositories can be accessed simultaneously, the matrix multiplier (400) each time loads a row of elements of a matrix as a multiplicand and a column of elements of the matrix as a multiplier to a corresponding calculation unit, and performs calculation. By use of the matrix multiplier (400), the steps required by implementation of the matrix multiplication calculation can be decreased, and the number of times of memory access required is reduced, thereby improving the efficiency of matrix multiplication calculation performed by the graphics processing unit.

Description

一种矩阵乘法的计算方法及装置Method and device for calculating matrix multiplication 技术领域Technical field
本发明涉及图形技术领域,特别涉及一种矩阵乘法计算的技术领域。The present invention relates to the field of graphics technologies, and in particular, to a technical field of matrix multiplication calculation.
背景技术Background technique
图形处理器(英文全称:Graphics Processing Unit,缩写:GPU)是一种用于在主机等设备上进行图像运算工作的微处理器。在GPU中,流多处理器(英文全称:Streaming Multiprocessor,缩写:SM)是基本计算单元,其采用单指令多线程的执行方式,能够保证多线程的同时执行。大致来说,SM包括指令缓存(英文:Instruction Buffer)、线程束调度器(英文:Warp Scheduler)、指令分发单元(英文:Dispatch Unit)、流处理器(英文全称:Streaming Processor,缩写:SP)、双精度浮点运算单元(英文全称:Double precision floating-point unit,缩写:DP)等单元。A graphics processor (English name: Graphics Processing Unit, abbreviation: GPU) is a microprocessor used for image computing operations on devices such as a host computer. In the GPU, the stream multiprocessor (English name: Streaming Multiprocessor, abbreviation: SM) is a basic computing unit, which adopts a single instruction multi-thread execution mode, and can ensure simultaneous execution of multiple threads. In general, SM includes instruction cache (English: Instruction Buffer), thread bundle scheduler (English: Warp Scheduler), instruction distribution unit (English: Dispatch Unit), stream processor (English full name: Streaming Processor, abbreviation: SP) Double precision floating point unit (English full name: Double precision floating-point unit, abbreviation: DP) and other units.
在GPU进行图像处理时,矩阵乘法是其进行数据计算中最重要的操作之一,具有很多的应用。例如,在深度学习的结构中,卷积神经网络在图像和语音识别方面能够给出更好的结果,对于大型图像处理有着出色的表现,而在某些卷积神经网络的具体实现过程中,可以将卷积计算转化为矩阵乘法计算,将卷积核矩阵和输入图像矩阵变换成两个大的矩阵A和B,然后A和B相乘得到结果矩阵D。其中,结果矩阵D的每一行表示一个输出图像,其输出图像的个数等于结果矩阵D的行数。Matrix multiplication is one of the most important operations in data calculation when the GPU performs image processing, and has many applications. For example, in the structure of deep learning, convolutional neural networks can give better results in image and speech recognition, and have excellent performance for large-scale image processing. In the specific implementation of some convolutional neural networks, The convolution calculation can be converted into a matrix multiplication calculation, the convolution kernel matrix and the input image matrix are transformed into two large matrices A and B, and then A and B are multiplied to obtain a result matrix D. Wherein each row of the result matrix D represents an output image whose number of output images is equal to the number of rows of the result matrix D.
矩阵,是数学中一个重要的基本概念,一个M*N的矩阵是一个由M行N列元素排列成的矩形阵列。对于矩阵乘法,它只有在作为被乘数的第一个矩阵的列数和作为乘数的第二个矩阵的行数相同时才可以进行。矩阵乘法的计算规则是,第一个矩阵第一行的每个元素,分别乘以第二个矩阵第一列对应位置的元素,然后将乘积相加,作为结果矩阵第一行第一列的元素。以此类推,结果矩阵第J行第K列的元素,等于第一个矩阵的第J行与第二个矩阵第K列,对应位置每个元素的乘积的和。而矩阵加法的计算规则相对简单,即将两个待相加的矩阵相同位置上的元素进行相加,作为结果矩阵该位置上的元素,从而得到结果矩阵。Matrix, an important basic concept in mathematics, an M*N matrix is a rectangular array of elements arranged in M rows and N columns. For matrix multiplication, it can only be done if the number of columns of the first matrix as the multiplicand is the same as the number of rows of the second matrix as the multiplier. The calculation rule of matrix multiplication is that each element of the first row of the first matrix is multiplied by the element corresponding to the first column of the second matrix, and then the products are added together as the first column of the first row of the result matrix. element. By analogy, the elements of the Kth column of the Jth row of the result matrix are equal to the sum of the product of each element of the Jth row of the first matrix and the Kth column of the second matrix. The calculation rule of matrix addition is relatively simple, that is, the elements at the same position of the two matrices to be added are added as the elements at the position of the result matrix, thereby obtaining the result matrix.
相应的,对于GPU中的SM来说,矩阵乘法器是重要的组成部分,它是GPU采用各种算法执行矩阵乘法操作的依托,目前,GPU中的SM执行矩阵乘法操作存在着需要占用大量的芯片空间以及需要进行大量的存储访问的问题,从而导致SM进行矩阵乘法计算效率较低。Correspondingly, for the SM in the GPU, the matrix multiplier is an important component. It is the GPU that relies on various algorithms to perform matrix multiplication operations. At present, the SM performs matrix multiplication operations in the GPU, which requires a large amount of The chip space and the need for a large amount of memory access, resulting in SM matrix multiplication calculations are less efficient.
发明内容Summary of the invention
本申请的实施例提供一种矩阵乘法器,可以提高矩阵乘法计算的效率。Embodiments of the present application provide a matrix multiplier that can improve the efficiency of matrix multiplication calculations.
第一方面,本申请提供一种矩阵乘法器,该矩阵乘法器包括N*N个计算单元,该N*N个计算单元组成N*N的矩阵,N为大于等于2的正整数。该矩阵乘法器还包括两个存储库集合,每个存储库集合包括N个存储库,第一存储库集合用于存储输入矩阵中的第一乘法矩阵,第二存储库集合用于存储输入矩阵中的第二乘法矩阵,该第一存储库集合中的N 个存储库通过行连接的方式与N*N的矩阵进行连接,第一存储库集合中的第M个存储库与该N*N的矩阵中的第M行的每个计算单元相连接,第二存储库集合中的N个存储库通过列连接的方式与N*N的矩阵进行连接,第二矩阵集合中的第M个存储库与N*N的矩阵中第M列的每个计算单元相连接,其中,M为变量,取值为1≤M≤N。在每个时钟周期,N*N的矩阵中的每一行的每个计算单元用于接收与自身相连接的第一存储库集合中的存储库所广播的第一输入数据,N*N的矩阵中的每一列的每个计算单元用于接收与自身相连接的第二存储库集合中的存储库所广播的第二输入数据;在每个时钟周期,N*N的矩阵中的每个计算单元根据接收到的第一输入数据和第二输入数据进行乘法计算;在第N个时钟周期结束后,矩阵乘法器完成第一乘法矩阵和第二乘法矩阵的乘法运算。In a first aspect, the present application provides a matrix multiplier comprising N*N computing units, the N*N computing units form a matrix of N*N, and N is a positive integer greater than or equal to 2. The matrix multiplier further includes two repository sets, each repository set includes N repositories, the first repository set is used to store a first multiplication matrix in the input matrix, and the second repository set is used to store an input matrix a second multiplication matrix in which the N repositories in the first repository set are connected to the N*N matrix by way of row connection, and the Mth repository in the first repository set and the N*N Each of the computing units of the Mth row in the matrix is connected, and the N repositories in the second repository set are connected to the N*N matrix by a column connection, and the Mth storage in the second matrix set The library is connected to each calculation unit of the Mth column in the matrix of N*N, where M is a variable and takes a value of 1 ≤ M ≤ N. At each clock cycle, each computational unit of each row in the matrix of N*N is used to receive a first input data, a matrix of N*N, broadcast by a repository in a first repository set connected to itself. Each of the computing units of each of the columns is configured to receive second input data broadcast by a repository in a second repository set connected to itself; each calculation in a matrix of N*Ns per clock cycle The unit performs multiplication calculation according to the received first input data and second input data; after the end of the Nth clock period, the matrix multiplier performs multiplication of the first multiplication matrix and the second multiplication matrix.
上述方案中,矩阵乘法器在进行矩阵乘法运算时,利用不同组的存储库可以同时访问的特性,每次将作为被乘数的矩阵的一行元素和作为乘数的矩阵的一列元素加载到相应的计算单元中,同时进行计算,从而减少了完成矩阵乘法运算所需要的步骤,降低了所需进行的存储访问的次数,从而提高了图形处理器进行矩阵乘法计算的效率。In the above scheme, the matrix multiplier uses the characteristics that different groups of banks can simultaneously access when performing matrix multiplication, and each time a row element of a matrix as a multiplicand and a column element of a matrix as a multiplier are loaded to corresponding In the calculation unit, the calculation is performed at the same time, thereby reducing the steps required to complete the matrix multiplication operation, reducing the number of memory accesses required, thereby improving the efficiency of the matrix processor for matrix multiplication calculation.
对于上述第一方面,一种可能的实现方式是,在每个时钟周期,N*N的矩阵中位于同一行的所有计算单元接收到相同的第一输入数据,该N*N的矩阵中位于同一列的所有计算单元接收到相同的第二输入数据。通过这种做法,可以提高图形处理器进行矩阵乘法计算的效率。For the first aspect, a possible implementation manner is that, in each clock cycle, all the computing units in the same row in the matrix of N*N receive the same first input data, and the matrix of the N*N is located. All computing units in the same column receive the same second input data. By doing this, the efficiency of the matrix processor for matrix multiplication calculation can be improved.
对于上述第一方面,另一种可能的实现方式是,该矩阵乘法器还包括第三存储库集合,该第三存储库集合用于存储结果矩阵,该第三存储库中的N个存储库通过列连接的方式与N*N的矩阵进行连接,该第三存储库集合中第M个存储库与该N*N的矩阵中的第M列的每个计算单元相连接。通过这种做法,可以提高输出矩阵乘法运算的结果矩阵的效率。For the first aspect, another possible implementation manner is that the matrix multiplier further includes a third repository set, where the third repository set is used to store a result matrix, and the N repositories in the third repository The M*N matrix is connected by a column connection, and the Mth repository in the third repository set is connected to each of the Mth columns in the N*N matrix. By doing this, the efficiency of the result matrix of the output matrix multiplication operation can be improved.
对于上述第一方面,另一种可能的实现方式是,该矩阵乘法器还包括第四存储库集合,该第四存储库集合用于存储输入矩阵中的加法矩阵,该第四存储库中的N个存储库通过行连接的方式与N*N的矩阵进行连接,该第四存储库集合中的第M个存储库与该N*N的矩阵中的第M行的每个计算单元相连接。在第一个时钟周期,N*N的矩阵中的第一列的每个计算单元用于接收与自身相连接的第四存储库集合的存储库输入的第一组数据,该第一组数据为加法矩阵中的第一列数据,在第二个时钟周期该N*N的矩阵中的第二列的每个计算单元用于接收与自身相连接的第四存储库集合中的存储库输入的第二组数据,该第二组数据为加法矩阵中的第二列数据,以此类推,在第N个时钟周期,N*N的矩阵中的第N列的每个计算单元用于接收与自身相连接的第四存储库集合中的存储库输入的第N组数据;在第N+1个时钟周期,该N*N的矩阵中的每个计算单元还用于根据接收到的加法矩阵的输入数据和第一乘法矩阵与第二乘法矩阵的乘法计算结果,进行加法运算,以得到第一乘法矩阵、第二乘法矩阵和加法矩阵的乘积计算结果。通过这种算法,使得矩阵乘法器还可用于矩阵加法运算。For the first aspect, another possible implementation manner is that the matrix multiplier further includes a fourth repository set, where the fourth repository set is used to store an addition matrix in the input matrix, where the fourth repository N repositories are connected to the N*N matrix by way of row connection, and the Mth repository in the fourth repository set is connected to each computing unit of the Mth row in the N*N matrix . In the first clock cycle, each computing unit of the first column in the matrix of N*N is configured to receive a first set of data entered by a repository of a fourth repository set connected to itself, the first set of data For the first column of data in the addition matrix, each of the second columns of the N*N matrix in the second clock cycle is used to receive the bank input in the fourth repository set connected to itself a second set of data, the second set of data is the second column of data in the addition matrix, and so on, in the Nth clock cycle, each computing unit of the Nth column in the matrix of N*N is used for receiving The Nth group of data input by the bank in the fourth repository set connected to itself; in the N+1th clock cycle, each of the N*N matrices is further used to receive according to the addition The input data of the matrix and the multiplication calculation result of the first multiplication matrix and the second multiplication matrix are added to obtain a product calculation result of the first multiplication matrix, the second multiplication matrix, and the addition matrix. Through this algorithm, matrix multipliers can also be used for matrix addition operations.
对于上述第一方面,另一种可能的实现方式是,矩阵乘法器还包括调度器,该调度器用于获得N*N矩阵形式的第一乘法矩阵和第二乘法矩阵,并将该第一乘法矩阵和第二乘法矩阵分别保存在第一存储库集合和第二存储库集合。通过这种做法,可以将需要进行矩阵乘法计算的矩阵切分成适合该矩阵乘法器的N*N矩阵形式的乘法矩阵,从而提高 了矩阵乘法器的效率。For the first aspect, another possible implementation manner is that the matrix multiplier further includes a scheduler, configured to obtain a first multiplication matrix and a second multiplication matrix in the form of an N*N matrix, and the first multiplication method The matrix and the second multiplication matrix are respectively stored in the first repository set and the second repository set. By doing so, the matrix that requires matrix multiplication calculation can be divided into multiplication matrices in the form of N*N matrices suitable for the matrix multiplier, thereby improving the efficiency of the matrix multiplier.
第二方面,本申请提供一种图形处理器,该图形处理器包括如第一方面所述的矩阵乘法器。In a second aspect, the present application provides a graphics processor comprising the matrix multiplier as described in the first aspect.
第三方面,本申请提供一种片上系统,该片上系统包括如第一方面所述的矩阵乘法器。In a third aspect, the application provides a system on a chip, the system on a chip comprising the matrix multiplier as described in the first aspect.
第四方面,本申请提供一种计算方法,用于矩阵乘法器进行计算,该矩阵乘法器包括:N*N个计算单元,该N*N个计算单元组成N*N的矩阵,N为大于等于2的正整数;两个存储库集合,每个存储库集合包括N个存储库,第一存储库集合用于存储输入矩阵中的第一乘法矩阵,第二存储库集合用于存储输入矩阵中的第二乘法矩阵,该第一存储库集合中的N个存储库通过行连接的方式与该N*N的矩阵进行连接,第一存储库集合中的第M个存储库与该N*N的矩阵中的第M行的每个计算单元相连接,第二存储库集合中的N个存储库通过列连接的方式与N*N的矩阵进行连接,该第二存储库集合中的第M个存储库与N*N的矩阵中第M列的每个计算单元相连接,其中,M为变量,取值为1≤M≤N。计算方法包括:在第一个时钟周期,该N*N的矩阵中的每一行的每个计算单元接收与自身相连接的第一存储库集合中的存储库所广播的第一输入数据,N*N的矩阵中的每一列的每个计算单元用于接收与自身相连接的第二存储库集合中的存储库所广播的第二输入数据,该N*N的矩阵中的每个计算单元根据第一输入数据和第二输入数据进行乘法计算,得到第一乘法计算结果,N*N矩阵中的每个计算单元将该第一乘法计算结果与内部寄存器中的初始数值进行加法计算得到第一乘加结果,并在内部寄存器中保存自身计算得到的第一乘加计算结果;在第二个时钟周期,N*N的矩阵中的每一行的每个计算单元接收与自身相连接的第一存储库集合中的存储库广播的第一输入数据,N*N的矩阵中的每一列的每个计算单元用于接收与自身相连接的第二存储库集合中的存储库广播的第二输入数据,该N*N的矩阵中的每个计算单元根据第一输入数据和第二输入数据进行乘法计算,得到第二乘法计算结果,N*N的矩阵中的每个计算单元将自身计算得到的第二乘法计算结果与第二乘加计算结果进行加法计算得到第二乘加结果,并在内部寄存器中保存第二乘加计算结果;在后续的时钟周期,依次类推进行计算,直至在第N个时钟周期后,矩阵乘法器完成第一乘法矩阵和第二乘法矩阵的乘法运算。In a fourth aspect, the present application provides a calculation method for a matrix multiplier to perform calculation, the matrix multiplier comprising: N*N computing units, the N*N computing units form a matrix of N*N, and N is greater than a positive integer equal to 2; two repository sets, each repository set includes N repositories, a first repository set for storing a first multiplication matrix in the input matrix, and a second repository set for storing an input matrix a second multiplication matrix in which the N repositories in the first repository set are connected to the N*N matrix by way of row connection, and the Mth repository in the first repository set and the N* Each computing unit of the Mth row in the matrix of N is connected, and the N repositories in the second repository set are connected to the matrix of N*N by means of column concatenation, and the number in the second repository set The M repositories are connected to each of the calculation units of the Mth column in the matrix of the N*N, where M is a variable and takes a value of 1 ≤ M ≤ N. The calculation method includes: in a first clock cycle, each of the computing units of each of the N*N matrices receives the first input data broadcast by the repository in the first repository set connected to itself, N Each computing unit of each column in the matrix of *N is configured to receive second input data broadcast by a repository in a second repository set connected to itself, each computing unit in the matrix of the N*N Performing multiplication calculation according to the first input data and the second input data to obtain a first multiplication calculation result, and each calculation unit in the N*N matrix adds the first multiplication calculation result and the initial value in the internal register to obtain the first calculation result. Multiply the result, and save the first multiplication and calculation result calculated by itself in the internal register; in the second clock cycle, each calculation unit of each row in the matrix of N*N receives the connection with itself a first input data of a repository broadcast in a repository set, each computing unit of each column in the matrix of N*N being used to receive a repository broadcast in a second repository set connected to itself Two input data, each calculation unit in the N*N matrix performs multiplication calculation according to the first input data and the second input data to obtain a second multiplication calculation result, and each calculation unit in the matrix of N*N will itself The calculated second multiplication calculation result and the second multiplication and addition calculation result are added to obtain a second multiplication and addition result, and the second multiplication and addition calculation result is saved in the internal register; in the subsequent clock cycle, the calculation is performed by analogy until After the Nth clock cycle, the matrix multiplier performs the multiplication of the first multiplication matrix and the second multiplication matrix.
对于上述第四方面,一种可能的实现方式是,在每个时钟周期,N*N的矩阵中位于同一行的所有计算单元接收到相同的第一输入数据,该N*N的矩阵中位于同一列的所有计算单元接收到相同的第二输入数据。For the above fourth aspect, a possible implementation manner is that, in each clock cycle, all the computing units in the same row in the matrix of N*N receive the same first input data, and the N*N matrix is located in All computing units in the same column receive the same second input data.
对于上述第四方面,另一种可能的实现方式是,该矩阵乘法器还包括第三存储库集合,该第三存储库集合用于存储结果矩阵,该第三存储库集合中的N个存储库通过列连接的方式与N*N的矩阵中的计算单元进行连接,该第三存储库集合中的第M个存储库与N*N的矩阵中的第M列的每个计算单元相连接。计算方法还包括:N*N的矩阵中的每个计算单元将计算得到的第N乘加计算结果输出到与自身相连接的第三存储库集合中的存储库。For the fourth aspect, another possible implementation manner is that the matrix multiplier further includes a third repository set, where the third repository set is used to store a result matrix, and the N storages in the third repository set The library is connected to the computing unit in the matrix of the N*N by means of a column connection, and the Mth repository in the third repository set is connected to each computing unit of the Mth column in the matrix of the N*N . The calculation method further includes: each of the calculation units in the N*N matrix outputs the calculated Nth multiplication and addition calculation result to a repository in the third repository set connected to itself.
对于上述第四方面,另一种可能的实现方式是,该矩阵乘法器还包括第四存储库集合,该第四存储库集合用于存储输入矩阵中的加法矩阵,该第四存储库中的N个存储库通过行连接的方式与N*N的矩阵进行连接,该第四存储库集合中的第M个存储库与该N*N 的矩阵中的第M行的每个计算单元相连接。计算方法还包括:在第一个时钟周期,N*N的矩阵中的第一列的每个计算单元用于接收与自身相连接的第四存储库集合的存储库输入的第一组数据,该第一组数据为加法矩阵中的第一列数据,在第二个时钟周期该N*N的矩阵中的第二列的每个计算单元用于接收与自身相连接的第四存储库集合中的存储库输入的第二组数据,该第二组数据为加法矩阵中的第二列数据,以此类推,在第N个时钟周期,N*N的矩阵中的第N列的每个计算单元用于接收与自身相连接的第四存储库集合中的存储库输入的第N组数据;在第N+1个时钟周期,该N*N的矩阵中的每个计算单元还用于根据接收到的加法矩阵的输入数据和第一乘法矩阵与第二乘法矩阵的乘法计算结果,进行加法运算,以得到第一乘法矩阵、第二乘法矩阵和加法矩阵的乘积计算结果。For the fourth aspect, another possible implementation manner is that the matrix multiplier further includes a fourth repository set, where the fourth repository set is used to store an addition matrix in the input matrix, where the fourth repository N repositories are connected to the N*N matrix by way of row connection, and the Mth repository in the fourth repository set is connected to each calculation unit of the Mth row in the N*N matrix . The calculation method further includes: in the first clock cycle, each computing unit of the first column in the matrix of N*N is configured to receive the first set of data input by the repository of the fourth repository set connected to itself, The first set of data is the first column of data in the addition matrix, and each of the second columns of the N*N matrix in the second clock cycle is used to receive the fourth repository set connected to itself The second set of data input by the repository, the second set of data is the second column of data in the addition matrix, and so on, in the Nth clock cycle, each of the Nth columns in the N*N matrix The computing unit is configured to receive the Nth group of data input by the bank in the fourth repository set connected to itself; in the (N+1)th clock cycle, each of the N*N matrix is further used And performing an addition operation according to the input data of the received addition matrix and the multiplication calculation result of the first multiplication matrix and the second multiplication matrix to obtain a product calculation result of the first multiplication matrix, the second multiplication matrix, and the addition matrix.
附图说明DRAWINGS
图1是现有技术中存储库集合的结构的示意图。1 is a schematic diagram of the structure of a repository set in the prior art.
图2是现有技术中矩阵乘法器的结构的示意图。2 is a schematic diagram showing the structure of a matrix multiplier in the prior art.
图3是现有技术中计算单元的结构的示意图。3 is a schematic diagram of the structure of a computing unit in the prior art.
图4是本申请的实施例所提供的矩阵乘法器的结构的示意图。4 is a schematic diagram showing the structure of a matrix multiplier provided by an embodiment of the present application.
图5是本申请的实施例所提供的计算单元的结构的示意图。FIG. 5 is a schematic diagram of a structure of a computing unit provided by an embodiment of the present application.
图6是本申请的实施例的流程示意图。FIG. 6 is a schematic flow chart of an embodiment of the present application.
图7是本申请的实施例的初始状态示意图。FIG. 7 is a schematic diagram of an initial state of an embodiment of the present application.
图8是本申请的实施例的第一个时钟周期的状态示意图。FIG. 8 is a schematic diagram showing the state of the first clock cycle of the embodiment of the present application.
图9是本申请的实施例的第二个时钟周期的状态示意图。9 is a schematic diagram of a state of a second clock cycle of an embodiment of the present application.
图10是本申请的实施例的第三个时钟周期的状态示意图。FIG. 10 is a schematic diagram showing the state of the third clock cycle of the embodiment of the present application.
图11是本申请的实施例的第四个时钟周期状态示意图。11 is a schematic diagram of a fourth clock cycle state of an embodiment of the present application.
图12是本申请的实施例的第五个时钟周期状态示意图。FIG. 12 is a schematic diagram of a fifth clock cycle state of an embodiment of the present application.
图13是本申请的实施例所提供的图形处理器的结构示意图。FIG. 13 is a schematic structural diagram of a graphics processor provided by an embodiment of the present application.
图14是本申请的实施例所提供的片上系统的结构示意图。FIG. 14 is a schematic structural diagram of a system on chip provided by an embodiment of the present application.
图15是本申请的实施例所提供的矩阵乘法器块的结构示意图。FIG. 15 is a schematic structural diagram of a matrix multiplier block provided by an embodiment of the present application.
具体实施方式detailed description
在GPU中,数据的存储通常采用存储库(bank)的组织形式。图1所示为存储库集合的结构的示意图。如图1所示,一个存储库集合由若干列存储块构成,每一列存储块为一个存储库,其中,每个存储块是32位或者64位大小。存储库集合是默认行连续的,即当分配一个数值给存储库集合时,连续的元素是按照行进行连续存储的。当指令在SM中执行时,存取单元(英文全称:Load/Store Units,缩写:LD/ST)从显存中加载数据到存储库中,而SP在执行具体的计算指令时,需要从存储库中读取数据。因此,在SM中有大量的SP和存储库(通常情况下,一个SM中的SP的个数和存储库的组数相同),而每一个SP可能需要访问任意一组存储库中的数据。如图2所示,在现有技术中,存储库和SP通过全连接网络进行相互连接,形成一个矩阵乘法器。其中,图2中的SP、DP和存储库均与全连接网络进行连接,通过这种方式,实现SP与所有存储库之间的相互访问。In GPUs, the storage of data is usually organized in the form of a bank. Figure 1 shows a schematic diagram of the structure of a repository set. As shown in FIG. 1, a repository set is composed of a plurality of column storage blocks, and each column storage block is a storage library, wherein each storage block is 32-bit or 64-bit in size. The repository collection is the default row continuation, that is, when assigning a value to the repository collection, consecutive elements are contiguously stored in rows. When the instruction is executed in the SM, the access unit (English full name: Load/Store Units, abbreviation: LD/ST) loads the data from the video memory into the repository, and the SP needs to access the repository when executing the specific calculation instruction. Read data in. Therefore, there are a large number of SPs and repositories in the SM (usually, the number of SPs in an SM is the same as the number of repositories in the SM), and each SP may need to access data in any set of repositories. As shown in FIG. 2, in the prior art, the repository and the SP are connected to each other through a fully connected network to form a matrix multiplier. The SP, the DP, and the repository in FIG. 2 are all connected to the fully connected network. In this way, mutual access between the SP and all the repositories is realized.
SP主要包括计算单元,用来进行矩阵乘法运算的基本步骤。图3是一个典型的计算单元的结构。如图3所示,计算单元主要包括寄存器301、寄存器302、寄存器303和寄存器304等4个寄存器、乘法单元305和加法单元306。其中,寄存器301和寄存器302放入的是进行乘法运算的被乘数和乘数,乘法单元301对这两个数进行乘法运算后,与寄存器303中放入的数进行相加(如果不需要进行加法运算的话,可以将0放入寄存器303中),并将相加得到的结果存入寄存器304,从而完成一次乘加计算。The SP mainly includes a calculation unit for performing the basic steps of matrix multiplication. Figure 3 is a diagram of the structure of a typical computing unit. As shown in FIG. 3, the calculation unit mainly includes four registers such as a register 301, a register 302, a register 303, and a register 304, a multiplication unit 305, and an addition unit 306. Wherein, the register 301 and the register 302 are placed into a multiplicand and a multiplier for multiplication, and the multiplying unit 301 multiplies the two numbers and adds them to the number placed in the register 303 (if not required) If an addition operation is performed, 0 can be placed in the register 303), and the result of the addition can be stored in the register 304, thereby completing a multiplication and addition calculation.
当进行矩阵乘法运算A*B时,结果矩阵D的第J行第K列的元素,等于第一个矩阵的第J行与第二个矩阵第K列,对应位置每个元素的乘积的和。例如,矩阵A和矩阵B都是4行4列的矩阵,那么矩阵D的第1行第1列的元素,是通过矩阵A第1行的每个元素分别乘以矩阵B第1列的每个元素,并将4个乘积进行相加而得到的。When the matrix multiplication operation A*B is performed, the elements of the Jth row and the Kth column of the result matrix D are equal to the Jth row of the first matrix and the Kth column of the second matrix, and the sum of the products of each element of the corresponding position. . For example, the matrix A and the matrix B are both a matrix of 4 rows and 4 columns, and then the elements of the first row and the first column of the matrix D are multiplied by each element of the first row of the matrix A by the first column of the matrix B, respectively. Elements, and add 4 products to add.
基于上述SM的结构,目前实现矩阵乘法运算A*B的过程如下:Based on the structure of the above SM, the current process of implementing matrix multiplication operation A*B is as follows:
首先,根据SM中的乘法器的规格对矩阵进行切分,形成符合乘法器规格的子矩阵,存取单元进而从显存中加载切分好的子矩阵到存储库中。特别的,当待划分的矩阵的规格小于乘法器的规格时,需要将待划分的矩阵的相应的位置上用0进行补齐,从而形成符合规格的子矩阵。First, the matrix is segmented according to the specifications of the multiplier in the SM to form a sub-matrix conforming to the multiplier specification, and the access unit further loads the segmented sub-matrix from the video memory into the repository. In particular, when the specification of the matrix to be divided is smaller than the specification of the multiplier, it is necessary to fill the corresponding position of the matrix to be divided with 0 to form a sub-matrix conforming to the specification.
其次,分别从划分好的、待进行矩阵乘法的子矩阵所对应的存储库中读取一个数据到对应的SP中。由于所有的SP和存储库都是通过全连接网络相互连接的,因此SP可以根据矩阵乘法的计算规则读取相应的矩阵A和矩阵B的元素至计算单元中。例如,矩阵A和矩阵B是划分好的,规格为4行4列(可表示为4*4)的子矩阵,当计算结果矩阵D中d00的值时,需要进行a00*b00+a01*b10+a02*b20+a03*b30的计算。因此,分别从矩阵A和矩阵B所对应的存储库中取出a00和b00,放入对应的SP中的计算单元中的寄存器301和寄存器302中。Secondly, one data is read from the corresponding bank corresponding to the sub-matrix to be matrix multiplied to the corresponding SP. Since all SPs and repositories are connected to each other through a fully connected network, the SP can read the elements of the corresponding matrix A and matrix B into the calculation unit according to the calculation rule of matrix multiplication. For example, matrix A and matrix B are sub-matrices with a size of 4 rows and 4 columns (which can be expressed as 4*4). When calculating the value of d00 in the result matrix D, a00*b00+a01*b10 is required. Calculation of +a02*b20+a03*b30. Therefore, a00 and b00 are respectively taken out from the corresponding libraries of the matrix A and the matrix B, and placed in the register 301 and the register 302 in the calculation unit in the corresponding SP.
最后,SP利用计算单元进行乘加运算。需要注意的是,每次进行完乘加运算后,都要将该结果存入预先准备好的存储库中。待下一次乘法运算完成后,将该乘加运算结果从该存储空间中取出,放入寄存器303中进行加法运算,例如,在进行完a00*b00计算后,将该结果从寄存器304中放入预先准备好的存储空间中。而当进行a01*b10计算后,将a00*b00的结果从上述存储空间中放入寄存器303中,并利用加法器306与a00*b00的结果进行相加,并将结果先放入寄存器304,再存入相应的存储空间中。以此类推,最终SP利用乘法器完成a00*b00+a01*b10+a02*b20+a03*b30的计算,并将结果通过全连接网络写入存储库中的相应位置,并作为d00的值。Finally, the SP uses the calculation unit to perform the multiply-and-accumulate operation. It should be noted that each time the multiply-accumulate operation is performed, the result is stored in a pre-prepared repository. After the next multiplication operation is completed, the multiplication and addition operation result is taken out from the storage space, and is added to the register 303 for addition. For example, after the calculation of a00*b00 is completed, the result is placed from the register 304. Prepared in the storage space in advance. When the calculation of a01*b10 is performed, the result of a00*b00 is put into the register 303 from the storage space, and the result of the adder 306 is added to the result of a00*b00, and the result is first placed in the register 304. Then save it in the corresponding storage space. By analogy, the final SP uses the multiplier to complete the calculation of a00*b00+a01*b10+a02*b20+a03*b30, and writes the result to the corresponding position in the repository through the fully connected network, and takes the value of d00.
当利用上述算法和装置计算规格为N*N大小的矩阵A*B时,由于每次只从各个矩阵中取出一个元素进行计算,并将每次的计算结果先存入预定的存储空间,下次计算时再调用,因此,完成矩阵A和矩阵B的乘法运算,共需要进行N*N*N次乘加运算,并需要进行3*N*N*N次读操作和N*N*N次写操作。且所有的SP和存储库是通过全连接网络进行连接的,采用这种做法效率较低且会占用较多存储空间。When the matrix A*B of the size N*N is calculated by using the above algorithm and device, since only one element is taken out from each matrix for calculation at a time, and each calculation result is first stored in a predetermined storage space, It is called again in the second calculation. Therefore, to complete the multiplication of matrix A and matrix B, a total of N*N*N multiplication and addition operations are required, and 3*N*N*N read operations and N*N*N are required. Write once. And all SPs and repositories are connected through a fully connected network, which is less efficient and takes up more storage space.
基于提高GPU中的SM进行矩阵乘法计算的效率的目的,本申请的实施例提供一种新的应用于GPU中的矩阵乘法器。在本申请的实施例中,当计算规格为N*N大小的矩阵A和矩阵B的乘法时,不再每次只从各个矩阵中取出一个元素进行计算,而是利用不同组的存储库可以同时访问的特性,每次将矩阵A的一行元素和矩阵B的一列元素加载到相 应的计算单元中,同时进行计算。通过这种做法,可以减少完成矩阵A和矩阵B的乘法运算的步骤,并且可以降低所需进行的存储访问的次数,从而提高SM进行矩阵乘法计算的效率。For the purpose of improving the efficiency of matrix multiplication calculation by SM in GPU, embodiments of the present application provide a new matrix multiplier for use in a GPU. In the embodiment of the present application, when calculating the multiplication of the matrix A and the matrix B of the size N*N, no more elements are taken out from each matrix for calculation at a time, but a different set of repositories can be utilized. Simultaneous access characteristics, each time a row of elements of matrix A and a column of elements of matrix B are loaded into the corresponding computing unit, and calculations are performed at the same time. By doing so, the steps of completing the multiplication of matrix A and matrix B can be reduced, and the number of memory accesses required can be reduced, thereby improving the efficiency of SM for matrix multiplication calculation.
图4是本申请的实施例所提供的矩阵乘法器的结构。如图4所示,矩阵乘法器400包括调度器、存储库和计算单元。其中,调度器用于获取相应规格的、用于计算的矩阵,并将该矩阵分别保存在相应的存储库集合中。调度器具体包括矩阵乘法调度单元401、指令分发单元402和指令分发单元403(图中所示为2个,实际上可能为1个或多个),其中,矩阵乘法调度单元401作为矩阵乘法器400的指令调度单元,主要负责指令排序和调度,通过输入指令,可以使得输入、加载、计算、存储和输出等过程有机地结合起来。指令分发单元,通过控制连线与存储库和计算单元相连(图中未示出),用于将矩阵乘法调度单元401所确定的调度指令发送给存储库和计算单元,从而使得存储库和计算单元按照指令处理数据。在本申请的实施例中,矩阵乘法器400中所包含的指令分发单元的数量可以是两个,因此可以实现指令双发射。本申请的矩阵乘法器的计算单元与存储库的连接不再通过全连接网络进行连接,如图4所示,每个矩阵乘法器包括N*N个计算单元,这N*N个计算单元形成一个N*N的矩阵(图示为4*4的矩阵,其中计算单元自左至右、自上而下分别命名为计算单元430至计算单元445),每个矩阵乘法器还包括至少两个存储库集合,每个存储库集合包括N个存储库(图示为4个存储库集合共有16个存储库,分别命名为存储库410至存储库425),第一存储库集合用于存储输入矩阵中的第一乘法矩阵,第二存储库集合用于存储输入矩阵中的第二乘法矩阵。可选的,矩阵乘法器还可以包括第三存储库集合和第四存储库集合,其中,第三存储库集合用于存储结果矩阵,第四存储库集合用于存储输入矩阵中的加法矩阵。本申请的矩阵乘法器在一个计算周期(在计算机领域,也称为时钟周期或者拍)可以完成N*N个乘法计算,从而提高计算效率。为此,第一存储库集合通过行连接的方式与计算单元相连接,第一存储库集合中的N个存储库分别与N*N的计算单元矩阵中的每一行的计算单元直接连接,第一存储库集合的第一存储库与N*N的计算单元矩阵中的第一行的每个计算单元相连接,第一存储库集合的第二存储库与N*N的计算单元矩阵中的第二行的每个计算单元相连接,第一存储库集合的第N存储库与N*N的计算单元矩阵中的第N行的每个计算单元相连接;第二存储库集合通过列连接的方式与计算单元相连接,第二存储库集合中的N个计算单元分别与N*N的计算单元矩阵中的每列的计算单元直接连接,第二存储库集合的第一存储库与N*N的计算单元矩阵中的第一列的每个计算单元相连接,第二存储库集合的第二存储库与N*N的计算单元矩阵中的第二列的每个计算单元相连接,第二存储库集合的第N存储库与N*N的计算单元矩阵中的第N列的每个计算单元相连接。例如,如图4所示,第一存储库集合包含存储库410、存储库411、存储库412与存储库413,第一存储库集合与计算单元矩阵保持行连接,存储库410与矩阵中第一行的每个计算单元相连接,存储库411与矩阵中第二行的每个计算单元相连接,存储库412与矩阵中第三行的每个计算单元相连接,存储库413与矩阵中第四行的每个计算单元相连接;第二存储库集合包含存储库414、存储库415、存储库416与存储库417,第二存储库集合与计算单元矩阵保持列连接,存储库414与矩阵中第一列的每个计算单元相连接,存储库415与矩阵中第二列的每个计算单元相连接,存储库416与矩阵中第三列的每个计算单元相连接,存 储库417与矩阵中第四列的每个计算单元相连接。根据上述连接方式,第一存储库集合在第一时钟周期可以向N*N个计算单元广播N个数据,第二存储库集合在第一时钟周期也可以向N*N个计算单元广播N个数据,在第一时钟周期每个计算单元可以进行一次乘法计算,在N个时钟周期之后,可以完成全部的乘法计算。4 is a diagram showing the structure of a matrix multiplier provided by an embodiment of the present application. As shown in FIG. 4, matrix multiplier 400 includes a scheduler, a repository, and a computing unit. The scheduler is configured to obtain a matrix of a corresponding specification for calculation, and save the matrix in a corresponding repository set. The scheduler specifically includes a matrix multiplication scheduling unit 401, an instruction distribution unit 402, and an instruction distribution unit 403 (two shown in the figure, which may actually be one or more), wherein the matrix multiplication scheduling unit 401 functions as a matrix multiplier The instruction scheduling unit of 400 is mainly responsible for order sorting and scheduling. By inputting instructions, processes such as input, load, calculation, storage and output can be organically combined. The instruction distribution unit is connected to the storage library and the calculation unit (not shown) through the control connection, and is configured to send the scheduling instruction determined by the matrix multiplication scheduling unit 401 to the storage library and the calculation unit, thereby causing the repository and the calculation The unit processes the data according to the instructions. In the embodiment of the present application, the number of instruction distribution units included in the matrix multiplier 400 may be two, so that instruction dual transmission can be realized. The connection between the computing unit of the matrix multiplier of the present application and the repository is no longer connected through the fully connected network. As shown in FIG. 4, each matrix multiplier includes N*N computing units, and the N*N computing units are formed. An N*N matrix (illustrated as a 4*4 matrix, wherein the computing units are named from left to right, top to bottom, respectively, as computing unit 430 to computing unit 445), and each matrix multiplier also includes at least two A repository set, each repository set includes N repositories (illustrated as four repositories set having a total of 16 repositories, respectively named as repositories 410 to 425), and the first repository set is used to store input A first multiplication matrix in the matrix, the second repository set is used to store a second multiplication matrix in the input matrix. Optionally, the matrix multiplier may further include a third repository set and a fourth repository set, wherein the third repository set is used to store the result matrix, and the fourth repository set is used to store the addition matrix in the input matrix. The matrix multiplier of the present application can perform N*N multiplication calculations in one calculation cycle (in the computer field, also called clock cycle or beat), thereby improving computational efficiency. To this end, the first repository set is connected to the computing unit by way of row connection, and the N repositories in the first repository set are directly connected to the computing unit of each row in the N*N computing unit matrix, respectively. a first repository of a repository set is coupled to each compute unit of the first row of the N*N compute unit matrix, the second repository of the first repository set and the N*N compute unit matrix Each of the computing units of the second row is connected, the Nth repository of the first repository set is connected to each computing cell of the Nth row in the N*N computing cell matrix; the second repository set is connected by columns The method is connected to the computing unit, and the N computing units in the second repository set are directly connected to the computing unit of each column in the N*N computing unit matrix, and the first repository and the N of the second repository set are respectively connected. Each calculation unit of the first column in the calculation unit matrix of *N is connected, and the second storage pool of the second repository set is connected to each calculation unit of the second column in the calculation unit matrix of N*N, The Nth repository of the second repository set and N*N Each cell of N columns matrix calculation unit is connected. For example, as shown in FIG. 4, the first repository set includes a repository 410, a repository 411, a repository 412, and a repository 413. The first repository set and the computing unit matrix maintain row connections, and the storage library 410 and the matrix Each computing unit of a row is connected, the storage library 411 is connected to each computing unit of the second row in the matrix, and the storage library 412 is connected to each computing unit of the third row in the matrix, and the storage library 413 is in the matrix Each of the computing units of the fourth row is connected; the second repository set includes a repository 414, a repository 415, a repository 416, and a repository 417, the second repository set and the computing unit matrix maintain column connections, and the repository 414 Each of the computing units of the first column in the matrix is connected, the storage library 415 is connected to each computing unit of the second column in the matrix, and the storage library 416 is connected to each computing unit of the third column in the matrix, the storage library 417 Connected to each calculation unit in the fourth column of the matrix. According to the foregoing connection manner, the first repository set may broadcast N data to the N*N computing units in the first clock cycle, and the second repository set may also broadcast N to the N*N computing units in the first clock cycle. Data, each calculation unit can perform a multiplication calculation in the first clock cycle, and after N clock cycles, all multiplication calculations can be completed.
进一步,矩阵乘法器中的第四存储库集合用于加载输入矩阵中的加法矩阵C,本申请可以将第四存储库集合与所述计算单元矩阵保持行连接,也可以将第四存储库集合与所述计算单元矩阵保持列连接,若第四存储库集合与所述计算单元矩阵保持行连接,则所述第四存储库集合中的每个存储库分别与所述计算单元矩阵的每一行的计算单元相连接,例如图4中,存储库418分别与计算单元矩阵中的第一行的每个计算单元相连接,存储库419分别与计算单元矩阵中的第二行的每个计算单元相连接,存储库420分别与计算单元矩阵中的第三行的每个计算单元相连接,存储库421分别与计算单元矩阵中的第四行的每个计算单元相连接。第四存储库集合可以在每个时钟周期向N个计算单元加载数据,在N个时钟周期之后,第四存储库集合存储的加法矩阵C全部输入到对应的计算单元,则在第N个时钟周期,可以进行相应的加法计算。进一步,矩阵乘法器中的第三存储库集合用于加载输入矩阵中的结果矩阵D,本申请可以将第三存储库集合与所述计算单元矩阵保持行连接,也可以将第三存储库集合与所述计算单元矩阵保持列连接,若第三存储库集合与所述计算单元矩阵保持列连接,则所述第三存储库集合中的每个存储库分别与所述计算单元矩阵的每一列的计算单元相连接,例如图4中,存储库425分别与计算单元矩阵中的第一列的每个计算单元相连接,存储库424分别与计算单元矩阵中的第二列的每个计算单元相连接,存储库423分别与计算单元矩阵中的第三列的每个计算单元相连接,存储库422分别与计算单元矩阵中的第四列的每个计算单元相连接。Further, the fourth repository set in the matrix multiplier is used to load the addition matrix C in the input matrix, and the application may connect the fourth repository set to the row of the computing unit matrix, or may set the fourth repository set. Maintaining a column connection with the computing unit matrix, and if the fourth repository set is connected to the computing unit matrix, each of the fourth repository sets and each row of the computing unit matrix The computing units are connected. For example, in FIG. 4, the storage library 418 is respectively connected to each computing unit of the first row in the computing unit matrix, and the storage library 419 is respectively associated with each computing unit of the second row in the computing unit matrix. Connected, the storage library 420 is respectively connected to each of the computing units of the third row in the computing unit matrix, and the storage library 421 is respectively connected to each of the computing units of the fourth row in the computing unit matrix. The fourth repository set can load data to the N computing units every clock cycle. After N clock cycles, the addition matrix C stored in the fourth repository set is all input to the corresponding computing unit, and then at the Nth clock. Cycle, you can perform the corresponding addition calculation. Further, the third repository set in the matrix multiplier is used to load the result matrix D in the input matrix, and the application may keep the third repository set connected to the computing unit matrix, or may be the third repository set. Maintaining a column connection with the computing unit matrix, and if the third repository set and the computing unit matrix maintain a column connection, each of the third repository sets and each column of the computing unit matrix The computing units are connected. For example, in FIG. 4, the storage library 425 is respectively connected to each computing unit of the first column in the computing unit matrix, and the storage library 424 and each computing unit of the second column in the computing unit matrix respectively. Connected, the storage 423 is connected to each of the computing units of the third column of the computing unit matrix, respectively, and the storage 422 is connected to each of the computing units of the fourth column of the computing unit matrix.
图5是本申请的实施例所提供的一种计算单元,以适配矩阵乘法器400所支持的矩阵乘法运算。如图5所示,计算单元500可以为上述矩阵乘法器400中的任意一个计算单元,包括寄存器501至寄存器505等5个寄存器、乘法单元506、加法单元507和加法单元508。当计算矩阵A与矩阵B进行矩阵乘法运算时,结果矩阵D的第一行第一列的元素d00=a00*b00+a01*b10+a02*b20+a03*b30,那么在第一时钟周期先将a00和b00分别放入寄存器501和寄存器502,利用乘法单元506计算a00*b00,并利用加法单元507计算a00*b00的乘积与寄存器503中所存储的数值的和,将该结果放入寄存器503中,替换之前的数值。在初始状态下,寄存器503里存储的数值为0,因此经过上述计算后,寄存器503里存储的数值为a00*b00。在接下来的第二时钟周期,可以将a01和b10分别放入寄存器501和寄存器502,利用乘法单元506计算a01*b10,在接下来的第三时钟周期,可以将a02和b20分别放入寄存器501和寄存器502,利用乘法单元506计算a02*b20,在接下来的第四时钟周期,可以将a03和b30分别放入寄存器501和寄存器502,利用乘法单元506计算a03*b30,在第四时钟周期之后,寄存器503里的数值为a00*b00+a01*b10+a02*b20+a03*b30。当寄存器504中存储的数值为0时,寄存器505中的数值为寄存器503中的数值,将该数值取出并存入预先准备好的存储空间,当寄存器504中存储加法矩阵C的数值时,将寄存器503与寄存器504中的数值进行加法运算,并将加法运算的结果作为结果矩阵第一行第一列元素d00的数值。当计算矩阵D=A*B+C时,此时结果矩阵D中的元素d00=a00*b00+a01*b10+a02*b20+a03*b30+c00,利用 计算单元500进行计算d00时,可以通过将c00放入寄存器504,并利用加法单元508与a00*b00+a01*b10+a02*b20+a03*b30的结果进行加法计算的方式实现。FIG. 5 is a computing unit provided by an embodiment of the present application to adapt matrix multiplication operations supported by matrix multiplier 400. As shown in FIG. 5, the calculation unit 500 may be any one of the above-described matrix multipliers 400, including five registers such as a register 501 to a register 505, a multiplication unit 506, an addition unit 507, and an addition unit 508. When the matrix A and the matrix B are matrix-multiplied, the elements of the first row and the first column of the matrix D are d00=a00*b00+a01*b10+a02*b20+a03*b30, then the first clock cycle A00 and b00 are respectively placed in the register 501 and the register 502, the multiplication unit 506 calculates a00*b00, and the addition unit 507 calculates the sum of the product of a00*b00 and the value stored in the register 503, and puts the result into the register. In 503, replace the previous value. In the initial state, the value stored in the register 503 is 0, so after the above calculation, the value stored in the register 503 is a00*b00. In the next second clock cycle, a01 and b10 can be placed in register 501 and register 502, respectively, using a multiplication unit 506 to calculate a01*b10, and in the next third clock cycle, a02 and b20 can be placed in registers respectively. 501 and register 502, using a multiplication unit 506 to calculate a02*b20, in the next fourth clock cycle, a03 and b30 can be placed in the register 501 and the register 502, respectively, and the multiplication unit 506 calculates a03*b30 at the fourth clock. After the period, the value in register 503 is a00*b00+a01*b10+a02*b20+a03*b30. When the value stored in the register 504 is 0, the value in the register 505 is the value in the register 503, and the value is taken out and stored in the storage space prepared in advance. When the value of the addition matrix C is stored in the register 504, The register 503 is added to the value in the register 504, and the result of the addition is used as the value of the first column element d00 of the first row of the result matrix. When the matrix D=A*B+C is calculated, the element d00=a00*b00+a01*b10+a02*b20+a03*b30+c00 in the result matrix D at this time can be calculated by the calculation unit 500. This is achieved by placing c00 in the register 504 and performing addition calculation using the addition unit 508 and the result of a00*b00+a01*b10+a02*b20+a03*b30.
图6为本申请的一个实施例的流程示意图。FIG. 6 is a schematic flow chart of an embodiment of the present application.
S601:在进入矩阵乘法器进行计算之前,对待进行运算的矩阵进行切块,形成适应矩阵乘法器规定的N*N大小的子矩阵,并将子矩阵分别存入对应的存储库中。如果待划分的矩阵的规格小于N*N时,则在相应的位置上补充0,从而形成N*N大小的子矩阵,并不影响计算结果。继续以N=4为例,A、B、C、D均为规格为4行4列的矩阵,其矩阵中的元素分别用aij、bij、cij、dij来表示,其中i表示该元素在矩阵中的行数减1,j表示该元素在矩阵中的列数减1,i和j为大于等于0且小于等于3的整数。参照图7所示的初始状态示意图,根据不同的行数将矩阵A的元素分别放入存储库410至存储库413中,根据不同的列数将矩阵B的元素分别放入存储库414至存储库417中,根据不同的行数将矩阵C的元素分别放入存储库418至存储库421中。S601: Before entering the matrix multiplier for calculation, the matrix to be operated is diced to form a submatrix of an N*N size specified by the adaptive matrix multiplier, and the submatrices are respectively stored in corresponding repositories. If the size of the matrix to be divided is smaller than N*N, then 0 is added to the corresponding position, thereby forming a submatrix of N*N size, which does not affect the calculation result. Continue to take N=4 as an example. A, B, C, and D are all matrices of 4 rows and 4 columns. The elements in the matrix are represented by aij, bij, cij, and dij, where i represents the element in the matrix. The number of rows in the field is decremented by 1, and j indicates that the number of columns of the element in the matrix is decremented by 1, and i and j are integers greater than or equal to 0 and less than or equal to 3. Referring to the initial state diagram shown in FIG. 7, the elements of the matrix A are respectively placed in the storage 410 to the storage 413 according to different number of rows, and the elements of the matrix B are respectively placed in the storage 414 to the storage according to different column numbers. In the library 417, the elements of the matrix C are placed in the repository 418 into the repository 421, respectively, according to different number of rows.
S602:在每个时间周期中,存储库410至存储库417中的每个存储库根据接收到的指令,依照既定的顺序广播一个数据到其相连的所有计算单元中,每个乘加单元分别收到来自矩阵A的元素和来自矩阵B的元素,并分别放入计算单元里的寄存器501和寄存器502中,根据前文提到的方法计算两者的乘积,得到该次的乘积,将得到的乘积与前一次的乘加计算结果进行加法计算,得到该次的乘加计算结果。在第一个时间周期中,前一次的计算结果,即寄存器的初始值为0。S602: in each time period, each of the storage library 410 to the storage library 417 broadcasts a data to all the connected computing units according to the received order according to the received instruction, and each of the multiply and add units respectively The elements from the matrix A and the elements from the matrix B are received and placed in the register 501 and the register 502 in the calculation unit respectively, and the product of the two is calculated according to the method mentioned above to obtain the product of the time, and the obtained product will be obtained. The product is added to the previous multiplication and addition calculation result, and the multiplication and addition calculation result is obtained. In the first time period, the previous calculation result, that is, the initial value of the register is 0.
具体来说,在第M个时间周期中,存储库410至存储库413将矩阵A的第M列的元素通过广播的方式发送到计算单元集合中,其中,位于第J行的计算单元接收矩阵A的第J行第M列的元素;存储库414至存储库417将矩阵B的第M行的元素通过广播的方式发送到计算单元集合中,其中,位于第K列的计算单元接收矩阵B的第M行第K列的元素,J、K、M均为小于等于4的正整数。乘积计算单元集合执行接收到的来自矩阵A和矩阵B的元素的乘法计算,得到第M次乘法计算结果,并将第M次乘法计算结果与第M-1次的乘加计算结果进行相加,得到第M次乘加计算结果,其中,第0次计算结果,也即内部寄存器的初始值设置为0。Specifically, in the Mth time period, the storage 410 to the storage 413 transmit the elements of the Mth column of the matrix A to the computing unit set by broadcasting, wherein the computing unit receiving matrix in the Jth row The elements of the Mth row and the Mth column of A; the storage library 414 to the storage library 417 transmit the elements of the Mth row of the matrix B to the set of computing units by broadcasting, wherein the computing unit located in the Kth column receives the matrix B The elements of the Mth row and the Kth column, J, K, and M are positive integers of 4 or less. The multiplication calculation unit performs a multiplication calculation of the received elements from the matrix A and the matrix B to obtain an Mth multiplication calculation result, and adds the Mth multiplication calculation result to the M-1th multiplication and addition calculation result. The result of the Mth multiplication and addition calculation is obtained, wherein the 0th calculation result, that is, the initial value of the internal register is set to 0.
例如,参见图8所示,在第一个时间周期中,存储库410将a00放入位于第一行的计算单元430至计算单元433中,存储库414将b00放入位于第一列的计算单元430、计算单元434、计算单元348和计算单元442中,其中,a00和b00分别被放入计算单元430的寄存器501和寄存器502中,并进行乘法运算,其结果a00*b00被放入寄存器503中。存储库410至存储库417中的其他寄存器也进行相应的操作。而在第二个时间周期中,参见图9所示,存储库410、存储库414分别将a01和b10放入对应的计算单元中,计算单元430计算a01和b10的乘积,并与寄存器503中存放的a00*b00进行相加,并放入寄存器503中,使得此时寄存器503中的数值为a00*b00+a01*b10。之后每个时间周期内的操作以此类推,并可参见图10所示的第三个时钟周期的状态和图11所示的第四个时钟周期的状态。For example, referring to FIG. 8, in the first time period, the repository 410 places a00 into the computing unit 430 in the first row to the computing unit 433, and the repository 414 puts b00 into the calculation in the first column. The unit 430, the calculation unit 434, the calculation unit 348, and the calculation unit 442, wherein a00 and b00 are respectively placed in the register 501 and the register 502 of the calculation unit 430, and multiplied, and the result a00*b00 is put into the register. 503. The other registers in the repository 410 to the repository 417 also perform corresponding operations. In the second time period, as shown in FIG. 9, the storage library 410 and the storage library 414 respectively put a01 and b10 into corresponding computing units, and the computing unit 430 calculates the product of a01 and b10, and registers with the register 503. The stored a00*b00 is added and placed in the register 503 so that the value in the register 503 at this time is a00*b00+a01*b10. The operation in each time period is followed by the analogy, and the state of the third clock cycle shown in FIG. 10 and the state of the fourth clock cycle shown in FIG. 11 can be referred to.
在每个时间周期,存储库418至存储库421中的每个存储库根据接收到的指令,依照既定的顺序将其所保存的矩阵C的元素的一个放入相应的计算单元中的寄存器504里。At each time period, each bank in the repository 418 to the repository 421 places one of its saved elements of the matrix C into a register 504 in the corresponding computing unit in accordance with the received instruction in a predetermined order. in.
具体来说,在第M个时间周期中,存储库418至存储库421中的每个存储库将矩阵C 的第M列元素发送到第M列的计算单元中,其中,位于第L行第M列的所述计算单元接收位于所述矩阵C第L行第M列的元素,L为小于等于4的正整数。Specifically, in the Mth time period, each of the storage 418 to the storage 421 sends the Mth column element of the matrix C to the computing unit of the Mth column, where the Lth row is located The calculation unit of the M column receives an element located in the Mth column of the Lth row of the matrix C, and L is a positive integer of 4 or less.
例如,参见图8所示,在第一个时间周期内,存储库418、存储库419、存储库420和存储库421分别将c00、c10、c20、c30放入计算单元430、计算单元434、计算单元438和计算单元442里的寄存器504中。类似的,参阅图9所示,在第二个时间周期内,存储库418、存储库419、存储库420和存储库421分别将c01、c11、c21、c31放入计算单元431、计算单元435、计算单元439和计算单元443里的寄存器504中,以此类推。For example, as shown in FIG. 8, in the first time period, the repository 418, the repository 419, the repository 420, and the repository 421 respectively put c00, c10, c20, and c30 into the computing unit 430, the computing unit 434, The calculation unit 438 and the register 504 in the calculation unit 442. Similarly, referring to FIG. 9, in the second time period, the storage library 418, the storage library 419, the storage library 420, and the storage library 421 put c01, c11, c21, and c31 into the computing unit 431 and the computing unit 435, respectively. , in the calculation unit 439 and the register 504 in the calculation unit 443, and so on.
S603:重复进行步骤S602。在四个时间周期之后,存储库410至存储库417已经将他们所存储的矩阵A和矩阵B的元素都放入相应的计算单元,并完成相应的乘加计算。例如经过四个时间周期之后,计算单元430完成了a00*b00+a01*b10+a02*b20+a03*b30,并将该结果放入自身的寄存器503中。同时,存储库418至存储库421已经将矩阵C的元素放入相应的计算单元的寄存器504中。参见图12所示,在第5个时间周期内,利用各个计算单元中的加法器508对寄存器503和寄存器504中的数值进行加法计算,得到的结果作为结果矩阵D的元素,放入各个计算单元的寄存器505中。S603: Step S602 is repeated. After four time periods, the repository 410 to the repository 417 have placed the elements of the matrix A and matrix B they store into the corresponding computational units and complete the corresponding multiply-and-accumulate calculations. For example, after four time periods, the calculation unit 430 completes a00*b00+a01*b10+a02*b20+a03*b30 and places the result in its own register 503. At the same time, the repository 418 to the repository 421 have placed the elements of the matrix C into the registers 504 of the corresponding computational unit. Referring to FIG. 12, in the fifth time period, the values in the register 503 and the register 504 are added by the adder 508 in each calculation unit, and the obtained result is used as an element of the result matrix D, and is put into each calculation. The unit's register 505.
S604:将各个计算单元的寄存器505中存放的结果矩阵D的元素依次存储到存储库422至存储库425中对应的存储库中。由于每组存储库每次只能写入一个数据,因此将得出的结果全部写入目的存储库中,需要4个时间周期。S604: The elements of the result matrix D stored in the register 505 of each computing unit are sequentially stored in the corresponding storage in the storage library 422 to the storage library 425. Since each group of banks can only write one data at a time, it takes 4 time periods to write all the results to the destination repository.
S605:将存储库422至存储库425中所存储的结果矩阵D的元素移至指定的存储空间中。S605: Move the elements of the result matrix D stored in the repository 422 to the repository 425 to the specified storage space.
本申请的实施例中所提出的算法,可以减少完成矩阵A和矩阵B的乘法运算的步骤,从而增加了GPU进行矩阵乘法运算的效率。具体来说,由于采用了将矩阵A和矩阵B的元素广播至一行或一列的计算单元的做法,因此计算矩阵A*B+C,只需要进行3*N*N次读操作和N*N次写操作,与现有技术相比极大地减少了读操作和写操作的次数。同时,由于采用存储库与计算单元直接相连的做法,可以减少占用芯片空间的大小。The algorithm proposed in the embodiment of the present application can reduce the steps of completing the multiplication operation of the matrix A and the matrix B, thereby increasing the efficiency of the GPU for matrix multiplication. Specifically, since the method of broadcasting the elements of the matrix A and the matrix B to one row or one column of the computing unit is adopted, the calculation matrix A*B+C only needs to perform 3*N*N read operations and N*N. The secondary write operation greatly reduces the number of read and write operations compared to the prior art. At the same time, due to the direct connection between the storage library and the computing unit, the size of the chip space can be reduced.
需要指出的是,上述S601至S605的标号仅用来进行指代,并不意味着在本申请的实施例中,上述步骤需要按照特定的顺序来执行。It should be noted that the above reference numerals S601 to S605 are only used for reference, and it is not meant that the above steps need to be performed in a specific order in the embodiment of the present application.
为了提高本申请提供的矩阵乘法器的工作效率,本申请设计了两套指令,分别用于矩阵乘法器的外部调用和内部控制。In order to improve the working efficiency of the matrix multiplier provided by the present application, the present application designs two sets of instructions for external calling and internal control of the matrix multiplier, respectively.
对于矩阵乘法器的外部调用指令集,本申请设计了三种指令。For the external call instruction set of the matrix multiplier, the present application designs three instructions.
第一种是用来将外部矩阵加载到矩阵乘法器中的存储空间,例如mA=load_matrix_mmp(pA,m,n),其中,pA为指向矩阵乘法器外部矩阵的指针,mA为指向矩阵乘法器内部的矩阵A的指针,m为矩阵A的行数(或者说列元素的个数),n为矩阵A的列数(或者说行元素的个数),该条指令的效果是将A从矩阵乘法器外部加载到矩阵乘法器内部。The first is the memory used to load the outer matrix into the matrix multiplier, such as mA=load_matrix_mmp(pA,m,n), where pA is a pointer to the outer matrix of the matrix multiplier and mA is the pointing matrix multiplier The pointer of the inner matrix A, m is the number of rows of the matrix A (or the number of column elements), and n is the number of columns of the matrix A (or the number of row elements). The effect of the instruction is to take A from The matrix multiplier is externally loaded into the matrix multiplier.
第二种是用来进行矩阵的乘加计算。例如,mD=matrix_mul_mmp(mA,mB,mC,m,n,k),其中,mA、mB、mC、mD是指向矩阵乘法器内部的矩阵A、B、C、D的指针,m为矩阵A、C、D的列元素的个数,n为矩阵A的行元素的个数,同时也是矩阵B的列元素的个数,k为矩阵B、C、D行元素的个数。该条指令的效果是启动矩阵乘法器进行矩阵乘法运算D=A*B+C。The second is used to perform multiplication and addition calculation of the matrix. For example, mD=matrix_mul_mmp(mA, mB, mC, m, n, k), where mA, mB, mC, mD are pointers to matrices A, B, C, D inside the matrix multiplier, m is matrix A The number of column elements of C, D, n is the number of row elements of matrix A, and is also the number of column elements of matrix B, and k is the number of elements of matrix B, C, and D rows. The effect of this instruction is to start the matrix multiplier for matrix multiplication operation D=A*B+C.
第三种是用来将结果矩阵拷贝到矩阵乘法器外部的存储中。例如,store_matrix_mmp(pD,mD,m,n),其中,pD为指向矩阵乘法器外部矩阵的指针,mD为指向矩阵乘法器内部的矩阵的指针,m为矩阵列元素的个数,n为矩阵行元素的个数。该条指令的效果是将矩阵D拷贝到矩阵乘法器外部的指针所指向的空间,其中,矩阵D的大小为m*n。例如,根据上述的外部调用指令集,当矩阵乘法器用来计算4*4大小的矩阵D=A*B+C时,可以采取以下的方式进行设置:The third is used to copy the result matrix into the memory outside the matrix multiplier. For example, store_matrix_mmp(pD,mD,m,n), where pD is a pointer to the matrix of the matrix multiplier, mD is a pointer to the matrix inside the matrix multiplier, m is the number of elements of the matrix column, and n is the matrix The number of row elements. The effect of this instruction is to copy the matrix D to the space pointed by the pointer outside the matrix multiplier, where the size of the matrix D is m*n. For example, according to the above external call instruction set, when the matrix multiplier is used to calculate the matrix D=A*B+C of 4*4 size, the following manner can be set:
mA=load_matrix_mmp(pA,4,4);mA=load_matrix_mmp(pA,4,4);
mB=load_matrix_mmp(pB,4,4);mB=load_matrix_mmp(pB,4,4);
mC=load_matrix_mmp(pC,4,4);mC=load_matrix_mmp(pC,4,4);
mD=matrix_mul_mmp(mA,mB,mC,4,4,4)mD=matrix_mul_mmp(mA,mB,mC,4,4,4)
store_matrix_mmp(pD,mD,4,4)Store_matrix_mmp(pD,mD,4,4)
对于矩阵乘法器的内部调用指令集,本申请设计了两种指令。For the internal call instruction set of the matrix multiplier, the present application designs two instructions.
第一种是用来加载矩阵的元素到计算单元的特定寄存器中,并对加载的元素进行乘累加计算。例如,Load_line_mmp(mA,mB,mC,n),其中mA、mB、mC分别是指向矩阵A、矩阵B、矩阵C的指针,n表示加载的行或者列的编号。该条指令的效果是,以广播的形式加载矩阵A的第n列和矩阵B的第n行到计算单元的特定寄存器中,以及加载矩阵C的第n列到计算单元的特定寄存器中,并根据预设的规则对加载的矩阵元素进行乘累加计算。The first is to load the elements of the matrix into a specific register of the calculation unit and multiply and accumulate the loaded elements. For example, Load_line_mmp(mA, mB, mC, n), where mA, mB, mC are pointers to matrix A, matrix B, matrix C, respectively, and n represents the number of the loaded row or column. The effect of the instruction is to load the nth column of the matrix A and the nth row of the matrix B into a specific register of the computing unit in the form of a broadcast, and load the nth column of the matrix C into a specific register of the computing unit, and The loaded matrix elements are multiplied and accumulated according to preset rules.
第二种是用来进行矩阵加法计算,并将计算结果逐行存储至指定的存储空间。例如,matrix_add_mmp(mD),mD为指向矩阵D的指针。结合上条指令,该条指令的效果是,将矩阵A和矩阵B的乘累加结果与矩阵C进行加法计算,并将计算结果作为结果矩阵D逐行存储至mD指向的存储空间中。The second is used to perform matrix addition calculations and store the calculation results line by line to the specified storage space. For example, matrix_add_mmp(mD), mD is a pointer to matrix D. In combination with the above instruction, the effect of the instruction is that the multiplication and accumulation result of the matrix A and the matrix B are added to the matrix C, and the calculation result is stored as a result matrix D row by row to the storage space pointed to by the mD.
本申请的实施例所提供的矩阵乘法器,可以嵌入到GPU中,用以高效地实现矩阵乘法运算。参阅图13所示,GPU包括存储控制器(英文全称:Memory Controller,简称:MMC)、快捷外设互联标准(英文全称:Peripheral Component Interconnect Express,简称:PCI-E)接口、线程引擎(英文全称:Thread Engine)、二级缓存(英文;L2 Cache)以及若干SM等元器件(二级缓存连接SM和存储控制器,图中未示出)。其中,SM作为GPU中的核心运算部件,为整个GPU提供运算能力。需要指出的是,GPU所包含SM的数量并非是固定不变的,而是可以根据需要进行调整,图13所示的SM的数量仅用于举例,而不应该理解为是对本申请的限定。本申请所提供的矩阵乘法器位于SM中,能够减少占用的芯片空间以及减少矩阵乘法运算时读写数据的次数,从而提升GPU的矩阵乘法运算性能和能效比。The matrix multiplier provided by the embodiment of the present application can be embedded in the GPU to efficiently implement matrix multiplication operations. Referring to FIG. 13, the GPU includes a storage controller (English name: Memory Controller, abbreviation: MMC), a Peripheral Component Interconnect Express (PCI-E) interface, and a thread engine (English full name). : Thread Engine), L2 cache (English; L2 Cache) and several components such as SM (L2 cache connection SM and storage controller, not shown). Among them, SM is the core computing component in the GPU, providing computing power for the entire GPU. It should be noted that the number of SMs included in the GPU is not fixed, but can be adjusted as needed. The number of SMs shown in FIG. 13 is for example only, and should not be construed as limiting the application. The matrix multiplier provided in the present application is located in the SM, which can reduce the occupied chip space and reduce the number of times of reading and writing data during matrix multiplication, thereby improving the matrix multiplication performance and energy efficiency ratio of the GPU.
本申请的实施例所提供的矩阵乘法器,还可以与CPU核心一起构建片上系统(英文全称:System on a Chip,简称:SoC),快速处理应用中的矩阵乘加运算。图14为包含矩阵乘法器的一种片上系统。如图14所示,该片上系统包括处理器、数字信号处理单元(英文全称:Digital Signal Processing,简称:DSP)、编译码器(英文:CODEC)以及矩阵乘法器块(英文全称:Matrix Multiplication Block,简称:MMB),这些部件之间通过二级缓存进行连接。其中,处理器可以是先进精简指令集机器处理器(英文全称:Advanced RISC Machine Processor)。参阅图15所示,MMP由若干矩阵乘法器构 成,这些矩阵乘法器之间通过一级缓存(英文:L1 Cache)进行连接。基于片上系统,只需要将应用中的矩阵乘加运算抽出,利用本申请的实施例所提供的矩阵乘法器进行计算,可以提高应用运行的效率。The matrix multiplier provided by the embodiment of the present application can also be combined with the CPU core to construct an on-chip system (English name: System on a Chip, referred to as SoC), which can quickly process matrix multiplication and addition operations in the application. Figure 14 is a system on a chip including a matrix multiplier. As shown in FIG. 14, the system on chip includes a processor, a digital signal processing unit (English name: Digital Signal Processing, DSP for short), a codec (English: CODEC), and a matrix multiplier block (English full name: Matrix Multiplication Block). , referred to as: MMB), these components are connected through a secondary cache. Among them, the processor can be an advanced reduced instruction set machine processor (English full name: Advanced RISC Machine Processor). Referring to Figure 15, the MMP is composed of a number of matrix multipliers that are connected by a level one cache (English: L1 Cache). Based on the system on chip, it is only necessary to extract the matrix multiplication and addition operations in the application, and the calculation is performed by using the matrix multiplier provided by the embodiment of the present application, thereby improving the efficiency of application operation.

Claims (11)

  1. 一种矩阵乘法器,其特征在于,所述矩阵乘法器包括:A matrix multiplier, the matrix multiplier comprising:
    N*N个计算单元,所述N*N个计算单元组成N*N的矩阵,N为大于等于2的正整数;N*N computing units, the N*N computing units form a matrix of N*N, and N is a positive integer greater than or equal to 2;
    两个存储库集合,每个存储库集合包括N个存储库,第一存储库集合用于存储输入矩阵中的第一乘法矩阵,第二存储库集合用于存储输入矩阵中的第二乘法矩阵,所述第一存储库集合中第M个存储库与所述N*N的矩阵中第M行的每个计算单元相连接,所述第二存储库集合中第M个存储库与所述N*N的矩阵中第M列的每个计算单元相连接,其中,M为变量,取值为1≤M≤N;Two repository sets, each repository set comprising N repositories, a first repository set for storing a first multiplication matrix in the input matrix, and a second repository set for storing a second multiplication matrix in the input matrix The Mth repository in the first repository set is connected to each computing unit of the Mth row in the N*N matrix, and the Mth repository in the second repository set is Each calculation unit of the Mth column in the matrix of N*N is connected, wherein M is a variable, and the value is 1≤M≤N;
    在每个时钟周期,所述N*N的矩阵中的每一行的每个计算单元用于接收与自身相连接的所述第一存储库集合中的存储库所广播的第一输入数据,所述N*N的矩阵中的每一列的每个计算单元用于接收与自身相连接的所述第二存储库集合中的存储库所广播的第二输入数据;在每个时钟周期,所述N*N的矩阵中的每个计算单元根据接收到的所述第一输入数据和所述第二输入数据进行乘法计算;在第N个时钟周期结束后,所述矩阵乘法器完成所述第一乘法矩阵与所述第二乘法矩阵的乘法运算。Each computing unit of each of the N*N matrices is configured to receive first input data broadcast by a repository in the first repository set connected to itself, each clock cycle, Each of the computing units of each of the N*N matrices is configured to receive second input data broadcast by a repository in the second repository set connected to itself; at each clock cycle, Each calculation unit in the matrix of N*N performs multiplication calculation according to the received first input data and the second input data; after the end of the Nth clock period, the matrix multiplier completes the first A multiplication operation of a multiplication matrix and the second multiplication matrix.
  2. 根据权利要求1所述的矩阵乘法器,其特征在于,在每个时钟周期,所述N*N的矩阵中位于同一行的所有计算单元接收到相同的第一输入数据,所述N*N的矩阵中位于同一列的所有计算单元接收到相同的第二输入数据。The matrix multiplier according to claim 1, wherein, in each clock cycle, all of the computing units in the same row of the N*N matrix receive the same first input data, the N*N All of the computing units in the same column in the matrix receive the same second input data.
  3. 根据权利要求1或2所述的矩阵乘法器,其特征在于,所述矩阵乘法器还包括:The matrix multiplier according to claim 1 or 2, wherein the matrix multiplier further comprises:
    第三存储库集合,所述第三存储库集合用于存储结果矩阵,所述第三存储库集合中第M个存储库与所述N*N的矩阵中第M列的每个计算单元相连接。a third repository set, wherein the third repository set is used to store a result matrix, and the Mth repository in the third repository set and each of the Mth columns in the N*N matrix connection.
  4. 根据权利要求1-3任意一项所述的矩阵乘法器,其特征在于,所述矩阵乘法器还包括:第四存储库集合,所述第四存储库集合用于存储输入矩阵中的加法矩阵,所述第四存储库集合中第M个存储库与所述N*N的矩阵中第M行的每个计算单元相连接;The matrix multiplier according to any one of claims 1 to 3, wherein the matrix multiplier further comprises: a fourth repository set, wherein the fourth repository set is used to store an addition matrix in the input matrix And the Mth repository in the fourth repository set is connected to each computing unit of the Mth row in the matrix of the N*N;
    在第一个时钟周期,所述N*N的矩阵中的第一列的每个计算单元用于接收与自身相连接的所述第四存储库集合的存储库输入的第一组数据,所述第一组数据为所述加法矩阵中的第一列数据,在第二个时钟周期所述N*N的矩阵中的第二列的每个计算单元用于接收与自身相连接的所述第四存储库集合中的存储库输入的第二组数据,所述第二组数据为所述加法矩阵中的第二列数据,以此类推,在第N个时钟周期所述N*N的矩阵中的第N列的每个计算单元用于接收与自身相连接的所述第四存储库集合中的存储库输入的第N组数据,所述第N组数据为所述加法矩阵中的第N列数据;In a first clock cycle, each computing unit of the first column of the N*N matrix is configured to receive a first set of data input from a repository of the fourth repository set connected to itself, The first set of data is the first column data in the addition matrix, and each of the second columns of the N*N matrix in the second clock cycle is used to receive the connection with itself a second set of data input by the repository in the fourth repository set, the second set of data being the second column of data in the addition matrix, and so on, the N*N in the Nth clock cycle Each of the calculation units of the Nth column in the matrix is configured to receive an Nth group of data input from a repository in the fourth repository set connected to itself, the Nth group of data being in the addition matrix Column N data;
    在所述第N+1个时钟周期,所述N*N的矩阵中的每个计算单元还用于根据接收到的所述加法矩阵的输入数据和所述第一乘法矩阵与所述第二乘法矩阵的乘法计算结果,进行加法运算,以得到所述第一乘法矩阵、第二乘法矩阵与加法矩阵的乘加计算结果。In the (N+1)th clock cycle, each of the N*N matrices is further configured to input data according to the received add matrix and the first multiplying matrix and the second The multiplication result of the multiplication matrix is subjected to an addition operation to obtain a multiplication and addition calculation result of the first multiplication matrix, the second multiplication matrix, and the addition matrix.
  5. 根据权利要求1-4任意一项所述的矩阵乘法器,其特征在于,所述矩阵乘法器还包括: 调度器,所述调度器用于获得N*N矩阵形式的第一乘法矩阵和第二乘法矩阵,并将所述第一乘法矩阵和第二乘法矩阵分别保存在所述第一存储库集合和所述第二存储库集合。The matrix multiplier according to any one of claims 1 to 4, wherein the matrix multiplier further comprises: a scheduler, the scheduler is configured to obtain a first multiplication matrix and a second form in the form of an N*N matrix Multiplying the matrix and storing the first multiplication matrix and the second multiplication matrix in the first repository set and the second repository set, respectively.
  6. 一种图形处理器,其特征在于,所述图形处理器包括如权利要求1-5任意一项所述的矩阵乘法器。A graphics processor, characterized in that the graphics processor comprises a matrix multiplier as claimed in any of claims 1-5.
  7. 一种片上系统,其特征在于,所述片上系统包括如权利要求1-5任意一项所述的矩阵乘法器。A system on a chip, characterized in that the system on chip comprises a matrix multiplier as claimed in any of claims 1-5.
  8. 一种计算方法,用于矩阵乘法器进行计算,其特征在于,所述矩阵乘法器包括:N*N个计算单元,所述N*N个计算单元组成N*N的矩阵,N为大于等于2的正整数;两个存储库集合,每个存储库集合包括N个存储库,第一存储库集合用于存储输入矩阵中的第一乘法矩阵,第二存储库集合用于存储输入矩阵中的第二乘法矩阵,所述第一存储库集合中第M个存储库与所述N*N的矩阵中第M行的每个计算单元相连接,所述第二存储库集合中第M个存储库与所述N*N的矩阵中第M列的每个计算单元相连接,其中,M为变量,取值为1≤M≤N;A calculation method for a matrix multiplier to perform calculation, wherein the matrix multiplier comprises: N*N calculation units, the N*N calculation units form a matrix of N*N, and N is greater than or equal to a positive integer of 2; two repository sets, each repository set includes N repositories, a first repository set is used to store a first multiplication matrix in the input matrix, and a second repository set is used to store the input matrix a second multiplication matrix, the Mth repository in the first repository set is connected to each of the M rows in the N*N matrix, and the Mth in the second repository set The storage library is connected to each computing unit of the Mth column in the matrix of the N*N, wherein M is a variable, and the value is 1≤M≤N;
    所述方法包括:The method includes:
    在第一个时钟周期,所述N*N的矩阵中的每一行的每个计算单元接收与自身相连接的所述第一存储库集合中的存储库所广播的第一输入数据,所述N*N的矩阵中的每一列的每个计算单元用于接收与自身相连接的所述第二存储库集合中的存储库所广播的第二输入数据,所述N*N的矩阵中的每个计算单元根据所述第一输入数据和所述第二输入数据进行乘法计算,得到第一乘法计算结果,所述N*N的矩阵中的每个计算单元将所述第一乘法计算结果与内部寄存器中的初始数值进行加法运算得到第一乘加结果,并在内部寄存器中保存自身计算得到的所述第一乘加计算结果;In a first clock cycle, each computing unit of each of the N*N matrices receives first input data broadcast by a repository in the first repository set connected to itself, Each computing unit of each column in the matrix of N*N is configured to receive second input data broadcast by a repository in the second repository set connected to itself, in the matrix of the N*N Each calculation unit performs multiplication calculation according to the first input data and the second input data to obtain a first multiplication calculation result, and each calculation unit in the N*N matrix calculates the first multiplication calculation result Adding with the initial value in the internal register to obtain a first multiplication and addition result, and storing the first multiplication and addition calculation result calculated by itself in the internal register;
    在第二个时钟周期,所述N*N的矩阵中的每一行的每个计算单元接收与自身相连接的所述第一存储库集合中的存储库广播的第一输入数据,所述N*N的矩阵中的每一列的每个计算单元用于接收与自身相连接的所述第二存储库集合中的存储库广播的第二输入数据,所述N*N的矩阵中的每个计算单元根据所述第一输入数据和所述第二输入数据进行乘法计算,得到第二乘法计算结果,所述N*N的矩阵中的每个计算单元将自身计算得到的第二乘法计算结果与所述第一乘加计算结果进行加法运算得到第二乘加结果,并在内部寄存器中保存所述第二乘加计算结果;In a second clock cycle, each computing unit of each of the N*N matrices receives first input data broadcast by a repository in the first set of repositories connected to itself, the N Each computing unit of each column in the matrix of *N is configured to receive second input data of a repository broadcast in the second repository set connected to itself, each of the N*N matrices The calculating unit performs multiplication calculation according to the first input data and the second input data to obtain a second multiplication calculation result, and each calculation unit in the N*N matrix calculates a second multiplication calculation result calculated by itself Adding with the first multiplication and addition calculation result to obtain a second multiplication and addition result, and storing the second multiplication and addition calculation result in an internal register;
    在后续的时钟周期,依次类推进行计算,直至在第N个时钟周期后,所述矩阵乘法器完成所述第一乘法矩阵与所述第二乘法矩阵的乘法运算。In subsequent clock cycles, the calculation is performed by analogy until the matrix multiplier completes the multiplication of the first multiplication matrix and the second multiplication matrix after the Nth clock cycle.
  9. 根据权利要求8所述的计算方法,其特征在于,在每个时钟周期,所述N*N的矩阵中位于同一行的所有计算单元接收到相同的第一输入数据,所述N*N的矩阵中位于同一列的所有计算单元接收到相同的第二输入数据。The calculation method according to claim 8, wherein, in each clock cycle, all the computing units in the same row in the matrix of the N*N receive the same first input data, the N*N All computing units in the same column in the matrix receive the same second input data.
  10. 根据权利要求8或9所述的计算,其特征在于,所述矩阵乘法器还包括:The calculation according to claim 8 or 9, wherein the matrix multiplier further comprises:
    第三存储库集合,所述第三存储库集合用于存储结果矩阵,所述第三存储库集合中第M个存储库与所述N*N的矩阵中第M列的每个计算单元相连接;a third repository set, wherein the third repository set is used to store a result matrix, and the Mth repository in the third repository set and each of the Mth columns in the N*N matrix connection;
    所述方法还包括:The method further includes:
    所述N*N的矩阵中的每个计算单元将计算得到的第N乘加计算结果输出到与自身相连接的所述第三存储库集合中的存储库。Each of the N*N matrices outputs the calculated Nth multiply-accumulate calculation result to a repository in the third repository set connected to itself.
  11. 根据权利要求8-10任一项所述的计算方法,其特征在于,所述矩阵乘法器还包括:第四存储库集合,所述第四存储库集合用于存储输入矩阵中的加法矩阵,所述第四存储库集合中第M个存储库与所述N*N的矩阵中第M行的每个计算单元相连接;The calculation method according to any one of claims 8 to 10, wherein the matrix multiplier further comprises: a fourth repository set, wherein the fourth repository set is used to store an addition matrix in the input matrix, The Mth repository in the fourth repository set is connected to each computing unit of the Mth row in the matrix of the N*N;
    所述方法还包括:The method further includes:
    在第一个时钟周期,所述N*N的矩阵中的第一列的每个计算单元用于接收与自身相连接的所述第四存储库集合中的存储库输入的第一组数据,所述第一组数据为所述加法矩阵中的第一列数据,在第二个时钟周期所述N*N的矩阵中的第二列的每个计算单元用于接收与自身相连接的所述第四存储库集合中的存储库输入的第二组数据,所述第二组数据为所述加法矩阵中的第二列数据,以此类推,在第N个时钟周期所述N*N的矩阵中的第N列的每个计算单元用于接收与自身相连接的所述第四存储库集合中的存储库输入的第N组数据,所述第N组数据为所述加法矩阵中的第N列数据;In a first clock cycle, each computing unit of the first column of the N*N matrix is configured to receive a first set of data entered by a repository in the fourth repository set connected to itself, The first set of data is the first column data in the addition matrix, and each of the second columns of the N*N matrix in the second clock cycle is used to receive the connection with itself a second set of data input by the repository in the fourth repository set, the second set of data being the second column of data in the addition matrix, and so on, the N*N in the Nth clock cycle Each of the calculation units of the Nth column in the matrix is configured to receive the Nth group of data input by the repository in the fourth repository set connected to itself, the Nth group of data being in the addition matrix Column N data;
    在所述第N+1个时钟周期,所述N*N的矩阵中的每个计算单元还用于根据接收到的所述加法矩阵的输入数据和所述第一乘法矩阵与所述第二乘法矩阵的第N乘加计算结果,进行加法运算,以得到所述第一乘法矩阵、第二乘法矩阵与加法矩阵的乘加计算结果。In the (N+1)th clock cycle, each of the N*N matrices is further configured to input data according to the received add matrix and the first multiplying matrix and the second The Nth multiplication and addition calculation result of the multiplication matrix is added to obtain a multiplication and addition calculation result of the first multiplication matrix, the second multiplication matrix, and the addition matrix.
PCT/CN2018/117559 2018-04-26 2018-11-27 Calculation method and apparatus for matrix multiplication WO2019205617A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810386460.0 2018-04-26
CN201810386460.0A CN110415157B (en) 2018-04-26 2018-04-26 Matrix multiplication calculation method and device

Publications (1)

Publication Number Publication Date
WO2019205617A1 true WO2019205617A1 (en) 2019-10-31

Family

ID=68294819

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/117559 WO2019205617A1 (en) 2018-04-26 2018-11-27 Calculation method and apparatus for matrix multiplication

Country Status (2)

Country Link
CN (1) CN110415157B (en)
WO (1) WO2019205617A1 (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113536220A (en) * 2020-04-21 2021-10-22 中科寒武纪科技股份有限公司 Operation method, processor and related product
CN113536219B (en) * 2020-04-21 2024-01-26 中科寒武纪科技股份有限公司 Operation method, processor and related products
WO2021212972A1 (en) * 2020-04-21 2021-10-28 中科寒武纪科技股份有限公司 Operation method, processor, and related product
CN114186186B (en) * 2020-09-15 2023-08-04 华为技术有限公司 Matrix calculation method and related equipment
CN112199119B (en) * 2020-10-21 2022-02-01 上海壁仞智能科技有限公司 Vector operation device
CN112433760B (en) * 2020-11-27 2022-09-23 海光信息技术股份有限公司 Data sorting method and data sorting circuit
CN112632464B (en) * 2020-12-28 2022-11-29 上海壁仞智能科技有限公司 Processing device for processing data
CN112991142B (en) * 2021-03-31 2023-06-16 腾讯科技(深圳)有限公司 Matrix operation method, device, equipment and storage medium for image data
EP4310700A1 (en) * 2021-03-31 2024-01-24 Huawei Technologies Co., Ltd. Matrix multiplier, matrix computing method, and related device
CN113268708B (en) * 2021-07-16 2021-10-15 北京壁仞科技开发有限公司 Method and device for matrix calculation
CN115756384A (en) * 2022-11-22 2023-03-07 海光信息技术股份有限公司 Tensor calculation unit and using method, data processing device and operating method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541507A (en) * 2010-12-31 2012-07-04 联芯科技有限公司 Dimension reconfigurable data processing method, system and matrix multiplication processor
US8626815B1 (en) * 2008-07-14 2014-01-07 Altera Corporation Configuring a programmable integrated circuit device to perform matrix multiplication
CN104063357A (en) * 2013-03-22 2014-09-24 富士通株式会社 Processor And Processing Method
CN107943756A (en) * 2017-12-15 2018-04-20 北京中科寒武纪科技有限公司 A kind of computational methods and Related product

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100465876C (en) * 2007-07-12 2009-03-04 浙江大学 Matrix multiplier device based on single FPGA
CN103294648B (en) * 2013-05-08 2016-06-01 中国人民解放军国防科学技术大学 Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device
JP6102645B2 (en) * 2013-09-11 2017-03-29 富士通株式会社 Product-sum operation circuit and product-sum operation system
US10192162B2 (en) * 2015-05-21 2019-01-29 Google Llc Vector computation unit in a neural network processor
CN106844294B (en) * 2016-12-29 2019-05-03 华为机器有限公司 Convolution algorithm chip and communication equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8626815B1 (en) * 2008-07-14 2014-01-07 Altera Corporation Configuring a programmable integrated circuit device to perform matrix multiplication
CN102541507A (en) * 2010-12-31 2012-07-04 联芯科技有限公司 Dimension reconfigurable data processing method, system and matrix multiplication processor
CN104063357A (en) * 2013-03-22 2014-09-24 富士通株式会社 Processor And Processing Method
CN107943756A (en) * 2017-12-15 2018-04-20 北京中科寒武纪科技有限公司 A kind of computational methods and Related product

Also Published As

Publication number Publication date
CN110415157A (en) 2019-11-05
CN110415157B (en) 2024-01-30

Similar Documents

Publication Publication Date Title
WO2019205617A1 (en) Calculation method and apparatus for matrix multiplication
KR102443546B1 (en) matrix multiplier
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
CA3069185C (en) Operation accelerator
WO2019218896A1 (en) Computing method and related product
US10768894B2 (en) Processor, information processing apparatus and operation method for processor
KR102252137B1 (en) Calculation device and method
WO2022037257A1 (en) Convolution calculation engine, artificial intelligence chip, and data processing method
CN110163357B (en) Computing device and method
CN111105023B (en) Data stream reconstruction method and reconfigurable data stream processor
WO2021115208A1 (en) Neural network processor, chip and electronic device
CN110991619A (en) Neural network processor, chip and electronic equipment
CN113762493A (en) Neural network model compression method and device, acceleration unit and computing system
CN111091181A (en) Convolution processing unit, neural network processor, electronic device and convolution operation method
CN116710912A (en) Matrix multiplier and control method thereof
CN109948787B (en) Arithmetic device, chip and method for neural network convolution layer
CN104636315A (en) GPDSP-oriented matrix LU decomposition vectorization calculation method
CN113031912A (en) Multiplier, data processing method, device and chip
CN114330669B (en) Vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and system
CN116888591A (en) Matrix multiplier, matrix calculation method and related equipment
CN115081600A (en) Conversion unit for executing Winograd convolution, integrated circuit device and board card
CN111382835A (en) Neural network compression method, electronic device and computer readable medium
CN111222632B (en) Computing device, computing method and related product
CN112766473B (en) Computing device and related product
CN111291884A (en) Neural network pruning method and device, electronic equipment and computer readable medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18916333

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18916333

Country of ref document: EP

Kind code of ref document: A1