WO2022062004A1 - Data processing method and apparatus for matrix multiplication, and device and medium - Google Patents

Data processing method and apparatus for matrix multiplication, and device and medium Download PDF

Info

Publication number
WO2022062004A1
WO2022062004A1 PCT/CN2020/122168 CN2020122168W WO2022062004A1 WO 2022062004 A1 WO2022062004 A1 WO 2022062004A1 CN 2020122168 W CN2020122168 W CN 2020122168W WO 2022062004 A1 WO2022062004 A1 WO 2022062004A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
data
multiplication
instruction
vector general
Prior art date
Application number
PCT/CN2020/122168
Other languages
French (fr)
Chinese (zh)
Inventor
陈庆
华芮
袁庆
Original Assignee
成都海光集成电路设计有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 成都海光集成电路设计有限公司 filed Critical 成都海光集成电路设计有限公司
Publication of WO2022062004A1 publication Critical patent/WO2022062004A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management

Definitions

  • the present disclosure relates to the field of data processing, and more particularly, to a data processing method, apparatus, device, and medium for matrix multiplication.
  • the graphics processing unit includes a large number of data processing units.
  • Each data processing unit is a single instruction multiple data stream (SIMD) structure. By executing one instruction, it simultaneously controls multiple threads to perform the same operation. It has a dedicated set of vector general-purpose registers (VGPR) and a large number of parallel execution units, such as multiplication units. Because the SIMD structure has a high degree of parallelism, the SIMD structure is widely used in matrix operations.
  • SIMD single instruction multiple data stream
  • an embodiment of the present disclosure provides a data processing method for matrix multiplication, including: acquiring a matrix multiplication instruction and a data selection instruction; A first vector general-purpose register of a first operation matrix, and a second vector general-purpose register storing a second operation matrix, wherein the first vector general-purpose register and the second vector general-purpose register have the same number of paths, wherein all The first number of operation data of the first operation matrix corresponds to the first number of paths of the first vector general register, and the second number of operation data of the second operation matrix corresponds to the second vector general register the second number of paths; based on the data selection instruction, determine target operation data in the second number of operation data of the second operation matrix; pass the first number of operation data of the first operation matrix through the The first number of paths of the first vector general register are respectively provided to the first number of multipliers as first multiplication factors, and the target operation data is passed through the first number of paths of the second vector general register
  • the multipliers to the first number are provided as second multiplication factors.
  • the method further comprises: based on the matrix multiplication instruction, determining a third vector general-purpose register for storing the result of the matrix multiplication operation; each of the first number of multipliers The multipliers respectively perform multiplication operations based on the corresponding first multiplication factors and the second multiplication factors to obtain operation results; and store the operation results in the third vector general-purpose register.
  • the matrix multiply instruction includes the first number of threads, and the first number of multipliers corresponds to the first number of threads, among the first number of threads Each of the threads corresponds to a corresponding path of the first vector general-purpose register and a corresponding path of the second vector general-purpose register, respectively; wherein the target operation data is determined in the second quantity of operation data of the second operation matrix
  • the method includes: selecting a path from the second number of paths in the second vector general-purpose register based on the data selection instruction, and using the operation data corresponding to the path as target operation data; wherein the target Operating data to provide the first number of multipliers as a second multiplication factor includes: for the threads of the first number of threads corresponding to the paths of the second vector general register, applying the providing target operational data to its corresponding multiplier as a second multiplication factor; and for the remaining threads of the first number of threads, copying the target operational data to the remaining threads in common with the second vector
  • the paths of the registers are connected
  • the first operation matrix is a column matrix
  • the first quantity of operation data is column data of the first operation matrix
  • the second operation matrix is a row matrix
  • the second quantity of operation data is row data of the second operation matrix
  • acquiring a matrix multiplication instruction and a data selection instruction comprises: acquiring a matrix multiplication instruction, the matrix multiplication instruction includes a first operation matrix field and a second operation matrix field, wherein the first operation matrix field a first vector general-purpose register for indicating that the first operation matrix is stored; and when the second operation matrix field is a predefined value, acquiring a data selection instruction, the data selection instruction includes an operation matrix field and data A selection field, wherein the operation matrix field is used to indicate a second vector general-purpose register that stores the second operation matrix, and the data selection field is used to indicate selection of the second operation matrix in the second quantity of operation data specific data as the target operation data.
  • Embodiments of the present disclosure provide an apparatus for performing data processing for matrix multiplication, including: an instruction fetch unit for acquiring a matrix multiplication instruction and a data selection instruction; and a decoding unit configured to retrieve data from the instruction fetch unit receiving and decoding the matrix multiply instruction and the data select instruction to determine a first vector general register storing a first operation matrix and a second vector general register storing a second operation matrix, and obtaining data selection information, wherein the first vector general register and the second vector general register have the same number of paths, wherein a first number of operation data of the first operation matrix corresponds to the first vector general a first number of paths of registers, a second number of operand data of said second operation matrix corresponding to a second number of paths of said second vector general register; a data selection control unit configured to decode from said decoding A unit receives the data selection information, and based on the data selection information, determines target operation data in the second amount of operation data of the second operation matrix; a read operand unit is configured to convert the first operation The first number of
  • the decoding unit further determines a third vector general-purpose register for storing the result of the matrix multiplication operation based on the decoding result
  • the apparatus further includes: a multiplication unit, which is It is configured to include the first number of multipliers, and each of the first number of multipliers performs multiplication operations based on the corresponding first multiplication factor and the second multiplication factor, respectively, to obtain an operation result ; an operation write-back unit configured to store the operation result into the third vector general-purpose register.
  • the matrix multiply instruction includes the first number of threads, and the first number of multipliers corresponds to the first number of threads, among the first number of threads Each of the threads corresponds to a corresponding path of the first vector general-purpose register and a corresponding path of the second vector general-purpose register, respectively; wherein the target operation data is determined in the second quantity of operation data of the second operation matrix
  • the method includes: selecting a path from the second number of paths in the second vector general-purpose register based on the data selection instruction, and using the operation data corresponding to the path as target operation data; wherein the target Operating data to provide the first number of multipliers as a second multiplication factor includes: for the threads of the first number of threads corresponding to the paths of the second vector general register, applying the providing target operational data to its corresponding multiplier as a second multiplication factor; and for the remaining threads of the first number of threads, copying the target operational data to the remaining threads in common with the second vector
  • the paths of the registers are connected
  • the first operation matrix is a column matrix
  • the first quantity of operation data is column data of the first operation matrix
  • the second operation matrix is a row matrix
  • the second quantity of operation data is row data of the second operation matrix
  • acquiring a matrix multiplication instruction and a data selection instruction comprises: acquiring a matrix multiplication instruction, the matrix multiplication instruction includes a first operation matrix field and a second operation matrix field, wherein the first operation matrix field a first vector general-purpose register for indicating that the first operation matrix is stored; and when the second operation matrix field is a predefined value, acquiring a data selection instruction, the data selection instruction includes an operation matrix field and data A selection field, wherein the operation matrix field is used to indicate a second vector general-purpose register that stores the second operation matrix, and the data selection field is used to indicate selection of the second operation matrix in the second quantity of operation data specific data as the target operation data.
  • Embodiments of the present disclosure provide a data processing apparatus including: a processor; and a memory having computer-executable instructions stored thereon, the instructions, when executed by the processor, for implementing the method as described above.
  • Embodiments of the present disclosure provide a computer-readable storage medium having stored thereon computer-executable instructions, which when executed by a processor, are used to implement the method as described above.
  • Embodiments of the present disclosure provide a computer program product or computer program including computer instructions stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method according to the embodiment of the present disclosure.
  • FIG. 1 shows a schematic flowchart of a data processing method 100 for matrix multiplication according to an embodiment of the present disclosure.
  • FIG. 2 shows a schematic diagram of the correspondence between threads performing matrix operations and paths of the VGPR according to an embodiment of the present disclosure.
  • FIG. 3 shows a schematic diagram of data processing for example matrix multiplication according to an embodiment of the present disclosure.
  • FIG. 4 shows a schematic diagram of an example apparatus 400 for performing data processing for matrix multiplication according to an embodiment of the present disclosure.
  • FIG. 5 shows a schematic diagram of the operation of an example data selection control unit 403 and a read operand unit 404 involved in the second half of data processing according to an embodiment of the present disclosure.
  • FIG. 6 shows a schematic diagram of a data processing apparatus 600 according to an embodiment of the present disclosure.
  • the SIMD structure processing unit of the GPU controls multiple threads to perform the same operation at the same time by executing the matrix operation instruction, so as to realize matrix reading, operation operation, and result storage.
  • executing one instruction can control the data operations of 32 threads at the same time
  • each SIMD32 structure has its own dedicated set of VGPRs, and each VGPR has 32 channels.
  • the following table 1 shows a general matrix operation instruction, which is a conventional instruction adopted when performing a matrix operation operation in the SIMD structure, and includes a first operation matrix (VSRCA) field indicating the first VGPR stored with the first operation matrix, Indicates the second operand (SRCB) field of the second VGPR stored with the second operand, indicates the purpose VGPR (VDST) field of the third VGPR for storing the matrix operation result, indicates the specific operation performed by the matrix operation instruction
  • the operation code (OP) field of the , and the instruction selection (Type) field indicating that the instruction to execute the matrix operation is determined.
  • the matrix multiplication instruction can be obtained by setting the OP field in the matrix operation instruction to a corresponding value indicating the multiplication operation.
  • the matrix multiplication instruction in the general matrix operation instruction format is used to perform matrix multiplication A*B, where matrix A is a 32 ⁇ 1 column matrix, that is, A(:,1) contains 32 data, and matrix B is 1
  • matrix A is a 32 ⁇ 1 column matrix, that is, A(:,1) contains 32 data
  • matrix B is 1
  • the row matrix of ⁇ 4 that is, B(1,:) contains 4 data.
  • the commonly used prior art is to read the matrix data one by one from the double data rate synchronous dynamic random access memory (DDRSDRAM) into the VGPR.
  • DDRSDRAM double data rate synchronous dynamic random access memory
  • matrix A is read into VGPR 0
  • the four matrix data of matrix B are read into four VGPRs (respectively called VGPR 1, VGPR 2, VGPR 3 and VGPR 4), and each operation will be VGPR 0.
  • the data corresponding to the 32 channels, and the data corresponding to the 32 channels of VGPR 1, VGPR 2, VGPR 3 or VGPR 4 are sent to the corresponding multipliers in the SIMD structure for multiplication.
  • This process involves reading data from the DDR SDRAM multiple times, such as 5 times in this operation, resulting in unnecessary data redundancy and extra power consumption.
  • the present disclosure proposes to only read the operation matrix once (that is, read the entire second operation matrix into the second VGPR at one time), and correspondingly add a part of the instructions on the basis of the original matrix multiplication instructions Used to guide the ordered multiplication of data within a matrix.
  • FIG. 1 shows a schematic flowchart of a data processing method 100 for matrix multiplication according to an embodiment of the present disclosure.
  • a matrix multiplication instruction and a data selection instruction are acquired.
  • matrix multiply instructions and data selection instructions may be retrieved from memory (eg, DDR SDRAM, etc.).
  • an instruction part for operating data between threads is added to guide the selection and copying of the data involved in the operation in the second operation matrix during the matrix multiplication process.
  • the above added command part is called data selection command, as shown in Table 2.
  • the SRCB field originally used to indicate the second VGPR is used as the entry to obtain the data selection instruction, and the data selection instruction indicates the second VGPR in which the second operation matrix is stored.
  • the data selection instruction may include a second operation matrix (VSRCB) field for indicating the second VGPR, and a data selection (SVF_MODE) field for indicating the data selection.
  • the matrix multiply instruction and the data selection instruction may exist as two separate instructions, or may exist as two parts of one instruction.
  • the SIMD instruction adopted by the data processing method 100 for matrix multiplication includes the above-mentioned matrix multiplication instruction and data selection instruction.
  • the length of a SIMD instruction may be 64 bits, the first 32 bits of which are the matrix operation instruction part, and the definitions and related descriptions of each bit field in the matrix operation instruction are shown in Table 3; For the part of the data selection command, the definitions and related descriptions of each bit field in the data selection command are shown in Table 4.
  • bits 0 to 8 are the SRCB field, which can indicate the second VGPR that stores the second operand (for example, when the SRCB value is equal to 90 or 267, etc. ), when the SRCB value is equal to a predefined value, this field indicates to enter the data selection, get the data selection command (for example, when the SRCB value is equal to 209).
  • the 9th to 16th bits are the VSRCA field.
  • the 17th to 24th bits are the VDST field.
  • Bits 25 to 30 are the OP field, which is one of a number of specific values for a matrix multiply instruction.
  • the 31st bit is the Type field, which is used to indicate that the matrix operation instruction is determined to be executed.
  • bits 32 to 39 are the VSRCB field.
  • the 40th to 44th bits are the SVF_MODE field, and the SVF_MODE with a length of 5 bits can be used to indicate the copy operation of data among 32 threads.
  • the remaining bits are reserved fields of the instruction, which can be reserved for subsequent implementation of other operations.
  • a first VGPR storing the first operation matrix and a second VGPR storing the second operation matrix may be determined based on the matrix multiplication instruction and the data selection instruction.
  • the address information of the first VGPR storing the first operation matrix and the second VGPR storing the second operation matrix can be obtained according to the VSRCA field in the matrix multiplication instruction and the VSRCB field in the data selection instruction , the address information may be the index of the VGPR in all VGPRs of the SIMD structure processing unit.
  • the first operation matrix may be stored in the first VGPR in advance
  • the second operation matrix may be stored in the second VGPR in advance
  • the first VGPR and the second VGPR have the same number of paths
  • the first number of operation data of the first operation matrix corresponds to the first number of ways of the first VGPR
  • the second number of operation data of the second operation matrix corresponds to the second number of ways of the second VGPR.
  • the SIMD structure processing unit can, according to the obtained address information of the first VGPR and the second VGPR, A multiplication operation is performed on a first number of operation data of the first operation matrix and a second number of operation data of the second operation matrix, the first number of operation data of the first operation matrix corresponding to the first number of paths of the first VGPR , the second number of operation data of the second operation matrix corresponds to the second number of paths of the second VGPR.
  • both the first VGPR and the second VGPR have 32 channels, so the VGPR can simultaneously provide up to 32 data in the stored matrix to participate in the operation.
  • the first operation matrix A is a 32 ⁇ 1 column matrix
  • the first quantity of operation data is 32 column data of A(:, 1)
  • the first The second operation matrix B is a 1 ⁇ 4 row matrix
  • the operation data of the second quantity is 4 row data of B(1,:).
  • the 32 channels of the VGPR A that store the matrix A correspond to the 32 data of A(:, 1) in the matrix A respectively
  • the first 4 channels of the 32 channels of the VGPR B that store the matrix B respectively correspond to the matrix.
  • the 4 data of B(1,:) in B, the other channels of VGPR B do not correspond to any data.
  • FIG. 2 shows a schematic diagram of the correspondence between threads performing matrix operations and paths of the VGPR according to an embodiment of the present disclosure.
  • the matrix multiply instruction includes a first number of threads, wherein each thread corresponds to a respective pass of the first VGPR and a respective pass of the second VGPR.
  • the above matrix multiplication instruction includes 32 threads corresponding to the 32 column data of A(:, 1).
  • Figure 2 shows that each thread corresponds to the corresponding path of the first VGPR and the Corresponding paths, eg, thread 0 corresponds to path 0 of the first VGPR and path 0 of the second VGPR, thread 1 corresponds to path 1 of the first VGPR and path 1 of the second VGPR, and so on.
  • the path 0 of the second VGPR corresponding to thread 0 corresponds to the first data B(1,1) of B(1,:), and after passing the data B corresponding to the path 0 of the second VGPR (1,1) is copied to the 31 paths of the second VGPR corresponding to the remaining threads of the 32 threads, and the data corresponding to the 32 paths of the second VGPR corresponding to the 32 threads are all B(1,1).
  • target operation data may be determined in the second quantity of operation data of the second operation matrix based on the data selection instruction.
  • the first number of operation data of the first operation matrix may be respectively provided to the first number of multipliers as the first multiplication factor via the first number of paths of the first VGPR, and the target operation data may be supplied via the first number of paths of the first VGPR respectively.
  • the first number of paths of the two VGPRs are provided to the first number of multipliers as a second multiplication factor.
  • the matrix multiply instruction may contain a first number of threads, and the first number of multipliers corresponds to the first number of threads.
  • the target operation data corresponding to the channel 1 of the Thread 1 is provided to the input end of the multiplier corresponding to thread 1, and the target operation data is copied to the input end of the multipliers connected to the channels of the second VGPR corresponding to the remaining threads for multiplication operation. .
  • a third VGPR for storing a result of a matrix multiplication operation can be determined based on a matrix multiplication instruction, the third VGPR has the same number of paths as the first VGPR and the second VGPR, the first number of multipliers Each multiplier in can perform a multiplication operation based on its corresponding first multiplication factor and second multiplication factor, and after obtaining the operation result, store the operation result in the third VGPR via the corresponding first number of paths.
  • FIG. 3 shows a schematic diagram of data processing for example matrix multiplication according to an embodiment of the present disclosure.
  • the SIMD example in this embodiment is a SIMD 32 structure
  • each VGPR includes 32 channels
  • Each path of the VGPR A that stores the matrix A corresponds to each data in the column vector of the matrix A, and each path of the VGPR B that stores the matrix B corresponds to each data in the row vector of the matrix B, respectively.
  • VGPR is executed on each thread. The multiplication operation of the data corresponding to the corresponding path of A and the target operation data in VGPR B.
  • the 32 channels of VGPR A correspond to the 32 data A(1,1), A(2,1), ..., A(32,1) of the column vector A(:,1) of the matrix A respectively;
  • a partial assembly instruction example of the method described in this disclosure may be represented as follows:
  • both registers v0 and v80 can store 32 data.
  • v_mul_u32 is an opcode, which indicates a 32-bit multiplication operation, wherein v0 indicates the register of the first operation matrix A, v80 indicates the register of the second operation matrix B, and the second operation matrix B is selected by changing the value of SVF_MODE
  • the target operation data in , v100/v101/v102/v103 indicates the intermediate register used to store the multiplication result of the first operation matrix A and the target operation data, thus realizing the matrix multiplication operation based on a single read operation matrix under the SIMD structure .
  • SIMD structure and the matrices involved in the multiplication operation are not limited to the above examples, but can be adjusted by those skilled in the art according to the actual situation, and examples are not provided here.
  • FIG. 4 shows a schematic diagram of an example apparatus 400 for performing data processing for matrix multiplication according to an embodiment of the present disclosure.
  • an apparatus 400 for performing data processing for matrix multiplication may include: an instruction fetch unit 401 , a decoding unit 402 , a data selection control unit 403 , and a read operand unit 404 .
  • Instruction fetch unit 401 may be configured to fetch matrix multiply instructions and data select instructions. For example, instruction fetch unit 401 may fetch instructions from a memory such as DDR SDRAM to an instruction register.
  • Decode unit 402 may be configured to receive matrix multiply instructions and data select instructions from instruction fetch unit 401 and decode these instructions to determine a first VGPR storing a first operation matrix and a second operation matrix the second VGPR of the The second number of operational data of the two-operation matrix corresponds to the second number of passes of the second VGPR.
  • the decoding unit 402 splits and interprets the fetched instruction according to a predetermined instruction format, and obtains information such as VGPR address and operation. In addition, based on the data selection instruction, corresponding data selection information can also be obtained, which can be used such as a data selection signal ( SVF_MODE) to transmit this information to guide subsequent data selection operations in the second operation matrix.
  • SVF_MODE data selection signal
  • the data selection control unit 403 may be configured to receive data selection information from the decoding unit 402, and based on the data selection information, determine target operation data among the second quantity of operation data of the second operation matrix. For example, in the data selection control unit 403, the second amount of operation data of the second operation matrix may be passed through a selector controlled by the data selection information (eg, SVF_MODE) to select the target operation data.
  • the data selection information eg, SVF_MODE
  • the read operand unit 404 may be configured to provide the first number of operation data of the first operation matrix to the first number of multipliers via the first number of paths of the first VGPR, respectively, as the first multiplication factor, and the target operation Data is provided to the first number of multipliers as a second multiplication factor via the first number of paths of the second VGPR.
  • the read operand unit 404 may copy the target operation data to the first number of paths connected to the above-mentioned first number of multipliers among the paths of the second VGPR, so as to provide the corresponding multipliers as second multiplication factors.
  • the decoding unit 402 may be further configured to determine a third VGPR for storing the result of the matrix multiplication operation based on the decoding result.
  • the apparatus 400 for performing the data processing method for matrix multiplication may further include: a multiplication unit 405, which may be configured to include a first number of multipliers, wherein each multiplier The multiplication operation is performed based on the corresponding first multiplication factor and the second multiplication factor, respectively, to obtain an operation result; and an operation write-back unit 406 may be configured to store the multiplication operation result in the third VGPR.
  • a multiplication unit 405 which may be configured to include a first number of multipliers, wherein each multiplier The multiplication operation is performed based on the corresponding first multiplication factor and the second multiplication factor, respectively, to obtain an operation result
  • an operation write-back unit 406 may be configured to store the multiplication operation result in the third VGPR.
  • FIG. 5 shows a schematic diagram of the operation of an example data selection control unit 403 and a read operand unit 404 involved in the second half of data processing according to an embodiment of the present disclosure.
  • the data selection control unit 403 is based on the data selection control information received from the decoding unit 402 (with SVF_MODE as the data selection signal), on the 32 paths of the VGPR B, the 32 second The operation data passes through a 32-to-1 selector to select the second operation data (ie, target operation data) corresponding to the designated path of the VGPR B.
  • the read operand unit 404 supplies the 32 first operation data of the matrix A to the first input ends of the 32 multipliers through the 32 paths of the VGPR A respectively, and provides the target operation data to all the designated paths of the VGPR B. Connect the second input of the multiplier and copy the target operation data to the remaining paths of the VGPR B and then provide it to the second input of the remaining multipliers.
  • FIG. 6 shows a schematic diagram of a data processing apparatus 600 according to an embodiment of the present disclosure.
  • a data processing device 600 may include a processor 601 and a memory 602 , which may be interconnected through a bus 603 .
  • the processor 601 can perform various actions and processes according to programs or codes stored in the memory 602 .
  • the processor 601 may be an integrated circuit chip, which has signal processing capability.
  • the aforementioned processors may be general purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), off-the-shelf programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs off-the-shelf programmable gate arrays
  • Various methods, steps, processes and logical block diagrams disclosed in the embodiments of the present disclosure can be implemented or executed.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor, etc., and may be an X86 architecture or an ARM architecture, or the like.
  • the memory 602 stores executable instructions, which when executed by the processor 601 are used to implement the data processing method according to the embodiment of the present disclosure.
  • Memory 602 may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory.
  • the nonvolatile memory may be read only memory (ROM), programmable read only memory (PROM), erasable programmable read only memory (EPROM), electrically erasable programmable read only memory (EEPROM), or flash memory.
  • Volatile memory may be random access memory (RAM), which acts as an external cache.
  • RAM static random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • DDRSDRAM double data rate synchronous dynamic Random Access Memory
  • ESDRAM Enhanced Synchronous Dynamic Random Access Memory
  • SLDRAM Synchronous Link Dynamic Random Access Memory
  • DRRAM Direct Memory Bus Random Access Memory
  • Embodiments of the present disclosure also provide a computer-readable storage medium on which computer-executable instructions are stored, and when the computer instructions are executed by a processor, can implement the data processing method according to the embodiments of the present disclosure.
  • computer-readable storage media in embodiments of the present disclosure may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. It should be noted that the memory of the methods described herein is intended to include, but not be limited to, these and any other suitable types of memory.
  • Embodiments of the present disclosure also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method according to the embodiment of the present disclosure.
  • Embodiments of the present disclosure provide a data processing method, apparatus, device, and storage medium for matrix multiplication.
  • the data processing method for matrix multiplication provided by the embodiments of the present disclosure firstly reads the entire matrix into the VGPR, then selects multiple paths of the VGPR, and copies the data corresponding to the selected path to other paths of the VGPR as
  • the multiplication factor participates in the multiplication operation of the corresponding thread, makes full use of the matrix characteristics, effectively multiplexes data between threads, reduces the number of data reads, and reduces power consumption.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which includes at least one block for implementing the specified logical function. executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the various example embodiments of the present disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic, or any combination thereof. Certain aspects may be implemented in hardware, while other aspects may be implemented in firmware or software that may be executed by a controller, microprocessor or other computing device. While aspects of the embodiments of the present disclosure are illustrated or described as block diagrams, flowcharts, or using some other graphical representation, it is to be understood that the blocks, apparatus, systems, techniques, or methods described herein may be taken as non-limiting Examples are implemented in hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controllers or other computing devices, or some combination thereof.

Abstract

A data processing method and apparatus for matrix multiplication, and a device and a medium. The data processing method comprises: acquiring a matrix multiplication instruction and a data selection instruction; on the basis of the matrix multiplication instruction and the data selection instruction, determining a first vector general-purpose register that stores a first operation matrix, and a second vector general-purpose register that stores a second operation matrix; on the basis of the data selection instruction, determining target operation data from a second number of pieces of operation data in the second operation matrix; respectively providing a first number of pieces of operation data in the first operation matrix to the first number of multipliers, and taking same as first multiplication factors; and providing the target operation data to the first number of multipliers, and then taking same as a second multiplication factor. By means of the data processing method and apparatus for matrix multiplication, and the device and the medium, data can be effectively reused between threads, thereby reducing the number of readings of data, and reducing the power consumption.

Description

用于矩阵乘法的数据处理方法、装置、设备及介质Data processing method, apparatus, device and medium for matrix multiplication
本申请要求于2020年9月24日递交的第202011019241.2号中国专利申请的优先权,在此全文引用上述中国专利申请公开的内容以作为本申请的一部分。This application claims the priority of Chinese Patent Application No. 202011019241.2 filed on September 24, 2020. The disclosure of the above Chinese patent application is hereby incorporated by reference in its entirety as a part of this application.
技术领域technical field
本公开涉及数据处理领域,并且更具体地,涉及用于矩阵乘法的数据处理方法、装置、设备及介质。The present disclosure relates to the field of data processing, and more particularly, to a data processing method, apparatus, device, and medium for matrix multiplication.
背景技术Background technique
图形处理器(GPU)中包括大量数据处理单元,每个数据处理单元是单指令多数据流(SIMD)结构,通过执行一条指令同时控制多个线程上执行相同的操作,每个SIMD结构中都有其专用的一组向量通用寄存器(VGPR)和大量可并行执行的运算单元,比如乘法单元。因为SIMD结构具有高度的并行性,所以SIMD结构被广泛应用于矩阵运算。The graphics processing unit (GPU) includes a large number of data processing units. Each data processing unit is a single instruction multiple data stream (SIMD) structure. By executing one instruction, it simultaneously controls multiple threads to perform the same operation. It has a dedicated set of vector general-purpose registers (VGPR) and a large number of parallel execution units, such as multiplication units. Because the SIMD structure has a high degree of parallelism, the SIMD structure is widely used in matrix operations.
目前在进行矩阵运算时,特别是进行矩阵乘法运算时,由于矩阵乘法的特性,常常需要通过多次读取矩阵数据来实现矩阵对应元素相乘,并且在将矩阵数据读入寄存器后,该寄存器的所有通路上传送的数据都相同,线程间的数据存在大量冗余,还会造成额外的功耗。现有的数据处理手段可通过执行特定指令实现线程间数据的复制,但所用的指令并不适用于矩阵运算操作,而且操作线程间数据的指令都是作为独立于运算指令的单独指令存在,这对于实际数据处理来说仍然效率较低。At present, when performing matrix operations, especially when performing matrix multiplication operations, due to the characteristics of matrix multiplication, it is often necessary to multiply the corresponding elements of the matrix by reading the matrix data multiple times, and after reading the matrix data into the register, the register The data transmitted on all the paths of the thread is the same, and there is a lot of redundancy in the data between threads, which also causes additional power consumption. Existing data processing methods can replicate data between threads by executing specific instructions, but the instructions used are not suitable for matrix operations, and the instructions for operating data between threads exist as separate instructions independent of operation instructions. Still less efficient for actual data processing.
因此,需要一种适用于矩阵运算、能够有效减少读取次数、并且高效的数据处理方法。Therefore, there is a need for a data processing method that is suitable for matrix operations, can effectively reduce the number of readings, and is efficient.
发明内容SUMMARY OF THE INVENTION
为了解决上述问题,本公开的实施例提供了一种用于矩阵乘法的数据处理方法,包括:获取矩阵乘法指令和数据选择指令;基于所述矩阵乘法指令和 所述数据选择指令,确定存储有第一操作矩阵的第一向量通用寄存器,以及存储有第二操作矩阵的第二向量通用寄存器,其中,所述第一向量通用寄存器和所述第二向量通用寄存器具有相同数量的通路,其中所述第一操作矩阵的第一数量的操作数据对应于所述第一向量通用寄存器的第一数量的通路,所述第二操作矩阵的第二数量的操作数据对应于所述第二向量通用寄存器的第二数量的通路;基于所述数据选择指令,在所述第二操作矩阵的第二数量的操作数据中确定目标操作数据;将所述第一操作矩阵的第一数量的操作数据经由所述第一向量通用寄存器的第一数量的通路分别提供至所述第一数量的乘法器作为第一乘法因子,并且将所述目标操作数据经由所述第二向量通用寄存器的第一数量的通路提供至所述第一数量的乘法器作为第二乘法因子。In order to solve the above problem, an embodiment of the present disclosure provides a data processing method for matrix multiplication, including: acquiring a matrix multiplication instruction and a data selection instruction; A first vector general-purpose register of a first operation matrix, and a second vector general-purpose register storing a second operation matrix, wherein the first vector general-purpose register and the second vector general-purpose register have the same number of paths, wherein all The first number of operation data of the first operation matrix corresponds to the first number of paths of the first vector general register, and the second number of operation data of the second operation matrix corresponds to the second vector general register the second number of paths; based on the data selection instruction, determine target operation data in the second number of operation data of the second operation matrix; pass the first number of operation data of the first operation matrix through the The first number of paths of the first vector general register are respectively provided to the first number of multipliers as first multiplication factors, and the target operation data is passed through the first number of paths of the second vector general register The multipliers to the first number are provided as second multiplication factors.
根据本公开的实施例,其中,所述方法还包括:基于所述矩阵乘法指令,确定用于存储所述矩阵乘法运算结果的第三向量通用寄存器;所述第一数量的乘法器中的各个乘法器分别基于其对应的所述第一乘法因子和所述第二乘法因子执行乘法运算,得到运算结果;以及将所述运算结果存储到所述第三向量通用寄存器中。According to an embodiment of the present disclosure, wherein the method further comprises: based on the matrix multiplication instruction, determining a third vector general-purpose register for storing the result of the matrix multiplication operation; each of the first number of multipliers The multipliers respectively perform multiplication operations based on the corresponding first multiplication factors and the second multiplication factors to obtain operation results; and store the operation results in the third vector general-purpose register.
根据本公开的实施例,其中,所述矩阵乘法指令包含所述第一数量的线程,并且所述第一数量的乘法器对应于所述第一数量的线程,所述第一数量的线程中的每一线程分别对应于所述第一向量通用寄存器的相应通路和所述第二向量通用寄存器的相应通路;其中,在所述第二操作矩阵的第二数量的操作数据中确定目标操作数据包括:基于所述数据选择指令,在所述第二向量通用寄存器的所述第二数量的通路中选择一个通路,并将该通路所对应的操作数据作为目标操作数据;其中,将所述目标操作数据提供至所述第一数量的乘法器作为第二乘法因子包括:对于所述第一数量的线程中与所述第二向量通用寄存器的所述通路相对应的所述线程,将所述目标操作数据提供至其对应的乘法器作为第二乘法因子;以及对于所述第一数量的线程中的其余线程,将所述目标操作数据复制到所述其余线程的与所述第二向量通用寄存器连接的通路,并分别提供至对应的乘法器作为第二乘法因子。According to an embodiment of the present disclosure, wherein the matrix multiply instruction includes the first number of threads, and the first number of multipliers corresponds to the first number of threads, among the first number of threads Each of the threads corresponds to a corresponding path of the first vector general-purpose register and a corresponding path of the second vector general-purpose register, respectively; wherein the target operation data is determined in the second quantity of operation data of the second operation matrix The method includes: selecting a path from the second number of paths in the second vector general-purpose register based on the data selection instruction, and using the operation data corresponding to the path as target operation data; wherein the target Operating data to provide the first number of multipliers as a second multiplication factor includes: for the threads of the first number of threads corresponding to the paths of the second vector general register, applying the providing target operational data to its corresponding multiplier as a second multiplication factor; and for the remaining threads of the first number of threads, copying the target operational data to the remaining threads in common with the second vector The paths of the registers are connected, and are respectively provided to the corresponding multipliers as the second multiplication factors.
根据本公开的实施例,其中,所述第一操作矩阵为列矩阵,所述第一数量的操作数据为所述第一操作矩阵的列数据;以及所述第二操作矩阵为行矩阵,所述第二数量的操作数据为所述第二操作矩阵的行数据。According to an embodiment of the present disclosure, wherein the first operation matrix is a column matrix, the first quantity of operation data is column data of the first operation matrix; and the second operation matrix is a row matrix, so The second quantity of operation data is row data of the second operation matrix.
根据本公开的实施例,其中,获取矩阵乘法指令和数据选择指令包括:获取矩阵乘法指令,所述矩阵乘法指令包括第一操作矩阵字段,第二操作矩阵字段,其中所述第一操作矩阵字段用于指示存储有所述第一操作矩阵的第一向量通用寄存器;以及在所述第二操作矩阵字段为预定义的值时,获取数据选择指令,所述数据选择指令包括操作矩阵字段和数据选择字段,其中所述操作矩阵字段用于指示存储有所述第二操作矩阵的第二向量通用寄存器,所述数据选择字段用于指示选择所述第二操作矩阵的第二数量的操作数据中的特定数据作为所述目标操作数据。According to an embodiment of the present disclosure, wherein acquiring a matrix multiplication instruction and a data selection instruction comprises: acquiring a matrix multiplication instruction, the matrix multiplication instruction includes a first operation matrix field and a second operation matrix field, wherein the first operation matrix field a first vector general-purpose register for indicating that the first operation matrix is stored; and when the second operation matrix field is a predefined value, acquiring a data selection instruction, the data selection instruction includes an operation matrix field and data A selection field, wherein the operation matrix field is used to indicate a second vector general-purpose register that stores the second operation matrix, and the data selection field is used to indicate selection of the second operation matrix in the second quantity of operation data specific data as the target operation data.
本公开的实施例提供了一种执行用于矩阵乘法的数据处理的装置,包括:取指单元,用于获取矩阵乘法指令和数据选择指令;译码单元,被配置为从所述取指单元接收所述矩阵乘法指令和所述数据选择指令,并对其进行译码,以确定存储有第一操作矩阵的第一向量通用寄存器,以及存储有第二操作矩阵的第二向量通用寄存器,并且获得数据选择信息,其中,所述第一向量通用寄存器和所述第二向量通用寄存器具有相同数量的通路,其中所述第一操作矩阵的第一数量的操作数据对应于所述第一向量通用寄存器的第一数量的通路,所述第二操作矩阵的第二数量的操作数据对应于所述第二向量通用寄存器的第二数量的通路;数据选择控制单元,被配置为从所述译码单元接收所述数据选择信息,并基于所述数据选择信息,在所述第二操作矩阵的第二数量的操作数据中确定目标操作数据;读操作数单元,被配置为将所述第一操作矩阵的第一数量的操作数据经由所述第一向量通用寄存器的第一数量的通路分别提供至所述第一数量的乘法器作为第一乘法因子,并且将所述目标操作数据经由所述第二向量通用寄存器的第一数量的通路提供至所述第一数量的乘法器作为第二乘法因子。Embodiments of the present disclosure provide an apparatus for performing data processing for matrix multiplication, including: an instruction fetch unit for acquiring a matrix multiplication instruction and a data selection instruction; and a decoding unit configured to retrieve data from the instruction fetch unit receiving and decoding the matrix multiply instruction and the data select instruction to determine a first vector general register storing a first operation matrix and a second vector general register storing a second operation matrix, and obtaining data selection information, wherein the first vector general register and the second vector general register have the same number of paths, wherein a first number of operation data of the first operation matrix corresponds to the first vector general a first number of paths of registers, a second number of operand data of said second operation matrix corresponding to a second number of paths of said second vector general register; a data selection control unit configured to decode from said decoding A unit receives the data selection information, and based on the data selection information, determines target operation data in the second amount of operation data of the second operation matrix; a read operand unit is configured to convert the first operation The first number of operation data of the matrix is respectively provided to the first number of multipliers as first multiplication factors via the first number of paths of the first vector general register, and the target operation data is passed through the first number of multipliers. The first number of paths of the two-vector general register are provided to the first number of multipliers as second multiplication factors.
根据本公开的实施例,其中,所述译码单元还基于所述译码结果,确定用于存储所述矩阵乘法运算结果的第三向量通用寄存器,并且所述装置还包括:乘法单元,被配置为包括所述第一数量的乘法器,所述第一数量的乘法器中的各个乘法器分别基于其对应的所述第一乘法因子和所述第二乘法因子执行乘法运算,得到运算结果;运算写回单元,被配置为将所述运算结果存储到第三向量通用寄存器中。According to an embodiment of the present disclosure, wherein the decoding unit further determines a third vector general-purpose register for storing the result of the matrix multiplication operation based on the decoding result, and the apparatus further includes: a multiplication unit, which is It is configured to include the first number of multipliers, and each of the first number of multipliers performs multiplication operations based on the corresponding first multiplication factor and the second multiplication factor, respectively, to obtain an operation result ; an operation write-back unit configured to store the operation result into the third vector general-purpose register.
根据本公开的实施例,其中,所述矩阵乘法指令包含所述第一数量的线 程,并且所述第一数量的乘法器对应于所述第一数量的线程,所述第一数量的线程中的每一线程分别对应于所述第一向量通用寄存器的相应通路和所述第二向量通用寄存器的相应通路;其中,在所述第二操作矩阵的第二数量的操作数据中确定目标操作数据包括:基于所述数据选择指令,在所述第二向量通用寄存器的所述第二数量的通路中选择一个通路,并将该通路所对应的操作数据作为目标操作数据;其中,将所述目标操作数据提供至所述第一数量的乘法器作为第二乘法因子包括:对于所述第一数量的线程中与所述第二向量通用寄存器的所述通路相对应的所述线程,将所述目标操作数据提供至其对应的乘法器作为第二乘法因子;以及对于所述第一数量的线程中的其余线程,将所述目标操作数据复制到所述其余线程的与所述第二向量通用寄存器连接的通路,并分别提供至对应的乘法器作为第二乘法因子。According to an embodiment of the present disclosure, wherein the matrix multiply instruction includes the first number of threads, and the first number of multipliers corresponds to the first number of threads, among the first number of threads Each of the threads corresponds to a corresponding path of the first vector general-purpose register and a corresponding path of the second vector general-purpose register, respectively; wherein the target operation data is determined in the second quantity of operation data of the second operation matrix The method includes: selecting a path from the second number of paths in the second vector general-purpose register based on the data selection instruction, and using the operation data corresponding to the path as target operation data; wherein the target Operating data to provide the first number of multipliers as a second multiplication factor includes: for the threads of the first number of threads corresponding to the paths of the second vector general register, applying the providing target operational data to its corresponding multiplier as a second multiplication factor; and for the remaining threads of the first number of threads, copying the target operational data to the remaining threads in common with the second vector The paths of the registers are connected, and are respectively provided to the corresponding multipliers as the second multiplication factors.
根据本公开的实施例,其中,所述第一操作矩阵为列矩阵,所述第一数量的操作数据为所述第一操作矩阵的列数据;以及所述第二操作矩阵为行矩阵,所述第二数量的操作数据为所述第二操作矩阵的行数据。According to an embodiment of the present disclosure, wherein the first operation matrix is a column matrix, the first quantity of operation data is column data of the first operation matrix; and the second operation matrix is a row matrix, so The second quantity of operation data is row data of the second operation matrix.
根据本公开的实施例,其中,获取矩阵乘法指令和数据选择指令包括:获取矩阵乘法指令,所述矩阵乘法指令包括第一操作矩阵字段,第二操作矩阵字段,其中所述第一操作矩阵字段用于指示存储有所述第一操作矩阵的第一向量通用寄存器;以及在所述第二操作矩阵字段为预定义的值时,获取数据选择指令,所述数据选择指令包括操作矩阵字段和数据选择字段,其中所述操作矩阵字段用于指示存储有所述第二操作矩阵的第二向量通用寄存器,所述数据选择字段用于指示选择所述第二操作矩阵的第二数量的操作数据中的特定数据作为所述目标操作数据。According to an embodiment of the present disclosure, wherein acquiring a matrix multiplication instruction and a data selection instruction comprises: acquiring a matrix multiplication instruction, the matrix multiplication instruction includes a first operation matrix field and a second operation matrix field, wherein the first operation matrix field a first vector general-purpose register for indicating that the first operation matrix is stored; and when the second operation matrix field is a predefined value, acquiring a data selection instruction, the data selection instruction includes an operation matrix field and data A selection field, wherein the operation matrix field is used to indicate a second vector general-purpose register that stores the second operation matrix, and the data selection field is used to indicate selection of the second operation matrix in the second quantity of operation data specific data as the target operation data.
本公开的实施例提供了一种数据处理设备,包括:处理器;和存储器,其上存储有计算机可执行指令,所述指令在被处理器执行时用于实现如上所述的方法。Embodiments of the present disclosure provide a data processing apparatus including: a processor; and a memory having computer-executable instructions stored thereon, the instructions, when executed by the processor, for implementing the method as described above.
本公开的实施例提供了一种计算机可读存储介质,其上存储有计算机可执行指令,所述指令在被处理器执行时用于实现如上所述的方法。Embodiments of the present disclosure provide a computer-readable storage medium having stored thereon computer-executable instructions, which when executed by a processor, are used to implement the method as described above.
本公开的实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理 器执行该计算机指令,使得该计算机设备执行根据本公开实施例的数据处理方法。Embodiments of the present disclosure provide a computer program product or computer program including computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method according to the embodiment of the present disclosure.
附图说明Description of drawings
为了更清楚地说明本公开实施例的技术方案,下面将对实施例的描述中所需要使用的附图作简单的介绍。显而易见地,下面描述中的附图仅仅是本公开的一些示例性实施例,对于本领域普通技术人员来说,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to illustrate the technical solutions of the embodiments of the present disclosure more clearly, the following briefly introduces the accompanying drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are only some exemplary embodiments of the present disclosure, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative efforts.
图1示出了根据本公开实施例的用于矩阵乘法的数据处理方法100的示意性流程图。FIG. 1 shows a schematic flowchart of a data processing method 100 for matrix multiplication according to an embodiment of the present disclosure.
图2示出了根据本公开实施例的执行矩阵运算的线程与VGPR的通路之间的对应关系的示意图。FIG. 2 shows a schematic diagram of the correspondence between threads performing matrix operations and paths of the VGPR according to an embodiment of the present disclosure.
图3示出了根据本公开实施例的示例矩阵乘法的数据处理的示意图。3 shows a schematic diagram of data processing for example matrix multiplication according to an embodiment of the present disclosure.
图4示出了根据本公开实施例的执行用于矩阵乘法的数据处理的示例装置400的示意图。FIG. 4 shows a schematic diagram of an example apparatus 400 for performing data processing for matrix multiplication according to an embodiment of the present disclosure.
图5示出了根据本公开实施例的数据处理后半部分所涉及的示例数据选择控制单元403和读操作数单元404的工作示意图。FIG. 5 shows a schematic diagram of the operation of an example data selection control unit 403 and a read operand unit 404 involved in the second half of data processing according to an embodiment of the present disclosure.
图6示出了根据本公开实施例的数据处理设备600的示意图。FIG. 6 shows a schematic diagram of a data processing apparatus 600 according to an embodiment of the present disclosure.
具体实施方式detailed description
为了使得本公开的目的、技术方案和优点更为明显,下面将参考附图详细描述根据本公开的示例实施例。显然,所描述的实施例仅仅是本公开的一部分实施例,而不是本公开的全部实施例,应理解,本公开不受这里描述的示例实施例的限制。In order to make the objects, technical solutions and advantages of the present disclosure more apparent, exemplary embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only some of the embodiments of the present disclosure, not all of the embodiments of the present disclosure, and it should be understood that the present disclosure is not limited by the example embodiments described herein.
在本说明书和附图中,基本上相同或相似的步骤和元素用相同或相似的附图标记来表示,并且对这些步骤和元素的重复描述将被省略。同时,在本公开的描述中,术语“第一”、“第二”等仅用于区分描述,而不能理解为指示或暗示相对重要性或排序。In this specification and the drawings, substantially the same or similar steps and elements are denoted by the same or similar reference numerals, and repeated descriptions of these steps and elements will be omitted. Meanwhile, in the description of the present disclosure, the terms "first", "second" and the like are only used to distinguish the description, and cannot be understood as indicating or implying relative importance or order.
在本说明书和附图中,根据实施例,元素以单数或复数的形式来描述。然而,单数和复数形式被适当地选择用于所提出的情况仅仅是为了方便解释而 无意将本公开限制于此。因此,单数形式可以包括复数形式,并且复数形式也可以包括单数形式,除非上下文另有明确说明。In this specification and drawings, elements are described in the singular or the plural depending on the embodiment. However, the singular and plural forms have been appropriately chosen for the presented instances only for convenience of explanation and are not intended to limit the disclosure thereto. Thus, the singular may include the plural, and the plural may also include the singular, unless the context clearly dictates otherwise.
GPU的SIMD结构处理单元通过执行矩阵运算指令来同时控制多个线程上执行相同的操作,以实现矩阵读取、运算操作、结果存入等。例如,对于SIMD 32结构,执行一条指令可以同时控制32个线程的数据操作,每个SIMD32结构中都有其专用的一组VGPR,每个VGPR具有32个通路。下表1示出了通用的矩阵运算指令,其是SIMD结构中执行矩阵运算操作时采用的传统指令,其包括指示存储有第一操作矩阵的第一VGPR的第一操作矩阵(VSRCA)字段,指示存储有第二操作数的第二VGPR的第二操作数(SRCB)字段,指示用于存储矩阵运算结果的第三VGPR的目的VGPR(VDST)字段,指示该矩阵运算指令所执行的具体操作的操作码(OP)字段,以及指示确定执行该矩阵运算指令的指令选择(Type)字段。其中,通过将矩阵运算指令中的OP字段设置为指示乘法操作的相应值,可以获得矩阵乘法指令。The SIMD structure processing unit of the GPU controls multiple threads to perform the same operation at the same time by executing the matrix operation instruction, so as to realize matrix reading, operation operation, and result storage. For example, for the SIMD 32 structure, executing one instruction can control the data operations of 32 threads at the same time, each SIMD32 structure has its own dedicated set of VGPRs, and each VGPR has 32 channels. The following table 1 shows a general matrix operation instruction, which is a conventional instruction adopted when performing a matrix operation operation in the SIMD structure, and includes a first operation matrix (VSRCA) field indicating the first VGPR stored with the first operation matrix, Indicates the second operand (SRCB) field of the second VGPR stored with the second operand, indicates the purpose VGPR (VDST) field of the third VGPR for storing the matrix operation result, indicates the specific operation performed by the matrix operation instruction The operation code (OP) field of the , and the instruction selection (Type) field indicating that the instruction to execute the matrix operation is determined. The matrix multiplication instruction can be obtained by setting the OP field in the matrix operation instruction to a corresponding value indicating the multiplication operation.
TypeType OPOP VDSTVDST VSRCAVSRCA SRCBSRCB
表1Table 1
在SIMD 32结构下采用通用矩阵运算指令格式的矩阵乘法指令来执行矩阵乘法A*B,其中矩阵A是32×1的列矩阵,即A(:,1)包含32个数据,矩阵B是1×4的行矩阵,即B(1,:)包含4个数据。Under the SIMD 32 structure, the matrix multiplication instruction in the general matrix operation instruction format is used to perform matrix multiplication A*B, where matrix A is a 32×1 column matrix, that is, A(:,1) contains 32 data, and matrix B is 1 The row matrix of ×4, that is, B(1,:) contains 4 data.
对于上述矩阵乘法运算,常采用的现有技术是将矩阵数据从双倍数据速率同步动态随机存取存储器(DDRSDRAM)逐个读入VGPR。首先将矩阵A读入VGPR 0中,然后将矩阵B的四个矩阵数据分别读入四个VGPR(分别称为VGPR 1、VGPR 2、VGPR 3和VGPR 4)中,每次运算将VGPR 0的32个通路对应的数据,与VGPR 1、VGPR 2、VGPR 3或VGPR 4的32个通路对应的数据送至SIMD结构中对应的乘法器进行乘法运算。此过程涉及多次从DDRSDRAM中读取数据,比如在此运算中读取次数为5次,造成了不必要的数据冗余以及额外功耗。For the above-mentioned matrix multiplication operation, the commonly used prior art is to read the matrix data one by one from the double data rate synchronous dynamic random access memory (DDRSDRAM) into the VGPR. First, matrix A is read into VGPR 0, and then the four matrix data of matrix B are read into four VGPRs (respectively called VGPR 1, VGPR 2, VGPR 3 and VGPR 4), and each operation will be VGPR 0. The data corresponding to the 32 channels, and the data corresponding to the 32 channels of VGPR 1, VGPR 2, VGPR 3 or VGPR 4 are sent to the corresponding multipliers in the SIMD structure for multiplication. This process involves reading data from the DDR SDRAM multiple times, such as 5 times in this operation, resulting in unnecessary data redundancy and extra power consumption.
因此,为了解决上述问题,本公开提出对操作矩阵仅进行单次读取(即,将整个第二操作矩阵一次性读入第二VGPR),并相应地在原矩阵乘法指令的基础上增加一部分指令用于指导矩阵内数据的有序相乘操作。Therefore, in order to solve the above problems, the present disclosure proposes to only read the operation matrix once (that is, read the entire second operation matrix into the second VGPR at one time), and correspondingly add a part of the instructions on the basis of the original matrix multiplication instructions Used to guide the ordered multiplication of data within a matrix.
下面将结合附图对本公开的实施例进行进一步描述。The embodiments of the present disclosure will be further described below with reference to the accompanying drawings.
图1示出了根据本公开实施例的用于矩阵乘法的数据处理方法100的示 意性流程图。FIG. 1 shows a schematic flowchart of a data processing method 100 for matrix multiplication according to an embodiment of the present disclosure.
如图1所示,首先,在步骤101中,获取矩阵乘法指令和数据选择指令。例如,可以从存储器(比如DDRSDRAM等)中获取矩阵乘法指令和数据选择指令。As shown in FIG. 1, first, in step 101, a matrix multiplication instruction and a data selection instruction are acquired. For example, matrix multiply instructions and data selection instructions may be retrieved from memory (eg, DDR SDRAM, etc.).
根据本公开的实施例,在原有的矩阵乘法指令的基础上,增加一段操作线程间数据的指令部分,以指导矩阵乘法过程中第二操作矩阵内参与运算的数据的选择与复制,本公开中将上述增加的指令部分称为数据选择指令,如表2所示。将原本用于指示第二VGPR的SRCB字段作为获取数据选择指令的入口,由数据选择指令来指示存储有第二操作矩阵的第二VGPR。数据选择指令可以包括用于指示第二VGPR的第二操作矩阵(VSRCB)字段,以及用于指示数据选择的数据选择(SVF_MODE)字段。应了解,根据本公开的实施例,矩阵乘法指令和数据选择指令可以作为两个分离指令存在,或者可以作为一个指令的两个部分存在。在下面的描述中,用于矩阵乘法的数据处理方法100所采用的SIMD指令包括上述矩阵乘法指令和数据选择指令两部分。According to the embodiment of the present disclosure, based on the original matrix multiplication instruction, an instruction part for operating data between threads is added to guide the selection and copying of the data involved in the operation in the second operation matrix during the matrix multiplication process. The above added command part is called data selection command, as shown in Table 2. The SRCB field originally used to indicate the second VGPR is used as the entry to obtain the data selection instruction, and the data selection instruction indicates the second VGPR in which the second operation matrix is stored. The data selection instruction may include a second operation matrix (VSRCB) field for indicating the second VGPR, and a data selection (SVF_MODE) field for indicating the data selection. It should be appreciated that, according to embodiments of the present disclosure, the matrix multiply instruction and the data selection instruction may exist as two separate instructions, or may exist as two parts of one instruction. In the following description, the SIMD instruction adopted by the data processing method 100 for matrix multiplication includes the above-mentioned matrix multiplication instruction and data selection instruction.
保留字段reserved text SVF_MODESVF_MODE VSRCBVSRCB
表2Table 2
根据本公开的实施例,例如,SIMD指令的长度可以为64比特,其前32比特为矩阵运算指令部分,矩阵运算指令中各比特域的定义及相关描述如表3所示;其后32比特为数据选择指令部分,数据选择指令中各比特域的定义及相关描述如表4所示。According to an embodiment of the present disclosure, for example, the length of a SIMD instruction may be 64 bits, the first 32 bits of which are the matrix operation instruction part, and the definitions and related descriptions of each bit field in the matrix operation instruction are shown in Table 3; For the part of the data selection command, the definitions and related descriptions of each bit field in the data selection command are shown in Table 4.
参见表3,在此SIMD指令的矩阵运算指令部分中,第0至8比特为SRCB字段,此字段可指示存储有第二操作数的第二VGPR(例如,当SRCB值等于90或267等值时),当SRCB值等于预定义的值时,此字段指示进入数据选择,获取数据选择指令(例如,当SRCB值等于209时)。第9至16比特为VSRCA字段。第17至24比特为VDST字段。第25至30比特为OP字段,对于矩阵乘法指令,OP字段为特定的多个值之一。第31比特为Type字段,用于指示确定执行此矩阵运算指令。Referring to Table 3, in the matrix operation instruction portion of this SIMD instruction, bits 0 to 8 are the SRCB field, which can indicate the second VGPR that stores the second operand (for example, when the SRCB value is equal to 90 or 267, etc. ), when the SRCB value is equal to a predefined value, this field indicates to enter the data selection, get the data selection command (for example, when the SRCB value is equal to 209). The 9th to 16th bits are the VSRCA field. The 17th to 24th bits are the VDST field. Bits 25 to 30 are the OP field, which is one of a number of specific values for a matrix multiply instruction. The 31st bit is the Type field, which is used to indicate that the matrix operation instruction is determined to be executed.
Figure PCTCN2020122168-appb-000001
Figure PCTCN2020122168-appb-000001
Figure PCTCN2020122168-appb-000002
Figure PCTCN2020122168-appb-000002
表3table 3
参见表4,在此SIMD指令的数据选择指令部分中,第32至39比特为VSRCB字段。第40至44比特为SVF_MODE字段,长度为5比特的SVF_MODE可用于指示32个线程间数据的复制操作。其余比特为指令的保留字段,可保留用于后续实现其他操作。Referring to Table 4, in the data select command portion of this SIMD command, bits 32 to 39 are the VSRCB field. The 40th to 44th bits are the SVF_MODE field, and the SVF_MODE with a length of 5 bits can be used to indicate the copy operation of data among 32 threads. The remaining bits are reserved fields of the instruction, which can be reserved for subsequent implementation of other operations.
Figure PCTCN2020122168-appb-000003
Figure PCTCN2020122168-appb-000003
表4Table 4
在步骤102中,可以基于矩阵乘法指令和数据选择指令,确定存储有第一操作矩阵的第一VGPR,以及存储有第二操作矩阵的第二VGPR。In step 102, a first VGPR storing the first operation matrix and a second VGPR storing the second operation matrix may be determined based on the matrix multiplication instruction and the data selection instruction.
根据本公开的实施例,可以根据矩阵乘法指令中的VSRCA字段和数据选择指令中的VSRCB字段,获得存储有第一操作矩阵的第一VGPR以及存储有第二操作矩阵的第二VGPR的地址信息,该地址信息可以是该VGPR在该SIMD结构处理单元的所有VGPR中的索引。According to the embodiments of the present disclosure, the address information of the first VGPR storing the first operation matrix and the second VGPR storing the second operation matrix can be obtained according to the VSRCA field in the matrix multiplication instruction and the VSRCB field in the data selection instruction , the address information may be the index of the VGPR in all VGPRs of the SIMD structure processing unit.
根据本公开实施例,可以预先将第一操作矩阵存储到第一VGPR中,并 预先将第二操作矩阵存储到第二VGPR中,其中,第一VGPR和第二VGPR具有相同数量的通路,其中第一操作矩阵的第一数量的操作数据对应于第一VGPR的第一数量的通路,第二操作矩阵的第二数量的操作数据对应于第二VGPR的第二数量的通路。According to an embodiment of the present disclosure, the first operation matrix may be stored in the first VGPR in advance, and the second operation matrix may be stored in the second VGPR in advance, wherein the first VGPR and the second VGPR have the same number of paths, wherein The first number of operation data of the first operation matrix corresponds to the first number of ways of the first VGPR, and the second number of operation data of the second operation matrix corresponds to the second number of ways of the second VGPR.
根据本公开的实施例,通过将第一操作矩阵和第二操作矩阵分别存储到第一VGPR和第二VGPR中,SIMD结构处理单元可以根据所获得的第一VGPR和第二VGPR的地址信息,对于第一操作矩阵的第一数量的操作数据与第二操作矩阵的第二数量的操作数据进行乘法运算,第一操作矩阵的第一数量的操作数据对应于第一VGPR的第一数量的通路,第二操作矩阵的第二数量的操作数据对应于第二VGPR的第二数量的通路。根据本公开的实施例,例如,对于SIMD 32结构,第一VGPR和第二VGPR都具有32个通路,因此VGPR可以同时提供所存储的矩阵中的最多32个数据参与运算。According to an embodiment of the present disclosure, by storing the first operation matrix and the second operation matrix in the first VGPR and the second VGPR, respectively, the SIMD structure processing unit can, according to the obtained address information of the first VGPR and the second VGPR, A multiplication operation is performed on a first number of operation data of the first operation matrix and a second number of operation data of the second operation matrix, the first number of operation data of the first operation matrix corresponding to the first number of paths of the first VGPR , the second number of operation data of the second operation matrix corresponds to the second number of paths of the second VGPR. According to an embodiment of the present disclosure, for example, for a SIMD 32 structure, both the first VGPR and the second VGPR have 32 channels, so the VGPR can simultaneously provide up to 32 data in the stored matrix to participate in the operation.
根据本公开的实施例,例如,对于矩阵乘法A*B,其中第一操作矩阵A为32×1的列矩阵,第一数量的操作数据为A(:,1)的32个列数据,第二操作矩阵B为1×4的行矩阵,第二数量的操作数据为B(1,:)的4个行数据。存储有矩阵A的VGPR A的32个通路分别对应于矩阵A中A(:,1)的32个数据,而存储有矩阵B的VGPR B的32个通路中的前4个通路分别对应于矩阵B中B(1,:)的4个数据,VGPR B的其他通路不与任何数据对应。According to an embodiment of the present disclosure, for example, for matrix multiplication A*B, where the first operation matrix A is a 32×1 column matrix, the first quantity of operation data is 32 column data of A(:, 1), the first The second operation matrix B is a 1×4 row matrix, and the operation data of the second quantity is 4 row data of B(1,:). The 32 channels of the VGPR A that store the matrix A correspond to the 32 data of A(:, 1) in the matrix A respectively, and the first 4 channels of the 32 channels of the VGPR B that store the matrix B respectively correspond to the matrix. The 4 data of B(1,:) in B, the other channels of VGPR B do not correspond to any data.
图2示出了根据本公开实施例的执行矩阵运算的线程与VGPR的通路之间的对应关系的示意图。FIG. 2 shows a schematic diagram of the correspondence between threads performing matrix operations and paths of the VGPR according to an embodiment of the present disclosure.
根据本公开的实施例,矩阵乘法指令包含第一数量的线程,其中每一线程分别对应于第一VGPR的相应通路和第二VGPR的相应通路。According to an embodiment of the present disclosure, the matrix multiply instruction includes a first number of threads, wherein each thread corresponds to a respective pass of the first VGPR and a respective pass of the second VGPR.
如图2所示,上述矩阵乘法指令包含与A(:,1)的32个列数据对应的32个线程,图2示出了各线程分别对应于第一VGPR的相应通路和第二VGPR的相应通路,例如,线程0对应于第一VGPR的通路0和第二VGPR的通路0,线程1对应于第一VGPR的通路1和第二VGPR的通路1,以此类推。其中,以线程0为例,线程0对应的第二VGPR的通路0对应于B(1,:)的第一个数据B(1,1),经过将第二VGPR的通路0对应的数据B(1,1)复制到32个线程中的其余线程所对应的第二VGPR的31个通路,32个线程所对应的第二VGPR的32个通路对应的数据都为B(1,1)。As shown in Figure 2, the above matrix multiplication instruction includes 32 threads corresponding to the 32 column data of A(:, 1). Figure 2 shows that each thread corresponds to the corresponding path of the first VGPR and the Corresponding paths, eg, thread 0 corresponds to path 0 of the first VGPR and path 0 of the second VGPR, thread 1 corresponds to path 1 of the first VGPR and path 1 of the second VGPR, and so on. Among them, taking thread 0 as an example, the path 0 of the second VGPR corresponding to thread 0 corresponds to the first data B(1,1) of B(1,:), and after passing the data B corresponding to the path 0 of the second VGPR (1,1) is copied to the 31 paths of the second VGPR corresponding to the remaining threads of the 32 threads, and the data corresponding to the 32 paths of the second VGPR corresponding to the 32 threads are all B(1,1).
接下来,回到图1,在步骤103中,可以基于数据选择指令,在第二操作矩阵的第二数量的操作数据中确定目标操作数据。Next, returning to FIG. 1 , in step 103 , target operation data may be determined in the second quantity of operation data of the second operation matrix based on the data selection instruction.
根据本公开的实施例,基于数据选择指令,可根据SVF_MODE值确定其所指示的第二VGPR的通路,并将该通路所对应的操作数据作为目标操作数据,例如,当SVF_MODE=1时,将第二VGPR的通路1所对应的操作数据确定为目标操作数据。According to an embodiment of the present disclosure, based on the data selection instruction, the path of the second VGPR indicated by the SVF_MODE value can be determined, and the operation data corresponding to the path can be used as the target operation data. For example, when SVF_MODE=1, set the The operation data corresponding to the channel 1 of the second VGPR is determined as the target operation data.
在步骤104中,可以将第一操作矩阵的第一数量的操作数据经由第一VGPR的第一数量的通路分别提供至第一数量的乘法器作为第一乘法因子,并且将目标操作数据经由第二VGPR的第一数量的通路提供至第一数量的乘法器作为第二乘法因子。In step 104, the first number of operation data of the first operation matrix may be respectively provided to the first number of multipliers as the first multiplication factor via the first number of paths of the first VGPR, and the target operation data may be supplied via the first number of paths of the first VGPR respectively. The first number of paths of the two VGPRs are provided to the first number of multipliers as a second multiplication factor.
根据本公开的实施例,矩阵乘法指令可包含第一数量的线程,并且第一数量的乘法器对应于第一数量的线程。对于第一数量的线程中与第二VGPR的通路相对应的线程,可将目标操作数据提供至其对应的乘法器作为第二乘法因子,而对于第一数量的线程中的其余线程,可通过将目标操作数据复制到其余线程所对应的第二VGPR的通路,来将目标操作数据提供至对应的乘法器作为第二乘法因子,例如,当SVF_MODE=1时,将线程1对应的第二VGPR的通路1所对应的目标操作数据提供至线程1对应的乘法器的输入端,并将此目标操作数据复制到其余线程对应的第二VGPR的通路所连接的乘法器的输入端以进行乘法运算。According to an embodiment of the present disclosure, the matrix multiply instruction may contain a first number of threads, and the first number of multipliers corresponds to the first number of threads. For the threads of the first number of threads corresponding to the paths of the second VGPR, target operation data may be provided to their corresponding multipliers as second multiplication factors, while for the remaining threads of the first number of threads, the target operation data may be provided by Copy the target operation data to the path of the second VGPR corresponding to the remaining threads to provide the target operation data to the corresponding multiplier as the second multiplication factor. For example, when SVF_MODE=1, the second VGPR corresponding to thread 1 is used. The target operation data corresponding to the channel 1 of the Thread 1 is provided to the input end of the multiplier corresponding to thread 1, and the target operation data is copied to the input end of the multipliers connected to the channels of the second VGPR corresponding to the remaining threads for multiplication operation. .
根据本公开的实施例,可以基于矩阵乘法指令来确定用于存储矩阵乘法运算结果的第三VGPR,该第三VGPR具有与第一VGPR和第二VGPR相同数量的通路,第一数量的乘法器中的各个乘法器可以分别基于其对应的第一乘法因子和第二乘法因子来执行乘法运算,得到运算结果之后,将运算结果经由对应的第一数量的通路存储到第三VGPR中。According to an embodiment of the present disclosure, a third VGPR for storing a result of a matrix multiplication operation can be determined based on a matrix multiplication instruction, the third VGPR has the same number of paths as the first VGPR and the second VGPR, the first number of multipliers Each multiplier in can perform a multiplication operation based on its corresponding first multiplication factor and second multiplication factor, and after obtaining the operation result, store the operation result in the third VGPR via the corresponding first number of paths.
图3示出了根据本公开实施例的示例矩阵乘法的数据处理的示意图。3 shows a schematic diagram of data processing for example matrix multiplication according to an embodiment of the present disclosure.
如图3所示,本实施例中SIMD示例为SIMD 32结构,每个VGPR包括32个通路,在此结构下执行矩阵乘法A*B=C,其中第一操作矩阵A为32×1的列矩阵,第二操作矩阵B为1×4的行矩阵,相应地,结果矩阵C为32×4的矩阵,涉及的硬件通用矩阵算法为As shown in FIG. 3 , the SIMD example in this embodiment is a SIMD 32 structure, each VGPR includes 32 channels, and matrix multiplication A*B=C is performed under this structure, wherein the first operation matrix A is a 32×1 column matrix, the second operation matrix B is a 1×4 row matrix, correspondingly, the result matrix C is a 32×4 matrix, and the hardware general matrix algorithm involved is
Figure PCTCN2020122168-appb-000004
Figure PCTCN2020122168-appb-000004
存储有矩阵A的VGPR A的各个通路分别对应矩阵A的列向量中的各个数据,存储有矩阵B的VGPR B的各个通路分别对应矩阵B的行向量中的各个数据,各线程上分别执行VGPR A的相应通路所对应的数据与VGPR B中的目标操作数据的乘法操作。Each path of the VGPR A that stores the matrix A corresponds to each data in the column vector of the matrix A, and each path of the VGPR B that stores the matrix B corresponds to each data in the row vector of the matrix B, respectively. VGPR is executed on each thread. The multiplication operation of the data corresponding to the corresponding path of A and the target operation data in VGPR B.
本实施例中具体操作如下:The specific operations in this embodiment are as follows:
VGPR A的32个通路分别对应矩阵A的列向量A(:,1)的32个数据A(1,1),A(2,1),…,A(32,1);The 32 channels of VGPR A correspond to the 32 data A(1,1), A(2,1), ..., A(32,1) of the column vector A(:,1) of the matrix A respectively;
SVF_MODE=0时,将B(1,1)复制到VGPR B的32个通路(图3中用一个虚线箭头表示此过程),VGPR A和VGPR B的各通路所对应的数据相应地相乘,所得结果分别经由VGPR C的对应的32个通路存入VGPR,得到矩阵C的列向量C(:,1);When SVF_MODE=0, copy B(1,1) to the 32 channels of VGPR B (a dashed arrow indicates this process in Figure 3), and the data corresponding to each channel of VGPR A and VGPR B are multiplied accordingly, The obtained results are respectively stored in the VGPR via the corresponding 32 paths of the VGPR C, and the column vector C(:, 1) of the matrix C is obtained;
以此类推,SVF_MODE=1时,将B(1,2)复制到VGPR B的32个通路,VGPR A和VGPR B的各通路所对应的数据相应地相乘,得到矩阵C的列向量C(:,2);By analogy, when SVF_MODE=1, B(1,2) is copied to the 32 channels of VGPR B, and the data corresponding to each channel of VGPR A and VGPR B are multiplied accordingly to obtain the column vector C of matrix C ( :,2);
SVF_MODE=2时,将B(1,3)复制到VGPR B的32个通路,VGPR A和VGPR B的各通路所对应的数据相应地相乘,得到矩阵C的列向量C(:,3);When SVF_MODE=2, copy B(1,3) to the 32 channels of VGPR B, and multiply the data corresponding to each channel of VGPR A and VGPR B accordingly to obtain the column vector C(:,3) of matrix C ;
SVF_MODE=3时,将B(1,4)复制到VGPR B的32个通路,VGPR A和VGPR B的各通路所对应的数据相应地相乘,得到矩阵C的列向量C(:,4),从而求得矩阵C。When SVF_MODE=3, copy B(1,4) to the 32 channels of VGPR B, and multiply the data corresponding to each channel of VGPR A and VGPR B accordingly to obtain the column vector C(:,4) of matrix C , so as to obtain the matrix C.
下面,具体描述根据本公开的实施例的用于矩阵乘法的数据处理过程的具体操作。Hereinafter, specific operations of the data processing procedure for matrix multiplication according to an embodiment of the present disclosure will be described in detail.
首先,将用于矩阵乘法的操作矩阵分别读入指定VGPR,然后在用于矩阵乘法的矩阵乘法指令和数据选择指令中给出所述指定VGPR,并且相应地改变SVF_MODE值。由此,无需多次执行数据读取和存储,仅通过一次数据读取即可完成列矩阵和行矩阵的矩阵乘法。例如,本公开所述方法的部分汇编指令示例可表示如下:First, the operation matrices for matrix multiplication are read into designated VGPRs, respectively, and then the designated VGPRs are given in the matrix multiplication instruction and data selection instruction for matrix multiplication, and the SVF_MODE value is changed accordingly. As a result, the matrix multiplication of the column matrix and the row matrix can be completed with only one data read without performing data reading and storage multiple times. For example, a partial assembly instruction example of the method described in this disclosure may be represented as follows:
buffer_load_b32v0,v_addr_0;buffer_load_b32v0,v_addr_0;
buffer_load_b32v80,v_addr_1;buffer_load_b32v80,v_addr_1;
v_mul_u32v100,v0,v80,SVF_MODE=0;v_mul_u32v100,v0,v80,SVF_MODE=0;
v_mul_u32v101,v0,v80,SVF_MODE=1;v_mul_u32v101,v0,v80,SVF_MODE=1;
v_mul_u32v102,v0,v80,SVF_MODE=2;v_mul_u32v102,v0,v80,SVF_MODE=2;
v_mul_u32v103,v0,v80,SVF_MODE=3;v_mul_u32v103,v0,v80,SVF_MODE=3;
具体地,在上述汇编指令中,首先,通过buffer_load_b32指令,将矩阵A从地址v_addr_0读入寄存器v0中,通过buffer_load_b32指令,将矩阵B从地址v_addr_1读入寄存器v80中,寄存器v0和v80均能存储32个数据。Specifically, in the above assembly instructions, first, through the buffer_load_b32 instruction, read the matrix A from the address v_addr_0 into the register v0, through the buffer_load_b32 instruction, read the matrix B from the address v_addr_1 into the register v80, both registers v0 and v80 can store 32 data.
接下来,基于寄存器v0和v80,对这两个寄存器中的数据进行矩阵操作。具体地,通过指令“v_mul_u32v100,v0,v80,SVF_MODE=0”,定义了根据本公开实施例的表3和表4中的参数OP、VDST、VSRCA、VSRCB、以及SVF_MODE。其中,v_mul_u32为操作码,其指示了32位乘法操作,其中,v0指示了第一操作矩阵A的寄存器,v80指示了第二操作矩阵B的寄存器,通过改变SVF_MODE值来选择第二操作矩阵B中的目标操作数据,v100/v101/v102/v103指示用于存储第一操作矩阵A和目标操作数据的乘法结果的中间寄存器,从而实现了SIMD结构下基于单次读取操作矩阵的矩阵乘法运算。Next, based on registers v0 and v80, matrix operations are performed on the data in these two registers. Specifically, the parameters OP, VDST, VSRCA, VSRCB, and SVF_MODE in Table 3 and Table 4 according to the embodiment of the present disclosure are defined by the instruction "v_mul_u32 v100, v0, v80, SVF_MODE=0". Among them, v_mul_u32 is an opcode, which indicates a 32-bit multiplication operation, wherein v0 indicates the register of the first operation matrix A, v80 indicates the register of the second operation matrix B, and the second operation matrix B is selected by changing the value of SVF_MODE The target operation data in , v100/v101/v102/v103 indicates the intermediate register used to store the multiplication result of the first operation matrix A and the target operation data, thus realizing the matrix multiplication operation based on a single read operation matrix under the SIMD structure .
应当理解,SIMD结构以及参与乘法运算的矩阵不限于上述举例,而是本领域技术人员可以根据实际情况进行调整,在此不一一举例。It should be understood that the SIMD structure and the matrices involved in the multiplication operation are not limited to the above examples, but can be adjusted by those skilled in the art according to the actual situation, and examples are not provided here.
图4示出了根据本公开实施例的执行用于矩阵乘法的数据处理的示例装置400的示意图。FIG. 4 shows a schematic diagram of an example apparatus 400 for performing data processing for matrix multiplication according to an embodiment of the present disclosure.
如图4所示,根据本公开实施例的执行用于矩阵乘法的数据处理的装置400可以包括:取指单元401、译码单元402、数据选择控制单元403、以及读操作数单元404。As shown in FIG. 4 , an apparatus 400 for performing data processing for matrix multiplication according to an embodiment of the present disclosure may include: an instruction fetch unit 401 , a decoding unit 402 , a data selection control unit 403 , and a read operand unit 404 .
取指单元401可以被配置用于获取矩阵乘法指令和数据选择指令。例如取指单元401可以将指令从诸如DDRSDRAM的存储器取到指令寄存器。Instruction fetch unit 401 may be configured to fetch matrix multiply instructions and data select instructions. For example, instruction fetch unit 401 may fetch instructions from a memory such as DDR SDRAM to an instruction register.
译码单元402可以被配置为从取指单元401接收矩阵乘法指令和数据选择指令,并对这些指令进行译码,以确定存储有第一操作矩阵的第一VGPR,以及存储有第二操作矩阵的第二VGPR,并且获得数据选择信息,其中,第一VGPR和第二VGPR具有相同数量的通路,第一操作矩阵的第一数量的操作数据对应于第一VGPR的第一数量的通路,第二操作矩阵的第二数量的操作数据对应于第二VGPR的第二数量的通路。译码单元402按照预定的指令格 式来对取得的指令进行拆分和解释,获得诸如VGPR地址和操作等信息,此外,基于数据选择指令还可以获得相应的数据选择信息,可用诸如数据选择信号(SVF_MODE)的形式来传送该信息,以指导后续第二操作矩阵中的数据选择操作。 Decode unit 402 may be configured to receive matrix multiply instructions and data select instructions from instruction fetch unit 401 and decode these instructions to determine a first VGPR storing a first operation matrix and a second operation matrix the second VGPR of the The second number of operational data of the two-operation matrix corresponds to the second number of passes of the second VGPR. The decoding unit 402 splits and interprets the fetched instruction according to a predetermined instruction format, and obtains information such as VGPR address and operation. In addition, based on the data selection instruction, corresponding data selection information can also be obtained, which can be used such as a data selection signal ( SVF_MODE) to transmit this information to guide subsequent data selection operations in the second operation matrix.
数据选择控制单元403可以被配置为从译码单元402接收数据选择信息,并基于该数据选择信息,在第二操作矩阵的第二数量的操作数据中确定目标操作数据。例如,数据选择控制单元403中,可将第二操作矩阵的第二数量的操作数据通过由数据选择信息(例如,SVF_MODE)控制的选择器以选出目标操作数据。The data selection control unit 403 may be configured to receive data selection information from the decoding unit 402, and based on the data selection information, determine target operation data among the second quantity of operation data of the second operation matrix. For example, in the data selection control unit 403, the second amount of operation data of the second operation matrix may be passed through a selector controlled by the data selection information (eg, SVF_MODE) to select the target operation data.
读操作数单元404可以被配置为将第一操作矩阵的第一数量的操作数据经由第一VGPR的第一数量的通路分别提供至第一数量的乘法器作为第一乘法因子,并且将目标操作数据经由第二VGPR的第一数量的通路提供至第一数量的乘法器作为第二乘法因子。读操作数单元404可将目标操作数据复制到第二VGPR的通路中与上述第一数量的乘法器连接的第一数量的通路上,以提供至对应的乘法器作为第二乘法因子。The read operand unit 404 may be configured to provide the first number of operation data of the first operation matrix to the first number of multipliers via the first number of paths of the first VGPR, respectively, as the first multiplication factor, and the target operation Data is provided to the first number of multipliers as a second multiplication factor via the first number of paths of the second VGPR. The read operand unit 404 may copy the target operation data to the first number of paths connected to the above-mentioned first number of multipliers among the paths of the second VGPR, so as to provide the corresponding multipliers as second multiplication factors.
根据本公开的实施例,译码单元402还可以被配置为基于译码结果,确定用于存储矩阵乘法运算结果的第三VGPR。According to an embodiment of the present disclosure, the decoding unit 402 may be further configured to determine a third VGPR for storing the result of the matrix multiplication operation based on the decoding result.
根据本公开的实施例,如图4所示,执行用于矩阵乘法的数据处理方法的装置400还可以包括:乘法单元405,其可以被配置为包括第一数量的乘法器,其中各个乘法器分别基于其对应的第一乘法因子和第二乘法因子执行乘法运算,得到运算结果;以及运算写回单元406,其可以被配置为将乘法运算结果存储到第三VGPR中。According to an embodiment of the present disclosure, as shown in FIG. 4 , the apparatus 400 for performing the data processing method for matrix multiplication may further include: a multiplication unit 405, which may be configured to include a first number of multipliers, wherein each multiplier The multiplication operation is performed based on the corresponding first multiplication factor and the second multiplication factor, respectively, to obtain an operation result; and an operation write-back unit 406 may be configured to store the multiplication operation result in the third VGPR.
图5示出了根据本公开实施例的数据处理后半部分所涉及的示例数据选择控制单元403和读操作数单元404的工作示意图。FIG. 5 shows a schematic diagram of the operation of an example data selection control unit 403 and a read operand unit 404 involved in the second half of data processing according to an embodiment of the present disclosure.
如图5所示,数据选择控制单元403基于从译码单元402所接收的数据选择控制信息(由SVF_MODE作为数据选择信号),在VGPR B的32个通路上,将矩阵B的32个第二操作数据经过一个32选1的选择器,选出VGPR B的指定通路所对应的第二操作数据(即,目标操作数据)。之后,由读操作数单元404将矩阵A的32个第一操作数据分别经由VGPR A的32个通路提供至32个乘法器的第一输入端,将目标操作数据提供至VGPR B的指定通路 所连接的乘法器的第二输入端,并将目标操作数据复制到VGPR B的其余通路,然后提供至其余乘法器的第二输入端。As shown in FIG. 5, the data selection control unit 403 is based on the data selection control information received from the decoding unit 402 (with SVF_MODE as the data selection signal), on the 32 paths of the VGPR B, the 32 second The operation data passes through a 32-to-1 selector to select the second operation data (ie, target operation data) corresponding to the designated path of the VGPR B. After that, the read operand unit 404 supplies the 32 first operation data of the matrix A to the first input ends of the 32 multipliers through the 32 paths of the VGPR A respectively, and provides the target operation data to all the designated paths of the VGPR B. Connect the second input of the multiplier and copy the target operation data to the remaining paths of the VGPR B and then provide it to the second input of the remaining multipliers.
图6示出了根据本公开实施例的数据处理设备600的示意图。FIG. 6 shows a schematic diagram of a data processing apparatus 600 according to an embodiment of the present disclosure.
如图6所示,根据本公开实施例的数据处理设备600可以包括处理器601以及存储器602,其可以通过总线603进行互联。As shown in FIG. 6 , a data processing device 600 according to an embodiment of the present disclosure may include a processor 601 and a memory 602 , which may be interconnected through a bus 603 .
处理器601可以根据存储在存储器602中的程序或代码执行各种动作和处理。具体地,处理器601可以是一种集成电路芯片,具有信号的处理能力。上述处理器可以是通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本公开实施例中公开的各种方法、步骤、流程及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等,可以是X86架构或者是ARM架构等。The processor 601 can perform various actions and processes according to programs or codes stored in the memory 602 . Specifically, the processor 601 may be an integrated circuit chip, which has signal processing capability. The aforementioned processors may be general purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), off-the-shelf programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Various methods, steps, processes and logical block diagrams disclosed in the embodiments of the present disclosure can be implemented or executed. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor, etc., and may be an X86 architecture or an ARM architecture, or the like.
存储器602存储有可执行指令,该指令在被处理器601执行时用于实现根据本公开实施例的数据处理方法。存储器602可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。非易失性存储器可以是只读存储器(ROM)、可编程只读存储器(PROM)、可擦除可编程只读存储器(EPROM)、电可擦除可编程只读存储器(EEPROM)或闪存。易失性存储器可以是随机存取存储器(RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(SDRAM)、双倍数据速率同步动态随机存取存储器(DDRSDRAM)、增强型同步动态随机存取存储器(ESDRAM)、同步连接动态随机存取存储器(SLDRAM)和直接内存总线随机存取存储器(DRRAM)。应注意,本文描述的方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。The memory 602 stores executable instructions, which when executed by the processor 601 are used to implement the data processing method according to the embodiment of the present disclosure. Memory 602 may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be read only memory (ROM), programmable read only memory (PROM), erasable programmable read only memory (EPROM), electrically erasable programmable read only memory (EEPROM), or flash memory. Volatile memory may be random access memory (RAM), which acts as an external cache. By way of example and not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronous dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Synchronous Link Dynamic Random Access Memory (SLDRAM), and Direct Memory Bus Random Access Memory (DRRAM). It should be noted that the memory of the methods described herein is intended to include, but not be limited to, these and any other suitable types of memory.
本公开的实施例还提供了一种计算机可读存储介质,其上存储有计算机可执行指令,该计算机指令被处理器执行时可以实现根据本公开实施例的数据处理方法。类似地,本公开实施例中的计算机可读存储介质可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。应注意,本文描述的方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。Embodiments of the present disclosure also provide a computer-readable storage medium on which computer-executable instructions are stored, and when the computer instructions are executed by a processor, can implement the data processing method according to the embodiments of the present disclosure. Similarly, computer-readable storage media in embodiments of the present disclosure may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. It should be noted that the memory of the methods described herein is intended to include, but not be limited to, these and any other suitable types of memory.
本公开的实施例还提供了一种计算机程序产品或计算机程序,该计算机 程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行根据本公开实施例的数据处理方法。Embodiments of the present disclosure also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method according to the embodiment of the present disclosure.
本公开的实施例提供了一种用于矩阵乘法的数据处理方法、装置、设备和存储介质。本公开的实施例提供的用于矩阵乘法的数据处理方法通过首先将整个矩阵读入VGPR,然后对VGPR的多个通路进行选择,将所选择的通路对应的数据复制到该VGPR的其他通路作为乘法因子以参与对应线程的乘法运算,充分利用了矩阵特性,在线程之间有效地复用数据,减少了数据的读取次数,降低了功耗。Embodiments of the present disclosure provide a data processing method, apparatus, device, and storage medium for matrix multiplication. The data processing method for matrix multiplication provided by the embodiments of the present disclosure firstly reads the entire matrix into the VGPR, then selects multiple paths of the VGPR, and copies the data corresponding to the selected path to other paths of the VGPR as The multiplication factor participates in the multiplication operation of the corresponding thread, makes full use of the matrix characteristics, effectively multiplexes data between threads, reduces the number of data reads, and reduces power consumption.
需要说明的是,附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,所述模块、程序段、或代码的一部分包含至少一个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。It should be noted that the flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which includes at least one block for implementing the specified logical function. executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
一般而言,本公开的各种示例实施例可以在硬件或专用电路、软件、固件、逻辑,或其任何组合中实施。某些方面可以在硬件中实施,而其他方面可以在可以由控制器、微处理器或其他计算设备执行的固件或软件中实施。当本公开的实施例的各方面被图示或描述为框图、流程图或使用某些其他图形表示时,将理解此处描述的方框、装置、系统、技术或方法可以作为非限制性的示例在硬件、软件、固件、专用电路或逻辑、通用硬件或控制器或其他计算设备,或其某些组合中实施。In general, the various example embodiments of the present disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic, or any combination thereof. Certain aspects may be implemented in hardware, while other aspects may be implemented in firmware or software that may be executed by a controller, microprocessor or other computing device. While aspects of the embodiments of the present disclosure are illustrated or described as block diagrams, flowcharts, or using some other graphical representation, it is to be understood that the blocks, apparatus, systems, techniques, or methods described herein may be taken as non-limiting Examples are implemented in hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controllers or other computing devices, or some combination thereof.
在上面详细描述的本公开的示例实施例仅仅是说明性的,而不是限制性的。本领域技术人员应该理解,在不脱离本公开的原理和精神的情况下,可对这些实施例或其特征进行各种修改和组合,这样的修改应落入本公开范围内。The example embodiments of the present disclosure described in detail above are illustrative only and not restrictive. It should be understood by those skilled in the art that various modifications and combinations of the embodiments or features thereof may be made without departing from the principles and spirit of the present disclosure, and such modifications are intended to fall within the scope of the present disclosure.

Claims (12)

  1. 一种用于矩阵乘法的数据处理方法,包括:A data processing method for matrix multiplication, comprising:
    获取矩阵乘法指令和数据选择指令;Get matrix multiplication instructions and data selection instructions;
    基于所述矩阵乘法指令和所述数据选择指令,确定存储有第一操作矩阵的第一向量通用寄存器,以及存储有第二操作矩阵的第二向量通用寄存器,其中,所述第一向量通用寄存器和所述第二向量通用寄存器具有相同数量的通路,其中所述第一操作矩阵的第一数量的操作数据对应于所述第一向量通用寄存器的第一数量的通路,所述第二操作矩阵的第二数量的操作数据对应于所述第二向量通用寄存器的第二数量的通路;Based on the matrix multiplication instruction and the data selection instruction, a first vector general register storing a first operation matrix and a second vector general register storing a second operation matrix are determined, wherein the first vector general register and the second vector general register have the same number of paths, wherein a first number of operation data of the first operation matrix corresponds to a first number of paths of the first vector general register, and the second operation matrix a second number of operational data paths corresponding to a second number of paths of the second vector general-purpose register;
    基于所述数据选择指令,在所述第二操作矩阵的第二数量的操作数据中确定目标操作数据;based on the data selection instruction, determining target operational data in a second amount of operational data of the second operational matrix;
    将所述第一操作矩阵的第一数量的操作数据经由所述第一向量通用寄存器的第一数量的通路分别提供至所述第一数量的乘法器作为第一乘法因子,并且将所述目标操作数据经由所述第二向量通用寄存器的第一数量的通路提供至所述第一数量的乘法器作为第二乘法因子。providing a first number of operation data of the first operation matrix to the first number of multipliers via a first number of paths of the first vector general register, respectively, as a first multiplication factor, and applying the target Operational data is provided to the first number of multipliers as second multiplication factors via the first number of paths of the second vector general register.
  2. 根据权利要求1所述的方法,还包括:The method of claim 1, further comprising:
    基于所述矩阵乘法指令,确定用于存储所述矩阵乘法运算结果的第三向量通用寄存器;determining, based on the matrix multiplication instruction, a third vector general-purpose register for storing the result of the matrix multiplication operation;
    所述第一数量的乘法器中的各个乘法器分别基于其对应的所述第一乘法因子和所述第二乘法因子执行乘法运算,得到运算结果;以及Each of the first number of multipliers performs a multiplication operation based on its corresponding first multiplication factor and the second multiplication factor, respectively, to obtain an operation result; and
    将所述运算结果存储到所述第三向量通用寄存器中。The operation result is stored in the third vector general-purpose register.
  3. 根据权利要求1所述的方法,其中,所述矩阵乘法指令包含所述第一数量的线程,并且所述第一数量的乘法器对应于所述第一数量的线程,所述第一数量的线程中的每一线程分别对应于所述第一向量通用寄存器的相应通路和所述第二向量通用寄存器的相应通路;2. The method of claim 1, wherein the matrix multiply instruction includes the first number of threads, and wherein the first number of multipliers corresponds to the first number of threads, the first number of Each of the threads corresponds to a corresponding path of the first vector general register and a corresponding path of the second vector general register, respectively;
    其中,在所述第二操作矩阵的第二数量的操作数据中确定目标操作数据包括:Wherein, determining the target operation data in the second quantity of operation data of the second operation matrix includes:
    基于所述数据选择指令,在所述第二向量通用寄存器的所述第二数量的通路中选择一个通路,并将该通路所对应的操作数据作为目标操作 数据;Based on the data selection instruction, select a path in the second number of paths of the second vector general-purpose register, and use the operation data corresponding to the path as the target operation data;
    其中,将所述目标操作数据提供至所述第一数量的乘法器作为第二乘法因子包括:Wherein, providing the target operation data to the first number of multipliers as the second multiplication factor includes:
    对于所述第一数量的线程中与所述第二向量通用寄存器的所述通路相对应的所述线程,将所述目标操作数据提供至其对应的乘法器作为第二乘法因子;以及for the threads of the first number of threads corresponding to the paths of the second vector general register, providing the target operation data to their corresponding multipliers as second multiplication factors; and
    对于所述第一数量的线程中的其余线程,将所述目标操作数据复制到所述其余线程的与所述第二向量通用寄存器连接的通路,并分别提供至对应的乘法器作为第二乘法因子。For the remaining threads of the first number of threads, the target operation data is copied to the paths of the remaining threads connected to the second vector general-purpose register, and provided to the corresponding multipliers as a second multiplication, respectively factor.
  4. 根据权利要求1所述的方法,其中,The method of claim 1, wherein,
    所述第一操作矩阵为列矩阵,所述第一数量的操作数据为所述第一操作矩阵的列数据;以及the first operation matrix is a column matrix, and the first quantity of operation data is column data of the first operation matrix; and
    所述第二操作矩阵为行矩阵,所述第二数量的操作数据为所述第二操作矩阵的行数据。The second operation matrix is a row matrix, and the second quantity of operation data is row data of the second operation matrix.
  5. 根据权利要求1所述的方法,其中获取矩阵乘法指令和数据选择指令包括:The method of claim 1, wherein obtaining the matrix multiply instruction and the data selection instruction comprises:
    获取矩阵乘法指令,所述矩阵乘法指令包括第一操作矩阵字段,第二操作矩阵字段,其中所述第一操作矩阵字段用于指示存储有所述第一操作矩阵的第一向量通用寄存器;以及obtaining a matrix multiplication instruction, the matrix multiplication instruction including a first operation matrix field and a second operation matrix field, wherein the first operation matrix field is used to indicate a first vector general register in which the first operation matrix is stored; and
    在所述第二操作矩阵字段为预定义的值时,获取数据选择指令,所述数据选择指令包括操作矩阵字段和数据选择字段,其中所述操作矩阵字段用于指示存储有所述第二操作矩阵的第二向量通用寄存器,所述数据选择字段用于指示选择所述第二操作矩阵的第二数量的操作数据中的特定数据作为所述目标操作数据。When the second operation matrix field is a predefined value, acquire a data selection instruction, where the data selection instruction includes an operation matrix field and a data selection field, wherein the operation matrix field is used to indicate that the second operation is stored The second vector general-purpose register of the matrix, and the data selection field is used to indicate that specific data in the second quantity of operation data of the second operation matrix is selected as the target operation data.
  6. 一种执行用于矩阵乘法的数据处理的装置,包括:An apparatus for performing data processing for matrix multiplication, comprising:
    取指单元,用于获取矩阵乘法指令和数据选择指令;The instruction fetch unit is used to obtain matrix multiplication instructions and data selection instructions;
    译码单元,被配置为从所述取指单元接收所述矩阵乘法指令和所述数据选择指令,并对其进行译码,以确定存储有第一操作矩阵的第一向量通用寄存器,以及存储有第二操作矩阵的第二向量通用寄存器,并且获得数据选择信息,其中,所述第一向量通用寄存器和所述第二向量通用寄存器具有相同数量 的通路,其中所述第一操作矩阵的第一数量的操作数据对应于所述第一向量通用寄存器的第一数量的通路,所述第二操作矩阵的第二数量的操作数据对应于所述第二向量通用寄存器的第二数量的通路;a decoding unit configured to receive the matrix multiply instruction and the data selection instruction from the instruction fetch unit, and decode them to determine a first vector general-purpose register in which the first operation matrix is stored, and to store A second vector general register having a second operation matrix, and obtaining data selection information, wherein the first vector general register and the second vector general register have the same number of paths, wherein the first operation matrix A number of operational data corresponds to a first number of paths of the first vector general register, and a second number of operational data of the second operation matrix corresponds to a second number of paths of the second vector general register;
    数据选择控制单元,被配置为从所述译码单元接收所述数据选择信息,并基于所述数据选择信息,在所述第二操作矩阵的第二数量的操作数据中确定目标操作数据;a data selection control unit configured to receive the data selection information from the decoding unit, and based on the data selection information, determine target operation data among a second amount of operation data of the second operation matrix;
    读操作数单元,被配置为将所述第一操作矩阵的第一数量的操作数据经由所述第一向量通用寄存器的第一数量的通路分别提供至所述第一数量的乘法器作为第一乘法因子,并且将所述目标操作数据经由所述第二向量通用寄存器的第一数量的通路提供至所述第一数量的乘法器作为第二乘法因子。a read operand unit configured to provide a first number of operation data of the first operation matrix to the first number of multipliers via a first number of paths of the first vector general register, respectively, as a first multiplication factor, and the target operation data is provided to the first number of multipliers as a second multiplication factor via a first number of paths of the second vector general register.
  7. 根据权利要求6所述的装置,其中,所述译码单元还基于所述译码结果,确定用于存储所述矩阵乘法运算结果的第三向量通用寄存器,并且所述装置还包括:The apparatus according to claim 6, wherein the decoding unit further determines a third vector general-purpose register for storing the result of the matrix multiplication operation based on the decoding result, and the apparatus further comprises:
    乘法单元,被配置为包括所述第一数量的乘法器,所述第一数量的乘法器中的各个乘法器分别基于其对应的所述第一乘法因子和所述第二乘法因子执行乘法运算,得到运算结果;a multiplication unit configured to include the first number of multipliers, each of the multipliers in the first number of multipliers performs a multiplication operation based on its corresponding first multiplication factor and the second multiplication factor, respectively , get the result of the operation;
    运算写回单元,被配置为将所述运算结果存储到第三向量通用寄存器中。The operation write-back unit is configured to store the operation result in the third vector general register.
  8. 根据权利要求6所述的装置,其中:The apparatus of claim 6, wherein:
    所述矩阵乘法指令包含所述第一数量的线程,并且所述第一数量的乘法器对应于所述第一数量的线程,所述第一数量的线程中的每一线程分别对应于所述第一向量通用寄存器的相应通路和所述第二向量通用寄存器的相应通路;The matrix multiply instruction includes the first number of threads, and the first number of multipliers corresponds to the first number of threads, each of the first number of threads corresponding to the the corresponding path of the first vector general register and the corresponding path of the second vector general register;
    其中,在所述第二操作矩阵的第二数量的操作数据中确定目标操作数据包括:Wherein, determining the target operation data in the second quantity of operation data of the second operation matrix includes:
    基于所述数据选择指令,在所述第二向量通用寄存器的所述第二数量的通路中选择一个通路,并将该通路所对应的操作数据作为目标操作数据;Based on the data selection instruction, select a path from the second number of paths in the second vector general-purpose register, and use the operation data corresponding to the path as the target operation data;
    其中,将所述目标操作数据提供至所述第一数量的乘法器作为第二乘法因子包括:Wherein, providing the target operation data to the first number of multipliers as the second multiplication factor includes:
    对于所述第一数量的线程中与所述第二向量通用寄存器的所述通路 相对应的所述线程,将所述目标操作数据提供至其对应的乘法器作为第二乘法因子;以及for said threads of said first number of threads corresponding to said paths of said second vector general register, providing said target operation data to their corresponding multipliers as second multiplication factors; and
    对于所述第一数量的线程中的其余线程,将所述目标操作数据复制到所述其余线程的与所述第二向量通用寄存器连接的通路,并分别提供至对应的乘法器作为第二乘法因子。For the remaining threads of the first number of threads, the target operation data is copied to the paths of the remaining threads connected to the second vector general-purpose register, and provided to the corresponding multipliers as the second multiplication, respectively factor.
  9. 根据权利要求6所述的装置,其中:The apparatus of claim 6, wherein:
    所述第一操作矩阵为列矩阵,所述第一数量的操作数据为所述第一操作矩阵的列数据;以及the first operation matrix is a column matrix, and the first quantity of operation data is column data of the first operation matrix; and
    所述第二操作矩阵为行矩阵,所述第二数量的操作数据为所述第二操作矩阵的行数据。The second operation matrix is a row matrix, and the second quantity of operation data is row data of the second operation matrix.
  10. 根据权利要求6所述的装置,其中获取矩阵乘法指令和数据选择指令包括:The apparatus of claim 6, wherein obtaining the matrix multiply instruction and the data selection instruction comprises:
    获取矩阵乘法指令,所述矩阵乘法指令包括第一操作矩阵字段,第二操作矩阵字段,其中所述第一操作矩阵字段用于指示存储有所述第一操作矩阵的第一向量通用寄存器;以及obtaining a matrix multiplication instruction, the matrix multiplication instruction including a first operation matrix field and a second operation matrix field, wherein the first operation matrix field is used to indicate a first vector general register in which the first operation matrix is stored; and
    在所述第二操作矩阵字段为预定义的值时,获取数据选择指令,所述数据选择指令包括操作矩阵字段和数据选择字段,其中所述操作矩阵字段用于指示存储有所述第二操作矩阵的第二向量通用寄存器,所述数据选择字段用于指示选择所述第二操作矩阵的第二数量的操作数据中的特定数据作为所述目标操作数据。When the second operation matrix field is a predefined value, acquire a data selection instruction, where the data selection instruction includes an operation matrix field and a data selection field, wherein the operation matrix field is used to indicate that the second operation is stored The second vector general-purpose register of the matrix, and the data selection field is used to indicate that specific data in the second quantity of operation data of the second operation matrix is selected as the target operation data.
  11. 一种数据处理设备,包括:A data processing device comprising:
    处理器;和processor; and
    存储器,其上存储有计算机可执行指令,所述指令在被处理器执行时用于实现如权利要求1-5中任一项所述的方法。A memory having stored thereon computer-executable instructions which, when executed by a processor, are used to implement the method of any of claims 1-5.
  12. 一种计算机可读存储介质,其上存储有计算机可执行指令,所述指令在被处理器执行时用于实现如权利要求1-5中任一项所述的方法。A computer-readable storage medium having stored thereon computer-executable instructions which, when executed by a processor, are used to implement the method of any of claims 1-5.
PCT/CN2020/122168 2020-09-24 2020-10-20 Data processing method and apparatus for matrix multiplication, and device and medium WO2022062004A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011019241.2A CN112182496B (en) 2020-09-24 2020-09-24 Data processing method and device for matrix multiplication
CN202011019241.2 2020-09-24

Publications (1)

Publication Number Publication Date
WO2022062004A1 true WO2022062004A1 (en) 2022-03-31

Family

ID=73943664

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/122168 WO2022062004A1 (en) 2020-09-24 2020-10-20 Data processing method and apparatus for matrix multiplication, and device and medium

Country Status (2)

Country Link
CN (1) CN112182496B (en)
WO (1) WO2022062004A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115880132A (en) * 2023-02-06 2023-03-31 南京砺算科技有限公司 Graphics processor, matrix multiplication task processing method, device and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722669B (en) * 2021-11-03 2022-01-21 海光信息技术股份有限公司 Data processing method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190102357A1 (en) * 2017-09-29 2019-04-04 Intel Corporation Bit matrix multiplication
CN110770701A (en) * 2017-06-28 2020-02-07 Arm有限公司 Register based matrix multiplication
CN111079081A (en) * 2019-12-16 2020-04-28 海光信息技术有限公司 Matrix multiplier, data processing method, integrated circuit device and processor
CN111198670A (en) * 2018-11-20 2020-05-26 华为技术有限公司 Method, circuit and SOC for executing matrix multiplication operation

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8577950B2 (en) * 2009-08-17 2013-11-05 International Business Machines Corporation Matrix multiplication operations with data pre-conditioning in a high performance computing architecture
US9600281B2 (en) * 2010-07-12 2017-03-21 International Business Machines Corporation Matrix multiplication operations using pair-wise load and splat operations
CN106445471B (en) * 2016-10-13 2018-06-01 北京百度网讯科技有限公司 Processor and the method for performing matrix multiplication on a processor
CN111124492B (en) * 2019-12-16 2022-09-20 成都海光微电子技术有限公司 Instruction generation method and device, instruction execution method, processor and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110770701A (en) * 2017-06-28 2020-02-07 Arm有限公司 Register based matrix multiplication
US20190102357A1 (en) * 2017-09-29 2019-04-04 Intel Corporation Bit matrix multiplication
CN111198670A (en) * 2018-11-20 2020-05-26 华为技术有限公司 Method, circuit and SOC for executing matrix multiplication operation
CN111079081A (en) * 2019-12-16 2020-04-28 海光信息技术有限公司 Matrix multiplier, data processing method, integrated circuit device and processor

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115880132A (en) * 2023-02-06 2023-03-31 南京砺算科技有限公司 Graphics processor, matrix multiplication task processing method, device and storage medium

Also Published As

Publication number Publication date
CN112182496A (en) 2021-01-05
CN112182496B (en) 2022-09-16

Similar Documents

Publication Publication Date Title
US9760375B2 (en) Register files for storing data operated on by instructions of multiple widths
WO2022062004A1 (en) Data processing method and apparatus for matrix multiplication, and device and medium
US10261796B2 (en) Processor and method for executing in-memory copy instructions indicating on-chip or off-chip memory
US10678540B2 (en) Arithmetic operation with shift
US10761851B2 (en) Memory apparatus and method for controlling the same
US20070079179A1 (en) Staggered execution stack for vector processing
JP3747936B2 (en) A parallel subword instruction that sends the result to the selected subword location in the data processor's result register
TW201716991A (en) Data processing
WO2023077770A1 (en) Data processing method, apparatus and device, and storage medium
WO2023077769A1 (en) Data processing method, apparatus and device, and computer-readable storage medium
GB2540943A (en) Vector arithmetic instruction
US7668897B2 (en) Result partitioning within SIMD data processing systems
US7234043B2 (en) Decoding predication instructions within a superscaler data processing system
US6609191B1 (en) Method and apparatus for speculative microinstruction pairing
US4812970A (en) Microprogram control system
US20110004743A1 (en) Pipe scheduling for pipelines based on destination register number
US11385897B2 (en) Merge execution unit for microinstructions
US11354126B2 (en) Data processing
EP3729260B1 (en) A multiple-pipeline architecture with special number detection
US20090037702A1 (en) Processor and data load method using the same
US20110238964A1 (en) Data processor
JPS5860355A (en) Information processing device
US20090063808A1 (en) Microprocessor and method of processing data
US8468306B2 (en) Microprocessor and method for deferred store data forwarding for store background data in a system with no memory model restrictions
US11036503B2 (en) Predicate indicator generation for vector processing operations

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20954850

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20954850

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20954850

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 16/10/2023)