WO2022062004A1

WO2022062004A1 - Data processing method and apparatus for matrix multiplication, and device and medium

Info

Publication number: WO2022062004A1
Application number: PCT/CN2020/122168
Authority: WO
Inventors: 陈庆; 华芮; 袁庆
Original assignee: 成都海光集成电路设计有限公司
Priority date: 2020-09-24
Filing date: 2020-10-20
Publication date: 2022-03-31
Also published as: CN112182496A; CN112182496B

Abstract

A data processing method and apparatus for matrix multiplication, and a device and a medium. The data processing method comprises: acquiring a matrix multiplication instruction and a data selection instruction; on the basis of the matrix multiplication instruction and the data selection instruction, determining a first vector general-purpose register that stores a first operation matrix, and a second vector general-purpose register that stores a second operation matrix; on the basis of the data selection instruction, determining target operation data from a second number of pieces of operation data in the second operation matrix; respectively providing a first number of pieces of operation data in the first operation matrix to the first number of multipliers, and taking same as first multiplication factors; and providing the target operation data to the first number of multipliers, and then taking same as a second multiplication factor. By means of the data processing method and apparatus for matrix multiplication, and the device and the medium, data can be effectively reused between threads, thereby reducing the number of readings of data, and reducing the power consumption.

Description

Data processing method, apparatus, device and medium for matrix multiplication

This application claims the priority of Chinese Patent Application No. 202011019241.2 filed on September 24, 2020. The disclosure of the above Chinese patent application is hereby incorporated by reference in its entirety as a part of this application.

technical field

The present disclosure relates to the field of data processing, and more particularly, to a data processing method, apparatus, device, and medium for matrix multiplication.

Background technique

The graphics processing unit (GPU) includes a large number of data processing units. Each data processing unit is a single instruction multiple data stream (SIMD) structure. By executing one instruction, it simultaneously controls multiple threads to perform the same operation. It has a dedicated set of vector general-purpose registers (VGPR) and a large number of parallel execution units, such as multiplication units. Because the SIMD structure has a high degree of parallelism, the SIMD structure is widely used in matrix operations.

At present, when performing matrix operations, especially when performing matrix multiplication operations, due to the characteristics of matrix multiplication, it is often necessary to multiply the corresponding elements of the matrix by reading the matrix data multiple times, and after reading the matrix data into the register, the register The data transmitted on all the paths of the thread is the same, and there is a lot of redundancy in the data between threads, which also causes additional power consumption. Existing data processing methods can replicate data between threads by executing specific instructions, but the instructions used are not suitable for matrix operations, and the instructions for operating data between threads exist as separate instructions independent of operation instructions. Still less efficient for actual data processing.

Therefore, there is a need for a data processing method that is suitable for matrix operations, can effectively reduce the number of readings, and is efficient.

SUMMARY OF THE INVENTION

In order to solve the above problem, an embodiment of the present disclosure provides a data processing method for matrix multiplication, including: acquiring a matrix multiplication instruction and a data selection instruction; A first vector general-purpose register of a first operation matrix, and a second vector general-purpose register storing a second operation matrix, wherein the first vector general-purpose register and the second vector general-purpose register have the same number of paths, wherein all The first number of operation data of the first operation matrix corresponds to the first number of paths of the first vector general register, and the second number of operation data of the second operation matrix corresponds to the second vector general register the second number of paths; based on the data selection instruction, determine target operation data in the second number of operation data of the second operation matrix; pass the first number of operation data of the first operation matrix through the The first number of paths of the first vector general register are respectively provided to the first number of multipliers as first multiplication factors, and the target operation data is passed through the first number of paths of the second vector general register The multipliers to the first number are provided as second multiplication factors.

According to an embodiment of the present disclosure, wherein the method further comprises: based on the matrix multiplication instruction, determining a third vector general-purpose register for storing the result of the matrix multiplication operation; each of the first number of multipliers The multipliers respectively perform multiplication operations based on the corresponding first multiplication factors and the second multiplication factors to obtain operation results; and store the operation results in the third vector general-purpose register.

According to an embodiment of the present disclosure, wherein the matrix multiply instruction includes the first number of threads, and the first number of multipliers corresponds to the first number of threads, among the first number of threads Each of the threads corresponds to a corresponding path of the first vector general-purpose register and a corresponding path of the second vector general-purpose register, respectively; wherein the target operation data is determined in the second quantity of operation data of the second operation matrix The method includes: selecting a path from the second number of paths in the second vector general-purpose register based on the data selection instruction, and using the operation data corresponding to the path as target operation data; wherein the target Operating data to provide the first number of multipliers as a second multiplication factor includes: for the threads of the first number of threads corresponding to the paths of the second vector general register, applying the providing target operational data to its corresponding multiplier as a second multiplication factor; and for the remaining threads of the first number of threads, copying the target operational data to the remaining threads in common with the second vector The paths of the registers are connected, and are respectively provided to the corresponding multipliers as the second multiplication factors.

According to an embodiment of the present disclosure, wherein the first operation matrix is a column matrix, the first quantity of operation data is column data of the first operation matrix; and the second operation matrix is a row matrix, so The second quantity of operation data is row data of the second operation matrix.

According to an embodiment of the present disclosure, wherein acquiring a matrix multiplication instruction and a data selection instruction comprises: acquiring a matrix multiplication instruction, the matrix multiplication instruction includes a first operation matrix field and a second operation matrix field, wherein the first operation matrix field a first vector general-purpose register for indicating that the first operation matrix is stored; and when the second operation matrix field is a predefined value, acquiring a data selection instruction, the data selection instruction includes an operation matrix field and data A selection field, wherein the operation matrix field is used to indicate a second vector general-purpose register that stores the second operation matrix, and the data selection field is used to indicate selection of the second operation matrix in the second quantity of operation data specific data as the target operation data.

Embodiments of the present disclosure provide an apparatus for performing data processing for matrix multiplication, including: an instruction fetch unit for acquiring a matrix multiplication instruction and a data selection instruction; and a decoding unit configured to retrieve data from the instruction fetch unit receiving and decoding the matrix multiply instruction and the data select instruction to determine a first vector general register storing a first operation matrix and a second vector general register storing a second operation matrix, and obtaining data selection information, wherein the first vector general register and the second vector general register have the same number of paths, wherein a first number of operation data of the first operation matrix corresponds to the first vector general a first number of paths of registers, a second number of operand data of said second operation matrix corresponding to a second number of paths of said second vector general register; a data selection control unit configured to decode from said decoding A unit receives the data selection information, and based on the data selection information, determines target operation data in the second amount of operation data of the second operation matrix; a read operand unit is configured to convert the first operation The first number of operation data of the matrix is respectively provided to the first number of multipliers as first multiplication factors via the first number of paths of the first vector general register, and the target operation data is passed through the first number of multipliers. The first number of paths of the two-vector general register are provided to the first number of multipliers as second multiplication factors.

According to an embodiment of the present disclosure, wherein the decoding unit further determines a third vector general-purpose register for storing the result of the matrix multiplication operation based on the decoding result, and the apparatus further includes: a multiplication unit, which is It is configured to include the first number of multipliers, and each of the first number of multipliers performs multiplication operations based on the corresponding first multiplication factor and the second multiplication factor, respectively, to obtain an operation result ; an operation write-back unit configured to store the operation result into the third vector general-purpose register.

Embodiments of the present disclosure provide a data processing apparatus including: a processor; and a memory having computer-executable instructions stored thereon, the instructions, when executed by the processor, for implementing the method as described above.

Embodiments of the present disclosure provide a computer-readable storage medium having stored thereon computer-executable instructions, which when executed by a processor, are used to implement the method as described above.

Embodiments of the present disclosure provide a computer program product or computer program including computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method according to the embodiment of the present disclosure.

Description of drawings

In order to illustrate the technical solutions of the embodiments of the present disclosure more clearly, the following briefly introduces the accompanying drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are only some exemplary embodiments of the present disclosure, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative efforts.

FIG. 1 shows a schematic flowchart of a data processing method 100 for matrix multiplication according to an embodiment of the present disclosure.

FIG. 2 shows a schematic diagram of the correspondence between threads performing matrix operations and paths of the VGPR according to an embodiment of the present disclosure.

3 shows a schematic diagram of data processing for example matrix multiplication according to an embodiment of the present disclosure.

FIG. 4 shows a schematic diagram of an example apparatus 400 for performing data processing for matrix multiplication according to an embodiment of the present disclosure.

FIG. 5 shows a schematic diagram of the operation of an example data selection control unit 403 and a read operand unit 404 involved in the second half of data processing according to an embodiment of the present disclosure.

FIG. 6 shows a schematic diagram of a data processing apparatus 600 according to an embodiment of the present disclosure.

detailed description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, exemplary embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only some of the embodiments of the present disclosure, not all of the embodiments of the present disclosure, and it should be understood that the present disclosure is not limited by the example embodiments described herein.

In this specification and the drawings, substantially the same or similar steps and elements are denoted by the same or similar reference numerals, and repeated descriptions of these steps and elements will be omitted. Meanwhile, in the description of the present disclosure, the terms "first", "second" and the like are only used to distinguish the description, and cannot be understood as indicating or implying relative importance or order.

In this specification and drawings, elements are described in the singular or the plural depending on the embodiment. However, the singular and plural forms have been appropriately chosen for the presented instances only for convenience of explanation and are not intended to limit the disclosure thereto. Thus, the singular may include the plural, and the plural may also include the singular, unless the context clearly dictates otherwise.

The SIMD structure processing unit of the GPU controls multiple threads to perform the same operation at the same time by executing the matrix operation instruction, so as to realize matrix reading, operation operation, and result storage. For example, for the SIMD 32 structure, executing one instruction can control the data operations of 32 threads at the same time, each SIMD32 structure has its own dedicated set of VGPRs, and each VGPR has 32 channels. The following table 1 shows a general matrix operation instruction, which is a conventional instruction adopted when performing a matrix operation operation in the SIMD structure, and includes a first operation matrix (VSRCA) field indicating the first VGPR stored with the first operation matrix, Indicates the second operand (SRCB) field of the second VGPR stored with the second operand, indicates the purpose VGPR (VDST) field of the third VGPR for storing the matrix operation result, indicates the specific operation performed by the matrix operation instruction The operation code (OP) field of the , and the instruction selection (Type) field indicating that the instruction to execute the matrix operation is determined. The matrix multiplication instruction can be obtained by setting the OP field in the matrix operation instruction to a corresponding value indicating the multiplication operation.

Type

OP

VDST

VSRCA

SRCB

Table 1

Under the SIMD 32 structure, the matrix multiplication instruction in the general matrix operation instruction format is used to perform matrix multiplication A*B, where matrix A is a 32×1 column matrix, that is, A(:,1) contains 32 data, and matrix B is 1 The row matrix of ×4, that is, B(1,:) contains 4 data.

For the above-mentioned matrix multiplication operation, the commonly used prior art is to read the matrix data one by one from the double data rate synchronous dynamic random access memory (DDRSDRAM) into the VGPR. First, matrix A is read into VGPR 0, and then the four matrix data of matrix B are read into four VGPRs (respectively called VGPR 1, VGPR 2, VGPR 3 and VGPR 4), and each operation will be VGPR 0. The data corresponding to the 32 channels, and the data corresponding to the 32 channels of VGPR 1, VGPR 2, VGPR 3 or VGPR 4 are sent to the corresponding multipliers in the SIMD structure for multiplication. This process involves reading data from the DDR SDRAM multiple times, such as 5 times in this operation, resulting in unnecessary data redundancy and extra power consumption.

Therefore, in order to solve the above problems, the present disclosure proposes to only read the operation matrix once (that is, read the entire second operation matrix into the second VGPR at one time), and correspondingly add a part of the instructions on the basis of the original matrix multiplication instructions Used to guide the ordered multiplication of data within a matrix.

The embodiments of the present disclosure will be further described below with reference to the accompanying drawings.

As shown in FIG. 1, first, in step 101, a matrix multiplication instruction and a data selection instruction are acquired. For example, matrix multiply instructions and data selection instructions may be retrieved from memory (eg, DDR SDRAM, etc.).

According to the embodiment of the present disclosure, based on the original matrix multiplication instruction, an instruction part for operating data between threads is added to guide the selection and copying of the data involved in the operation in the second operation matrix during the matrix multiplication process. The above added command part is called data selection command, as shown in Table 2. The SRCB field originally used to indicate the second VGPR is used as the entry to obtain the data selection instruction, and the data selection instruction indicates the second VGPR in which the second operation matrix is stored. The data selection instruction may include a second operation matrix (VSRCB) field for indicating the second VGPR, and a data selection (SVF_MODE) field for indicating the data selection. It should be appreciated that, according to embodiments of the present disclosure, the matrix multiply instruction and the data selection instruction may exist as two separate instructions, or may exist as two parts of one instruction. In the following description, the SIMD instruction adopted by the data processing method 100 for matrix multiplication includes the above-mentioned matrix multiplication instruction and data selection instruction.

reserved text

SVF_MODE

VSRCB

Table 2

According to an embodiment of the present disclosure, for example, the length of a SIMD instruction may be 64 bits, the first 32 bits of which are the matrix operation instruction part, and the definitions and related descriptions of each bit field in the matrix operation instruction are shown in Table 3; For the part of the data selection command, the definitions and related descriptions of each bit field in the data selection command are shown in Table 4.

Referring to Table 3, in the matrix operation instruction portion of this SIMD instruction, bits 0 to 8 are the SRCB field, which can indicate the second VGPR that stores the second operand (for example, when the SRCB value is equal to 90 or 267, etc. ), when the SRCB value is equal to a predefined value, this field indicates to enter the data selection, get the data selection command (for example, when the SRCB value is equal to 209). The 9th to 16th bits are the VSRCA field. The 17th to 24th bits are the VDST field. Bits 25 to 30 are the OP field, which is one of a number of specific values for a matrix multiply instruction. The 31st bit is the Type field, which is used to indicate that the matrix operation instruction is determined to be executed.

table 3

Referring to Table 4, in the data select command portion of this SIMD command, bits 32 to 39 are the VSRCB field. The 40th to 44th bits are the SVF_MODE field, and the SVF_MODE with a length of 5 bits can be used to indicate the copy operation of data among 32 threads. The remaining bits are reserved fields of the instruction, which can be reserved for subsequent implementation of other operations.

Table 4

In step 102, a first VGPR storing the first operation matrix and a second VGPR storing the second operation matrix may be determined based on the matrix multiplication instruction and the data selection instruction.

According to the embodiments of the present disclosure, the address information of the first VGPR storing the first operation matrix and the second VGPR storing the second operation matrix can be obtained according to the VSRCA field in the matrix multiplication instruction and the VSRCB field in the data selection instruction , the address information may be the index of the VGPR in all VGPRs of the SIMD structure processing unit.

According to an embodiment of the present disclosure, the first operation matrix may be stored in the first VGPR in advance, and the second operation matrix may be stored in the second VGPR in advance, wherein the first VGPR and the second VGPR have the same number of paths, wherein The first number of operation data of the first operation matrix corresponds to the first number of ways of the first VGPR, and the second number of operation data of the second operation matrix corresponds to the second number of ways of the second VGPR.

According to an embodiment of the present disclosure, by storing the first operation matrix and the second operation matrix in the first VGPR and the second VGPR, respectively, the SIMD structure processing unit can, according to the obtained address information of the first VGPR and the second VGPR, A multiplication operation is performed on a first number of operation data of the first operation matrix and a second number of operation data of the second operation matrix, the first number of operation data of the first operation matrix corresponding to the first number of paths of the first VGPR , the second number of operation data of the second operation matrix corresponds to the second number of paths of the second VGPR. According to an embodiment of the present disclosure, for example, for a SIMD 32 structure, both the first VGPR and the second VGPR have 32 channels, so the VGPR can simultaneously provide up to 32 data in the stored matrix to participate in the operation.

According to an embodiment of the present disclosure, for example, for matrix multiplication A*B, where the first operation matrix A is a 32×1 column matrix, the first quantity of operation data is 32 column data of A(:, 1), the first The second operation matrix B is a 1×4 row matrix, and the operation data of the second quantity is 4 row data of B(1,:). The 32 channels of the VGPR A that store the matrix A correspond to the 32 data of A(:, 1) in the matrix A respectively, and the first 4 channels of the 32 channels of the VGPR B that store the matrix B respectively correspond to the matrix. The 4 data of B(1,:) in B, the other channels of VGPR B do not correspond to any data.

According to an embodiment of the present disclosure, the matrix multiply instruction includes a first number of threads, wherein each thread corresponds to a respective pass of the first VGPR and a respective pass of the second VGPR.

As shown in Figure 2, the above matrix multiplication instruction includes 32 threads corresponding to the 32 column data of A(:, 1). Figure 2 shows that each thread corresponds to the corresponding path of the first VGPR and the Corresponding paths, eg, thread 0 corresponds to path 0 of the first VGPR and path 0 of the second VGPR, thread 1 corresponds to path 1 of the first VGPR and path 1 of the second VGPR, and so on. Among them, taking thread 0 as an example, the path 0 of the second VGPR corresponding to thread 0 corresponds to the first data B(1,1) of B(1,:), and after passing the data B corresponding to the path 0 of the second VGPR (1,1) is copied to the 31 paths of the second VGPR corresponding to the remaining threads of the 32 threads, and the data corresponding to the 32 paths of the second VGPR corresponding to the 32 threads are all B(1,1).

Next, returning to FIG. 1 , in step 103 , target operation data may be determined in the second quantity of operation data of the second operation matrix based on the data selection instruction.

According to an embodiment of the present disclosure, based on the data selection instruction, the path of the second VGPR indicated by the SVF_MODE value can be determined, and the operation data corresponding to the path can be used as the target operation data. For example, when SVF_MODE=1, set the The operation data corresponding to the channel 1 of the second VGPR is determined as the target operation data.

In step 104, the first number of operation data of the first operation matrix may be respectively provided to the first number of multipliers as the first multiplication factor via the first number of paths of the first VGPR, and the target operation data may be supplied via the first number of paths of the first VGPR respectively. The first number of paths of the two VGPRs are provided to the first number of multipliers as a second multiplication factor.

According to an embodiment of the present disclosure, the matrix multiply instruction may contain a first number of threads, and the first number of multipliers corresponds to the first number of threads. For the threads of the first number of threads corresponding to the paths of the second VGPR, target operation data may be provided to their corresponding multipliers as second multiplication factors, while for the remaining threads of the first number of threads, the target operation data may be provided by Copy the target operation data to the path of the second VGPR corresponding to the remaining threads to provide the target operation data to the corresponding multiplier as the second multiplication factor. For example, when SVF_MODE=1, the second VGPR corresponding to thread 1 is used. The target operation data corresponding to the channel 1 of the Thread 1 is provided to the input end of the multiplier corresponding to thread 1, and the target operation data is copied to the input end of the multipliers connected to the channels of the second VGPR corresponding to the remaining threads for multiplication operation. .

According to an embodiment of the present disclosure, a third VGPR for storing a result of a matrix multiplication operation can be determined based on a matrix multiplication instruction, the third VGPR has the same number of paths as the first VGPR and the second VGPR, the first number of multipliers Each multiplier in can perform a multiplication operation based on its corresponding first multiplication factor and second multiplication factor, and after obtaining the operation result, store the operation result in the third VGPR via the corresponding first number of paths.

As shown in FIG. 3 , the SIMD example in this embodiment is a SIMD 32 structure, each VGPR includes 32 channels, and matrix multiplication A*B=C is performed under this structure, wherein the first operation matrix A is a 32×1 column matrix, the second operation matrix B is a 1×4 row matrix, correspondingly, the result matrix C is a 32×4 matrix, and the hardware general matrix algorithm involved is

Each path of the VGPR A that stores the matrix A corresponds to each data in the column vector of the matrix A, and each path of the VGPR B that stores the matrix B corresponds to each data in the row vector of the matrix B, respectively. VGPR is executed on each thread. The multiplication operation of the data corresponding to the corresponding path of A and the target operation data in VGPR B.

The specific operations in this embodiment are as follows:

The 32 channels of VGPR A correspond to the 32 data A(1,1), A(2,1), ..., A(32,1) of the column vector A(:,1) of the matrix A respectively;

When SVF_MODE=0, copy B(1,1) to the 32 channels of VGPR B (a dashed arrow indicates this process in Figure 3), and the data corresponding to each channel of VGPR A and VGPR B are multiplied accordingly, The obtained results are respectively stored in the VGPR via the corresponding 32 paths of the VGPR C, and the column vector C(:, 1) of the matrix C is obtained;

By analogy, when SVF_MODE=1, B(1,2) is copied to the 32 channels of VGPR B, and the data corresponding to each channel of VGPR A and VGPR B are multiplied accordingly to obtain the column vector C of matrix C ( :,2);

When SVF_MODE=2, copy B(1,3) to the 32 channels of VGPR B, and multiply the data corresponding to each channel of VGPR A and VGPR B accordingly to obtain the column vector C(:,3) of matrix C ;

When SVF_MODE=3, copy B(1,4) to the 32 channels of VGPR B, and multiply the data corresponding to each channel of VGPR A and VGPR B accordingly to obtain the column vector C(:,4) of matrix C , so as to obtain the matrix C.

Hereinafter, specific operations of the data processing procedure for matrix multiplication according to an embodiment of the present disclosure will be described in detail.

First, the operation matrices for matrix multiplication are read into designated VGPRs, respectively, and then the designated VGPRs are given in the matrix multiplication instruction and data selection instruction for matrix multiplication, and the SVF_MODE value is changed accordingly. As a result, the matrix multiplication of the column matrix and the row matrix can be completed with only one data read without performing data reading and storage multiple times. For example, a partial assembly instruction example of the method described in this disclosure may be represented as follows:

buffer_load_b32v0,v_addr_0;

buffer_load_b32v80,v_addr_1;

v_mul_u32v100,v0,v80,SVF_MODE=0;

v_mul_u32v101,v0,v80,SVF_MODE=1;

v_mul_u32v102,v0,v80,SVF_MODE=2;

v_mul_u32v103,v0,v80,SVF_MODE=3;

Specifically, in the above assembly instructions, first, through the buffer_load_b32 instruction, read the matrix A from the address v_addr_0 into the register v0, through the buffer_load_b32 instruction, read the matrix B from the address v_addr_1 into the register v80, both registers v0 and v80 can store 32 data.

Next, based on registers v0 and v80, matrix operations are performed on the data in these two registers. Specifically, the parameters OP, VDST, VSRCA, VSRCB, and SVF_MODE in Table 3 and Table 4 according to the embodiment of the present disclosure are defined by the instruction "v_mul_u32 v100, v0, v80, SVF_MODE=0". Among them, v_mul_u32 is an opcode, which indicates a 32-bit multiplication operation, wherein v0 indicates the register of the first operation matrix A, v80 indicates the register of the second operation matrix B, and the second operation matrix B is selected by changing the value of SVF_MODE The target operation data in , v100/v101/v102/v103 indicates the intermediate register used to store the multiplication result of the first operation matrix A and the target operation data, thus realizing the matrix multiplication operation based on a single read operation matrix under the SIMD structure .

It should be understood that the SIMD structure and the matrices involved in the multiplication operation are not limited to the above examples, but can be adjusted by those skilled in the art according to the actual situation, and examples are not provided here.

As shown in FIG. 4 , an apparatus 400 for performing data processing for matrix multiplication according to an embodiment of the present disclosure may include: an instruction fetch unit 401 , a decoding unit 402 , a data selection control unit 403 , and a read operand unit 404 .

Instruction fetch unit 401 may be configured to fetch matrix multiply instructions and data select instructions. For example, instruction fetch unit 401 may fetch instructions from a memory such as DDR SDRAM to an instruction register.

Decode unit 402 may be configured to receive matrix multiply instructions and data select instructions from instruction fetch unit 401 and decode these instructions to determine a first VGPR storing a first operation matrix and a second operation matrix the second VGPR of the The second number of operational data of the two-operation matrix corresponds to the second number of passes of the second VGPR. The decoding unit 402 splits and interprets the fetched instruction according to a predetermined instruction format, and obtains information such as VGPR address and operation. In addition, based on the data selection instruction, corresponding data selection information can also be obtained, which can be used such as a data selection signal ( SVF_MODE) to transmit this information to guide subsequent data selection operations in the second operation matrix.

The data selection control unit 403 may be configured to receive data selection information from the decoding unit 402, and based on the data selection information, determine target operation data among the second quantity of operation data of the second operation matrix. For example, in the data selection control unit 403, the second amount of operation data of the second operation matrix may be passed through a selector controlled by the data selection information (eg, SVF_MODE) to select the target operation data.

The read operand unit 404 may be configured to provide the first number of operation data of the first operation matrix to the first number of multipliers via the first number of paths of the first VGPR, respectively, as the first multiplication factor, and the target operation Data is provided to the first number of multipliers as a second multiplication factor via the first number of paths of the second VGPR. The read operand unit 404 may copy the target operation data to the first number of paths connected to the above-mentioned first number of multipliers among the paths of the second VGPR, so as to provide the corresponding multipliers as second multiplication factors.

According to an embodiment of the present disclosure, the decoding unit 402 may be further configured to determine a third VGPR for storing the result of the matrix multiplication operation based on the decoding result.

According to an embodiment of the present disclosure, as shown in FIG. 4 , the apparatus 400 for performing the data processing method for matrix multiplication may further include: a multiplication unit 405, which may be configured to include a first number of multipliers, wherein each multiplier The multiplication operation is performed based on the corresponding first multiplication factor and the second multiplication factor, respectively, to obtain an operation result; and an operation write-back unit 406 may be configured to store the multiplication operation result in the third VGPR.

As shown in FIG. 5, the data selection control unit 403 is based on the data selection control information received from the decoding unit 402 (with SVF_MODE as the data selection signal), on the 32 paths of the VGPR B, the 32 second The operation data passes through a 32-to-1 selector to select the second operation data (ie, target operation data) corresponding to the designated path of the VGPR B. After that, the read operand unit 404 supplies the 32 first operation data of the matrix A to the first input ends of the 32 multipliers through the 32 paths of the VGPR A respectively, and provides the target operation data to all the designated paths of the VGPR B. Connect the second input of the multiplier and copy the target operation data to the remaining paths of the VGPR B and then provide it to the second input of the remaining multipliers.

As shown in FIG. 6 , a data processing device 600 according to an embodiment of the present disclosure may include a processor 601 and a memory 602 , which may be interconnected through a bus 603 .

The processor 601 can perform various actions and processes according to programs or codes stored in the memory 602 . Specifically, the processor 601 may be an integrated circuit chip, which has signal processing capability. The aforementioned processors may be general purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), off-the-shelf programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Various methods, steps, processes and logical block diagrams disclosed in the embodiments of the present disclosure can be implemented or executed. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor, etc., and may be an X86 architecture or an ARM architecture, or the like.

The memory 602 stores executable instructions, which when executed by the processor 601 are used to implement the data processing method according to the embodiment of the present disclosure. Memory 602 may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be read only memory (ROM), programmable read only memory (PROM), erasable programmable read only memory (EPROM), electrically erasable programmable read only memory (EEPROM), or flash memory. Volatile memory may be random access memory (RAM), which acts as an external cache. By way of example and not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronous dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Synchronous Link Dynamic Random Access Memory (SLDRAM), and Direct Memory Bus Random Access Memory (DRRAM). It should be noted that the memory of the methods described herein is intended to include, but not be limited to, these and any other suitable types of memory.

Embodiments of the present disclosure also provide a computer-readable storage medium on which computer-executable instructions are stored, and when the computer instructions are executed by a processor, can implement the data processing method according to the embodiments of the present disclosure. Similarly, computer-readable storage media in embodiments of the present disclosure may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. It should be noted that the memory of the methods described herein is intended to include, but not be limited to, these and any other suitable types of memory.

Embodiments of the present disclosure also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method according to the embodiment of the present disclosure.

Embodiments of the present disclosure provide a data processing method, apparatus, device, and storage medium for matrix multiplication. The data processing method for matrix multiplication provided by the embodiments of the present disclosure firstly reads the entire matrix into the VGPR, then selects multiple paths of the VGPR, and copies the data corresponding to the selected path to other paths of the VGPR as The multiplication factor participates in the multiplication operation of the corresponding thread, makes full use of the matrix characteristics, effectively multiplexes data between threads, reduces the number of data reads, and reduces power consumption.

It should be noted that the flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which includes at least one block for implementing the specified logical function. executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.

In general, the various example embodiments of the present disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic, or any combination thereof. Certain aspects may be implemented in hardware, while other aspects may be implemented in firmware or software that may be executed by a controller, microprocessor or other computing device. While aspects of the embodiments of the present disclosure are illustrated or described as block diagrams, flowcharts, or using some other graphical representation, it is to be understood that the blocks, apparatus, systems, techniques, or methods described herein may be taken as non-limiting Examples are implemented in hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controllers or other computing devices, or some combination thereof.

The example embodiments of the present disclosure described in detail above are illustrative only and not restrictive. It should be understood by those skilled in the art that various modifications and combinations of the embodiments or features thereof may be made without departing from the principles and spirit of the present disclosure, and such modifications are intended to fall within the scope of the present disclosure.

Claims

A data processing method for matrix multiplication, comprising:

Get matrix multiplication instructions and data selection instructions;

Based on the matrix multiplication instruction and the data selection instruction, a first vector general register storing a first operation matrix and a second vector general register storing a second operation matrix are determined, wherein the first vector general register and the second vector general register have the same number of paths, wherein a first number of operation data of the first operation matrix corresponds to a first number of paths of the first vector general register, and the second operation matrix a second number of operational data paths corresponding to a second number of paths of the second vector general-purpose register;

based on the data selection instruction, determining target operational data in a second amount of operational data of the second operational matrix;

providing a first number of operation data of the first operation matrix to the first number of multipliers via a first number of paths of the first vector general register, respectively, as a first multiplication factor, and applying the target Operational data is provided to the first number of multipliers as second multiplication factors via the first number of paths of the second vector general register.
The method of claim 1, further comprising:

determining, based on the matrix multiplication instruction, a third vector general-purpose register for storing the result of the matrix multiplication operation;

Each of the first number of multipliers performs a multiplication operation based on its corresponding first multiplication factor and the second multiplication factor, respectively, to obtain an operation result; and

The operation result is stored in the third vector general-purpose register.
2. The method of claim 1, wherein the matrix multiply instruction includes the first number of threads, and wherein the first number of multipliers corresponds to the first number of threads, the first number of Each of the threads corresponds to a corresponding path of the first vector general register and a corresponding path of the second vector general register, respectively;

Wherein, determining the target operation data in the second quantity of operation data of the second operation matrix includes:

Based on the data selection instruction, select a path in the second number of paths of the second vector general-purpose register, and use the operation data corresponding to the path as the target operation data;

Wherein, providing the target operation data to the first number of multipliers as the second multiplication factor includes:

for the threads of the first number of threads corresponding to the paths of the second vector general register, providing the target operation data to their corresponding multipliers as second multiplication factors; and

For the remaining threads of the first number of threads, the target operation data is copied to the paths of the remaining threads connected to the second vector general-purpose register, and provided to the corresponding multipliers as a second multiplication, respectively factor.
The method of claim 1, wherein,

the first operation matrix is a column matrix, and the first quantity of operation data is column data of the first operation matrix; and

The second operation matrix is a row matrix, and the second quantity of operation data is row data of the second operation matrix.
The method of claim 1, wherein obtaining the matrix multiply instruction and the data selection instruction comprises:

obtaining a matrix multiplication instruction, the matrix multiplication instruction including a first operation matrix field and a second operation matrix field, wherein the first operation matrix field is used to indicate a first vector general register in which the first operation matrix is stored; and

When the second operation matrix field is a predefined value, acquire a data selection instruction, where the data selection instruction includes an operation matrix field and a data selection field, wherein the operation matrix field is used to indicate that the second operation is stored The second vector general-purpose register of the matrix, and the data selection field is used to indicate that specific data in the second quantity of operation data of the second operation matrix is selected as the target operation data.
An apparatus for performing data processing for matrix multiplication, comprising:

The instruction fetch unit is used to obtain matrix multiplication instructions and data selection instructions;

a decoding unit configured to receive the matrix multiply instruction and the data selection instruction from the instruction fetch unit, and decode them to determine a first vector general-purpose register in which the first operation matrix is stored, and to store A second vector general register having a second operation matrix, and obtaining data selection information, wherein the first vector general register and the second vector general register have the same number of paths, wherein the first operation matrix A number of operational data corresponds to a first number of paths of the first vector general register, and a second number of operational data of the second operation matrix corresponds to a second number of paths of the second vector general register;

a data selection control unit configured to receive the data selection information from the decoding unit, and based on the data selection information, determine target operation data among a second amount of operation data of the second operation matrix;

a read operand unit configured to provide a first number of operation data of the first operation matrix to the first number of multipliers via a first number of paths of the first vector general register, respectively, as a first multiplication factor, and the target operation data is provided to the first number of multipliers as a second multiplication factor via a first number of paths of the second vector general register.
The apparatus according to claim 6, wherein the decoding unit further determines a third vector general-purpose register for storing the result of the matrix multiplication operation based on the decoding result, and the apparatus further comprises:

a multiplication unit configured to include the first number of multipliers, each of the multipliers in the first number of multipliers performs a multiplication operation based on its corresponding first multiplication factor and the second multiplication factor, respectively , get the result of the operation;

The operation write-back unit is configured to store the operation result in the third vector general register.
The apparatus of claim 6, wherein:

The matrix multiply instruction includes the first number of threads, and the first number of multipliers corresponds to the first number of threads, each of the first number of threads corresponding to the the corresponding path of the first vector general register and the corresponding path of the second vector general register;

Wherein, determining the target operation data in the second quantity of operation data of the second operation matrix includes:

Based on the data selection instruction, select a path from the second number of paths in the second vector general-purpose register, and use the operation data corresponding to the path as the target operation data;

Wherein, providing the target operation data to the first number of multipliers as the second multiplication factor includes:

for said threads of said first number of threads corresponding to said paths of said second vector general register, providing said target operation data to their corresponding multipliers as second multiplication factors; and

For the remaining threads of the first number of threads, the target operation data is copied to the paths of the remaining threads connected to the second vector general-purpose register, and provided to the corresponding multipliers as the second multiplication, respectively factor.
The apparatus of claim 6, wherein:

the first operation matrix is a column matrix, and the first quantity of operation data is column data of the first operation matrix; and

The second operation matrix is a row matrix, and the second quantity of operation data is row data of the second operation matrix.
The apparatus of claim 6, wherein obtaining the matrix multiply instruction and the data selection instruction comprises:

obtaining a matrix multiplication instruction, the matrix multiplication instruction including a first operation matrix field and a second operation matrix field, wherein the first operation matrix field is used to indicate a first vector general register in which the first operation matrix is stored; and

When the second operation matrix field is a predefined value, acquire a data selection instruction, where the data selection instruction includes an operation matrix field and a data selection field, wherein the operation matrix field is used to indicate that the second operation is stored The second vector general-purpose register of the matrix, and the data selection field is used to indicate that specific data in the second quantity of operation data of the second operation matrix is selected as the target operation data.
A data processing device comprising:

processor; and

A memory having stored thereon computer-executable instructions which, when executed by a processor, are used to implement the method of any of claims 1-5.
A computer-readable storage medium having stored thereon computer-executable instructions which, when executed by a processor, are used to implement the method of any of claims 1-5.