CN116149602A

CN116149602A - Data processing method, device, electronic equipment and storage medium

Info

Publication number: CN116149602A
Application number: CN202211530960.XA
Authority: CN
Inventors: 高娅; 卜景德; 赵红朋
Original assignee: Dawning Information Industry Beijing Co Ltd
Current assignee: Dawning Information Industry Beijing Co Ltd
Priority date: 2022-12-01
Filing date: 2022-12-01
Publication date: 2023-05-23

Abstract

The application relates to a data processing method, a data processing device, electronic equipment and a storage medium. The method comprises the following steps: based on an ith data read-write instruction, reading an ith data block from global data, and writing the ith data block into a shared memory, wherein the data read-write instruction is an assembly instruction for executing data read-write processing, the global data is divided into N data blocks, N is an integer greater than 0, and i is an integer less than or equal to N; reading the ith data block from the shared memory to a vector register for multiply-accumulate processing to obtain a multiply-accumulate operation result of the ith data block; obtaining a general multiplication matrix GEMM operation result aiming at global data according to multiplication accumulation operation results corresponding to the N data blocks; and writing the GEMM operation result into the global memory. The method provided by the embodiment of the disclosure can improve the GEMM optimizing effect.

Description

Data processing method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a data processing method, a data processing device, an electronic device, and a storage medium.

Background

With the rapid development of science and technology, GEMM (General Matrix to Matrix Multiplication, general matrix multiplication) generally plays a very important role in computing applications, and is an operation with both computation and memory density, and has very high requirements on the computing capability, memory bandwidth and delay of a processor.

The optimization implementation method of GEMM can be shown with reference to fig. 1. As shown in fig. 1, GEMM may generally divide global data into a number of data blocks and continuously move the data blocks into a shared memory.

When global data is carried to the shared memory, the global data is required to be read into the register, and then the data in the register is written back to the shared memory, and the global data is limited by the number of the registers in the processor and the size of the bandwidth, so that GEMM operation delay is high, efficiency is low, and the optimization effect of the GEMM is not obvious enough.

Disclosure of Invention

In view of this, the present application provides a data processing method, apparatus, electronic device, and storage medium that can improve GEMM operation efficiency and reduce GEMM operation delay.

In a first aspect, the present application provides a data processing method, the method comprising:

based on an ith data read-write instruction, reading an ith data block from global data, and writing the ith data block into a shared memory, wherein the data read-write instruction is an assembly instruction for executing data read-write processing, the global data is divided into N data blocks, N is an integer greater than 0, and i is an integer less than or equal to N;

Reading the ith data block from the shared memory to a vector register for multiply-accumulate processing to obtain a multiply-accumulate operation result of the ith data block;

obtaining a general multiplication matrix GEMM operation result aiming at the global data according to the multiplication accumulation operation results corresponding to the N data blocks;

and writing the GEMM operation result into a global memory.

According to the data processing method provided by the embodiment of the disclosure, global data in the global memory can be directly read into the shared memory through the assembly instruction without a register, so that the data reading and writing rate is accelerated, the operation time delay is further reduced, the GEMM operation efficiency is improved, and the GEMM optimization effect is remarkable.

In one embodiment, the method further comprises:

and in response to a shared memory setting operation for each data block, writing a write address corresponding to each data block in the shared memory into a target register.

According to the data processing method provided by the embodiment of the disclosure, the write address corresponding to each data block of the global data in the shared memory can be written into the target register, and then when data processing is performed, the write address corresponding to each data block in the shared memory can be read from the target register through the data read-write instruction, and the global data in the global memory is directly read into the shared memory based on the read write address, so that the data read-write speed is increased, the operation time delay is further reduced, the GEMM operation efficiency is improved, and the GEMM optimization effect is obvious.

In one embodiment, the i-th data read-write instruction includes a read address of the i-th data block in the global memory, and the reading the i-th data block from the global data based on the i-th data read-write instruction, and writing the i-th data block into the shared memory includes:

reading the ith data block from the global data according to the read address of the ith data block in the global memory in the ith data read-write instruction;

and reading a write address corresponding to the ith data block in the shared memory from the target register according to the ith data read-write instruction, and writing the ith data block into the shared memory based on the write address corresponding to the ith data block in the shared memory.

According to the data processing method provided by the embodiment of the disclosure, the write address corresponding to each data block in the shared memory can be read from the target register through the data read-write instruction, and the global data in the global memory is directly read into the shared memory based on the read write address, so that the data read-write speed is accelerated, the operation time delay is further reduced, the GEMM operation efficiency is improved, and the GEMM optimization effect is remarkable.

In one embodiment, the reading the ith data block from the global data and writing the ith data block into the shared memory includes:

reading the ith data block from the global data, and writing the ith data block into a first cache area of the shared memory;

reading the (i+1) th data block from the global data, and writing the (i+1) th data block into a second cache area of the shared memory;

reading the ith data block from the shared memory to a vector register for multiply-accumulate processing to obtain a multiply-accumulate operation result of the ith data block, wherein the multiply-accumulate operation result comprises the following steps:

and reading the ith data block from the first cache area of the shared memory to a vector register for multiply-accumulate processing to obtain a multiply-accumulate operation result of the ith data block.

According to the data processing method provided by the embodiment of the disclosure, the first cache region and the second cache region can carry out data carrying, namely, the (i+1) th data block is carried out to the shared memory before the (i) th data block is subjected to multiply-accumulate operation, so that part of instruction delay is covered, GEMM operation efficiency is improved, and the GEMM optimization effect is obvious.

In one embodiment, the data block includes m rows and k columns of elements, the vector register includes a first vector register and a second vector register, the reading the ith data block from the first buffer area of the shared memory into the vector register to perform multiply-accumulate processing, to obtain a multiply-accumulate operation result of the ith data block, including:

reading the j-th channel data of the i-th data block from a first cache area of the shared memory to the first vector register, wherein j is an integer less than or equal to k;

when j is smaller than k, reading the j+1th channel data of the ith data block from the first cache area of the shared memory to the second vector register;

performing multiply-accumulate operation on the j-th channel data in the first vector register;

when j+1 is smaller than k, reading the j+2 channel data of the ith data block from a first cache area of the shared memory into the first vector register;

performing multiply-accumulate operation on the j+1th channel data in the second vector register;

And when j+2 is smaller than k, taking j+2 as new j, repeating the step of performing multiply-accumulate operation on the j-th channel data in the first vector register until j+2 is equal to k, and performing multiply-accumulate operation on the j+2-th channel data in the second vector register to obtain a multiply-accumulate operation result of the i-th data block.

According to the data processing method provided by the embodiment of the disclosure, the data prefetching operation is performed outside and inside the loop of multiply-accumulate operation, so that the delay of part of instructions can be covered, the delay in the GEMM operation process can be reduced, and the operation efficiency of the GEMM is improved.

In one embodiment, the size of the data block is positively correlated with the size of the multiply-accumulate operation.

According to the data processing method provided by the embodiment of the disclosure, the data reading times can be reduced by reasonably dividing the data blocks, so that the GEMM operation efficiency is improved, and the optimization effect of the GEMM operation is improved.

In a second aspect, the present application also provides a data processing apparatus, the apparatus comprising:

the first reading module is used for reading an ith data block from global data based on an ith data reading and writing instruction and writing the ith data block into a shared memory, wherein the data reading and writing instruction is an assembly instruction for executing data reading and writing processing, the global data is divided into N data blocks, N is an integer greater than 0, and i is an integer less than or equal to N;

The second reading module is used for reading the ith data block from the shared memory to a vector register for multiply-accumulate processing to obtain a multiply-accumulate operation result of the ith data block;

the operation module is used for obtaining a general multiplication matrix GEMM operation result aiming at the global data according to the multiplication accumulation operation results corresponding to the N data blocks;

and the first writing module is used for writing the GEMM operation result into the global memory.

In one embodiment, the apparatus further comprises:

and the second writing module is used for responding to the shared memory setting operation for each data block and writing the corresponding writing address of each data block in the shared memory into a target register.

In one embodiment, the i-th data read-write instruction includes a read address of the i-th data block in the global memory, and the first read module is further configured to:

In one embodiment, the first reading module is further configured to:

In one embodiment, the data block includes m rows and k columns of elements, the vector register includes a first vector register and a second vector register, and the first read module is further configured to:

In a third aspect, the present application also provides a computer device. The computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes any one of the data processing configuration methods when executing the computer program.

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements any of the data processing methods described above.

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements any of the data processing methods described above.

According to the data processing method, the device, the electronic equipment and the storage medium, the ith data block is read from the global data based on the ith data read-write instruction, and the ith data block is written into the shared memory, wherein the data read-write instruction is an assembly instruction for executing data read-write processing, the global data is divided into N data blocks, N is an integer greater than 0, and i is an integer less than or equal to N. And reading the ith data block from the shared memory to a vector register for multiply-accumulate processing to obtain a multiply-accumulate operation result of the ith data block, obtaining a general multiplication matrix GEMM operation result aiming at global data according to the multiply-accumulate operation result corresponding to the N data blocks, and writing the GEMM operation result into the global memory. According to the data processing method, the device, the electronic equipment and the storage medium provided by the embodiment of the disclosure, global data in the global memory can be directly read into the shared memory through the assembly instruction without passing through a register, so that the data reading and writing rate is accelerated, the operation time delay is further reduced, the operation efficiency of the GEMM is improved, and the optimization effect of the GEMM is remarkable.

Drawings

FIG. 1 is a flow chart of a method of optimizing the implementation of GEMM in one embodiment;

FIG. 2 is a flow chart of a method of data processing in one embodiment;

FIG. 3 is a flow chart of a method of data processing in one embodiment;

FIG. 4 is a flow chart of a method of data processing in one embodiment;

FIG. 5 is a flow chart of a method of data processing in one embodiment;

FIG. 6 is a schematic diagram of a data processing method in one embodiment;

FIG. 7 is a block diagram of a data processing apparatus in one embodiment;

fig. 8 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Fig. 2 is a flowchart illustrating a data processing method according to an exemplary embodiment, which is applied to a terminal for illustration, it will be understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server.

In this embodiment, the method includes the steps of:

in step 202, based on the ith data read/write instruction, the ith data block is read from the global data, and the ith data block is written into the shared memory, wherein the data read/write instruction is an assembly instruction for executing data read/write processing, the global data is divided into N data blocks, N is an integer greater than 0, and i is an integer less than or equal to N.

In the embodiment of the disclosure, global data stored in a global memory may be divided into N data blocks in advance, and corresponding assembly instructions (data read-write instructions) may be constructed by handwriting in advance based on read addresses of the data blocks in the global memory. For any data block, the assembly instruction corresponding to the data block is used for reading the data block from the global memory and writing the data block into the shared memory, namely, the ith data block corresponding to the ith data reading and writing instruction can be read from the global data based on the ith data reading and writing instruction, and the ith data block is written into the shared memory.

In step 204, the ith data block is read from the shared memory into the vector register for multiply-accumulate processing, so as to obtain the multiply-accumulate operation result of the ith data block.

In the embodiment of the disclosure, when the ith data block is written into the shared memory, data can be read from the shared memory into the vector register to execute corresponding multiply-accumulate processing in the vector register, and after the multiply-accumulate processing of all the data in the ith data block is completed, a corresponding multiply-accumulate operation result is obtained.

In step 206, according to the multiply-accumulate operation result corresponding to the N data blocks, obtaining a GEMM operation result for the global data;

in step 208, the GEMM operation is written to global memory.

In the embodiment of the disclosure, a process of reading and writing data blocks from a global memory to a shared memory and reading the data blocks from the shared memory to a vector register to perform multiply-accumulate processing may be performed based on a data read-write instruction corresponding to each data block, and after multiply-accumulate processing of N data blocks is completed, a GEMM operation result of the global data is obtained, and the GEMM operation result is written into the global memory.

According to the data processing method provided by the embodiment of the disclosure, based on an ith data read-write instruction, an ith data block is read from global data, and the ith data block is written into a shared memory, wherein the data read-write instruction is an assembly instruction for executing data read-write processing, the global data is divided into N data blocks, N is an integer greater than 0, and i is an integer less than or equal to N. And reading the ith data block from the shared memory to a vector register for multiply-accumulate processing to obtain a multiply-accumulate operation result of the ith data block, obtaining a general multiplication matrix GEMM operation result aiming at global data according to the multiply-accumulate operation result corresponding to the N data blocks, and writing the GEMM operation result into the global memory. According to the data processing method provided by the embodiment of the disclosure, global data in the global memory can be directly read into the shared memory through the assembly instruction without a register, so that the data reading and writing rate is accelerated, the operation time delay is further reduced, the GEMM operation efficiency is improved, and the GEMM optimization effect is remarkable.

In an exemplary embodiment, the data processing method further includes:

and responding to the shared memory setting operation for each data block, and writing the corresponding write address of each data block in the shared memory into the target register.

In the embodiment of the disclosure, taking the scale of multiply-accumulate operation as 4×4, the thread set (work group) as 16×16, the partition block selection as 64×64, and the cyclic depth as 8 as an example, that is, the data block selected for the global data matrix a is 64×8 in size, and the data block selected for the global data matrix B is also 64×8 in size. The data read by each thread is 64×8/256=2 data, and the corresponding write address of the data block in the shared memory is (thread_num & 31) ×2+ (thread_num/32) ×64, where thread_num is the number of the thread, and the write address is the address for writing the data into the shared memory.

After the write address of each data block in the shared memory is obtained, the write address of each data block in the shared memory can be written into the target register, so that when the read-write operation of the ith data block is performed based on the ith data read-write instruction, the write address of the ith data block in the shared memory can be read from the target register, and the ith data block read from the global memory is directly written into the shared memory according to the write address. The target register is a register set before each global data reading, and is used for storing a write data address of the shared memory, for example: the target register may be set to an m0 register.

In an exemplary embodiment, referring to fig. 3, in step 202, based on the i-th data read/write instruction, the i-th data block is read from the global data, and the i-th data block is written into the shared memory, where the i-th data read/write instruction includes:

in step 302, according to the read address of the ith data block in the global memory in the ith data read-write instruction, the ith data block is read from the global data;

in step 304, according to the ith data read/write command, the write address corresponding to the ith data block in the shared memory is read from the target register, and the ith data block is written into the shared memory based on the write address corresponding to the ith data block in the shared memory.

In the embodiment of the disclosure, the scale of the multiply-accumulate operation is 4×4, the thread set (work group) is 16×16, the partition block is selected to be 64×64, and the cyclic depth is 8, that is, the size of the data block selected for the global data matrix a is 64×8, and the size of the data block selected for the global data matrix B is also 64×8. The data read by each thread is 64×8/256=2 data, the read address=blockx×64+ (thread_num & 31) ×2+ (thread_num/32) ×width1 corresponding to the data block, the blockx is the total number of blocks of the global data matrix a, and width1 is the span between each channel of the global data matrix a. Similarly, for the global data matrix B, the read address=block_64+ (thread_num & 31) ×2+ (thread_num/32) ×width2 corresponding to the data block, where block is the total number of blocks of the global data matrix B, and width2 is the span between each channel of the global data matrix B.

The i data obtaining and writing instruction comprises a data reading instruction, wherein the data reading instruction comprises a read address of an i data block in a global memory, the configuration of a shared memory writing address is added after the data reading instruction, hardware can automatically read an address (writing address) of shared memory writing data to a target register, the i data block is read from the global data based on the read address of the i data block in the global memory in the i data reading and writing instruction, and the i data block is directly written into the shared memory according to the read writing address.

In an exemplary embodiment, referring to fig. 4, in step 202, the i-th data block is read from the global data, and the i-th data block is written into the shared memory, which may be implemented by the following steps:

in step 402, the ith data block is read from the global data, and written into the first cache area of the shared memory;

in step 404, the (i+1) th data block is read from the global data, and the (i+1) th data block is written into the second cache region of the shared memory;

in this embodiment, in step 204, the multiplication and accumulation operation result of the ith data block is obtained by reading the ith data block from the shared memory into the vector register to perform the multiplication and accumulation processing, and the method may be implemented by the following steps:

in step 406, the ith data block is read from the first buffer area of the shared memory to the vector register for multiply-accumulate processing, so as to obtain the multiply-accumulate operation result of the ith data block.

In the embodiment of the disclosure, the ith data block can be read from the global data and written into the first cache area of the shared memory, then the (i+1) th data block is continuously read from the global memory and written into the second cache area of the shared memory, and the ith data block is read from the first cache area of the shared memory into the vector register for multiply-accumulate processing, so as to obtain the multiply-accumulate operation result of the ith data block.

In the embodiment of the disclosure, 2 buffer areas are set in the shared memory, the ith data block is placed in the first buffer area through the ith data read-write, before the multiplication and accumulation operation of the ith data block is executed, the next ith+1st data block is carried in the second buffer area through the ith+1st data read-write instruction, the global data carrying takes a relatively large number of cycles, that is, if the carrying is completed, a relatively large delay is generated when the carrying is used immediately, so that an instruction for carrying the data is placed before a cycle corresponding to the multiplication and accumulation operation, the completion of the data carrying is ensured when the cycle is finished, and part of instruction delay can be covered by intermediate operation steps.

According to the data processing method provided by the embodiment of the disclosure, data can be carried through the first cache area and the second cache area, namely, the (i+1) th data block is carried to the shared memory before the (i) th data block performs multiply-accumulate operation, so that part of instruction delay is covered, GEMM operation efficiency is improved, and the GEMM optimization effect is obvious.

In an exemplary implementation, the data block includes m rows and k columns of elements, that is, the data block is m×k, the vector register includes a first vector register and a second vector register, and referring to fig. 5, in step 406, the i-th data block is read from the first buffer area of the shared memory into the vector register to perform multiply-accumulate processing, to obtain a multiply-accumulate operation result of the i-th data block, where the multiply-accumulate operation result includes:

in step 502, the j-th channel data of the i-th data block is read from the first buffer area of the shared memory into the first vector register, where j is an integer less than or equal to k;

in step 504, if j is less than k, reading the j+1th channel data of the i-th data block from the first cache region of the shared memory into the second vector register;

in step 506, multiply-accumulate the j-th channel data in the first vector register;

in step 508, in the case where j+1 is less than k, reading the j+2-th channel data of the i-th data block from the first cache region of the shared memory into the first vector register;

in step 510, multiply-accumulate the j+1th channel data in the second vector register;

In step 512, when j+2 is smaller than k, taking j+2 as new j, repeating the step of performing multiply-accumulate operation on the j-th channel data in the first vector register until j+2 is equal to k, and performing multiply-accumulate operation on the j+2-th channel data in the second vector register to obtain the multiply-accumulate operation result of the i-th data block.

In this embodiment of the present disclosure, the j-th channel data of the i-th data block may be pre-read from the first buffer area into the first vector register, and in the case where the j-th channel is not the last channel of the i-th data block, before the multiply-accumulate operation of the j-th channel data is calculated, an instruction to read the j+1-th channel data of the i-th data block from the first buffer area in the shared memory into the second vector register is executed, and then the multiply-accumulate operation is performed on the j-th channel data in the first vector register.

Because the reading of the data has time delay, the data can be put into the register after waiting for some instruction periods, and therefore, the instruction for reading some data (reading the j+1th channel data into the second vector register) is executed before the multiplication and accumulation operation of the j channel data is carried out, and the calculation is not carried out immediately, so that the time delay of some instructions can be covered.

In the embodiment of the disclosure, the foregoing step 404 may be executed after the step 502 to perform the multiply-accumulate operation in the loop of the multiply-accumulate operation, so that the multiply-accumulate operation is performed on the j-th channel data, which is prefetched before, instead of performing the multiply-accumulate operation on the data (i+1th data block) just read in consideration of the delay of the data reading, until all the data in the first cache area of the shared memory is calculated, and the multiply-accumulate operation is performed on the data fetched from the second cache area of the shared memory, so that the read delay of the global data may be reduced.

If j+1 is less than k, it may be determined that there is more data not read into the vector registers, so the j+2-th channel data may be read into the first vector register, and the j+1-th channel data in the second vector register may be subjected to multiply-accumulate operation, and j+2 is continuously taken as a new j, and the process returns to step 506.

Until j+2 is equal to k, all the data of the channels in the ith data block are read into the vector register, so that the multiplication and accumulation operation is carried out on the data of the kth channel in the second vector register, the multiplication and accumulation operation result of the ith data block can be obtained, and the cycle is ended.

In an exemplary embodiment, the size of the data block is positively correlated with the size of the multiply-accumulate operation.

In the embodiment of the disclosure, the selection of the data block size can be performed according to the size of the global data. The size of the data block is positively correlated with the size of the multiply-accumulate operation, i.e., the larger the size of the multiply-accumulate operation, the larger the size division of the data block can be selected. The size of the data block is related to the size of the multiply-accumulate operation calculated for each thread, i.e. the size of the data block = number of threads.

For example, the size of matrix a of GEMM operation is mxk and the size of matrix B is kxn, and the size of the data block may be the product of mxn and the size of multiply-accumulate operation. For example: there are various scales of multiply-accumulate operations, such as: the multiply-accumulate may be 2×2,4×4,4×6,6×4,4×8,8×4,8×8, etc., the work group (thread group) may be 8×8,8×16, 16×8, 16×16, 16×32, etc., to select to do 4×4 multiply-accumulate, and the work group selects 16×16 as an example, then the size of the data block is 64×64.

The multiply-accumulate may be 2×2,4×4,4×6,6×4,4×8,8×4,8×8, etc., and the work group may be 8×8,8×16, 16×8, 16×16, 16×32, etc., so that the corresponding data blocks have a wide variety of sizes (accumulated work group=data block size).

The size of a particular data block is selected in relation to the global data input, and when mxn is relatively small, the multiply-accumulate may be selected to be smaller, and vice versa.

The domestic heterogeneous GPU platform has 32 banks, each bank occupies 4 bytes, and if each thread needs to read 64 bytes of data, 2 threads can occupy the banks if the data are continuously read, namely thread 0 and thread 1 do not collide, thread 0 and thread 2 collide, thread 1 and thread 3 collide, and similarly, 16 threads only use 1/8 bank, and the collision is serious. In the embodiment of the disclosure, the reading mode of the data can be changed, and considering that the largest data reading instruction can read 16 bytes at a time, the data of 64 bytes is divided into 4 parts to be read, and 16 bytes are read every 32 banks, namely, thread 0 and thread 1 and … thread 7 are all non-conflicting, so that the conflicts of the banks can be greatly reduced.

In the embodiment of the disclosure, when GEMM operation is performed, the total grid number=block x×block y; wherein, the block x is the number of data blocks corresponding to the matrix a, the block y is the number of data blocks corresponding to the matrix B, each card has 64 CUs (Control Unit), if the grid is exactly 64, exactly one for one CU, if the grid number is more, the corresponding relation between the grid and the CU can be modified by setting the block x and the block y, which can be specifically calculated as follows:

/>

in the embodiment of the disclosure, the domestic heterogeneous GPU platform has 64 CU computing units, and if the input data is very small, the number of divided data blocks is relatively small, and each data block can correspond to one CU computing unit, so that the CU computing units can be fully applied. If the data is bigger and the data blocks are more, the data of a plurality of blocks need to be calculated on one CU, and the data with similar memory addresses can be placed on the same CU to calculate as much as possible through configuration, so that the hit rate of a cache can be increased.

In the embodiment of the disclosure, multiply-accumulate computation may be directly implemented by using an instruction provided by a domestic heterogeneous GPU platform, for example, an instruction of v_mac_f32 or v_mac_f64, and continuous use of the multiply-accumulate instruction may reduce the transmission period of the instruction, so as to put multiply-accumulate together as soon as possible.

In order for those skilled in the art to better understand the disclosed embodiments, the disclosed embodiments are described below by way of specific examples.

Referring to fig. 6, the block of global data is related to the size of the multiply-accumulate calculated for each thread according to the number of threads used, i.e., the size of the block=the size of the multiply-accumulate, the number of threads may be 16×16, 16×8, 32×16, etc., the multiply-accumulate may be 2×2,4×4,6×6,4×6, etc., so that the block may have a variety of different choices, such as the number of threads=16×16, the size of the multiply-accumulate operation=4×4, and then the size of the block=the number of threads=the size of the multiply-accumulate operation=64×64. The selection of the block is related to the size of the input data, for example, when the input data is relatively small, if the input data is 100×100, the block performance of the size=64×64 of the multiply-accumulate operation is better than the block performance of the size=128×128 of the multiply-accumulate operation, and if the input data is 1024×1024, the block performance of the 128×128 is better than the block performance of the 64×64.

And calculating offset addresses, including the read address calculation of the global data matrix A and the global data matrix B, the read-write address calculation of the global data matrix C and the write address and read address offset of the data on the shared memory. The domestic heterogeneous GPU platform is provided with 64 CU computing units, if input data are small, the number of divided data blocks is small, each data block can correspond to one CU computing unit, and the CU computing units are fully applied. If the data is bigger and the data blocks are more, the data of a plurality of data are needed to be calculated on one CU, and the data with similar memory addresses can be placed on the same CU to be calculated through configuration as much as possible, so that the hit rate of the cache is increased.

By adopting the data read-write instruction technology, global data can be directly conveyed to a shared memory without a vector register, and when the data is calculated, attention is paid to the fact that the instruction is calculated by taking 4 bytes as a unit, so that when the GEMM is used, the data of a double-precision type is required to be divided into 2 data of a single-precision type to be conveyed, and the address of the write data of the shared memory is required to be put into an m0 register.

In the data handling process, because the reading of the data has time delay, the data can be put into the register after waiting for some instruction periods, so that some data is read out of the loop through the prefetching operation and is not immediately calculated, and the time delay of some instructions can be covered.

In the GEMM operation cycle, the data block can be carried to the shared memory once, and the time delay is considered, the data block just read is not subjected to multiply-accumulate calculation, but the data pre-fetched in advance is subjected to multiply-accumulate calculation, so that the time delay of the data just read can cover some time delay.

The shared memory is provided with 2 cache areas, data which are carried by the data read-write instruction technology for the first time are placed in the first cache area shared1, the data of the next data block are carried into the second cache area shared2 by the data read-write instruction technology at the beginning of a cycle, the number of cycles required by global data carrying is relatively large, that is, if the data are carried, relatively large delay is generated immediately, so that an instruction for carrying the data is placed at the beginning of the cycle, the completion of data carrying is ensured when the cycle is finished, and part of instruction delay can be covered in the middle operation steps.

The multiply-accumulate computation can be directly implemented by using an instruction provided by a domestic heterogeneous GPU platform, for example, the instruction v_mac_f32 or the instruction v_mac_f64 is implemented by using the instruction v_mac_f32, the instruction v_mac_f32 needs to comprise the instruction emission time and the instruction execution time, when a plurality of mac instructions are implemented together, the instruction emission and the execution can be mutually covered, the time consumption is reduced, the continuous use of the multiply-accumulate instruction can reduce the instruction emission period, and the multiply-accumulate is implemented as soon as possible.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a data processing device for realizing the above related data processing method. The implementation of the solution provided by the apparatus is similar to the implementation described in the above method, so the specific limitation of one or more embodiments of the data processing apparatus provided below may refer to the limitation of the memory frequency configuration method described above, and will not be repeated here.

In one embodiment, as shown in fig. 7, there is provided a data processing apparatus including: a first read module 702, a second read module 704, an operation module 706, and a first write module 708, wherein,

a first reading module 702, configured to read an ith data block from global data based on an ith data read/write instruction, and write the ith data block into a shared memory, where the data read/write instruction is an assembly instruction for performing data read/write processing, the global data is divided into N data blocks, N is an integer greater than 0, and i is an integer less than or equal to N;

a second reading module 704, configured to read the ith data block from the shared memory to a vector register for multiply-accumulate processing, so as to obtain a multiply-accumulate operation result of the ith data block;

The operation module 706 is configured to obtain a general multiplication matrix GEMM operation result for the global data according to the multiply-accumulate operation results corresponding to the N data blocks;

a first writing module 708, configured to write the GEMM operation result into the global memory.

According to the data processing device, the ith data block is read from the global data based on the ith data read-write instruction, and the ith data block is written into the shared memory, wherein the data read-write instruction is an assembly instruction for executing data read-write processing, the global data is divided into N data blocks, N is an integer greater than 0, and i is an integer less than or equal to N. And reading the ith data block from the shared memory to a vector register for multiply-accumulate processing to obtain a multiply-accumulate operation result of the ith data block, obtaining a general multiplication matrix GEMM operation result aiming at global data according to the multiply-accumulate operation result corresponding to the N data blocks, and writing the GEMM operation result into the global memory. Based on the data processing device provided by the embodiment of the disclosure, global data in the global memory can be directly read into the shared memory through the assembly instruction without passing through a register, so that the data reading and writing rate is accelerated, the operation time delay is further reduced, the GEMM operation efficiency is improved, and the GEMM optimization effect is remarkable.

In one embodiment, the apparatus further comprises:

In one embodiment, the i-th data read-write instruction includes a read address of the i-th data block in the global memory, and the first reading module 702 is further configured to:

In one embodiment, the first reading module 702 is further configured to:

The modules in the memory frequency configuration device may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 8. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a data processing method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 8 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A method of data processing, the method comprising:

and writing the GEMM operation result into a global memory.

2. The method according to claim 1, wherein the method further comprises:

3. The method according to claim 2, wherein the i-th data read-write instruction includes a read address of the i-th data block in the global memory, and the reading the i-th data block from the global data based on the i-th data read-write instruction, and writing the i-th data block into the shared memory includes:

4. The method of claim 1, wherein the reading the ith data block from the global data and writing the ith data block into the shared memory comprises:

5. The method according to claim 4, wherein the data block includes m rows and k columns of elements, the vector register includes a first vector register and a second vector register, the reading the ith data block from the first buffer area of the shared memory into the vector register for multiply-accumulate processing, and obtaining a multiply-accumulate operation result of the ith data block includes:

Reading the j-th channel data of the i-th data block from a first cache area of the shared memory into the first vector register, wherein j is an integer less than or equal to k;

6. The method according to any of claims 1 to 5, wherein the size of the data block is positively correlated with the size of the multiply-accumulate operation.

7. A data processing apparatus, the apparatus comprising:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.