CN118051168A

CN118051168A - Data reading method, apparatus, computer device, storage medium, and program product

Info

Publication number: CN118051168A
Application number: CN202211434973.7A
Authority: CN
Inventors: 尚垚威; 张淮声
Original assignee: Glenfly Tech Co Ltd
Current assignee: Glenfly Tech Co Ltd
Priority date: 2022-11-16
Filing date: 2022-11-16
Publication date: 2024-05-17

Abstract

The present application relates to a data reading method, apparatus, computer device, storage medium and program product. The method includes the steps that a first matrix to be processed is divided into a plurality of row matrix blocks, a second matrix to be processed is divided into a plurality of column matrix blocks, threads read a current row matrix block and a current column matrix block, the threads calculate a historical row matrix block and a historical column matrix block written in a first shared memory in parallel, and the threads write the previous row matrix block of the current row matrix block and the previous column matrix block of the current column matrix block into a second shared memory. According to the method, the process of calculating the historical row matrix blocks and the historical column matrix blocks written in the first shared memory through the threads and the process of writing the previous row matrix blocks of the current row matrix blocks and the previous column matrix blocks of the current column matrix blocks into the second shared memory through the threads are conducted, the reading time delay of reading the current row matrix blocks and the current column matrix blocks through the threads is covered, and the reading time delay of the global memory is effectively reduced.

Description

Data reading method, apparatus, computer device, storage medium, and program product

Technical Field

The present application relates to the field of computer technology, and in particular, to a data reading method, apparatus, computer device, storage medium, and program product.

Background

The universal matrix multiplication (General Matrix Multiplication, GEMM) is widely applied to the fields of scientific calculation, deep learning and the like, and is a deep learning core calculation unit. With the improvement of Graphics Processor (GPU) performance and the appearance of new matrix multiplication hardware arithmetic units (e.g., tensor Core), the phenomenon of "memory wall" is becoming more and more apparent.

For general matrix multiplication, the significant latency typically includes global Memory read latency, shared Memory (shared Memory) write latency, shared Memory (shared Memory) read latency, synchronization overhead due to thread execution speed differences, and the like. In the traditional mode, the method for masking the memory delay by the GPU generally comprises two modes of thread-level parallelism and instruction layer parallelism, and for GEMM, a partitioning algorithm generally needs to occupy more registers and shared memory, so that the number of threads which can be scheduled and operated by hardware is limited, the read-write delay of a global memory cannot be effectively covered, and the problem of low data processing efficiency exists.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a data reading method, apparatus, computer device, computer readable storage medium, and computer program product that can effectively cover a global memory read-write latency.

In a first aspect, the present application provides a data reading method. The method comprises the following steps:

Responding to a data reading request, splitting a first matrix to be processed into a plurality of row matrix blocks, and splitting a second matrix to be processed into a plurality of column matrix blocks;

The thread reads the current row matrix block and the current column matrix block, and the thread calculates the historical row matrix block and the historical column matrix block written in the previous iteration loop in the first shared memory in parallel to obtain a first calculation result; during the process of processing the first shared memory by the thread, the thread writes the previous row matrix block of the current row matrix block and the previous column matrix block of the current column matrix block into the second shared memory;

After the thread finishes the writing operation on the second shared memory, the thread reads the next row matrix block of the current row matrix block, reads the next column matrix block of the current column matrix block, and calculates the previous row matrix block and the previous column matrix block written into the second shared memory to obtain a second calculation result; in the process of processing the second shared memory by the thread, the thread writes the current row matrix block and the current column matrix block read by the thread into the first shared memory;

Taking the next row matrix block of the current row matrix block in the row matrix blocks as the current row matrix block of the next iteration cycle, taking the next column matrix block of the current column matrix block in the column matrix blocks as the current column matrix block of the next iteration cycle, returning to the step of reading the current row matrix block and the current column matrix block by the thread to continue execution until the cycle end condition is reached, stopping, and obtaining a plurality of first calculation results and a plurality of second calculation results;

and accumulating the plurality of first calculation results and the plurality of second calculation results to obtain an output matrix.

In one embodiment, if the current column matrix block is the first column matrix block and the current row matrix block is the first row matrix block, the method further comprises:

The thread reads a first row matrix block and a first column matrix block;

After the thread performs the read operation, the thread writes the first row matrix block and the first column matrix block into the first shared memory; during the process of processing the first shared memory by the thread, the thread reads the matrix blocks in the next row and the matrix blocks in the next column, and synchronously processes the thread.

In one embodiment, the thread calculates, in parallel, a history row matrix block and a history column matrix block written in a previous iteration loop in the first shared memory, to obtain a first calculation result, including:

splitting a history row matrix block written in a previous iteration cycle in a first shared memory into a plurality of sub-row matrix blocks, and splitting a history column matrix block into a plurality of sub-column matrix blocks;

The thread reads a current sub-row matrix block and a current sub-column matrix block in the first shared memory, and the thread calculates a history sub-row matrix block and a history sub-column matrix block which are read in the last iteration cycle in the first shared memory in parallel to obtain a first intermediate result; after the thread performs the calculation operation on the first shared memory, the thread reads the next sub-row matrix block of the current sub-row matrix block, reads the next sub-column matrix block of the current sub-column matrix block, and calculates the current sub-row matrix block and the current sub-column matrix block in parallel to obtain a second intermediate result;

Taking the next sub-row matrix block of the current sub-row matrix block in the sub-row matrix blocks as the current sub-row matrix block of the next iteration cycle, taking the next sub-column matrix block of the current sub-column matrix block in the sub-column matrix blocks as the current sub-column matrix block of the next iteration cycle, and returning to the step of reading the current sub-row matrix block and the current sub-column matrix block by the thread to continue execution until stopping when the cycle end condition is reached, and obtaining a plurality of first intermediate results and a plurality of second intermediate results;

And accumulating the plurality of first intermediate results and the plurality of second intermediate results to obtain a first calculation result.

In one embodiment, if the current sub-column matrix block is the first sub-column matrix block of the history row matrix block in the first shared memory and the current sub-row matrix block is the first sub-row matrix block of the history row matrix block in the first shared memory, the method further comprises:

the thread reads a first sub-row matrix block of a history row matrix block and a first sub-column matrix block of a history column matrix block in a first shared memory;

When the thread executes the reading operation of the first shared memory, the thread reads the next sub-row matrix block and the next sub-column matrix block in the first shared memory, and the thread performs matrix block multiplication processing on the current sub-row matrix block and the current sub-column matrix block in parallel to obtain a first intermediate result.

In one embodiment, if the current sub-row matrix block is the last sub-column matrix block of the history row matrix block in the first shared memory and the current sub-column matrix block is the last sub-column matrix block of the history column matrix block in the first shared memory, before the thread calculates the history sub-row matrix block and the history sub-column matrix block read in the last iteration loop in the first shared memory in parallel, the method further includes:

Synchronizing the threads for reading the sub-row matrix blocks and the sub-row matrix blocks in the first shared memory; and synchronizing the threads written into the sub-row matrix blocks and the sub-row matrix blocks in the first shared memory.

In one embodiment, if the current sub-row matrix block is the last sub-row matrix block of the history row matrix block in the first shared memory and the current sub-column matrix block is the last sub-column matrix block of the history column matrix block in the first shared memory, the method further includes, after the thread calculates the history sub-row matrix block and the history sub-column matrix block read in the last iteration loop in the first shared memory in parallel:

The process of multiplying the last sub-row matrix block of the history sub-row matrix block and the last sub-column matrix block of the history sub-column matrix block by the thread is delayed until the process of reading the first sub-row matrix block of the previous row matrix block and the first sub-column matrix block of the previous column matrix block of the current row matrix block in the second shared memory is executed when the thread calculates the previous row matrix block and the previous column matrix block written in the second shared memory.

In one embodiment, the thread calculates, in parallel, a history subrow matrix block and a history subcolumn matrix block read in a last iteration loop in the first shared memory, to obtain a first intermediate result, including:

The thread registers each row of elements of the historical sub-row matrix block read in the last iteration loop in the first shared memory into a vector register; the thread parallel registers a single element of each row of elements of the history subcolumn matrix block read in the last iteration loop in the first shared memory into a scalar register;

The thread multiplies and accumulates the single element in the scalar register with each row of elements in the vector register to obtain a first intermediate result.

In a second aspect, the application further provides a data reading device. The device comprises:

The block dividing module is used for responding to the data reading request, dividing a first matrix to be processed into a plurality of row matrix blocks, and dividing a second matrix to be processed into a plurality of column matrix blocks;

The first reading module reads the current row matrix block and the current column matrix block by adopting threads, and the threads calculate the historical row matrix block and the historical column matrix block which are written in by the last iteration cycle in the first shared memory in parallel to obtain a first calculation result; during the process of processing the first shared memory by the thread, the thread writes the previous row matrix block of the current row matrix block and the previous column matrix block of the current column matrix block into the second shared memory;

The second reading module is used for reading a next row matrix block of the current row matrix block by the thread after the thread executes the writing operation on the second shared memory, reading a next column matrix block of the current column matrix block, and calculating a previous row matrix block and a previous column matrix block written into the second shared memory by the thread to obtain a second calculation result; in the process of processing the second shared memory by the thread, the thread writes the current row matrix block and the current column matrix block read by the thread into the first shared memory;

the iteration module is used for taking the next row matrix block of the current row matrix blocks in the row matrix blocks as the current row matrix block of the next iteration cycle, taking the next column matrix block of the current column matrix blocks in the column matrix blocks as the current column matrix block of the next iteration cycle, returning to the step of reading the current row matrix blocks and the current column matrix blocks by the threads, continuing to execute until the cycle end condition is reached, stopping, and obtaining a plurality of first calculation results and a plurality of second calculation results;

the output module is used for accumulating the plurality of first calculation results and the plurality of second calculation results to obtain an output matrix.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

The data reading method, the device, the computer equipment, the storage medium and the program product divide a first matrix to be processed into a plurality of row matrix blocks, divide a second matrix to be processed into a plurality of column matrix blocks, read a current row matrix block and a current column matrix block by a thread, calculate a historical row matrix block and a historical column matrix block written in a previous iteration loop in a first shared memory in parallel by the thread, and write the previous row matrix block of the current row matrix block and the previous column matrix block of the current column matrix block into a second shared memory in the process of processing the first shared memory by the thread. In the method, the process of calculating the historical row matrix block and the historical column matrix block written in the previous iteration loop in the first shared memory by the thread and the process of writing the previous row matrix block of the current row matrix block and the previous column matrix block of the current column matrix block into the second shared memory by the thread are carried out, so that the reading time delay of the current row matrix block and the current column matrix block of the thread for reading the global memory is covered, the reading time delay of the global memory is effectively reduced, the data processing efficiency is improved, and the higher hardware utilization rate is realized.

Drawings

FIG. 1 is an application environment diagram of a data reading method in one embodiment;

FIG. 2 is a flow chart of a data reading method according to an embodiment;

FIG. 3 is a schematic diagram of matrix partitioning in one embodiment;

FIG. 4 is a schematic diagram of a dual Buffer mechanism according to another embodiment;

FIG. 5 is a timing diagram of the cyclic execution of row matrix blocks and column matrix blocks in one embodiment;

FIG. 6 is a timing diagram of the reading, computing, and writing of a plurality of row matrix blocks and a plurality of column matrix blocks if the number of row matrix blocks and column matrix blocks is even in one embodiment;

FIG. 7 is a timing diagram of reading, calculating, and writing of a plurality of row matrix blocks and a plurality of column matrix blocks if the number of row matrix blocks and the number of column matrix blocks are both odd in one embodiment;

FIG. 8 is a flow chart of a first calculation result in one embodiment;

FIG. 9 is a schematic diagram of the partitioning and prefetching of row matrix blocks, column matrix blocks in one embodiment;

FIG. 10 is a timing diagram of sub-row matrix block and sub-column matrix block loop execution in one embodiment;

FIG. 11 is a timing diagram of the reading and computation of a plurality of sub-row matrix blocks and a plurality of sub-column matrix blocks if the number of sub-row matrix blocks and sub-column matrix blocks is even in one embodiment;

FIG. 12 is a timing diagram of the reading and computation of a plurality of sub-row matrix blocks and a plurality of sub-column matrix blocks if the number of sub-row matrix blocks and sub-column matrix blocks is odd in one embodiment;

FIG. 13 is a flow chart of a conventional read-write synchronization of shared memory in one embodiment;

FIG. 14 is a flow chart of a shared memory latency synchronization read/write process in one embodiment;

FIG. 15 is a flow diagram of computing a first intermediate result in one embodiment;

FIG. 16 is a flow diagram of sub-row matrix block and sub-column matrix block multiplication, and data sharing, in one embodiment;

FIG. 17 is a schematic diagram of an overall scheme for data reading in one embodiment;

FIG. 18 is a block diagram showing the structure of a data reading apparatus in one embodiment;

Fig. 19 is an internal structural view of the computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The data reading method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. The computer device 102 obtains a data reading request initiated by the terminal 104, splits a first matrix to be processed in the global memory into a plurality of row matrix blocks, and splits a second matrix to be processed into a plurality of column matrix blocks; the thread reads the current row matrix block and the current column matrix block, and the thread calculates the historical row matrix block and the historical column matrix block written in the previous iteration loop in the first shared memory in parallel to obtain a first calculation result; during the process of processing the first shared memory by the thread, the thread writes the previous row matrix block of the current row matrix block and the previous column matrix block of the current column matrix block into the second shared memory; after the thread finishes the writing operation on the second shared memory, the thread reads the next row matrix block of the current row matrix block, reads the next column matrix block of the current column matrix block, and calculates the previous row matrix block and the previous column matrix block written into the second shared memory to obtain a second calculation result; in the process of processing the second shared memory by the thread, the thread writes the current row matrix block and the current column matrix block read by the thread into the first shared memory; taking the next row matrix block of the current row matrix block in the row matrix blocks as the current row matrix block of the next iteration cycle, taking the next column matrix block of the current column matrix block in the column matrix blocks as the current column matrix block of the next iteration cycle, returning to the step of reading the current row matrix block and the current column matrix block by the thread to continue execution until the cycle end condition is reached, stopping, and obtaining a plurality of first calculation results and a plurality of second calculation results; and accumulating the plurality of first calculation results and the plurality of second calculation results to obtain an output matrix. Wherein the computer device 102 may be, but is not limited to, a terminal and a server, the computer device 102 comprising a graphics processor or a hardware CPU chip, etc. The terminal can be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things equipment, portable wearable equipment, and the internet of things equipment can be smart speakers, smart televisions, smart air conditioners, smart vehicle-mounted equipment and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.

In one embodiment, as shown in fig. 2, a data reading method is provided, and the method is applied to the computer device in fig. 1 for illustration, and includes the following steps:

In step 202, in response to the data reading request, the first matrix to be processed is split into a plurality of row matrix blocks, and the second matrix to be processed is split into a plurality of column matrix blocks.

The first matrix is a right matrix in matrix multiplication, the second matrix is a left matrix in matrix multiplication, and elements of the matrix are image data, scientific data, AI data of artificial intelligence or other data stored in a global memory. As shown in fig. 3, the product of the second matrix of dimension mxk times the first matrix of dimension kxn is equal to the matrix of dimension mxn. Splitting the first matrix according to a transverse direction to obtain a plurality of row matrix blocks B (Tile B) with the dimension of k multiplied by n, splitting the second matrix according to a longitudinal direction to obtain a plurality of column matrix blocks A (Tile A) with the dimension of m multiplied by k, wherein the number of the row matrix blocks split by the first matrix is equal to the number of the column matrix blocks split by the second matrix; and multiplying a column matrix block A with m x k in dimension by a row matrix block B with k x n in dimension to obtain a product, wherein the product is equal to a matrix slice with m x n in dimension in the output matrix C, performing matrix block multiplication accumulation processing on each column matrix block A and the row matrix block B at a corresponding position to obtain a plurality of matrix slices with m x n in dimension, and accumulating the plurality of matrix slices with m x n in dimension to obtain the output matrix C with m x n in dimension.

Step 204, the thread reads the current row matrix block and the current column matrix block, and the thread calculates the historical row matrix block and the historical column matrix block written in the last iteration loop in the first shared memory in parallel to obtain a first calculation result; during processing of the thread for the first shared memory, the thread writes the previous row matrix block of the current row matrix block and the previous column matrix block of the current column matrix block to the second shared memory.

Wherein a Group (Group) of computer devices calculates an output matrix C, and threads within the Group of computer devices 102 cooperate to read column matrix block a and row matrix block B, and prefetch the next column matrix block a and row matrix block B in the K direction in fig. 3. The historical row matrix block and the historical column matrix block of the current loop are the current row matrix block and the current column matrix block of the last iteration loop thread written into the first shared memory.

The first shared memory and the second shared memory are two shared memories of the shared memory, and respectively alternately store a current row matrix block B read by a thread, a current column matrix block A, a next row matrix block B read by the thread and a next column matrix block A. The read/write timing of the first shared memory (first Buffer) and the second shared memory (second Buffer) of the present embodiment is shown in fig. 4. In the embodiment, a double-buffer Ping-Pong (Ping-Pong) operation mechanism is adopted, when a thread of computer equipment reads a current row matrix block and a current column matrix block, and the thread performs matrix block multiplication processing on a history row matrix block and a history column matrix block written in a previous iteration loop in a first shared memory in parallel, the thread of the computer equipment writes a next row matrix block B and a next column matrix block A read by the thread into a second shared memory; when the thread writes the row matrix block and the column matrix block which are pre-fetched next time into the first shared memory, the thread performs matrix block multiplication processing on the history row matrix block and the history column matrix block which are stored in the second shared memory; and sequentially cycling until the cycling stopping condition is met. It should be noted that: the first and second shared memories may be written to the second shared memory first and then to the first shared memory. The present embodiment takes the example of writing the first row matrix block and the first column matrix block into the first shared memory.

In some embodiments, when the current column matrix block is the first column matrix block and the current row matrix block is the first row matrix block, the data reading method further comprises the steps of:

step 1, a thread reads a first row matrix block and a first column matrix block.

Fig. 5 is a cycle execution timing diagram of a row matrix block and a column matrix block, in fig. 5, a code execution timeline in a GPU processor is from left to right, and a global memory reading, synchronization and checking memory reading preparation instruction, a shared memory writing instruction, a thread cycle calculation process and a final data output process are sequentially performed from top to bottom. The instructions of different functional categories are distributed in the vertical direction in sequence, the time delay condition of instruction execution can be observed through the instruction 'tailing' area, and the covering condition of the read-write time delay of the global memory reading and the shared memory can be clearly seen. In fig. 5, a read column matrix block a and a read row matrix block B respectively represent a thread read row matrix block and a read column matrix block, and sequence numbers of the currently read row matrix block and the currently read column matrix block are determined according to a global memory reading sequence.

Step 2, after the thread executes the read operation, the thread writes the first row matrix block and the first column matrix block into the first shared memory; during the process of processing the first shared memory by the thread, the thread reads the matrix blocks in the next row and the matrix blocks in the next column, and synchronously processes the thread.

Optionally, as shown in fig. 5, after the thread of the computer device finishes reading the current row matrix block and the current column matrix block, the thread of the computer device writes the first row matrix block and the first column matrix block into the first shared memory, and reads the second row matrix block and the second column matrix block in a time delay corresponding to a "tailing" area of the thread in which the thread writes the first row matrix block and the first column matrix block into the first shared memory; in the time delay corresponding to the tailing area of the second row matrix block and the second column matrix block read by the thread, the synchronous thread carries out synchronous processing on the thread and the thread in parallel; after the thread reads the second row matrix block and the second column matrix block and completes execution, the thread writes the second row matrix block and the second column matrix block into a second shared memory; in the time delay corresponding to the tailing area of the second shared memory, the thread reads the third row matrix block and the third column matrix block, and the thread calculates the first row matrix block and the first column matrix block written in the last iteration cycle in the first shared memory in parallel to obtain a first calculation result; after the thread has completed executing the processing for the first shared memory, the thread writes the third row matrix block, and the third column matrix block, to the first shared memory.

In some embodiments, when the current row matrix block is any row matrix block after the second row matrix block and the current column matrix block is any column matrix block after the second column matrix block, the thread reads the current row matrix block and the current column matrix block, and the thread calculates the historical row matrix block and the historical column matrix block written in the previous iteration loop in the first shared memory in parallel to obtain a first calculation result; during processing of the thread for the first shared memory, the thread writes the previous row matrix block of the current row matrix block and the previous column matrix block of the current column matrix block to the second shared memory.

Optionally, taking the current row matrix block as a third row matrix block and the current column matrix block as a third column matrix block as an example, as shown in fig. 5, after the thread of the computer device reads the second row matrix block and the second column matrix block, the thread of the computer device reads the third row matrix block and the third column matrix block, and the thread calculates the first row matrix block and the first column matrix block written in the first iteration loop in the first shared memory in parallel to obtain a first calculation result; during processing of the thread for the first shared memory, the thread writes the second row matrix block and the second column matrix block to the second shared memory.

Step 206, after the thread finishes the writing operation to the second shared memory, the thread reads the next row of matrix blocks of the current row of matrix blocks, reads the next column of matrix blocks of the current column of matrix blocks, and calculates the previous row of matrix blocks and the previous column of matrix blocks written into the second shared memory to obtain a second calculation result; and in the process of processing the second shared memory by the thread, the thread writes the current row matrix block and the current column matrix block read by the thread into the first shared memory.

After the current row matrix block and the current column matrix block are read by the thread and written into the first shared memory, the current row matrix block and the current column matrix block written into the first shared memory are used as a history row matrix block and a history column matrix block related to the next iteration.

Optionally, taking the current row matrix block as a third row matrix block and the current column matrix block as a third column matrix block as an example, as shown in fig. 5, after the thread of the computer device writes the second row matrix block and the second column matrix block into the second shared memory, the thread reads the fourth row matrix block and the fourth column matrix block, and the thread performs matrix block multiplication processing on the second row matrix block and the second column matrix block written into the second shared memory to obtain a second calculation result; during the process of processing the second shared memory by the thread, the thread writes the fourth row matrix block and the fourth column matrix block read by the thread into the first shared memory.

Step 208, taking the next row of matrix blocks of the current row of matrix blocks in the plurality of row matrix blocks as the current row matrix block of the next iteration cycle, taking the next column of matrix blocks of the current column of matrix blocks in the plurality of column matrix blocks as the current column matrix block of the next iteration cycle, returning to the step of reading the current row matrix block and the current column matrix block by the thread, continuing to execute until the cycle end condition is reached, stopping, and obtaining a plurality of first calculation results and a plurality of second calculation results.

The loop ending condition may be that a loop interrupt instruction is received, or that execution of each row matrix block and each column matrix block in the first shared memory and the second shared memory is completed. In this embodiment, a plurality of row matrix blocks are respectively marked as a row matrix block B1 to a row matrix block Bn in a splitting order, and a plurality of column matrix blocks are respectively marked as a column matrix block A1 to a column matrix block An in a splitting order.

In this embodiment, taking the case that the first row matrix block and the first column matrix block are written into the first shared memory, in the process that the first shared memory and the second shared memory alternately store the row matrix block and the column matrix block, the row matrix block and the column matrix block marked as odd numbers are written into the first shared memory, and the row matrix block and the column matrix block marked as even numbers are written into the second shared memory. A loop cycle may be performed according to steps 204 and 206 when the current row matrix block is any one of the row matrix blocks after the second row matrix block and the current column matrix block is any one of the column matrix blocks after the second column matrix block.

In some embodiments, if the number of row matrix blocks and column matrix blocks is even, the timing rules of reading, calculating and writing of the row matrix blocks and the column matrix blocks are as shown in fig. 6, and the timing steps in the dashed line box in fig. 6 are one cycle period. When the current row matrix block is the last row matrix block and the current column matrix block is the last column matrix block, the thread writes the last row matrix block and the last column matrix block which are read in the last iteration cycle in the second shared memory into the second shared memory, and in the writing operation of the thread, the thread performs matrix block multiplication processing on the last-to-last row matrix block and the last-to-last column matrix block in parallel to obtain a first calculation result; after the thread performs the calculation operation, the thread performs matrix block multiplication processing on the last row matrix block and the last column matrix block to obtain a second calculation result.

In some embodiments, if the number of row matrix blocks and column matrix blocks is odd, the timing of reading, calculating and writing of the row matrix blocks and the column matrix blocks is as shown in fig. 7, and the timing steps in the dashed box in fig. 7 are one cycle period. When the current row matrix block is the last row matrix block and the current column matrix block is the last column matrix block, the last row matrix block and the last column matrix block are in the last cycle period, so that after the last row matrix block and the last column matrix block pass through the cycle period, the thread performs matrix block multiplication processing on the last row matrix block and the last column matrix block to obtain a first calculation result.

Optionally, the first row matrix block and the first column matrix block are written into the first shared memory, the number of the row matrix block and the column matrix block is even, the current row matrix block is the third row matrix block, the current column matrix block is the third column matrix block, after the third row matrix block and the third column matrix block execute the steps 204 and 206, as shown in fig. 6, the next row matrix block of the current row matrix block in the plurality of row matrix blocks of the computer device is used as the current row matrix block of the next iteration loop, the next column matrix block of the current column matrix block in the plurality of column matrix blocks is used as the current column matrix block of the next iteration loop, namely, the fifth row matrix block is used as the current column matrix block of the next iteration loop, the fifth column matrix block is used as the current column matrix block, and the step 204 is returned to continue to execute, when the first loop period is executed, the next row matrix block of the current loop is used as the next row matrix block of the current loop, the next loop is executed again, the next row matrix block of the current loop is used as the next loop is executed, and the next loop is stopped until the first loop reaches the next iteration condition, and the next loop reaches the first step is completed, and the result is calculated.

Step 210, performing accumulation processing on the plurality of first calculation results and the plurality of second calculation results to obtain an output matrix.

The result obtained by multiplying and accumulating the corresponding positions of the row matrix blocks in the first matrix and the column matrix blocks in the second matrix is the output matrix. That is, the multiplication result of a row matrix block and a column matrix block is equal to one row of the output matrix, and the plurality of first calculation results and the plurality of second calculation results are accumulated according to the time sequence order, so that the output matrix can be obtained.

The matrix multiplication process of the embodiment is embodied in a subtask of the convolution acceleration calculation in the deep neural network learning, and the finally obtained output matrix can be a result corresponding to the subtask of the convolution acceleration calculation. The specific application scene can be in the scenes of image recognition, model training, model prediction, artificial intelligence and the like.

In this embodiment, the first matrix to be processed is split into a plurality of row matrix blocks, the second matrix to be processed is split into a plurality of column matrix blocks, the current row matrix block and the current column matrix block are read by the thread, the historical row matrix block and the historical column matrix block written in the last iteration loop in the first shared memory are calculated by the thread in parallel, and the previous row matrix block of the current row matrix block and the previous column matrix block of the current column matrix block are written into the second shared memory by the thread in the process of processing the first shared memory. In the method, the process of calculating the historical row matrix block and the historical column matrix block written in the previous iteration loop in the first shared memory by the thread and the process of writing the previous row matrix block of the current row matrix block and the previous column matrix block of the current column matrix block into the second shared memory by the thread are carried out, so that the reading time delay of the current row matrix block and the current column matrix block of the thread of the global memory is covered, the reading time delay of the global memory is effectively reduced, and higher hardware utilization rate is realized.

In the embodiment, the data is read into the shared memory from the global memory, the mutual data is shared by the shared memory, and the read-write time delay of the shared memory is smaller than that of the global memory, so that the purpose of reducing the data reading is achieved by the method. However, in order to realize more data sharing, more data needs to be stored in the shared memory, so that the problems of ineffective coverage of read-write delay of the shared memory and high synchronization overhead are caused. Therefore, in order to solve the above problem, in this embodiment, the row matrix blocks and the column matrix blocks written into the first shared memory and the second shared memory of the shared memory are split again to obtain a plurality of sub-row matrix blocks and a plurality of sub-column matrix blocks, and the read delay of the current sub-row matrix blocks and the current sub-column matrix blocks is covered by circularly calculating the multiplication process of the sub-row matrix blocks and the sub-column matrix blocks, so that the read delay of the shared memory is effectively reduced, and the hardware utilization rate is further improved.

In one embodiment, since the line parallel computing is performed on the history row matrix block and the history column matrix block written in the previous iteration loop in the first shared memory in the above embodiment to obtain the first computing result, the process of computing the previous row matrix block and the previous column matrix block written in the second shared memory by the thread to obtain the second computing result is the same as that of computing the previous row matrix block and the previous column matrix block written in the second shared memory by the thread, so the embodiment only introduces the process of processing the first shared memory by the thread. Specifically, as shown in fig. 8, the thread calculates, in parallel, a history row matrix block and a history column matrix block written in a previous iteration loop in the first shared memory, to obtain a first calculation result, including:

Step 802, splitting a history row matrix block written in a previous iteration loop in the first shared memory into a plurality of sub-row matrix blocks, and splitting a history column matrix block into a plurality of sub-column matrix blocks.

The historical row matrix blocks and the historical column matrix blocks written in the first shared memory in the previous iteration cycle are the current row matrix blocks and the current column matrix blocks written in the first shared memory in the previous iteration cycle.

The dimension of the history row matrix block is k×n, the dimension of the history column matrix block is m×k, and the multiplication of one history column matrix block and one history row matrix block obtains a matrix slice with the dimension of m×n in the output matrix C. For example, the value of k may be 1, 4, 16 or other values, in this embodiment, the value of k is 4, the history row matrix block is split transversely by the computer device to obtain a plurality of sub-row matrix blocks B 'with dimensions of 4×n, the history column matrix block is split longitudinally to obtain a plurality of sub-column moments a' with dimensions of m×4, and the number of sub-row matrix blocks split by the history row matrix block is equal to the number of sub-column matrix blocks split by the history column matrix block. And multiplying a subcolumn matrix block A 'with a dimension of m multiplied by a subcolumn matrix block B' with a dimension of 4 multiplied by n to obtain a matrix slice with a dimension of m multiplied by n, and performing matrix block multiplication accumulation processing on each subcolumn matrix block B 'and each subcolumn matrix block A' to obtain a matrix slice with a dimension of m multiplied by n in an output matrix C.

As shown in fig. 9, in this embodiment, the history row matrix block is split into 4 sub-row matrix blocks, the history column matrix block is split into 4 sub-column matrix blocks, the first shared memory and the second shared memory are used to store the read and prefetched sub-column matrix block a 'and the sub-row matrix block B', and the shared memory read delay is masked by the matrix multiplication time of the sub-column matrix block a 'and the sub-row matrix block B', so as to obtain higher performance.

Optionally, taking the current row matrix block as a third row matrix block, taking the current column matrix block as a third column matrix block as an example, taking the history row matrix block written in the previous iteration loop in the first shared memory of the computer device as the first row matrix block, the history column matrix block as the second column matrix block, splitting the first row matrix block into a plurality of sub-row matrix blocks B ', and splitting the first column matrix block into a plurality of sub-column matrix blocks a'.

Step 804, the thread reads the current sub-row matrix block and the current sub-column matrix block in the first shared memory, and the thread calculates the historical sub-row matrix block and the historical sub-column matrix block read in the last iteration cycle in the first shared memory in parallel to obtain a first intermediate result; after the thread performs the calculation operation on the first shared memory, the thread reads the next sub-row matrix block of the current sub-row matrix block, reads the next sub-column matrix block of the current sub-column matrix block, and calculates the current sub-row matrix block and the current sub-column matrix block in parallel to obtain a second intermediate result.

The historical sub-row matrix block and the historical sub-column matrix block which are read in the last iteration loop in the first shared memory are respectively the next sub-row matrix block and the next sub-column matrix block of the current sub-row matrix block of the historical row matrix block and the next sub-column matrix block of the current sub-column matrix block which are read in the last iteration loop thread.

In some embodiments, if the current sub-column matrix block is the first sub-column matrix block of the history row matrix block in the first shared memory and the current sub-row matrix block is the first sub-row matrix block of the history row matrix block in the first shared memory, the data reading method further comprises the steps of:

Step 1, a thread reads a first sub-row matrix block of a history row matrix block and a first sub-column matrix block of a history column matrix block in a first shared memory.

Fig. 10 is a cycle execution timing diagram of the sub-row matrix blocks and the sub-column matrix blocks, in which fig. 10 is a time line of instructions executed by the GPU processor from left to right, and the global memory reading, the shared memory writing, the shared memory reading, the synchronization and checking memory reading preparation instructions and the thread cycle calculation process are sequentially performed from top to bottom. The instructions of different functional categories are distributed in the vertical direction in sequence, the time delay condition of instruction execution can be observed through the instruction 'tailing' area, and the covering condition of the read-write time delay of the global memory reading and the shared memory can be clearly seen.

And step 2, when the thread executes the reading operation of the first shared memory, the thread reads the next sub-row matrix block and the next sub-column matrix block in the first shared memory, and the thread performs matrix block multiplication processing on the current sub-row matrix block and the current sub-column matrix block in parallel to obtain a first intermediate result.

Optionally, as shown in fig. 10, after the thread initiates reading the first sub-row matrix block and the first sub-column matrix block, the initiating thread pre-fetches the second sub-row matrix block and the second sub-column matrix block in the first shared memory, waits for the thread to read the first sub-row matrix block and the first sub-column matrix block to complete, and then performs matrix multiplication processing on the first sub-row matrix block and the first sub-column matrix block in the first shared memory in parallel in the process of line Cheng Douqu of the second sub-row matrix block and the second sub-column matrix block, so as to obtain the first intermediate result.

In some embodiments, when the current sub-row matrix block is any sub-row matrix block after the second sub-column matrix block, and the current sub-column matrix block is any sub-column matrix block after the second sub-column matrix block, the thread reads the current sub-row matrix block and the current sub-column matrix block in the first shared memory, and the thread calculates the historical sub-row matrix block and the historical sub-column matrix block read in the last iteration cycle in the first shared memory in parallel to obtain a first intermediate result; after the thread performs the calculation operation on the first shared memory, the thread reads the next sub-row matrix block of the current sub-row matrix block, reads the next sub-column matrix block of the current sub-column matrix block, and calculates the current sub-row matrix block and the current sub-column matrix block in parallel to obtain a second intermediate result.

The next sub-row matrix block and the next sub-column matrix block of the current sub-row matrix block read by the thread of the current loop are used as the history sub-row matrix block and the history sub-column matrix block related to the next iteration.

Optionally, taking the current sub-row matrix block as a third sub-row matrix block, and taking the current sub-column matrix block as a third sub-column matrix block, as shown in fig. 10, the thread reads the third sub-row matrix block and the third sub-column matrix block in the first shared memory, and after the history sub-row matrix block and the history sub-column matrix block read by the thread in the last iteration cycle are completed, the thread calculates the second sub-row matrix block and the second sub-column matrix block read by the last iteration cycle in the first shared memory in parallel to obtain the first intermediate result. After the thread performs the calculation operation on the first shared memory, the thread reads the fourth sub-row matrix block, reads the fourth sub-column matrix block, and calculates the third sub-row matrix block and the third sub-column matrix block in parallel to obtain a second intermediate result.

Step 806, taking the next sub-row matrix block of the current sub-row matrix block in the plurality of sub-row matrix blocks as the current sub-row matrix block of the next iteration loop, taking the next sub-column matrix block of the current sub-column matrix block in the plurality of sub-column matrix blocks as the current sub-column matrix block of the next iteration loop, returning to the step of the thread for reading the current sub-row matrix block and the current sub-column matrix block to continue execution until stopping when the loop end condition is reached, and obtaining a plurality of first intermediate results and a plurality of second intermediate results.

The loop ending condition may be that a loop interrupt instruction is received, or that execution of each row matrix block and each column matrix block in the first shared memory and the second shared memory is completed. In this embodiment, the plurality of sub-row matrix blocks are respectively marked as sub-row matrix blocks B1 'to Bn' according to the splitting order, and the plurality of sub-column matrix blocks are respectively marked as sub-column matrix blocks A1 'to An' according to the splitting order. When the current sub-row matrix block is any sub-row matrix block after the second sub-row matrix block and the current sub-column matrix block is any sub-column matrix block after the second sub-column matrix block, a loop cycle may be performed according to steps 802 and 804.

In some embodiments, if the number of sub-row matrix blocks and sub-column matrix blocks is even, the timing rules of reading and calculating the sub-row matrix blocks and the sub-column matrix blocks are as shown in fig. 11, and the timing steps in the dashed frame in fig. 11 are one cycle period. When the current sub-row matrix block is the last sub-row matrix block and the current sub-column matrix block is the last sub-column matrix block, the last sub-row matrix block and the last sub-column matrix block are in the last cycle period, as can be seen from fig. 11, after the last sub-row matrix block and the last sub-column matrix block have undergone a cycle period, the thread performs matrix block multiplication processing on the last sub-row matrix block and the last sub-column matrix block, so as to obtain a second intermediate result.

In some embodiments, if the number of sub-row matrix blocks and sub-column matrix blocks is odd, the timing rules of reading and calculating the sub-row matrix blocks and the sub-column matrix blocks are as shown in fig. 12, and the timing steps in the dashed frame in fig. 12 are one cycle period. When the current sub-row matrix block is the last sub-row matrix block and the current sub-column matrix block is the last sub-column matrix block, the thread reads the last sub-row matrix block and the last sub-column matrix block in the first shared memory, and the thread performs matrix block multiplication processing on the last sub-row matrix block and the last sub-column matrix block in parallel to obtain a second intermediate result; after the thread performs the calculation operation, the thread performs matrix block multiplication processing on the last sub-row matrix block and the last sub-column matrix block again to obtain a first intermediate result.

Optionally, taking the number of the sub-row matrix blocks and the number of the sub-column matrix blocks as an even number, taking the current sub-row matrix block as a third sub-row matrix block, taking the current sub-column matrix block as a third sub-column matrix block as an example, after the third sub-row matrix block and the third sub-column matrix block execute the steps 802 and 804, as shown in fig. 11, taking the next sub-row matrix block of the current sub-row matrix block in the plurality of sub-row matrix blocks as the current sub-row matrix block of the next iteration loop, taking the next sub-column matrix block of the current sub-column matrix block in the plurality of sub-column matrix blocks as the current sub-column matrix block of the next iteration loop, namely taking the fifth sub-row matrix block as the current sub-row matrix block, taking the fifth sub-column matrix block as the current sub-column matrix block, returning to the step 804 to continue to execute, taking the next sub-row matrix block of the current loop period as the next sub-row matrix block of the next iteration loop after executing the first loop period, taking the next sub-row matrix block of the current loop period as the next iteration period again, and continuing to execute the next loop until the next loop of the current sub-column matrix block reaches the first loop period and the second loop until the next iteration result reaches the intermediate condition.

Step 808, performing accumulation processing on the plurality of first intermediate results and the plurality of second intermediate results to obtain a first calculation result.

The result obtained by multiplying and accumulating the corresponding positions of the plurality of sub-row matrix blocks of the current row matrix block and the plurality of sub-column matrix blocks of the current column matrix block written in the first shared memory is the product result of the current row matrix block and the current column matrix block.

In this embodiment, the row matrix blocks and the column matrix blocks written into the first shared memory and the second shared memory of the shared memory are split again to obtain a plurality of sub-row matrix blocks and a plurality of sub-column matrix blocks, global memory data prefetching and shared memory data prefetching are implemented by adopting the first shared memory and the second shared memory and the dual register set in a two-stage circulation mode, and global memory and shared memory reading time delay are covered by matrix multiplication of the row matrix blocks and the column matrix blocks and matrix multiplication of the sub-row matrix blocks and the sub-column matrix blocks, so that read-write time delay of the shared memory is effectively reduced, synchronization operation times and cost are reduced, and GEMM computing efficiency is improved.

In some embodiments, as shown in fig. 13, after the execution of the shared memory write instruction, the normal operation is to execute the synchronization instruction immediately, and then execute the shared memory read instruction, which on one hand results in that the delay of the shared memory write instruction cannot be covered, on the other hand, all threads execute to the same point after the execution of the synchronization instruction, and when the shared memory is subsequently read concurrently, the delay cannot be masked by the thread level parallelism. Therefore, in order to solve the above problem, as shown in fig. 14, the present embodiment executes the synchronization instruction before the next-to-last sub-row matrix block and the sub-column matrix block of the first shared memory perform matrix block multiplication processing, so as to ensure that when the matrix multiplication cycle of the second shared memory starts, the read row matrix block and the read column matrix block are both written into the first shared memory, and execute the rectangular multiplication processing after the synchronization instruction, so that competition to the shared hardware units (such as the data reading unit and the shared memory unit) caused by synchronization can be reduced, the thread-level parallel time delay can be reduced more effectively, and parallel reading and writing can be realized. As shown in fig. 10, the execution timing of the synchronization instruction is specifically:

If the current sub-row matrix block is the last sub-column matrix block of the history row matrix block in the first shared memory and the current sub-column matrix block is the last sub-column matrix block of the history column matrix block in the first shared memory, synchronizing the threads reading the sub-row matrix block and the sub-row matrix block in the first shared memory before the threads parallelly calculate the history sub-row matrix block and the history sub-column matrix block which are read in the last iteration cycle in the first shared memory, so as to ensure that the threads synchronously read the sub-row matrix block and the sub-row matrix block; and synchronizing the threads written into the sub-row matrix blocks and the sub-row matrix blocks in the first shared memory, so as to ensure that the threads are synchronously written into the sub-row matrix blocks and the sub-row matrix blocks.

In this embodiment, before the matrix block multiplication processing is performed on the last second sub-row matrix block and the sub-column matrix block of the first shared memory, a synchronization instruction is executed, so that when the matrix multiplication cycle of the second shared memory starts, the read row matrix block and the read column matrix block are both written into the first shared memory, and after the synchronization instruction, the rectangular multiplication processing is performed, so that competition on shared hardware units (such as a data reading unit and a shared memory unit) caused by synchronization can be reduced, the parallel delay of thread level can be reduced more effectively, and parallel reading and writing can be realized.

In some embodiments, as shown in fig. 10, the read delay of the shared memory is not covered in the process of performing matrix block multiplication processing on the last sub-row matrix block and the last sub-column matrix block of the first shared memory, and the delay of other threads is not covered in the process of reading the first sub-row matrix block and the first sub-column matrix block in the first shared memory, so that the delay of reading the first sub-row matrix block and the first sub-column matrix block by the threads in the first shared memory is wasted; it can also be seen in fig. 11 and 12 that the last sub-row matrix block and the last sub-column matrix block do not cover the read delay of the shared memory during the matrix block multiplication process. Therefore, in order to solve the above problem, when the current sub-row matrix block is the last sub-row matrix block of the history row matrix block in the first shared memory and the current sub-column matrix block is the last sub-column matrix block of the history column matrix block in the first shared memory, after the thread calculates the history sub-row matrix block and the history sub-column matrix block read in the last iteration loop in the first shared memory, the process of performing matrix block multiplication processing on the last sub-row matrix block and the last sub-column matrix block of the history sub-row matrix block by the thread is delayed until the thread performs calculation on the previous row matrix block and the previous column matrix block written in the second shared memory, and the process of reading the first sub-row matrix block of the previous row matrix block and the first sub-column matrix block of the previous column matrix block of the current row matrix block in the second shared memory by the thread is performed. That is, as shown by the dotted line in fig. 10, the process of multiplying the last matrix is delayed to be performed in the process of the first sub-row matrix block of the previous row matrix block of the current row matrix block and the first sub-column matrix block of the previous column matrix block of the current column matrix block in the second shared memory, so as to cover the read delay of the first sub-row matrix block and the first sub-column matrix block in the second shared memory, and the complete coverage of the read delay of the shared memory is realized in the whole cyclic process.

It should be noted that: the calculation of the second calculation result in the second shared memory, and the calculation process of the thread in the second shared memory is the same as the calculation process of the thread in the first shared memory, as shown in fig. 6, after the thread calculates the history row matrix block and the history column matrix block written in the last iteration loop in the first shared memory, the next calculation of the thread calculates the previous row matrix block and the previous column matrix block of the current row matrix block in the second shared memory. That is, after the calculation of the first shared memory is completed, the thread calculates the second shared memory, so in this embodiment, the calculation process of the last column matrix block of the history row matrix block and the last sub-column matrix block of the history column matrix block in the first shared memory is delayed to the calculation process of the second shared memory.

It should be noted that: the calculation process of the thread on the second shared memory is the same as that of the above embodiment, that is, the previous row matrix block of the current row matrix block is split into a plurality of sub-row matrix blocks, the previous column matrix block of the current column matrix block is split into a plurality of sub-column matrix blocks, and in the process of reading the sub-row matrix block of the previous row matrix block of the current row matrix block and the sub-column matrix block of the previous column matrix block of the current column matrix block, the thread calculates the sub-row matrix block and the sub-column matrix block, thereby obtaining the second calculation result. That is, the calculation of the last column matrix block of the history row matrix block and the last sub-column matrix block of the history column matrix block in the first shared memory is delayed until the first sub-row matrix block of the previous row matrix block and the first sub-column matrix block of the previous column matrix block of the current column matrix block in the second shared memory.

It should be noted that: except that the last calculation process of the thread does not use a delay thread, the thread needs to delay the matrix multiplication process of the last subrow matrix block and the last subrow matrix block until the next calculation start of the thread no matter the thread calculates a history row matrix block, a history column matrix block or calculates a previous row matrix block of the current row matrix block and a previous column matrix block of the current column matrix block.

Optionally, the current row matrix block is taken as a third row matrix block, the current column matrix block is taken as a third column matrix block, the thread calculates the first row matrix block and the first column matrix block which are written in the last iteration loop of the first shared memory, in the calculation process, the first row matrix block is split into a plurality of sub-row matrix blocks, the first column matrix block is split into a plurality of sub-column matrix blocks, and before the thread calculates the last sub-row matrix block and the last sub-column matrix block of the first row matrix block, the calculation process of the last sub-row matrix block and the last sub-column matrix block of the first column matrix block is delayed until the thread reads the first sub-row matrix block and the first sub-column matrix block of the second row matrix block in the second shared memory, so as to cover the thread reading the first sub-row matrix block and the first sub-column matrix block of the second row matrix block in the second shared memory.

The embodiment executes the delay instruction, delays the process of multiplying the last sub-row matrix block and the last sub-column matrix block of the first shared memory by the matrix block until the thread reads the first sub-row matrix block and the first sub-column matrix block in the second shared memory, and covers the read delay of the first sub-row matrix block and the first sub-column matrix block in the second shared memory, thereby realizing the complete coverage of the read delay of the shared memory.

In some embodiments, the GPU hardware executes instructions in a Single Instruction Multiple Data (SIMD) manner, with 32 threads organized together as a thread bundle, each thread bundle having a number of vector registers CRF (Common REGISTER FILE) of length 32, and threads within the thread bundle being unable to access other thread data. In order to realize the fast sharing of the data in the thread bundles, each thread bundle of the GPU has a plurality of scalar registers SRF (Scalar Register File), through which the fast sharing of the data in the thread bundles is realized, and some GPUs can broadcast instructions through hardware to share the data to be shared to other threads. In order to fully utilize hardware characteristics to quickly share data to realize efficient submatrix multiplication, in the embodiment, each row of elements of the submatrix block is registered into a vector register, single elements of one row of elements of the submatrix block are respectively registered into a scalar register, and the single elements in the scalar register are respectively multiplied and accumulated with one row of elements in the vector register to obtain multiplication results of the submatrix block and the submatrix block, namely a first intermediate result or a second intermediate result. The calculation process of the first intermediate result and the second intermediate result is the same, and this embodiment only describes the calculation process of the first intermediate result. Specifically, as shown in fig. 15, the thread calculates, in parallel, a history subrow matrix block and a history subrow matrix block that are read in a last iteration loop in the first shared memory, to obtain a first intermediate result, including:

Step 1502, a thread registers each row element of a history subrow matrix block read in a last iteration loop in a first shared memory into a vector register; the thread parallel registers a single element of each row of elements of the historic subcolumn matrix block read in the last iteration loop in the first shared memory into one scalar register.

The history subcolumn matrix block also needs to be read into the vector registers and then moved to the scalar registers. As shown in fig. 16, in this embodiment, the length and width of the sub-row matrix block B 'are respectively 32 and 4, the length and width of the sub-column matrix block a' are respectively 4 and 32, 4 elements of one row of the sub-column matrix block a 'are sequentially registered into scalar registers SR0 to SR3, and 4 rows of elements of the sub-row matrix block B' are registered into R1 to R4.

Alternatively, a row of 4 consecutive single elements of the sub-column matrix block A' may be sequentially shared to scalar registers SR0 SR3 in the GPU by a MOVQLN instruction.

In step 1504, the thread multiplies and accumulates the single element in the scalar register with each row of elements in the vector register to obtain a first intermediate result.

As shown in fig. 16, the unit elements stored in the scalar registers SR0 to SR3 are multiplied by one row of elements registered in the vector registers, respectively, and then the accumulation processing is performed, so as to obtain the matrix multiplication result of the history sub-row matrix block and the history sub-column matrix block, which are read in the last iteration cycle in the first shared memory, that is, the first intermediate result.

Optionally, the GPU processor implements the following multiply-accumulate calculation with an instruction FMADC +repeat3 Modifier: r10=r10+sr0×r1+sr1×r2+sr2×r3+sr3×r4, FMADC instruction is: fmadc.rp3r10, R10, SR0, R1; the Repeat3 Modifier instruction is: MOVQLN SR 0R 6, sqd0x0, qdn0x1. The FMADC instruction and the MOVQLN instruction can be combined into one instruction, the cost of the MOVQLN instruction is hidden, and the effective calculation instruction proportion is improved.

In this embodiment, each row of elements of the history sub-row matrix block read in the last iteration loop in the first shared memory is registered into a vector register; and registering a single element of each row of elements of the history subcolumn matrix block read in the last iteration cycle in the first shared memory into one scalar register, and realizing quick data sharing in the thread bundle through a scalar register SRF in the GPU thread bundle, thereby realizing efficient submatrix multiplication.

In one embodiment, as shown in fig. 17, the data reading method specifically includes the following steps:

Step 1, responding to a data reading request, splitting a first matrix to be processed into a plurality of row matrix blocks, and splitting a second matrix to be processed into a plurality of column matrix blocks.

Step 2, the thread reads the first row matrix block and the first column matrix block.

Step 3, after the thread executes the read operation, the thread writes the first row matrix block and the first column matrix block into the first shared memory; during the process of processing the first shared memory by the thread, the thread reads the second row matrix block and the second column matrix block, and synchronously processes the thread.

And step 4, after the thread executes the writing operation, the thread reads the current row matrix block and the current column matrix block.

And 5, splitting the historical row matrix block written in the last iteration cycle in the first shared memory into a plurality of sub-row matrix blocks and splitting the historical column matrix block into a plurality of sub-column matrix blocks in the process of reading the current row matrix block and the current column matrix block by using the program.

And 6, reading a first sub-row matrix block of the history row matrix block in the first shared memory by the thread, and when the first sub-column matrix block of the history column matrix block is read by the thread, reading a next sub-row matrix block and a next sub-column matrix block in the first shared memory by the thread, and carrying out matrix block multiplication processing on the first sub-row matrix block and the first sub-column matrix block by the thread in parallel to obtain a first intermediate result.

Step 7, after the first sub-row matrix block and the first sub-column matrix block are multiplied by each other in parallel by the thread, the current sub-row matrix block and the current sub-column matrix block in the first shared memory are read by the thread, and the historical sub-row matrix block and the historical sub-column matrix block read by the last iteration cycle in the first shared memory are calculated by the thread in parallel to obtain a first intermediate result; after the thread performs the calculation operation on the first shared memory, the thread reads the next sub-row matrix block of the current sub-row matrix block, reads the next sub-column matrix block of the current sub-column matrix block, and calculates the current sub-row matrix block and the current sub-column matrix block in parallel to obtain a second intermediate result; the current sub-row matrix block is any one of the sub-row matrix blocks after the second sub-column matrix block, and the current sub-column matrix block is any one of the sub-column matrix blocks after the second sub-column matrix block.

And 8, taking the next sub-row matrix block of the current sub-row matrix block in the sub-row matrix blocks as the current sub-row matrix block of the next iteration cycle, taking the next sub-column matrix block of the current sub-column matrix block in the sub-column matrix blocks as the current sub-column matrix block of the next iteration cycle, returning to the step of reading the current sub-row matrix block and the current sub-column matrix block by the thread, continuing to execute until the cycle end condition is reached, stopping, and obtaining a plurality of first intermediate results and a plurality of second intermediate results. And accumulating the plurality of first intermediate results and the plurality of second intermediate results to obtain a first calculation result.

Step 9, in the process of processing the first shared memory by the thread, the thread writes the previous row matrix block of the current row matrix block and the previous column matrix block of the current column matrix block into the second shared memory; the current row matrix block is any one row matrix block after the second row matrix block, and the current column matrix block is any one column matrix block after the second column matrix block.

Step 10, after the thread finishes the writing operation on the second shared memory, the thread reads the next row matrix block of the current row matrix block, reads the next column matrix block of the current column matrix block, and calculates the previous row matrix block and the previous column matrix block written into the second shared memory to obtain a second calculation result; and in the process of processing the second shared memory by the thread, the thread writes the current row matrix block and the current column matrix block read by the thread into the first shared memory.

And step 11, taking the next row of matrix blocks of the current row of matrix blocks in the row matrix blocks as the current row matrix block of the next iteration cycle, taking the next column of matrix blocks of the current column of matrix blocks in the column matrix blocks as the current column matrix block of the next iteration cycle, returning to the step of reading the current row matrix blocks and the current column matrix blocks by the threads, continuing to execute until the cycle end condition is reached, stopping, and obtaining a plurality of first calculation results and a plurality of second calculation results.

And step 12, accumulating the plurality of first calculation results and the plurality of second calculation results to obtain an output matrix.

In the embodiment, matrix blocking is adopted, data sharing is carried out through a mode of sharing data of a memory and data sharing of a register level, data reading of the memory is reduced, and aiming at memory reading time delay and shared memory reading and writing time delay, a two-stage circulation mode is adopted, the time delay is covered through matrix multiplication time, so that a good effect is achieved, the hardware utilization rate reaches 95%, and the effective computing power reaches 90% of the peak computing power of the GPU.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a data reading device for realizing the above related data reading method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation of the embodiment of one or more data reading devices provided below may be referred to the limitation of the data reading method hereinabove, and will not be repeated here.

In one embodiment, as shown in fig. 18, there is provided a data reading apparatus including:

the partitioning module 100 is configured to split a first matrix to be processed into a plurality of row matrix blocks and a second matrix to be processed into a plurality of column matrix blocks in response to a data reading request;

The first reading module 200 is configured to read the current row matrix block and the current column matrix block by using a thread, and the thread calculates, in parallel, the historical row matrix block and the historical column matrix block written in the previous iteration loop in the first shared memory, so as to obtain a first calculation result; during processing of the thread for the first shared memory, the thread writes the previous row matrix block of the current row matrix block and the previous column matrix block of the current column matrix block to the second shared memory.

The second reading module 300 is configured to read, after the thread performs the writing operation on the second shared memory, a next row of matrix blocks of the current row of matrix blocks by the thread, a next column of matrix blocks of the current column of matrix blocks by the thread, and calculate a previous row of matrix blocks and a previous column of matrix blocks written into the second shared memory by the thread, so as to obtain a second calculation result; and in the process of processing the second shared memory by the thread, the thread writes the current row matrix block and the current column matrix block read by the thread into the first shared memory.

The iteration module 400 uses the next row matrix block of the current row matrix block of the row matrix blocks as the current row matrix block of the next iteration loop, uses the next column matrix block of the current column matrix block of the column matrix blocks as the current column matrix block of the next iteration loop, and returns to the step of the thread for reading the current row matrix block and the current column matrix block to continue execution until the loop end condition is reached, and stops to obtain a plurality of first calculation results and a plurality of second calculation results.

The output module 500 is configured to perform accumulation processing on the plurality of first calculation results and the plurality of second calculation results to obtain an output matrix.

In one embodiment, if the current column matrix block is the first column matrix block and the current row matrix block is the first row matrix block, the first reading module 200 is further configured to read the first row matrix block and the first column matrix block by the thread; after the thread performs the read operation, the thread writes the first row matrix block and the first column matrix block into the first shared memory; during the process of processing the first shared memory by the thread, the thread reads the matrix blocks in the next row and the matrix blocks in the next column, and synchronously processes the thread.

In one embodiment, the first reading module 200 is further configured to split the history row matrix block written in the previous iteration loop in the first shared memory into a plurality of sub-row matrix blocks, and split the history column matrix block into a plurality of sub-column matrix blocks;

In one embodiment, if the current sub-column matrix block is the first sub-column matrix block of the history row matrix block in the first shared memory and the current sub-column matrix block is the first sub-row matrix block of the history row matrix block in the first shared memory, the first reading module 200 is further configured to read the first sub-row matrix block of the history row matrix block in the first shared memory and the first sub-column matrix block of the history row matrix block when the current sub-column matrix block is the first sub-column matrix block of the history row matrix block;

In one embodiment, if the current sub-row matrix block is the last sub-column matrix block of the history row matrix block in the first shared memory and the current sub-column matrix block is the last sub-column matrix block of the history column matrix block in the first shared memory, the first reading module 200 is further configured to synchronize the threads and the threads in the first shared memory before the threads concurrently calculate the history sub-row matrix block and the history sub-column matrix block read in the last iteration loop in the first shared memory.

In one embodiment, if the current sub-row matrix block is the last sub-row matrix block of the history row matrix block in the first shared memory and the current sub-column matrix block is the last sub-column matrix block of the history column matrix block in the first shared memory, the first reading module 200 is further configured to execute the process of performing matrix block multiplication processing on the last sub-row matrix block of the history sub-row matrix block and the last sub-column matrix block of the history sub-column matrix block by the thread after performing calculation on the history sub-row matrix block and the history sub-column matrix block read in the last iteration loop in the first shared memory by the thread in parallel, until the process of performing calculation on the previous row matrix block and the previous column matrix block written in the second shared memory by the thread reads the first sub-row matrix block of the previous row matrix block and the first sub-column matrix block of the previous column matrix block in the current row matrix block in the second shared memory.

In one embodiment, the first reading module 200 is further configured to: registering each row of elements of the historical sub-row matrix block read in the last iteration loop in the first shared memory into a vector register by adopting a thread; the thread parallel registers a single element of each row of elements of the history subcolumn matrix block read in the last iteration loop in the first shared memory into a scalar register;

And multiplying and accumulating the single element in the scalar register with each row of element in the vector register by adopting a thread to obtain a first intermediate result.

The respective modules in the above-described data reading apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be an image processor, the internal structure of which may be as shown in fig. 19. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a data reading method. The display unit of the computer device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in FIG. 19 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps in the method embodiments.

In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, implements the steps of the method embodiments.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of reading data, the method comprising:

The thread reads the current row matrix block and the current column matrix block, and the thread calculates the historical row matrix block and the historical column matrix block written in the previous iteration loop in the first shared memory in parallel to obtain a first calculation result; in the process of processing the first shared memory by the thread, the thread writes the matrix block of the previous row of the current row matrix block and the matrix block of the previous column of the current column matrix block into a second shared memory;

After the thread finishes the writing operation on the second shared memory, the thread reads the next row of matrix blocks of the current row of matrix blocks, reads the next column of matrix blocks of the current column of matrix blocks, and calculates the previous row of matrix blocks and the previous column of matrix blocks written into the second shared memory to obtain a second calculation result; during the process of processing the second shared memory by the thread, the thread writes the current row matrix block and the current column matrix block read by the thread into the first shared memory;

Taking the next row matrix block of the current row matrix block in the row matrix blocks as the current row matrix block of the next iteration cycle, taking the next column matrix block of the current column matrix block in the column matrix blocks as the current column matrix block of the next iteration cycle, and returning to the step of reading the current row matrix block and the current column matrix block by the thread to continue execution until stopping when the cycle end condition is reached, so as to obtain a plurality of first calculation results and a plurality of second calculation results;

2. The method of claim 1, wherein if the current column matrix block is the first column matrix block and the current row matrix block is the first row matrix block, the method further comprises:

The thread reads a first row matrix block and a first column matrix block;

After the thread performs the read operation, the thread writes the first row matrix block and the first column matrix block into a first shared memory; and in the process of processing the first shared memory by the thread, the thread reads the matrix blocks in the next row and the matrix blocks in the next column, and synchronously processes the thread.

3. The method of claim 1, wherein the computing, by the thread, the historical row matrix block and the historical column matrix block written in the previous iteration loop in the first shared memory in parallel to obtain a first computation result includes:

Splitting a history row matrix block written in a previous iteration cycle in the first shared memory into a plurality of sub-row matrix blocks, and splitting a history column matrix block into a plurality of sub-column matrix blocks;

The thread reads a current sub-row matrix block and a current sub-column matrix block in a first shared memory, and the thread calculates a history sub-row matrix block and a history sub-column matrix block which are read in the last iteration cycle in the first shared memory in parallel to obtain a first intermediate result; after the thread performs the calculation operation on the first shared memory, the thread reads the next sub-row matrix block of the current sub-row matrix block, reads the next sub-column matrix block of the current sub-column matrix block, and calculates the current sub-row matrix block and the current sub-column matrix block in parallel to obtain a second intermediate result;

Taking the next sub-row matrix block of the current sub-row matrix block in the sub-row matrix blocks as the current sub-row matrix block of the next iteration cycle, taking the next sub-column matrix block of the current sub-column matrix block in the sub-column matrix blocks as the current sub-column matrix block of the next iteration cycle, returning to the step of reading the current sub-row matrix block and the current sub-column matrix block by the thread, continuing to execute until stopping when the cycle end condition is reached, and obtaining a plurality of first intermediate results and a plurality of second intermediate results;

4. The method of claim 3, wherein if the current sub-column matrix block is a first sub-column matrix block of a history row matrix block in the first shared memory and the current sub-row matrix block is a first sub-row matrix block of a history row matrix block in the first shared memory, the method further comprises:

5. The method of claim 3, wherein if the current sub-row matrix block is the last sub-column matrix block of the history row matrix block in the first shared memory and the current sub-column matrix block is the last sub-column matrix block of the history column matrix block in the first shared memory, the thread performs the calculation on the history sub-row matrix block and the history sub-column matrix block read in the last iteration loop in the first shared memory in parallel, the method further comprising:

6. The method of claim 3, wherein if the current sub-row matrix block is the last sub-row matrix block of the history row matrix block in the first shared memory and the current sub-column matrix block is the last sub-column matrix block of the history column matrix block in the first shared memory, the thread performs the calculation on the history sub-row matrix block and the history sub-column matrix block read in the last iteration loop in the first shared memory in parallel, the method further comprising:

And the process of carrying out matrix block multiplication processing on the last sub-row matrix block and the last sub-column matrix block of the history sub-row matrix block by the thread is delayed until the process of reading the first sub-row matrix block of the previous row matrix block and the first sub-column matrix block of the previous column matrix block of the current row matrix block in the second shared memory by the thread when the thread calculates the previous row matrix block and the previous column matrix block written in the second shared memory.

7. A method according to claim 3, wherein the thread calculates in parallel a history subrow matrix block and a history subrow column matrix block read in a last iteration loop in the first shared memory to obtain a first intermediate result, including:

The thread registers each row of elements of the history subrow matrix block read in the last iteration loop in the first shared memory into a vector register; the thread parallel registers a single element of each row of elements of the history subcolumn matrix block read in the last iteration loop in the first shared memory into a scalar register;

And the thread respectively multiplies and accumulates the single element in the scalar register with each row of elements in the vector register to obtain a first intermediate result.

8. A data reading apparatus, the apparatus comprising:

The first reading module reads the current row matrix block and the current column matrix block by adopting threads, and the threads calculate the historical row matrix block and the historical column matrix block which are written in by the last iteration cycle in the first shared memory in parallel to obtain a first calculation result; in the process of processing the first shared memory by the thread, the thread writes the matrix block of the previous row of the current row matrix block and the matrix block of the previous column of the current column matrix block into a second shared memory;

The second reading module is used for reading a next row of matrix blocks of the current row of matrix blocks and a next column of matrix blocks of the current column of matrix blocks after the thread executes the writing operation on the second shared memory, and the thread calculates a previous row of matrix blocks and a previous column of matrix blocks written into the second shared memory to obtain a second calculation result; during the process of processing the second shared memory by the thread, the thread writes the current row matrix block and the current column matrix block read by the thread into the first shared memory;

the iteration module is used for taking the next row matrix block of the current row matrix block in the row matrix blocks as the current row matrix block of the next iteration cycle, taking the next column matrix block of the current column matrix block in the column matrix blocks as the current column matrix block of the next iteration cycle, returning to the step of reading the current row matrix block and the current column matrix block by the thread, continuing to execute the step until the cycle end condition is reached, stopping the step, and obtaining a plurality of first calculation results and a plurality of second calculation results;

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.