CN117112208A

CN117112208A - SpMV (distributed virtual machine) implementation method, system, device and medium based on GPU (graphics processing unit) asynchronous replication

Info

Publication number: CN117112208A
Application number: CN202311048985.0A
Authority: CN
Inventors: 曾广森; 邹毅
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2023-08-18
Filing date: 2023-08-18
Publication date: 2023-11-24

Abstract

The application discloses a method, a system, a device and a medium for realizing SpMV based on GPU asynchronous replication, and belongs to the technical field of high-performance numerical computation. The method comprises the following steps: s1, dividing non-zero elements in a sparse matrix A by adopting a preset non-zero element dividing algorithm, and marking each divided element as a batch; s2, distributing the batch to each GPU thread block according to a preset batch distribution algorithm; s3, each GPU thread block processes the corresponding batch. The application uses memcpy_async to realize the loading of batch, which can release thread resources from data replication task to execute other computation tasks, realize faster data replication between global memory and shared memory, and fully utilize the data replication time of shared memory to perform computation, thereby shortening the time cost of SpMV as a whole and improving performance.

Description

SpMV (distributed virtual machine) implementation method, system, device and medium based on GPU (graphics processing unit) asynchronous replication

Technical Field

The application relates to the technical field of high-performance numerical computation, in particular to a method, a system, a device and a medium for realizing SpMV based on GPU asynchronous replication.

Background

Sparse matrix-vector multiplication (SpMV) is a commonly used operation in computer science and numerical computing, and has wide application in many fields, such as: large scale linear systems, graph analysis, machine learning, etc. In these applications, spMV operation is often a performance bottleneck. In order to reduce memory overhead and improve memory access efficiency, various sparse matrix storage formats have been proposed in the literature, such as: COO, CSR, ELL, HYB, CSR5, etc., wherein CSR storage formats are most widely used, and are also default storage formats, other storage formats inevitably involve format conversion problems, which can cause considerable performance overhead. So the improvement of the performance of the SpMV algorithm based on the CSR storage format has important significance.

The image processing unit (GPU) has the characteristics of high throughput and high parallelism, and is an attractive choice in the field of scientific computing. Many studies of the SpMV algorithm based on GPU show that GPU often initiates non-merged memory accesses due to asymmetry of CSR format, which causes a significant time overhead. To solve this problem, joseph L.Greathoise and Mayan Daga propose CSR-Stream. CSR-Stream first uniformly divides non-zero elements into multiple shares, each of which is assigned to a GPU thread block (GPU block) for processing. The GPU thread block copies non-zero elements to the shared memory in the block in batches, then invokes the threads in the GPU thread block to read the non-zero elements on the shared memory to perform row-wise accumulation and summation, and finally writes the result back to the GPU global memory. In the method, two main memory access behaviors actually occur in the GPU, namely, merged memory access is initiated to the global memory, and then non-merged memory access is initiated to the shared memory. This approach is much more efficient than initiating non-pooled memory accesses directly to global memory, because the access latency of global memory is very large and the non-pooled global memory access overhead is much greater than that of shared memory.

CSR-Stream still has a number of disadvantages. Although the merged global memory access improves the memory access efficiency, the data replication between the global memory and the shared memory still takes a lot of time, and the linear resources are occupied in the replication process and cannot perform other computing tasks.

Disclosure of Invention

In order to solve at least one of the technical problems existing in the prior art to a certain extent, the application aims to provide a method, a system, a device and a medium for realizing the SpMV based on the asynchronous copying of the GPU.

The technical scheme adopted by the application is as follows:

the method for realizing the SpMV based on the GPU asynchronous replication comprises the following steps of:

s1, dividing non-zero elements in a sparse matrix A by adopting a preset non-zero element dividing algorithm, and marking each divided element as a batch;

s2, distributing the batch to each GPU thread block according to a preset batch distribution algorithm;

s3, each GPU thread block processes the corresponding batch.

Further, the sparse matrix A has m rows and N columns, and the three arrays are values, column _indices and row_offsets, respectively, wherein the values array and column_indices array have a size of N _nz Respectively recording the value and the column coordinate of each non-zero element in sequence; the row_offsets array size is m+1, and the index of the first element of each row of the matrix in the array values and column_indices is recorded, wherein row_offsets [ m ]]＝N _nz ；

The SpMV formula is expressed as:

wherein,and->For dense vectors, α and β are scalar quantities.

Further, the step S3 includes:

load of batch: copying a value array and a column_indexes array fragment corresponding to non-zero elements in the batch from a global memory of the GPU to a shared memory, wherein a memcpy_async function is used in the copying process;

calculation of batch: distributing threads in the GPU thread block to the loaded batch, and reading the data of the batch from the shared memory by the distributed threads and calculating; accumulating non-zero elements in matrix rows and writing the accumulated result back to dense vectorsIs a kind of medium.

Further, the loading of the batch and the calculating of the batch are performed concurrently.

Further, in the calculation process of the batch, a calculation mode based on a matrix row or a calculation mode based on a merging path is adopted to calculate the data of the batch.

Further, the SpMV implementation method runs on a device comprising at least one CPU and one GPU; the GPU is NVIDIA Ampere architecture GPU or GPU supporting GPU shared memory asynchronous copy instruction (memcpy_async).

Further, the steps S1 and S2 are performed on the CPU side, and the step S3 is performed on the GPU side.

The application adopts another technical scheme that:

SpMV implementation system based on GPU asynchronous replication, provided with sparse matrix A comprising N _nz The sparse matrix A is stored by adopting three arrays in a CSR storage format, and the SpMV implementation system comprises:

the element dividing module is used for dividing the non-zero elements in the sparse matrix A by adopting a preset non-zero element dividing algorithm, and each element obtained by dividing is recorded as a batch;

the thread allocation module is used for allocating the batch to each GPU thread block according to a preset batch allocation algorithm;

and the thread execution module is used for processing the respective batch by each GPU thread block.

The application adopts another technical scheme that:

an SpMV implementation apparatus based on GPU asynchronous replication, comprising:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method as described above.

The application adopts another technical scheme that:

a computer readable storage medium, in which a processor executable program is stored, which when executed by a processor is adapted to carry out the method as described above.

The beneficial effects of the application are as follows: the application uses memcpy_async to realize the loading of batch, which can release thread resources from data replication task to execute other computation tasks, realize faster data replication between global memory and shared memory, and fully utilize the data replication time of shared memory to perform computation, thereby shortening the time cost of SpMV as a whole and improving performance.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description is made with reference to the accompanying drawings of the embodiments of the present application or the related technical solutions in the prior art, and it should be understood that the drawings in the following description are only for convenience and clarity of describing some embodiments in the technical solutions of the present application, and other drawings may be obtained according to these drawings without the need of inventive labor for those skilled in the art.

FIG. 1 is a pseudo code diagram of a GPU thread block processing respective latches in an embodiment of the present application;

FIG. 2 is a flowchart of a method for implementing SpMV based on CSR storage format based on Ampere architecture GPU according to an embodiment of the present application;

FIG. 3 is a pseudo code diagram of a non-zero element partitioning algorithm in an embodiment of the present application;

FIG. 4 is a pseudo code diagram of processing a "extra-long line" in an embodiment of the application;

FIG. 5 is a pseudo code diagram of "calculation of batch" in an embodiment of the application;

FIG. 6 is a pseudo code diagram of dynamically selecting vector_size in an embodiment of the present application;

FIG. 7 is a schematic diagram of test case performance results in an embodiment of the present application;

fig. 8 is a schematic step diagram of a SpMV implementation method based on GPU asynchronous replication in an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.

In the description of the present application, it should be understood that references to orientation descriptions such as upper, lower, front, rear, left, right, etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of description of the present application and to simplify the description, and do not indicate or imply that the apparatus or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the present application.

In the description of the present application, a number means one or more, a number means two or more, and greater than, less than, exceeding, etc. are understood to not include the present number, and above, below, within, etc. are understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

Furthermore, in the description of the present application, unless otherwise indicated, "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

In the description of the present application, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present application can be reasonably determined by a person skilled in the art in combination with the specific contents of the technical scheme.

NVIDIA proposes a new GPU architecture, namely an Ampere architecture, which realizes a new asynchronous copy instruction for a shared memory, and the corresponding API in CUDA is memcpy_async. The instruction can bypass the L1 Cache and the thread register file, directly copy the data in the global memory into the shared memory, the copy process does not need to occupy thread resources, and the thread can execute other calculation tasks while copying. Thus memcpy_async can achieve faster data replication for SpMV and concurrency of data replication and computation. The innovation point of the application is that the new characteristic of the Ampere architecture is skillfully applied to the SpMV to realize the effects, thereby solving the problems existing in the prior algorithm and finally improving the performance of the SpMV.

As shown in fig. 8, the present embodiment provides a SpMV implementation method based on GPU asynchronous replication, where a given size is m×n with N _nz A sparse matrix a of non-zero elements. The sparse matrix a is stored in three arrays of CSR storage format, values, column _indices and row_offsets, respectively. Wherein the value array and column_indices array have a size of N _nz The value and column coordinates of each non-zero element are recorded separately in order. The row_offsets array size is m+1, and the index of the first element of each row of the matrix in the array values and column_indices is recorded, wherein row_offsets [ m ]]＝N _nz . The general SpMV formula can be expressed asWherein->And->For dense vectors, α and β are scalar quantities. The SpMV implementation method comprises the following steps:

s3, each GPU thread block processes the corresponding batch.

In step S3, the processing of each batch by the GPU thread block is divided into two steps:

Concurrent execution of "load of batch" and "calculate of batch" between different batches may be achieved by using the memcpy_async function. The pseudo code of step S3 is shown in fig. 1.

It should be noted that, the non-zero element dividing algorithm in the step S1, the batch allocation algorithm in the step S2, and the calculation algorithm in the "calculation of batch" in the step S3 are all variable, and there are various implementations, which are not described in detail herein. The core of the application is: 1) The memcpy_async is used for realizing faster data copying between the global memory and the shared memory, and the copying time is shortened; 2) The thread can release other computing tasks from the copying process, that is, we overlap the two processes of "loading of batch" and "computing of batch", which reduces the time overhead of SpMV as a whole. The idea of overlapping "load of batch" and "computation of batch" is not considered in the conventional SpMV algorithm based on CSR storage format based on GPU. The application first and deeply researches how to efficiently apply the new characteristic memcpy_async of the Ampere architecture to the SpMV.

As an alternative embodiment, the SpMV implementation method runs on a device comprising at least one CPU and one GPU; the GPU is NVIDIA Ampere architecture GPU or GPU supporting GPU shared memory asynchronous copy instruction. Step S1 and step S2 are performed on the CPU side, and step S3 is performed on the GPU side. Namely: 1) Dividing the non-zero elements into components according to a non-zero element dividing algorithm designed in advance on the CPU side, wherein each component is called a batch; 2) Distributing the latches to each GPU thread block on the CPU side according to a latch distribution algorithm designed in advance; 3) On the GPU side, each GPU thread block processes a respective batch.

The above method is explained in detail below with reference to the drawings and specific examples.

Given a size of m x N with N _nz A sparse matrix a of non-zero elements. The sparse matrix a is stored in three arrays of CSR storage format, values, column _indices and row_offsets, respectively. The general SpMV formula can be expressed asWherein->And->For dense vectors, α and β are scalar quantities. Let the GPU used be NVIDIA a40, which is the amp architecture with a thread bundle (warp) size of 32.

As shown in fig. 2, the embodiment provides a SpMV implementation method based on CSR storage format and based on the GPUs of the amp architecture, which specifically includes the following steps:

step 1: the CPU side divides the non-zero elements into components according to a non-zero element dividing algorithm designed in advance, and each component is called a batch. The non-zero element division algorithm in this embodiment specifically includes: firstly, setting a maximum value batch_max_size for the batch, then screening out and recording matrix rows (called 'extra long rows') with row length larger than the batch_max_size into a long_row_info array, then sequentially loading the rest matrix rows into the batch, loading the rest matrix rows into the next batch when the number of non-zero elements in the batch is about to be larger than the batch_max_size, and repeating the steps until all the non-zero elements are divided. The division results of these non-zero elements will be recorded into the batch_info array. The pseudo code for this process is shown in figure 3.

Step 2: and distributing the batch to each GPU thread block on the CPU side according to a batch distribution algorithm designed in advance. The batch allocation algorithm in this embodiment is an average allocation, specifically: setting the total number of the latches as the latch_num and the number of the GPU thread blocks as the B, and enabling each GPU thread block to be responsible for processingAnd (3) batch.

Step 3: on the GPU side, each GPU thread block processes a respective batch. The processing of each batch includes two steps, "loading of batch" and "computation of batch", and these two processes are overlapped by the memcpy_async function, with pseudo code as shown in FIG. 1.

Step 4: for the "extra-long lines" selected in step 1, a special algorithm is used for processing, and the algorithm is specifically: each GPU thread block is responsible for one row of 'extra long row', the thread bundles in the thread blocks respectively calculate the partial accumulation sum of the 'extra long row' non-zero elements, and finally, each thread bundle carries out summation on the partial accumulation sum to obtain a final result, and the result is written back toIn the vector. The pseudocode for this process is shown in fig. 4.

The step 3 can be started after the step 1 and the step 2 are completed, and the step 4 can be started after the step 1 is completed.

In the present embodiment, the calculation algorithm used for the "calculation of batch" in step 3 is specifically described below. The thread bundles within the GPU thread block are divided into smaller vectors, each having vector_size threads. Each vector processes non-zero elements of a matrix row at a time, accumulates and writes back the non-zero elements of the matrix rowIn the vector. After the vector completes processing a matrix row, a new matrix row is allocated, and the process is repeated until all matrix rows in the batch are processed. The pseudo code for this process is shown in fig. 5. To maximize thread utilization, the algorithm dynamically selects the vector_size according to the average line length of each batch, where vector_size e {1,2,4,8, 16, 32}, pseudo code is shown in FIG. 6.

Using the above technique, we performed tests on a set of matrices. The test platform information is as follows: two AMD EPYC 7713 64-Core CPUs were mounted, equipped with 377G RAM, and running Ubuntu20.04 operating system. The test platform is provided with an NVIDIA Ampere architecture GPU A40, the GPU is provided with 84 streaming multi-core processors, 10752 CUDA cores are owned in total, 48GB of global memory is owned, and each streaming multi-core processor is owned by 48KB of contributing memory. All CUDA programs were compiled using the CUDA 11.7tool, using the "-arch sm_86-O3" compilation parameters.

The test matrix set adopts the matrix in the internationally well known matrix mark sparse matrix set for testing, and 20 representative matrices are selected as the test matrix set, and specific matrix information is shown in table 1. N of these matrices _nz In the range 73K to 14.8M, the average row length of the matrices is in the range 2.6 to 158.5, these matrices having varying sparse characteristics.

Table 1 matrix information table

On this test matrix set, we compared our algorithm with a number of well-known CSR-based SpMV algorithms: lightSpMV, merge-based and cusacrse. Among them, custarse is a linear algebraic algorithm library based on GPU proposed by NVIDIA, and its SpMV algorithm based on CSR storage format is widely used in various applications, with good performance. The test results are shown in fig. 7, where none of the algorithms is able to fit all types of matrices, but our method is in most cases better than the other algorithms. Our method has an average throughput of 100.1GFLOPs, lightspmv of 71.7GFLOPs, merge-based of 71.1GFLOPs, custarse of 71.6GFLOPs across all test matrices. Our approach had an average 39.8% performance improvement over custarse. Our approach possesses outstanding performance on matrices with average line length less than 16, e.g. our approach has 119.6% performance improvement over custarse on aug2d matrices, which is also the greatest in these test matrices.

In summary, the method of the embodiment provides a more efficient SpMV algorithm based on CSR storage format based on the amp architecture GPU. The algorithm first divides the non-zero elements into multiple latches and then assigns the latches to GPU thread blocks for processing. The processing procedure of the GPU thread block for each batch is divided into two steps: "load of batch" and "computation of batch". "load of batch" refers to copying non-zero elements from global memory to shared memory, and "calculate of batch" refers to calling threads within a GPU thread block to read non-zero elements from shared memory, then accumulate in a matrix row, and write the result back into global memory. The application uses memcpy_async, a new feature of the Ampere architecture GPU proposed by NVIDIA, to realize the "load of batch", which can release thread resources from data replication tasks to execute other computing tasks. The application skillfully applies memcpy_async to the SpMV, and overlaps the two processes of "loading of batch" and "computing of batch". The embodiment realizes faster data copying between the global memory and the shared memory, fully utilizes the data copying time of the shared memory to calculate, and shortens the overall time expense of the SpMV, thereby improving the performance.

The embodiment also provides a system for realizing the SpMV based on the asynchronous copy of the GPU, wherein the sparse matrix A comprises N _nz The sparse matrix A is stored by adopting three arrays in a CSR storage format, and the SpMV implementation system comprises:

The embodiment of the application provides a system for realizing the SpMV based on the asynchronous replication of the GPU, which can execute the steps of the method according to any combination of the embodiments of the method, and has the corresponding functions and beneficial effects.

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method as shown in fig. 2 or 8.

The embodiment of the application provides a method for realizing the SpMV based on the asynchronous copying of the GPU, which is provided by the embodiment of the method, can be implemented by any combination of the embodiment of the method, and has the corresponding functions and beneficial effects.

Embodiments of the present application also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 2 or 8.

The embodiment also provides a storage medium which stores instructions or programs for executing the method for realizing the SpMV based on the GPU asynchronous copy, and when the instructions or programs are run, the instructions or programs can execute any combination implementation steps of the method embodiment, and the method has corresponding functions and beneficial effects.

In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present application are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.

Furthermore, while the application is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present application. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the application as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the application, which is to be defined in the appended claims and their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the foregoing description of the present specification, reference has been made to the terms "one embodiment/example", "another embodiment/example", "certain embodiments/examples", and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present application have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the application, the scope of which is defined by the claims and their equivalents.

While the preferred embodiment of the present application has been described in detail, the present application is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims

1. A method for realizing SpMV based on GPU asynchronous replication is characterized in that a sparse matrix A is provided to comprise N _nz The method comprises the following steps of:

s3, each GPU thread block processes the corresponding batch.

2. The method for implementing the SpMV based on the GPU asynchronous replication according to claim 1, wherein the sparse matrix a has m number of rows and N number of columns, the three arrays are values, column _indices and row_offsets, respectively, and the values array and column_indices array have a size of N _nz Respectively recording the value and the column coordinate of each non-zero element in sequence; the row_offsets array size is m+1, and the index of the first element of each row of the matrix in the array values and column_indices is recorded, wherein row_offsets [ m ]]＝N _nz ；

The SpMV formula is expressed as:

wherein,and->For dense vectors, α and β are scalar quantities.

3. The method for implementing SpMV based on GPU asynchronous replication according to claim 1, wherein the step S3 comprises:

4. A method of implementing SpMV based on asynchronous replication of GPU according to claim 3, wherein the loading of the batch and the computing of the batch are performed concurrently.

5. The method for realizing the SpMV based on the asynchronous replication of the GPU according to claim 3, wherein in the calculation process of the batch, a calculation mode based on a matrix row or a calculation mode based on a merging path is adopted to calculate the data of the batch.

6. The method according to claim 1, wherein the method is run on a device comprising at least one CPU and one GPU; the GPU is NVIDIA Ampere architecture GPU or GPU supporting GPU shared memory asynchronous copy instruction.

7. The method for implementing SpMV based on asynchronous replication of GPU according to claim 6, wherein the step S1 and the step S2 are executed on the CPU side, and the step S3 is executed on the GPU side.

8. A system for realizing the SpMV based on the asynchronous copying of the GPU is characterized in that a sparse matrix A is provided and comprises N _nz The sparse matrix A is stored by adopting three arrays in a CSR storage format, and the SpMV implementation system comprises: an element dividing module for dividing the non-zero elements in the sparse matrix A by adopting a preset non-zero element dividing algorithm,

each element obtained by dividing is marked as a batch;

9. An SpMV implementation apparatus based on GPU asynchronous replication, comprising:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any one of claims 1-7.

10. A computer readable storage medium, in which a processor executable program is stored, characterized in that the processor executable program is for performing the method according to any of claims 1-7 when being executed by a processor.