CN116431967A - SpMV (virtual private network) implementation method, device and medium based on CSR (compact storage R) storage format - Google Patents

SpMV (virtual private network) implementation method, device and medium based on CSR (compact storage R) storage format Download PDF

Info

Publication number
CN116431967A
CN116431967A CN202310297248.8A CN202310297248A CN116431967A CN 116431967 A CN116431967 A CN 116431967A CN 202310297248 A CN202310297248 A CN 202310297248A CN 116431967 A CN116431967 A CN 116431967A
Authority
CN
China
Prior art keywords
task
blocks
vector
spmv
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310297248.8A
Other languages
Chinese (zh)
Inventor
曾广森
邹毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202310297248.8A priority Critical patent/CN116431967A/en
Publication of CN116431967A publication Critical patent/CN116431967A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a method, a device and a medium for realizing SpMV based on a CSR storage format, wherein the method comprises the following steps: dividing the SpMV into three sections of operations, wherein the three sections of operations are respectively a value-taking operation, a product operation and a summation operation, and each section of operation corresponds to one task; dividing each task into a plurality of task blocks; dividing all threads into a plurality of vector, wherein one vector comprises vector_size threads, and the vector is an execution unit of a task block; designing two emission queues and a counter to record the ready state of the task block; each vector acquires and executes a task block in a ready state, and updates the ready state of the task block after execution; when all task blocks are executed, the SpMV operation is completed. According to the invention, the value and the product are divided into the task blocks with fixed sizes, so that the data utilization rate of the GPU memory access is improved. The invention can be widely applied to the technical field of high-performance numerical calculation.

Description

SpMV (virtual private network) implementation method, device and medium based on CSR (compact storage R) storage format
Technical Field
The invention relates to the technical field of high-performance numerical computation, in particular to a method, a device and a medium for realizing SpMV based on CSR storage format.
Background
Sparse matrix vector multiplication (SpMV) is a fundamental primitive in sparse linear algebra, and is dominant in many scientific computing applications, such as iterative methods for solving large linear systems and eigenvalue problems, data mining, graph analysis, and the like. In these scientific computing applications, spMV is often a performance bottleneck, so it is of great importance to accelerate SpMV operations.
The image processing unit (GPU) has the characteristics of high throughput and high parallelism, and is an attractive choice in the field of scientific computing. In the SpMV algorithm facing the GPU, the memory access behavior is often a performance bottleneck, and effectively processing the memory access behavior is one of the keys for improving the SpMV. The GPU reads a continuous memory space at a time while reading memory, even if not all of the read data is needed. When the required data is distributed in different continuous segments, the GPU may read multiple segments of continuous memory space to obtain the required data, so that a significant time overhead is caused by random access in the GPU.
Much work has been done in the past with respect to the research of the SpMV algorithm for GPU based on CSR storage formats. Bell and Garland propose CSR-vectors that assign 32 threads (a thread bundle) to each row of the matrix as a Vector, which computes the product of non-zero elements in the row and the dense Vector x, sums the product results, and finally writes the sum result back into the dense Vector y. Yongchao Liu and Bertil Schmidt propose LightSpMV, which causes excessive threads to idle when the number of non-zero elements in the rows of the matrix is less than 32, and in order to avoid wastage as much as possible, the LightSpMV determines vector_size (a thread bundle contained in one vector, each vector being responsible for processing a row of the matrix) according to the average number of non-zero elements in the rows of the matrix. ACSR proposed by Arash Ashari et al is an algorithm for dynamically distributing load, which first reads the number of non-zero elements in each row of the matrix, divides the rows with similar numbers into the same group (called bins), and each bin sets a proper vector_size for the vector according to the number of elements, thereby realizing dynamic load distribution.
At present, the research on the SpMV algorithm based on the CSR storage format for the GPU is mostly cut in from the aspect of load balancing, so that the load of each thread is the same as much as possible to improve the overall performance. Memory access in SpMV, however, takes up a significant amount of time overhead, and the above algorithm lacks consideration in this regard. The algorithm proposed in the prior art generally combines the processes of value, product and summation of the SpMV, but the length of each row of the matrix is variable, and the data size of some rows of the matrix is far smaller than the data size read by the GPU when accessing the memory object at a time.
Disclosure of Invention
In order to solve at least one of the technical problems existing in the prior art to a certain extent, the invention aims to provide a method, a device and a medium for realizing the SpMV based on the CSR storage format.
The technical scheme adopted by the invention is as follows:
a SpMV implementation method based on CSR storage format comprises the following steps:
setting the scale of a sparse matrix A as m rows and n columns, wherein the sparse matrix comprises a plurality of non-zero elements; adopting a CSR storage format to store a matrix, and obtaining three arrays values, column _indexes and row_offsets;
dividing the SpMV into three sections of operations, wherein the three sections of operations are respectively a value-taking operation (Task 1), a product operation (Task 2) and a summation operation (Task 3), and each section of operation corresponds to one Task;
dividing each task into a plurality of task blocks; the task blocks corresponding to the value-taking operation are marked as task1 blocks, the task blocks corresponding to the product operation are marked as task2 blocks, and the task blocks corresponding to the product operation are marked as task3 blocks;
dividing all threads into a plurality of vector, wherein one vector comprises vector_size threads, and the vector is an execution unit of a task block;
designing two emission queues and a counter to record the ready state of the task block;
each vector acquires and executes a task block in a ready state, and updates the ready state of the task block after execution; when all task blocks are executed, the SpMV operation is completed.
Further, the value-taking operation (Task 1) is to obtain a value corresponding to the dense vector x according to the column coordinate of each non-zero element in the sparse matrix, and write the result into the temporary array temp as an intermediate value, where the pseudo code is expressed as: x [ column_indices [ i ] ] →temp [ i ].
Further, the product operation (Task 2) is to multiply the intermediate value stored in the temporary array temp in the value-taking operation by the elements of the values array one by one, and write the result back into the temporary array temp as the intermediate value; the pseudo code is expressed as: temp [ i ] values [ i ] →temp [ i ].
Further, the summing operation (Task 3) is to sum the intermediate values saved to the temporary array temp in the product operation by the number of non-zero elements possessed by each matrix row, and write the result of the summation back into the corresponding position of the dense vector y, where the pseudo code is expressed as:
Figure BDA0004143641210000021
further, the dividing each task into a plurality of task blocks includes:
when dividing tasks, the task1block and the task2 block have the same size, and the size of a task block is set to be an integer multiple of the size of a thread bundle;
the tasks of the summing operation are divided according to the matrix rows, and the non-zero elements in the same matrix row are divided into a task block.
Further, a dependency relationship exists among task blocks, task1 blocks always meet the dependency relationship, task2 blocks depend on task1 blocks in a one-to-one correspondence manner, and task3 blocks depend on task2 blocks in a one-to-one or one-to-many manner;
task blocks meeting the dependency relationship are in a ready state, and task2 blocks and task3 blocks in the ready state are added to two transmit queues, task 2_issueque and task 3_issueque.
Further, each vector acquires and executes a task block in a ready state, including:
vector obtains task blocks in ready state according to priority to execute, and the process is called the emission of task blocks;
task2 block and task3block are transmitted to the vector through a transmission queue to be executed; the value-taking operation records a task1block which needs to be transmitted currently through a counter;
wherein the order of priority is: task3block > task2 block > task1 block.
Further, the updating the ready state of the task block after execution includes:
when the vector executes the task1block, adding the corresponding task2 block to a corresponding transmit queue task2_issue_queue;
when the vector executes the task2 block, checking whether any task3block meets the dependency condition, and adding the task3block to a corresponding transmit queue task3_issue_queue.
The invention adopts another technical scheme that:
a CSR storage format-based SpMV implementation apparatus, comprising:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method described above.
The invention adopts another technical scheme that:
a computer readable storage medium, in which a processor executable program is stored, which when executed by a processor is adapted to carry out the method as described above.
The beneficial effects of the invention are as follows: according to the invention, the 'value' and the 'product' are divided into the task blocks with fixed sizes, so that the data utilization rate of the GPU memory access is improved. The vector does not need to wait for all the task blocks of 'value' and 'product' to be completed and then execute the 'summation' task block, and can execute the task blocks only by acquiring the task blocks in a ready state from a queue, so as to achieve the effect similar to out-of-order execution, improve the parallelism and avoid the problem of blocking caused by the fact that threads initiate memory access at the same time.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description is made with reference to the accompanying drawings of the embodiments of the present invention or the related technical solutions in the prior art, and it should be understood that the drawings in the following description are only for convenience and clarity of describing some embodiments in the technical solutions of the present invention, and other drawings may be obtained according to these drawings without the need of inventive labor for those skilled in the art.
Fig. 1 is a schematic diagram of three types of tasks in a SpMV implementation method (Task issue algorithm) based on CSR storage format for GPU in an embodiment of the present invention;
FIG. 2 is a schematic diagram of a task block dependency relationship in an embodiment of the present invention;
FIG. 3 is a schematic diagram of the operation of the transmit queue and counter in an embodiment of the invention;
FIG. 4 is a schematic diagram of the kernel function pseudocode of the TaskIssue algorithm in an embodiment of the present invention;
FIG. 5 is a flow chart of data preprocessing in an embodiment of the invention;
FIG. 6 is a schematic diagram of a dependency relationship between a row_dependency array and a task_block_appendve_row array record task block in an embodiment of the present invention;
FIG. 7 is pseudo code for checking whether task3block satisfies a dependency relationship in an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.
In the description of the present invention, it should be understood that references to orientation descriptions such as upper, lower, front, rear, left, right, etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of description of the present invention and to simplify the description, and do not indicate or imply that the apparatus or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the present invention.
In the description of the present invention, a number means one or more, a number means two or more, and greater than, less than, exceeding, etc. are understood to not include the present number, and above, below, within, etc. are understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
In the description of the present invention, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present invention can be reasonably determined by a person skilled in the art in combination with the specific contents of the technical scheme.
Sparse matrix refers to a matrix in which non-zero elements are much smaller than the total number. Because the proportion of the non-zero elements to the total number is small, the whole matrix can not be stored in the sparse matrix, and only the values of the non-zero elements and the position information of the non-zero elements in the matrix can be stored. CSR (compressed sparse row) is a widely used sparse storage format consisting of 3 arrays of numbers (assuming a sparse matrix a with m rows and n columns, with nnz non-zero elements):
1) values, the array size is nnz, and the values of non-zero elements in the matrix are recorded line by line according to the sequence;
2) column_indices, array size nnz, record the columns of non-zero elements in the matrix row by row in order;
3) The array size is m+1, and the index of the first element of each row of the matrix in the array values and column_indices is recorded, wherein row_offsets [ m ] = nnz.
The definition of SpMV is y=ax, where a is a sparse matrix and x and y are dense vectors.
The storage structure of the GPU is as follows:
1) Register File (Register File), which is the fastest memory, has a small capacity. The register variables are private to each thread;
2) A Shared Memory (Shared Memory) located on-chip, faster, accessible by threads in the same thread block;
3) Global Memory (Global Memory) is located on the video Memory, belongs to off-chip Memory, and has slow access speed.
Based on the problems of the prior art, in order to improve the memory access efficiency, the invention divides the SpMV operation into three sections: the "value", "product" and "sum". In this way, the "value" and "product" can be decoupled from the "summation" process, and the "value" and "product" process is not constrained by the line length of the matrix, and can be divided into blocks with fixed length for processing, so that the memory access efficiency is improved. In addition, if the operations are performed in the order of "value", "product" and "sum", a large number of threads may initiate access requests to the memory at the same time, which may cause serious blocking. However, the "sum" process does not need to wait for all "value" and "product" operations to be completed, and a matrix row can be "summed" as long as the non-zero elements required for the shuffling complete the "value" and "product" operations. To avoid memory access blocking as much as possible, a queue needs to be implemented to keep track of which matrix rows are in a "sum" capable state, and the thread performs the "sum" operation preferentially.
As shown in fig. 1, the embodiment provides a SpMV implementation method (taskis) based on a CSR storage format for a GPU, which specifically includes the following steps:
s1, setting the scale of a sparse matrix A as m rows and n columns, wherein the sparse matrix A has nnz non-zero elements in total, and adopting a CSR storage format to store the matrix, wherein the matrix comprises three groups values, column _indexes and row_offsets.
S2, dividing the SpMV into three sections, wherein each section corresponds to one Task (Task), and the tasks are respectively a value (Task 1), a product (Task 2) and a summation (Task 3).
The value (Task 1) is obtained according to the column coordinates of each non-zero element, the value corresponding to the dense vector x is obtained, the result is written into the temporary array temp as an intermediate value, and the pseudo code is expressed as: x [ column_indices [ i ] ] →temp [ i ].
The operation of the product (Task 2) is to multiply the intermediate value saved to the temporary array temp in Task1 with the elements of the values array one by one and write the result back into the temporary array temp as the intermediate value, the pseudo code being expressed as: temp [ i ] values [ i ] →temp [ i ].
The summing (Task 3) is performed by summing the intermediate values saved to the temporary array temp in Task2, the number of non-zero elements that each matrix row has, and writing the result of the summation back into the corresponding position of the dense vector y, and the pseudo code is expressed as:
Figure BDA0004143641210000061
s3, dividing each task into a plurality of task blocks (task blocks); the task blocks corresponding to the value-taking operation are marked as task1 blocks, the task blocks corresponding to the product operation are marked as task2 blocks, and the task blocks corresponding to the product operation are marked as task3 blocks.
As shown in fig. 2, the Task block sizes (task_size) of Task1 and Task2 are set to be the same, and set to be an integer multiple of the thread bundle size (e.g., each 32 non-zero elements are divided into a Task block, which is referred to as Task1block and Task2 block, respectively). Task3 divides Task blocks by rows, and non-zero elements in the same matrix row are divided into Task blocks (Task 3 blocks). The arrows in fig. 2 represent the dependency relationships, and Task blocks of Task1 and Task2 are the same in size, so that the dependency relationships therebetween are one-to-one correspondence, for example: t is t 21 At t is required 11 Can be executed after completion, t 22 At t is required 12 And can be executed after completion. Since the number of non-zero elements possessed by each row of the matrix is different, task2 and Task3 may be one-to-many, for example: t is t 32 At t is required 21 And t 22 Can be executed after the completion of the operation.
S4, dividing all threads into a plurality of vector, wherein one vector comprises vector_size threads, and the vector is an execution unit of a task block;
threads are divided into individual vectors, each containing vector_size threads, which is typically an integer multiple of the thread bundle size (e.g., 32). Each task block in the ready state will be assigned a vector to execute.
S5, designing two emission queues and a counter to record the ready state of the task block.
S6, each vector acquires and executes a task block in a ready state, and updates the ready state of the task block after execution; when all task blocks are executed, the SpMV operation is completed.
Since task blocks are dependent, that is, they can be executed only when the dependency condition is satisfied, the task blocks are divided into three states: a not ready state, a ready state, and an execution state. The not ready state means that the task block does not currently satisfy the dependency condition, for example: at t in FIG. 2 11 T when not completed 21 Then in the not ready state. Ready state means that the task block currently satisfies a dependency condition, waiting for an idle vector to execute it, for example: in FIG. 2, when t 21 And t 22 After all are finished, t 32 Then in a ready state. When the task block meets the ready state and has an idle vector, the task block is allocated to the idle vector for execution, and the task block is in an execution state, and such allocation process is called the task block transmission (Issue).
The TaskIssue algorithm requires two issue queues and a counter variable altogether to record the ready state of the Task, namely task1_counter, task2_issuequeue, task3_issuequeue. The following teaches their role:
1) task1_counter: since Task1 does not need to rely on other tasks, it is ready all at the beginning, when there is a free vector, task1 will be launched in sequence, and Task1_counter is used to record which Task1block is launched for the current round.
2) task2_issue_queue: the issue queue is used to record which task2 blokcs are in the ready state. Since the dependency relationship between Task1 and Task2 is one-to-one (as shown in fig. 2), when there is Task1block completed, it is necessary to add the corresponding Task2 block to Task2_issue_queue, as in the example shown in fig. 3, when t 11 Will t when completed 21 Added to the task2_isue_queue. When there is a task2 block shot, it needs to be removed from the task2_issue_queue to avoid duplicate execution.
3) task3_issue_queue: the issue queue is used to record which task3 blokcs are in the ready state. Since there is a one-to-many dependency between Task2 and Task3 (as shown in FIG. 2), it is necessary to check that Task3block adds Task3block satisfying the dependency condition to Task3_issue_queue when Task2 block is completed, as in the example shown in FIG. 3, when t 21 And t 22 T after all are finished 32 Satisfy the dependency condition, t 32 Added to the task3_isue_queue. When there is a task3block shot, it needs to be removed from the task3_issue_queue to avoid duplicate execution.
When there are multiple different similar ready task blocks, their transmit priorities are as follows: task3> Task2> Task1. The idle vector will execute the task block in the ready state with high transmit priority. If there is a free vector without a task in ready state, the thread resources contained in the vector are reclaimed to avoid occupying processor core resources in the hardware.
FIG. 4 is a kernel function pseudocode of the TaskIssue algorithm. The function first performs operations such as vector partitioning, queue initialization, counter initialization, etc., and then enters a loop. In the cycle, the vector firstly acquires the Task type and Task information of the Task block to be transmitted according to the transmission priority, and removes the Task block from the transmission queue (if the Task type is Task1, the Task is self-adding Task 1_counter). And then the vector executes the task block, updates the task ready queue after completion, and returns to the beginning of the loop to acquire a new task. When all vector return indicates that all task blocks have been executed, the SpMV operation is finished with the kernel function.
The above method is explained in detail below with reference to the drawings and specific examples.
The specific implementation process of the TaskIssue algorithm comprises two parts, namely data preprocessing and a kernel function. The data preprocessing process is as shown in fig. 5:
1) The data preprocessing process firstly divides the whole sparse matrix into a plurality of task groups, each matrix row can only belong to one task group and cannot be disassembled, and the number of non-zero elements which are close to each task group is made as much as possible on the basis. Each task group is allocated to a GPU block for execution, and each GPU block independently executes a taskis algorithm, so that the purpose of this is to balance GPU load as much as possible while avoiding the overhead operation of inter-GPU block communication.
2) Since the number of non-zero elements to be calculated for each task group may be different, and the size of the transmit queue length and temporary space temp depends on the number of non-zero elements, the amount of shared memory space required for each task block needs to be calculated. When checking whether the dependency of task3block is satisfied, the completion condition of task2 block needs to be known, and here, a task2_finish array is used to record which task2 blocks have been completed, and the task2_finish array is also stored in the shared memory.
3) It is mentioned above that it is necessary to check whether or not there is a Task block satisfying the dependency condition in Task3 after the Task block execution of Task2 is completed, and add it to Task3_issue_queue. To achieve this check operation, in this embodiment, a row_dependency and task_block_append_row array is defined to record the Task block dependency between Task2 and Task 3. As shown in fig. 6, the row_dependency array is used to record which Task2 blocks in Task3 each Task3block needs to depend on, and each element of the row_dependency array contains two sub-elements, task_block_start and task_block_end, for example: t in FIG. 6 32 Dependent on t 21 、t 22 And t 23 Therefore, row_dependency [2 ]].task_block_start=1,row_dependency[2]Task_block_end=3. the task_block_appendve_row array is used to record which Task3 blocks in Task2 are relied on by Task2 blocks, and each element of the array contains two subelements row_start and row_end, for example: t in FIG. 6 213 Quilt t 313 Dependence is also summed with t 32 Dependency, therefore task_block_appendve_row [1 ]].row_start=1,task_block_involve_row[1]Row_end=2. Checking whether there is pseudo code of Task block satisfying the dependency condition in Task3 through these two arrays is shown in fig. 7.
After the data preprocessing is completed, a kernel function can be operated, and the kernel function pseudo code is shown in fig. 4. The kernel function is initialized (including transmit queue initialization, task2_finish array clearing, counter initialization, etc.) first, and then enters the main loop. The main cycle has two steps: and obtaining a task block to be executed and an execution task block. In both steps, the issue queue and counter are operated on, which results in contention if multiple threads initiate them at the same time, in this embodiment using atomic locks to ensure that these operations can only be performed by one thread at a time.
The embodiment also provides a SpMV implementation device based on a CSR storage format, including:
at least one general purpose processor and a general purpose image processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method described herein.
The embodiment of the invention provides a SpMV realizing device based on a CSR storage format, which can execute any combination implementation steps of the embodiment of the method and has corresponding functions and beneficial effects.
The embodiment also provides a storage medium which stores instructions or programs for executing the method for realizing the SpMV based on the CSR storage format, and when the instructions or programs are run, the steps can be implemented by any combination of the embodiments of the executable method, so that the method has the corresponding functions and beneficial effects.
In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.
Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the foregoing description of the present specification, reference has been made to the terms "one embodiment/example", "another embodiment/example", "certain embodiments/examples", and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present invention, and these equivalent modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

Claims (10)

1. The SpMV implementation method based on the CSR storage format is characterized by comprising the following steps of:
setting the scale of a sparse matrix as m rows and n columns, wherein the sparse matrix comprises a plurality of non-zero elements; adopting a CSR storage format to store a matrix, and obtaining three arrays values, column _indexes and row_offsets;
dividing the SpMV into three sections of operations, wherein the three sections of operations are respectively a value-taking operation, a product operation and a summation operation, and each section of operation corresponds to one task;
dividing each task into a plurality of task blocks; the task blocks corresponding to the value-taking operation are marked as task1 blocks, the task blocks corresponding to the product operation are marked as task2 blocks, and the task blocks corresponding to the product operation are marked as task3 blocks;
dividing all threads into a plurality of vector, wherein one vector comprises vector_size threads, and the vector is an execution unit of a task block;
designing two emission queues and a counter to record the ready state of the task block;
each vector acquires and executes a task block in a ready state, and updates the ready state of the task block after execution; when all task blocks are executed, the SpMV operation is completed.
2. The method for realizing the SpMV based on the CSR storage format according to claim 1, wherein the value-taking operation is to obtain a value corresponding to the dense vector x according to the column coordinates where each non-zero element in the sparse matrix is located, and write the result into the temporary array temp as an intermediate value, and the pseudo code is expressed as: x [ column_indices [ i ] ] →temp [ i ].
3. The method for realizing the SpMV based on the CSR memory format according to claim 1, wherein the product operation is to multiply the intermediate value stored in the temporary array temp in the value-taking operation by the elements of the values array one by one, and write the result back into the temporary array temp as the intermediate value; the pseudo code is expressed as: temp [ i ] values [ i ] →temp [ i ].
4. The method according to claim 1, wherein the summation operation is to save the intermediate value stored in the product operation to the temporary array temp, sum the number of non-zero elements possessed by each matrix row, and write the summation result back to the corresponding position of the dense vector y, and the pseudo code is expressed as:
5. the method for implementing the SpMV based on the CSR memory format according to claim 1, wherein the dividing each task into a plurality of task blocks comprises:
when dividing tasks, the task1block and the task2 block have the same size, and the size of a task block is set to be an integer multiple of the size of a thread bundle;
the tasks of the summing operation are divided according to the matrix rows, and the non-zero elements in the same matrix row are divided into a task block.
6. The method for realizing the SpMV based on the CSR storage format according to claim 1, wherein a dependency relationship exists among task blocks, task1 blocks always meet the dependency relationship, task2 blocks depend on task1 blocks in a one-to-one correspondence manner, and task3 blocks depend on task2 blocks in a one-to-one or one-to-many manner;
task blocks meeting the dependency relationship are in a ready state, and task2 blocks and task3 blocks in the ready state are added to an emission queue task 2_issuequeue and task 3_issuequeue respectively.
7. The method for implementing SpMV based on CSR storage format according to claim 1, wherein each vector obtains and executes a task block in a ready state, comprising:
the vector acquires the task block in a ready state according to the priority to execute;
task2 block and task3block are transmitted to the vector through a transmission queue to be executed; the value operation records the task1block which needs to be transmitted currently through a counter.
8. The method for implementing SpMV based on CSR memory format according to claim 1, wherein updating the ready state of the task block after execution comprises:
when the vector executes the task1block, adding the corresponding task2 block to a corresponding transmission queue;
when the vector executes the task2 block, checking whether any task3block meets the dependency condition, and adding the task3block to the corresponding transmit queue.
9. The utility model provides a SpMV realization device based on CSR storage format which characterized in that includes:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any one of claims 1-8.
10. A computer readable storage medium, in which a processor executable program is stored, characterized in that the processor executable program is for performing the method according to any of claims 1-8 when being executed by a processor.
CN202310297248.8A 2023-03-23 2023-03-23 SpMV (virtual private network) implementation method, device and medium based on CSR (compact storage R) storage format Pending CN116431967A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310297248.8A CN116431967A (en) 2023-03-23 2023-03-23 SpMV (virtual private network) implementation method, device and medium based on CSR (compact storage R) storage format

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310297248.8A CN116431967A (en) 2023-03-23 2023-03-23 SpMV (virtual private network) implementation method, device and medium based on CSR (compact storage R) storage format

Publications (1)

Publication Number Publication Date
CN116431967A true CN116431967A (en) 2023-07-14

Family

ID=87089997

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310297248.8A Pending CN116431967A (en) 2023-03-23 2023-03-23 SpMV (virtual private network) implementation method, device and medium based on CSR (compact storage R) storage format

Country Status (1)

Country Link
CN (1) CN116431967A (en)

Similar Documents

Publication Publication Date Title
Mittal et al. A survey of techniques for optimizing deep learning on GPUs
CN106991011B (en) CPU multithreading and GPU (graphics processing unit) multi-granularity parallel and cooperative optimization based method
CN100412851C (en) Methods and apparatus for sharing processor resources
US20100064291A1 (en) System and Method for Reducing Execution Divergence in Parallel Processing Architectures
US20120143932A1 (en) Data Structure For Tiling And Packetizing A Sparse Matrix
CN103019810A (en) Scheduling and management of compute tasks with different execution priority levels
US9069609B2 (en) Scheduling and execution of compute tasks
CN104572568A (en) Read lock operation method, write lock operation method and system
CN113313247B (en) Operation method of sparse neural network based on data flow architecture
US9715413B2 (en) Execution state analysis for assigning tasks to streaming multiprocessors
CN114491402A (en) Calculation method for sparse matrix vector multiplication access optimization
CN115237599B (en) Rendering task processing method and device
CN111459543B (en) Method for managing register file unit
CN116529775A (en) Method and apparatus for ray tracing merge function call
CN116431967A (en) SpMV (virtual private network) implementation method, device and medium based on CSR (compact storage R) storage format
CN111125070A (en) Data exchange method and platform
Mishra et al. Bulk i/o storage management for big data applications
US20220300326A1 (en) Techniques for balancing workloads when parallelizing multiply-accumulate computations
CN104636207A (en) Collaborative scheduling method and system based on GPGPU system structure
CN117785480B (en) Processor, reduction calculation method and electronic equipment
US11709812B2 (en) Techniques for generating and processing hierarchical representations of sparse matrices
CN117271391B (en) Cache structure and electronic equipment
US11397578B2 (en) Selectively dispatching waves based on accumulators holding behavioral characteristics of waves currently executing
US20240134929A1 (en) Column-partitioned sparse matrix multiplication
EP4002117A1 (en) Systems, methods, and devices for shuffle acceleration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination