CN118094074A - Matrix multiplication calculation result accumulation method, device, equipment and storage medium - Google Patents

Matrix multiplication calculation result accumulation method, device, equipment and storage medium Download PDF

Info

Publication number
CN118094074A
CN118094074A CN202410519577.7A CN202410519577A CN118094074A CN 118094074 A CN118094074 A CN 118094074A CN 202410519577 A CN202410519577 A CN 202410519577A CN 118094074 A CN118094074 A CN 118094074A
Authority
CN
China
Prior art keywords
matrix
result
task
matrix multiplication
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410519577.7A
Other languages
Chinese (zh)
Inventor
张昊翀
何力新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Original Assignee
Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Artificial Intelligence of Hefei Comprehensive National Science Center filed Critical Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Priority to CN202410519577.7A priority Critical patent/CN118094074A/en
Publication of CN118094074A publication Critical patent/CN118094074A/en
Pending legal-status Critical Current

Links

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention relates to the field of data processing, and discloses a matrix multiplication calculation result accumulation method, a device, equipment and a storage medium, wherein the method is used for decomposing a matrix multiplication task by acquiring the matrix multiplication task to obtain a matrix multiplication sub-task; setting concurrency limit operation for matrix multiplier tasks; acquiring a historical accumulated result from the result storage address based on the concurrency limiting operation; and accumulating the matrix multiplication calculation result of the matrix multiplication sub-task and the historical accumulation result to obtain an accumulation result. The matrix multiplication tasks are decomposed into a plurality of matrix multiplication sub-tasks, and the matrix multiplication sub-tasks are processed in parallel, so that the calculation efficiency is improved, and the calculation time is saved; and the exclusive access of each matrix multiplication sub-task to the result storage address is realized through the concurrent limiting operation, so that the data consistency when the matrix multiplication calculation results are accumulated is ensured, the data inconsistency and possible calculation errors caused by the concurrent access and modification of the shared resource are avoided, and the accuracy of the calculation results is improved.

Description

Matrix multiplication calculation result accumulation method, device, equipment and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for accumulating matrix multiplication calculation results.
Background
Matrix multiplication is a common operation mode in computer data processing. Is widely applied in the fields of computer graphics, artificial intelligence, machine learning, signal processing, image processing, optimization, simulation and the like.
In order to improve the calculation efficiency of matrix multiplication, batch calculation is generally performed in a parallel manner. The existing matrix multiplication calculation library generally adopts batch gemm or a protocol method to calculate in parallel calculation batch matrix multiplication. When batch gemm calculation methods are used, if different matrix multiplication calculation result storage addresses are pointed to the same memory, memory write conflict can occur, so that accumulated calculation result errors among calculation results are caused; the use of the reduction method to accumulate the calculation results after the batch matrix multiplication calculation is completed can create a serious performance burden. Both of these methods have a large limitation in performing matrix multiplication calculations.
The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.
Disclosure of Invention
The invention mainly aims to provide a matrix multiplication calculation result accumulation method, a device, equipment and a storage medium, and aims to solve the technical problems of high limitation and low parallel calculation efficiency of a matrix multiplication calculation method in the prior art.
In order to achieve the above object, the present invention provides a matrix multiplication calculation result accumulating method, which includes the steps of:
obtaining a matrix multiplication task, decomposing the matrix multiplication task, and obtaining a matrix multiplication sub-task;
Setting concurrency limiting operation when the matrix multiplier task accesses the result storage address;
acquiring a historical accumulated result from a result storage address based on the concurrency limit operation;
and accumulating the matrix multiplication calculation result of the matrix multiplication sub-task and the historical accumulation result to obtain an accumulation result.
Optionally, after the step of accumulating the matrix multiplication result of the matrix multiplication sub-task with the historical accumulation result to obtain an accumulation result, the method further includes:
storing the accumulated result to the result storage address;
and releasing the concurrency limit of the matrix multiplier task to accumulate the matrix multiplier calculation result of the next matrix multiplier task.
Optionally, after the step of removing the concurrency limitation of the matrix multiplier task to accumulate the matrix multiplication calculation result of the next matrix multiplier task, the method further includes:
And when all the matrix multiplication sub-tasks corresponding to the matrix multiplication tasks are executed, determining a matrix multiplication calculation result of the matrix multiplication tasks based on the accumulation result in the result storage address.
Optionally, the step of setting a concurrency limit operation when the matrix multiplier task accesses a result storage address includes:
concurrently executing matrix multiplier tasks to obtain a matrix multiplier calculation result of each matrix multiplier task;
Setting concurrency limiting operation when the matrix multiplier task accesses the result storage address; the concurrency limit operation comprises a memory locking operation and an atomic operation.
Optionally, the step of concurrently executing matrix multiplier tasks to obtain a matrix multiplier calculation result of each matrix multiplier task includes:
And acquiring matrix weight parameters of each matrix multiplier task in parallel, and executing the matrix multiplier task based on the matrix weight parameters to acquire a matrix multiplier calculation result of the matrix multiplier task.
Optionally, the step of obtaining matrix weight parameters of each matrix multiplier task in parallel and executing the matrix multiplier task based on the matrix weight parameters to obtain a matrix multiplier calculation result of the matrix multiplier task includes:
A matrix preprocessing mode of the matrix multiplying sub-task is obtained in parallel, and matrix preprocessing is carried out on a matrix related to the matrix multiplying sub-task based on the matrix preprocessing mode;
And acquiring matrix weight parameters of the matrix multiplication sub-task, and processing the matrix multiplication sub-task after matrix pretreatment based on the matrix weight parameters to acquire a matrix multiplication calculation result.
Optionally, the preprocessing mode includes: matrix transpose or conjugate transpose.
In addition, in order to achieve the above object, the present invention also provides a matrix multiplication result accumulating apparatus, including:
the task decomposition module is used for obtaining a matrix multiplication task, decomposing the matrix multiplication task and obtaining a matrix multiplication sub task;
the concurrency management module is used for setting concurrency limiting operation when the matrix multiplier task accesses the result storage address;
The result calculation module is used for acquiring a historical accumulated result from the result storage address based on the concurrency limiting operation;
And the result calculation module is also used for accumulating the matrix multiplication calculation result of the matrix multiplication sub-task and the historical accumulation result to obtain an accumulation result.
In addition, to achieve the above object, the present invention also proposes a matrix multiplication calculation result accumulating apparatus, the apparatus comprising: a memory, a processor, and a matrix multiply-calculate result accumulation program stored on the memory and executable on the processor, the matrix multiply-calculate result accumulation program configured to implement the steps of the matrix multiply-calculate result accumulation method as described above.
In addition, in order to achieve the above object, the present invention also proposes a storage medium having stored thereon a matrix multiplication result accumulation program which, when executed by a processor, implements the steps of the matrix multiplication result accumulation method as described above.
According to the embodiment of the invention, the matrix multiplication task is decomposed by acquiring the matrix multiplication task, so that the matrix multiplication sub-task is acquired; setting concurrency limit operation for matrix multiplier tasks; acquiring a historical accumulated result from the result storage address based on the concurrency limiting operation; and accumulating the matrix multiplication calculation result of the matrix multiplication sub-task and the historical accumulation result to obtain an accumulation result. The matrix multiplication tasks are decomposed into a plurality of matrix multiplication sub-tasks, and the matrix multiplication sub-tasks are processed in parallel, so that the calculation efficiency is remarkably improved, and the calculation time is effectively saved; and the mutual exclusion access of each matrix multiplication sub task to the storage address of the result is realized through the concurrent limiting operation, so that the data consistency when the matrix multiplication calculation results are accumulated is ensured, the data inconsistency and possible calculation errors caused by the concurrent access and modification of the shared resource are avoided, and the accuracy of the matrix multiplication calculation results is improved.
Drawings
FIG. 1 is a schematic diagram of a matrix multiplication calculation result accumulation device of a hardware running environment according to an embodiment of the present invention;
FIG. 2 is a flowchart of a first embodiment of a matrix multiplication result accumulation method according to the present invention;
FIG. 3 is a flowchart of a second embodiment of the matrix multiplication result accumulation method according to the present invention;
FIG. 4 is a flowchart of a third embodiment of a matrix multiplication result accumulation method according to the present invention;
Fig. 5 is a block diagram of a first embodiment of the matrix multiplication result accumulation device according to the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic diagram of a matrix multiplication calculation result accumulating device of a hardware running environment according to an embodiment of the present invention.
As shown in fig. 1, the matrix multiplication calculation result accumulation device may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
It will be appreciated by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation of the matrix multiply-calculate result accumulation apparatus, and may include more or fewer components than shown, or may combine certain components, or may be a different arrangement of components.
As shown in fig. 1, an operating system, a network communication module, a user interface module, and a matrix multiplication calculation result accumulation program may be included in the memory 1005 as one storage medium.
In the matrix multiplication calculation result accumulating apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the matrix multiplication result accumulation device of the present invention may be disposed in the matrix multiplication result accumulation device, where the matrix multiplication result accumulation device invokes a matrix multiplication result accumulation program stored in the memory 1005 through the processor 1001, and executes the matrix multiplication result accumulation method provided by the embodiment of the present invention.
An embodiment of the present invention provides a matrix multiplication result accumulation method, and referring to fig. 2, fig. 2 is a flowchart of a first embodiment of the matrix multiplication result accumulation method of the present invention.
In this embodiment, the matrix multiplication calculation result accumulation method includes the following steps:
Step S10: and obtaining a matrix multiplication task, and decomposing the matrix multiplication task to obtain a matrix multiplication sub-task.
It should be noted that, the execution body of the method of this embodiment may be a terminal device having functions of data processing, matrix calculation, and program running, for example, a computer, a server, or the like, or may be an electronic device having the same or similar functions, for example, the matrix multiplication calculation result accumulating device described above. The present embodiment and the following embodiments will be described below by taking a matrix multiplication calculation result accumulating apparatus (hereinafter referred to as accumulating apparatus) as an example.
It will be appreciated that matrix multiplication, i.e. matrix multiplication, is of great importance in the field of computer data processing. For example, in image processing, image data may be generally represented as a matrix, and image processing operations such as image smoothing, edge detection, filtering, etc. may be implemented through matrix multiplication, so as to implement functions such as image enhancement, noise reduction, and feature extraction. In another example, in machine learning and deep learning processes, matrix multiplication may enable computation of weights and input signals between neurons, thereby enabling training and prediction of models. For another example, in big data processing, matrix multiplication may implement operations such as node traversal, node evaluation, etc., thereby generating a data network. The embodiment of the invention does not exemplify the application range and the application field of matrix multiplication, and the processing process related to matrix multiplication can be used as the practical application of the matrix multiplication calculation result accumulation method.
It should be noted that, when the embodiment of the present invention is applied, a matrix multiplication task may be performed, so as to obtain a matrix multiplication calculation result. The matrix multiplication task is a computer data processing task for realizing the matrix multiplication operation process. The matrix multiplication task can be a multiplication operation task between two matrixes or a matrix multiplication calculation result accumulation operation task, and the embodiment of the invention does not limit the practical application example of the matrix multiplication task.
It can be appreciated that when the matrix multiplication calculation result accumulation operation task is executed, the matrix multiplication task can be decomposed to obtain a plurality of matrix multiplication sub-tasks. And the subtasks are processed in parallel, and matrix multiplication calculation results corresponding to each matrix multiplication subtask after parallel processing are accumulated, so that the matrix multiplication calculation results corresponding to the matrix multiplication tasks can be determined.
The matrix multiplication task is a task of multiplying the matrix by each other. Can be expressed as:
result = A * B;
wherein result is used to represent the matrix multiplication result of the matrix multiplication sub-task, A is used to represent the first matrix, and B is used to represent the second matrix.
It should be noted that, decomposing the matrix multiplication task may obtain a plurality of matrix multiplication sub-tasks, each matrix multiplication sub-task may include a pair of matrices, that is, a first matrix and a second matrix, and performing matrix multiplication operation on the first matrix and the second matrix may obtain a matrix multiplication calculation result of the matrix multiplication sub-task. And the matrix multiplication calculation results of the matrix multiplication tasks can be obtained by carrying out parallel processing on a plurality of matrix multiplication sub-tasks and accumulating the matrix multiplication calculation results of the matrix multiplication sub-tasks.
In a specific implementation, the accumulating device acquires a matrix multiplication task, decomposes the matrix multiplication task, and acquires a matrix multiplication sub-task. The matrix multiplication task is decomposed into a plurality of matrix multiplication sub-tasks, and the matrix multiplication sub-tasks are processed in parallel, so that the capability of the modern multi-core processor is fully utilized, and the batch matrix multiply-accumulate calculation from 1 to n is greatly accelerated. Particularly, when the technical problems of large-scale matrixes (such as the fields of deep learning, high-performance calculation and the like) are solved, the calculation efficiency is remarkably improved, and the calculation time can be effectively saved.
Step S20: setting concurrency limiting operation when the matrix multiplier task accesses the result storage address;
step S30: and acquiring a historical accumulated result from the result storage address based on the concurrency limit operation.
It should be noted that, in the multi-task parallel processing environment, the concurrent limiting operation limits the target resource or operation, so as to ensure that the concurrent access and operation between threads are controllable and reasonable.
Specifically, the step of setting a concurrency limit operation when the matrix multiplier task accesses a result storage address includes:
step S21: concurrently executing matrix multiplier tasks to obtain a matrix multiplier calculation result of each matrix multiplier task;
step S22: setting concurrency limiting operation when the matrix multiplier task accesses the result storage address; the concurrency limit operation comprises a memory locking operation and an atomic operation.
When the matrix multiplication task is decomposed into a plurality of matrix multiplication sub-tasks, the matrix multiplication sub-tasks can be executed concurrently to obtain the concurrent execution result of each matrix multiplication sub-task.
It should be explained that, in order to save and process the execution result of the matrix multiplication task, in the embodiment of the present invention, the matrix multiplication task corresponds to a result storage address for storing the accumulated result of the matrix multiplication task.
When the matrix multiplication sub-task accesses the result storage address, concurrent limiting operation can be performed on the result access address, so that only one matrix multiplication sub-task accesses the result storage address at the same time. Therefore, the problems of conflict, error calculation of accumulated results and the like caused by that a plurality of matrix multiplication sub-tasks access the result storage address simultaneously are avoided, and the accuracy of accumulation of matrix multiplication calculation results is improved. Specifically, completing matrix multiplication calculation on the matrix multiplication sub-task to obtain a matrix multiplication calculation result of the matrix multiplication sub-task; and the result storage addresses are concurrently limited, so that only one matrix multiplier task can access the result storage addresses at the same time.
It can be understood that, when the concurrent limitation is performed on the result storage address, the matrix multiplication task can perform the read-write operation on the accumulated result stored in the result storage address.
It should be understood that, the above-mentioned historical accumulated result is the accumulated result stored in the result access address, and the accumulated result can be updated by accumulating the historical accumulated result and the matrix multiplication result of the current matrix multiplication sub-task, and the updated accumulated result is stored in the result storage address, so that the subsequent matrix multiplication sub-task can continuously accumulate the matrix multiplication result.
It should be noted that, the lock is a mechanism for implementing mutual exclusion access between threads in concurrent processing, so as to protect consistency and reliability of shared resources in a multithreading processing environment. By locking the result storage address, the result storage address can be accessed in a mutually exclusive manner, namely when the result storage address is accessed by matrix multiplication sub-tasks, other matrix multiplication sub-tasks can wait for the matrix multiplication sub-task of the current access result storage address to leave (namely, lock release), and only one matrix multiplication sub-task access result storage address can be accessed at the same time.
It should be explained that atomic operations refer to indivisible operations that can be completed without being interrupted by other threads in a multi-threaded or concurrent environment. The atomic operation is either completely executed successfully or not executed at all, and an intermediate inconsistent state does not occur, so that only one matrix multiplier task is allowed to access the result storage address at the same time.
In one implementation of the present invention, in order to implement parallel processing of matrix multiplier tasks, an embodiment of the present invention designs a computing component for performing matrix multiplier computation. The computation component may be a thread pool or other concurrency tool, through which parallel operations of matrix multiplier tasks and accumulation of matrix multiplier computation results of matrix multiplier tasks may be implemented.
Specifically, when the matrix multiplication task is decomposed into a plurality of matrix multiplication sub-tasks, the embodiment of the invention also adds the matrix multiplication sub-tasks to the calculation component, realizes the parallel processing of the matrix multiplication sub-tasks through the calculation component, and performs the parallel limiting operation on the result storage addresses when the matrix multiplication calculation results of the matrix multiplication sub-tasks are obtained, so as to realize the accumulation of the matrix multiplication calculation results to the result storage addresses one by one, and improve the accuracy of the matrix multiplication calculation result accumulation.
In a specific implementation, the accumulation device sets concurrency limit operation for the matrix multiplication subtasks; and acquiring a historical accumulated result from the result storage address based on the concurrency limit. Because the mutual exclusion access of the storage addresses of the results of each matrix multiplication sub-task is realized through the concurrent limiting operation, the read-write operation of the shared resource accumulation result can be correctly synchronized under the parallel computing environment, and the consistency of data and the correctness of the computing result are ensured. Data inconsistencies and possible computing errors due to concurrent access and modification of shared resources are avoided.
Step S40: and accumulating the matrix multiplication calculation result of the matrix multiplication sub-task and the historical accumulation result to obtain an accumulation result.
It will be appreciated that the historical accumulated results, i.e. accumulated results stored in the result storage address. The updating of the accumulated results stored in the result storage address can be realized by accumulating the matrix multiplication calculation results of the matrix multiplication subtasks with the accumulated results in the result storage address.
Specifically, it can be expressed in pseudo code as follows:
C= result + C;
Wherein result is used for representing the matrix multiplication result of the matrix multiplication sub-task, and C is used for representing the accumulation result in the result storage address.
In a specific implementation, the accumulation device accumulates the matrix multiplication calculation result of the matrix multiplication sub-task and the historical accumulation result to obtain an accumulation result.
According to the embodiment of the invention, the matrix multiplication task is decomposed by acquiring the matrix multiplication task, so that the matrix multiplication sub-task is acquired; setting concurrency limit operation for matrix multiplier tasks; acquiring a historical accumulated result from the result storage address based on the concurrency limiting operation; and accumulating the matrix multiplication calculation result of the matrix multiplication sub-task and the historical accumulation result to obtain an accumulation result. The matrix multiplication tasks are decomposed into a plurality of matrix multiplication sub-tasks, and the matrix multiplication sub-tasks are processed in parallel, so that the calculation efficiency is remarkably improved, and the calculation time is effectively saved; and the mutual exclusion access of each matrix multiplication sub task to the storage address of the result is realized through the concurrent limiting operation, so that the data consistency when the matrix multiplication calculation results are accumulated is ensured, the data inconsistency and possible calculation errors caused by the concurrent access and modification of the shared resource are avoided, and the accuracy of the matrix multiplication calculation results is improved.
Based on the first embodiment of the matrix-multiply-computation-result accumulation method of the present invention as described above, a second embodiment of the matrix-multiply-computation-result accumulation method of the present invention is proposed.
Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of the matrix multiplication result accumulation method according to the present invention.
Based on the first embodiment, in this embodiment, after the step of accumulating the matrix multiplication result of the matrix multiplier task with the historical accumulation result to obtain an accumulation result, the method further includes:
Step S50: storing the accumulated result to the result storage address;
Step S60: and releasing the concurrency limit of the matrix multiplier task to accumulate the matrix multiplier calculation result of the next matrix multiplier task.
It will be appreciated that when the accumulation is completed, the accumulated result may be stored in the result storage address for the next matrix multiplier task to read the accumulated result in the result storage address and continue accumulation. And by repeating the accumulating step, when all matrix multiplication tasks are completed, a corresponding matrix multiplication calculation result of the matrix multiplication tasks can be obtained. That is, after the step of removing the concurrency limit of the matrix multiplier task to accumulate the matrix multiplication calculation result of the next matrix multiplier task, the method further includes:
step S70: and when all the matrix multiplication sub-tasks corresponding to the matrix multiplication tasks are executed, determining a matrix multiplication calculation result of the matrix multiplication tasks based on the accumulation result in the result storage address.
It should be noted that, in order to make the next matrix multiplier task normally access to the result storage address, when the accumulation of the current matrix multiplier task is completed, the concurrency limit of the matrix multiplier task may be released, and the operation is returned to the step of setting the concurrency limit on the matrix multiplier task (i.e., the next matrix multiplier task), so that the result accumulation of the next matrix multiplier task is achieved.
In one embodiment of the invention, a result matrix, i.e. the accumulated result stored in the result storage address, may be provided in the result storage address. And accumulating the matrix multiplication calculation result of the matrix multiplication subtask and the result matrix, so that the result matrix can be updated.
It can be understood that when all the matrix multiplication sub-tasks corresponding to the matrix multiplication tasks are executed, the accumulated result in the result storage address can represent the matrix multiplication calculation result of the matrix multiplication tasks.
In specific implementation, the accumulation device stores the accumulation result to a result storage address, and releases the concurrency limit of the matrix multiplication sub-task so as to realize accumulation of matrix multiplication calculation results of the next matrix multiplication sub-task; and when all the matrix multiplication sub-tasks corresponding to the matrix multiplication tasks are executed, determining a matrix multiplication calculation result of the matrix multiplication sub-tasks based on the accumulated result in the result storage address.
The embodiment of the invention stores the accumulated result to the result storage address; the concurrency limit of the matrix multiplication sub-task is released, so that the matrix multiplication calculation result of the next matrix multiplication sub-task is accumulated; and when all the matrix multiplication sub-tasks corresponding to the matrix multiplication tasks are executed, determining a matrix multiplication calculation result of the matrix multiplication tasks based on the accumulated result in the result storage address. And when the accumulation result of the matrix multiplication sub-task of the current access result storage address is stored, the concurrency limit of the result storage address is released, so that the next matrix multiplication sub-task can access the result storage address, and the accuracy of accumulation of the matrix multiplication calculation result is ensured.
Based on the above embodiments, a third embodiment of the method of the present invention is provided, and referring to fig. 4, fig. 4 is a flowchart of a third embodiment of the matrix multiplication result accumulation method of the present invention.
In this embodiment, the step of concurrently executing the matrix multiplier tasks to obtain a matrix multiplier calculation result of each matrix multiplier task includes:
Step S210: and acquiring matrix weight parameters of each matrix multiplier task, and executing each matrix multiplier task based on the matrix weight parameters to acquire a matrix multiplier calculation result of each matrix multiplier task.
It should be noted that, when determining the matrix multiplication calculation result of the matrix multiplication task, the first matrix weight parameter of the matrix multiplication task and the matrix weight parameter corresponding to the accumulated result (i.e. the result matrix) in the result storage address may be determined, and through these matrix parameters, the weighted calculation may be implemented, so as to achieve a better calculation effect, and meet the calculation requirements of different processing fields.
It can be understood that when the accumulating device executes the matrix multiplication task, matrix weight parameters corresponding to two matrices in the matrix multiplication task can be obtained, and the matrix multiplication task is executed in a weighted manner based on the matrix weight parameters, so that a calculation result of the matrix multiplication task is obtained.
In one implementation of the invention, in performing a batch matrix multiply-accumulate computation from 1 to n groups, the process of weighting computation can be expressed using pseudocode as:
result=alpha * A[i]*B[i];
when the result deposit address is locked, C [ i ] =result+beta ] C [ i ];
AtomAdd (Ci, result+beta_Ci) when atomic operations are used;
Wherein alpha is used to represent a first matrix weight parameter of the matrix multiplier task, beta is used to represent a second matrix weight parameter of the result matrix, A [ i ] is used to represent the first matrix, B [ i ] is used to represent the second matrix, C [ i ] is used to represent the result matrix, atomAdd is used to represent the atomic operation, and i is an integer from 1 to n.
When matrix multiplication calculation results corresponding to the matrix multiplication sub-tasks are accumulated, the sequence of the matrix multiplication sub-tasks obtained by calculation of the matrix multiplication sub-tasks determines the sequence of the access result storage addresses of each matrix multiplication sub-task. Specifically, when a matrix multiplication sub-task calculates to obtain a matrix multiplication calculation result corresponding to the matrix multiplication sub-task, if a result storage address is being accessed, adding the matrix multiplication sub-task into an access waiting queue; if the result deposit address is in the non-accessed state, the first matrix multiplier task in the access waiting queue is used for accessing the result deposit address, and concurrent limiting operation is carried out on the result deposit address during access.
In a specific implementation, the accumulating device may acquire matrix weight parameters of each matrix multiplier task in parallel, and execute the matrix multiplier task based on the matrix weight parameters to obtain a matrix multiplier calculation result of the matrix multiplier task. The matrix multiplication calculation and the result accumulation are carried out according to the matrix weight parameters of each matrix multiplication task and the result matrix, so that the matrix multiplication calculation result accumulation method has wider application range and improves the practicability of the matrix multiplication calculation result accumulation method.
Further, the step of obtaining a matrix multiplication sub-task matrix weight parameter, executing the matrix multiplication sub-task based on the matrix weight parameter, and obtaining a matrix multiplication calculation result of the matrix multiplication sub-task includes:
Step S211: a matrix preprocessing mode of the matrix multiplying sub-task is obtained in parallel, and matrix preprocessing is carried out on a matrix related to the matrix multiplying sub-task based on the matrix preprocessing mode;
Step S212: and acquiring matrix weight parameters of the matrix multiplication sub-task, and processing the matrix multiplication sub-task after matrix pretreatment based on the matrix weight parameters to acquire a matrix multiplication calculation result.
In the fields of image processing, data compression, encoding, simulation, and emulation, it is generally necessary to pre-process the matrix to improve the processing efficiency, such as matrix transposition, conjugate transposition, dimension reduction, normalization, and the like, which is not limited by the embodiments of the present invention.
Therefore, when the matrix multiplication calculation result accumulation method of the embodiment of the invention is executed, matrix preprocessing can be carried out on the matrix related to each matrix multiplication task according to the requirement. In particular, in performing a batch matrix multiply-accumulate computation from 1 to n groups, the process of preprocessing the computation can be expressed using pseudocode as:
result = alpha * op(A[i]) * op(B[i])
Where op represents some preprocessing operation, such as matrix transposition, conjugate transposition, etc.
According to the embodiment of the invention, the matrix preprocessing mode of the matrix multiplying sub-task is obtained in parallel, and the matrix related to the matrix multiplying sub-task is preprocessed based on the matrix preprocessing mode; and acquiring matrix weight parameters of the matrix multiplier task, and processing the matrix multiplier task after matrix pretreatment based on the matrix weight parameters to acquire a matrix multiplier calculation result. The matrix multiplication calculation and result accumulation are carried out according to the matrix weight parameters of each matrix multiplication task and the result matrix, and the related matrix is preprocessed during the matrix multiplication task, so that the matrix multiplication calculation result accumulation method has wider application range and improves the practicability of the matrix multiplication calculation result accumulation method.
In addition, the embodiment of the invention also provides a storage medium, wherein a matrix multiplication calculation result accumulation program is stored on the storage medium, and the matrix multiplication calculation result accumulation program realizes the steps of the matrix multiplication calculation result accumulation method when being executed by a processor.
Based on the first embodiment of the matrix multiplication result accumulation method of the present invention, a first embodiment of the matrix multiplication result accumulation device of the present invention is provided, and referring to fig. 5, fig. 5 is a block diagram of the first embodiment of the matrix multiplication result accumulation device of the present invention.
As shown in fig. 5, the matrix multiplication calculation result accumulating device provided by the embodiment of the present invention includes:
The task decomposition module 501 is configured to obtain a matrix multiplication task, decompose the matrix multiplication task, and obtain a matrix multiplication sub-task;
the concurrency management module 502 is configured to set concurrency restriction operation when the matrix multiplier task accesses a result storage address;
a result calculation module 503, configured to obtain a history accumulated result from the result storage address based on the concurrency restriction operation;
The result calculation module 503 is further configured to accumulate the matrix multiplication result of the matrix multiplication sub-task with the historical accumulation result, so as to obtain an accumulation result.
According to the embodiment of the invention, the matrix multiplication task is decomposed by acquiring the matrix multiplication task, so that the matrix multiplication sub-task is acquired; setting concurrency limit operation for matrix multiplier tasks; acquiring a historical accumulated result from the result storage address based on the concurrency limiting operation; and accumulating the matrix multiplication calculation result of the matrix multiplication sub-task and the historical accumulation result to obtain an accumulation result. The matrix multiplication tasks are decomposed into a plurality of matrix multiplication sub-tasks, and the matrix multiplication sub-tasks are processed in parallel, so that the calculation efficiency is remarkably improved, and the calculation time is effectively saved; and the mutual exclusion access of each matrix multiplication sub task to the storage address of the result is realized through the concurrent limiting operation, so that the data consistency when the matrix multiplication calculation results are accumulated is ensured, the data inconsistency and possible calculation errors caused by the concurrent access and modification of the shared resource are avoided, and the accuracy of the matrix multiplication calculation results is improved.
In one implementation of the present invention, the result calculation module 503 is further configured to store the accumulated result to the result storage address; and releasing the concurrency limit of the matrix multiplier task to accumulate the matrix multiplier calculation result of the next matrix multiplier task.
In one implementation of the present invention, the result calculation module 503 is further configured to determine a matrix multiplication result of the matrix multiplication task based on the accumulated result in the result storage address when all the matrix multiplication sub-tasks corresponding to the matrix multiplication task are executed.
In one implementation manner of the present invention, the concurrency management module 502 is further configured to execute matrix multiplier tasks concurrently to obtain a matrix multiplier calculation result of each of the matrix multiplier tasks; setting concurrency limiting operation when the matrix multiplier task accesses the result storage address; the concurrency limit operation comprises a memory locking operation and an atomic operation.
In one implementation manner of the present invention, the result calculation module 503 is further configured to obtain matrix weight parameters of each matrix multiplier task in parallel, and execute the matrix multiplier task based on the matrix weight parameters to obtain a matrix multiplier calculation result of the matrix multiplier task.
In one implementation manner of the present invention, the result calculation module 503 is further configured to obtain a matrix preprocessing manner of the matrix multiplier task in parallel, and perform matrix preprocessing on a matrix related to the matrix multiplier task based on the matrix preprocessing manner; and acquiring matrix weight parameters of the matrix multiplication sub-task, and processing the matrix multiplication sub-task after matrix pretreatment based on the matrix weight parameters to acquire a matrix multiplication calculation result.
In one implementation of the present invention, the preprocessing mode includes: matrix transpose or conjugate transpose.
Other embodiments or specific implementation manners of the matrix multiplication result accumulation device of the present invention may refer to the above method embodiments, and are not described herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. read-only memory/random-access memory, magnetic disk, optical disk), comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (10)

1. A method for accumulating matrix multiplication results, the method comprising:
obtaining a matrix multiplication task, decomposing the matrix multiplication task, and obtaining a matrix multiplication sub-task;
Setting concurrency limiting operation when the matrix multiplier task accesses the result storage address;
acquiring a historical accumulated result from a result storage address based on the concurrency limit operation;
and accumulating the matrix multiplication calculation result of the matrix multiplication sub-task and the historical accumulation result to obtain an accumulation result.
2. The matrix multiply-calculate result accumulation method of claim 1, wherein the step of accumulating the matrix multiply-calculate result of the matrix multiply-subtask with the historical accumulation result to obtain an accumulation result further comprises, after the step of:
storing the accumulated result to the result storage address;
and releasing the concurrency limit of the matrix multiplier task to accumulate the matrix multiplier calculation result of the next matrix multiplier task.
3. The matrix multiplication result accumulating method according to claim 2, wherein after the step of removing the concurrency limit of the matrix multiplication sub-task to accumulate the matrix multiplication result of the next matrix multiplication sub-task, further comprising:
And when all the matrix multiplication sub-tasks corresponding to the matrix multiplication tasks are executed, determining a matrix multiplication calculation result of the matrix multiplication tasks based on the accumulation result in the result storage address.
4. The matrix multiply-calculate result accumulation method of claim 1, wherein the step of setting a concurrency limit operation when the matrix multiply sub-task accesses a result deposit address comprises:
concurrently executing matrix multiplier tasks to obtain a matrix multiplier calculation result of each matrix multiplier task;
Setting concurrency limiting operation when the matrix multiplier task accesses the result storage address; the concurrency limit operation comprises a memory locking operation and an atomic operation.
5. The matrix multiply-calculate-result accumulation method of claim 4, wherein the step of concurrently performing matrix multiply-subtasks to obtain matrix multiply-calculate results for each of the matrix multiply-subtasks comprises:
And acquiring matrix weight parameters of each matrix multiplier task in parallel, and executing the matrix multiplier task based on the matrix weight parameters to acquire a matrix multiplier calculation result of the matrix multiplier task.
6. The method for accumulating matrix multiplication results according to claim 5, wherein the step of obtaining matrix weight parameters of each matrix multiplication sub-task in parallel and performing the matrix multiplication sub-task based on the matrix weight parameters to obtain the matrix multiplication results of the matrix multiplication sub-task comprises:
A matrix preprocessing mode of the matrix multiplying sub-task is obtained in parallel, and matrix preprocessing is carried out on a matrix related to the matrix multiplying sub-task based on the matrix preprocessing mode;
And acquiring matrix weight parameters of the matrix multiplication sub-task, and processing the matrix multiplication sub-task after matrix pretreatment based on the matrix weight parameters to acquire a matrix multiplication calculation result.
7. The matrix multiply-calculate result accumulation method of claim 6, wherein the preprocessing mode includes: matrix transpose or conjugate transpose.
8. A matrix multiplication result accumulating apparatus, characterized in that the matrix multiplication result accumulating apparatus comprises:
the task decomposition module is used for obtaining a matrix multiplication task, decomposing the matrix multiplication task and obtaining a matrix multiplication sub task;
the concurrency management module is used for setting concurrency limiting operation when the matrix multiplier task accesses the result storage address;
The result calculation module is used for acquiring a historical accumulated result from the result storage address based on the concurrency limiting operation;
And the result calculation module is also used for accumulating the matrix multiplication calculation result of the matrix multiplication sub-task and the historical accumulation result to obtain an accumulation result.
9. A matrix multiply-calculate result accumulation apparatus, the apparatus comprising: a memory, a processor and a matrix multiply-calculate result accumulation program stored on the memory and executable on the processor, the matrix multiply-calculate result accumulation program configured to implement the steps of the matrix multiply-calculate result accumulation method of any one of claims 1 to 7.
10. A storage medium having stored thereon a matrix-multiplied-computation-result accumulation program which, when executed by a processor, implements the steps of the matrix-multiplied-computation-result accumulation method of any one of claims 1 to 7.
CN202410519577.7A 2024-04-28 2024-04-28 Matrix multiplication calculation result accumulation method, device, equipment and storage medium Pending CN118094074A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410519577.7A CN118094074A (en) 2024-04-28 2024-04-28 Matrix multiplication calculation result accumulation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410519577.7A CN118094074A (en) 2024-04-28 2024-04-28 Matrix multiplication calculation result accumulation method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN118094074A true CN118094074A (en) 2024-05-28

Family

ID=91153415

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410519577.7A Pending CN118094074A (en) 2024-04-28 2024-04-28 Matrix multiplication calculation result accumulation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN118094074A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4697247A (en) * 1983-06-10 1987-09-29 Hughes Aircraft Company Method of performing matrix by matrix multiplication
JP2005216124A (en) * 2004-01-30 2005-08-11 Mitsubishi Electric Corp Matrix operation apparatus
US20160094660A1 (en) * 2014-09-30 2016-03-31 Interactic Holdings, Llc Matrix Vector Multiply Techniques
CN112506677A (en) * 2020-12-09 2021-03-16 上海交通大学 TensorFlow distributed matrix calculation implementation method and system
CN116992203A (en) * 2023-07-13 2023-11-03 中山大学 FPGA-based large-scale high-throughput sparse matrix vector integer multiplication method
CN117828252A (en) * 2023-10-31 2024-04-05 深圳爱特思信息技术有限公司 High-performance matrix vector multiplication method based on matrix core

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4697247A (en) * 1983-06-10 1987-09-29 Hughes Aircraft Company Method of performing matrix by matrix multiplication
JP2005216124A (en) * 2004-01-30 2005-08-11 Mitsubishi Electric Corp Matrix operation apparatus
US20160094660A1 (en) * 2014-09-30 2016-03-31 Interactic Holdings, Llc Matrix Vector Multiply Techniques
CN112506677A (en) * 2020-12-09 2021-03-16 上海交通大学 TensorFlow distributed matrix calculation implementation method and system
CN116992203A (en) * 2023-07-13 2023-11-03 中山大学 FPGA-based large-scale high-throughput sparse matrix vector integer multiplication method
CN117828252A (en) * 2023-10-31 2024-04-05 深圳爱特思信息技术有限公司 High-performance matrix vector multiplication method based on matrix core

Similar Documents

Publication Publication Date Title
Li et al. Quantum supremacy circuit simulation on Sunway TaihuLight
Vázquez et al. A new approach for sparse matrix vector product on NVIDIA GPUs
CN105022670B (en) Heterogeneous distributed task processing system and its processing method in a kind of cloud computing platform
Cong et al. Solving large, irregular graph problems using adaptive work-stealing
Peng et al. GLU3. 0: Fast GPU-based parallel sparse LU factorization for circuit simulation
DE112011100258T5 (en) Performing aggressive code optimizations with an ability to cancel the changes made by the aggressive optimizations
CN115660078A (en) Distributed computing method, system, storage medium and electronic equipment
Martínez-del-Amor et al. Population Dynamics P systems on CUDA
CN110689045A (en) Distributed training method and device for deep learning model
Clarke et al. Fupermod: A framework for optimal data partitioning for parallel scientific applications on dedicated heterogeneous hpc platforms
EP2676194A2 (en) Improved asynchronous programming execution
Jansson Spectral Element simulations on the NEC SX-Aurora TSUBASA
Safari et al. Formal verification of parallel prefix sum and stream compaction algorithms in CUDA
CN107977980B (en) Target tracking method, device and readable medium
CN109901913B (en) Multithread transaction storage programming model method capable of controlling repeated execution times
Barrett et al. Implementing a portable multi-threaded graph library: The MTGL on Qthreads
CN118094074A (en) Matrix multiplication calculation result accumulation method, device, equipment and storage medium
Aksenova et al. The models and methods of optimal control of three work-stealing deques located in a shared memory
CN115033839A (en) Calculation method and system for reducing storage overhead of rotation transformation in FFT (fast Fourier transform)
Johnson Characterizing the performance of algorithms for lock-free objects
Ryabko et al. Application of the computer capacity to the analysis of processors Evolution
Fialko Parallel algorithms for forward and back substitution in linear algebraic equations of finite element method
Han GPU acceleration for computational finance
Tsutsui et al. An analytical study of parallel GA with independent runs on GPUs
Charara et al. Batched Tile Low-Rank GEMM on GPUs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination