CN115543566A - Method and system for executing computation graph by multiple threads - Google Patents

Method and system for executing computation graph by multiple threads Download PDF

Info

Publication number
CN115543566A
CN115543566A CN202211184997.1A CN202211184997A CN115543566A CN 115543566 A CN115543566 A CN 115543566A CN 202211184997 A CN202211184997 A CN 202211184997A CN 115543566 A CN115543566 A CN 115543566A
Authority
CN
China
Prior art keywords
memory space
threads
computation graph
thread
relative index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211184997.1A
Other languages
Chinese (zh)
Inventor
孙承根
焦英翔
石光川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
4Paradigm Beijing Technology Co Ltd
Original Assignee
4Paradigm Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 4Paradigm Beijing Technology Co Ltd filed Critical 4Paradigm Beijing Technology Co Ltd
Priority to CN202211184997.1A priority Critical patent/CN115543566A/en
Publication of CN115543566A publication Critical patent/CN115543566A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)
  • Image Processing (AREA)

Abstract

A method and system for executing a computational graph using multiple threads is provided, the method comprising: acquiring a computation graph comprising at least one operation, wherein a relative index of operation data of the at least one operation in a memory space is declared in the computation graph; creating a plurality of first threads and allocating a corresponding memory space for each first thread; copying the computational graph for each first thread; executing the at least one operation included in the computation graph according to the starting address of the memory space of each of the plurality of first threads and the relative index. In the method and the system for executing the computation graph by utilizing the multithreading according to the embodiment of the invention, the computation graph can be directly copied under a distributed environment so as to achieve the aim of data parallel acceleration, and meanwhile, the statement mode of relative index does not relate to distributed logic, so that a user does not need to consider how to process the multithreading logic, and the use cost and the development difficulty of the user are reduced.

Description

Method and system for executing computation graph by multiple threads
The application is a divisional application of patent applications with application date of 2018, 09 and 06, and application number of 201811037341.0, entitled "method and system for executing calculation graph by multithreading".
Technical Field
The present invention relates to the field of parallel processing of data, and more particularly, to a method and system for executing a computational graph using multiple threads.
Background
With the increasing amount of data to be processed in various industries, people need an easy-to-use distributed data processing tool, and with the gradual complexity of processing problems, algorithms for solving the corresponding problems are more and more complex, which brings more challenges to the usability and flexibility of the processing tool. The computational graph model is a general computational process representation method, is generally applied to various data processing platforms, is easy to understand and high in flexibility, and can realize complex logic by combining simple operations.
The currently widely used trade-off method of the computational graph execution framework is that the computational graph itself does not perform parallel processing, so that a user can declare by thinking of sequential execution, development difficulty is reduced, and parallelization is refined in each operation of the computational graph, and execution efficiency is accelerated. By the method, only one computation graph is operated on one computation node at the same time, and memory consumption is reduced to a certain extent.
However, with the optimization of hardware and the reduction of the price thereof, the overhead of a memory is gradually no longer a bottleneck, and with the gradual complexity of algorithm design, there are some operations which cannot be parallelized, and when a user needs a computing method which is not supported by a framework, the user must implement multithreading logic by himself, the development difficulty is multiplied, otherwise, the execution efficiency is multiplied.
A simple and generally applicable parallelization method is data parallelization, namely different execution units respectively process different data, and the method is the most common method in multi-machine distribution, so that if multiple threads on a single machine are also used, multithreading acceleration is realized by using the data parallelization method, and the development cost of a user can be reduced.
However, for the computational graph, the memory space of each segment of data to be processed is controlled by a plurality of nodes at the upstream and downstream of the computational graph, and the actual index of the memory space of the data is generally recorded by the execution node. When the computation graph is copied for multi-thread parallel processing, the actual index is also copied, and multiple threads share the memory space represented by the same actual index, that is, multiple computation graphs point to the same data, which may cause memory conflict and error execution result during multi-thread parallel processing. If memory conflict needs to be avoided, the statement needs to be repeated, and development difficulty is increased.
Disclosure of Invention
The invention aims to provide a method and a system for executing a computation graph by utilizing multiple threads, which aim to solve the problem of high development difficulty caused by avoiding memory conflict in the prior computation graph multithread parallel processing.
One aspect of the present invention provides a method for executing a computation graph using multiple threads, comprising: acquiring a computation graph comprising at least one operation, wherein a relative index of operation data of the at least one operation in a memory space is declared in the computation graph; creating a plurality of first threads and allocating a corresponding memory space for each first thread; copying the computational graph for each first thread; executing the at least one operation included in the computation graph according to the starting address of the memory space of each of the plurality of first threads and the relative index.
Optionally, the step of executing the at least one operation included in the computation graph according to the starting address of the memory space of each of the plurality of first threads and the relative index includes: determining, by using each first thread, a memory address of operation data of the at least one operation included in the corresponding computation graph according to the start address of the corresponding memory space and the relative index; executing, by the first threads, the at least one operation included in the respective computation graph according to the respective determined memory address.
Optionally, the method further comprises: creating a plurality of second threads, wherein executing the at least one operation included in the computation graph according to the starting address of the memory space of each of the plurality of first threads and the relative index comprises: generating an operation packet of the at least one operation included in the corresponding computation graph according to the starting address of the corresponding memory space and the relative index by using each first thread; and executing an operation packet created by the plurality of first threads by using the plurality of second threads, wherein the operation packet has a start address corresponding to the memory space, a processing procedure corresponding to the operation, and a relative index of operation data of the corresponding operation in the memory space, or the operation information packet has a processing procedure corresponding to the operation and a memory address of the corresponding operation generated based on the start address corresponding to the memory space and the relative index of the operation data of the corresponding operation in the memory space.
Optionally, the method further comprises: creating a plurality of second threads, wherein executing the at least one operation included in the computation graph according to the starting address of the memory space of each of the plurality of first threads and the relative index comprises: generating an operation packet of a next operation to be executed among the at least one operation included in the corresponding computation graph according to the starting address of the corresponding memory space and the relative index by using each first thread; putting the operation packets generated by the first threads into a buffer queue; acquiring the operation packets from the buffer queue by using the plurality of second threads for execution, and continuously acquiring new operation packets from the buffer queue after the execution of the operation packets is completed; and informing the corresponding first thread of the execution completion condition of the operation packet by using the plurality of second threads, so that the corresponding first thread determines an operation to be executed next in the at least one operation included in the corresponding computation graph, wherein the operation packet has a start address of the corresponding memory space, a processing procedure of the corresponding operation, and a relative index of operand data of the corresponding operation in the memory space, or the operation information packet has the processing procedure of the corresponding operation, and a memory address of the corresponding operation generated based on the start address of the corresponding memory space and the relative index of the operand data of the corresponding operation in the memory space.
Optionally, the at least one operation involves at least one arithmetic operation in a machine learning algorithm.
Optionally, the number of second threads is greater than the number of first threads.
Optionally, the step of allocating a corresponding memory space for each first thread includes: and allocating a corresponding memory space for each first thread according to the total amount of the operation data of the at least one operation.
Another aspect of the present invention provides a system for executing a computation graph using multiple threads, comprising: the device comprises a computation graph acquiring device and a computation graph generating device, wherein the computation graph acquiring device is used for acquiring a computation graph comprising at least one operation, and a relative index of operation data of the at least one operation in a memory space is stated in the computation graph; the device comprises a creating device, a memory device and a processing device, wherein the creating device is used for creating a plurality of first threads and allocating corresponding memory space for each first thread; copying means for copying the computation graph for each first thread; an executing device, configured to execute the at least one operation included in the computation graph according to a starting address of a memory space of each of the plurality of first threads and the relative index.
Optionally, the execution unit is configured to determine, by using each first thread, a memory address of operation data of the at least one operation included in the corresponding computation graph according to the start address of the corresponding memory space and the relative index; and executing the at least one operation included in the respective computation graph according to the respective determined memory address by using the plurality of first threads.
Optionally, the creating unit is further configured to create a plurality of second threads, where the execution unit is configured to generate, by using each first thread, an operation packet of the at least one operation included in the corresponding computation graph according to the starting address of the corresponding memory space and the relative index; and executing an operation packet created by the plurality of first threads by using the plurality of second threads, wherein the operation packet has a start address corresponding to the memory space, a processing procedure corresponding to the operation, and a relative index of operation data of the corresponding operation in the memory space, or the operation information packet has a processing procedure corresponding to the operation and a memory address of the corresponding operation generated based on the start address corresponding to the memory space and the relative index of the operation data of the corresponding operation in the memory space.
Optionally, the creating unit is further configured to create a plurality of second threads, where the executing unit is configured to generate, by using each first thread, an operation packet of a next to-be-executed operation among the at least one operation included in the corresponding computation graph according to the starting address of the corresponding memory space and the relative index; utilizing the plurality of first threads to place the operation packets generated by the first threads into a buffer queue; acquiring operation packets from the buffer queue by using the plurality of second threads for execution, and continuously acquiring new operation packets from the buffer queue after the execution of the operation packets is completed; and informing the corresponding first thread of the execution completion condition of the operation packet by using the plurality of second threads, so that the corresponding first thread determines an operation to be executed next in the at least one operation included in the corresponding computation graph, wherein the operation packet has a start address of the corresponding memory space, a processing procedure of the corresponding operation, and a relative index of operand data of the corresponding operation in the memory space, or the operation information packet has the processing procedure of the corresponding operation, and a memory address of the corresponding operation generated based on the start address of the corresponding memory space and the relative index of the operand data of the corresponding operation in the memory space.
Optionally, the at least one operation involves at least one arithmetic operation in a machine learning algorithm.
Optionally, the number of second threads is greater than the number of first threads.
Optionally, the creating means is configured to allocate a corresponding memory space for each first thread according to the total amount of the operation data of the at least one operation.
Another aspect of the invention provides a system comprising at least one computing device and at least one storage device storing instructions that, when executed by the at least one computing device, cause the at least one computing device to perform a method of executing a computational graph using multiple threads as described above.
Another aspect of the present invention provides a computer-readable storage medium storing instructions that, when executed by at least one computing device, cause the at least one computing device to perform a method of executing a computational graph using multiple threads as described above.
In the method and the system for executing the computation graph by utilizing the multithreading according to the embodiment of the invention, the relative index of the operation data of the operation included in the computation graph in the memory space is stated in the computation graph, so that the computation graph can be directly copied in a distributed environment to achieve the purpose of data parallel acceleration, and meanwhile, the statement mode of the relative index does not relate to distributed logic, does not need a user to consider how to process the multithreading logic, and reduces the use cost and the development difficulty of the user. In addition, the method and the system for executing the computation graph by utilizing multiple threads according to the embodiment of the invention can support the processing of streaming data, and can simultaneously take development cost and execution efficiency into consideration even if the data volume needing to be processed is very large.
Additional aspects and/or advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and other objects, features and advantages of the present invention will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a block diagram illustrating a system for executing a computational graph using multiple threads, according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a conventional computation graph and memory space relationship;
FIGS. 3 and 4 are diagrams illustrating the relationship between the computation graph and the memory space of the present invention;
FIG. 5 is a flow diagram illustrating a method of executing a computational graph using multiple threads, according to an embodiment of the invention;
fig. 6 is a flowchart illustrating steps for performing at least one operation included in the computation graph according to a starting address of a memory space of each of the first threads and the relative index according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below with reference to the accompanying drawings.
FIG. 1 is a block diagram illustrating a system for executing a computational graph using multiple threads according to an embodiment of the present invention. As shown in fig. 1, the system for executing a computation graph using multiple threads according to an embodiment of the present invention includes computation graph obtaining means 101, creating means 102, copying means 103, and executing means 104.
Specifically, the computation graph obtaining means 101 is configured to obtain a computation graph including at least one operation. A relative index of the memory space in which the operational data of the at least one operation is present is declared in the computation graph.
As an example, the computation graph comprises at least one operation, which may also be referred to as an arithmetic operation. As an example, the at least one operation involves at least one arithmetic operation in a machine learning algorithm. The corresponding operation data may include input data and output data of the operation.
When a computational graph is declared, the memory space of the operational data of the computational graph is represented by relative indices, rather than by direct indices (i.e., actual indices). In the field of computers, as an example, the actual index of the memory space may be represented by 8 bytes of data, while the memory space of each operation data of the computation graph according to the embodiment of the present invention is represented by a relative index, and the maximum value of the relative index may be set as the total amount of operation data (i.e., the number of operation data) of at least one operation included in the computation graph, which is generally small and may be represented by 4 or 2 bytes of data, so that the memory size occupied by the computation graph may be reduced.
The creating device 102 is configured to create a plurality of first threads, and allocate a corresponding memory space to each first thread.
As an example, in a case where the memory of the operation data of the computation graph is uniformly hosted by the memory management module, the creation device 102 may allocate a corresponding memory space for each first thread by using the memory management module.
As an example, the creating device 102 may allocate a corresponding memory space for each first thread according to the total amount of operation data (i.e., the number of operation data) of at least one operation included in the computation graph.
The copying means 103 is used to copy the computation graph for each first thread.
The execution device 104 is configured to execute at least one operation included in the computation graph according to the starting address and the relative index of the memory space of each of the plurality of first threads. That is, the executing device 104 executes at least one operation included in the computation graph according to the start address of the memory space of each first thread and the relative index of the operation data of the at least one operation included in the computation graph corresponding to the start address.
As an example, the executing device 104 may determine the memory space of each operation included in the computation graph according to the start address of the memory space of each first thread and the relative index of the operation data of each operation, and execute each operation included in the computation graph according to the memory space of each operation included in the computation graph.
The difference between the prior art computation graph and the memory space and the present invention will be described with reference to fig. 2 to 4.
Fig. 2 is a diagram showing a relationship between a conventional computation graph and a memory space. Fig. 3 and 4 are diagrams showing the relationship between the computation graph and the memory space according to the present invention. In fig. 2 to 4, the upper half represents the actual index of the memory space, and the lower half represents the computation graph, where the blocks represent operations and the circles represent the actual index or relative index of the operation data in the memory space.
As shown in fig. 2, assuming that the total memory space has four memory segments to be operated, and the actual indexes thereof are 0 to 3, the output data of operation a is written into the memory space with the actual index of 1, and the input of operation B is read from the memory space with the actual index of 1. When the computation graph is copied to implement multi-thread parallel execution, the output data of operation a and the input data of operation B of the copied computation graph still point to the memory space with the actual index of 1, so in order to reallocate the used memory space for the copied computation graph, the original corresponding relationship has to be found, and this operation is difficult to implement and has high complexity.
As shown in fig. 3 and 4, the embodiment of the present invention declares in the computation graph that the relative index of the memory space of the output data of operation a and the input data of operation B of the computation graph is 1, and the creating device 102 allocates the memory space with the actual index of 0 to 1 and the memory space with the actual index of 2 to 3 to the first thread 0 and the second thread 1 respectively when creating the first thread 0 and the first thread 1. When executing the computation graph copied for the first thread 0, determining the memory space of the output data of the operation A and the input data of the operation B as the memory space with the actual index of 0; when the calculation graph copied for the first thread 1 is executed, the memory space of the output data of the operation a and the memory space of the input data of the operation B are determined as the memory space with the actual index of 3, so that the memory spaces of the data processed by the calculation graphs of the first thread 0 and the first thread 1 are separated, and the execution error is avoided.
Returning to fig. 1, as an example, the execution unit 104 may determine, by using each first thread, a memory address of operation data of at least one operation included in the corresponding computation graph according to the start address of the corresponding memory space and the relative index; and executing at least one operation included in the respective computation graph according to the respective determined memory address by using the plurality of first threads. That is, in this example, the operations of the respective computation graphs are performed by the first thread itself.
In another embodiment, because the parallel processing has efficiency, for example, in some scenarios, the way of data parallel processing may be limited, such as each part of data is processed, a synchronization result is required between threads (as an example, the synchronization process is common in the training process of the machine learning model), and if the parallelism is high, the total amount of data obtained by each thread is small, and the overhead such as waiting for the thread and copying the data becomes large, so that the efficiency is reduced instead. For this purpose, a special worker thread may be provided to obtain the operations to be processed with the memory location information of the data to be processed and perform corresponding processing, and specifically, all the information related to the operations may be packaged into one operation packet, so that no data access conflict is caused by executing the operations of the operation packets by the worker thread. That is, a plurality of second threads (i.e., worker threads) may be created for performing operations. The first thread of the original calculation graph is responsible for packing, the operation package is transmitted to the second thread, and the second thread executes the operation according to all the information in the operation package. This has the advantage of isolating the parallelism of the computation graph from the parallelism of the actual computation. Generally, for a computation graph corresponding to a neural network, the computation graph includes a plurality of operations that have no interdependence relationship and can be executed simultaneously, so the number of the first threads may be less than that of the second threads, and even the number of the second threads may be several times that of the first threads, which can both reduce the overhead and increase the parallelism of the computation.
Specifically, the first thread is used to pack the operation packet, and the creating unit 102 also creates a plurality of second threads for executing the operations in the computation graph.
In this embodiment, the execution unit 104 generates, by using each first thread, an operation packet of at least one operation included in the corresponding computation graph according to the start address and the relative index of the corresponding memory space; and executing the operation package created by the plurality of first threads using the plurality of second threads.
The operation packet has a start address corresponding to the memory space, a processing procedure corresponding to the operation, and a relative index of operation data corresponding to the operation in the memory space. Accordingly, when the operation in the operation packet is executed, the memory address of the data involved in the operation is determined by the above-mentioned information in the operation packet.
Or, the operation information packet has a processing procedure of the corresponding operation and a memory address of the corresponding operation generated based on the starting address of the corresponding memory space and a relative index of the operation data of the corresponding operation in the memory space. In this case, execution units 104 generate, with the first thread, the memory address of the corresponding operation based on the starting address of the corresponding memory space and the relative index in memory space of the operand data of the corresponding operation. Accordingly, when an operation in the operation packet is executed, the memory address of the data involved in the operation included in the operation packet can be directly utilized.
As an example, the execution unit 104 generates, by using each first thread, an operation packet of operations to be executed next among at least one operation included in the corresponding computation graph according to the start address and the relative index of the corresponding memory space; putting the operation packets generated by the first threads into a buffer queue; acquiring the operation packet from the buffer queue by using a plurality of second threads for execution, and continuously acquiring a new operation packet from the buffer queue after the execution of the operation packet is completed; and informing the execution completion condition of the operation packet to the corresponding first thread by using the plurality of second threads, so that the corresponding first thread determines the operation to be executed next in at least one operation included in the corresponding computation graph. That is, after an operation is completed, the subsequent operations are packed and transferred to the buffer queue for the second thread to obtain, so that the operation packets obtained by the second thread are all operation packets that can be directly executed without waiting for the upstream dependent operation to be executed.
FIG. 5 is a flow diagram illustrating a method of executing a computational graph using multiple threads in accordance with an embodiment of the present invention. Here, the method may be performed by a computer program, or may be performed by a hardware device or an aggregation of hardware and software resources dedicated to performing machine learning, big data computation, or data analysis, for example, by a machine learning platform for implementing a machine learning related business.
Referring to fig. 5, in step S501, a calculation map including at least one operation is acquired. A relative index of the memory space in which the operational data of the at least one operation is present is declared in the computation graph.
As an example, the computation graph comprises at least one operation, which may also be referred to as an arithmetic operation. As an example, the at least one operation involves at least one arithmetic operation in a machine learning algorithm. The operation data of the operation may include input data and output data of the operation.
When a computational graph is declared, the memory space of the operational data of the computational graph is represented by relative indices, rather than by direct indices (i.e., actual indices). In the field of computers, as an example, the actual index of the memory space may be represented by 8 bytes of data, while the memory space of each operation data of the computation graph according to the embodiment of the present invention is represented by a relative index, and the maximum value of the relative index may be set as the total amount of operation data (i.e., the number of operation data) of at least one operation included in the computation graph, which is generally small and may be represented by 4 or 2 bytes of data, so that the memory size occupied by the computation graph may be reduced.
In step S502, a plurality of first threads are created, and a corresponding memory space is allocated for each first thread.
As an example, in a case where the memories of the operation data of the computation graph are uniformly hosted by the memory management module, the memory management module may be utilized to allocate a corresponding memory space for each first thread.
As an example, each first thread may be allocated a corresponding memory space according to the total amount of operation data (i.e., the number of operation data) of at least one operation included in the computation graph.
The computation graph is copied for each first thread in step S503.
In step S504, at least one operation included in the computation graph is executed according to the start address and the relative index of the memory space of each of the first threads. That is, at least one operation included in the computation graph is executed according to the starting address of the memory space of each first thread and the relative index of the operation data of the at least one operation included in the computation graph corresponding to the starting address.
As an example, in step S504, the memory space of each operation included in the computation graph may be determined according to the starting address of the memory space of each first thread and the relative index of the operation data of each operation included in the computation graph, and each operation included in the computation graph may be executed according to the memory space of each operation included in the computation graph.
The difference between the relationship between the computation graph and the memory space in the prior art and the present invention is as described above, and is not described herein again.
As an example, in step S504, each first thread may be utilized to determine a memory address of operation data of at least one operation included in the corresponding computation graph according to the start address of the corresponding memory space and the relative index; and executing at least one operation included in the respective computation graph according to the respective determined memory address by using the plurality of first threads. That is, in this example, the operations of the respective computation graphs are performed by the first thread itself.
As another example, a method of executing a computational graph using multiple threads according to an embodiment of the present invention further includes the steps of: creating a plurality of second threads, and in this embodiment, in step S504, generating an operation packet of at least one operation included in the corresponding computation graph according to the start address and the relative index of the corresponding memory space by using each first thread; and executing the operation package created by the plurality of first threads using the plurality of second threads.
The operation packet has a start address corresponding to the memory space, a processing procedure corresponding to the operation, and a relative index of operation data corresponding to the operation in the memory space.
Or, the operation information packet has a processing procedure of the corresponding operation and a memory address of the corresponding operation generated based on the starting address of the corresponding memory space and a relative index of the operation data of the corresponding operation in the memory space. In this case, in step S504, the memory address of the corresponding operation is generated by the first thread based on the starting address of the corresponding memory space and the relative index of the operand data of the corresponding operation in the memory space.
An example of the flowchart of step S504 will be described below with reference to fig. 6.
In step S601, each first thread is used to generate an operation packet of a next operation to be executed among at least one operation included in the corresponding computation graph according to the start address and the relative index of the corresponding memory space.
In step S602, the operation packets generated by the respective first threads are placed in a buffer queue.
In step S603, the operation packet is obtained from the buffer queue by using the plurality of second threads for execution, and a new operation packet is continuously obtained from the buffer queue after the operation packet is completely executed. That is, after an operation is completed, the subsequent operations are packed and transferred to the buffer queue for the second thread to obtain, so that the operation packets obtained by the second thread are all operation packets that can be directly executed without waiting for the upstream dependent operation to be executed.
In step S604, the execution completion of the operation packet is notified to the corresponding first thread by using the plurality of second threads, so that the corresponding first thread determines an operation to be executed next among at least one operation included in the corresponding computation graph.
In the method and the system for executing the computation graph by utilizing the multithreading according to the embodiment of the invention, the relative index of the operation data of the operation included in the computation graph in the memory space is declared, so that the computation graph can be directly copied in a distributed environment to achieve the purpose of data parallel acceleration, and meanwhile, the declaration mode of the relative index does not relate to distributed logic, so that a user does not need to consider how to process the multithreading logic, and the use cost and the development difficulty of the user are reduced. In addition, the method and the system for executing the computation graph by utilizing multiple threads according to the embodiment of the invention can support the processing of streaming data, and can simultaneously take development cost and execution efficiency into consideration even if the data volume needing to be processed is very large. The method and system for executing a computation graph using multiple threads according to an exemplary embodiment of the present invention have been described above with reference to fig. 1 to 6. However, it should be understood that: the devices, systems, units, etc. used in fig. 1-6 may each be configured as software, hardware, firmware, or any combination thereof that performs the specified function. For example, these systems, devices, units, etc. may correspond to an application specific integrated circuit, to pure software code, or to a module combining software and hardware. Further, one or more functions implemented by these systems, apparatuses, or units, etc. may also be uniformly executed by components in a physical entity device (e.g., processor, client, server, etc.).
Furthermore, the above method may be implemented by instructions recorded on a computer-readable medium, for example, according to an exemplary embodiment of the present application, there may be provided a computer-readable storage medium storing instructions that, when executed by at least one computing device, cause the at least one computing device to perform the steps of: acquiring a computation graph comprising at least one operation, wherein a relative index of operation data of the at least one operation in a memory space is declared in the computation graph; creating a plurality of first threads and allocating a corresponding memory space for each first thread; copying the computational graph for each first thread; executing the at least one operation included in the computation graph according to the starting address of the memory space of each of the plurality of first threads and the relative index.
The computer program in the computer-readable medium may be executed in an environment deployed in a computer device such as a client, a host, a proxy device, a server, etc., and it should be noted that the computer program may also be used to perform additional steps other than or in addition to the steps described above, and the content of the additional steps and further processing is mentioned in the description of the related method with reference to fig. 5 and 6, and thus will not be described again here to avoid repetition.
It should be noted that the method and system for executing a computation graph using multiple threads according to an exemplary embodiment of the present invention may completely depend on the execution of a computer program to realize corresponding functions, that is, each unit or device corresponds to each step in the functional architecture of the computer program, so that the whole device or system is called by a special software package (e.g., lib library) to realize the corresponding functions.
On the other hand, when each of the units or devices mentioned in fig. 1 to 6 is implemented in software, firmware, middleware or microcode, program codes or code segments for performing the corresponding operations may be stored in a computer-readable medium such as a storage medium so that a processor may perform the corresponding operations by reading and executing the corresponding program codes or code segments.
On the other hand, each means included in the system for executing a computation graph using multiple threads according to an exemplary embodiment of the present invention may also be implemented by hardware, software, firmware, middleware, microcode, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the corresponding operations may be stored in a computer-readable medium such as a storage medium, so that a processor may perform the corresponding operations by reading and executing the corresponding program code or code segments.
For example, a system implementing a computation graph with multiple threads according to an exemplary embodiment of the present invention may comprise at least one computing device and at least one storage device storing instructions, wherein the instructions, when executed by the at least one computing device, cause the at least one computing device to perform the steps of: acquiring a computation graph comprising at least one operation, wherein a relative index of operation data of the at least one operation in a memory space is declared in the computation graph; creating a plurality of first threads and allocating a corresponding memory space for each first thread; copying the computational graph for each first thread; executing the at least one operation included in the computation graph according to the starting address of the memory space of each of the plurality of first threads and the relative index.
Specifically, the system device may be deployed in a server or may be deployed on a node apparatus in a distributed network environment. Additionally, the system equipment may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the system device may be connected to each other via a bus and/or a network.
The system apparatus need not be a single device, but can be any collection of devices or circuits capable of executing the above instructions (or sets of instructions), individually or in combination. The system apparatus may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).
In the system apparatus, the computing device for performing the method of executing the computation graph using multithreading according to an exemplary embodiment of the present invention may be a processor, and such a processor may include a Central Processing Unit (CPU), a Graphic Processing Unit (GPU), a programmable logic device, a special processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, the processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like. The processor may execute instructions or code stored in one of the storage devices, which may also store data. Instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.
The storage device may be integral to the processor, e.g., having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the storage device may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The storage device and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, a network connection, etc., so that the processor can read files stored in the storage device.
While exemplary embodiments of the present application have been described above, it should be understood that the above description is exemplary only, and not exhaustive, and that the present application is not limited to the exemplary embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the present application. Therefore, the protection scope of the present application shall be subject to the scope of the claims.

Claims (10)

1. A method for executing a computational graph using multiple threads, comprising:
acquiring a computation graph comprising at least one operation, wherein a relative index of operation data of the at least one operation in a memory space is declared in the computation graph;
creating a plurality of first threads and allocating a corresponding memory space for each first thread;
copying the computational graph for each first thread;
executing the at least one operation included in the computation graph according to a starting address of a memory space of each of the plurality of first threads and the relative index.
2. The method of claim 1, wherein the step of performing the at least one operation included in the computation graph according to the starting address of the memory space of each of the plurality of first threads and the relative index comprises:
determining, by each first thread, a memory address of operation data of the at least one operation included in the corresponding computation graph according to the start address of the corresponding memory space and the relative index;
executing, by the first threads, the at least one operation included in the respective computation graph according to the respective determined memory address.
3. The method of claim 1, further comprising: a plurality of second threads are created and,
wherein the step of executing the at least one operation included in the computation graph according to the starting address of the memory space of each of the plurality of first threads and the relative index includes:
generating an operation packet of the at least one operation included in the corresponding computation graph according to the starting address of the corresponding memory space and the relative index by using each first thread; and
executing the operation packet created by the plurality of first threads using the plurality of second threads,
the operation packet has a start address corresponding to the memory space, a processing procedure corresponding to the operation, and a relative index of the operation data corresponding to the operation in the memory space, or the operation information packet has a processing procedure corresponding to the operation, and a memory address corresponding to the operation generated based on the start address corresponding to the memory space and the relative index of the operation data corresponding to the operation in the memory space.
4. The method of claim 1, further comprising: a plurality of second threads are created that,
wherein the step of executing the at least one operation included in the computation graph according to the starting address of the memory space of each of the plurality of first threads and the relative index includes:
generating an operation packet of operations to be executed next among the at least one operation included in the corresponding computation graph according to the start address of the corresponding memory space and the relative index by using each first thread;
putting the operation packets generated by the first threads into a buffer queue;
acquiring the operation packets from the buffer queue by using the plurality of second threads for execution, and continuously acquiring new operation packets from the buffer queue after the execution of the operation packets is completed; and
informing, by the plurality of second threads, a corresponding first thread of completion of execution of the operation package, so that the corresponding first thread determines an operation to be executed next among the at least one operation included in the corresponding computation graph,
the operation packet has a start address corresponding to the memory space, a processing procedure corresponding to the operation, and a relative index of the operation data corresponding to the operation in the memory space, or the operation information packet has a processing procedure corresponding to the operation, and a memory address corresponding to the operation generated based on the start address corresponding to the memory space and the relative index of the operation data corresponding to the operation in the memory space.
5. The method of claim 1, wherein the at least one operation involves at least one arithmetic operation in a machine learning algorithm.
6. The method of claim 3 or 4, wherein the number of second threads is greater than the number of first threads.
7. The method of claim 1, wherein allocating a respective memory space for each first thread comprises: and allocating a corresponding memory space for each first thread according to the total amount of the operation data of the at least one operation.
8. A system for executing a computational graph using multiple threads, comprising:
computation graph obtaining means for obtaining a computation graph including at least one operation, where a relative index of operation data of the at least one operation in a memory space is declared in the computation graph;
the device comprises a creating device, a memory management device and a processing device, wherein the creating device is used for creating a plurality of first threads and allocating corresponding memory space for each first thread;
copying means for copying the computation graph for each first thread;
an executing device, configured to execute the at least one operation included in the computation graph according to a starting address of a memory space of each of the plurality of first threads and the relative index.
9. A system comprising at least one computing device and at least one storage device storing instructions that, when executed by the at least one computing device, cause the at least one computing device to perform a method of executing a computational graph using multiple threads as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium storing instructions that, when executed by at least one computing device, cause the at least one computing device to perform a method of executing a computational graph using multiple threads as claimed in any of claims 1 to 7.
CN202211184997.1A 2018-09-06 2018-09-06 Method and system for executing computation graph by multiple threads Pending CN115543566A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211184997.1A CN115543566A (en) 2018-09-06 2018-09-06 Method and system for executing computation graph by multiple threads

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811037341.0A CN110879744B (en) 2018-09-06 2018-09-06 Method and system for executing computation graph by multiple threads
CN202211184997.1A CN115543566A (en) 2018-09-06 2018-09-06 Method and system for executing computation graph by multiple threads

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201811037341.0A Division CN110879744B (en) 2018-09-06 2018-09-06 Method and system for executing computation graph by multiple threads

Publications (1)

Publication Number Publication Date
CN115543566A true CN115543566A (en) 2022-12-30

Family

ID=69727013

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201811037341.0A Active CN110879744B (en) 2018-09-06 2018-09-06 Method and system for executing computation graph by multiple threads
CN202211184997.1A Pending CN115543566A (en) 2018-09-06 2018-09-06 Method and system for executing computation graph by multiple threads

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN201811037341.0A Active CN110879744B (en) 2018-09-06 2018-09-06 Method and system for executing computation graph by multiple threads

Country Status (1)

Country Link
CN (2) CN110879744B (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8954986B2 (en) * 2010-12-17 2015-02-10 Intel Corporation Systems and methods for data-parallel processing
CN103488684B (en) * 2013-08-23 2016-12-28 国家电网公司 Electric reliability index quick calculation method based on data cached multiple threads
CN108292241B (en) * 2015-10-28 2022-05-24 谷歌有限责任公司 Processing a computation graph
CN108279943B (en) * 2017-01-05 2020-09-11 腾讯科技(深圳)有限公司 Index loading method and device
CN107609350B (en) * 2017-09-08 2020-04-03 厦门极元科技有限公司 Data processing method of second-generation sequencing data analysis platform
CN108008975A (en) * 2017-12-22 2018-05-08 郑州云海信息技术有限公司 A kind of processing method and processing device of the view data based on KNL platforms

Also Published As

Publication number Publication date
CN110879744B (en) 2022-08-16
CN110879744A (en) 2020-03-13

Similar Documents

Publication Publication Date Title
US11010681B2 (en) Distributed computing system, and data transmission method and apparatus in distributed computing system
CN109669772B (en) Parallel execution method and equipment of computational graph
TWI525540B (en) Mapping processing logic having data-parallel threads across processors
EP3126971B1 (en) Program execution on heterogeneous platform
US9639374B2 (en) System and method thereof to optimize boot time of computers having multiple CPU's
US20170109415A1 (en) Platform and software framework for data intensive applications in the cloud
US20210158131A1 (en) Hierarchical partitioning of operators
US11733983B2 (en) Method and apparatus for generating metadata by a compiler
US8681166B1 (en) System and method for efficient resource management of a signal flow programmed digital signal processor code
US11900601B2 (en) Loading deep learning network models for processing medical images
US20230004365A1 (en) Multistage compiler architecture
Wozniak et al. MPI jobs within MPI jobs: A practical way of enabling task-level fault-tolerance in HPC workflows
US11954419B2 (en) Dynamic allocation of computing resources for electronic design automation operations
US10496433B2 (en) Modification of context saving functions
US8041551B1 (en) Algorithm and architecture for multi-argument associative operations that minimizes the number of components using a latency of the components
CN114008589A (en) Dynamic code loading for multiple executions on a sequential processor
CN110879744B (en) Method and system for executing computation graph by multiple threads
US20230244966A1 (en) Machine learning deployment platform
WO2019118338A1 (en) Systems and methods for mapping software applications interdependencies
JP2007080049A (en) Built-in program generation method, built-in program development system and information table section
US20230205500A1 (en) Computation architecture synthesis
US11748077B2 (en) Apparatus and method and computer program product for compiling code adapted for secondary offloads in graphics processing unit
CN113704687A (en) Tensor calculation operation method and device and operation system
CN114764331A (en) Code generation method and device, electronic equipment and computer readable storage medium
CN117992149A (en) Offload computation based on extended instruction set architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination