CN110347450B

CN110347450B - Multi-stream parallel control system and method thereof

Info

Publication number: CN110347450B
Application number: CN201910633636.2A
Authority: CN
Inventors: 袁进辉; 牛冲
Original assignee: Beijing Oneflow Technology Co Ltd
Current assignee: Beijing Oneflow Technology Co Ltd
Priority date: 2019-07-15
Filing date: 2019-07-15
Publication date: 2024-02-09
Anticipated expiration: 2039-07-15
Also published as: CN110347450A

Abstract

The present disclosure relates to a multi-stream parallel control system, comprising: a host thread component comprising a first executable and a second executable, the first executable inserting a computing task into a specified task stream of a plurality of task streams, and the second executable inserting an event contained in a stream tuning structure after each computing task is inserted; the task flow component comprises a task execution body and an event execution body, wherein the task execution body is used for executing a task inserted by the first execution body, and the event execution body is used for executing an event contained in the inserted flow-back structure body; and the thread callback component is configured corresponding to each flow execution component and comprises a flow callback structure body execution body, and is used for executing the flow callback structure body and sending out information of completion of the execution of the event when the execution of the event execution body is completed.

Description

Multi-stream parallel control system and method thereof

Technical Field

The present invention relates to a control system and a control method for multi-stream parallel processing in a data processing network, and more particularly, to a parallel control system and a control method for implementing multi-stream parallel processing in a CUDA interface.

Background

With the advent of big data computing and deep learning, various coprocessors are commonly used to offload the data processing functions of the CPU. Such as GPU (Graphic Processing Unit), APU, etc. The GPU has a highly parallel architecture (highly parallel structure), so the GPU has a higher efficiency than the CPU in processing graphics data and complex algorithms. When the CPU executes the calculation task, only one data is processed at one moment, no real parallelism exists, and the GPU is provided with a plurality of processor cores, and can process a plurality of data at one moment in parallel. The GPU has more ALUs (Arithmetic Logic Unit, logical operation executives) for data processing than the CPU, rather than data caching and flow control. Such a structure is well suited for large-scale data that are highly uniform in type, independent of each other, and a clean computing environment that does not need to be interrupted. In order to implement such a large number of similar simple operations without occupying CPU resources, multiple GPUs are connected to one or more CPUs to perform parallel data processing, so as to obtain a large number of data processing results at high speed.

Currently, GPUs of inflight (NVIDIA) are mostly adopted to realize such parallel simple data processing. However, during use, the GPU has some drawbacks in controlling the stream parallelism, so that the phenomenon that the streams should be serially run in parallel occurs. Fig. 1 shows a schematic diagram of streaming serial results occurring in a conventional GPU interface. The left side of fig. 1 shows the parallel situation during the stream execution of the GPUs, and the right side of fig. 1 shows the situation in which the serial situation occurs between the streams during the stream execution of the GPUs. During actual operation, the serial situation between the streams that occurs is not just the situation shown on the right side of fig. 1. This is clearly a problem caused by the CPU in controlling the flow of the respective GPUs, thus affecting the respective data processing speeds of the GPUs. For this reason, it is desirable to eliminate the above-described problems and to provide a stable flow parallel control system.

In addition, as the conventional GPU interface returns control rights to the CPU when entering the callback function point position in the process of executing the tasks in the stream, the control of the GPU stream is returned to the GPU from the CPU after the callback function is run by the CPU, and therefore the GPU equipment is in a waiting state in the process of executing the callback function by the CPU. Although the waiting state is not long, the data processing of the GPU is still a waste of time, and the efficiency of the GPU in processing data is reduced. Therefore, it is desirable to eliminate this inefficiency.

Disclosure of Invention

In order to solve the above-mentioned problems, the present disclosure provides a multi-stream parallel control system, including a host thread component, a plurality of task stream components, and a number of thread callback components corresponding to the number of task stream components, wherein the host thread component includes a first executable and a second executable, the first executable inserts one computing task into a specified task stream of the plurality of task streams, and the second executable inserts a stream callback structure containing events and callback functions into the specified task stream after each computing task is inserted and simultaneously inserts the stream callback structure into the thread callback component; the task flow component comprises a task execution body and an event execution body, wherein the task execution body is used for executing a task inserted by a first execution body, and the event execution body is used for executing an event contained by the inserted flow-back adjusting structure body; and the thread callback component is configured corresponding to each task flow component and comprises a callback structure body executing body, wherein the callback structure body is inserted into the callback structure body, and is used for executing callback functions contained in the callback structure body when the event executing body in the task flow component is executed.

According to the flow parallel control system disclosed by the disclosure, the event executor modifies the event result mark after the event execution is finished, and when the flow back structure executor of the thread callback component acquires that the event result mark is modified through a thread channel, the flow back structure executor executes a callback function contained in the flow back structure and sends a message of finishing the event execution to the host thread component.

According to another aspect of the present disclosure, there is provided a multi-stream parallel control method, including: asynchronously inserting computing tasks into each task stream for the task stream; initializing a flow-back structure, wherein the flow-back structure comprises initialized events and callback functions; asynchronously inserting the initialized flow-back tuning structure after each computing task in the task flow; inserting the initialized flow-back structure into a callback thread of the thread callback component; when receiving a message that the execution of an event in a task flow is completed, the callback component of the thread executes a callback function in a callback structure; repeating the above steps.

According to the flow parallel control method, after the event is executed, the initial value of the event is modified, so that the thread callback component acquires the message of the event execution completion by knowing the modification of the initial value of the event through a thread channel.

The parallel control system and the method thereof are adopted. Since the thread callback component is separately arranged for each task flow component, the execution process of each task flow is completely separated, so that the execution process is not affected by each other, and the possibility of flow serialization is eliminated. In addition, because the thread callback component executes the callback function on the premise that the event after each task in the task flow is completed, the time that the GPU equipment end is controlled by the host thread component in the data processing process is reduced, and particularly the waiting time of the GPU equipment end when the host thread component confirms whether the task is completed or not is eliminated, so that the GPU efficiency is optimized.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

The disclosure will now be described in detail by way of example with reference to the accompanying drawings, in which:

FIG. 1 is a diagram showing the results of a conventional stream parallel control;

FIG. 2 is a schematic diagram illustrating a flow parallel control system according to the present disclosure; and

fig. 3 is a flow chart illustrating a flow chart of a flow parallel control method according to the present disclosure.

Detailed Description

The present invention is described in further detail below with reference to examples and drawings to enable those skilled in the art to practice the same and to refer to the description.

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the present disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, one of the two possible devices may be referred to hereinafter as a first executable or a second executable, and similarly the other of the two possible devices may be referred to as a second executable or a first executable, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

In order that those skilled in the art will better understand the present disclosure, the present disclosure will be described in further detail below with reference to the accompanying drawings and detailed description.

As shown in fig. 1, in the existing deep learning computing system, GPUs are mostly adopted to solve a large number of simply repeated computing tasks, and multiple GPUs are adopted to perform parallel processing. The interface component provided by the conventional ambida (NVIDIA) may occur that task flows executed by the multiple GPU devices shown on the right side of fig. 1 cannot be parallel, and in particular, a situation that multiple task flows are serial to each other may occur. Although the reason for this serial multi-tasking is not clear, the inventors have made various improvements in order to eliminate the situation shown on the right side of fig. 1.

Fig. 2 is a schematic diagram of a multi-stream parallel control system according to the present disclosure. As shown in fig. 2, the host thread assembly 10 includes any number of execution bodies 11, 12, 13, … 1N for performing various operational procedures. The first executing body 11 inserts tasks allocated to the execution of the GPU into the task stream of the GPU based on a predetermined instruction. Specifically, the first executor 11 inserts a task into the task stream controlled by the task stream component 30, and the task stream component 30 instructs the task executor 31 to execute the inserted task. The second executable 12 then inserts predetermined events in the initialized stream tuning structure into the task stream controlled by the task stream component 30 and is arranged after the tasks inserted by the first executable 11. Thus, according to the insertion order, the task flows controlled by the task flow component 30 will cause the GPU that specifically executes the task flows to execute in sequence, in order of task-before-event.

At the same time, the second executable 12 inserts a predetermined initialized callback structure into the callback executable 21 in the thread callback component 20. Although only one callback execution entity 21 is shown here, the thread callback component 20 includes a number of callback execution entities, the number of which corresponds to the data of the task execution entity.

After the GPU is assigned with a predetermined operation task, the execution body included in the GPU executes the operation on the task according to a predetermined flow, and the task flow component sequentially controls each task and event in the GPU device. Specifically, when each task in the stream component performs an event operation immediately after completion, the event executor 32 modifies the initial value of the corresponding event after the event operation is completed. The corresponding callback executor 21 in the thread callback component 20 will learn about the change of the initial value through the thread channel.

Communication is made between the thread callback component 20 and the task flow component 30 via a thread channel. The event executor 32 sends a completed message to the callback executor 21 through the thread channel after the event operation is finished, the message informs the callback executor 21 that the initial value of the event has been changed after the event is executed by the event executor 32. The callback executing body 21 may be a state machine, which starts to execute the callback function in the callback structure body when knowing that the initial value of the event changes, and sends the executed message to the host thread component 10 after the callback function is executed, so that other executing bodies in the host thread component 10 execute any operation in sequence.

Since the flow parallel control system according to the present disclosure allocates one thread callback component 20 for each task flow component 30, the host thread component 10 does not need to execute a callback function for a task in each task flow after completion of the task as in the existing CUDA interface, but executes the callback function with the thread callback component 20 dedicated to each task flow, thereby obtaining a message of completion of the task. And the thread callback component 20 is employed, thereby eliminating the need for executing callback functions in the host thread component 10. Further, since each task flow component 30 is configured with a thread callback component 20 to execute callback functions in the callback structure, there is no situation that flow parallel control is crossed between parallel task flows, so the task flows are independent from each other, and the situation that flow serial occurs between a plurality of task flows is eliminated.

Furthermore, in the conventional CUDA interface, after the task in each task flow is executed, the data copy from the host end to the GPU device end is completed once, the kernel is started once, the data copy from the GPU device end to the host end is completed once, and finally, an operation of adding a callback function is added. When the execution body of the GPU equipment end operates to the callback function point, the GPU equipment can give control rights back to the host end, the host end returns the control rights back to the task flow component to control the GPU equipment end after the operation is completed, and then the GPU equipment end can continue the operation of each execution body to execute the next task. Thus, the GPU will be in an idle state when executing callback functions in the conventional host thread component 10. This would waste processing time of the GPU, reducing the data processing efficiency of the GPU. By adopting the callback executing body 21 of the thread callback component 20 to execute the callback function after the event occurs, the time of taking over the control right of the host to the GPU equipment end can be reduced. And because the events in the task flow component 30 are executed much less than the callback function, although the event execution process may also take a period of time, this period of time is much less than the time that the traditional host thread component 10 executes the callback function. Therefore, by setting the thread callback component 20 and inserting the event in the callback structure in the task flow component, the interval time (the event execution time can be ignored) between the task in the task flow component 30 and the next task can be obviously reduced, so that seamless connection between task execution in the task flow component 30 is basically realized, the smoothness of task execution at the GPU equipment end is obviously improved, and the efficiency of processing data by the GPU is improved.

Although described above with respect to the system of the present disclosure based on fig. 2, it is noted that the executives 11 and 12 may alternately insert tasks and flow-back structures into the task flow assembly 30 in succession. Similarly, the executive 12 inserts a thread callback component into the corresponding callback thread executive in the thread callback component 20 at the same time as inserting a thread callback Component (CBEVENT) into the task flow component 30.

Fig. 3 is a timing diagram illustrating a flow parallel control method according to the present disclosure. As shown in fig. 3, at step S31, the first executable 11 in the host thread assembly 10 asynchronously inserts tasks to be executed by the respective task streams sequentially into the corresponding task streams. After each insertion of a task by the first executable 11, a current tuning structure (CUDAEVENT) is inserted into the task stream by the second executable 12 at step S32 for each task stream. Referred to in this disclosure as a CALLBACK EVENT, or simply "CB EVENT"), and while executing step S32, the second executable 12 also inserts the CALLBACK structure into the CALLBACK thread of the thread CALLBACK component 20 at step S33, and the inserted CALLBACK structure is operated on by the CALLBACK executable 21 (also referred to as a "CALLBACK executable"). The steps S31, S32, and S33 are repeated a corresponding number of times based on the number of tasks to be executed. The task flow component 30, after task and flow-back fabric insertion, performs the EVENTs (EVENT) contained by the task and flow-back fabric in sequence. That is, in step S34, after one task in one task flow is executed and an EVENT (EVENT) included in a subsequent callback structure is executed, an initial value of the EVENT is modified, so that the callback executor 21 in the thread callback component 20 knows the modification of the initial value of the EVENT. Subsequently, at step S35, the callback executor 21 executes a callback function in the callback structure after learning the modification of the initial value of the corresponding event (CALLBACK FUNCTION), and sends a message to the host thread component 10.

Execution of the callback function is required because the task flow is inserted and executed asynchronously in order for completion of task execution to be confirmed by the host thread. Therefore, in order to enable the host thread component to timely acquire the execution status of the GPU and execute the callback function after the execution is completed, the host thread component 10 can execute some operations unrelated to the GPU as required.

For example, in a computing system employing the multi-stream parallel control system of the present disclosure, one host machine may plug four GPU device ends. Alternatively, there may be two, three, six, twenty GPU device ends. Four GPU devices are illustrated here as examples. In the control system, one GPU device side corresponds to one task flow component 30. Therefore, the four GPU devices correspond to the four task stream components 30-0, 30-1, 30-2 and 30-3. Just because the four task flow components are distributed on the four GPU equipment ends, the four task flow components do not have any relation at all. The inserted tasks in the four task flows occupy the resources on each GPU equipment end of the four cards, so that the execution bodies of the GPU equipment ends are concurrent to execute the corresponding tasks, and no blocking is generated. Since in the conventional CUDA interface the callback function interface (CALLBACK FUNCTION) is inserted after the corresponding task in the task flow, there is a crossover between the respective task flows between each other when the host thread component executes the callback function. This calls the callback function interface at a certain time, which causes the execution of tasks of each stream to become serial, and thus the stream serial situation is also reached.

While the callback function interfaces of the present disclosure are separately disposed in the thread callback component 20, thereby changing the logic of the stream parallel control system to manage the callback functions, i.e., separate callback management for each task stream. The detailed description is presented above with respect to fig. 2 and 3. The thread callback component 20 obtains the result message of EVENT in CB EVENT executed by the executing entity in the task flow component 30 through the thread channel, thereby starting to execute the callback function of EVENT in the callback executing entity 21 of the thread callback component 20. If the thread callback component 20 does not obtain a result message of CB ENVENT by an executable in the task flow component 30, no operation is performed.

In other words, to eliminate the problem of serialization between multiple streams, the present disclosure constructs a stream-steering architecture containing EVENT based on EVENT, so that a layer of dead loop is encapsulated on the EVENT interface, a new CBEVENT interface is obtained to replace the existing CUDA interface, and is executed by the thread-steering component 20, thereby achieving the purpose of controlling stream parallelism of the present disclosure.

Moreover, in the computing system adopting the stream parallel control system disclosed by the invention, tasks and events are alternately inserted in the stream component of the GPU equipment end, and the calling cost of the events is very small. Specifically, since the CALLBACK is not executed on the stream component of the present disclosure as in the conventional CUDA interface, the CALLBACK is executed on the CALLBACK executor of the thread CALLBACK component 20. In the task flow component 30, an EVENT (EVENT) is directly executed every time a task is executed, and thus, EVENT consumption is small. After the EVENT is executed, the next task is directly executed without waiting for an execution body in the host thread component to execute the CALLBACK function, because the task flow component of the present disclosure only executes the EVENT, and does not need to execute the CALLBACK and send a message to the host thread component 10 to execute the CALLBACK function as in the existing CUDA interface and wait for the host thread component 10 to execute the external CALLBACK function before confirming whether the task execution is completed. Because, in the present disclosure, the precondition in the thread callback component 20 for executing the callback function is that the EVENT has already been executed, and the EVENT is executed after each task, it is necessary to indicate that the task has already been executed when the thread callback component 20 executes the callback function. Thus, the thread callback component 20 performs a callback function action itself as an acknowledgement that the task is completed. Moreover, the task flow component only takes very little time to execute the EVENT and is much shorter than the waiting time of the existing CUDA interface in the process of confirming the completion of the task, so that the efficiency of processing data by the GPU equipment side is greatly improved.

Thus far, the present specification describes a parallel control system and method thereof according to an embodiment of the present disclosure. Since the thread callback component 20 is provided separately for each task flow component 30, execution of the individual task flows is completely separated from each other, thereby being unaffected by each other, eliminating the possibility of flow serialization. In addition, because the execution of the callback function by the thread callback component 20 is based on the premise that the execution of the event after each task in the task flow is completed, the time for the GPU equipment end to be controlled by the host thread component in the data processing process is reduced, and especially the waiting time of the GPU equipment end when the host thread component 10 confirms whether the execution of the task is completed is eliminated, so that the efficiency of the GPU is optimized.

While the basic principles of the present disclosure have been described above in connection with specific embodiments, it should be noted that all or any steps or components of the methods and apparatus of the present disclosure can be implemented in hardware, firmware, software, or combinations thereof in any computing device (including processors, storage media, etc.) or network of computing devices, as would be apparent to one of ordinary skill in the art upon reading the present disclosure.

Thus, the objects of the present disclosure may also be achieved by running a program or set of programs on any computing device. The computing device may be a well-known general purpose device. Thus, the objects of the present disclosure may also be achieved by simply providing a program product containing program code for implementing the method or apparatus. That is, such a program product also constitutes the present disclosure, and a storage medium storing such a program product also constitutes the present disclosure. It is apparent that the storage medium may be any known storage medium or any storage medium developed in the future.

It should also be noted that in the apparatus and methods of the present disclosure, it is apparent that the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered equivalent to the present disclosure. The steps of executing the series of processes may naturally be executed in chronological order in the order described, but are not necessarily executed in chronological order. Some steps may be performed in parallel or independently of each other.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A multi-stream parallel control system comprises a host thread component, a plurality of task stream components and a number of thread callback components corresponding to the number of the task stream components, wherein

The host thread component comprises a first execution body and a second execution body, wherein the first execution body inserts one computing task into a designated task stream in a plurality of task streams, and the second execution body inserts a flow callback structure body containing events and callback functions into the designated task stream after each computing task is inserted and simultaneously inserts the flow callback structure body into the thread callback component;

the task flow component comprises a task execution body and an event execution body, wherein the task execution body is used for executing a task inserted by a first execution body, and the event execution body is used for executing an event contained by the inserted flow-back adjusting structure body; and

the thread callback component is independent of the host thread component and is configured corresponding to each task flow component, and comprises a callback structure body, wherein the callback structure body is inserted by the callback structure body, so that callback functions contained in the callback structure body are executed when the event execution body in the task flow component finishes executing.

2. The flow concurrency control system of claim 1, wherein the event executor modifies an event result flag after completion of event execution, and the flow callback structure executor of the thread callback component executes a callback function contained in the flow callback structure and sends a message of completion of event execution to the host thread component when learning that the event result flag is modified via a thread channel.

3. A multi-stream parallel control method, comprising:

for each task stream, asynchronously inserting computing tasks into the task stream by a first executable of a host thread component;

initializing a flow-back structure, wherein the flow-back structure comprises initialized events and callback functions;

a second executable of the host thread component asynchronously inserts an initialized flow-back throttling structure behind each computing task in the task flow;

inserting the initialized flow-back structure into a callback thread of a thread callback component independent of the host thread component;

when receiving a message that the execution of an event in a task flow is completed, the callback component of the thread executes a callback function in a callback structure; and

repeating the above steps.

4. The flow parallel control method of claim 3, wherein the initial value of the event is modified after the event is executed, such that the thread callback component obtains the event-executed message by learning the modification of the initial value of the event via the thread channel.