CN110347450A

CN110347450A - Multithread concurrent control system and its method

Info

Publication number: CN110347450A
Application number: CN201910633636.2A
Authority: CN
Inventors: 袁进辉; 牛冲
Original assignee: Beijing First-Class Technology Co Ltd
Current assignee: Beijing First-Class Technology Co Ltd
Priority date: 2019-07-15
Filing date: 2019-07-15
Publication date: 2019-10-18
Anticipated expiration: 2039-07-15
Also published as: CN110347450B

Abstract

This disclosure relates to a kind of multithread concurrent control system, it include: host thread component, body and second, which is executed, including first executes body, the first execution body will be in appointed task stream that a calculating task is inserted into multiple tasks stream and the second execution body be inserted into the event for flowing back to tune structural body and being included after each calculating task is inserted into；Task flow component, including task execution body and event execution body, task execution body are used to execute the task that the first execution body is inserted into, and event executes body and is used to execute the event for flowing back to tune structural body and being included being inserted into；And thread adjusts back component, corresponds to each stream executive module configuration, including flow back to and structural body is adjusted to execute body, executes the message for flowing back to and the concurrent outgoing event of structural body being adjusted to be finished for executing when body is finished in event.

Description

Multithread concurrent control system and its method

Technical field

The present invention relates to multithread in a kind of pair of data processing network parallel control system and its control method, it is more specific and Speech is related to a kind of concurrent control system and control method realized in CUDA interface to multithread parallel processing.

Background technique

As big data calculating and the rise of deep learning, various coprocessors are normally used for sharing the data of CPU Processing function.Such as GPU (Graphic Processing Unit), APU etc..GPU has high parallel organization (highly Parallel structure), so GPU possesses efficiency more higher than CPU in terms of processing graph data and complicated algorithm. When CPU executes calculating task, a moment only handles a data, parallel there is no truly, and GPU have it is multiple Processor core, can be with the multiple data of parallel processing a moment.Compared with CPU, GPU possesses more ALU (Arithmetic Logic Unit, logical operation execute body) it is used for data processing, rather than data high-speed caches and flow control.Such structure is non- It is very suitable for large-scale data for type high unity, mutually without dependence and does not need the pure calculating ring being interrupted Border.Cpu resource is prevented take up in order to realize this large amount of similar simple operation, mostly uses and is accessed on one or more CPU Multiple GPU execute parallel data processing, to obtain a large amount of data processed result of high speed.

People mostly use the tall and handsome GPU up to (NVIDIA) to realize this parallel simple data processing at present.But In use process, there are some defects for the parallel control of GPU convection current, so that the existing serial phenomenon of outflow that should be parallel. Fig. 1 shows the serial result schematic diagram of stream occurred in traditional GPU interface.The left side Fig. 1 shows the stream implementation procedure of multiple GPU In parallel situation, situation serial between occurring flowing in the stream implementation procedure of multiple GPU is shown on the right of Fig. 1.In reality It is somebody's turn to do in operational process, the serial situation between the stream of appearance not merely only has situation shown in the right side Fig. 1.This is obviously The problem of CPU causes in the control process of the stream to each GPU, therefore influence the respective data processing speed of GPU.For This, it is intended that eliminating the above problem, and provides a kind of stable stream concurrent control system.

Further, since traditional GPU interface is during executing the task in stream, and when into call back function point, GPU meeting Control is given back into CPU, for CPU after having run call back function, the control of GPU stream just can return to GPU from CPU, therefore, During CPU executes call back function, GPU equipment will be waited for.This wait state although the time do not grow, it is right GPU carries out data processing and is still a kind of waste of time, reduces the efficiency of GPU processing data.Accordingly it is desirable to eliminate The situation of this poor efficiency.

Summary of the invention

To solve the above-mentioned problems, present disclose provides a kind of multithread concurrent control system, including it is host thread component, more A task flow component and the thread of quantity corresponding with task flow component count adjust back component, wherein the host thread component Body and second is executed including first and executes body, and a calculating task is inserted into the finger in multiple tasks stream by the first execution body Determine in task flow and the second execution body in the rear insertion into appointed task stream that each calculating task is inserted into includes thing Part and flowing back to for call back function adjust structural body and flow back to tune structural body to thread readjustment component insertion simultaneously；The task flow Component includes task execution body and event executes body, and the task execution body is used to execute times that the first execution body is inserted into Business, the event execute body and are used to execute the event for flowing back to tune structural body and being included being inserted into；And the thread readjustment group Part corresponds to each task flow component Configuration, including flows back to and structural body is adjusted to execute body, flows back to and structural body is adjusted to execute what body was inserted into Tune structural body is flowed back to, is adjusted in structural body for executing to execute when body is finished in the event in task flow component by flowing back to The call back function for being included.

According to the stream concurrent control system of the disclosure, wherein the event, which executes body, modifies event after event is finished Result queue, flowing back to for the thread readjustment component adjust structural body execution body knowing the event result mark via thread channel When note modification, execution, which is flowed back to, adjusts call back function included in structural body, and sends event to host thread component and be finished Message.

A kind of multithread concurrency control method another aspect of the present disclosure provides, comprising: be directed to each task Stream, calculating task is asynchronously inserted into the task flow；Initialization flow callback structure body, the tune structural body that flows back to include By initialization event and call back function；Structural body will be adjusted asynchronously to be inserted into the task flow by flowing back to for initialization In each calculating task after；By flowing back in the readjustment thread for adjusting structural body insertion thread readjustment component by initialization； The readjustment thread in the message that the event execution received in task flow is completed, adjust back component and execute stream callback structure by thread Call back function in body；And repeat above step.

According to the stream concurrency control method of the disclosure, wherein the event is after being performed, the event it is initial Value is modified, so that the thread adjusts back component and obtains event and knowing the modification of event initial value via thread channel and holds The message that row finishes.

Using the concurrent control system and its method of the disclosure.Each task flow component is directed to due to thread readjustment component It is separately provided, therefore the implementation procedure of each task flow is completely isolated, thus it is unaffected by each other, it eliminates and flows serially Possibility.Also, it is finished since thread readjustment component executes call back function with the event after task each in task flow Premised on, therefore, reduce the time that GPU equipment end is controlled in carrying out data handling procedure by host thread component, especially It is a cancellation the host thread component time that GPU equipment end waits when whether confirmation task is finished, to optimize GPU Efficiency.

Detailed description of the invention

The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the disclosure Example, and together with specification for explaining the principles of this disclosure.

It is discussed in detail the disclosure by embodiment below with reference to the accompanying drawings, in attached drawing:

Shown in FIG. 1 is existing stream parallel control result schematic diagram；

Shown in Fig. 2 is the schematic illustration according to the stream concurrent control system of the disclosure；And

The flow diagram of stream concurrency control method shown in Fig. 3 according to the disclosure.

Specific embodiment

Below with reference to embodiment and attached drawing, the present invention is described in further detail, to enable those skilled in the art's reference Specification word can be implemented accordingly.

Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all implementations consistent with this disclosure.On the contrary, they be only with it is such as appended The example of the consistent device and method of some aspects be described in detail in claims, the disclosure.

It is only to be not intended to be limiting and originally open merely for for the purpose of describing particular embodiments in the term that the disclosure uses.? The "an" of singular used in disclosure and the accompanying claims book, " described " and "the" are also intended to including most shapes Formula, unless the context clearly indicates other meaning.It is also understood that term "and/or" used herein refers to and includes One or more associated any or all of project listed may combine.

It will be appreciated that though various information, but this may be described using term first, second, third, etc. in the disclosure A little information should not necessarily be limited by these terms.These terms are only used to for same type of information being distinguished from each other out.For example, not departing from In the case where disclosure range, hereinafter, one of two possible equipment can be referred to as the first execution body or be referred to as Second executes body, and similarly, another of two possible equipment can be referred to as the second execution body or be referred to as first and hold Row body.Depending on context, word as used in this " if " can be construed to " ... when " or " when ... " Or " in response to determination ".

In order to make those skilled in the art more fully understand the disclosure, with reference to the accompanying drawings and detailed description to this public affairs It opens and is described in further detail.

As shown in Figure 1, mostly using GPU in existing deep learning computing system to solve a large amount of simple duplicate meters Calculation task, and parallel processing is carried out using more GPU.The existing tall and handsome interface module provided up to (NVIDIA) will appear the right side Fig. 1 Task flow performed by multiple GPU equipment shown in side cannot be parallel, especially will appear multiple tasks stream and occurs each other Serial situation.Although not knowing the reason of this multitask stream serially causes, inventor has carried out various improvement, to eliminate Situation shown on the right side of Fig. 1.

Shown in Fig. 2 is the schematic illustration according to the multithread concurrent control system of the disclosure.As shown in Fig. 2, host line Journey component 10 include any number of execution body 11,12,13 ... .1N, for executing various operating process.Wherein first execute Body 11 is based on scheduled instruction, will distribute in task flow described in the task insertion GPU of GPU execution.Specifically, first holds In the task flow that row body 11 is controlled task insertion task flow component 30, and task flow component 30 instructs task execution body 31 to hold The be inserted into task of row.Scheduled flowing back to after initialization is adjusted the event in structural body to be inserted by subsequent second execution body 12 In the task flow that task flow component 30 is controlled, and it is arranged in after the task that the first execution body 11 is inserted into.Therefore, according to slotting Enter sequence, the task flow that task flow component 30 is controlled will make the GPU for specifically executing task flow will be according to event after first task Sequence successively execute.

At the same time, scheduled flowing back to after initialization is adjusted structural body to be also inserted into thread readjustment by the second execution body 12 Readjustment in component 20 executes body 21.Although only showing a readjustment herein executes body 21, thread adjusts back component 20 Body is executed including multiple readjustments, quantity is corresponding with the data of task execution body.

After scheduled operation task is assigned in GPU, included by execution body according to scheduled process to being held The task execution of load operates, and task flow component successively controls each task and event in GPU equipment.Specifically, working as stream group Each task execution is just immediately performed event action in part after completing, and event executes body 32 after event action terminates, meeting Modify the initial value of corresponding event.Corresponding readjustment, which executes body 21, in thread readjustment component 20 to know that this is first by thread channel The change of initial value.

It adjusts back in thread and is communicated with each other between component 20 and task flow component 30 by thread channel.Event executes body 32, which send readjustment for the message of completion by thread channel after event action terminates, executes body 21, which informs readjustment Body 21 is executed, for event after executing body 32 by event and executing, initial value has occurred that variation.Readjustment executes body 21 can To be a state machine, when knowing that this event initial value changes, start to execute the readjustment letter in callback structure body Number, and after call back function is finished, the message being finished is sent to host thread component 10, so as to host thread Other execution bodies in component 10 execute any operation in sequence.

Due to being that each task flow component 30 is assigned with a thread readjustment according to the stream concurrent control system of the disclosure Component 20, so that host thread component 10 is not necessarily to as in existing CUDA interface for the task in each task flow in task Call back function is executed after the completion and carrys out control task stream, but component 20 is adjusted back using the thread for being exclusively used in each task flow to execute Call back function, to obtain the message of task completion.And component 20 is adjusted back due to using thread, to eliminate host line The needs of call back function are executed in journey component 10.Further, since each task flow component 30 configures a thread readjustment group Part 20 come execute flow back to adjust structural body in call back function, therefore, between each parallel task flow there is no stream parallel control The case where intersection, therefore, each task flow, are independent of one another, eliminate and occur flowing serial situation between multiple tasks stream.

Further, it usually in traditional CUDA interface, after the task execution in each task flow, all completes primary Data copy of the host side to GPU equipment end, a kernel starting, the data copy of a GPU equipment end to host side, finally Increase the operation of an addition call back function.When call back function point is accomplished in the execution gymnastics of GPU equipment end, GPU is set Standby that control can be given back to host side, control is just returned to task flow component again after completing to control by host side operation GPU equipment end, then GPU equipment end could continue each operation for executing body to execute next task.Therefore, traditional When executing call back function in host thread component 10, GPU will be in idle condition.This will waste the processing time of GPU, reduce The data-handling efficiency of GPU.And using the thread of the disclosure adjust back component 20 come readjustment execute body 21 and execute event Call back function later can reduce host to the time of the control adapter tube of GPU equipment end.And due in task flow component The time that event in 30 executes is shorter than the time that call back function executes very much, therefore, although event implementation procedure can also disappear Consumption a period of time, but this period executes the time much shorter of call back function than legacy hosts thread component 10.Therefore, lead to It crosses setting thread to adjust back component 20 and be inserted into the event for flowing back to and adjusting in structural body in task flow component, task can be significantly reduced The interval time (event executes the time, can be ignored) in component 30 between task and next task is flowed, thus substantially The upper seamless connection realized in task flow component 30 between task execution, thus what the GPU Periphery Devices significantly improved executed Fluency, to improve the efficiency of GPU processing data.

Although the system above for the disclosure is described based on Fig. 2, it should be noted however that execute body 11 with Executing body 12 continuously can replace to 30 insertion task of task flow component and flow back to tune structural body.Equally, execute body 12 to appoint While tune structural body (CBEVENT) is flowed back in the business stream insertion of component 30, it will also flow back to and adjust structural body insertion thread readjustment component 20 In corresponding readjustment thread execute body.

Shown in Fig. 3 is the time diagram according to the stream concurrency control method of the disclosure.As shown in figure 3, in step S31 Locate, the task of execution needed for each task flow is asynchronously sequentially inserted into correspondence by the first execution body 11 in host thread component 10 Task flow in.For each task flow, after first executes one task of the every insertion of body 11, at step S32, by second Executing body 12 will flow back in tune structural body (CUDAEVENT) insertion task flow.It is known as CALLBACK EVENT in the disclosure, or Referred to as " CB EVENT "), and while executing step S32, at step S33, the second execution body 12 will also flow back to tune Structural body is inserted into the readjustment thread of thread readjustment component 20, and executes body 21 (also referred to as " readjustment executes body ") to institute by readjustment Flowing back to for insertion adjusts structural body to be operated.The quantity of above-mentioned steps S31, S32 and S33 based on required execution task repeats Corresponding number.Task flow component 30 successively executes task and stream callback structure in task and after flowing back to tune structural body insertion The event (EVENT) that body is included.I.e. in step S34, when a task execution in a task flow is complete and perform with Flowing back to after the event (EVENT) for adjusting structural body to be included afterwards, can modify the initial value of the event, so that thread is adjusted back Readjustment in component 20 executes the modification that body 21 knows the initial value of event.Then, at step S35, readjustment executes body 21 and exists After the modification for knowing the initial value of corresponding event, the call back function (CALLBACK for flowing back to and adjusting in structural body is executed FUNCTION), and host thread component 10 is transmitted the message to.

Why call back function is needed to be implemented, because task flow is asynchronous insertion and executes, so as to true by host thread Recognize the completion of task execution.Therefore, it in order to enable host thread component is capable of the practice condition of timely learning GPU, and holds at it After the completion of row, call back function is executed, so that host thread component 10, which can according to need, executes some operations unrelated with GPU.

For citing, in the computing system of the parallel control system of the multithread using the disclosure, a host machine Four GPU equipment ends can be inserted.It selectively can be two, three, six, 20 GPU equipment ends.This sentences four It is explained for GPU equipment end.In the controls, the corresponding task flow component 30 of a GPU equipment end.Therefore, four Corresponding a GPU equipment end is exactly four task flow components 30-0,30-1,30-2,30-3.Just because of four task flow components point Cloth is in four GPU equipment ends, therefore, does not have any relationship completely between four task flow components.In four task flows The task of insertion occupies the resources in four each GPU equipment ends of card, therefore the respective body that executes of each GPU equipment end is concurrent It goes to execute corresponding task, therefore any obstruction will not be generated.Due in traditional CUDA interface, call back function interface (CALLBACK FUNCTION) is inserted into after the corresponding task in task flow, therefore is caused between each task flow in host There are mutual intersections when executing call back function for thread component.This calls this call back function interface on some time, The execution that will lead to the task of each stream becomes serially to execute, therefore also arrives and flow serial situation.

And the call back function interface of the disclosure is individually arranged in thread readjustment component 20, to change the parallel control of stream The logic of system administration call back function processed, that is, separately individually adjusted back management for each task flow.Specific implementation ginseng See the above-mentioned explanation carried out for Fig. 2 and 3.Thread readjustment component 20 obtains holding in task flow component 30 by thread channel The results messages for the EVENT in CB EVENT that row body executes, to start to execute body 21 in the readjustment of thread readjustment component 20 The call back function for adjusting the EVENT in structural body is flowed back in middle execution.If thread readjustment component 20 does not obtain task flow component 30 In execute body execute CB ENVENT results messages, then do not execute any operation.

In other words, in order to eliminate the serial problem between multithread, the disclosure constructs one based on EVENT and includes EVENT's flows back to tune structural body, to encapsulate one layer of endless loop on EVENT interface again, it is next to obtain a kind of new CBEVENT interface Existing CUDA interface is substituted, and component 20 is adjusted back to execute by thread, to realize that parallel purpose is flowed in the control of the disclosure.

Moreover, using in the computing system for flowing concurrent control system of the disclosure, due to appointing in the stream component of GPU equipment end Business is interleaved alternately with event, and the call overhead of event is very small.Specifically, due to not having such as on the stream component of the disclosure Traditional CUDA interface executes CALLBACK like that, but executes in the readjustment of thread readjustment component 20 and execute on body CALLBACK.In task flow component 30, as soon as every executed a task, directly execution event (EVENT), therefore, EVENT disappears Consume very little.After event has executed, next task is directly executed, without waiting the execution body in host thread component to hold Row call back function, because the task flow component of the disclosure only executes EVENT, without as existing CUDA interface It executes CALLBACK and is sent to host thread component 10 and execute the message of call back function and host thread component 10 is waited to execute Whether the later confirmation task execution of outer call back function finishes.Because in the disclosure, due to being held in thread readjustment component 20 The premise of row call back function is that EVENT has been finished, and EVENT is executed after each task, therefore online When journey adjusts back the execution call back function of component 20, necessarily indicate that task has been finished.Therefore, thread readjustment component 20 executes back Several rows of letter of transfer are confirmation that inherently task execution finishes.Moreover, because task flow component only executes what EVENT was spent Time is minimum, and the time much shorter waited in confirmation tasks make progress than existing CUDA interface, therefore greatlys improve The efficiency of GPU equipment end processing data.

So far, present specification describes a kind of concurrent control systems and its method according to the embodiment of the present disclosure.Due to line Journey readjustment component 20 is separately provided for each task flow component 30, therefore the implementation procedure of each task flow is kept completely separate It opens, thus it is unaffected by each other, eliminate a possibility that stream is serial.Also, since thread readjustment component 20 executes call back function Premised on the event after task each in task flow is finished, therefore, reduces GPU equipment end and carrying out data processing The time controlled in the process by host thread component, especially it is a cancellation whether host thread component 10 has executed in confirmation task The time that GPU equipment end waits when finishing, to optimize the efficiency of GPU.

The basic principle of the disclosure is described in conjunction with specific embodiments above, however, it is desirable to, it is noted that this field For those of ordinary skill, it is to be understood that the whole or any steps or component of disclosed method and device, Ke Yi Any computing device (including processor, storage medium etc.) perhaps in the network of computing device with hardware, firmware, software or Their combination is realized that this is that those of ordinary skill in the art use them in the case where having read the explanation of the disclosure Basic programming skill can be achieved with.

Therefore, the purpose of the disclosure can also by run on any computing device a program or batch processing come It realizes.The computing device can be well known fexible unit.Therefore, the purpose of the disclosure can also include only by offer The program product of the program code of the method or device is realized to realize.That is, such program product is also constituted The disclosure, and the storage medium for being stored with such program product also constitutes the disclosure.Obviously, the storage medium can be Any well known storage medium or any storage medium that developed in the future.

It may also be noted that in the device and method of the disclosure, it is clear that each component or each step are can to decompose And/or reconfigure.These decompose and/or reconfigure the equivalent scheme that should be regarded as the disclosure.Also, execute above-mentioned series The step of processing, can execute according to the sequence of explanation in chronological order naturally, but not need centainly sequentially in time It executes.Certain steps can execute parallel or independently of one another.

Above-mentioned specific embodiment does not constitute the limitation to disclosure protection scope.Those skilled in the art should be bright It is white, design requirement and other factors are depended on, various modifications, combination, sub-portfolio and substitution can occur.It is any Made modifications, equivalent substitutions and improvements etc., should be included in disclosure protection scope within the spirit and principle of the disclosure Within.

Claims

1. a kind of multithread concurrent control system, including host thread component, multiple tasks stream component and with task flow package count The thread readjustment component of corresponding quantity is measured, wherein

The host thread component includes the first execution body and second executes body, and the first execution body inserts a calculating task Enter into the appointed task stream in multiple tasks stream and it is described second execution body be inserted into each calculating task it is rear to finger Determining insertion in task flow includes that event and flowing back to for call back function adjust structural body and simultaneously to thread readjustment component insertion stream Callback structure body；

The task flow component includes task execution body and event executes body, and the task execution body is executed for executing first The task that body is inserted into, the event execute body and are used to execute the event for flowing back to tune structural body and being included being inserted into；And

The thread adjusts back component, corresponds to each task flow component Configuration, including flows back to and structural body is adjusted to execute body, flows back to tune knot What structure body execution body was inserted into flows back to tune structural body, holds for the event in task flow component executes when body is finished Row adjusts call back function included in structural body as flowing back to.

2. stream concurrent control system according to claim 1, wherein the event executes body and repairs after event is finished Change event result label, flowing back to for the thread readjustment component adjusts structural body execution body knowing the event via thread channel When result queue is modified, execution, which is flowed back to, adjusts call back function included in structural body, and sends event to host thread component and hold The message that row finishes.

3. a kind of multithread concurrency control method, comprising:

For each task flow, calculating task is asynchronously inserted into the task flow；

Initialization flow callback structure body, it is described to flow back to the event and call back function by initialization for adjusting structural body to include；

Structural body will be adjusted asynchronously to be inserted into after each calculating task in the task flow by flowing back to for initialization；

By flowing back in the readjustment thread for adjusting structural body insertion thread readjustment component by initialization；

For the readjustment thread in the message that the event execution received in task flow is completed, tune is flowed back in thread readjustment component execution Call back function in structural body；And

Repeat above step.

4. stream concurrency control method according to claim 3, wherein the event is after being performed, the event Initial value is modified, so that the thread, which adjusts back component, obtains thing and knowing the modification of event initial value via thread channel The message that part is finished.