CN113806044A

CN113806044A - Heterogeneous platform task bottleneck elimination method for computer vision application

Info

Publication number: CN113806044A
Application number: CN202111008450.1A
Authority: CN
Inventors: 王祎; 刘志磊
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2021-12-17
Anticipated expiration: 2041-08-31
Also published as: CN113806044B

Abstract

The invention relates to a method for eliminating task bottleneck of a heterogeneous platform for computer vision application, which comprises the following steps: splitting a computer vision application into a plurality of semantically independent tasks, wherein each task needs to realize a predefined uniform interface; connecting the tasks through queues, and organizing the application into a directed graph form; discovering a bottleneck task; processing a bottleneck task; and packaging the tasks into coroutine tasks, and submitting the coroutine tasks to a coroutine scheduler for execution.

Description

Heterogeneous platform task bottleneck elimination method for computer vision application

The technical field is as follows:

the invention relates to the fields of computer vision, a streaming computing system, dynamic scheduling and the like, in particular to a heterogeneous platform-oriented task bottleneck elimination method based on a streaming computing model.

Background art:

today, with the rapid development of the computer vision field, there are already a lot of basic researches to make the computer vision method have the possibility of being applied in the fields such as security and logistics. Computer vision algorithm engineering is becoming a crucial ring for promoting algorithm landing and application. However, the training process of the computer vision algorithm is often more complex than the inference process, and the inference process usually does not include a complex iterative algorithm and is often regarded as an easier part, so the research in the computer vision field focuses on improving the efficiency or accuracy of model training, the framework in the computer vision field is also focused on model training, and the inference process only provides simple basic functions. As computer vision becomes more and more applied, a computer vision reasoning framework which is highly available, easy to expand and capable of helping a computer vision algorithm to rapidly land is becoming a new demand for rapid growth.

As shown in fig. 1, a computer vision application is generally formed by connecting a plurality of network models in series, has a certain sequence and dependency relationship, and is usually executed by a special computing device, and an image or video stream to be analyzed is sequentially analyzed by the network models, and is assisted by some logic control, and finally, an analysis result is fed back to a user. Such a processing pipeline is well suited for abstraction using a streaming computing model, which contains three concepts, graph, task, edge. Where a task is an abstraction of a task, typically model reasoning or logic control in computer vision applications. An edge is an abstraction of dependencies and communication patterns between tasks, usually a queue. One task may correspond to multiple edges, and an edge may also connect multiple tasks. The graph is an abstraction of the task pipeline, and a plurality of tasks and edges between the tasks form a directed graph. By applying the streaming computing model, the machine learning application is abstracted into a directed graph structure, the flow structure of the machine learning application can be well expressed, and the potential parallel capability of the computer vision application is fully exposed.

In the landing process of computer vision application, cost control under high load is a core requirement of engineering, and two important performance indexes exist generally, namely: throughput rate and latency. The throughput rate and the delay have essential conflict, that is, theoretically, the higher the throughput rate is, the higher the delay is on the premise that the algorithm completes the equivalent function. The invention focuses on solving the problem of eliminating the bottleneck of computer vision application tasks under a heterogeneous platform, and provides a comprehensive scheduling method which can be transparent to specific applications, and can adapt to two indexes of throughput rate and delay during operation, thereby improving the landing efficiency of algorithm application and reducing the enterprise development cost.

The invention content is as follows:

the invention aims to provide a method for a heterogeneous platform, which is transparent to specific application and can eliminate throughput rate and delay bottleneck during operation, and the implementation steps of the technical scheme are as follows:

a heterogeneous platform task bottleneck elimination method for computer vision applications, comprising the steps of:

(1) splitting a computer vision application into a plurality of semantically independent tasks, wherein each task needs to realize a predefined uniform interface;

(2) the tasks are connected through a queue, and the queue provides functions of shunting, broadcasting, aggregating and preserving according to indexes; each task has a plurality of input queues and a plurality of output queues, and the input queues and the output queues are shared by the tasks, so that the application is organized into a directed graph form;

(3) finding a bottleneck task, the method comprises the following steps:

1) if the flow rate of the output end of the input queue of one task is less than that of the input end, the task is considered to be on a bottleneck path, and the last task on the bottleneck path is considered to be a bottleneck task; for any task, judging the execution overhead of the task through the change of the residual data volume of the input pipeline before and after the single execution of the task, and calculating the task execution overhead C:

C＝max(N₁-N₂,1)

Cost_n+1＝Cost_n×S+C×(1-S)

wherein N is₁Is the minimum value of the remaining data amount in all input queues before the single execution of the task, N₂The minimum value of the residual data amount in all input queues after the task is executed once, S is a smoothing coefficient, the value used by the method is 0.7, Cost_nIs the last execution Cost, the initial value is a maximum value, Cost_n+1Is this execution overhead;

2) in order to combine the execution overhead of the task with the execution overhead of the task depending before and after the task to more accurately judge the bottleneck task of the current application, the execution overhead of the context task is calculated:

wherein N is₃The minimum value of the residual data amount in all output pipelines after the task is executed once;

(4) the bottleneck task processing method comprises the following steps: traversing all tasks in any order, and calculating the execution cost of the context task, wherein the task with the largest execution cost of the context task is the global bottleneck task; if the bottleneck task is a computing task on the CPU, executing replication according to whether the bottleneck task has a state or not, or replicating according to the index to parallelize the bottleneck task and eliminate the bottleneck; if the bottleneck task is a computing task on the gpu, increasing the overtime time of batch processing input data to improve the throughput rate of the bottleneck task; for all tasks with the execution cost of the context task being 0.5 times of the execution cost of the context task, the tasks are called low-cost tasks, if the low-cost tasks are stateful computing tasks on a cpu, the tasks are copied according to indexes, a computing task with the same unique identification value is aggregated, and the system load is reduced; if the low-overhead task is a calculation task on the gpu, the overtime time of batch processing input data is reduced, the current batch processing is executed as soon as possible, and the delay of the task is reduced;

(5) packaging the tasks into coroutine tasks, and submitting the coroutine tasks to a coroutine scheduler for execution;

(6) when the coroutine scheduler calls the coroutine task, traversing the input queue of the coroutine task in any order, sequentially executing dynamic batch processing, determining the batch processing size of each input current execution of the task and specific input data in the batch processing according to the remaining waiting time, then executing the task, and the remaining waiting time T_rThe calculation formula is as follows:

T_r＝T_e×(W_before+W_n)-T_before-T_n

wherein T is_nFor the average execution time of the nth task, T_totalFor the average delay of all data in the whole flow, W_nIs the delay weight of the nth task at T_nUpdate-time update delay weight W_n，T_eIs the expected delay, W_beforeIs the cumulative delay weight sum from the 1 st task to the n-1 st task, i.e. the delay weight sum of the n-1 tasks that a data has been subjected to when reaching the n-th task, T_beforeIs the sum of the delays experienced by a task until the data reaches the nth task.

The invention provides a comprehensive scheduling algorithm by utilizing the good decoupling property and the easy parallel characteristic of a flow type calculation model, and the comprehensive scheduling algorithm can be transparent to specific computer visual application and can adapt to two indexes of throughput rate and delay during operation, thereby improving the landing efficiency of algorithm application and reducing the enterprise development cost.

Description of the drawings:

FIG. 1: computer vision applications a common flow abstraction.

FIG. 2: the invention provides an automatic parallel strategy diagram.

FIG. 3: the invention provides an automatic batch processing strategy schematic diagram.

FIG. 4: effect maps in real computer vision projects. The numbers on each line in the figure refer to the number of data in the pipe/input rate/output rate.

The specific implementation mode is as follows:

the method comprises the following specific steps:

(1) the computer vision application is divided into a plurality of semantically independent tasks, and each task needs to realize a predefined uniform interface. The interface is specifically defined as follows:

fn id (& self) - > use; // obtaining a unique identification for a task

fn exec (& mut self); // task execution interface

fn set _ input (& mut self, i: use, edge: Queue); // set input edge

fn get _ input (& mut self, i: use) - > & mut Queue; // get input edge

fn set _ output (& mut self, i: use, edge: Queue); // set the output edge

fn get _ output (& mut self, i: usize) - > & mut Queue; // obtaining the output edge

fn cost (& mut self) - > f 64; update and return last execution overhead

fn indexes (& self) - > Vec < use >; if the current task is empty, the task is a task without context

fn clone (& Self) - > Self; // copy the current task

fn clone _ by _ index (& Self, index: use) - > Self; // copy Current task by index

fn collect _ by _ index (& Self, other: Self); // aggregating tasks by index

fn Device (& self) - > Device; // get task execution device

fn cost _ time (& mut self) - > use; update and Return to average execution time consumption

fn weight (& self) - > f 64; // delay weight

(2) The tasks are connected through queues, each task has a plurality of input queues and a plurality of output queues, and the queues are shared by the tasks, so that the application is organized into a directed graph form. The queue is implemented based on current _ queue in intel tbb, but different from the current _ queue, the queue provides functions of shunting, broadcasting, aggregating and preserving according to indexes, and provides functional support for different forms of task dependency relationships.

(3) The bottleneck task is discovered by a bottleneck detection method, which is specifically explained as follows:

as shown in fig. 2a), if the output flow rate of the input queue of a task is less than the input flow rate, the task can be considered to be on the bottleneck path, and the last task on the bottleneck path can be considered to be a bottleneck task, which needs to be executed by more physical threads to eliminate the performance bottleneck. One task may be stateful or stateless, for stateless tasks, runtime parallelization is relatively easy, and the key challenge of runtime dynamic parallelization of one task is how to handle stateful tasks. Regarding to the problem, as shown in fig. 2b), we have invented a run-time index-based splitting scheme, a stateful task can only process a full data set related to some indexes, a run-time global scheduling algorithm can dynamically adjust the existing indexes of the task, exchange index-related cache data, and further allocate more physical threads to the stateful task to process the indexed data streams in parallel, so as to improve the throughput rate of the stateful task.

Regarding the detection of the bottleneck, for any task, the execution overhead of the task is judged through the change of the residual data volume of the input pipeline before and after the task is executed once, and the formula for calculating the execution overhead is as follows:

C＝max(N₁-N₂,1)

Cost_n+1＝Cost_n×S+C×(1-S)

wherein N is₁Is the minimum value of the remaining data amount in all input queues before the single execution of the task, N₂The minimum value of the residual data amount in all input queues after the task is executed once, S is a smoothing coefficient, the value used by the method is 0.7, Cost_nIs the last execution Cost, the initial value is a maximum value, Cost_n+1Is this execution overhead.

In order to combine the execution overhead of the task itself with the execution overhead of the pre-task and the post-task to more accurately judge the bottleneck task of the current application, the Cost is set as above_n+1Multiplying the value by a dependent cost coefficient to obtain the execution cost of the context task, wherein the formula is as follows:

wherein N is₃Is the minimum of the amount of data remaining in all output pipes after a single execution of the task.

(4) And traversing all the tasks in any order, calculating the execution overhead of the context task, and executing the task with the highest execution overhead of the context task, namely the global bottleneck task. If the bottleneck task is a computing task on the CPU, the bottleneck task is parallelized by executing the copying or copying according to the index according to whether the bottleneck task has the state, so that the bottleneck is eliminated. If the bottleneck task is a calculation task on the gpu, the timeout time for collecting batch input data is increased in consideration of avoiding manufacturing video memory fragments, the batch input data is executed in a larger batch as much as possible, and the throughput rate of the bottleneck task is improved.

And for all tasks with the execution cost of the context task being 0.5 times that of the bottleneck task, the tasks are called low-cost tasks, if the low-cost tasks are stateful computing tasks on a cpu, the tasks are copied according to the index, a computing task with the same unique identification value is aggregated, and the system load is reduced. If the low-overhead task is a calculation task on gpu, the overtime time of batch processing input data is reduced, the current batch processing is executed as soon as possible, and the delay of the task is reduced.

(5) And packaging the tasks into coroutine tasks, and submitting the coroutine tasks to a coroutine scheduler for execution. Because computer vision has a plurality of lighter logic control tasks, coroutine scheduling is adopted, so that larger thread overhead is avoided.

(6) When the coroutine scheduler calls a coroutine task, traversing the input queue of the task, sequentially executing a dynamic batch processing algorithm, and determining the batch processing size of each input current execution of the task and specific input data in batch processing. The exec interface of the task is then called. The dynamic batch algorithm is specified as follows:

as shown in FIG. 3, the dynamic batch processing algorithm calculates the remaining waiting time of the data in the current task according to the delay expectation and the spent time for each data by specifying a global delay expectation, executes the data immediately if the remaining waiting time is 0, waits for the next round of execution if the remaining waiting time is not 0 and the current task does not save enough input data for batch processing, prioritizes the execution of the batch processing according to the smaller remaining waiting time to the larger remaining waiting time if the remaining waiting time is not 0 and the current task does not save enough input data for batch processing, and waits for the next round of execution of the data.

Remaining waiting time T_rThe calculation formula is as follows:

T_r＝T_e×(W_before+W_n)-T_before-T_n

(7) And (5) repeating the steps (3) to (5).

Claims

1. A heterogeneous platform task bottleneck elimination method for computer vision applications, comprising the steps of:

(3) finding a bottleneck task, the method comprises the following steps:

C＝max(N₁-N₂,1)

Cost_n+1＝Cost_n×S+C×(1-S)

wherein N is₁Is the minimum value of the remaining data amount in all input queues before the single execution of the task, N₂The minimum value of the residual data amount in all input queues after the task is executed once, S is a smoothing coefficient, S is 0.7, Cost_nIs the last execution Cost, the initial value is a maximum value, Cost_n+1Is this execution overhead;

T_r＝T_e×(W_before+W_n)-T_before-T_n