CN113935472A

CN113935472A - Model scheduling processing method, device, equipment and storage medium

Info

Publication number: CN113935472A
Application number: CN202111299696.9A
Authority: CN
Inventors: 张海俊; 朱亚平; 姚文军
Original assignee: iFlytek Co Ltd
Current assignee: University of Science and Technology of China USTC; iFlytek Co Ltd
Priority date: 2021-11-04
Filing date: 2021-11-04
Publication date: 2022-01-14

Abstract

The application provides a model scheduling processing method, a device, equipment and a storage medium, wherein the method comprises the following steps: determining performance parameters of each network model according to running log data of each network model when executing a single-path serial task and running log data of each network model when executing a multi-path parallel task; and determining a scheduling strategy for each network model according to the performance parameters of each network model and the running log data of each network model when executing the target task. According to the scheme, the scheduling strategy of each network model is determined by analyzing the performance parameters of the network model and the running log data of each network model when the network model executes the target task, so that the analysis and adjustment of model scheduling are realized, and the performance of the multi-network model cooperative work application is improved.

Description

Model scheduling processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a model scheduling processing method, apparatus, device, and storage medium.

Background

With the continuous improvement of hardware computing power and the complex and diversified application scenes, the structure and parameter quantity of the neural network model increase, and the application scenes of the cooperative work of the multiple network models are endless. For example, in a context translation application scenario, the task flow involved includes OCR context recognition and text translation, where the OCR context recognition may involve image segmentation, image warping, image recognition, and the like, and there are several neural network models used in the whole service, and there is scheduling dependency on time sequence between these several models. In such an application with multiple network schedules, the overall performance of the application is not only dependent on the performance of the model operator itself, but also influenced by the model scheduling overhead.

For example, although a single model has excellent processing performance, a large number of models need to be scheduled to work in cooperation in a complex application scenario, and if the model scheduling is not reasonable, the single model cannot function in time, even multiple models compete for device resources with each other, so that the application cannot run.

Therefore, analyzing and optimizing the model scheduling strategy in the multi-network model cooperative work application is also the key for improving the overall performance of the application. In the current stage, the optimization of the multi-network model application mainly focuses on analyzing and optimizing the performance of an operator inside the model, and the analysis and adjustment of model scheduling are completely ignored, so that the performance improvement of the multi-network model cooperative work application is limited.

Disclosure of Invention

Based on the technical current situation, the application provides a model scheduling processing method, a model scheduling processing device and a model scheduling processing storage medium, which can realize analysis and adjustment of model scheduling and are beneficial to improving the performance of multi-network model cooperative work application.

A model scheduling processing method comprises the following steps:

determining performance parameters of each network model according to running log data of each network model when executing a single-path serial task and running log data of each network model when executing a multi-path parallel task;

and determining a scheduling strategy for each network model according to the performance parameters of each network model and the running log data of each network model when executing the target task.

Optionally, the performance parameter of the network model includes an acceleration ratio of the network model when executing the multiple parallel tasks; the method further comprises the following steps:

determining ideal data processing performance of each network model when executing multi-path parallel tasks according to running log data of each network model when executing single-path serial tasks and performance parameters of each network model;

the determining the scheduling strategy of each network model according to the performance parameters of each network model and the running log data of each network model when executing the target task comprises the following steps:

Optionally, the running log data of each network model when executing the single-path serial task and the multi-path parallel task respectively includes inference time consumption of the network model when executing the single-path serial task and inference time consumption of the network model when executing the multi-path parallel task; the performance parameters of the network model comprise the acceleration ratio of the network model when the network model executes a plurality of paths of parallel tasks;

according to the running log data of each network model when executing a single-path serial task and the running log data of each network model when executing a multi-path parallel task, determining the speed-up ratio of each network model when executing the multi-path parallel task, wherein the speed-up ratio comprises the following steps:

and calculating and determining the acceleration ratio of the network model when executing the multi-path parallel task of the set dimension input data according to the inference time consumption of the network model when executing the single-path serial task and the inference time consumption of the network model when executing the multi-path parallel task of the set dimension input data.

Optionally, the performance parameter of the network model includes an acceleration ratio of the network model when executing the multiple parallel tasks; running log data when the network model executes the target task comprises network model reasoning time consumption information and input data dimension information;

determining a scheduling strategy for each network model according to the performance parameters of each network model and the running log data of each network model when executing the target task, wherein the scheduling strategy comprises the following steps:

determining the average acceleration ratio and the average input data dimension information of the network model when the network model executes the target task according to the running log data of the network model when the network model executes the target task;

and determining the scheduling frequency of the network model when the target task is executed according to the performance parameters of the network model, and the average acceleration ratio and the average input data dimension information of the network model when the target task is executed.

Optionally, determining an average acceleration ratio and average input data dimension information of the network model when the network model executes the target task according to the running log data of the network model when the network model executes the target task, where the determining includes:

determining the acceleration ratio of the network model when being called each time in the process of executing the target task according to the inference time consumption and the input data dimension information of the network model when being called each time in the process of executing the target task and the inference time consumption of the network model when executing the single-path serial task;

calculating and determining the average acceleration ratio of the network model when the network model executes the target task according to the acceleration ratio of the network model when the network model is called each time in the process of executing the target task;

and calculating and determining average input data dimension information of the network model in the process of executing the target task according to the input data dimension information of the network model every time the network model is called in the process of executing the target task.

Optionally, the performance parameters of the network model further include response delay performance of the network model, and the running log data of the network model when executing the target task further includes input completed data information;

determining a scheduling strategy for each network model according to the performance parameters of each network model and the running log data of each network model when executing the target task, and further comprising:

determining the scheduling priority of each network model according to the response delay performance of each network model and the input completed data information of each network model when executing the target task;

and determining a scheduling strategy for each network model at least according to the scheduling frequency of each network model and the scheduling priority of each network model.

Optionally, the performance parameter of the network model further includes a type of the network model; the network model is a memory access intensive model or a calculation intensive model;

determining a network model collocation strategy which runs in parallel according to the type of each network model;

the determining the scheduling strategy of each network model at least according to the scheduling frequency of each network model and the scheduling priority of each network model comprises the following steps:

and determining the scheduling strategy of each network model according to the scheduling frequency of each network model, the scheduling priority of each network model and the network model collocation strategy which runs in parallel.

Optionally, the performance parameter of the network model includes response delay performance of the network model, and the running log data of the network model when executing the target task includes input completed data information;

and determining the scheduling priority of each network model according to the response delay performance of each network model and the input completed data information of each network model when executing the target task.

Optionally, the performance parameter of the network model includes a type of the network model; the network model is a memory access intensive model or a calculation intensive model;

and determining a network model collocation strategy which runs in parallel according to the type of each network model.

A model scheduling processing apparatus comprising:

the model analysis unit is used for determining the performance parameters of each network model according to the running log data of each network model when executing a single-path serial task and the running log data of each network model when executing a multi-path parallel task;

and the strategy making unit is used for determining the scheduling strategy of each network model according to the performance parameters of each network model and the running log data of each network model when executing the target task.

A model schedule processing apparatus comprising:

a memory and a processor;

the memory is connected with the processor and used for storing programs;

the processor is used for realizing the model scheduling processing method by operating the program in the memory.

A storage medium having stored thereon a computer program which, when executed by a processor, implements the model scheduling processing method described above.

According to the model scheduling processing method, the performance parameters of each network model can be determined according to the running log data of each network model when executing a single-path serial task and the running log data of each network model when executing a multi-path parallel task; and then, determining a scheduling strategy for each network model according to the performance parameters of each network model and the running log data of each network model when executing the target task. According to the scheme, the scheduling strategy of each network model is determined by analyzing the performance parameters of the network model and the running log data of each network model when the network model executes the target task, so that the analysis and adjustment of model scheduling are realized, and the performance of the multi-network model cooperative work application is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic diagram of an application scenario in which multiple network models work together according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a model scheduling processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an inference timing diagram provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of another inference timing diagram provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a timing diagram for another inference provided by an embodiment of the present application;

fig. 6 is a schematic structural diagram of a model scheduling processing apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a model scheduling processing device according to an embodiment of the present application.

Detailed Description

The technical scheme of the embodiment of the application is suitable for an application scene of multi-network model cooperative work, in particular to an application scene of scheduling the multi-network model in the application scene. By adopting the technical scheme of the embodiment of the application, the scheduling efficiency of the multiple network models can be improved, and each network model can work cooperatively according to a scientific and reasonable scheduling strategy, so that the performance of the application based on the multiple network models can be further improved.

Fig. 1 shows a schematic diagram of an application scenario in which multiple network models work together. In this application scenario, there is at least one execution device, one scheduler, and a plurality of neural network models. Each neural network model is mapped to a final inference execution to be an inference engine. The scheduler is a global scheduler, each neural network model is executed according to the scheduling of the scheduler, and the neural network model is scheduled and selected by the scheduler to run on the equipment. When the scheduler schedules a certain neural network model, the neural network model is scheduled to the execution equipment, and the neural network model is operated by depending on hardware resources of the execution equipment, such as memory resources, computing resources and the like, so that the corresponding model function is realized.

The technical scheme of the embodiment of the application can be particularly applied to the scheduler in the application scene, and the scheduler schedules each neural network model according to the model scheduling strategy determined based on the technical scheme of the embodiment of the application, so that the model scheduling efficiency can be improved, and the model scheduling overhead can be reduced, thereby improving the performance of the application based on the network models.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the present application first provides a model scheduling processing method, as shown in fig. 2, the method includes:

s201, determining performance parameters of each network model according to running log data of each network model when executing a single-path serial task and running log data of each network model when executing a multi-path parallel task.

The operation log data refers to data used for recording changes of various parameters or indexes in the whole operation process of the neural network model, wherein the data content includes but is not limited to information which is helpful for task scheduling and analysis, such as model reasoning time consumption, input data dimension information, input completed data information, task response time and the like.

The time consumed by model inference refers to the time consumed by the process from the moment when the model is scheduled to the moment when inference computation is started to the moment when inference is completed.

The input data dimension information is mainly input data Batch, namely the number of input data paths. In general, the input data of the neural network model has several dimensions, that is, several paths of input data, for example, the input shape of a single path session is (1, 4, 40), 5 paths of session data may be spliced together for performance consideration during calculation, and after 5 paths of splicing, the result is (5, 1, 4, 40), where 5 is the Batch of the model input data.

The input completed data information refers to the input completed data dimension, specifically the input completed input data path number. For example, assuming that there are 5 data input neural network models in total, Batch is 5, and when there are 3 input data input completed, the number of input data input completed is 3. As an exemplary representation manner, the embodiment of the present application uses a Flush parameter to represent the number of input data paths that have been input to completion.

The task response time is the time consumed by the neural network model to respond to a certain task.

According to the embodiment of the application, a log recording function is configured for each neural network model, so that the neural network model can record operation log data in real time in the operation process.

Based on the setting, each application network model is called to execute a single-path serial task and a multi-path parallel task. The single-path serial task means that each neural network model is sequentially and independently scheduled as long as one path of data is input to each neural network model, namely, each model runs in series, and the condition that a plurality of models are simultaneously scheduled does not exist; the multi-path parallel task means that on one hand, each neural network model has multi-path input data at the input, and on the other hand, each neural network model is not limited to be scheduled independently when being scheduled, but can be scheduled by a plurality of models simultaneously, namely the plurality of models can run in parallel.

It can be understood that when each model executes a single-path serial task, the situation that multiple models contend for executing equipment resources does not exist, so that each model can exert the theoretically best performance without interference. Therefore, the running log data of the network model when executing the single-path serial task can be used for representing the best working state, namely the ideal working state, which can be exerted by the network model under the condition of no interference.

And each network model executes multi-path parallel tasks, which is equivalent to simulating the working scene of a plurality of network models in practical application, because the common application related to the plurality of network models does not operate each model in sequence, and usually operates a plurality of models in parallel, the processing efficiency can be ensured. Therefore, the running log data of each network model when executing the multi-path parallel tasks can be used for representing the theoretical working state of the plurality of network models when executing the multi-path parallel tasks.

As an exemplary implementation, the embodiment of the present application uses the inference timing diagram as shown in fig. 3 and fig. 4 to record the operation log data of the neural network model.

Fig. 3 shows an inference timing diagram of 5 neural network models when a single-path serial task is executed. T1, T2, T3, T4 and T5 respectively represent the time consumed by reasoning of the network model 1, the network model 2, the network model 3, the network model 4 and the network model 5, the Batch in brackets identifies the Batch size of a reasoning task, namely the dimensionality of input data, Flush identifies the input data which is input and completed in the reasoning task, namely the number of input data paths which are input and completed, and the rest running log data can be added according to the analysis requirement. Gaps among inference time consumption of each model indicate that scheduling overhead exists or no task is scheduled to be executed on equipment due to the fact that the task is not prepared. The total time T represents the time from task entry to final completion.

Similarly, fig. 4 shows the inference timing diagram of 5 neural network models when performing multiple parallel tasks. As can be seen from fig. 4, there are not only multiple paths of input data of 5 neural network models, but also parallel operation between the respective neural network models.

Based on the running log data of each neural network model when executing a single-path serial task and the running log data of each neural network model when executing a multi-path parallel task, the embodiment of the application further analyzes the performance of each neural network model and determines the performance parameters of the neural network model.

Specifically, for a certain neural network model, according to the running log data of the neural network model when executing a single-path serial task and the running log data of the neural network model when executing a multi-path parallel task, the performance of the neural network model is analyzed, and the performance parameters of the neural network model are determined.

The performance parameter refers to a parameter or an index capable of representing the performance of the neural network model, and may be, for example, the response delay performance of the model, the acceleration ratio of the model when executing multiple parallel tasks, the type of the model, and the like. According to the method and the device, the running log data of each network model when executing the single-path serial task and the running log data of each network model when executing the multi-path parallel task are analyzed, and the specific values of various performance parameters of each network model are determined, namely, various performance parameter values of each network model are determined.

The response delay performance of the model can be determined by the task response time of the model when the task is executed.

The acceleration ratio of the model when the multi-path parallel task is executed can be determined by comparing the reasoning time consumption of the model when the multi-path parallel task is executed and the reasoning time consumption of the model when the single-path serial task is executed.

The model type can be determined by analyzing the performance of operators in the model, and the specific model type can be divided into an access intensive model and a calculation intensive model. The memory access intensive model is a model with the ratio of the memory bandwidth required to be accessed by the model to the model calculation amount exceeding a certain ratio, and conversely, the ratio of the memory bandwidth required by the model to the model calculation amount not exceeding a certain ratio is a calculation intensive model.

It can be understood that after the performance parameters of each network model are determined, the theoretical performance indexes of the network model can be determined.

S202, determining a scheduling strategy for each network model according to the performance parameters of each network model and the running log data of each network model when executing the target task.

Specifically, the target task, that is, a task executed to schedule an application of each network model, may be understood as an actual task executed by each network model. Referring to fig. 1, when each network model executes an actual task, the scheduler schedules each network model to run on the execution device according to the task requirement.

When each network model is scheduled to run, each network model records the running log respectively to obtain the running log data, and an inference sequence diagram similar to that shown in fig. 5 can be obtained.

It can be understood that when scheduling of multiple network models is involved, due to limited resource of the executing device, a situation may occur in which multiple network models simultaneously preempt the resource of the executing device, and at this time, the network models may not exert their theoretical performance when actually executing tasks. Or due to other circumstances, the network model may not be able to perform at a normal performance level, thereby failing to perform the application performance sufficiently.

Based on the operation log data of each network model when executing the target task, with reference to the above description, the actual performance of the network model when executing the target task, such as the actual response delay performance, the acceleration ratio, and the like, can be analytically determined.

On this basis, by combining the performance parameters of each network model analyzed and determined in step S201, that is, the theoretical performance of each network model, it may be compared and analyzed whether the network model exerts its theoretical performance when actually executing the target task, and then a scheduling policy for each network model when executing the target task is formulated, and the formulation of the scheduling policy is directed at least that each network model can exert its theoretical performance when executing the target task.

For example, if it is determined through analysis that a network is lost and does not exert its theoretical performance when executing a target task, the scheduling policy of each network model is adjusted so that the network model can exert its theoretical performance without affecting the performance of other network models, thereby improving the performance of applications implemented by each network model in executing the target task.

In addition, it should be noted that, because the determination of the scheduling policy is determined based on the operation log data of each network model when executing the target task, the scheduling policy made at one time may not be able to schedule each network model to exert the best performance, at this time, the above steps of the embodiment of the present application may be iteratively executed, that is, the scheduling policy is iteratively updated, so that the model scheduling policy is adjusted in real time according to the operation condition of each network model in the process of executing the target task by each network model, so that the performance of each network model when executing the target task can be stably improved.

As can be seen from the above description, the model scheduling processing method provided in the embodiment of the present application can determine the performance parameters of each network model according to the running log data of each network model when executing a single-path serial task and the running log data of each network model when executing a multi-path parallel task; and then, determining a scheduling strategy for each network model according to the performance parameters of each network model and the running log data of each network model when executing the target task. According to the scheme, the scheduling strategy of each network model is determined by analyzing the performance parameters of the network model and the running log data of each network model when the network model executes the target task, so that the analysis and adjustment of model scheduling are realized, and the performance of the multi-network model cooperative work application is improved.

As an optional implementation manner, in the embodiment of the present application, the speed-up ratio of each network model when executing multiple parallel tasks is determined according to the running log data of each network model when executing a single-path serial task and the running log data of each network model when executing multiple parallel tasks, that is, the speed-up ratio of each network model when executing multiple parallel tasks is taken as one of the performance parameters of the network model.

On the basis, the method and the device determine the ideal data processing performance of each network model when executing the multi-path parallel task according to the running log data of each network model when executing the single-path serial task and the performance parameters of each network model.

In the embodiment of the present application, the ideal data processing performance is expressed by the data amount that each network model can process in an ideal state, and the larger the data amount that each network model can process, the better the data processing performance. Wherein, the data volume that can be processed is measured by the number of throughput data paths. The more data throughput paths of each network model when the multiple parallel tasks are executed, that is, the more data paths that can be input, the better the data processing performance of each network model when the multiple parallel tasks are executed.

Specifically, the maximum data volume which can be processed by each network model under an expected acceleration ratio is calculated and determined according to the running log data of each network model when executing a single-path serial task and the acceleration ratio of each network model when executing a multi-path parallel task.

The expected acceleration ratio is an expected value of the acceleration ratio of each network model when the multi-path parallel task is executed. More specifically, the expected acceleration ratio is calculated according to the device occupancy rate of each network model and the acceleration ratio of each network model during the execution of multiple parallel tasks. For example, assume that there are 3 network models occupying the device at a ratio of 1: 2: when the acceleration ratios of the 3 and 3 network models are respectively S1, S2 and S3 when executing the multi-path parallel task, a normalized acceleration ratio S may be calculated as (1/6) × S1+ (2/6) × S2+ (3/6) × S3, and a normalized acceleration ratio S is obtained, and the acceleration ratio S is the desired acceleration ratio of each network model.

Referring to the operation log data of each network model shown in fig. 2 when executing a single serial task, assuming that the total time of each network model executing the serial task is T, the total gap time of each network model operation is Tx, which may be used to represent the time consumed by each network model scheduling, and assuming that the total inference time of each network model is Ts, that is, T1+ T2+ T3+ T4+ T5 is Ts. In an ideal scheduling scenario, the total gap time is reduced to 0, namely the model scheduling time consumption is reduced to 0, and at the moment, the number of data throughput paths can be increased to Tx/Ts. And under the ideal scheduling scene, in combination with the expected acceleration ratio, the maximum data throughput path number which can be reached by the parallel operation of each network model is estimated to be [ (Tx/Ts) +1 ]. S, and the maximum data throughput path number can represent the ideal data processing performance of each network model when executing multiple parallel tasks.

Based on the ideal data processing performance, when determining the scheduling policy for each network model according to the performance parameters of each network model and the running log data of each network model when executing the target task, specifically, the scheduling policy for each network model is determined according to the performance parameters of each network model and the running log data of each network model when executing the target task, wherein the target is to achieve the ideal data processing performance when executing the target task.

That is, when a scheduling policy for each network model is formulated according to the performance parameters of each network model and the running log data of each network model when executing the target task, it should be ensured that the data processing performance of each network model when executing the target task is improved, that is, the data processing amount is improved, after each network model is scheduled according to the formulated scheduling policy. And after at least one time of updating the scheduling strategy, scheduling each network model to execute the target task according to the updated scheduling strategy, so that the data processing performance of each network model when executing the target task reaches or approaches to the ideal data processing performance.

The ideal data processing performance defines the target and the direction for scheduling each network model, so that the adjustment of the scheduling strategy is regular and can be circulated, the invalid scheduling or the vicious scheduling of the network model can be avoided, and the influence on the working efficiency of the model due to the unreasonable scheduling strategy is avoided.

The embodiment of the application also provides an exemplary implementation mode for determining the acceleration ratio of the network model when the multi-path parallel task is executed.

In order to determine the acceleration ratio of the network model when executing the multi-path parallel tasks, it is necessary that the time consumed for reasoning when the network model executes the single-path serial tasks is recorded in the running log data when each network model executes the single-path serial tasks, and meanwhile, the time consumed for reasoning when the network model executes the multi-path parallel tasks is recorded in the running log data when each network model executes the multi-path parallel tasks.

On the basis, the acceleration ratio of each network model when executing the multi-path parallel task is determined according to the running log data of each network model when executing the single-path serial task and the running log data of each network model when executing the multi-path parallel task, specifically, the acceleration ratio of each network model when executing the multi-path parallel task of the set dimension input data is calculated and determined according to the inference time consumption of each network model when executing the single-path serial task and the inference time consumption of each network model when executing the multi-path parallel task of the set dimension input data.

Specifically, for each network model, the acceleration ratio of the network model when executing a plurality of parallel tasks of input data with set dimensions is determined by the following processes:

first, it is clear that the existing accelerator card can better utilize the theoretical computing power of the device for the task of inputting data Batch. Therefore, the embodiment of the application completes the acceleration ratio analysis of each model according to the size of the Batch of the network model input data, namely, the acceleration ratio of the network model under the specific input data Batch is analyzed.

Considering that the accelerator card is generally more friendly to the calculation task that the Batch dimension is a multiple of 8, the network model acceleration ratio of 8, 16, 32 and so on of the input data Batch can be analyzed.

For example, assuming that the run log data when the parallel task of input data Batch 8 is executed is determined according to the network model, the inference time of the neural network model when the multi-path parallel task of input data Batch 8 is executed is T1'; meanwhile, according to the operation log data of the network model when the network model executes the serial task with the input data Batch being 1, it is determined that the inference time consumption of the network model when the network model executes the single serial task with the input data Batch being 1 is T1, and then the acceleration ratio of the network model when the network model executes the multi-path parallel task with the input data Batch being 8 is S8T 1/T1'.

With reference to the above description, for each network model, it is possible to calculate and determine the acceleration ratio when it performs a multi-path parallel task of input data of arbitrary dimensions. And for each network model participating in scheduling, the acceleration ratio of the network model when executing the multi-path parallel tasks of input data with any dimension can be calculated and determined respectively by referring to the processing.

In the following, the present application introduces, through a plurality of embodiments, specific implementation manners for determining a scheduling policy for each network model according to performance parameters of each network model and running log data of each network model when executing a target task.

In order to ensure clear discussion logic, different scheduling policy making schemes are introduced in different embodiments, and it should be understood that any one of the scheduling policy determining schemes provided in the present application may be selected or one or more of the scheduling policies provided in the present application may be combined to determine a final scheduling policy when the technical scheme of the embodiments of the present application is actually implemented.

The first scheduling strategy making embodiment:

according to the method and the device, when the performance parameters of each network model are determined according to the running log data of each network model when executing a single-path serial task and the running log data of each network model when executing a multi-path parallel task, at least the acceleration ratio of the network models when executing the multi-path parallel task is determined.

Meanwhile, in the running log data of each network model when the target task is executed, at least the reasoning time consumption information of the network model and the input data dimension information of the network model, namely the input data Batch of the network model, are recorded.

On the basis, the following steps A1-A2 are carried out to realize the formulation of the scheduling strategy of each network model:

and A1, determining the average acceleration ratio and the average input data dimension information of the network model when the network model executes the target task according to the running log data of the network model when the network model executes the target task.

Specifically, the inference time of the network model and the input data dimension information are recorded in the running log data of the network model when the network model executes the target task. On this basis, the average acceleration ratio of the network model in executing the target task, and the average input data Batch, are determined by performing the following steps A11-A13:

and A11, determining the acceleration ratio of the network model when the network model is called each time in the process of executing the target task according to the inference time consumption and the input data dimension information of the network model when the network model is called each time in the process of executing the target task and the inference time consumption of the network model when the network model executes the single-path serial task.

Specifically, in the process of executing the target task, when the network model is called each time, the inference time and the input data Batch of the model can be determined according to the corresponding running log data. Then, with reference to the above embodiment, in combination with the inference time consumed by the network model when executing the one-way serial task, the acceleration ratio of the network model when being called this time may be calculated and determined, and more specifically, the acceleration ratio of the network model under a specific Batch may be calculated and determined.

And A12, calculating and determining the average acceleration ratio of the network model when executing the target task according to the acceleration ratio of the network model when being called each time in the process of executing the target task.

Specifically, the average value of the acceleration ratios of the network model when being called each time in the process of executing the target task is calculated, so that the average acceleration ratio of the network model when executing the target task can be determined.

And A13, calculating and determining average input data dimension information of the network model in the process of executing the target task according to the input data dimension information of the network model each time the network model is called in the process of executing the target task.

Specifically, an average value is calculated for the input data Batch of the network model when the network model is called each time in the process of executing the target task, and then average input data dimension information, namely the average input data Batch, of the network model in the process of executing the target task can be obtained.

Corresponding to each network model, the average acceleration ratio and the average input data dimension information of the network model when executing the target task can be determined by respectively executing the processing of the steps A11, A12 and A13. Based on the average acceleration ratio and the average input data Batch of each network model, a scheduling policy corresponding to the network model may be determined by the following step a 2:

a2, determining the dispatching frequency of the network model when executing the target task according to the performance parameters of the network model, and the average acceleration ratio and the average input data dimension information of the network model when executing the target task.

Specifically, for a certain network model, if the average acceleration ratio of the network model during the execution of the target task is smaller than the acceleration ratio determined when the performance parameter of the network model is predetermined, and the average input data Batch of the network model during the execution of the target task is smaller, for example, smaller than the input data Batch corresponding to the acceleration ratio determined when the performance parameter of the network model is predetermined, the input data Batch of the network model is increased as much as possible during the scheduling of the network model.

Because the average acceleration ratio of the network model is low and the average input data Batch is low during the process of executing the target task, the possible reason is that the network model is frequently scheduled during the process of executing the target task, so that the input data Batch of the network model has insufficient time to be pieced together, and the scheduling policy of the network model is to reduce the scheduling frequency of the network model.

For example, it is assumed that when determining the performance parameter of the network model a according to the above-described embodiment, it is determined that the acceleration ratio of the network model a is 5 when Batch is 8; however, the average acceleration ratio of the network model a when executing the target task is 2 and the average input data dimension is Batch-5, which are calculated from the operation log data of the network model a when executing the target task. Then, at this time, it can be considered that the acceleration ratio of the network model a in executing the target task cannot reach 5 because its input data Batch is too small, and the input data Batch is too small, possibly because the network model a is frequently scheduled in executing the target task, so that the Batch of its input data cannot be accumulated to 8 all the time. Then, when the network model a is scheduled, the scheduling frequency should be reduced, so that the input data Batch of the network model a has enough time to be pieced together, so that the network model a can obtain a higher speed-up ratio under the condition of a larger Batch.

The scheduling frequency of each network model is determined according to the method, so that each network model can be scheduled at a reasonable frequency, and the phenomenon that the network model cannot exert the performance of the network model due to frequent scheduling of the network model is avoided. When the scheduling frequency of the network model is more reasonable, the performance of the network model when being scheduled at a single time can be respectively improved, and therefore the application performance can be integrally improved.

In addition, if the frequency of scheduling the network model when executing the target task is already low, but the acceleration ratio of the network model is low and the input data Batch is low, in order to increase the input data Batch of the network model, the scheduling frequency of the network model should be kept unchanged at this time, the number of external data throughput paths should be increased, and the network model can receive more paths of input data in unit time.

The second scheduling strategy making embodiment:

the embodiment of the application sets that when the performance parameters of each network model are determined according to the running log data of each network model when executing a single-path serial task and the running log data of each network model when executing a multi-path parallel task, at least the response delay performance of the network model is determined. The response delay performance of the network model can be represented by the task response time of the network model when executing the single-path serial task. Since the network model is the response time without interference when executing the single-path serial task, the response time can be considered as the limit response time of the network model, that is, the response time can be used for representing the fastest response delay performance of the network model.

Meanwhile, in the running log data of each network model when the target task is executed, at least the input completed data information of the network model is recorded. The input completed data information is mainly the number of input completed data paths. In the embodiment of the present application, the number of input data paths that have been input to completion is represented by a value of Flush. For example, assuming that the input data Batch of the network model is 5, at a certain time, assuming that the input of 3-way data is completed, Flush at this time is 3.

On this basis, when determining the scheduling policy for each network model according to the performance parameters of each network model and the running log data of each network model when executing the target task, the embodiment of the application may specifically execute the following processing to implement the formulation of the scheduling policy for each network model:

Specifically, in the process of executing the target task by each network model, the network model which has a large influence on the task response time needs to be searched out. For example, if the network model B is supposed to be called many times for executing the target task, and the response delay performance of the network model B is low, if the network model B cannot be scheduled in time, the task response time is directly affected. The network model B is a network model that has a large impact on the task response time.

For each network model executing a target task, especially for the network model having a large influence on task response time, in the embodiments of the present application, a scheduling priority for the network model is determined according to corresponding delay performance of the network model and data information that has been input and completed when the network model executes the target task, that is, Flush information.

Specifically, in the process of executing the target task by each network model, if a certain path session of the target task has completed data input, whether the scheduling priority of the network model associated with the path session is high enough is confirmed. If the scheduling priority of the network model associated with the way session is not high enough, a higher scheduling priority is set for the network model. The completed data input is represented by Flush information in the running log data of the network model when the target task is executed.

For example, as shown in fig. 5, assuming that task scheduling for network models corresponding to T1, T2, and T3 has been completed at the current time, and T4 and T5 correspond to alternative network model scheduling tasks, although the Batch of the T4 task is larger at this time, since there is a Flush session in T5, T5 should be scheduled preferentially rather than T4 in consideration of the response requirement of the response type task.

Therefore, when the scheduling strategies for each network model are determined, the data input conditions of the network models can be examined in real time, and meanwhile, the scheduling priority of the network models is determined by combining the response delay performance of the network models, so that the network models which have completed data input, especially the network models which have completed data input and have large influence on task response time, can be scheduled in time and preferentially, and therefore the task response time can be shortened, and the task response efficiency can be improved.

The third scheduling strategy making embodiment:

the embodiment of the application sets that when the performance parameters of each network model are determined according to the running log data of each network model when executing a single-path serial task and the running log data of each network model when executing a multi-path parallel task, at least the type of the network model is determined. The performance of the network model is divided into a memory access intensive model and a computation intensive model.

With regard to the type division of the network model, each computing device has theoretical computing power GFlops and theoretical access storage bandwidth M, and the ratio of the theoretical computing power GFlops to the theoretical access storage bandwidth M is the computing access ratio. The memory access intensive model is that the ratio of the overall calculated quantity GFlops of the current model to the memory bandwidth M which needs to be accessed by the model is smaller than the theoretical calculated memory access ratio of the device. On the contrary, if the ratio of the overall calculation quantity GFlops of the model to the memory bandwidth M which the model needs to access is not less than the theoretical calculation memory access ratio of the device, the model is a calculation intensive model.

As another exemplary implementation manner, described with reference to the above embodiment, the speed-up ratio of each network model when executing a plurality of parallel tasks is determined according to the running log data of each network model when executing a single serial task and the running log data of each network model when executing a plurality of parallel tasks. If the acceleration ratio of a certain model is too low, a nvprof or nsight tool is used for carrying out hotspot occupancy analysis on the model, the acceleration possibility is analyzed, if the acceleration of an operator is limited, the model can be considered as an access intensive model, and otherwise, the model can be considered as a calculation intensive model.

When the types of the network models are respectively determined, the types can be determined directly by adopting the mode of calculating the memory access ratio, or the types of the network models with low acceleration ratio can be analyzed by means of nvprof or nsight tools, and the types of the network models with high acceleration ratio are determined by adopting the mode of calculating the memory access ratio.

Specifically, the network model collocation strategy that runs in parallel refers to a network model type that can run in parallel.

A simple network model collocation strategy is to avoid the occupation of equipment resources by a plurality of compute-intensive models as much as possible at the same time, and when a plurality of equipment exists in the same task, the number of the compute-intensive models which run in parallel does not exceed the number of the equipment. A relatively friendly network model collocation strategy is that a calculation intensive model and a memory access intensive model run in parallel. That is, on the same device, the computation intensive model and the memory access intensive model can be scheduled to run simultaneously at the same time.

As shown in fig. 4, there are parallel time periods for T2 and T3, and if T2 and T3 are both computation intensive models, then the parallel time periods in fig. 4 may cause preemption of the computing device, which is not favorable for full performance. At this time, the scheduling policy, that is, the network model collocation policy for parallel operation, may be adjusted to avoid parallel operation of T2 and T3 collocation. Assuming that T1 is a memory-intensive model, a network model collocation strategy may be formulated such that in the process of scheduling T2, the schedule T1 runs in parallel with the T2 collocation.

On the other hand, besides completely limiting the network model collocation strategy which only allows the computation-intensive model and the memory-access-intensive model to run in parallel, the collocation strategy for the computation-intensive network model to run in parallel can be formulated according to the utilization rate of the computation-intensive network model to the equipment.

Specifically, for a plurality of computation-intensive network models running in parallel, if operators of the models are developed in a high-utilization-rate mode for the equipment, the problem of occupying equipment resources occurs; if the operator development of each model is developed according to the occupation mode of 50% of equipment utilization rate or even lower, the improvement of the equipment utilization rate and the acceleration of the whole task performance can be realized by means of the parallel calculation and the task scheduling of the equipment, and at the moment, each calculation intensive model can be allowed to run in parallel.

Namely, on the premise that the sum of the utilization rates of the parallel-running calculation-intensive models to the equipment does not exceed the bearing capacity of the equipment, a collocation strategy for the parallel running of the calculation-intensive network models is formulated.

In addition, the nvprof can be used for analyzing the hardware equipment information of the parallel-running computation-intensive models in parallel time to determine whether the collocation of the parallel-running computation-intensive models is a reasonable combination mode, so that the scheduling strategy is improved.

The three embodiments described above respectively introduce the scheduling frequency of the network model, the scheduling priority of the network model, and the formulation scheme of the network model collocation policy that operates in parallel. In practical implementation of the technical solution of the embodiment of the present application, the model scheduling policy is made according to the network model scheduling frequency, or the scheduling priority of the network model, or the network model collocation policy that operates in parallel, but not limited to the introduction of any of the above embodiments, and the scheduling policy for each network model may also be determined by combining two or three of the above embodiments.

Corresponding to the above model scheduling processing method, an embodiment of the present application further provides a model scheduling processing apparatus, as shown in fig. 6, the apparatus includes:

the model analysis unit 100 is configured to determine performance parameters of each network model according to the running log data of each network model when executing a single-path serial task and the running log data of each network model when executing a multi-path parallel task;

the policy making unit 110 is configured to determine a scheduling policy for each network model according to the performance parameters of each network model and the running log data of each network model when executing the target task.

As an alternative embodiment, the performance parameters of the network model include the acceleration ratio of the network model when executing multiple parallel tasks; the device further comprises:

the scheduling target analysis unit is used for determining ideal data processing performance of each network model when executing multi-path parallel tasks according to running log data of each network model when executing single-path serial tasks and performance parameters of each network model;

As an optional implementation manner, the running log data of each network model when executing the single-path serial task and the multi-path parallel task respectively includes the inference time consumption of the network model when executing the single-path serial task and the inference time consumption of the network model when executing the multi-path parallel task; the performance parameters of the network model comprise the acceleration ratio of the network model when the network model executes a plurality of paths of parallel tasks;

As an alternative embodiment, the performance parameters of the network model include the acceleration ratio of the network model when executing multiple parallel tasks; running log data when the network model executes the target task comprises network model reasoning time consumption information and input data dimension information;

As an optional implementation manner, determining, according to the running log data of the network model when executing the target task, an average acceleration ratio and average input data dimension information of the network model when executing the target task includes:

As an optional implementation manner, the performance parameters of the network model further include response delay performance of the network model, and the operation log data of the network model when executing the target task further includes data information that has been input and completed;

As an optional implementation, the performance parameter of the network model further includes a type of the network model; the network model is a memory access intensive model or a calculation intensive model;

As an optional implementation manner, the performance parameter of the network model includes response delay performance of the network model, and the running log data of the network model when executing the target task includes input completed data information;

As an alternative embodiment, the performance parameter of the network model includes a type of the network model; the network model is a memory access intensive model or a calculation intensive model;

Specifically, please refer to the specific content of the corresponding processing steps in the above method embodiments for the specific work content of each unit of the model scheduling processing apparatus, which is not repeated here.

Another embodiment of the present application further provides a model scheduling processing apparatus, as shown in fig. 7, the apparatus includes:

a memory 200 and a processor 210;

wherein, the memory 200 is connected to the processor 210 for storing programs;

the processor 210 is configured to implement the model scheduling processing method disclosed in any of the above embodiments by running the program stored in the memory 200.

Specifically, the model scheduling processing device may further include: a bus, a communication interface 220, an input device 230, and an output device 240.

The processor 210, the memory 200, the communication interface 220, the input device 230, and the output device 240 are connected to each other through a bus. Wherein:

a bus may include a path that transfers information between components of a computer system.

The processor 210 may be a general-purpose processor, such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with the present invention. But may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components.

The processor 210 may include a main processor and may also include a baseband chip, modem, and the like.

The memory 200 stores programs for executing the technical solution of the present invention, and may also store an operating system and other key services. In particular, the program may include program code including computer operating instructions. More specifically, memory 200 may include a read-only memory (ROM), other types of static storage devices that may store static information and instructions, a Random Access Memory (RAM), other types of dynamic storage devices that may store information and instructions, a disk storage, a flash, and so forth.

The input device 230 may include a means for receiving data and information input by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor, among others.

Output device 240 may include equipment that allows output of information to a user, such as a display screen, a printer, speakers, and the like.

Communication interface 220 may include any device that uses any transceiver or the like to communicate with other devices or communication networks, such as an ethernet network, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.

The processor 210 executes the program stored in the memory 200 and invokes other devices, which can be used to implement the steps of the model scheduling processing method provided by the above-mentioned embodiment of the present application.

Another embodiment of the present application further provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the model scheduling processing method provided in the foregoing embodiment of the present application.

Specifically, the specific working content of each part of the model scheduling processing apparatus and the specific processing content of the computer program on the storage medium when being executed by the processor may refer to the content of each embodiment of the model scheduling processing method, which is not described herein again.

While, for purposes of simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present application is not limited by the order of acts or acts described, as some steps may occur in other orders or concurrently with other steps in accordance with the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The steps in the method of each embodiment of the present application may be sequentially adjusted, combined, and deleted according to actual needs, and technical features described in each embodiment may be replaced or combined.

The modules and sub-modules in the device and the terminal in the embodiments of the application can be combined, divided and deleted according to actual needs.

In the several embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of a module or a sub-module is only one logical division, and there may be other divisions when the terminal is actually implemented, for example, a plurality of sub-modules or modules may be combined or integrated into another module, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules or sub-modules described as separate parts may or may not be physically separate, and parts that are modules or sub-modules may or may not be physical modules or sub-modules, may be located in one place, or may be distributed over a plurality of network modules or sub-modules. Some or all of the modules or sub-modules can be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, each functional module or sub-module in the embodiments of the present application may be integrated into one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated into one module. The integrated modules or sub-modules may be implemented in the form of hardware, or may be implemented in the form of software functional modules or sub-modules.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software cells may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A model scheduling processing method is characterized by comprising the following steps:

2. The method of claim 1, wherein the performance parameters of the network model include a speed-up ratio of the network model in performing the plurality of parallel tasks; the method further comprises the following steps:

3. The method according to claim 1, wherein the running log data of each network model when executing the single-path serial task and the multi-path parallel task respectively comprises inference time consumption of the network model when executing the single-path serial task and inference time consumption of the network model when executing the multi-path parallel task; the performance parameters of the network model comprise the acceleration ratio of the network model when the network model executes a plurality of paths of parallel tasks;

4. The method according to any one of claims 1 to 3, wherein the performance parameters of the network model comprise a speed-up ratio of the network model when executing the multiple parallel tasks; running log data when the network model executes the target task comprises network model reasoning time consumption information and input data dimension information;

5. The method of claim 4, wherein determining the average acceleration ratio and the average input data dimension information of the network model when executing the target task according to the running log data of the network model when executing the target task comprises:

6. The method of claim 4, wherein the performance parameters of the network model further include response delay performance of the network model, and the running log data of the network model when executing the target task further includes data information of completed input;

7. The method of claim 6, wherein the performance parameters of the network model further include a type of the network model; the network model is a memory access intensive model or a calculation intensive model;

8. The method according to any one of claims 1 to 3, wherein the performance parameters of the network model comprise response delay performance of the network model, and the running log data of the network model when executing the target task comprises the input completed data information;

9. A method according to any one of claims 1 to 3, characterized in that the performance parameters of the network model comprise the type of network model; the network model is a memory access intensive model or a calculation intensive model;

10. A model scheduling processing apparatus, comprising:

11. A model scheduling processing apparatus, comprising:

a memory and a processor;

the memory is connected with the processor and used for storing programs;

the processor is configured to implement the model scheduling processing method according to any one of claims 1 to 9 by executing the program in the memory.

12. A storage medium having stored thereon a computer program which, when executed by a processor, implements the model scheduling processing method according to any one of claims 1 to 9.