CN116225653A

CN116225653A - QOS-aware resource allocation method and device under deep learning multi-model deployment scene

Info

Publication number: CN116225653A
Application number: CN202310221373.0A
Authority: CN
Inventors: 吴悦文; 吴恒; 余甜; 罗钓寒; 张文博
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2023-03-09
Filing date: 2023-03-09
Publication date: 2023-06-06

Abstract

The invention relates to a QoS perceived resource allocation method and device under a deep learning multi-model deployment scene. The method comprises the following steps: splitting the deep learning model into a plurality of serial dependent sub-models, so that the corresponding target task is split into a plurality of sub-tasks; inserting all subtasks in the global task queue according to the overall response ratio of all subtasks in the global task queue when the queue is changed; when the subtasks are to be operated, tokens are issued for the subtasks according to the number of the current various types of tasks and the attribute of the subtasks, so that the operation result of the target task is obtained based on the deep learning model or a plurality of serially dependent sub-models. The invention can effectively solve the problem of overlong waiting of short tasks caused by long tasks, realize the adjustment of the resource allocation strategy of the tasks with extremely low scheduling expenditure, and reduce the jitter of the task service level.

Description

QOS-aware resource allocation method and device under deep learning multi-model deployment scene

Technical Field

The invention belongs to the technical field of software, and particularly relates to a QoS (Quality of Service ) aware resource allocation method and device under a deep learning multi-model deployment scene.

Background

Deep learning is widely used as an important technology in the fields of lane detection, pedestrian recognition, object tracking, image segmentation, etc., in edge computing scenes such as traffic, medical treatment, education, smart cities, etc. In view of communication overhead, privacy security, and the need for low latency by applications, edge computing platforms have become an important way of deep learning application deployment. Task requests faced by the above scenario are typically from a specific area, and processing multiple deep learning reasoning tasks simultaneously presents a necessarily growing trend due to the complexity of the service scenario. On the other hand, the utilization rate of resources by specific type tasks is very limited, and a plurality of service scenes are considered, so that in order to reduce the deployment cost, various deep learning models have a common scheme for sharing operation resources, such as an edge computing box TW-T906/TW-T906G, UBoxAI and the like, which are based on a single GPU and provide reasoning services for a plurality of applications in the specific scenes. Because the reasoning delay difference of the deep learning reasoning task is large and the task request arrives randomly, the long task can cause the following short task to wait too long, so that the short task faces worse QoS. How to guarantee the QoS of each task request based on an effective resource allocation policy faces challenges.

Existing work discusses the above resource allocation problem from the arithmetic unit level, the computational graph level, and the task level, respectively.

And the resource allocation at the level of the operation unit is realized, and the work uses a work thread context switching mechanism of operation resources (such as GPU) to improve the utilization rate of the resources when the task sharing operation resources are inferred by the multi-deep learning model through thread blocking, cyclic blocking and the like. As a widely adopted MPS technology, the method is a multi-process parallel acceleration service developed by NVIDIA of a well-known manufacturer for the GPU of the under-flag Kepler architecture. On the one hand, deep learning reasoning resources gradually become diversified, such as Mali-GPU (graphics processing Unit) introduced by ARM company, traditional low-power-consumption equipment such as FPGA (field programmable gate array), DSP (digital Signal processor) and the like, so that the adaptive scene of the MPS (application program System) technology is very limited; on the other hand, the technology is characterized in that idle GPU operation cores are called in time so as to improve the resource utilization rate. Research has shown that this approach, while increasing the overall throughput of the system, does not predict QoS for perceived tasks. The random resource allocation brings serious unfairness in the task view, so that the QoS of the task cannot be effectively ensured. While research efforts have been directed to correcting the shortcomings of this technique based on the concept of task preemption, recent disclosures (REEF: han M, zhang H, chen R, et al, micro-scale Preemption for Concurrent { GPU-accepted } { DNN } Informates [ C ]//16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) }, 2022:539-558.) have also been limited to preemptive schedules based on task preset priorities, i.e., providing services to the scene on the premise of guaranteeing QoS for a particular task, but have virtually ignored QoS requirements for other tasks.

The resource allocation at the computational graph level is calculated (Fuxun Yu, shawn brain, diWang, longfei Shangguan, xulong Tang, chenchen Liu, and Xiang Chen.2021.Automated run-Aware Scheduling for Multi-Tenant DNN Inference on GPU. In IEEE/ACM International Conference On Computer Aided Design, ICCAD 2021,Munich,Germany,November 1-4,2021.IEEE,1-9.Https:// doi. Org/10.1109/ICCAD51958.2021.9643501) combining multiple deep learning models into one large model with multiple parallel branches to speed up overall computational efficiency by reasonably ordering and fusing the internal computational units. Such work is mainly applicable to a scenario where a plurality of deep learning tasks are strongly associated, and these task requests are always generated synchronously and called synchronously, and are not applicable to a scenario where multiple tasks independently and randomly occur.

The task-level resource allocation, which treats the task as a resource management unit, is insensitive to the structural details of the deep learning model, thereby simplifying the scheduling work. FCFS task level scheduling is used as (Arpan Gujarti, reza Karimi, safya Alzayat, antoine Kaufmann, ymir Vigfusson, and Jonathan Mace 2020.Serving DNNs like Clockwork: performance Predictability from the Bottom Up. CoRRabs/2006.02464 (2020). ArXiv:2006.02464https:// arXiv. Org/abs/2006.02464) and tasks are discarded when they arrive predicted as QoS unavailable. (Yujeong Choi and Minsoo Rhu.2020.PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing units.In IEEE International Symposium on High Performance Computer Architecture, HPCA 2020,San Diego,CA,USA,February 22-26,2020.IEEE,220-233.https:// doi.org/10.1109/HPCA47549.2020.00027) implements a token-based online task scheduling algorithm based on QoS aware prediction of task completion time, thus achieving overall optimization. Although the QoS of the task can be predicted in advance in this way, the problem that the following short task waits for too long time when the long task occupies the operation resource cannot be effectively solved, so that the short task faces worse QoS and cannot be effectively adjusted although the QoS can be predicted.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a QoS perceived resource allocation method and a device under a deep learning multi-model deployment scene, which are characterized in that firstly, through analyzing the calculation graph structure of a deep learning model, and automatically and uniformly dividing the calculation graph structure and guaranteeing the serial dependency relationship of sub-models after division, a long task is uniformly divided into a plurality of serial sub-tasks, thereby realizing the predictable perception of task QoS and simultaneously effectively solving the problem of overlong waiting of short tasks caused by the long task; secondly, a microsecond-level task scheduling and sorting strategy is provided, and on the basis of continuously monitoring the QoS of the task, the resource allocation strategy for adjusting the task is realized with extremely low scheduling overhead; finally, a dynamic decision mechanism is provided for adaptively adjusting the working mode of the task, so that the jitter of the task service level is further reduced.

In order to achieve the above object, the present invention includes:

a QOS-aware resource allocation method in a deep learning multi-model deployment scenario, the method comprising:

splitting the deep learning model into a plurality of serial dependent sub-models, so that the corresponding target task is split into a plurality of sub-tasks;

inserting all subtasks in the global task queue according to the overall response ratio of all subtasks in the global task queue when the queue is changed;

when the subtasks are to be operated, tokens are issued for the subtasks according to the number of the current various types of tasks and the attribute of the subtasks, so that the operation result of the target task is obtained based on the deep learning model or a plurality of serially dependent sub-models.

Further, the splitting of the deep learning model into a plurality of serially dependent sub-models based on the splitting overhead and splitting uniformity of the sub-models includes:

converting the deep learning model into a model in ONNX format;

splitting the ONNX format model based on the set splitting points to generate a primary splitting strategy population;

performing effect prediction on the current generation segmentation strategy population, and obtaining the adaptability of single individuals in the segmentation strategy population based on the effect prediction result; wherein, the effect prediction result includes: segmentation overhead and segmentation uniformity;

judging whether the segmentation expense and segmentation uniformity of the strategy corresponding to the maximum adaptability meet the standard or not;

under the condition that the standard is not met, individual selection, intersection and variation are carried out on individuals in the segmentation strategy population to obtain a next generation segmentation strategy population, the next generation segmentation strategy population is returned to the current segmentation strategy population, and effect prediction is carried out to obtain the adaptability of single individuals in the segmentation strategy population;

in case the criteria are met, a plurality of serially dependent sub-models are obtained.

Further, the splitting the ONNX format model based on the set splitting point to generate a first generation splitting strategy population, including:

according to the execution sequence of operators, the sequence numbers of the agreed operators are sequentially numbered from 1 to n;

the ONNX format is processedThe model of (2) is divided into N+1 serial submodels to obtain a primary division strategy population, wherein N is more than or equal to 0, and the submodel M ₁ The operator sequence number is 1-x ₁ Operator composition of (2), submodel M _i From operator number (x) _i-1 +1)～x _i Operator composition of (2), submodel M _N+1 From operator number (x) _N The operator of +1) to N is composed of i and N which are more than or equal to 2.

Further, the performing effect prediction on the current segmentation strategy population includes:

respectively obtaining the reasoning delay of the deep learning model _M And the inference delay of each submodel _i ；

Delaying latency based on the reasoning _M And the reasoning delay latency _i Obtaining segmentation expense;

delaying latency based on the reasoning _i And obtaining the segmentation uniformity.

Further, the individual fitness

Wherein, cost _raw For the execution time of the original model, std is the standard deviation, N+1 is the number of sub-models, k ₁ Weight coefficient, k, for uniformity ₂ The weight coefficient of the segmentation cost is defined as p is a first expected coefficient, q is a second expected coefficient, and the overhead is defined as the segmentation cost.

Further, the inserting the subtasks into the global task queue according to the overall response ratio of all subtasks in the global task queue when the queue is changed includes:

obtaining the queuing time length and the predicted queuing time length of each subtask in the global task queue and the reasoning delay latency of the deep learning model corresponding to the subtask _M ；

When a subtask is inserted into any position in the global task queue, the predicted reasoning delay of each subtask in the global task queue is obtained;

based on the queued time, the predicted queuing time, and the inferred delay _M And predicting reasoning delay to obtain the response ratio of each subtask in the global task queue;

and inserting the subtasks into a global task queue so that the overall response ratio of all subtasks in the global task queue is the lowest.

Further, the issuing a token for the subtask according to the number of the current various types of tasks and the attribute of the subtask to obtain the running result of the target task based on the deep learning model or a plurality of serially dependent sub-models comprises:

under the condition that the current task execution scenario is a first scenario or a second scenario, the subtasks are submitted to the deep learning model for execution, so that an operation result of the target task is obtained; the first scenario is that the subtask is the first part of the target task, the total number of task types of the target task exceeds a set threshold, the second scenario is that the subtask is the first part of the target task, and the total number of tasks corresponding to all task types except the task type of the target task is larger than the set threshold;

and under the condition that the current task execution scenario is not the first scenario and the second scenario, the subtasks are handed over to the corresponding sub-models, and the running result of the target task is obtained based on the execution results of the sub-models.

A QOS-aware resource allocation apparatus in a deep learning multi-model deployment scenario, the apparatus comprising:

the uniform segmentation module is used for splitting the deep learning model into a plurality of serial dependent sub-models so as to split the corresponding target task into a plurality of sub-tasks;

the task management module is used for inserting all subtasks in the global task queue according to the overall response ratio of all subtasks in the global task queue when the queue is changed;

the token management module is used for issuing tokens for the subtasks according to the number of the current various types of tasks and the attribute of the subtasks when the subtasks are to be operated;

and the task processing module is used for obtaining the running result of the target task based on the deep learning model or the multiple serial dependent sub-models.

An electronic device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of the above.

A computer readable storage medium storing a computer program which, when executed by a computer, implements the method of any one of the preceding claims.

Compared with the prior art, the invention has at least the following advantages.

(1) The system continuously perceives the QoS of each task request, and gives consideration to the low delay requirement of each task;

(2) Based on the model segmentation technology, the problem of overlong waiting of short tasks caused by long tasks is effectively solved, and powerful scheduling guarantee is provided for improving QoS of the tasks;

(3) The service stability of each task is strong, and the jitter is smaller.

Drawings

Fig. 1 is a schematic system configuration diagram of a resource allocation system according to the present invention.

Fig. 2 is a schematic diagram of an execution flow of the resource allocation system of the present invention.

FIG. 3 is a diagram illustrating the abstract execution of a genetic algorithm in the model segmentation of the resource allocation system of the present invention.

Description of the embodiments

The technical solution of the present invention will be further described below with reference to the accompanying drawings, wherein the described embodiments are some embodiments of the present invention, and not represent all embodiments.

The QOS-aware resource allocation device in the deep learning multi-model deployment scene, as shown in figure 1, comprises a uniform segmentation module, a request responder, a task management module, a task processing module and a token management module.

Uniform cutting tool: and calling in an offline stage, automatically generating a uniform segmentation strategy of the model by analyzing the deep learning model to be deployed so as to ensure the consistency of the execution time of each sub-model, and adding redundant input and output into a calculation graph of the deep learning model by automation so as to realize the complete serial dependency relationship among the sub-models. The final execution target of the tool is to automatically generate a final deployment strategy and a corresponding deployment configuration file, and the final execution target comprises a model slicer, an execution predictor and a deployment decision maker. First, giving the concrete meaning of the following concepts, (1) segmenting a deep learning model into a plurality of continuous submodels; (2) cut overhead; (3) cut uniformity.

1) Splitting the deep learning model into a plurality of sequential sub-models: the calculation graph G of the original model M is analyzed and divided into k subgraphs, and the number of the subgraphs is sub_G ₁ ～sub_G _k Then the calculated graph structure is restored into a model structure for storage, and sub_M is sequentially corresponding to ₁ ～sub_M _k . This partitioning requires that for arbitrary data D, it is taken as sub_M ₁ Is input to sub_m _p Is output as sub_M _p+1 Sequentially and serially execute sub_M ₁ ～Sub_M _k The final result is consistent with the output result obtained by directly taking the final result as the input of M;

2) Segmentation overhead: assuming that the reasoning delay of model M is latency _M Model sub_m _i Is a delayed latency of (a) _i The segmentation overhead is expressed as:

3) Segmentation uniformity: assumption model sub_m _i Is a delayed latency of (a) _i Uniformity means latency _i Is expressed in terms of their standard deviation.

The three core devices that make up the tool are described in detail as follows:

1) Model slicer: splitting the deep learning model into a plurality of serial dependent sub-models according to a given segmentation scheme;

2) Executing a predictor: and respectively calculating the reasoning delay predicted value of the submodel generated for the specific segmentation scheme, and evaluating the segmentation cost corresponding to the segmentation scheme. The prediction mode is to circularly execute n times on the resource device in a task exclusive mode, and due to the cold start problem of model loading, the prediction model is as follows:

wherein latency is ^m The model reasoning delay measured in the mth cycle is obtained.

3) Deployment decision maker: and comprehensively evaluating the segmentation expense and the segmentation uniformity corresponding to the current segmentation scheme, taking the segmentation expense and the segmentation uniformity as final output strategies when the expected target value is reached, and creating a corresponding deployment configuration file which contains the structural information of each sub-model and the association between the sub-models, otherwise, further generating the next group of segmentation scheme based on the heuristic strategy, and delivering the next group of segmentation scheme to a model segmenter.

Request responder: on-line stage calling, receiving and registering a task request of a user to a system based on protocols such as RPC and the like, and timely responding a calculation result to the user after the system processing is completed;

the task management module: and (3) calling in an online stage to complete the data format encapsulation/decapsulation, monitoring, log recording, priority ordering and pushing related work of the user request task. The system consists of a task encapsulator, a task decapsulator, a task manager, a priority sequencer and a log collector, wherein:

1) Task encapsulator: deep learning runtime deployment frameworks typically have their own data representation formats, and in order to decouple from the underlying deep learning deployment runtime framework, the functionality first performs a conversion of the data format upon receipt of a task request, converts the user request data into input format data specific to the runtime framework, and then associates it to an internally defined data structure that is used to more fully represent and record the various features of a task request for subsequent processing. The data structure comprises input/output data in a format appointed by the deep learning model, starting and stopping moments corresponding to each stage of task processing, time consumption for reasoning when the deep learning model to be called by the task is operated and predicted, actual time consumption for reasoning when the deep learning operation is called by each stage of task processing, processing progress of the task, receiving and completing moments of the task and task numbers. Wherein, when any data does not exist, the data is temporarily filled with null values;

2) Task decapsulator: the method is an inverse process of a task wrapper, and mainly comprises the steps of analyzing all characteristics of task processing recorded by a system from an internally defined data structure, and converting output format data specific to a running frame into a universal format readable by a user;

3) Task manager: and managing all task requests to be processed in the current system, and maintaining a global priority queue. When the previous calculation task is finished, pushing the sub task with the highest priority to the operation module in time, monitoring the state of each task (waiting for processing, processing and finishing), and writing log information into the data structure when the state of the task is changed;

4) Prioritizer: splitting the newly added tasks into a plurality of serial subtasks according to a model deployment strategy, calculating respective priority sequence numbers for the subtasks by adopting a quick sequencing algorithm based on response ratio, and finally inserting the subtasks into a system priority queue. The sorting algorithm specifically inserts each subtask corresponding to a new task into the system under the premise of not changing the priority comparison relation of each subtask existing in the system, calculates a reasonable insertion strategy to enable all the recorded tasks to obtain the lowest overall response ratio under the premise, wherein the response ratio is calculated as follows:

5) Log collector: as a unified logging tool, log information generated by all tasks is recorded for subsequent viewing.

The task processing module: and calling in an online stage, actually loading task request data and a deep learning reasoning runtime framework, and calculating an output result. The model executor and controller are composed of two parts, wherein:

1) The model executor: there are two modes of operation, namely "model serial execution" and "original model execution", for performing a task/subtask specific operation. In the mode of 'model serial execution', only the pushed subtasks are completed; in the "proto-model execution" mode, the entire task corresponding to the sub-task will be directly processed at one time, which requires that the pushed sub-task must be the first part of the entire task.

2) And (3) a controller: independent model executors are started for different types of tasks in a multithreading manner, and their lifecycles are controlled to determine when they enter an execution state from a blocking state while monitoring their state information.

The token management module: the online stage call is carried out, a token of a specific type is issued for a task pushed by a task management module, different task calculation modes are designated based on the category of the token, and a token verification mechanism is provided, so that only a model executor corresponding to the task can pass verification, and the model executor consists of a segmentation switch decision maker, a token distributor and a token verifier, wherein:

1) A cut switch decision maker: according to the number of the tasks of each type and the attribute of the subtasks currently pushed, two different types of tokens are issued, and the operation mode of the system for processing data is determined. The specific algorithm decision is as follows:

a) When the current subtask is a first part of an original task (assuming that the task type is T) of a user and the total number of the tasks to be processed of the T type exceeds 1, the current subtask is executed in a mode of directly processing an original model;

b) When the total number of tasks corresponding to all types of tasks except the T type is greater than 1, the current subtask is executed in a mode of directly processing an original model;

c) In addition to the above, the current subtasks should be performed in a submodel serial processing manner.

2) A token distributor: binding the token with the model executor ID corresponding to the subtask so that only the designated model executor can pass the token verification;

3) A token validator: the verification function of the ID of the token and the model executor is provided, and verification can be passed only when the ID is bound with the token.

In one embodiment, the QoS aware resource allocation method in a deep learning multi-model deployment scenario of the present invention, as shown in fig. 2, includes steps 001-006 of an offline decision stage and steps 101-121 of an online service stage.

1. Offline decision stage

Step 001: the user derives the deep learning model to be deployed into ONNX format in advance, this step is usually directly supported by a development framework such as Pytorch, tensorflow, etc., as input to the model automation uniform segmentation tool. Aiming at different model files, the tool independently analyzes;

step 002: the model slicer reads the model;

step 003: the model slicer generates sub-models after slicing according to the individual information in the slicing strategy population and gives the sub-models to the execution predictor. Wherein the population of primary segmentation strategies is randomly generated.

Step 004: the execution predictor respectively predicts and evaluates the sub-models submitted by the model slicer to give the slicing uniformity and the slicing cost corresponding to each strategy in the population;

step 005: the deployment decision maker analyzes the prediction result fed back by the execution predictor, and when the optimal scheme reaches the expected value (namely, the segmentation is uniform enough and the segmentation cost meets the threshold value), the algorithm is ended; otherwise, continuing to execute the genetic algorithm to generate a next generation segmentation strategy population, submitting the next generation segmentation strategy population to a model segmenter, and repeatedly executing the steps 003-005;

step 006: when the decision maker is deployed to finish the algorithm, the optimal segmentation strategy is recorded, and when the disk generates each corresponding sub-model file and corresponding description information thereof, the model executor reads and calls the sub-model file when the online system is started.

Wherein the offline decision stage is operated by means of classical genetic algorithms in heuristic algorithms. First, the technical details of the genetic algorithm will be described, and the specific flow can be referred to fig. 3:

the processing object of the genetic algorithm is the ONNX model, and each time the genetic algorithm runs, there will be a desired number of submodels entered by the user. Assuming that the ONNX model has a total of M operators, the user expects to split it into N+1 (N.gtoreq.0) serial sub-models. At this time, according to the execution sequence of operators, the sequence numbers of the agreed operators are sequentially numbered from 1 to M.

Encoding rules of the strategy: [ x ] ₁ ，x ₂ ，...，x _N ]Representing the segmented submodel M ₁ ～M _N+1 M in (v) _i (2.ltoreq.i.ltoreq.N) is represented by operator number (x) _i-1 +1)～x _i Operator constitution of M ₁ The operator sequence number is 1-x ₁ Operator constitution of M _N+1 From operator number (x) _N +1) to M. Should satisfy x _i M is less than or equal to 1 and less than or equal to i and less than or equal to N, and x is present in any of 1 and less than or equal to i and less than or equal to j and less than or equal to N _i ≠x _j 。

And (3) adaptability calculation: for the adaptability evaluation of a single individual in the segmentation strategy population, executing time standard deviation std and overhead calculation overhead according to each sub-model corresponding to the strategy to obtain a specific calculation formula as follows:

wherein Cost is _raw Std is the standard deviation of the execution time of the submodels of a specific individual (segmentation strategy) and N is the number of the submodels; weight coefficient k of uniformity ₁ Weight coefficient k of segmentation overhead ₂ The desired coefficient p and the desired coefficient q are all empirical values.

When judging whether the population meets the threshold, only the overhead and std of the strategy corresponding to the minimum fitness (optimal individual) are needed to be judged whether the overhead and std meet the standard. In fact, when the optimal individuals generated in 5 consecutive generations have not changed, the method considers that the optimal solution of the model is solved, that is, the set threshold cannot be reached.

Selecting an individual: assuming a total of m individuals in the population, for individual g in the population, the probability of being selected is:

crossover rules: for a code of [ a ] ₁ ，a ₂ ，...，a _N ]Individuals A and [ b ] of (E) ₁ ，b ₂ ，...，b _N ]Individual B of (a), whose cross-produced offspring are encoded as [ a ] ₁ *r ₁ +b ₁ *(1-r ₁ )，a ₂ *r ₂ +b ₂ *(1-r ₂ )，...，a _N *r _N +b _N *(1-r _N )]Wherein r is ₁ ～r _N Is a random floating point number between 0 and 1.

Variation rule: for single produced offspring, there is a certain probability of performing the mutation operation. When mutated, x is randomly determined ₁ ～x _N Left-right adjustments are made, but the encoding rules of the policy should still be satisfied.

In addition, when generating a new population of offspring, it should always be ensured that the optimal individuals in the previous generation are fully retained in the next generation of individuals.

2. Online service phase

Step 101: the user sends a task request to a request responder, wherein the task request comprises task data to be processed;

step 102: the task package receives a new task request submitted by the task responder;

step 103: the task encapsulator disassembles the task to form a plurality of subtasks, and encapsulates the subtasks according to the internally defined data structure;

step 104: the priority sequencer calculates the priority of the subtasks generated by the task wrapper and inserts the subtasks into the global task queue;

step 105: registering the task condition to a task manager and carrying out subsequent management by the task condition;

step 106: the task manager continuously logs the task;

step 107: when the system is in an initial state or receives an end signal from a model executor, the task manager pushes the sub-task with the highest priority at the current moment to the token distributor, and if the sub-task is marked as completed, the sub-task is automatically deleted and the sub-task with the secondary priority is pushed;

step 108: the token distributor calls a cut switch decision device to write the token category;

step 109: the segmentation switch decision maker writes categories into the token according to the current system task state;

step 110: the token distributor writes data into the token so that the token is bound with a model executor corresponding to the subtask;

step 111: after the token is generated, pushing the token to a controller by a token control module;

step 112: the controller hands the model executor ID and the token to the token verifier for comparison until verification is successful, so that the thread and the working mode of the model executor to be awakened are confirmed;

step 113: the token verifier refreshes the token content for the next round of token generation;

step 114: the controller wakes up the model executor thread;

step 115: the model executor loads subtask input data from the task data structure;

step 116: the model executor performs operation according to the appointed working mode to obtain an operation result;

step 117: the model executor records the operation result into a task data structure;

step 118: the model executor sends a signal of finishing the calculation of the subtask to the task manager so that the next subtask is smoothly pushed;

step 119: when all subtasks corresponding to a task are processed, the subtasks are submitted to a task decapsulator to finish decapsulation, and the task manager sends complete log information to a log collector;

step 120: the request responder receives the processing result from the task decapsulator and responds the processing result to the user;

step 121: the user receives the return data from the request responder.

The above examples are provided for the purpose of describing the present invention only and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent substitutions and modifications can be made without departing from the spirit and principles of the present invention, and are intended to be included within the scope of the present invention.

Claims

1.A QOS-aware resource allocation method in a deep learning multi-model deployment scenario, the method comprising:

2. The method of claim 1, wherein splitting the deep learning model into a plurality of serially dependent sub-models based on the segmentation overhead and segmentation uniformity of the sub-models comprises:

converting the deep learning model into a model in ONNX format;

3. The method of claim 2, wherein splitting the ONNX format model based on set cut points to generate a primary cut policy population comprises:

dividing the ONNX model into N+1 serial sub-models to obtain a primary dividing strategy population, wherein N is more than or equal to 0, and M is a sub-model ₁ The operator sequence number is 1-x ₁ Operator composition of (2), submodel M _i From operator number (x) _i-1 +1)～x _i Operator composition of (2), submodel M _N+1 From operator number (x) _N The operator of +1) to N is composed of i and N which are more than or equal to 2.

4. The method of claim 2, wherein said performing effect prediction on the current population of segmentation policies comprises:

5. The method of claim 4, wherein the individual's fitness

6. The method of claim 1, wherein inserting the subtasks into the global task queue according to the overall response ratio of all subtasks in the global task queue when the queue changes, comprises:

7. The method of claim 1, wherein issuing tokens for the subtasks based on the number of tasks of the current type and the properties of the subtasks to obtain the operation result of the target task based on the deep learning model or a plurality of serially dependent sub-models, comprises:

8. A QOS-aware resource allocation apparatus in a deep learning multi-model deployment scenario, the apparatus comprising:

9. An electronic device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1-7.