CN116915869A

CN116915869A - Cloud edge cooperation-based time delay sensitive intelligent service quick response method

Info

Publication number: CN116915869A
Application number: CN202311017986.9A
Authority: CN
Inventors: 许小龙; 胡宇昊; 王青洋; 楚义
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2023-08-14
Filing date: 2023-08-14
Publication date: 2023-10-20

Abstract

The invention discloses a cloud edge cooperation-based delay-sensitive intelligent service quick response method, which is based on a double-delay depth deterministic strategy algorithm to realize joint optimization of convolution kernel segmentation and server resource allocation, and the core idea is that computing resources in each edge server are partitioned, and resources in each region are only used for reasoning a certain part of a certain convolution layer in a certain convolution neural network; all intelligent services based on the convolutional neural network of the same category in one scene are placed in a queue, and the reasoning order of different services is determined according to the priority; in the convolutional neural network corresponding to each intelligent service, each convolutional layer is divided into a plurality of parts, so that parallel reasoning of the convolutional neural network on different servers is realized; the whole algorithm has the characteristics of reasonable convolution kernel segmentation, accurate resource allocation and preferential treatment of high-priority service requests, and can meet the quick response requirement of delay-sensitive intelligent service in a high-task load scene.

Description

Cloud edge cooperation-based time delay sensitive intelligent service quick response method

Technical Field

The invention relates to the technical field of edge calculation, in particular to a cloud edge cooperation-based delay-sensitive intelligent service quick response method.

Background

Deep Learning (DL) is one of the most important Deep Neural Networks (DNNs), convolutional Neural Networks (CNNs) that are widely used in many vision applications, such as augmented reality, object classification, object recognition, and the like. Although CNNs can help these intelligent applications produce accurate results, their high computational complexity makes it difficult for models of these applications to be deployed on edge devices, including terminal devices such as cell phones, tablet computers, and notebook computers, as well as edge servers; the traditional solution is to reason the CNN model on a cloud server, however, considering the distance between the edge and the cloud is far, the transmission delay of data in cloud computing is very high.

This problem is solved in the prior art by model partitioning, for example, based on Bian Yun synergy, by using a maximum cut minimum flow algorithm to design a dynamic adaptive model partitioning method, and by using a model partitioning method based on joint parameter quantization. However, in these prior art techniques, the different parts within the CNN are still sequential reasoning and thus do not minimize the response time of the intelligent application. As a matter of course, there is a new method proposed in the prior art called model parallelism, which groups channels of convolution kernels in CNNs and performs parallel reasoning on different devices, so as to implement model parallelism. However, existing models work in parallel, resulting in a non-negligible loss of accuracy or additional training costs; furthermore, with the development of edge computing and the popularization of intelligent applications, more and more tasks are offloaded from terminal devices to edge servers or cloud ends, which brings huge computing load to corresponding devices (particularly edge devices). The limited computational resources and the large number of tasks result in lengthy response times for user requests, especially failing to meet quality of service (QoS) for delay-sensitive tasks such as automatic driving, road monitoring, etc. Therefore, in a scenario where the load of each server is extremely high, a new reasoning acceleration method is urgently required to be proposed for the delay-sensitive service.

Disclosure of Invention

The purpose of the invention is that: the quick response method for the delay-sensitive intelligent service based on cloud edge cooperation is characterized in that priority is set for each task in a scene, each convolution layer is divided into a plurality of parts along high direction for each CNN, and the parts are deployed on a plurality of edge servers so as to use a certain amount of resources for parallel reasoning.

In order to achieve the above functions, the invention designs a cloud-edge collaboration-based delay-sensitive intelligent service quick response method, which is based on the mutual collaboration among all sides and cloud devices under a user-dense side-cloud collaboration system consisting of all intelligent service users, all edge servers, controllers and cloud servers, and the side-cloud collaboration system executes the following steps A-E to obtain an intelligent service request queue to be processed and a CNN distributed reasoning method, and responds according to the priorities of the intelligent service requests:

step A: aiming at a target scene, selecting a category of a corresponding convolutional neural network according to intelligent service requests sent by intelligent service users, obtaining a convolutional layer segmentation strategy according to a numerical optimization theory, segmenting each convolutional kernel of the convolutional neural network along a high, and distributing different parts of the convolutional layer obtained by segmentation to different edge servers in the scene for parallel reasoning;

and (B) step (B): aiming at all intelligent service requests based on the convolutional neural network of the same category in the scene, a queue is established, and the priority of each intelligent service request in the same queue is determined according to the sensitivity degree of the intelligent service request to response time delay so as to determine the running sequence of each intelligent service request;

step C: partitioning computing resources in each edge server according to the computing amount of each class of convolutional neural network in the scene, wherein the computing resources in each region can only be used for reasoning about the convolutional neural network of a certain corresponding class;

step D: c, further dividing the computing resources in each resource partition divided in the step C by using a deep reinforcement learning model aiming at each edge server to form a plurality of subareas, wherein the computing resources in each subarea are only used for reasoning a certain specific convolution layer in a certain specific type of convolution neural network;

step E: and sequentially running the intelligent service requests in each queue until all the intelligent service requests in the queues are processed.

As a preferred technical scheme of the invention: the convolution kernel segmentation strategy in the step A is implemented in the following steps A1 to A4, and a parallel reasoning strategy of each convolution kernel corresponding to each edge server is obtained;

step A1: having N in the target scene _server The edge servers share N for a certain specific type of convolutional neural network _con The convolution layers are divided and inferred on the edge servers in parallel respectively; reasoning all pooling layers on the controller; reasoning all the full connection layers on the cloud server; for each convolution layer, sending the reasoning result of each edge server to a controller for aggregation, and returning the aggregation result to each edge server again for parallel reasoning of the next convolution layer, or sending the aggregation result to a cloud server for reasoning of a full-connection layer;

step A2: the response delay of the intelligent service request sent by the intelligent service user is as follows:

in the method, in the process of the invention,representing the partition ratio, float, of the nth convolutional layer corresponding to the mth edge server _n Floating-point operand representing the nth convolution layer, < +.>Representing the number of computing resources allocated to the nth convolutional layer in the mth edge server, gamma _m Representing the computation rate per unit of computing resource in the mth edge server, N _con Representing the total number of convolution layers;

step A3: in the parallel reasoning process of each convolution layer, the reasoning time delay of each edge server is completely equal, and the convolution kernel segmentation problem is converted into the following formula:

thereby deriving a convolutional layer segmentation strategy, i.eWherein->j∈{1,2,…,N _server And N _server Is the total number of edge servers;

step A4: for each edge server, step a 4.1-step a4.1 are performed:

step a4.1: the convolution kernel size calculated by the step A3 and distributed to the server is recorded asThen pair->Rounding down->And calculate->Fractional parts of the number of times; corresponding +.>The fractional parts of the number are added together and added to +.>

Step a4.2: traversing the corresponding inference delay at that time for each edge serverAnd find the minimum value among themAnd the size of the part of the convolution kernel allocated by the server is recorded as +.>

Step a4.3: execution ofAnd->

Step a4.4: repeating steps A4.2-A4.3 until each edge serverThe maximum value of (2) is not less than +.>

Step a4.5: for each server remainingAnd executing the operation once to finally obtain the convolution layer segmentation strategy.

As a preferred technical scheme of the invention: and B, classifying the intelligent service requests sent by the intelligent service users according to the type of the convolutional neural network based on the intelligent service requests, establishing a task queue for the convolutional neural network of one type, putting the intelligent service requests based on the convolutional neural network of the same type into the corresponding task queue, and sequencing the intelligent service requests in the task queue according to the priority of the intelligent service requests, wherein the priority of each intelligent service request is preset by a service provider.

As a preferred technical scheme of the invention: in the step C, aiming at the computing resources in each edge server, the computing resources in each area are partitioned proportionally according to the computing amount of the convolutional neural network of all the categories in the scene, and the computing resources in each area are only used for reasoning the convolutional neural network of one category.

As a preferred technical scheme of the invention: d, adopting a double-delay depth deterministic strategy algorithm in a depth reinforcement learning model, executing the following steps D1 to D3, dividing computing resource subareas in each server, and distributing computing resources for each convolution layer in each convolution neural network in a scene;

step D1: dividing computing resource subareas by utilizing a double-delay depth deterministic strategy algorithm in a depth reinforcement learning model, reasoning a convolution neural network of a specific class, and setting a state space and an action space as follows:

wherein a is _i Is the action space s _i Is a state space;the proportion is allocated for the resources; />Calculating the quantity of resources which are not allocated to the subareas in a certain resource area; c (C) _i The number of convolution layers which are not yet inferred in the convolution neural network;

step D2: setting a reward function reward _n The deep reinforcement learning model is trained as follows:

in the method, in the process of the invention,representing an interaction time delay between the mth edge server and the controller;

step D3: and storing the trained deep reinforcement learning model in a controller, and dynamically adjusting a computing resource allocation strategy according to the fluctuation of network bandwidth and aiming at reducing the total reasoning time delay of the convolutional neural network in the actual running process of the intelligent service request.

As a preferred technical scheme of the invention: e, the intelligent service requests in each queue run successively according to the priority order, and each convolution layer in the corresponding convolution neural network of each intelligent service request sequentially performs parallel reasoning in different resource subareas in the running process; for the convolutional neural network model of the same class, the reasoning processes of different convolutional layers of different intelligent service requests are mutually independent.

The beneficial effects are that: the advantages of the present invention over the prior art include:

the invention provides a quick response method of a time delay sensitive intelligent service based on edge cloud cooperation, and provides a distributed reasoning method of a Convolutional Neural Network (CNN), which is based on a mathematical optimization method and a double-delay depth deterministic strategy (TD 3) algorithm in deep reinforcement learning to realize joint optimization of convolutional kernel segmentation and server resource allocation. The core idea is to partition the computing resources in each edge server, and the resources in each area are only used for reasoning a certain part of a certain convolution layer in a certain CNN; all intelligent services based on the same class CNN in one scene are placed in one queue, and the reasoning order of different services is determined according to the priority; in the CNN corresponding to each intelligent service, each convolution layer is divided into a plurality of parts, so that parallel reasoning of the CNN on different servers is realized. The whole algorithm has the characteristics of reasonable convolution kernel segmentation, accurate resource allocation and preferential treatment of high-priority service requests, and can meet the quick response requirement of delay-sensitive intelligent service in a high-task load scene. It should be noted that, compared with the traditional model segmentation and model parallelism, the novel parallel reasoning method for the convolution layer further reduces CNN reasoning delay and ensures the accuracy of the model reasoning result; in addition, the reasonable distribution of the computing resources in the edge servers in the invention realizes the full and efficient utilization of the computing resources in all the edge servers in the scene.

Drawings

FIG. 1 is a flow chart of a method for providing fast response to a latency sensitive intelligent service based on Yun Bian collaboration in accordance with an embodiment of the present invention;

FIG. 2 is a diagram of a convolution kernel parallel reasoning framework provided in accordance with an embodiment of the present invention;

FIG. 3 is a distributed inference scenario diagram of an intelligent service model provided according to an embodiment of the present invention;

FIG. 4 is a diagram of a queue and multitasking alternating mechanism provided in accordance with an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.

The invention provides a quick response method of a time delay sensitive intelligent service based on edge cloud cooperation, namely a Convolutional Neural Network (CNN) distributed reasoning method under a multi-user scene. The method is based on a mathematical optimization method and a double-delay depth deterministic strategy (TD 3) algorithm in deep reinforcement learning, and realizes joint optimization of convolution kernel segmentation and server resource allocation. Compared with other model segmentation and model parallel methods, the method further reduces CNN reasoning delay and ensures the accuracy of model reasoning results; meanwhile, the reasonable distribution of the computing resources in the edge servers in the invention realizes the full and efficient utilization of the computing resources in all the edge servers in the scene. In addition, aiming at different priorities corresponding to different intelligent services in a scene, a queue and multitask alternating mechanism is designed, all intelligent services based on the same kind of CNN are placed in the same queue, and the running sequence is determined according to the different priorities; and in CNNs corresponding to front and rear services in the same queue, different convolution layers of different services can be inferred at the same time. This minimizes the response time of delay-sensitive intelligent service requests in high-mission load scenarios.

Referring to fig. 1, according to the delay-sensitive intelligent service quick response method based on Yun Bian collaboration provided by the embodiment of the invention, based on the mutual collaboration between each side and cloud equipment under a user-dense side-cloud collaboration system formed by each intelligent service user, each edge server, each controller and a cloud server, aiming at intelligent service requests with different priorities in a scene, the side-cloud collaboration system executes the following steps a-E to obtain an intelligent service request queue to be processed and a CNN distributed reasoning method, and responds according to the priorities of the intelligent service requests:

the convolution kernel segmentation strategy in the step A is implemented in the following steps A1 to A4, and a parallel reasoning strategy of each convolution kernel corresponding to each edge server is obtained;

referring to fig. 2, for each convolution kernel, the convolution kernel is divided into a plurality of different parts along the height, each part is deployed on different edge servers for parallel reasoning, and finally the reasoning results of the edge servers are aggregated on the controller. Furthermore, the division of the convolution kernel causes the corresponding input data to be divided as well. h is a _c And w is equal to _c Representing the height and width of the convolution kernel respectively,x _c and y is _c Respectively representing a starting point and an ending point of the convolution kernel segmentation; thereby, the start point and end point (x) _input Represents the starting point, y _input Represent end point):

x _input ＝x _c

y _input ＝h _c -(w _c -y _c )

thus, the size of the reasoning result of each edge server is calculated for each convolution kernel, and is completely equal to the size of the result obtained by normal reasoning without dividing the convolution kernel; and the process of the controller for aggregating the reasoning results of the edge servers is simply adding the reasoning results of the servers.

Step A1: referring to FIG. 3, assume that N is common in a particular user-dense load scenario _server And an edge server available for use. For a certain class of convolutional neural networks, there is a total of N _con And a plurality of convolution layers. Dividing a convolution layer and reasoning on a plurality of edge servers in parallel; because the calculation amount of the pooling layers is extremely small, all pooling layers are directly inferred on the controller; all the full connection layers are inferred on the cloud server because they are computationally intensive and not easily segmented. And for each convolution layer, the reasoning result of each server is sent to the controller for aggregation, and the aggregation result is returned to each edge server again for parallel reasoning of the next convolution layer, or the aggregation result is sent to the cloud server for reasoning of the full connection layer. The calculation amount of a certain neural network layer is measured through floating point operands, and the floating point operand of the nth convolutional layer in a certain convolutional neural network is recorded as float _n The division ratio of the convolution layer corresponding to the mth edge server is as followsThe number of computing resources allocated to the convolutional layer in the mth edge server is +.>Mth mThe computing rate per unit computing resource in each edge server is gamma _m . The reasoning time of the full connection layer on the cloud server is recorded as t _full The data transmission delay between cloud edges is marked as +.>For the reasoning result of the nth convolution layer, the interaction time delay between the mth edge server and the controller is marked as +.>Thus, a response time delay formula of a certain intelligent service is obtained:

the aggregation time delay and the time delay of pooling operation of the convolution layer parallel reasoning result in the controller are ignored;

step A2: the invention focuses on the parallel reasoning of the convolution layer, so the reasoning time t of the full connection layer on the cloud server _full Data transmission delay with cloud edgeCan be removed from the formula. In addition, to simplify the problem, the data interaction delay between the edge server and the controller is temporarily ignored in the convolutional layer segmentation stage>Thus resulting in a simplified formula:

step A4: for each edge server, step a 4.1-step a4.1 are performed:

step a4.1: considering that the high stress of the part of the convolution kernel actually allocated to each edge after segmentation is an integer, the above method needs to be further optimized, and the convolution kernel size allocated to the server calculated by step A3 is recorded asThen pair->Downward take-offGet whole->And calculate->Fractional parts of the number of times; corresponding +.>The fractional parts of the number are added together and added to +.>

Step a4.3: execution ofAnd->

and B, classifying the intelligent service requests sent by the intelligent service users according to the type of the convolutional neural network based on the intelligent service requests, establishing a task queue for the convolutional neural network of one type, putting the intelligent service requests based on the convolutional neural network of the same type into the corresponding task queue, and sequencing the intelligent service requests in the task queue according to the priority of the intelligent service requests, wherein the priority of each intelligent service request is preset by a service provider.

in the step C, aiming at the computing resources in each edge server, the computing resources in each area are partitioned proportionally according to the computing amount of the convolutional neural network of all the categories in the scene, and the computing resources in each area are only used for reasoning the convolutional neural network of one category.

d, adopting a double-delay depth deterministic strategy algorithm in a depth reinforcement learning model, executing the following steps D1 to D3, dividing computing resource subareas in each server, and distributing computing resources for each convolution layer in each convolution neural network in a scene;

the dual-delay depth deterministic strategy gradient (TD 3) algorithm is improved by a depth deterministic strategy gradient (DDPG) algorithm, a Critic network and a target Critic network are added on the basis of the DDPG algorithm, and a smaller Q value is selected to avoid overestimation, so that a better decision is made. In the server resource allocation decision process, the number of computing resources which are not allocated to the subareas in each time resource area is taken as a state, the resource allocation proportion (the decimal between 0 and 1) at each time is taken as an action, and the state at the next time is easily obtained and is decided by the state at the last time and the action at the current time, so that the server resource allocation decision process accords with Markov properties, and can solve the problem by considering deep reinforcement learning. Considering continuity of the action space and a large number of discrete states in the state space, and combining with strong convergence of the TD3 algorithm, the TD3 algorithm is finally selected to make a decision for resource allocation in the edge server.

wherein a is _i Is the action space s _i Is a state space;the proportion is allocated for the resources; />For a resource region not yet allocatedCalculating the number of resources of the subareas; c (C) _i The number of convolution layers which are not yet inferred in the convolution neural network;

in the reward function, the reasoning time delay of the edge server and the data round-trip transmission time delay between the server and the controller are considered, so that the time delay of the reasoning of the convolutional neural network is reduced as much as possible.

As shown in fig. 4, the intelligent services to be processed in each queue in step E are operated sequentially according to the priority order. In the running process of each intelligent service, each convolution layer in the corresponding convolution neural network model sequentially carries out parallel reasoning in different resource subareas. For the convolutional neural network model of the same class, the reasoning processes of different convolutional layers of different intelligent services are mutually independent. Thus, while inferring the second convolution layer of the first processed intelligent service in the queue in parallel, the first convolution layer of the next intelligent service to be processed in the queue can also be inferred in parallel, and so on, until all intelligent services in the queue are processed.

The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.

Claims

1. The cloud-edge collaboration-based time delay sensitive intelligent service quick response method is characterized in that based on the mutual collaboration among all sides and cloud equipment under a user-dense edge-cloud collaboration system consisting of all intelligent service users, all edge servers, controllers and cloud servers, aiming at intelligent service requests with different priorities in a scene, the edge-cloud collaboration system executes the following steps A-E to obtain an intelligent service request queue to be processed and a CNN distributed reasoning method, and responds according to the priorities of the intelligent service requests:

2. The cloud-edge collaboration-based time delay sensitive intelligent service quick response method according to claim 1, wherein the convolution kernel segmentation strategy in the step A is implemented as follows in the steps A1 to A4, and a parallel reasoning strategy of each convolution kernel corresponding to each edge server is obtained;

step A4: for each edge server, step a 4.1-step a4.1 are performed:

step a4.1: the convolution kernel size calculated by the step A3 and distributed to the server is recorded asThen pair->Rounding down->And calculate->Is excessive fromIs a fraction of the fraction of (2); corresponding +.>The fractional parts of the number are added together and added to +.>

Step a4.2: traversing the corresponding inference delay at that time for each edge serverAnd find the minimum value therein +.>And the size of the part of the convolution kernel allocated by the server is recorded as +.>

Step a4.3: execution ofAnd->

3. The cloud edge collaboration-based time delay sensitive intelligent service quick response method according to claim 1, wherein in the step B, intelligent service requests sent by intelligent service users are classified according to the convolutional neural network class on which the intelligent service requests are based, a task queue is established for one class of convolutional neural network, the intelligent service requests based on the same class of convolutional neural network are placed in the corresponding task queue, the intelligent service requests are ordered in the task queue according to the priority of the intelligent service requests, and the priority of each intelligent service request is preset by a service provider.

4. The cloud edge collaboration-based time delay sensitive intelligent service quick response method according to claim 1, wherein in the step C, for the computing resources in each edge server, the computing resources in each area are used for reasoning about the convolutional neural network of one category only according to the computing amount of the convolutional neural network of all categories in the scene and are partitioned proportionally.

5. The cloud edge collaboration-based time delay sensitive intelligent service quick response method according to claim 1, wherein in the step D, a double-delay depth deterministic strategy algorithm in a depth reinforcement learning model is adopted, the following steps D1 to D3 are executed, computing resource subareas in each server are divided, and computing resources are allocated for each convolution layer in each convolution neural network in a scene;

6. The cloud edge collaboration-based time delay sensitive intelligent service quick response method according to claim 1, wherein in the step E, intelligent service requests in each queue run sequentially according to a priority order, and in the running process of each intelligent service request, each convolution layer in a corresponding convolution neural network sequentially performs parallel reasoning in different resource subareas; for the convolutional neural network model of the same class, the reasoning processes of different convolutional layers of different intelligent service requests are mutually independent.