WO2024032121A1

WO2024032121A1 - Deep learning model reasoning acceleration method based on cloud-edge-end collaboration

Info

Publication number: WO2024032121A1
Application number: PCT/CN2023/098730
Authority: WO
Inventors: 郭永安; 周金粮; 王宇翱; 钱琪杰; 孙洪波
Original assignee: 南京邮电大学
Priority date: 2022-08-11
Filing date: 2023-06-07
Publication date: 2024-02-15
Also published as: CN115034390A; CN115034390B

Abstract

The present invention provides a deep learning model reasoning acceleration method based on cloud-edge-end collaboration, and in particular, relates to a deep learning model hierarchical offloading method. According to the method, theoretical modeling is performed on a computing delay, a data transmission delay, a data propagation delay, and a model hierarchical offloading policy generation delay in a whole deep learning model reasoning process, and a hierarchical offloading policy of an optimal deep learning model is determined by using the minimum computing task response delay as an optimization target. Compared with a deep learning model execution framework dominated by a physical end and a deep learning model execution framework dominated by a cloud computing center, according to the present method, an edge computing paradigm and cloud computing are combined, and a deep learning model is hierarchically offloaded to different edge computing nodes, so that the computing task response delay is minimized when the computing precision is met.

Description

A deep learning model inference acceleration method based on cloud-edge-device collaboration

Technical field

The invention belongs to the field of cloud-edge-end collaborative computing, and specifically relates to a deep learning model inference acceleration method based on cloud-edge-end collaboration.

Background technique

Intelligent applications based on deep learning models usually require a lot of calculations. There are currently two feasible solutions. One is the End-only mode, which uses simple models and lightweight deep learning model frameworks to perform all calculations on the physical side, such as TensorFlow Lite, Caffe For Android; the second is Cloud-only mode, which offloads all computing tasks to a cloud center with powerful computing power to perform complex deep learning model calculations. However, the above method will either reduce the recognition accuracy because it only deploys a simple model on the physical side, or it will cause excessive transmission delay overhead due to the instability of the wide area network transmission link between the physical side and the cloud. Therefore, it is quite difficult to ensure reasonable latency and recognition accuracy at the same time.

To overcome the contradiction between latency and recognition accuracy, a better solution is to leverage the edge computing paradigm. However, existing edge computing execution frameworks and offloading mechanisms for deep learning model inference still have some limitations due to ignoring the characteristics of deep learning applications and the dynamics of the edge environment.

Contents of the invention

The purpose of this invention is to minimize the response delay of computing tasks on the premise of meeting the calculation accuracy by combining the edge computing paradigm and cloud computing, and offloading the deep learning model to different edge computing nodes in layers.

To achieve the above objectives, the present invention provides the following technical solution: a deep learning model inference acceleration method based on cloud-edge-end collaboration, where the cloud-edge-end collaboration refers to a cloud server and at least two edge computing nodes communicating with the cloud server. and at least one physical terminal. The communication distance between the physical terminal and the edge computing node is smaller than the distance between the edge computing node and the cloud server. The method includes the following steps:

Step S1: The physical terminal preprocesses the image data into image feature data D ₁ with the same resolution and equal data volume, inputs the divided DNN layers of the deep learning model DNN _z to be offloaded, and uses the output of the previous layer as the next layer. The input of the layer finally gets D _z ;

Step S2, perform the offline learning phase: Based on the preset load conditions of each edge computing node, the process of processing the image feature data D _z using the deep learning model DNN _z to be offloaded on each edge computing node is used as the input, known image feature data D _z Through the calculation delay corresponding to each DNN _z of the deep learning model to be offloaded on each edge computing node as the output, a hierarchical calculation delay prediction model CT is constructed and trained;

At the same time, based on the preset load condition of the cloud server, the process of image feature data D _z at each DNN _z of the deep learning model to be offloaded on the cloud server is used as input, and each DNN _z of the known deep learning model to be offloaded on the cloud server processes the image feature data. The calculation delay corresponding to D _z is the output, and the cloud server calculation delay prediction model CT _c is constructed and trained;

Step S3: According to the actual computing resource load of each edge computing node, the edge computing node corresponding to the computing task of the physical terminal applies the hierarchical computing delay prediction model CT to process the image feature data of each DNN _z of the deep learning model to be offloaded. The process of D _z is to input and obtain the image feature data D _z through the deep learning model to be offloaded on each edge computing node. The calculation delay corresponding to each DNN _z is the theoretical hierarchical calculation delay CT′ of the output;

Step S4: Based on the known edge computing node LAN network bandwidth r and the physical distance l between each edge computing node, calculate the data transmission required to transmit the image feature data D _z to other edge computing nodes through the current edge computing node. Time delay T and propagation delay S; at the same time, based on the known cloud server network bandwidth r _c and the physical distance l _c between the edge computing node of the computing task and the cloud server, calculate the image transmitted by the edge computing node of the computing task The data transmission delay T _c and propagation delay S _C required for characteristic data D ₁ to the cloud server;

Step S5: Take the theoretical hierarchical calculation delay CT′ of each edge computing node obtained in step S3, and the data transmission delay T and propagation delay S obtained in step S4 as input, and use the corresponding response delay TIME as the output, The hierarchical offloading model to build a deep learning model is as follows:
TIME=F(CT′,T,S)+t,

And with the minimum response delay TIME as the optimization goal, we obtain the hierarchical offloading model of the deep learning model with the minimum response delay TIME, where t is the hierarchical offloading model of the edge computing node from receiving the computing tasks sent by the physical terminal to generating the deep learning model. time;

Step S6: According to the cloud server computing delay prediction model CT _c obtained in step S2 and the computing resource load of the cloud server, apply the hierarchical computing delay prediction model CT _c to process the image features of each DNN _z of the deep learning model to be offloaded. The process of data D _z is to input and obtain the image feature data D _z through each DNN _z of the deep learning model to be offloaded on the cloud server. The corresponding calculation delay is the output theoretical hierarchical calculation time delay CT _z ′, and then according to the following formula:

Calculate the theoretical computing delay CT′ _c caused by using the cloud server alone to process computing tasks, where CT ₁ ′ is the computing delay caused by passing D ₁ through DNN ₁ , and then calculate it according to the following formula when using the cloud server alone to process image feature data The response delay of D _z is TIMEc:
TIME _c =F(CT′ _c , T _c , _Sc );

Step S7: Dynamically compare the response delay TIMEc when using the cloud server alone with the TIME size with the smallest response delay of the deep learning model hierarchical offloading model. If TIME is less than TIMEc, then use the deep learning model corresponding to the smallest response delay TIME to hierarchize. The offloading model is a hierarchical offloading strategy, which completes the offloading calculation of the data to be calculated with the goal of minimizing the response delay; otherwise, the cloud server is used alone to process the data to be calculated corresponding to the response delay TIMEc as the final hierarchical offloading strategy, and the data to be calculated is completed. Data is offloaded to minimize response latency;

Step S8: Based on the hierarchical offloading strategy obtained in step S7, each edge computing node that executes the hierarchical offloading strategy collects the computing load of the computing task, and then returns to step S2.

Further, the aforementioned divided DNN layers of the deep learning model to be unloaded are obtained as follows: the neurons contained in the hidden layer, input layer and output layer of the deep learning model to be unloaded are arranged in separate columns. The neurons are divided into n columns to obtain separate neuron columns, and then DNN _z is obtained,

DNN _z : n is a positive integer.

Further, the aforementioned step S1 is specifically:

Each DNN _z based on the segmented deep learning model to be offloaded uses the process of processing the image feature data D _z by each DNN _z of the deep learning model to be offloaded on each edge computing node as input. The image feature data D _z is passed on each edge computing node. The calculation delay corresponding to each DNN _z of the deep learning model to be offloaded is the output, and the hierarchical calculation delay model of each edge computing node is constructed as follows: CT=f(α, β, γ); where α is the computing resource load The condition is the preset CPU load, β is the computing resource load condition and the GPU load is preset, and γ is the computing resource load condition and the cache load is preset.

Further, in the aforementioned step S3, based on the known edge computing node LAN network bandwidth r, the physical distance l between each edge computing node is as follows:
T= _Dz /r,
S=l/C;

Calculate the data transmission delay T and propagation delay S required by each edge computing node to transmit the image feature data D _z to other edge computing nodes respectively; where the speed of light C represents the propagation rate of electromagnetic waves on the channel.

Further, the aforementioned edge computing nodes include deep reinforcement networks, deep learning models, situational awareness centers, and decision-making transceiver centers;

The deep reinforcement network includes:

The hierarchical computing delay prediction module is used to calculate the theoretical hierarchical computing delays CT′ and CT′ _c , as well as the storage hierarchical computing delay prediction model CT and the cloud server computing delay prediction model CT _c ;

Transmission delay calculation module, used to calculate data transmission delay T and propagation delay S;

The online decision-making delay statistics module is used to calculate the time t from the edge computing node receiving the computing task sent by the physical terminal to generating the deep learning model hierarchical offloading model;

The online learning module is used to collect and transmit the actual computing load and actual computing delay data during computing tasks to the hierarchical computing delay prediction module of the edge computing node;

The offline sample data storage module is used to store the image feature data D _z corresponding to each DNN _z of the deep learning model to be offloaded on each edge computing node under preset load conditions of each edge computing node and cloud server, and The image feature data D _z passes through the calculation delay corresponding to each DNN _z of the deep learning model to be offloaded on the cloud server;

The decision information generation module is used to pass the generated final hierarchical offloading strategy to the decision transceiver center;

The situation awareness center includes:

The edge computing node computing capability awareness module is used to calculate the computing resource load of each edge computing node;

The cloud server computing capability awareness module is used to calculate the computing resource load of the cloud server;

The network telemetry module is used to calculate the network bandwidth r of the local area network where each edge computing node is located, and is used to calculate the physical distance l between each edge computing node;

The decision-making transceiver center is used to send and receive the final hierarchical offloading strategy.

Further, the aforementioned cloud server includes a deep learning model and a decision transceiver center; the deep learning model is a trained deep learning model; and the decision transceiver center is used to receive the final hierarchical offloading strategy. The situation awareness center includes a computing capability awareness module and a network telemetry module.

The present invention adopts the above technical solution and has the following beneficial effects compared with the existing technology:

(1) Different from the deep learning model execution framework dominated by the physical end and the cloud computing center, this method adopts Combining the edge computing paradigm with cloud computing, and offloading deep learning models to different edge computing nodes in layers, fully exploiting the computing potential of the edge side, and minimizing the response delay of computing tasks while satisfying computing accuracy.

(2) By theoretically modeling the computing delay, data transmission delay, data propagation delay and model hierarchical offloading strategy generation delay in the entire deep learning model inference process, and optimizing with the minimum computing task response delay The goal is to determine the hierarchical offloading strategy of the optimal deep learning model, and ultimately achieve inference acceleration of the deep learning model.

(3) This method is carried out under the premise of offline learning. Furthermore, this method can update the hierarchical computing delay prediction model in real time based on the actual measured computing resource load and computing delay of each task calculation to optimize deep learning. Decision-making process for model hierarchical offloading.

(4) The deep learning model is hierarchically offloaded to edge computing nodes such as edge computing nodes and cloud servers. The collaborative reasoning method can effectively ensure the security of computing data and reduce network bandwidth occupancy.

Description of drawings

Figure 1 is a technical principle diagram of the present invention.

Figure 2 is a schematic diagram of the module composition of the deep reinforcement network of the present invention.

Figure 3 is a schematic diagram of the hierarchical offloading of the deep learning model of the present invention.

Figure 4 is a flow chart of the method of the present invention.

Detailed ways

In order to better understand the technical content of the present invention, specific embodiments are described below along with the accompanying drawings.

Aspects of the invention are described herein with reference to the accompanying drawings, in which a number of illustrative embodiments are shown. The embodiments of the present invention are not limited to those described in the drawings. It should be understood that the present invention is implemented through any of the various concepts and embodiments introduced above, as well as the concepts and implementations described in detail below, because the concepts and embodiments disclosed in the present invention are not limited to any implementation. Way. Additionally, some aspects of the present disclosure may be used alone or in any appropriate combination with other aspects of the present disclosure.

As shown in Figure 1, based on the cloud server, there are at least two edge computing nodes within the communication range of the cloud server c. The edge computing nodes are deployed on wifi access points or base stations, and the local area network where the edge computing nodes are located At least one physical terminal is set up within the communication range; the distance between each edge computing node and each physical terminal within its communication range is smaller than the distance between the edge computing node and the cloud server; any edge computing node i within the communication range of cloud server c, edge The total number of other edge computing nodes within the communication range of computing node i that is less than the preset distance from it is recorded as N, and 1 ≤ j ≤ N, where j is each edge computing node within the communication range of edge computing node i that is less than the preset distance from it. The number of the node. These N edge computing nodes together with the edge computing node i form an edge cluster; a deep learning model and a decision-making transceiver center are deployed on the cloud server c; a deep reinforcement network, a deep learning model, and situational awareness are deployed on the edge computing node. Center and Decision Mailing and receiving center.

As shown in Figure 2, a deep reinforcement network is deployed on the edge computing node. The deep reinforcement network includes a hierarchical computing delay prediction module, a transmission delay calculation module, an online decision-making delay statistics module, an online learning module, and an offline sample data storage module. and decision-making information generation module; with the goal of minimizing the computing task response delay TIME, comprehensively consider the data transmission delay T, data propagation delay S, deep learning model hierarchical computing delay CT and decision-making delay t, to find the depth The learning model is hierarchically offloaded to the optimal offloading strategy of each computing node to achieve rapid inference of deep learning models. The hierarchical computing delay prediction module is used to calculate the theoretical hierarchical computing delay; the transmission delay calculation module is used to calculate the data transmission delay T and the propagation delay S; the online decision-making delay statistics module is used to calculate the edge computing node from receiving The time t from the computing task sent to the physical terminal to the generation of the deep learning model layered offloading model; the online learning module is used to collect and transfer the actual computing load and actual computing delay data during the computing task to the layering of the edge computing node Calculation delay prediction module. The actual computing delay refers to the computing delay corresponding to each DNN _z of the deep learning model to be offloaded on each edge computing node when the image feature data D _z passes through the computing task of each edge computing node.

The offline sample data storage module is used to store the hierarchical computing delay prediction model. The CT decision information generation module is used to pass the generated final hierarchical offloading strategy to the decision transceiver center; the deep learning model is a trained deep learning model; situational awareness The center includes a computing power awareness module and a network telemetry module; the computing power awareness module is used to calculate the computing resource load of each edge computing node; the network telemetry module is used to calculate the network bandwidth r of the local area network where each edge computing node is located, and is used to The physical distance l between each edge computing node is calculated; the decision-making transceiver center is used to receive the final hierarchical offloading strategy.

Cloud server c includes a deep learning model and a decision transceiver center; the deep learning model is a trained deep learning model; the decision transceiver center is used to receive the final hierarchical offloading strategy. The situation awareness center includes a computing capability awareness module and a network telemetry module.

As shown in Figure 3, the deep learning model has a multi-layer structure. The neurons contained in the hidden layer, input layer and output layer of the deep learning model to be unloaded are divided into n columns based on the neurons in separate columns. Obtain the columns of neurons in separate columns, and then obtain the DNN _z ,

DNN _z : n is a positive integer.

As shown in Figure 4, for any edge computing node i within the communication range of cloud server c, assume that the total number of other edge computing nodes within the communication range of edge computing node i that is less than the preset distance is recorded as 2, and I, II Represents the numbers of these two edge computing nodes respectively. These two edge computing nodes together with edge computing node i form an edge cluster. That is, there are three edge computing nodes in the edge cluster.

Assuming that the deep learning model to be offloaded has three columns of neurons, it can be divided into two layers of deep learning models to be offloaded (DNN ₁ , DNN ₂ ), denoted as 1≤z<2.

In the offline learning stage, under the different computing resource loads of each edge computing node i, I, II and cloud server c itself, the general single image feature data D ₁ is used as input, and each edge computing node is measured separately to perform each layer of deep learning. The hierarchical computing delay CT _iz , CT _Iz , CT _IIz required by the model and the hierarchical computing delay CT _cz required by the cloud server c for each layer of the deep learning model. The hierarchical computing delays corresponding to each of the above edge computing nodes under different computing resource loads are recorded in the offline sample data storage module under the deeply enhanced network.

Computing resource load includes: CPU load α, GPU load β and cache load γ.

Secondly, based on deep reinforcement learning technology, the hierarchical computing delay prediction module uses the sample data in the offline sample data storage module to perform multivariate nonlinear function fitting to obtain the hierarchical computing delay prediction model:
CT _iz =f(α _i , β _i , γ _i )

The above formula represents the calculation of the deep learning model on any edge computing node i among the three edge computing nodes under the edge cluster, when its CPU load, GPU load and cache load are α _i , β _i and γ _i respectively. The calculation delay CT _iz generated by the z-th layer (DNN _z ). The trained hierarchical computing delay prediction model is stored in the hierarchical computing delay prediction module. CT _Iz =f (α _I , β _I , γ _I ), CT _IIz =f (α _II , β _II , γ _II ) are the same as above.
CT _cz =f (α _c , β _c , γ _c )

The above formula represents the calculation of the z-th layer of the deep learning model (DNN _z ) on cloud server c on the edge cluster, when its CPU load, GPU load and cache load are α _c , β _c and γ _c respectively. Calculate the time delay CT _cz . The trained hierarchical computing delay prediction model is stored in the hierarchical computing delay prediction module of each edge computing node.

After the offline learning phase, task calculations can be performed.

Based on image compression and image segmentation technology, the physical terminal preprocesses the computing task (image data) into image feature data D ₁ with the same resolution and equal data volume, and loads it to the edge computing node i located in the same local area network as the current physical terminal. On, the online decision-making delay statistics module of edge computing node i starts timing and dynamically sends the decision-making delay t to the decision-making information generation module (decision-making delay t refers to the time from receiving the computing task to generating the deep learning model of edge computing node i Layered offloading strategy during this period);

The computing power awareness module under the situation awareness center of edge computing node i and the computing power awareness module of cloud server c will move The state-aware edge computing node computing resource load (b _i , b _I , b _II ) and the cloud server c computing resource load (b _c ) are passed to the hierarchical computing delay prediction module; the network telemetry module will dynamically calculate the The network bandwidth (r _i , r _I , r _II , r _c ) and physical distance l _iI , l _iII , l _ic , l _{I II} , l _Ic , l _IIc ) of the area where the edge computing node and cloud server are located are passed to the transmission Delay calculation module;

The hierarchical computing delay prediction module combines the computing resource load of each edge computing node and cloud server c with the pre-stored hierarchical computing delay prediction model to predict the results of each edge computing node calculating each layer of the deep learning model DNN _z The required theoretical hierarchical computing latency (CT _iz ′, CT _Iz ′, CT _IIz ′)) and the theoretical computing latency required to use cloud server c alone to perform all deep learning model calculations The above theoretical calculation delay results are synchronously transmitted to the decision information generation module; the transmission delay calculation module uses the input image feature data D ₁ as the standard to calculate the theoretical data transmission delay (T _i , T _I , T _II ) and theoretical propagation delay (S _iI , S _iII , S _ic , S _{I II} , S _Ic , S _IIc ), the above theoretical delay calculation results are synchronously transmitted to the decision information generation module:
T _i =D _z /r _i , S _iI =l _i1 /C,

The above represents the data transmission delay Ti and propagation delay S _iI required to transmit the image feature data D _z to the edge computing node I via the edge computing node _i , the data transmission delay _Ti and the image feature data to be transmitted D _z , the network bandwidth r _i of the edge computing node i is related, the propagation delay S _iI is related to the channel length from the edge computing node i to the edge computing node I (estimated based on the physical distance l _iI ), the propagation rate of electromagnetic waves on the channel (based on the speed of light) C for estimating) related to:

As above, the decision-making information generation module is based on deep reinforcement learning technology and uses each edge computing node to process the theoretical hierarchical computing delay required for each layer of the deep learning model DNN _z . (CT _iz ′, CT _Iz ′, CT _IIz ′), the theoretical computing delay required to use cloud server c alone to calculate all deep learning models Based on the theoretical data transmission delay (T _i , T _I , T _II ) and theoretical propagation delay (S _iI , S _iII , S _ic , S _{I II} , S _Ic , S _IIc ), the task response delay TIME Minimum is the optimization goal, which determines the hierarchical offloading strategy of the optimal deep learning model (different hierarchical offloading strategies correspond to different task response delays TIME, and the goal is to find the optimal hierarchical offloading strategy): Furthermore, in the process of generating the hierarchical offloading strategy of the deep learning model, to avoid the solution task response delay TIME from falling into an over-optimization process, dynamically compare the response delay TIMEc when using cloud server c alone, that is, (CT′ _c +T _i +S _ic ) and the TIME size with the smallest response delay of the deep learning model hierarchical offloading model. If TIME is less than TIMEc, then the hierarchical offloading model of the deep learning model corresponding to the smallest response delay TIME is used as the hierarchical offloading strategy to complete the calculation. The data is offloaded and calculated with the goal of minimizing the response delay; otherwise, the cloud server c corresponding to the response delay TIMEc is used to process the data to be calculated alone as the final hierarchical offloading strategy, and the data to be calculated is completed to minimize the response delay. ;

The decision information generation module transmits the generated optimal deep learning model hierarchical offloading strategy to the decision transceiver center (the hierarchical offloading strategy information includes the edge computing nodes participating in this calculation and the number of deep learning model layers that need to be calculated by the edge computing nodes). The policy information is sent to the decision-making transceiver center of each edge computing node that needs to participate in the task calculation through the decision-making transceiver center. The edge computing node starts task calculation according to the strategy. Task calculation results are sent directly to the physical terminal.

The online learning module of each edge computing node participating in task calculation collects the computing resource load (CPU load, GPU load and cache load) and actual computing delay when performing task calculation, and transfers all the above sample data to edge computing node i The hierarchical computing delay prediction module is used to update the hierarchical computing delay prediction model for the current deep learning model. Furthermore, all edge computing nodes share the updated hierarchical computing delay prediction model.

Although the present invention has been described above with preferred embodiments, they are not intended to limit the present invention. Those with ordinary skill in the technical field to which the present invention belongs can make various modifications and modifications without departing from the spirit and scope of the present invention. Therefore, the protection scope of the present invention shall be determined by the claims.

Claims

A deep learning model inference acceleration method based on cloud-edge-device collaboration. The cloud-edge-device collaboration refers to a cloud server, at least two edge computing nodes communicating with the cloud server, and at least one physical terminal. The physical terminal and the edge computing node The communication distance is less than the distance between the edge computing node and the cloud server, and it is characterized in that the method includes the following steps:

Step S1: The physical terminal preprocesses the image data into image feature data D 1 with the same resolution and equal data volume, inputs the divided DNN layers of the deep learning model to be offloaded, that is, DNN z , and uses the output of the upper layer as the next layer. The input of one layer finally gets D z ;

Step S2, perform the offline learning phase: Based on the preset load conditions of the computing resources of each edge computing node, the process of processing the image feature data D z of the deep learning model DNN z to be offloaded on each edge computing node is used as the input, known image feature data D z uses the computing delays corresponding to each DNN z of the deep learning model to be offloaded on each edge computing node as the output, constructs and trains the hierarchical computing delay prediction model CT;

At the same time, based on the preset load condition of the cloud server computing resources, the process of processing the image feature data D z by each DNN z of the deep learning model to be offloaded on the cloud server is used as input, and the known process of each DNN z of the deep learning model to be offloaded on the cloud server is used to process the image. The computing delay corresponding to the characteristic data D z is the output, and the cloud server computing delay prediction model CT c is constructed and trained; Step S3: According to the actual computing resource load of each edge computing node, the computing task corresponding to the physical terminal is The edge computing node applies the hierarchical computing delay prediction model CT, taking the process of processing the image feature data D z by each DNN z of the deep learning model to be offloaded as input, and obtaining the image feature data D z through the deep learning model to be offloaded on each edge computing node. The calculation delay corresponding to each DNN z is the theoretical hierarchical calculation delay CT′ of the output;

Step S4: Based on the known edge computing node LAN network bandwidth r and the physical distance l between each edge computing node, calculate the data transmission required to transmit the image feature data D z to other edge computing nodes through the current edge computing node. Time delay T and propagation delay S; at the same time, based on the known cloud server network bandwidth r c and the physical distance l c between the edge computing node of the computing task and the cloud server, calculate the image transmitted by the edge computing node of the computing task The data transmission delay T c and propagation delay S C required for characteristic data D 1 to the cloud server;

Step S5: Take the theoretical hierarchical calculation delay CT′ of each edge computing node obtained in step S3, and the data transmission delay T and propagation delay S obtained in step S4 as input, and use the corresponding response delay TIME as the output, The hierarchical offloading model to build a deep learning model is as follows:
TIME=F(CT′,T,S)+t,

And with the minimum response delay TIME as the optimization goal, we obtain the deep learning model hierarchical offloading model with the minimum response delay TIME. Where t is the time from the edge computing node receiving the computing task sent by the physical terminal to generating the deep learning model hierarchical offloading model;

Step S6. According to the cloud server computing delay prediction model CT c obtained in step S2 and the actual computing resource load of the cloud server, apply the hierarchical computing delay prediction model CT c to process the image of each DNN z of the deep learning model to be offloaded. The process of feature data D z is to input and obtain image feature data D z through each DNN z of the deep learning model to be offloaded on the cloud server. The corresponding calculation delay is the output theoretical hierarchical calculation time delay CT z ′, and then according to the following formula :

Calculate the theoretical computing delay CT c ′ caused by using the cloud server alone to process computing tasks, where CT 1 ′ is the computing delay caused by passing D 1 through DNN 1. Then calculate the image feature data when using the cloud server alone according to the following formula The response delay of D z is TIMEc:
TIME c =F (CT c ′, T c , Sc );

Step S7: Dynamically compare the response delay TIMEc when using the cloud server alone with the TIME size with the smallest response delay of the deep learning model hierarchical offloading model. If TIME is less than TIMEc, then use the deep learning model corresponding to the smallest response delay TIME to hierarchize. The offloading model is a hierarchical offloading strategy, which completes the offloading calculation of the data to be calculated with the goal of minimizing the response delay; otherwise, the cloud server is used alone to process the data to be calculated corresponding to the response delay TIMEc as the final hierarchical offloading strategy, and the data to be calculated is completed. Data is offloaded to minimize response latency;

Step S8: Based on the hierarchical offloading strategy obtained in step S7, each edge computing node executing the hierarchical offloading strategy collects the computing load and actual computing delay during the computing task, and then returns to step S2.
A deep learning model inference acceleration method based on cloud-edge-device collaboration according to claim 1, characterized in that each DNN layer of the divided deep learning model to be unloaded is obtained as follows: the deep learning model to be unloaded is obtained The neurons contained in the hidden layer, input layer and output layer are divided into n columns based on the neurons in separate columns, and the neuron columns in separate columns are obtained, and then DNN z is obtained.

n is a positive integer.
A deep learning model inference acceleration method based on cloud-edge-device collaboration according to claim 2, characterized in that: Step S1 is specifically as follows:

Each DNN z based on the segmented deep learning model to be offloaded uses the process of processing the image feature data D z by each DNN z of the deep learning model to be offloaded on each edge computing node as input. The image feature data D z is passed on each edge computing node. The calculation delay corresponding to each DNN z of the deep learning model to be offloaded is the output, and the hierarchical calculation delay model of each edge computing node is constructed as follows: CT=f(α, β, γ); where α is the computing resource load The condition is the preset CPU load, β is the computing resource load condition and the GPU load is preset, and γ is the computing resource load condition and the cache load is preset.
A deep learning model inference acceleration method based on cloud-edge-device collaboration according to claim 3, characterized in that, in step S3, based on the known edge computing node local area network network bandwidth r, the The physical distance l is as follows:
T= Dz /r,
S=l/C;

Calculate the data transmission delay T and propagation delay S required by each edge computing node to transmit the image feature data D z to other edge computing nodes respectively; where the speed of light C represents the propagation rate of electromagnetic waves on the channel.
A deep learning model inference acceleration method based on cloud-edge collaboration according to claim 4, characterized in that the edge computing node includes a deep reinforcement network, a situation awareness center, and a decision-making transceiver center;

The deep reinforcement network includes:

The hierarchical computing delay prediction module is used to calculate the theoretical hierarchical computing delays CT′ and CT c ′, as well as the storage hierarchical computing delay prediction model CT and the cloud server computing delay prediction model CT c ;

Transmission delay calculation module, used to calculate data transmission delay T and propagation delay S;

The online decision-making delay statistics module is used to calculate the time t from the edge computing node receiving the computing task sent by the physical terminal to generating the deep learning model hierarchical offloading model;

The online learning module is used to collect and transmit the actual computing load and actual computing delay data during computing tasks to the hierarchical computing delay prediction module of the edge computing node;

Offline sample data storage module is used to store image characteristics of each edge computing node and cloud server under preset load conditions. The computational delay corresponding to the feature data D z passing through each DNN z of the deep learning model to be offloaded on each edge computing node, and the computational delay corresponding to the image feature data D z passing through each DNN z of the deep learning model to be offloaded on the cloud server;

The decision information generation module is used to pass the generated final hierarchical offloading strategy to the decision transceiver center;

The situation awareness center includes:

The edge computing node computing capability awareness module is used to calculate the computing resource load of each edge computing node;

The cloud server computing capability awareness module is used to calculate the computing resource load of the cloud server;

The network telemetry module is used to calculate the network bandwidth r of the local area network where each edge computing node is located, and is used to calculate the physical distance l between each edge computing node;

The decision-making transceiver center is used to send and receive the final hierarchical offloading strategy.
A deep learning model inference acceleration method based on cloud-edge-device collaboration according to claim 5, characterized in that the cloud server includes a deep learning model and a decision-making transceiver center; the deep learning model is a trained deep learning Model; the decision sending and receiving center is used to receive the final hierarchical offloading strategy.