CN117114113B

CN117114113B - Collaborative reasoning acceleration method based on queuing theory

Info

Publication number: CN117114113B
Application number: CN202311378988.0A
Authority: CN
Inventors: 郭永安; 齐帅; 王宇翱; 钱琪杰; 白晨浩
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-10-24
Filing date: 2023-10-24
Publication date: 2023-12-29
Anticipated expiration: 2043-10-24
Also published as: CN117114113A

Abstract

The invention belongs to the technical field of edge calculation, and relates to a collaborative reasoning acceleration method based on queuing theory; comprising the following steps: step 1, establishing a task attribute model; step 2, establishing a communication model; step 3, making a decision whether to directly upload the cloud server or not according to the current queue state information, wherein the step 7 is the uploading step, otherwise, the step 4 is the uploading step; step 4, partitioning the DNN model, if the task does not trigger the remorse mechanism at the moment, turning to step 5, otherwise turning to step 6; step 5, based on the step 4, the edge server cooperatively executes an reasoning task and transfers the step 8; step 6, triggering an regret mechanism, and converting the DNN model deep section part into step 7; step 7, uploading the cloud server by the model, and finishing aggregation of the reasoning results to enter step 8; step 8, obtaining a model partitioning strategy and total reasoning time delay at the moment; when the total reasoning time delay is not reduced, outputting an optimal model partitioning strategy and the minimum reasoning time delay; and optimizing the model partition point by combining the queue state information, so as to minimize the system reasoning time delay.

Description

Collaborative reasoning acceleration method based on queuing theory

Technical Field

The invention belongs to the technical field of edge calculation, and particularly relates to a collaborative reasoning acceleration method based on queuing theory.

Background

Currently, research on queuing theory is widely focused and applied in the field of mobile edge computing. Meanwhile, as technologies such as machine learning, artificial intelligence and the like are also integrated into researches of queuing theory, the performance of a mobile edge computing system is further improved, and particularly, the queuing theory plays a key role in reducing task processing delay, optimizing resource allocation and the like in a mobile edge computing scene.

Common queuing models include First-Come First-Service (FCFS), last-Come First-Service (LCFS), and Priority-based queuing systems. When a large number of tasks wait for processing at an edge server, the waiting time of the tasks in a queue may be excessively long, so that the tasks are out of the delay tolerance range, and processing fails. Based on the above problems, a remorse mechanism is introduced in the queuing, but in the existing studies, the remorse behavior of the customers is random and no timely adjustment decision is made according to the queuing situation in the current system.

In the study of collaborative reasoning, most model partitioning points only consider the transmission delay and the processing delay of the reasoning model, but the queuing delay of the model is not considered, but is usually not negligible in the whole reasoning process.

Therefore, the invention provides a collaborative reasoning acceleration method based on queuing theory, which aims at the problems, optimizes the model partition point while considering the waiting time delay of a reasoning task in a queuing queue and triggering a regret mechanism aiming at the length of the waiting time delay, and finally achieves the aim of minimizing the time delay.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a collaborative reasoning acceleration method based on queuing theory, which is used for making a decision whether to upload to cloud or not by hesitating at an edge server after a reasoning task reaches the edge server, and finally realizing the whole collaborative reasoning acceleration process by combining a model partition to make a behavior whether to regress or not after entering a waiting queue of the edge server.

In order to achieve the above functions, the present invention designs a collaborative reasoning acceleration scenario based on impatience queuing, which includes a plurality of terminal devices, a plurality of edge servers and a cloud server, wherein the physical distance between the terminal devices and the edge servers is relatively close to the cloud server, so that the terminal devices preferably upload a DNN (Deep Nerual Network, deep neural network) model to the edge servers to complete reasoning, and when the load of the edge servers is too large, upload to remote cloud is considered to complete the collaborative reasoning process, the method includes the following steps:

step 1, constructing an architecture system, which comprises terminal equipment, an edge server and a cloud server; collecting task data generated by terminal equipment, establishing a task attribute model, and simultaneously calculating the reasoning delay of each layer of DNN model required by completing the reasoning task; terminal equipmentIs set asThe method comprises the steps of carrying out a first treatment on the surface of the The set of edge servers m is；

Step 2, monitoring the network bandwidth of the current wireless link, simultaneously establishing a communication model between the terminal equipment and the edge server and between the edge server and the cloud server, and uploading an reasoning task to the edge server;

step 3, monitoring and evaluating the load condition and queue state information at the current edge server, hesitating an reasoning task uploaded to the edge server, making a decision whether to directly upload the DNN model to the cloud server according to the current queue state information, if so, jumping to step 7, otherwise jumping to step 4;

step 4, the reasoning task enters a queuing waiting queue of the edge server, and DNN model partitioning is carried out according to the current queue state information and the reasoning delay of each layer obtained in the step 1, so that a partitioning strategy is obtained and the partitioning strategy is divided into a DNN model shallow section part and a DNN model deep section part; if the reasoning task does not trigger the regret mechanism at this time, the step 5 is skipped, otherwise the step 6 is skipped;

step 5, based on the partition strategy obtained in the step 4, the edge server cooperates with the DNN model shallow section part and the DNN model deep section part to execute reasoning tasks, and collects queue state information at the moment, and the step 8 is skipped;

step 6, when the regret mechanism is triggered, skipping the deep section part of the DNN model to step 7;

step 7, uploading the model to a cloud server to complete reasoning, and completing aggregation of reasoning results at the cloud server to enter step 8;

step 8, calculating to obtain a model partitioning strategy and total reasoning time delay T at the moment;

step 9, comparing the total reasoning time delay T of the current round with the reasoning time delay obtained in the previous roundWhen (when)Iterating to the step 3; and when the total reasoning time delay is not reduced, outputting an optimal model partitioning strategy and a minimum reasoning time delay result.

Further, the step 1 of collecting task data generated by the terminal device, establishing a task attribute model, and simultaneously calculating an inference delay of each layer of the DNN model required for completing an inference task, specifically includes the following steps:

step 11, settingIs a set of total tasks generated in all terminal devices; the reasoning task k is described asWhereinModel segmentation points representing an reasoning task k;the data size of the reasoning task is represented, and the unit is MBytes;representing the maximum tolerance time delay of the reasoning task k;

step 12, calculating the reasoning delay of each layer of DNN model required for completing the reasoning task based on the given load of the initial edge server, expressed as

，

Wherein DNN models required for completing the reasoning task k are as followsLayer and satisfy；

And determining an initial partition strategy according to the reasoning delay of each layer of the obtained DNN model and the correlation among the layers of the DNN. The correlation/dependence among DNN layers is not divisible among layers, and the reasoning time delay T0 corresponding to the initial partitioning strategy can be judged and compared with the reasoning time delay T1 obtained by the first iteration so as to carry out the iteration optimization of the subsequent rounds.

Further, in the step 2, the network bandwidth of the current wireless link is monitored, and meanwhile, a communication model between the terminal device and the edge server and between the edge server and the cloud server is established, and the reasoning task is uploaded to the edge server, which specifically includes the following steps:

step 21, assuming the total bandwidth is B, assuming that the associated channel is frequency non-selective, and that the channel gain can remain constant and accurately estimated by the edge server during the transmission of each inference task; correspondingly, slave terminals during the upload of the reasoning tasksThe instantaneous uplink transmission rate to the edge server m is obtained by:

wherein,presentation of an allocation to a terminal deviceIs a bandwidth of (a);indicating terminal equipmentIs set to the transmission power of (a); scalar quantityRepresented at terminal equipmentUplink channel gain between edge server m;representing the variance of the additive white gaussian noise AWGN;

for the limitation of network resources, each terminal device is allocatedThe bandwidth resource sum of (a) satisfies the following constraint:

the instantaneous uplink transmission rate achieved between the edge server m and the cloud server is represented as:

wherein,representing bandwidth resources allocated to the edge servers;representing the transmit power from edge server m to cloud server;representing uplink channel gain between edge server m and cloud server, including path loss and small scale fading in the communication link;

step 23, by the terminal deviceTransmission delay to edge server m:

wherein,a data size representing the inference task;

step 24, the transmission delay from the edge server m to the cloud server is defined as:

wherein,representing the data size of the reasoning task k after partitioning;a decision whether to upload to a cloud server.

Further, in the step 3, the load condition and the queue state information at the current edge server are monitored and evaluated, the reasoning task uploaded to the edge server is hesitated, and a decision of whether to directly upload the DNN model to the cloud server is made according to the current queue state information, specifically, the steps are as follows:

step 31, the queue status information Q includes the position of the task in the current queue and the service rate of the edge server in the current queue, which may be specifically expressed as:

wherein,representing the position of the reasoning task k in the current waiting queue;representing the service rate of the current edge server m, whereinThe method is related to the current load condition of the edge server, and the larger the current load of the edge server is, the smaller the service rate of the edge server is; conversely, the greater the service rate of the edge server.

Step 32, a decision is made as to whether to upload to the cloud server according to the current queue status information, which is defined as:

wherein,representing the delay of the reasoning task to wait in line at the edge server m;representing the processing delay of the inference task at the edge server m;representation ofThe transmission delay from the edge server to the cloud server is used as an reasoning task; because the cloud server has sufficient computing resources and has high processing speed, processing time delay on the cloud server is not considered;indicating that the reasoning task completes reasoning at the edge server;uploading the representation to a cloud server;

step 33, queuing latency at edge server m:

wherein,indicating that inference task k is ahead in the current queueThe overall processing delay of the inference task at the location of (a), the value of which affects the decision whether the DNN model is uploaded to the cloud server.

Further, when the inference task does not make an regret decision after waiting in line, the task will be processed by the edge server, in which case the processing delay of the inference task k at the edge server m is expressed as:

wherein the method comprises the steps ofThe data size of the partitioned reasoning task k is represented.

Further, the task in step 4 enters a queuing waiting queue of the edge server, and simultaneously performs DNN model partitioning according to the current queue state information and the reasoning delay of each layer obtained in step 1 to obtain a partitioning strategy, and the specific steps are as follows:

step 41, partitioning strategy divides the DNN model into two parts: 1) The shallow part of the DNN model is executed at the edge server; 2) Transmitting the deep section part of the DNN model to a cloud server for execution, wherein the output result of the DNN model at the shallow section part of the edge server is transmitted to the cloud server through a communication link as the input of the deep section part;

step 42, finishing the terminal equipmentGenerated inference tasksThe required model partitioning points of the DNN model are expressed as integer variablesLayer 0 to layer 0 representing DNN modelThe layer is executed at the edge server, the firstLayer to the firstThe layer is calculated at a cloud server; specially whenAndrepresenting that the DNN model is executed entirely at the cloud server and entirely at the edge server, respectively.

Further, step 9, comparing the total inference time delay T of the current round with the inference time delay obtained in the previous roundWhen (when)Iterating to the step 3; when the total reasoning delay is no longerWhen the method is reduced, an optimal model partitioning strategy and a minimum reasoning time delay result are output, and the method specifically comprises the following steps:

the inference delay of the DNN model is defined as the following problem:

（a）

（b）

（c）

（d）

（e）

wherein,model segmentation points representing an reasoning task k;representing the information of the terminal equipmentTransmission delay to edge server m;representing the queuing latency at edge server m;representing reasoning task k in edge clothesProcessing delay at server m;

(a) Indicating that the sum of the bandwidths allocated to the terminal devices is not greater than the total bandwidth from the terminal devices to the edge server in the architecture system; (b) Indicating that the processing time delay of the reasoning task k in the system cannot exceed the maximum time delay tolerance time of the reasoning task k; (c) The representation is valid for the model partition points of the DNN model so that the model can be effectively partitioned; (d) The method comprises the steps that a representation reasoning task selects a server to be processed according to time delay comparison; (e) represents a limitation of computing power of the edge server;

and continuously iterating until the reasoning delay is not reduced any more, and outputting a final partitioning strategy and a minimum reasoning delay result.

The invention has the following beneficial effects: 1. the invention is oriented to a mobile edge computing scene, combines the remorse behavior in the queuing theory, is different from the random remorse introduced in the prior study, makes a model partitioning strategy and triggers the remorse mechanism according to the current queue state information, designs a collaborative reasoning acceleration method based on the vexation-resistant queuing of a reasoning task on the basis, fully utilizes the computing resources of a cloud server while considering the current load state of an edge server, finally realizes the minimization of the system reasoning time delay and effectively completes the collaborative reasoning acceleration process.

2. According to the method, the position of the DNN model in the waiting queue and the dynamic load condition of the edge server in the current system are intelligently made, a proper calculation node is selected, the partition strategy is optimized, and the overall reasoning efficiency of the system is improved, so that the method is better suitable for a real scene.

Drawings

FIG. 1 is a scene model diagram for an MEC network environment in the present invention;

FIG. 2 is a diagram of the overall framework of the present invention;

FIG. 3 is a flow chart of a collaborative reasoning acceleration method based on queuing theory.

Detailed Description

The invention is further illustrated by the following examples, which are not intended to be limiting.

With the increasing complexity and computational demands of deep learning models, it is often difficult for a single compute node to meet the demands of real-time reasoning tasks. Therefore, joint collaborative reasoning of multiple computing nodes becomes an important solution. Meanwhile, queuing theory is used as a mathematical tool, so that the design of an elastic collaborative reasoning system can be facilitated, the system configuration can be adaptively adjusted according to the load condition and the resource availability of the current computing node, the stability and the robustness of the system can be improved, and the system can be adapted to different working scenes and the change of loads.

Based on the method, queuing theory is introduced into the research of collaborative reasoning, and a collaborative reasoning acceleration method based on impatience queuing is designed for a cellular network scene.

Referring to fig. 1, in combination with the current practical application, the cellular network is utilized to realize multi-cell coverage, specifically covering multiple production and living aspects such as smart phone communication service, mechanical arm completion detection and quality inspection service, and real-time road condition monitoring, where specifically, a scene model includes a set of terminal devices as followsWherein each terminal deviceResponsible for the task consisting of a single inference model (i.e. DNN model), the present invention does not take into account the local computation of the terminal device, due to its limited computing power. The set of edge servers m is defined asThe scene model comprises a cloud server.

Based on the architecture system, the invention provides a collaborative reasoning acceleration method based on queuing theory, as shown in fig. 3, and the operation mechanism is as follows:

step 1, collecting task data generated by terminal equipment, establishing a task attribute model, and simultaneously calculating reasoning delay of each layer of DNN model required by completing tasks;

each terminal deviceMultiple consecutive reasoning tasks k of the same type can be generated, so that DNN models required for the reasoning tasks generated for the same terminal device are the same, andis the set of overall tasks that are generated in the architecture system. Distinguishing these heterogeneous tasks based on their computational properties, in particular, inference task k is described asWhereinModel segmentation points representing an reasoning task k;the data size (in MBytes) representing the reasoning task;representing the maximum tolerable delay of the reasoning task k. We assume that heterogeneous tasks differ in latency tolerance, separating tasks into latency sensitive tasks and latency tolerant tasks, such as: the judgment of the road condition information in intelligent driving has extremely low time delay tolerance, and the road condition information needs to be acquired in real time and decided in time to avoid traffic accidents; the processing of photo information in the daily photographing process has higher delay tolerance;

the inference delay for each layer of the DNN model required to complete a task based on the initial edge server given the load can be expressed asWherein DNN reasoning models required by the reasoning task k are as followsLayer and satisfyThe method comprises the steps of carrying out a first treatment on the surface of the Determining an initial partition strategy according to the reasoning delay of each layer of the obtained DNN model and the correlation among the layers of DNN;

step 2, monitoring the network bandwidth of the current wireless link, simultaneously establishing a communication model of the terminal equipment and the edge server, and a communication model of the edge server and the cloud server, and uploading an reasoning task to the edge server;

the use of Orthogonal Frequency Division Multiple Access (OFDMA) to allocate frequency band resources to terminal devices is considered in the architecture system. The number of sub-carriers is assumed to be sufficiently large so that the division of bandwidth is approximately continuous. Let the architecture system total bandwidth be B, assuming that the associated channel is frequency non-selective, and the channel gain can remain constant and accurately estimated by the edge server during the transmission of each inference task. Correspondingly, slave terminals during the upload of the reasoning tasksThe achievable instantaneous uplink transmission rate to the edge server m is obtained by:

wherein,presentation of an allocation to a terminal deviceIs a bandwidth of (a);indicating terminal equipmentIs set to the transmission power of (a); scalar quantityRepresented at terminal equipmentAnd the uplink channel gain between edge server m.

similarly, the instantaneous uplink transmission rate achievable between edge server m and cloud server can be expressed as:

wherein,representing bandwidth resources allocated to the edge servers;representing the transmit power from edge server m to cloud server;representing uplink channel gain between edge server m and cloud server, where path loss and small scale fading in the communication link have been included;representing the variance of Additive White Gaussian Noise (AWGN).

By terminal equipmentTransmission delay to edge server m:

wherein,representing slave terminal deviceData transfer rate to edge server m;

the transmission delay from the edge server m to the cloud server is defined as:

wherein,representing the data upload rate from edge server m to cloud server.

the queue status information Q includes the position of the inference task in the current queue and the service rate of the edge server in the current queue, which can be expressed specifically as:

wherein,representing the position of the reasoning task k in the current waiting queue;representing the service rate of the current edge server m, whereinIn relation to the current load condition of the edge server, the greater the current edge server load, the edge server isThe smaller the service rate of the server; conversely, the greater the service rate of the edge server.

Making a decision whether to upload to the cloud server according to the current queue state information:

wherein,representing the delay of the reasoning task to wait in line at the edge server m;representing the processing delay of the inference task at the edge server m;the transmission delay from the edge server to the cloud server is represented as an inference task; because the cloud server has sufficient computing resources and has high processing speed, processing time delay on the cloud server is not considered;

queuing latency at edge server m:

wherein,indicating that inference task k is ahead in the current queueThe value of the total processing delay of the task at the location of (a) determines the decision whether the model is uploaded to the cloud server.

When the inference task does not make a regret decision after queuing, the inference task will be processed by the edge server, in which case the processing delay of the inference task k at the edge server m is expressed as:

wherein,representing the partitioned data size of inference task k, whereIs related to the load condition of the current edge server,whereinReferring to the computing power of the edge server m,representing the number of terminal devices connected to the current edge server m,is the introduced compensation function;

step 4, the reasoning task enters a queuing waiting queue of the edge server, DNN model partitioning is carried out according to the current queue state information and each layer of reasoning delay obtained in the step 1, a partitioning strategy is obtained, if the task does not trigger an impersonant mechanism at the moment, the step 5 is skipped, and otherwise, the step 6 is skipped;

the partitioning strategy divides the DNN model into two parts: 1) The shallow part of the DNN model is executed at the edge server; 2) Transmitting the deep section part of the DNN model to a cloud server for execution, wherein the output result of the DNN model at the shallow section part of the edge server is transmitted to the cloud server through a communication link as the input of the deep section part;

by terminal equipmentDNN inference task generatedIs expressed as an integer variationMeasuring amountLayer 0 to layer 0 representing DNN modelThe layer is executed at the edge server, the firstLayer to the firstThe layer is calculated at a cloud server; specially whenAndrespectively representing that the DNN model is respectively executed at the cloud server entirely and at the edge server entirely;

specifically, in connection with FIG. 2, according to the following decision

When the waiting time delay of the reasoning task in the current queue and the processing time delay of the edge server are larger than the transmission time delay of the reasoning task uploaded to the cloud server, the deep part of the model is prejudicial, and then the reasoning task is withdrawn from the waiting queue and uploaded to the cloud server to complete the reasoning process;

step 7, uploading the model to a cloud server to complete reasoning, and completing aggregation of reasoning results at the cloud server to enter step 8; ( There are two possibilities for the model here: 1. in the step 3, after the edge server is hesitated, determining to directly upload the reasoning model to the cloud server for processing; 2. in step 6, the model deep section part of the regret mechanism is triggered )

step 9, comparing the current iteration turns to obtain total reasoning time delayInference time delay obtained from previous iteration roundWhen (when)Iterating to the step 3; when the total reasoning time delay is not reduced any more, outputting an optimal model partitioning strategy and the minimum reasoning time delay;

specifically, the optimization objective of the iterative process is as follows:

（a）

（b）

（c）

（d）

（e）

wherein (a) means that the sum of bandwidths allocated to the terminal devices is not greater than the total bandwidth from the terminal devices to the edge server in the system, which condition ensures a bandwidth limitation condition of the system; (b) The method has the advantages that the processing time delay of the reasoning task k in the system cannot exceed the maximum time delay tolerance time of the reasoning task k, the reasoning task is ensured to be successfully processed by the server within the specified time, and the success rate of task processing is further ensured; (c) Meaning that the model partition points for the DNN model should be valid so that the model can be effectively partitioned; (d) Meaning that the reasoning task selects a server to be processed according to the time delay comparison; (e) represents a limitation of computing power of the edge server.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Meanwhile, the above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereto, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the present invention.

Claims

1. The collaborative reasoning acceleration method based on queuing theory is characterized by comprising the following steps:

step 1, constructing an architecture system, which comprises terminal equipment, an edge server and a cloud server; collecting task data generated by terminal equipment, establishing a task attribute model, and simultaneously calculating the reasoning delay of each layer of DNN model required by completing the reasoning task; the set of the terminal equipment i isThe set of edge servers m is +.>

step 9, comparing the total reasoning time delay T of the current round with the reasoning time delay T 'obtained in the previous round, and iterating to the step 3 when T is smaller than T'; and when the total reasoning time delay is not reduced, outputting an optimal model partitioning strategy and a minimum reasoning time delay result.

2. The collaborative reasoning acceleration method based on queuing theory according to claim 1, wherein the task data generated by the terminal device in step 1 is collected, a task attribute model is built, and the reasoning delay of each layer of the DNN model required for completing the reasoning task is calculated, which specifically comprises the following steps:

step 11, settingIs a set of total tasks generated in all terminal devices; the reasoning task k is described as k=(s) _k ,W _k ,D _k ) Wherein s is _k Model segmentation points representing an reasoning task k; w (W) _k The data size of the reasoning task is represented, and the unit is MBytes; d (D) _k Representing the maximum tolerance time delay of the reasoning task k;

Wherein the DNN model required for completing the reasoning task k has L _k Layer, and satisfy 0 < l _k ≤L _k ；

And determining an initial partitioning strategy according to the reasoning delay of each layer of the obtained DNN model and the correlation among the layers of the DNN.

3. The collaborative reasoning acceleration method based on queuing theory according to claim 1, wherein the monitoring of the network bandwidth of the current wireless link in step 2 simultaneously establishes a communication model between the terminal device and the edge server and between the edge server and the cloud server, and uploads a reasoning task to the edge server, specifically comprising the following steps:

step 21, assuming the total bandwidth is B, assuming that the associated channel is frequency non-selective, and that the channel gain can remain constant and accurately estimated by the edge server during the transmission of each inference task; accordingly, the instantaneous uplink transmission rate achieved from the terminal device i to the edge server m during the inference task upload is obtained by:

wherein b _i Representing the bandwidth allocated to the terminal device i; p is p _i Representing the transmit power of terminal device i; scalar h _i,m Representing the uplink channel gain between the terminal device i and the edge server m; sigma (sigma) ² Representing the variance of the additive white gaussian noise AWGN;

for the limitation of network resources, the sum of bandwidth resources allocated to each terminal device i satisfies the following constraint:

wherein b _c Representing bandwidth resources allocated to the edge servers; p is p _m Representing the transmit power from edge server m to cloud server; h is a _m,c Representing uplink channel gain between edge server m and cloud server, including path loss and small scale fading in the communication link;

step 23, transmission delay from terminal device i to edge server m:

wherein W is _k A data size representing the inference task;

wherein,representing the data size of the reasoning task k after partitioning; ρ _mc A decision whether to upload to a cloud server.

4. The collaborative reasoning acceleration method based on queuing theory according to claim 1, wherein the monitoring and evaluation of the load status and the queue status information at the current edge server in step 3 are performed, the reasoning task uploaded to the edge server is hesitated, and a decision is made whether to directly upload the DNN model to the cloud server according to the current queue status information, and the specific steps are as follows:

Q＝(p _k ,μ _m )，

wherein p is _k Representing the position of the reasoning task k in the current waiting queue; mu (mu) _m Representing the service rate of the current edge server m, where μ _m The method is related to the current load condition of the edge server, and the larger the current load of the edge server is, the smaller the service rate of the edge server is; conversely, the greater the service rate of the edge server;

wherein,representing the delay of the reasoning task to wait in line at the edge server m; />Representing the processing delay of the inference task at the edge server m; />The transmission delay from the edge server to the cloud server is expressed as an inference task; because the cloud server has sufficient computing resources and has high processing speed, processing time delay on the cloud server is not considered; ρ _mc =0 means that the inference task completes the inference at the edge server; ρ _mc =1 represents upload to cloud server;

step 33, queuing latency at edge server m:

wherein,representing the reasoning task k in the current queue p before _k The overall processing delay of the inference task at the location of-1, the value of which affects the decision whether the DNN model is uploaded to the cloud server.

5. The collaborative inference acceleration method based on queuing theory according to claim 4, wherein when the inference task does not make an regret decision after queuing, the task will be processed by the edge server, in which case the processing delay of the inference task k at the edge server m is expressed as:

6. The collaborative reasoning acceleration method based on queuing theory according to claim 1, wherein the task in step 4 enters a queuing waiting queue of an edge server, and simultaneously performs DNN model partitioning according to current queue state information and each layer of reasoning delay obtained in step 1 to obtain a partitioning strategy, and the specific steps are as follows:

step 42, the model division points of the DNN model required for completing the reasoning task k generated by the terminal device i are expressed as integer variables s _k ∈{0,1,2...l _k Layer 0 through s of the DNN model }, representing _k The layer is executed at the edge server, s < th) _k+1 Layer to the first _k The layer is calculated at a cloud server; specially when s _k =0 and s _k ＝l _k Representing that the DNN model is executed entirely at the cloud server and entirely at the edge server, respectively.

7. The collaborative reasoning acceleration method based on queuing theory according to claim 1, wherein, step 9, comparing the total reasoning delay T of the current round with the reasoning delay T 'obtained in the previous round, and iterating to step 3 when T < T'; when the total reasoning time delay is not reduced, outputting an optimal model partitioning strategy and a minimum reasoning time delay result, wherein the specific steps are as follows:

the inference delay of the DNN model is defined as the following problem:

T _total,k ≤D _k (b)

μ _m ≤μ _max (e)，

wherein s is _k Model segmentation points representing an reasoning task k;representing a transmission delay from the terminal device i to the edge server m; />Representing the queuing latency at edge server m; />Representing the processing delay of the reasoning task k at the edge server m; />Representing the transmission delay from the edge server m to the cloud server; l (L) _k First representing DNN model _k A layer; />Is a set of total tasks generated in all terminal devices; ρ _mc A decision representing whether to upload to a cloud server; mu (mu) _m Representing the service rate of the current edge server m; mu (mu) _max Representing the maximum service rate of the current edge server m;