CN117114113B - Collaborative reasoning acceleration method based on queuing theory - Google Patents

Collaborative reasoning acceleration method based on queuing theory Download PDF

Info

Publication number
CN117114113B
CN117114113B CN202311378988.0A CN202311378988A CN117114113B CN 117114113 B CN117114113 B CN 117114113B CN 202311378988 A CN202311378988 A CN 202311378988A CN 117114113 B CN117114113 B CN 117114113B
Authority
CN
China
Prior art keywords
reasoning
edge server
task
model
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311378988.0A
Other languages
Chinese (zh)
Other versions
CN117114113A (en
Inventor
郭永安
齐帅
王宇翱
钱琪杰
白晨浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202311378988.0A priority Critical patent/CN117114113B/en
Publication of CN117114113A publication Critical patent/CN117114113A/en
Application granted granted Critical
Publication of CN117114113B publication Critical patent/CN117114113B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention belongs to the technical field of edge calculation, and relates to a collaborative reasoning acceleration method based on queuing theory; comprising the following steps: step 1, establishing a task attribute model; step 2, establishing a communication model; step 3, making a decision whether to directly upload the cloud server or not according to the current queue state information, wherein the step 7 is the uploading step, otherwise, the step 4 is the uploading step; step 4, partitioning the DNN model, if the task does not trigger the remorse mechanism at the moment, turning to step 5, otherwise turning to step 6; step 5, based on the step 4, the edge server cooperatively executes an reasoning task and transfers the step 8; step 6, triggering an regret mechanism, and converting the DNN model deep section part into step 7; step 7, uploading the cloud server by the model, and finishing aggregation of the reasoning results to enter step 8; step 8, obtaining a model partitioning strategy and total reasoning time delay at the moment; when the total reasoning time delay is not reduced, outputting an optimal model partitioning strategy and the minimum reasoning time delay; and optimizing the model partition point by combining the queue state information, so as to minimize the system reasoning time delay.

Description

Collaborative reasoning acceleration method based on queuing theory
Technical Field
The invention belongs to the technical field of edge calculation, and particularly relates to a collaborative reasoning acceleration method based on queuing theory.
Background
Currently, research on queuing theory is widely focused and applied in the field of mobile edge computing. Meanwhile, as technologies such as machine learning, artificial intelligence and the like are also integrated into researches of queuing theory, the performance of a mobile edge computing system is further improved, and particularly, the queuing theory plays a key role in reducing task processing delay, optimizing resource allocation and the like in a mobile edge computing scene.
Common queuing models include First-Come First-Service (FCFS), last-Come First-Service (LCFS), and Priority-based queuing systems. When a large number of tasks wait for processing at an edge server, the waiting time of the tasks in a queue may be excessively long, so that the tasks are out of the delay tolerance range, and processing fails. Based on the above problems, a remorse mechanism is introduced in the queuing, but in the existing studies, the remorse behavior of the customers is random and no timely adjustment decision is made according to the queuing situation in the current system.
In the study of collaborative reasoning, most model partitioning points only consider the transmission delay and the processing delay of the reasoning model, but the queuing delay of the model is not considered, but is usually not negligible in the whole reasoning process.
Therefore, the invention provides a collaborative reasoning acceleration method based on queuing theory, which aims at the problems, optimizes the model partition point while considering the waiting time delay of a reasoning task in a queuing queue and triggering a regret mechanism aiming at the length of the waiting time delay, and finally achieves the aim of minimizing the time delay.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a collaborative reasoning acceleration method based on queuing theory, which is used for making a decision whether to upload to cloud or not by hesitating at an edge server after a reasoning task reaches the edge server, and finally realizing the whole collaborative reasoning acceleration process by combining a model partition to make a behavior whether to regress or not after entering a waiting queue of the edge server.
In order to achieve the above functions, the present invention designs a collaborative reasoning acceleration scenario based on impatience queuing, which includes a plurality of terminal devices, a plurality of edge servers and a cloud server, wherein the physical distance between the terminal devices and the edge servers is relatively close to the cloud server, so that the terminal devices preferably upload a DNN (Deep Nerual Network, deep neural network) model to the edge servers to complete reasoning, and when the load of the edge servers is too large, upload to remote cloud is considered to complete the collaborative reasoning process, the method includes the following steps:
step 1, constructing an architecture system, which comprises terminal equipment, an edge server and a cloud server; collecting task data generated by terminal equipment, establishing a task attribute model, and simultaneously calculating the reasoning delay of each layer of DNN model required by completing the reasoning task; terminal equipmentIs set asThe method comprises the steps of carrying out a first treatment on the surface of the The set of edge servers m is
Step 2, monitoring the network bandwidth of the current wireless link, simultaneously establishing a communication model between the terminal equipment and the edge server and between the edge server and the cloud server, and uploading an reasoning task to the edge server;
step 3, monitoring and evaluating the load condition and queue state information at the current edge server, hesitating an reasoning task uploaded to the edge server, making a decision whether to directly upload the DNN model to the cloud server according to the current queue state information, if so, jumping to step 7, otherwise jumping to step 4;
step 4, the reasoning task enters a queuing waiting queue of the edge server, and DNN model partitioning is carried out according to the current queue state information and the reasoning delay of each layer obtained in the step 1, so that a partitioning strategy is obtained and the partitioning strategy is divided into a DNN model shallow section part and a DNN model deep section part; if the reasoning task does not trigger the regret mechanism at this time, the step 5 is skipped, otherwise the step 6 is skipped;
step 5, based on the partition strategy obtained in the step 4, the edge server cooperates with the DNN model shallow section part and the DNN model deep section part to execute reasoning tasks, and collects queue state information at the moment, and the step 8 is skipped;
step 6, when the regret mechanism is triggered, skipping the deep section part of the DNN model to step 7;
step 7, uploading the model to a cloud server to complete reasoning, and completing aggregation of reasoning results at the cloud server to enter step 8;
step 8, calculating to obtain a model partitioning strategy and total reasoning time delay T at the moment;
step 9, comparing the total reasoning time delay T of the current round with the reasoning time delay obtained in the previous roundWhen (when)Iterating to the step 3; and when the total reasoning time delay is not reduced, outputting an optimal model partitioning strategy and a minimum reasoning time delay result.
Further, the step 1 of collecting task data generated by the terminal device, establishing a task attribute model, and simultaneously calculating an inference delay of each layer of the DNN model required for completing an inference task, specifically includes the following steps:
step 11, settingIs a set of total tasks generated in all terminal devices; the reasoning task k is described asWhereinModel segmentation points representing an reasoning task k;the data size of the reasoning task is represented, and the unit is MBytes;representing the maximum tolerance time delay of the reasoning task k;
step 12, calculating the reasoning delay of each layer of DNN model required for completing the reasoning task based on the given load of the initial edge server, expressed as
Wherein DNN models required for completing the reasoning task k are as followsLayer and satisfy
And determining an initial partition strategy according to the reasoning delay of each layer of the obtained DNN model and the correlation among the layers of the DNN. The correlation/dependence among DNN layers is not divisible among layers, and the reasoning time delay T0 corresponding to the initial partitioning strategy can be judged and compared with the reasoning time delay T1 obtained by the first iteration so as to carry out the iteration optimization of the subsequent rounds.
Further, in the step 2, the network bandwidth of the current wireless link is monitored, and meanwhile, a communication model between the terminal device and the edge server and between the edge server and the cloud server is established, and the reasoning task is uploaded to the edge server, which specifically includes the following steps:
step 21, assuming the total bandwidth is B, assuming that the associated channel is frequency non-selective, and that the channel gain can remain constant and accurately estimated by the edge server during the transmission of each inference task; correspondingly, slave terminals during the upload of the reasoning tasksThe instantaneous uplink transmission rate to the edge server m is obtained by:
wherein,presentation of an allocation to a terminal deviceIs a bandwidth of (a);indicating terminal equipmentIs set to the transmission power of (a); scalar quantityRepresented at terminal equipmentUplink channel gain between edge server m;representing the variance of the additive white gaussian noise AWGN;
for the limitation of network resources, each terminal device is allocatedThe bandwidth resource sum of (a) satisfies the following constraint:
the instantaneous uplink transmission rate achieved between the edge server m and the cloud server is represented as:
wherein,representing bandwidth resources allocated to the edge servers;representing the transmit power from edge server m to cloud server;representing uplink channel gain between edge server m and cloud server, including path loss and small scale fading in the communication link;
step 23, by the terminal deviceTransmission delay to edge server m:
wherein,a data size representing the inference task;
step 24, the transmission delay from the edge server m to the cloud server is defined as:
wherein,representing the data size of the reasoning task k after partitioning;a decision whether to upload to a cloud server.
Further, in the step 3, the load condition and the queue state information at the current edge server are monitored and evaluated, the reasoning task uploaded to the edge server is hesitated, and a decision of whether to directly upload the DNN model to the cloud server is made according to the current queue state information, specifically, the steps are as follows:
step 31, the queue status information Q includes the position of the task in the current queue and the service rate of the edge server in the current queue, which may be specifically expressed as:
wherein,representing the position of the reasoning task k in the current waiting queue;representing the service rate of the current edge server m, whereinThe method is related to the current load condition of the edge server, and the larger the current load of the edge server is, the smaller the service rate of the edge server is; conversely, the greater the service rate of the edge server.
Step 32, a decision is made as to whether to upload to the cloud server according to the current queue status information, which is defined as:
wherein,representing the delay of the reasoning task to wait in line at the edge server m;representing the processing delay of the inference task at the edge server m;representation ofThe transmission delay from the edge server to the cloud server is used as an reasoning task; because the cloud server has sufficient computing resources and has high processing speed, processing time delay on the cloud server is not considered;indicating that the reasoning task completes reasoning at the edge server;uploading the representation to a cloud server;
step 33, queuing latency at edge server m:
wherein,indicating that inference task k is ahead in the current queueThe overall processing delay of the inference task at the location of (a), the value of which affects the decision whether the DNN model is uploaded to the cloud server.
Further, when the inference task does not make an regret decision after waiting in line, the task will be processed by the edge server, in which case the processing delay of the inference task k at the edge server m is expressed as:
wherein the method comprises the steps ofThe data size of the partitioned reasoning task k is represented.
Further, the task in step 4 enters a queuing waiting queue of the edge server, and simultaneously performs DNN model partitioning according to the current queue state information and the reasoning delay of each layer obtained in step 1 to obtain a partitioning strategy, and the specific steps are as follows:
step 41, partitioning strategy divides the DNN model into two parts: 1) The shallow part of the DNN model is executed at the edge server; 2) Transmitting the deep section part of the DNN model to a cloud server for execution, wherein the output result of the DNN model at the shallow section part of the edge server is transmitted to the cloud server through a communication link as the input of the deep section part;
step 42, finishing the terminal equipmentGenerated inference tasksThe required model partitioning points of the DNN model are expressed as integer variablesLayer 0 to layer 0 representing DNN modelThe layer is executed at the edge server, the firstLayer to the firstThe layer is calculated at a cloud server; specially whenAndrepresenting that the DNN model is executed entirely at the cloud server and entirely at the edge server, respectively.
Further, step 9, comparing the total inference time delay T of the current round with the inference time delay obtained in the previous roundWhen (when)Iterating to the step 3; when the total reasoning delay is no longerWhen the method is reduced, an optimal model partitioning strategy and a minimum reasoning time delay result are output, and the method specifically comprises the following steps:
the inference delay of the DNN model is defined as the following problem:
(a)
(b)
(c)
(d)
(e)
wherein,model segmentation points representing an reasoning task k;representing the information of the terminal equipmentTransmission delay to edge server m;representing the queuing latency at edge server m;representing reasoning task k in edge clothesProcessing delay at server m;
(a) Indicating that the sum of the bandwidths allocated to the terminal devices is not greater than the total bandwidth from the terminal devices to the edge server in the architecture system; (b) Indicating that the processing time delay of the reasoning task k in the system cannot exceed the maximum time delay tolerance time of the reasoning task k; (c) The representation is valid for the model partition points of the DNN model so that the model can be effectively partitioned; (d) The method comprises the steps that a representation reasoning task selects a server to be processed according to time delay comparison; (e) represents a limitation of computing power of the edge server;
and continuously iterating until the reasoning delay is not reduced any more, and outputting a final partitioning strategy and a minimum reasoning delay result.
The invention has the following beneficial effects: 1. the invention is oriented to a mobile edge computing scene, combines the remorse behavior in the queuing theory, is different from the random remorse introduced in the prior study, makes a model partitioning strategy and triggers the remorse mechanism according to the current queue state information, designs a collaborative reasoning acceleration method based on the vexation-resistant queuing of a reasoning task on the basis, fully utilizes the computing resources of a cloud server while considering the current load state of an edge server, finally realizes the minimization of the system reasoning time delay and effectively completes the collaborative reasoning acceleration process.
2. According to the method, the position of the DNN model in the waiting queue and the dynamic load condition of the edge server in the current system are intelligently made, a proper calculation node is selected, the partition strategy is optimized, and the overall reasoning efficiency of the system is improved, so that the method is better suitable for a real scene.
Drawings
FIG. 1 is a scene model diagram for an MEC network environment in the present invention;
FIG. 2 is a diagram of the overall framework of the present invention;
FIG. 3 is a flow chart of a collaborative reasoning acceleration method based on queuing theory.
Detailed Description
The invention is further illustrated by the following examples, which are not intended to be limiting.
With the increasing complexity and computational demands of deep learning models, it is often difficult for a single compute node to meet the demands of real-time reasoning tasks. Therefore, joint collaborative reasoning of multiple computing nodes becomes an important solution. Meanwhile, queuing theory is used as a mathematical tool, so that the design of an elastic collaborative reasoning system can be facilitated, the system configuration can be adaptively adjusted according to the load condition and the resource availability of the current computing node, the stability and the robustness of the system can be improved, and the system can be adapted to different working scenes and the change of loads.
Based on the method, queuing theory is introduced into the research of collaborative reasoning, and a collaborative reasoning acceleration method based on impatience queuing is designed for a cellular network scene.
Referring to fig. 1, in combination with the current practical application, the cellular network is utilized to realize multi-cell coverage, specifically covering multiple production and living aspects such as smart phone communication service, mechanical arm completion detection and quality inspection service, and real-time road condition monitoring, where specifically, a scene model includes a set of terminal devices as followsWherein each terminal deviceResponsible for the task consisting of a single inference model (i.e. DNN model), the present invention does not take into account the local computation of the terminal device, due to its limited computing power. The set of edge servers m is defined asThe scene model comprises a cloud server.
Based on the architecture system, the invention provides a collaborative reasoning acceleration method based on queuing theory, as shown in fig. 3, and the operation mechanism is as follows:
step 1, collecting task data generated by terminal equipment, establishing a task attribute model, and simultaneously calculating reasoning delay of each layer of DNN model required by completing tasks;
each terminal deviceMultiple consecutive reasoning tasks k of the same type can be generated, so that DNN models required for the reasoning tasks generated for the same terminal device are the same, andis the set of overall tasks that are generated in the architecture system. Distinguishing these heterogeneous tasks based on their computational properties, in particular, inference task k is described asWhereinModel segmentation points representing an reasoning task k;the data size (in MBytes) representing the reasoning task;representing the maximum tolerable delay of the reasoning task k. We assume that heterogeneous tasks differ in latency tolerance, separating tasks into latency sensitive tasks and latency tolerant tasks, such as: the judgment of the road condition information in intelligent driving has extremely low time delay tolerance, and the road condition information needs to be acquired in real time and decided in time to avoid traffic accidents; the processing of photo information in the daily photographing process has higher delay tolerance;
the inference delay for each layer of the DNN model required to complete a task based on the initial edge server given the load can be expressed asWherein DNN reasoning models required by the reasoning task k are as followsLayer and satisfyThe method comprises the steps of carrying out a first treatment on the surface of the Determining an initial partition strategy according to the reasoning delay of each layer of the obtained DNN model and the correlation among the layers of DNN;
step 2, monitoring the network bandwidth of the current wireless link, simultaneously establishing a communication model of the terminal equipment and the edge server, and a communication model of the edge server and the cloud server, and uploading an reasoning task to the edge server;
the use of Orthogonal Frequency Division Multiple Access (OFDMA) to allocate frequency band resources to terminal devices is considered in the architecture system. The number of sub-carriers is assumed to be sufficiently large so that the division of bandwidth is approximately continuous. Let the architecture system total bandwidth be B, assuming that the associated channel is frequency non-selective, and the channel gain can remain constant and accurately estimated by the edge server during the transmission of each inference task. Correspondingly, slave terminals during the upload of the reasoning tasksThe achievable instantaneous uplink transmission rate to the edge server m is obtained by:
wherein,presentation of an allocation to a terminal deviceIs a bandwidth of (a);indicating terminal equipmentIs set to the transmission power of (a); scalar quantityRepresented at terminal equipmentAnd the uplink channel gain between edge server m.
For the limitation of network resources, each terminal device is allocatedThe bandwidth resource sum of (a) satisfies the following constraint:
similarly, the instantaneous uplink transmission rate achievable between edge server m and cloud server can be expressed as:
wherein,representing bandwidth resources allocated to the edge servers;representing the transmit power from edge server m to cloud server;representing uplink channel gain between edge server m and cloud server, where path loss and small scale fading in the communication link have been included;representing the variance of Additive White Gaussian Noise (AWGN).
By terminal equipmentTransmission delay to edge server m:
wherein,representing slave terminal deviceData transfer rate to edge server m;
the transmission delay from the edge server m to the cloud server is defined as:
wherein,representing the data upload rate from edge server m to cloud server.
Step 3, monitoring and evaluating the load condition and queue state information at the current edge server, hesitating an reasoning task uploaded to the edge server, making a decision whether to directly upload the DNN model to the cloud server according to the current queue state information, if so, jumping to step 7, otherwise jumping to step 4;
the queue status information Q includes the position of the inference task in the current queue and the service rate of the edge server in the current queue, which can be expressed specifically as:
wherein,representing the position of the reasoning task k in the current waiting queue;representing the service rate of the current edge server m, whereinIn relation to the current load condition of the edge server, the greater the current edge server load, the edge server isThe smaller the service rate of the server; conversely, the greater the service rate of the edge server.
Making a decision whether to upload to the cloud server according to the current queue state information:
wherein,representing the delay of the reasoning task to wait in line at the edge server m;representing the processing delay of the inference task at the edge server m;the transmission delay from the edge server to the cloud server is represented as an inference task; because the cloud server has sufficient computing resources and has high processing speed, processing time delay on the cloud server is not considered;
queuing latency at edge server m:
wherein,indicating that inference task k is ahead in the current queueThe value of the total processing delay of the task at the location of (a) determines the decision whether the model is uploaded to the cloud server.
When the inference task does not make a regret decision after queuing, the inference task will be processed by the edge server, in which case the processing delay of the inference task k at the edge server m is expressed as:
wherein,representing the partitioned data size of inference task k, whereIs related to the load condition of the current edge server,whereinReferring to the computing power of the edge server m,representing the number of terminal devices connected to the current edge server m,is the introduced compensation function;
step 4, the reasoning task enters a queuing waiting queue of the edge server, DNN model partitioning is carried out according to the current queue state information and each layer of reasoning delay obtained in the step 1, a partitioning strategy is obtained, if the task does not trigger an impersonant mechanism at the moment, the step 5 is skipped, and otherwise, the step 6 is skipped;
the partitioning strategy divides the DNN model into two parts: 1) The shallow part of the DNN model is executed at the edge server; 2) Transmitting the deep section part of the DNN model to a cloud server for execution, wherein the output result of the DNN model at the shallow section part of the edge server is transmitted to the cloud server through a communication link as the input of the deep section part;
by terminal equipmentDNN inference task generatedIs expressed as an integer variationMeasuring amountLayer 0 to layer 0 representing DNN modelThe layer is executed at the edge server, the firstLayer to the firstThe layer is calculated at a cloud server; specially whenAndrespectively representing that the DNN model is respectively executed at the cloud server entirely and at the edge server entirely;
step 5, based on the partition strategy obtained in the step 4, the edge server cooperates with the DNN model shallow section part and the DNN model deep section part to execute reasoning tasks, and collects queue state information at the moment, and the step 8 is skipped;
step 6, when the regret mechanism is triggered, skipping the deep section part of the DNN model to step 7;
specifically, in connection with FIG. 2, according to the following decision
When the waiting time delay of the reasoning task in the current queue and the processing time delay of the edge server are larger than the transmission time delay of the reasoning task uploaded to the cloud server, the deep part of the model is prejudicial, and then the reasoning task is withdrawn from the waiting queue and uploaded to the cloud server to complete the reasoning process;
step 7, uploading the model to a cloud server to complete reasoning, and completing aggregation of reasoning results at the cloud server to enter step 8; ( There are two possibilities for the model here: 1. in the step 3, after the edge server is hesitated, determining to directly upload the reasoning model to the cloud server for processing; 2. in step 6, the model deep section part of the regret mechanism is triggered )
Step 8, calculating to obtain a model partitioning strategy and total reasoning time delay T at the moment;
step 9, comparing the current iteration turns to obtain total reasoning time delayInference time delay obtained from previous iteration roundWhen (when)Iterating to the step 3; when the total reasoning time delay is not reduced any more, outputting an optimal model partitioning strategy and the minimum reasoning time delay;
specifically, the optimization objective of the iterative process is as follows:
(a)
(b)
(c)
(d)
(e)
wherein (a) means that the sum of bandwidths allocated to the terminal devices is not greater than the total bandwidth from the terminal devices to the edge server in the system, which condition ensures a bandwidth limitation condition of the system; (b) The method has the advantages that the processing time delay of the reasoning task k in the system cannot exceed the maximum time delay tolerance time of the reasoning task k, the reasoning task is ensured to be successfully processed by the server within the specified time, and the success rate of task processing is further ensured; (c) Meaning that the model partition points for the DNN model should be valid so that the model can be effectively partitioned; (d) Meaning that the reasoning task selects a server to be processed according to the time delay comparison; (e) represents a limitation of computing power of the edge server.
And continuously iterating until the reasoning delay is not reduced any more, and outputting a final partitioning strategy and a minimum reasoning delay result.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Meanwhile, the above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereto, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the present invention.

Claims (7)

1. The collaborative reasoning acceleration method based on queuing theory is characterized by comprising the following steps:
step 1, constructing an architecture system, which comprises terminal equipment, an edge server and a cloud server; collecting task data generated by terminal equipment, establishing a task attribute model, and simultaneously calculating the reasoning delay of each layer of DNN model required by completing the reasoning task; the set of the terminal equipment i isThe set of edge servers m is +.>
Step 2, monitoring the network bandwidth of the current wireless link, simultaneously establishing a communication model between the terminal equipment and the edge server and between the edge server and the cloud server, and uploading an reasoning task to the edge server;
step 3, monitoring and evaluating the load condition and queue state information at the current edge server, hesitating an reasoning task uploaded to the edge server, making a decision whether to directly upload the DNN model to the cloud server according to the current queue state information, if so, jumping to step 7, otherwise jumping to step 4;
step 4, the reasoning task enters a queuing waiting queue of the edge server, and DNN model partitioning is carried out according to the current queue state information and the reasoning delay of each layer obtained in the step 1, so that a partitioning strategy is obtained and the partitioning strategy is divided into a DNN model shallow section part and a DNN model deep section part; if the reasoning task does not trigger the regret mechanism at this time, the step 5 is skipped, otherwise the step 6 is skipped;
step 5, based on the partition strategy obtained in the step 4, the edge server cooperates with the DNN model shallow section part and the DNN model deep section part to execute reasoning tasks, and collects queue state information at the moment, and the step 8 is skipped;
step 6, when the regret mechanism is triggered, skipping the deep section part of the DNN model to step 7;
step 7, uploading the model to a cloud server to complete reasoning, and completing aggregation of reasoning results at the cloud server to enter step 8;
step 8, calculating to obtain a model partitioning strategy and total reasoning time delay T at the moment;
step 9, comparing the total reasoning time delay T of the current round with the reasoning time delay T 'obtained in the previous round, and iterating to the step 3 when T is smaller than T'; and when the total reasoning time delay is not reduced, outputting an optimal model partitioning strategy and a minimum reasoning time delay result.
2. The collaborative reasoning acceleration method based on queuing theory according to claim 1, wherein the task data generated by the terminal device in step 1 is collected, a task attribute model is built, and the reasoning delay of each layer of the DNN model required for completing the reasoning task is calculated, which specifically comprises the following steps:
step 11, settingIs a set of total tasks generated in all terminal devices; the reasoning task k is described as k=(s) k ,W k ,D k ) Wherein s is k Model segmentation points representing an reasoning task k; w (W) k The data size of the reasoning task is represented, and the unit is MBytes; d (D) k Representing the maximum tolerance time delay of the reasoning task k;
step 12, calculating the reasoning delay of each layer of DNN model required for completing the reasoning task based on the given load of the initial edge server, expressed as
Wherein the DNN model required for completing the reasoning task k has L k Layer, and satisfy 0 < l k ≤L k
And determining an initial partitioning strategy according to the reasoning delay of each layer of the obtained DNN model and the correlation among the layers of the DNN.
3. The collaborative reasoning acceleration method based on queuing theory according to claim 1, wherein the monitoring of the network bandwidth of the current wireless link in step 2 simultaneously establishes a communication model between the terminal device and the edge server and between the edge server and the cloud server, and uploads a reasoning task to the edge server, specifically comprising the following steps:
step 21, assuming the total bandwidth is B, assuming that the associated channel is frequency non-selective, and that the channel gain can remain constant and accurately estimated by the edge server during the transmission of each inference task; accordingly, the instantaneous uplink transmission rate achieved from the terminal device i to the edge server m during the inference task upload is obtained by:
wherein b i Representing the bandwidth allocated to the terminal device i; p is p i Representing the transmit power of terminal device i; scalar h i,m Representing the uplink channel gain between the terminal device i and the edge server m; sigma (sigma) 2 Representing the variance of the additive white gaussian noise AWGN;
for the limitation of network resources, the sum of bandwidth resources allocated to each terminal device i satisfies the following constraint:
the instantaneous uplink transmission rate achieved between the edge server m and the cloud server is represented as:
wherein b c Representing bandwidth resources allocated to the edge servers; p is p m Representing the transmit power from edge server m to cloud server; h is a m,c Representing uplink channel gain between edge server m and cloud server, including path loss and small scale fading in the communication link;
step 23, transmission delay from terminal device i to edge server m:
wherein W is k A data size representing the inference task;
step 24, the transmission delay from the edge server m to the cloud server is defined as:
wherein,representing the data size of the reasoning task k after partitioning; ρ mc A decision whether to upload to a cloud server.
4. The collaborative reasoning acceleration method based on queuing theory according to claim 1, wherein the monitoring and evaluation of the load status and the queue status information at the current edge server in step 3 are performed, the reasoning task uploaded to the edge server is hesitated, and a decision is made whether to directly upload the DNN model to the cloud server according to the current queue status information, and the specific steps are as follows:
step 31, the queue status information Q includes the position of the task in the current queue and the service rate of the edge server in the current queue, which may be specifically expressed as:
Q=(p km ),
wherein p is k Representing the position of the reasoning task k in the current waiting queue; mu (mu) m Representing the service rate of the current edge server m, where μ m The method is related to the current load condition of the edge server, and the larger the current load of the edge server is, the smaller the service rate of the edge server is; conversely, the greater the service rate of the edge server;
step 32, a decision is made as to whether to upload to the cloud server according to the current queue status information, which is defined as:
wherein,representing the delay of the reasoning task to wait in line at the edge server m; />Representing the processing delay of the inference task at the edge server m; />The transmission delay from the edge server to the cloud server is expressed as an inference task; because the cloud server has sufficient computing resources and has high processing speed, processing time delay on the cloud server is not considered; ρ mc =0 means that the inference task completes the inference at the edge server; ρ mc =1 represents upload to cloud server;
step 33, queuing latency at edge server m:
wherein,representing the reasoning task k in the current queue p before k The overall processing delay of the inference task at the location of-1, the value of which affects the decision whether the DNN model is uploaded to the cloud server.
5. The collaborative inference acceleration method based on queuing theory according to claim 4, wherein when the inference task does not make an regret decision after queuing, the task will be processed by the edge server, in which case the processing delay of the inference task k at the edge server m is expressed as:
wherein the method comprises the steps ofThe data size of the partitioned reasoning task k is represented.
6. The collaborative reasoning acceleration method based on queuing theory according to claim 1, wherein the task in step 4 enters a queuing waiting queue of an edge server, and simultaneously performs DNN model partitioning according to current queue state information and each layer of reasoning delay obtained in step 1 to obtain a partitioning strategy, and the specific steps are as follows:
step 41, partitioning strategy divides the DNN model into two parts: 1) The shallow part of the DNN model is executed at the edge server; 2) Transmitting the deep section part of the DNN model to a cloud server for execution, wherein the output result of the DNN model at the shallow section part of the edge server is transmitted to the cloud server through a communication link as the input of the deep section part;
step 42, the model division points of the DNN model required for completing the reasoning task k generated by the terminal device i are expressed as integer variables s k ∈{0,1,2...l k Layer 0 through s of the DNN model }, representing k The layer is executed at the edge server, s < th) k+1 Layer to the first k The layer is calculated at a cloud server; specially when s k =0 and s k =l k Representing that the DNN model is executed entirely at the cloud server and entirely at the edge server, respectively.
7. The collaborative reasoning acceleration method based on queuing theory according to claim 1, wherein, step 9, comparing the total reasoning delay T of the current round with the reasoning delay T 'obtained in the previous round, and iterating to step 3 when T < T'; when the total reasoning time delay is not reduced, outputting an optimal model partitioning strategy and a minimum reasoning time delay result, wherein the specific steps are as follows:
the inference delay of the DNN model is defined as the following problem:
T total,k ≤D k (b)
μ m ≤μ max (e),
wherein s is k Model segmentation points representing an reasoning task k;representing a transmission delay from the terminal device i to the edge server m; />Representing the queuing latency at edge server m; />Representing the processing delay of the reasoning task k at the edge server m; />Representing the transmission delay from the edge server m to the cloud server; l (L) k First representing DNN model k A layer; />Is a set of total tasks generated in all terminal devices; ρ mc A decision representing whether to upload to a cloud server; mu (mu) m Representing the service rate of the current edge server m; mu (mu) max Representing the maximum service rate of the current edge server m;
(a) Indicating that the sum of the bandwidths allocated to the terminal devices is not greater than the total bandwidth from the terminal devices to the edge server in the architecture system; (b) Indicating that the processing time delay of the reasoning task k in the system cannot exceed the maximum time delay tolerance time of the reasoning task k; (c) The representation is valid for the model partition points of the DNN model so that the model can be effectively partitioned; (d) The method comprises the steps that a representation reasoning task selects a server to be processed according to time delay comparison; (e) represents a limitation of computing power of the edge server;
and continuously iterating until the reasoning delay is not reduced any more, and outputting a final partitioning strategy and a minimum reasoning delay result.
CN202311378988.0A 2023-10-24 2023-10-24 Collaborative reasoning acceleration method based on queuing theory Active CN117114113B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311378988.0A CN117114113B (en) 2023-10-24 2023-10-24 Collaborative reasoning acceleration method based on queuing theory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311378988.0A CN117114113B (en) 2023-10-24 2023-10-24 Collaborative reasoning acceleration method based on queuing theory

Publications (2)

Publication Number Publication Date
CN117114113A CN117114113A (en) 2023-11-24
CN117114113B true CN117114113B (en) 2023-12-29

Family

ID=88798755

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311378988.0A Active CN117114113B (en) 2023-10-24 2023-10-24 Collaborative reasoning acceleration method based on queuing theory

Country Status (1)

Country Link
CN (1) CN117114113B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107071879A (en) * 2015-12-24 2017-08-18 英特尔公司 Uplink channel interference management in shared spectrum network
CN110309914A (en) * 2019-07-03 2019-10-08 中山大学 Deep learning model reasoning accelerated method based on Edge Server Yu mobile terminal equipment collaboration
CN110532078A (en) * 2019-08-29 2019-12-03 中国科学院软件研究所 A kind of edge calculations method for optimizing scheduling and system
CN111176820A (en) * 2019-12-31 2020-05-19 中科院计算技术研究所大数据研究院 Deep neural network-based edge computing task allocation method and device
KR102165864B1 (en) * 2019-07-22 2020-10-14 성균관대학교산학협력단 Methods and apparatuses for packet scheduling for software defined networking in edge computing environment
CN112348172A (en) * 2020-11-13 2021-02-09 之江实验室 Deep neural network collaborative reasoning method based on end edge cloud architecture
CN112926660A (en) * 2021-02-26 2021-06-08 武汉大学 Water level identification system and method with cooperative end edges
CN113485803A (en) * 2021-06-29 2021-10-08 天津大学 Self-adaptive packaging and collaborative reasoning method under task flow field scene with time delay constraint
CN114422349A (en) * 2022-03-30 2022-04-29 南京邮电大学 Cloud-edge-end-collaboration-based deep learning model training and reasoning architecture deployment method
CN114723057A (en) * 2022-03-31 2022-07-08 北京理工大学 Neural network collaborative reasoning method for multi-access edge computing system
CN115629865A (en) * 2022-12-20 2023-01-20 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Deep learning inference task scheduling method based on edge calculation
CN116455768A (en) * 2023-06-16 2023-07-18 南京邮电大学 Cloud edge end collaborative CNN reasoning method and system for global time delay optimization
CN116663644A (en) * 2023-06-08 2023-08-29 中南大学 Multi-compression version Yun Bianduan DNN collaborative reasoning acceleration method
CN116915869A (en) * 2023-08-14 2023-10-20 南京信息工程大学 Cloud edge cooperation-based time delay sensitive intelligent service quick response method

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107071879A (en) * 2015-12-24 2017-08-18 英特尔公司 Uplink channel interference management in shared spectrum network
CN110309914A (en) * 2019-07-03 2019-10-08 中山大学 Deep learning model reasoning accelerated method based on Edge Server Yu mobile terminal equipment collaboration
KR102165864B1 (en) * 2019-07-22 2020-10-14 성균관대학교산학협력단 Methods and apparatuses for packet scheduling for software defined networking in edge computing environment
CN110532078A (en) * 2019-08-29 2019-12-03 中国科学院软件研究所 A kind of edge calculations method for optimizing scheduling and system
CN111176820A (en) * 2019-12-31 2020-05-19 中科院计算技术研究所大数据研究院 Deep neural network-based edge computing task allocation method and device
CN112348172A (en) * 2020-11-13 2021-02-09 之江实验室 Deep neural network collaborative reasoning method based on end edge cloud architecture
CN112926660A (en) * 2021-02-26 2021-06-08 武汉大学 Water level identification system and method with cooperative end edges
CN113485803A (en) * 2021-06-29 2021-10-08 天津大学 Self-adaptive packaging and collaborative reasoning method under task flow field scene with time delay constraint
CN114422349A (en) * 2022-03-30 2022-04-29 南京邮电大学 Cloud-edge-end-collaboration-based deep learning model training and reasoning architecture deployment method
CN114723057A (en) * 2022-03-31 2022-07-08 北京理工大学 Neural network collaborative reasoning method for multi-access edge computing system
CN115629865A (en) * 2022-12-20 2023-01-20 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Deep learning inference task scheduling method based on edge calculation
CN116663644A (en) * 2023-06-08 2023-08-29 中南大学 Multi-compression version Yun Bianduan DNN collaborative reasoning acceleration method
CN116455768A (en) * 2023-06-16 2023-07-18 南京邮电大学 Cloud edge end collaborative CNN reasoning method and system for global time delay optimization
CN116915869A (en) * 2023-08-14 2023-10-20 南京信息工程大学 Cloud edge cooperation-based time delay sensitive intelligent service quick response method

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Dynamic Regret of Randomized Online Service Caching in Edge Computing;Siqi Fan 等;《arXiv》;1-10 *
Impatient Queuing for Intelligent Task Offloading in Multiaccess Edge Computing;Bin Han 等;《IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS》;第22卷(第1期);59-72 *
Latency-Aware Collaborative Perception;Zixing Lei 等;《arXiv》;1-17 *
基于边端协同的CNN推理加速框架;郭永安 等;《南京邮电大学学报(自然科学版)》;第43卷(第3期);68-77 *
基于边缘计算的无线网络资源均衡分配方法;刘长青;《信息与电脑(理论版)》(第2023年14期);12-14 *
面向深度学习的云边任务迁移与调度研究;顾冰雪;《中国优秀硕士学位论文全文数据库 信息科技辑》(第2022年12期);I139-422 *

Also Published As

Publication number Publication date
CN117114113A (en) 2023-11-24

Similar Documents

Publication Publication Date Title
CN111372314A (en) Task unloading method and task unloading device based on mobile edge computing scene
CN110839184B (en) Method and device for adjusting bandwidth of mobile fronthaul optical network based on flow prediction
CN110798849A (en) Computing resource allocation and task unloading method for ultra-dense network edge computing
CN111711666B (en) Internet of vehicles cloud computing resource optimization method based on reinforcement learning
CN114138373B (en) Edge computing task unloading method based on reinforcement learning
CN110839075A (en) Service migration method based on particle swarm in edge computing environment
CN110753319B (en) Heterogeneous service-oriented distributed resource allocation method and system in heterogeneous Internet of vehicles
CN113156992B (en) Three-layer architecture collaborative optimization method for unmanned aerial vehicle in edge environment
CN114422349B (en) Cloud-edge-end-collaboration-based deep learning model training and reasoning architecture deployment method
CN114285853A (en) Task unloading method based on end edge cloud cooperation in equipment-intensive industrial Internet of things
CN113961264B (en) Intelligent unloading algorithm and system for video monitoring cloud edge cooperation
CN114745383A (en) Mobile edge calculation assisted multilayer federal learning method
CN113573363B (en) MEC calculation unloading and resource allocation method based on deep reinforcement learning
CN115665227B (en) Universal heterogeneous integrated computing network resource intelligent adaptation network architecture and method
CN116916386A (en) Large model auxiliary edge task unloading method considering user competition and load
CN115802370A (en) Communication method and device
CN117114113B (en) Collaborative reasoning acceleration method based on queuing theory
CN117156492A (en) Deep reinforcement learning-based dual-time-scale resource allocation method for joint service caching, communication and calculation
CN111930435A (en) Task unloading decision method based on PD-BPSO technology
CN117580063A (en) Multi-dimensional resource collaborative management method in vehicle-to-vehicle network
CN114928611B (en) IEEE802.11p protocol-based energy-saving calculation unloading optimization method for Internet of vehicles
CN116017570A (en) Edge computing system resource management method based on block chain
CN116016380A (en) Shi Min service heterogeneous network resource allocation method, system, equipment and medium
CN113452625A (en) Deep reinforcement learning-based unloading scheduling and resource allocation method
CN116668447B (en) Edge computing task unloading method based on improved self-learning weight

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant