CN116963034A

CN116963034A - Emergency scene-oriented air-ground network distributed resource scheduling method

Info

Publication number: CN116963034A
Application number: CN202310861810.5A
Authority: CN
Inventors: 程梦倩; 宋晓勤; 雷磊; 李楠; 张莉涓
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2023-07-13
Filing date: 2023-07-13
Publication date: 2023-10-27

Abstract

The invention discloses an air-ground network distributed unloading decision-making and resource scheduling method for emergency scenes, which aims at the emergency disaster scenes, constructs an air-ground integrated Internet of things consisting of unmanned aerial vehicles and emergency rescue vehicle users, considers the demands of computation intensive and time delay sensitive services, aims at minimizing the total time delay of a system to construct an optimization problem, and then designs an improved decision-making depth double-Q network algorithm to solve the optimization problem. The ID3QN algorithm used by the invention can minimize the time cost of the system under the condition of meeting the constraints of time delay, power and the like, and effectively solves the joint optimization problem of unloading decision, channel and power distribution of the vehicle user in an emergency scene.

Description

Emergency scene-oriented air-ground network distributed resource scheduling method

Technical Field

The invention relates to the field of air-ground integrated Internet of things, in particular to an air-ground network distributed unloading decision and resource optimization method based on an improved opposite depth double Q network for emergency scenes.

Background

Emergency disaster scenarios require higher mobility, reliability and flexibility for field rescue communication and computing facilities. Although deploying Multi-access edge computing (Multi-access Edge Computing, MEC) in emergency scenarios may alleviate the problem of limited computing resources for internet of things (Internet ofThings, ioT) devices. However, the MEC deployed in advance in the emergency scene has the problems of inflexibility and uneven service, and the preset base station is also easily destroyed and cannot provide service, so that the conventional ground network cannot meet the requirement of quick response in the emergency scene. Aiming at the situation, the air-ground integrated Internet of things plays a key role, and provides support for assisting and supplementing a ground system. The third generation partnership project (The Third Generation Partnership Proiect,3 GPP) has seen Non-terrestrial networks (Non-Terrestrial Networks, NTN) as a new feature of 5G, which is intended to provide wireless access services worldwide, beyond space limitations. Unmanned aerial vehicles (Unmanned Aerial Vehicles, UAVs) have the advantages of low cost, flexibility in maneuvering and the like, and are widely applied to the field of wireless communication. As an aerial computing platform, UAVs can assist in edge computing, particularly for high-density public emergency scenarios.

In addition, due to the presence of various random and nonlinear factors, wireless communication systems are often difficult to accurately model, and even if modeling is enabled, models and algorithms become complex and fail to meet the real-time response requirements. While artificial intelligence (Artificial Intelligence, AI) technology with powerful data processing and expression capabilities and low inference complexity can provide technical support, especially deep reinforcement learning (Deep Reinforcement Learning, DRL), has been widely applied to resource allocation and computation offloading problems in the internet of things.

Disclosure of Invention

The invention aims at constructing an unmanned aerial vehicle-assisted air-ground integrated Internet of things architecture for emergency rescue scenes, and provides an improved decision-depth double-Q network (Improved dueling double deep Q network, ID3 QN) algorithm for unloading decision and resource optimization so as to reduce the total time delay of a system by considering the requirements of computation-intensive and time delay-sensitive services. In order to achieve the object, the invention adopts the following steps:

step 1: constructing an air-ground integrated Internet of things system model consisting of an unmanned aerial vehicle and emergency rescue vehicle users;

step 2: describing a communication and calculation model of the system, and constructing an optimization problem based on the model with the aim of minimizing the time delay of the system;

step 3: constructing a deep reinforcement learning model according to an optimization problem by adopting a distributed resource allocation method, and setting key parameters of a dual-Q network (Dueling double deep Q network, D3 QN) of the opposite depth;

step 4: a priority experience playback mechanism is introduced into the D3QN, so that the convergence rate of training is increased, and the system performance is improved;

step 5: designing an ID3QN training algorithm and training a DRL model;

step 6: in the execution stage, the trained ID3QN model is utilized to obtain the optimal user transmitting power and channel allocation strategy;

further, the step 1 includes the following specific steps:

step 1-1: considering a microcell in a disaster area, in which M unmanned aerial vehicles are equipped with computing resources as airborne MEC nodes, they perform trajectory optimization in advance and preferentially fly to the vicinity of a desired area according to the situation of users, a set of UAVs is expressed as

Step 1-2: on the ground, there are N emergency vehicle users (Emergency vehicle users, EVUs) that need to perform computationally intensive and delay sensitive tasks, each EVU can move, the aggregate representation of whichIs thatAssuming that each EVU has only one computation task in each time slot, denoted +.> wherein ,d_n Representing the amount of calculated data entered; i.e _n Representing the number of CPU revolutions required to complete the calculation task; />Representing the maximum tolerable time delay of the task n; when the EVU does not have enough computing resources, the UAV is selected for computing and unloading;

further, the step 2 includes the following specific steps:

step 2-1: definition of the definitionTo indicate the position of execution of the nth EVU calculation task when +.>When the calculation task representing EVUn is executed locally,/>Representing task->Executing on UAvm, otherwise, +.>Then it is indicated that EVUn has not selected UAVm to complete the computational offload task, assuming that each EVU can only select one UAV for computational offload;

step 2-2: if the EVUn selects UAvm for computational offloading, then the signal-to-interference-and-noise ratio gamma of the V2U link between the EVU and the UAV _n，m Can be expressed as

Wherein, P [ n ]]Sum sigma ² The transmit power of EVUn and the power of additive white gaussian noise are represented, respectively;representing the channel coefficients between EVUn and UAVm; i _n Representing the interference of EVUn from other V2U links using the same sub-band, can be calculated by

wherein ,represents the channel coefficient between EVUn' and UAvm using the same V2U link,/V>And->Using the same definition, n in the formula is changed to n';

step 2-3: because the channel between EVU and UAV is a Line of sight (LOS) of free space, the channel coefficients are related to the effects of path LOSs and can be expressed as

wherein ,is made of distance->The path loss represented; setting position coordinate division of transmitting end and receiving end of V2U linkLet it be (x) _n ，y _n ，z _n )，(x _m ，y _m ，z _m ) The Euclidean distance between EVUn and UAvm>Can be expressed as

Step 2-4: the transmission rate between EVUn and UAVm can be expressed as

R _n，m ＝Blog ₂ (1+γ _n，m ) (5)

Wherein B represents the bandwidth of the V2U link;

step 2-5: then the total transmission delay can be expressed as

wherein ,representing the transmission delay after the UAvm is selected by the EVUn;

step 2-6: the total computation latency of all EVU execution tasks can be expressed as

wherein ,representing allocation to computing tasks>Is a computing resource of (a); />Indicating that local computing resources are available +.>Executing a computing task; when m > 0, & gt>Representing the number of CPU revolutions per second assigned to EVUn by the UAV; />Representing the computation time required by the EVUn to select UAvm to execute the task;

step 2-7: the total time cost of all systems can be expressed as

Step 2-8: based on the above definition, the optimization problem is expressed as that the total time delay of the system is minimized

wherein ,allocation policy indicating offloading policy, channel and user transmit power, respectively, +.>Represents the maximum transmit power per EVU, < >>Representing the largest computational resource of UAVm; constraint C1 represents task->A maximum tolerable delay time limit of (2); constraints C2, C3 and C4 represent the power constraint sum of each EVU, respectivelyThe UAV calculates constraints of the resources; constraint C5 indicates that each EVU can only select one UAV for computational offloading;

further, the step 3 includes the following specific steps:

step 3-1: regarding EVU as an agent, for each agent n, the current state s is obtained from the state space by first observing locally at each time step t _t (n) state space is calculated by the EVU's computing task informationCurrent channel state information->UAV status information F _t And the random exploration variable epsilon composition in the training round number e and epsilon-greedy algorithm, namely

Step 3-2: thereafter, each agent passes through a state-action cost function Q _π (s _t (n)，a _t (n)) obtaining a policy pi and selecting action a from the action space _t (n) each agent action space is defined by an offloading policySubchannel->And transmit powerIs expressed as

wherein ,indicating the calculated location of the agent, if the agent chooses to calculate locally +.>The training stage is not entered; if the EVU selects UAvm for computational offloading, then the EVU selects UAvm from the subchannel set C _m One subchannel is selected; transmit power->Limited to 4 levels, i.e. [23, 10,5,0 ]]dBm, then the joint action space of the agent is expressed as

Step 3-3: based on the action selections of all agents, the environment is converted into a new state S _t+1 All agents share a global prize, defining a single step prize function at t for each agent as

r _i ＝C-T _total (13) Wherein C is a constant for adjusting r _t So as to train;

step 3-4: in order to find the best strategy to maximize the overall return, both current and future returns must be considered, so the return is defined as the cumulative discount prize R _t ，

wherein ,representing discount factors->Indicating that future rewards are more focused and +.>Representing that the current prize is more focused;

step 3-5: value-based deep reinforcement learning approximates Q using nonlinear proximity capability of neural networks ^* (s _t ，a _t )＝max _π Q _π (s _t ，a _t ) Then selecting an optimal action according to the optimal action value function; in the D3QN algorithm, the use parameter is theta _t To better estimate the optimal action value function, i.e. Q ^* (S _t ，A _t ；θ _t )≈max _π Q _π (S _t ，A _t )；

Step 3-6: then, the network structure is designed, and unlike the traditional deep double-Q network, a breach layer is introduced before an output layer to evaluate the state and the action respectively, so that the intelligent agent can process the state with smaller relation with the action more effectively; the layer divides the network output into two parts, namely a state-dependent value function V (S _t ) And a dominance function A (S) _t ，A _t ) In this way, the states can be independently evaluated rather than always relying on actions;

step 3-7: based on the network structure, the Q value function can be rewritten as

wherein ,network parameters respectively representing a common part, a value function part and a dominance function part, which together constitute a network parameter θ _t The method comprises the steps of carrying out a first treatment on the surface of the Value function->A value representing the current state; dominance function->Representation ofThe value of each action compared to the other actions in the current state;

step 3-8: however, the formulas based on the above cannot be based onUnique determination and />In practical application, the above-mentioned method needs to be rewritten as

By subtracting the mean value of the dominance function, whenWhen fixed, can determine +.> and />

Step 3-9: in the training process, the D3QN adopts two networks of a prediction network and a target network to relieve the problem of Q value overestimation, firstly finds out the action of maximizing the Q value in the prediction network, and then uses the action to obtain the target Q value in the target network, wherein the target value can be expressed as

wherein θ_t Andrepresenting parameters of the predicted network and the target network, respectively, two network structuresThe same, predict network parameter to upgrade continuously, the goal network parameter is updated once every certain cycle; q (S) _t+1 ，A _t ；θ _t ) Representing neural network θ _t The following is for state S _t+1 Take action A _t The obtained cost function;

further, the step 4 includes the following specific steps:

step 4-1: training data for agent nWill be stored in the memory playback pool as samples for subsequent training, interpolation between the pure greedy samples and the uniform random samples is performed using the random sampling method, defining the probability that each sample i is extracted as

Where α is an index, corresponding to uniform sampling when α=0; b represents a small lot;representing the priority of sample i, β is a small positive number, preventing the priority of sample from being revisited once it is 0, δ _i Time differential error (Temporal difference error, TD-error) representing sample i, expressed as

Step 4-2: in updating the network, each agent needs to minimize the loss function to achieve gradient descent, which is defined as when considering sample priority

wherein ,w_i ＝[BP(i)] ^-μ Represents a sampling-Importance (IS) weight, B represents an empirical playback pool size, μ IS an index, and when μ=1, w _i Completely compensating the non-uniform probability P (i);

further, the step 5 includes the following specific steps:

step 5-1: starting an environment simulator, initializing predicted network parameters of an agentAnd target network parameters->Initializing update frequency->Isoparametric parameters; initializing related parameters of preferential experience playback, setting playback pool size B, indexes alpha, mu and the like;

step 5-2: initializing a training round number e;

step 5-3: initializing a time step t in the e round;

step 5-4: updating position, path loss and large-scale fading parameters, and setting UAV parameters;

step 5-5: each agent n observes the current stateSelecting actions according to epsilon greedy strategy>And get immediate rewards +.>At the same time transition to the next state +.>Training data obtained->Storing into a memory playback pool;

step 5-6: updating the small-scale fading parameters;

step 5-7: each agent extracts training data from the experience playback pool as a sample according to the extraction probability shown in the formula (18), calculates IS weight and updates sample priority; obtaining a loss function according to equation (20), updating parameters of the agent predictive network by back propagation of the neural network using a small batch gradient descent strategy

Step 5-8: when the training times reach the target network updating interval, according to the predicted network parametersUpdating target network parameters +.>

Step 5-9: judging whether T is less than T, if T is the total time step in the e round, entering the step (5-4) if t=t+1, otherwise, entering the step (5-10);

step 5-10: judging whether e < I is met, wherein I is the set total training round number, if yes, e=e+1, entering a step (5-3), otherwise, finishing optimization, and obtaining a trained network model;

further, the step 6 includes the following specific steps:

step 6-1: inputting state information observed by an intelligent agent at a certain moment by utilizing a network model trained by an ID3QN algorithm

Step 6-2: outputting an optimal policyComputation offload node to get EVU selection and corresponding channel and power allocation。

Drawings

Fig. 1 is a model of the air-ground integrated internet of things provided by the embodiment of the invention;

FIG. 2 is a frame diagram of an ID3QN algorithm provided by an embodiment of the invention;

FIG. 3 is a diagram of simulation results of the total time delay of the system according to the change of the calculation task according to the embodiment of the invention;

fig. 4 is a diagram of simulation results of the total delay of the system according to the EVU number according to the embodiment of the present invention;

Detailed Description

The invention is described in further detail below with reference to the drawings and examples.

The invention aims at an emergency rescue scene, builds an unmanned plane-assisted air-ground integrated Internet of things architecture shown in figure 1, considers the demands of computation-intensive and delay-sensitive services, aims at minimizing the total time delay of a system to construct an optimization problem, provides an algorithm joint optimization unloading decision and resource allocation based on a dual-Q network of opposite depths, introduces a priority experience playback mechanism to improve performance, and an improved dual-Q network (Improved dueling double deep Q network, ID3 QN) algorithm frame diagram is shown in figure 2, and can obtain an optimal unloading strategy and a corresponding channel and power allocation strategy according to a trained model.

The present invention is described in further detail below.

Step 1: an air-ground integrated Internet of things system model formed by an unmanned aerial vehicle and an emergency rescue vehicle user is constructed, and the method comprises the following steps:

Step 1-2: on the ground, there are N emergency vehiclesUsers (Emergency vehicle users, EVUs) need to perform computationally intensive and delay sensitive tasks, each EVU can move, the aggregate of which is denoted asAssuming that each EVU has only one computation task in each time slot, denoted +.> wherein ,d_n Representing the amount of calculated data entered; i.e _n Representing the number of CPU revolutions required to complete the calculation task; />Representing the maximum tolerable time delay of the task n; when the EVU does not have enough computing resources, the UAV is selected for computing and unloading;

step 2: describing a communication and calculation model of the system, and constructing an optimization problem based on the model with the aim of minimizing the system time delay, comprising the following steps:

wherein ,is made of distance->The path loss represented; let the position coordinates of the transmitting end and the receiving end of the V2U link be (x) _n ，y _n ，z _n )，(x _m ，y _m ，z _m ) The Euclidean distance between EVUn and UAvm>Can be expressed as

Step 2-4: the transmission rate between EVUn and UAVm can be expressed as

R _n，m ＝Blog ₂ (1+γ _n，m ) (25)

Wherein B represents the bandwidth of the V2U link;

step 2-5: then the total transmission delay can be expressed as

step 2-7: the total time cost of all systems can be expressed as

wherein ,allocation policy indicating offloading policy, channel and user transmit power, respectively, +.>Represents the maximum transmit power per EVU, < >>Representing the largest computational resource of UAVm; constraint C1 represents task->A maximum tolerable delay time limit of (2); constraints C2, C3 and C4 represent the power constraints of each EVU and the constraints of the UAV computing resources, respectively; constraint C5 indicates that each EVU can only select one UAV for computational offloading;

step 3: by adopting a distributed resource allocation method, a deep reinforcement learning model is constructed according to an optimization problem, key parameters of a dual-Q network (Dueling double deep Q network, D3 QN) of a diagonal depth are set, and the method comprises the following steps:

step 3-1: regarding EVU as an agent, for each agent n, the current state s is first obtained from the state space by local observation at each time step t _t (n) state space is calculated by the EVU's computing task informationCurrent channel state information->UAV status information F _t And the random exploration variable epsilon composition in the training round number e and epsilon-greedy algorithm, namely

Step 3-3: based on the action selections of all agents, the environment is converted into a new state S _i+1 All agents share a global prize, defining a single step prize function at t for each agent as

r _t ＝C-T _total (33)

Wherein C is a constant for adjusting r _t So as to train;

step 3-5: value-based deep reinforcement learning approximates Q(s) with the nonlinear proximity capability of neural networks _t ，a _t )＝max _π Q _π (s _t ，a _t ) Then selecting an optimal action according to the optimal action value function; in the D3QN algorithm, the use parameter is theta _t To better estimate the optimal action value function, i.e. Q ^* (S _t ，A _t ；θ _t )≈max _π Q _π (S _t ，A _t )；

Step 3-6: then, the network structure is designed, and different from the traditional deep double-Q network (Double deep Q network, DDQN), a breach layer is introduced before an output layer to evaluate the state and the action respectively, so that the intelligent agent can process the state with smaller relation with the action more effectively; the layer divides the network output into two parts, namely a state-dependent value function V (S _t ) And a dominance function A (S) _t ，A _t ) In this way, the states can be independently evaluated rather than always relying on actions;

wherein ,representing the common part, the value function part and the preference, respectivelyNetwork parameters of the potential function part, which together form the network parameter θ _t The method comprises the steps of carrying out a first treatment on the surface of the Value function->A value representing the current state; dominance function->A value representing each action compared to other actions in the current state;

wherein θ_t Andrespectively representing parameters of a predicted network and a target network, wherein the two networks have the same structure, the predicted network parameters are continuously updated, and the target network parameters are updated once at regular intervals; q (S) _t+1 ，A _i ；θ _i ) Representing neural network θ _i The following is for state S _t+1 Take action A _t The obtained cost function;

the traditional experience playback mechanism is random and uniform when extracting small batches of samples, the values of the samples are different in fact, some samples can accelerate network convergence, and if priority is set for each sample in advance and the samples are extracted according to the priority, training can be more efficient;

further, the step 4 includes the following specific steps:

step 5: designing an ID3QN training algorithm and training a DRL model, wherein the training algorithm comprises the following steps:

step 5-2: initializing a training round number e;

step 5-3: initializing a time step t in the e round;

step 5-6: updating the small-scale fading parameters;

step 6: in the execution stage, the trained ID3QN model is utilized to obtain the optimal user transmitting power and channel allocation strategy, and the method comprises the following specific steps:

Step 6-2: outputting an optimal policyAnd obtaining the calculated unloading node selected by the EVU and corresponding channel and power distribution.

In order to verify the effectiveness of the ID3QN method, the simulation is carried out by using a Pycham, the simulation environment is arranged in a space with the length of 2000m and the width of 500m, and the emergency rescue vehicle runs on a two-way four-lane with the length of 2000m and the road width of 14 m; the UAV flying height is 50-120 m, the flying speed is 10m/s, the UAV flying speed has 4 sub-channels, the bandwidth is 4MHz, the coverage area diameter is 500m, and the computing resource is 2GHz.

Only LOS channels are considered in simulation, and the path LOSs is set to be 32.4+22log ₁₀ (d)+20log ₁₀ (f _c), wherein ,f_c Representing carrier frequency in GHz, d representing Euclidean distance between EVU and UAV in three-dimensional space; the shadow fading distribution is set as lognormal distribution, and the shadow fading standard deviation is 4dB; the large-scale fading is updated once every training round; updating each training step of small-scale fading once; the ID3QN in the simulation consists of 1 input layer, 4 hidden layers and 1 output layer, wherein the size of the input layer and the dimension D of the state space _s The same size of the output layer as the motion space dimension D _a The same; the first 3 hidden layers are all connected layers, respectively comprising 128, 64 and 64 neurons, and the 4 th hidden layerFor the layer of the duel, there is D _a +1 neurons. During training, the ReLU is used as an activation function to update parameters using RMSProp optimizers.

The training round number is set to 1500, 100 steps are performed in each round, and the target network parameters are updated once in every 5 rounds; the size of the experience playback pool is 16384, and the size of the small batch sample is 2048; furthermore, discount factorsAnd learning rates eta of 0.7 and 0.001, respectively, and initial and final values of epsilon of 1 and 0.02, respectively.

The ID3QN algorithm is compared to several reference algorithms: 1. a traditional DDQN algorithm; 2. a DDQN algorithm with preferential experience playback is introduced, which is called IDDQN for short; 3. the D3QN algorithm of preferential experience playback is not introduced;

fig. 3 and fig. 4 respectively describe performance comparison of several algorithms under the conditions of different calculation task amounts and different EVU user amounts, it can be seen that the average overhead of the system of the ID3QN algorithm is always the lowest, and the D3QN algorithm has obvious performance advantages compared with the DDQN algorithm, and in addition, the system performance is improved by introducing the priority experience playback mechanism.

What is not described in detail in the present application belongs to the prior art known to those skilled in the art.

Claims

1. An emergency scene-oriented air network distributed offloading decision and resource optimization method based on an improved depth-of-decision double-Q network (Improved dueling double deep Q network, ID3 QN) is characterized by comprising the following steps:

step 5: designing an ID3QN training algorithm and training a DRL model;

further, the step 3 includes the following specific steps:

step 3-1: regarding emergency vehicle users (Emergency vehicle users, EVUs) as agents, for each agent n, the current state s is first obtained from the state space by local observation at each time step t _t (n) state space is calculated by the EVU's computing task informationCurrent channel state information->UAV status information F _t And the random exploration variable epsilon composition in the training round number e and epsilon-greedy algorithm, namely

Step 3-2: thereafter, each agent passes through a state-action cost function Q _π (s _t (n)，a _t (n)) obtaining a policy pi and selecting action a from the action space _t (n) each agent action space is defined by an offloading policySubchannel->And transmit power P _t ⁿ Is represented by the selected composition ofIs that

wherein ,indicating the calculated location of the agent, if the agent chooses to calculate locally +.>The training stage is not entered; if EVU selects UAV m for computational offloading, then it will be from subchannel set C _m One subchannel is selected; transmit power P _t ⁿ Limited to 4 levels, i.e. [23, 10,5,0 ]]dBm, then the joint action space of the agent is expressed as

r _t ＝C-T _total (4)

Wherein C is a constant for adjusting r _t To train, T _total Representing the total time delay of the system;

wherein ,representing the common part, the value function part and the dominance function part respectivelyDivided network parameters, which together form a network parameter θ _t The method comprises the steps of carrying out a first treatment on the surface of the Value function->A value representing the current state; dominance function->A value representing each action compared to other actions in the current state;

step 3-8: however, the formulas based on the above cannot be based onUnique determination->Andin practical application, the above-mentioned method needs to be rewritten as

By subtracting the mean value of the dominance function, whenWhen fixed, can determine +.>And

wherein θ_t Andrespectively representing parameters of a predicted network and a target network, wherein the two networks have the same structure, the predicted network parameters are continuously updated, and the target network parameters are updated once at regular intervals; q (S) _t+1 ，A _t ；θ _t ) Representing neural network θ _t The following is for state S _t+1 Take action A _t The obtained cost function;

further, the step 5 includes the following specific steps:

step 5-2: initializing a training round number e;

step 5-3: initializing a time step t in the e round;

step 5-5: each agent n observes the current stateSelecting actions according to epsilon greedy strategy>And get the instant rewardsAt the same time transition to the next state +.>Training data obtained->Storing into a memory playback pool;

step 5-6: updating the small-scale fading parameters;

step 5-7: each agent extracts training data from the experience playback pool as a sample according to the extraction probability shown below,

Then calculate IS weight w _i ＝[BP(i)] ^-μ And updating the sample priority, B represents the empirical playback pool size, μ is an index, whenMu=1, w _i The non-uniform probability P (i) is fully compensated, resulting in a loss function,

updating parameters of an agent predictive network by back propagation of neural networks using a small batch gradient descent strategy

Step 5-8: when the training times reach the target network updating interval, according to the predicted network parametersUpdating target network parameters

step 5-10: and (3) judging whether e < I is met, wherein I is the set total training round number, if so, e=e+1, entering a step (5-3), and if not, ending optimization, and obtaining a trained network model.