CN116963034A - Emergency scene-oriented air-ground network distributed resource scheduling method - Google Patents
Emergency scene-oriented air-ground network distributed resource scheduling method Download PDFInfo
- Publication number
- CN116963034A CN116963034A CN202310861810.5A CN202310861810A CN116963034A CN 116963034 A CN116963034 A CN 116963034A CN 202310861810 A CN202310861810 A CN 202310861810A CN 116963034 A CN116963034 A CN 116963034A
- Authority
- CN
- China
- Prior art keywords
- network
- action
- parameters
- representing
- agent
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 28
- 238000005457 optimization Methods 0.000 claims abstract description 21
- 230000009471 action Effects 0.000 claims description 48
- 230000006870 function Effects 0.000 claims description 39
- 238000004364 calculation method Methods 0.000 claims description 11
- 238000005562 fading Methods 0.000 claims description 10
- 238000013528 artificial neural network Methods 0.000 claims description 9
- 230000002787 reinforcement Effects 0.000 claims description 8
- 238000004891 communication Methods 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 6
- 238000013468 resource allocation Methods 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 5
- 239000000203 mixture Substances 0.000 claims description 4
- 230000001186 cumulative effect Effects 0.000 claims description 3
- 230000001419 dependent effect Effects 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 230000009916 joint effect Effects 0.000 claims description 3
- 230000002123 temporal effect Effects 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 239000003795 chemical substances by application Substances 0.000 description 36
- 230000005540 biological transmission Effects 0.000 description 6
- 238000004088 simulation Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 239000000654 additive Substances 0.000 description 2
- 230000000996 additive effect Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- GVVPGTZRZFNKDS-JXMROGBWSA-N geranyl diphosphate Chemical compound CC(C)=CCC\C(C)=C\CO[P@](O)(=O)OP(O)(O)=O GVVPGTZRZFNKDS-JXMROGBWSA-N 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W4/00—Services specially adapted for wireless communication networks; Facilities therefor
- H04W4/90—Services for handling of emergency or hazardous situations, e.g. earthquake and tsunami warning systems [ETWS]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/092—Reinforcement learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/12—Protocols specially adapted for proprietary or special-purpose networking environments, e.g. medical networks, sensor networks, networks in vehicles or remote metering networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W28/00—Network traffic management; Network resource management
- H04W28/02—Traffic management, e.g. flow control or congestion control
- H04W28/08—Load balancing or load distribution
- H04W28/09—Management thereof
- H04W28/0925—Management thereof using policies
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W28/00—Network traffic management; Network resource management
- H04W28/02—Traffic management, e.g. flow control or congestion control
- H04W28/08—Load balancing or load distribution
- H04W28/09—Management thereof
- H04W28/0958—Management thereof based on metrics or performance parameters
- H04W28/0967—Quality of Service [QoS] parameters
- H04W28/0975—Quality of Service [QoS] parameters for reducing delays
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W4/00—Services specially adapted for wireless communication networks; Facilities therefor
- H04W4/30—Services specially adapted for particular environments, situations or purposes
- H04W4/40—Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P]
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Emergency Management (AREA)
- Medical Informatics (AREA)
- Business, Economics & Management (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Public Health (AREA)
- Biophysics (AREA)
- Quality & Reliability (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Environmental & Geological Engineering (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses an air-ground network distributed unloading decision-making and resource scheduling method for emergency scenes, which aims at the emergency disaster scenes, constructs an air-ground integrated Internet of things consisting of unmanned aerial vehicles and emergency rescue vehicle users, considers the demands of computation intensive and time delay sensitive services, aims at minimizing the total time delay of a system to construct an optimization problem, and then designs an improved decision-making depth double-Q network algorithm to solve the optimization problem. The ID3QN algorithm used by the invention can minimize the time cost of the system under the condition of meeting the constraints of time delay, power and the like, and effectively solves the joint optimization problem of unloading decision, channel and power distribution of the vehicle user in an emergency scene.
Description
Technical Field
The invention relates to the field of air-ground integrated Internet of things, in particular to an air-ground network distributed unloading decision and resource optimization method based on an improved opposite depth double Q network for emergency scenes.
Background
Emergency disaster scenarios require higher mobility, reliability and flexibility for field rescue communication and computing facilities. Although deploying Multi-access edge computing (Multi-access Edge Computing, MEC) in emergency scenarios may alleviate the problem of limited computing resources for internet of things (Internet ofThings, ioT) devices. However, the MEC deployed in advance in the emergency scene has the problems of inflexibility and uneven service, and the preset base station is also easily destroyed and cannot provide service, so that the conventional ground network cannot meet the requirement of quick response in the emergency scene. Aiming at the situation, the air-ground integrated Internet of things plays a key role, and provides support for assisting and supplementing a ground system. The third generation partnership project (The Third Generation Partnership Proiect,3 GPP) has seen Non-terrestrial networks (Non-Terrestrial Networks, NTN) as a new feature of 5G, which is intended to provide wireless access services worldwide, beyond space limitations. Unmanned aerial vehicles (Unmanned Aerial Vehicles, UAVs) have the advantages of low cost, flexibility in maneuvering and the like, and are widely applied to the field of wireless communication. As an aerial computing platform, UAVs can assist in edge computing, particularly for high-density public emergency scenarios.
In addition, due to the presence of various random and nonlinear factors, wireless communication systems are often difficult to accurately model, and even if modeling is enabled, models and algorithms become complex and fail to meet the real-time response requirements. While artificial intelligence (Artificial Intelligence, AI) technology with powerful data processing and expression capabilities and low inference complexity can provide technical support, especially deep reinforcement learning (Deep Reinforcement Learning, DRL), has been widely applied to resource allocation and computation offloading problems in the internet of things.
Disclosure of Invention
The invention aims at constructing an unmanned aerial vehicle-assisted air-ground integrated Internet of things architecture for emergency rescue scenes, and provides an improved decision-depth double-Q network (Improved dueling double deep Q network, ID3 QN) algorithm for unloading decision and resource optimization so as to reduce the total time delay of a system by considering the requirements of computation-intensive and time delay-sensitive services. In order to achieve the object, the invention adopts the following steps:
step 1: constructing an air-ground integrated Internet of things system model consisting of an unmanned aerial vehicle and emergency rescue vehicle users;
step 2: describing a communication and calculation model of the system, and constructing an optimization problem based on the model with the aim of minimizing the time delay of the system;
step 3: constructing a deep reinforcement learning model according to an optimization problem by adopting a distributed resource allocation method, and setting key parameters of a dual-Q network (Dueling double deep Q network, D3 QN) of the opposite depth;
step 4: a priority experience playback mechanism is introduced into the D3QN, so that the convergence rate of training is increased, and the system performance is improved;
step 5: designing an ID3QN training algorithm and training a DRL model;
step 6: in the execution stage, the trained ID3QN model is utilized to obtain the optimal user transmitting power and channel allocation strategy;
further, the step 1 includes the following specific steps:
step 1-1: considering a microcell in a disaster area, in which M unmanned aerial vehicles are equipped with computing resources as airborne MEC nodes, they perform trajectory optimization in advance and preferentially fly to the vicinity of a desired area according to the situation of users, a set of UAVs is expressed as
Step 1-2: on the ground, there are N emergency vehicle users (Emergency vehicle users, EVUs) that need to perform computationally intensive and delay sensitive tasks, each EVU can move, the aggregate representation of whichIs thatAssuming that each EVU has only one computation task in each time slot, denoted +.> wherein ,dn Representing the amount of calculated data entered; i.e n Representing the number of CPU revolutions required to complete the calculation task; />Representing the maximum tolerable time delay of the task n; when the EVU does not have enough computing resources, the UAV is selected for computing and unloading;
further, the step 2 includes the following specific steps:
step 2-1: definition of the definitionTo indicate the position of execution of the nth EVU calculation task when +.>When the calculation task representing EVUn is executed locally,/>Representing task->Executing on UAvm, otherwise, +.>Then it is indicated that EVUn has not selected UAVm to complete the computational offload task, assuming that each EVU can only select one UAV for computational offload;
step 2-2: if the EVUn selects UAvm for computational offloading, then the signal-to-interference-and-noise ratio gamma of the V2U link between the EVU and the UAV n,m Can be expressed as
Wherein, P [ n ]]Sum sigma 2 The transmit power of EVUn and the power of additive white gaussian noise are represented, respectively;representing the channel coefficients between EVUn and UAVm; i n Representing the interference of EVUn from other V2U links using the same sub-band, can be calculated by
wherein ,represents the channel coefficient between EVUn' and UAvm using the same V2U link,/V>And->Using the same definition, n in the formula is changed to n';
step 2-3: because the channel between EVU and UAV is a Line of sight (LOS) of free space, the channel coefficients are related to the effects of path LOSs and can be expressed as
wherein ,is made of distance->The path loss represented; setting position coordinate division of transmitting end and receiving end of V2U linkLet it be (x) n ,y n ,z n ),(x m ,y m ,z m ) The Euclidean distance between EVUn and UAvm>Can be expressed as
Step 2-4: the transmission rate between EVUn and UAVm can be expressed as
R n,m =Blog 2 (1+γ n,m ) (5)
Wherein B represents the bandwidth of the V2U link;
step 2-5: then the total transmission delay can be expressed as
wherein ,representing the transmission delay after the UAvm is selected by the EVUn;
step 2-6: the total computation latency of all EVU execution tasks can be expressed as
wherein ,representing allocation to computing tasks>Is a computing resource of (a); />Indicating that local computing resources are available +.>Executing a computing task; when m > 0, & gt>Representing the number of CPU revolutions per second assigned to EVUn by the UAV; />Representing the computation time required by the EVUn to select UAvm to execute the task;
step 2-7: the total time cost of all systems can be expressed as
Step 2-8: based on the above definition, the optimization problem is expressed as that the total time delay of the system is minimized
wherein ,allocation policy indicating offloading policy, channel and user transmit power, respectively, +.>Represents the maximum transmit power per EVU, < >>Representing the largest computational resource of UAVm; constraint C1 represents task->A maximum tolerable delay time limit of (2); constraints C2, C3 and C4 represent the power constraint sum of each EVU, respectivelyThe UAV calculates constraints of the resources; constraint C5 indicates that each EVU can only select one UAV for computational offloading;
further, the step 3 includes the following specific steps:
step 3-1: regarding EVU as an agent, for each agent n, the current state s is obtained from the state space by first observing locally at each time step t t (n) state space is calculated by the EVU's computing task informationCurrent channel state information->UAV status information F t And the random exploration variable epsilon composition in the training round number e and epsilon-greedy algorithm, namely
Step 3-2: thereafter, each agent passes through a state-action cost function Q π (s t (n),a t (n)) obtaining a policy pi and selecting action a from the action space t (n) each agent action space is defined by an offloading policySubchannel->And transmit powerIs expressed as
wherein ,indicating the calculated location of the agent, if the agent chooses to calculate locally +.>The training stage is not entered; if the EVU selects UAvm for computational offloading, then the EVU selects UAvm from the subchannel set C m One subchannel is selected; transmit power->Limited to 4 levels, i.e. [23, 10,5,0 ]]dBm, then the joint action space of the agent is expressed as
Step 3-3: based on the action selections of all agents, the environment is converted into a new state S t+1 All agents share a global prize, defining a single step prize function at t for each agent as
r i =C-T total (13) Wherein C is a constant for adjusting r t So as to train;
step 3-4: in order to find the best strategy to maximize the overall return, both current and future returns must be considered, so the return is defined as the cumulative discount prize R t ,
wherein ,representing discount factors->Indicating that future rewards are more focused and +.>Representing that the current prize is more focused;
step 3-5: value-based deep reinforcement learning approximates Q using nonlinear proximity capability of neural networks * (s t ,a t )=max π Q π (s t ,a t ) Then selecting an optimal action according to the optimal action value function; in the D3QN algorithm, the use parameter is theta t To better estimate the optimal action value function, i.e. Q * (S t ,A t ;θ t )≈max π Q π (S t ,A t );
Step 3-6: then, the network structure is designed, and unlike the traditional deep double-Q network, a breach layer is introduced before an output layer to evaluate the state and the action respectively, so that the intelligent agent can process the state with smaller relation with the action more effectively; the layer divides the network output into two parts, namely a state-dependent value function V (S t ) And a dominance function A (S) t ,A t ) In this way, the states can be independently evaluated rather than always relying on actions;
step 3-7: based on the network structure, the Q value function can be rewritten as
wherein ,network parameters respectively representing a common part, a value function part and a dominance function part, which together constitute a network parameter θ t The method comprises the steps of carrying out a first treatment on the surface of the Value function->A value representing the current state; dominance function->Representation ofThe value of each action compared to the other actions in the current state;
step 3-8: however, the formulas based on the above cannot be based onUnique determination and />In practical application, the above-mentioned method needs to be rewritten as
By subtracting the mean value of the dominance function, whenWhen fixed, can determine +.> and />
Step 3-9: in the training process, the D3QN adopts two networks of a prediction network and a target network to relieve the problem of Q value overestimation, firstly finds out the action of maximizing the Q value in the prediction network, and then uses the action to obtain the target Q value in the target network, wherein the target value can be expressed as
wherein θt Andrepresenting parameters of the predicted network and the target network, respectively, two network structuresThe same, predict network parameter to upgrade continuously, the goal network parameter is updated once every certain cycle; q (S) t+1 ,A t ;θ t ) Representing neural network θ t The following is for state S t+1 Take action A t The obtained cost function;
further, the step 4 includes the following specific steps:
step 4-1: training data for agent nWill be stored in the memory playback pool as samples for subsequent training, interpolation between the pure greedy samples and the uniform random samples is performed using the random sampling method, defining the probability that each sample i is extracted as
Where α is an index, corresponding to uniform sampling when α=0; b represents a small lot;representing the priority of sample i, β is a small positive number, preventing the priority of sample from being revisited once it is 0, δ i Time differential error (Temporal difference error, TD-error) representing sample i, expressed as
Step 4-2: in updating the network, each agent needs to minimize the loss function to achieve gradient descent, which is defined as when considering sample priority
wherein ,wi =[BP(i)] -μ Represents a sampling-Importance (IS) weight, B represents an empirical playback pool size, μ IS an index, and when μ=1, w i Completely compensating the non-uniform probability P (i);
further, the step 5 includes the following specific steps:
step 5-1: starting an environment simulator, initializing predicted network parameters of an agentAnd target network parameters->Initializing update frequency->Isoparametric parameters; initializing related parameters of preferential experience playback, setting playback pool size B, indexes alpha, mu and the like;
step 5-2: initializing a training round number e;
step 5-3: initializing a time step t in the e round;
step 5-4: updating position, path loss and large-scale fading parameters, and setting UAV parameters;
step 5-5: each agent n observes the current stateSelecting actions according to epsilon greedy strategy>And get immediate rewards +.>At the same time transition to the next state +.>Training data obtained->Storing into a memory playback pool;
step 5-6: updating the small-scale fading parameters;
step 5-7: each agent extracts training data from the experience playback pool as a sample according to the extraction probability shown in the formula (18), calculates IS weight and updates sample priority; obtaining a loss function according to equation (20), updating parameters of the agent predictive network by back propagation of the neural network using a small batch gradient descent strategy
Step 5-8: when the training times reach the target network updating interval, according to the predicted network parametersUpdating target network parameters +.>
Step 5-9: judging whether T is less than T, if T is the total time step in the e round, entering the step (5-4) if t=t+1, otherwise, entering the step (5-10);
step 5-10: judging whether e < I is met, wherein I is the set total training round number, if yes, e=e+1, entering a step (5-3), otherwise, finishing optimization, and obtaining a trained network model;
further, the step 6 includes the following specific steps:
step 6-1: inputting state information observed by an intelligent agent at a certain moment by utilizing a network model trained by an ID3QN algorithm
Step 6-2: outputting an optimal policyComputation offload node to get EVU selection and corresponding channel and power allocation。
Drawings
Fig. 1 is a model of the air-ground integrated internet of things provided by the embodiment of the invention;
FIG. 2 is a frame diagram of an ID3QN algorithm provided by an embodiment of the invention;
FIG. 3 is a diagram of simulation results of the total time delay of the system according to the change of the calculation task according to the embodiment of the invention;
fig. 4 is a diagram of simulation results of the total delay of the system according to the EVU number according to the embodiment of the present invention;
Detailed Description
The invention is described in further detail below with reference to the drawings and examples.
The invention aims at an emergency rescue scene, builds an unmanned plane-assisted air-ground integrated Internet of things architecture shown in figure 1, considers the demands of computation-intensive and delay-sensitive services, aims at minimizing the total time delay of a system to construct an optimization problem, provides an algorithm joint optimization unloading decision and resource allocation based on a dual-Q network of opposite depths, introduces a priority experience playback mechanism to improve performance, and an improved dual-Q network (Improved dueling double deep Q network, ID3 QN) algorithm frame diagram is shown in figure 2, and can obtain an optimal unloading strategy and a corresponding channel and power allocation strategy according to a trained model.
The present invention is described in further detail below.
Step 1: an air-ground integrated Internet of things system model formed by an unmanned aerial vehicle and an emergency rescue vehicle user is constructed, and the method comprises the following steps:
step 1-1: considering a microcell in a disaster area, in which M unmanned aerial vehicles are equipped with computing resources as airborne MEC nodes, they perform trajectory optimization in advance and preferentially fly to the vicinity of a desired area according to the situation of users, a set of UAVs is expressed as
Step 1-2: on the ground, there are N emergency vehiclesUsers (Emergency vehicle users, EVUs) need to perform computationally intensive and delay sensitive tasks, each EVU can move, the aggregate of which is denoted asAssuming that each EVU has only one computation task in each time slot, denoted +.> wherein ,dn Representing the amount of calculated data entered; i.e n Representing the number of CPU revolutions required to complete the calculation task; />Representing the maximum tolerable time delay of the task n; when the EVU does not have enough computing resources, the UAV is selected for computing and unloading;
step 2: describing a communication and calculation model of the system, and constructing an optimization problem based on the model with the aim of minimizing the system time delay, comprising the following steps:
step 2-1: definition of the definitionTo indicate the position of execution of the nth EVU calculation task when +.>When the calculation task representing EVUn is executed locally,/>Representing task->Executing on UAvm, otherwise, +.>Then it is indicated that EVUn has not selected UAVm to complete the computational offload task, assuming that each EVU can only select one UAV for computational offload;
step 2-2: if the EVUn selects UAvm for computational offloading, then the signal-to-interference-and-noise ratio gamma of the V2U link between the EVU and the UAV n,m Can be expressed as
Wherein, P [ n ]]Sum sigma 2 The transmit power of EVUn and the power of additive white gaussian noise are represented, respectively;representing the channel coefficients between EVUn and UAVm; i n Representing the interference of EVUn from other V2U links using the same sub-band, can be calculated by
wherein ,represents the channel coefficient between EVUn' and UAvm using the same V2U link,/V>And->Using the same definition, n in the formula is changed to n';
step 2-3: because the channel between EVU and UAV is a Line of sight (LOS) of free space, the channel coefficients are related to the effects of path LOSs and can be expressed as
wherein ,is made of distance->The path loss represented; let the position coordinates of the transmitting end and the receiving end of the V2U link be (x) n ,y n ,z n ),(x m ,y m ,z m ) The Euclidean distance between EVUn and UAvm>Can be expressed as
Step 2-4: the transmission rate between EVUn and UAVm can be expressed as
R n,m =Blog 2 (1+γ n,m ) (25)
Wherein B represents the bandwidth of the V2U link;
step 2-5: then the total transmission delay can be expressed as
wherein ,representing the transmission delay after the UAvm is selected by the EVUn;
step 2-6: the total computation latency of all EVU execution tasks can be expressed as
wherein ,representing allocation to computing tasks>Is a computing resource of (a); />Indicating that local computing resources are available +.>Executing a computing task; when m > 0, & gt>Representing the number of CPU revolutions per second assigned to EVUn by the UAV; />Representing the computation time required by the EVUn to select UAvm to execute the task;
step 2-7: the total time cost of all systems can be expressed as
Step 2-8: based on the above definition, the optimization problem is expressed as that the total time delay of the system is minimized
wherein ,allocation policy indicating offloading policy, channel and user transmit power, respectively, +.>Represents the maximum transmit power per EVU, < >>Representing the largest computational resource of UAVm; constraint C1 represents task->A maximum tolerable delay time limit of (2); constraints C2, C3 and C4 represent the power constraints of each EVU and the constraints of the UAV computing resources, respectively; constraint C5 indicates that each EVU can only select one UAV for computational offloading;
step 3: by adopting a distributed resource allocation method, a deep reinforcement learning model is constructed according to an optimization problem, key parameters of a dual-Q network (Dueling double deep Q network, D3 QN) of a diagonal depth are set, and the method comprises the following steps:
step 3-1: regarding EVU as an agent, for each agent n, the current state s is first obtained from the state space by local observation at each time step t t (n) state space is calculated by the EVU's computing task informationCurrent channel state information->UAV status information F t And the random exploration variable epsilon composition in the training round number e and epsilon-greedy algorithm, namely
Step 3-2: thereafter, each agent passes through a state-action cost function Q π (s t (n),a t (n)) obtaining a policy pi and selecting action a from the action space t (n) each agent action space is defined by an offloading policySubchannel->And transmit powerIs expressed as
wherein ,indicating the calculated location of the agent, if the agent chooses to calculate locally +.>The training stage is not entered; if the EVU selects UAvm for computational offloading, then the EVU selects UAvm from the subchannel set C m One subchannel is selected; transmit power->Limited to 4 levels, i.e. [23, 10,5,0 ]]dBm, then the joint action space of the agent is expressed as
Step 3-3: based on the action selections of all agents, the environment is converted into a new state S i+1 All agents share a global prize, defining a single step prize function at t for each agent as
r t =C-T total (33)
Wherein C is a constant for adjusting r t So as to train;
step 3-4: in order to find the best strategy to maximize the overall return, both current and future returns must be considered, so the return is defined as the cumulative discount prize R t ,
wherein ,representing discount factors->Indicating that future rewards are more focused and +.>Representing that the current prize is more focused;
step 3-5: value-based deep reinforcement learning approximates Q(s) with the nonlinear proximity capability of neural networks t ,a t )=max π Q π (s t ,a t ) Then selecting an optimal action according to the optimal action value function; in the D3QN algorithm, the use parameter is theta t To better estimate the optimal action value function, i.e. Q * (S t ,A t ;θ t )≈max π Q π (S t ,A t );
Step 3-6: then, the network structure is designed, and different from the traditional deep double-Q network (Double deep Q network, DDQN), a breach layer is introduced before an output layer to evaluate the state and the action respectively, so that the intelligent agent can process the state with smaller relation with the action more effectively; the layer divides the network output into two parts, namely a state-dependent value function V (S t ) And a dominance function A (S) t ,A t ) In this way, the states can be independently evaluated rather than always relying on actions;
step 3-7: based on the network structure, the Q value function can be rewritten as
wherein ,representing the common part, the value function part and the preference, respectivelyNetwork parameters of the potential function part, which together form the network parameter θ t The method comprises the steps of carrying out a first treatment on the surface of the Value function->A value representing the current state; dominance function->A value representing each action compared to other actions in the current state;
step 3-8: however, the formulas based on the above cannot be based onUnique determination and />In practical application, the above-mentioned method needs to be rewritten as
By subtracting the mean value of the dominance function, whenWhen fixed, can determine +.> and />
Step 3-9: in the training process, the D3QN adopts two networks of a prediction network and a target network to relieve the problem of Q value overestimation, firstly finds out the action of maximizing the Q value in the prediction network, and then uses the action to obtain the target Q value in the target network, wherein the target value can be expressed as
wherein θt Andrespectively representing parameters of a predicted network and a target network, wherein the two networks have the same structure, the predicted network parameters are continuously updated, and the target network parameters are updated once at regular intervals; q (S) t+1 ,A i ;θ i ) Representing neural network θ i The following is for state S t+1 Take action A t The obtained cost function;
step 4: a priority experience playback mechanism is introduced into the D3QN, so that the convergence rate of training is increased, and the system performance is improved;
the traditional experience playback mechanism is random and uniform when extracting small batches of samples, the values of the samples are different in fact, some samples can accelerate network convergence, and if priority is set for each sample in advance and the samples are extracted according to the priority, training can be more efficient;
further, the step 4 includes the following specific steps:
step 4-1: training data for agent nWill be stored in the memory playback pool as samples for subsequent training, interpolation between the pure greedy samples and the uniform random samples is performed using the random sampling method, defining the probability that each sample i is extracted as
Where α is an index, corresponding to uniform sampling when α=0; b represents a small lot;representing the priority of sample i, β is a small positive number, preventing the priority of sample from being revisited once it is 0, δ i Time differential error (Temporal difference error, TD-error) representing sample i, expressed as
Step 4-2: in updating the network, each agent needs to minimize the loss function to achieve gradient descent, which is defined as when considering sample priority
wherein ,wi =[BP(i)] -μ Represents a sampling-Importance (IS) weight, B represents an empirical playback pool size, μ IS an index, and when μ=1, w i Completely compensating the non-uniform probability P (i);
step 5: designing an ID3QN training algorithm and training a DRL model, wherein the training algorithm comprises the following steps:
step 5-1: starting an environment simulator, initializing predicted network parameters of an agentAnd target network parameters->Initializing update frequency->Isoparametric parameters; initializing related parameters of preferential experience playback, setting playback pool size B, indexes alpha, mu and the like;
step 5-2: initializing a training round number e;
step 5-3: initializing a time step t in the e round;
step 5-4: updating position, path loss and large-scale fading parameters, and setting UAV parameters;
step 5-5: each agent n observes the current stateSelecting actions according to epsilon greedy strategy>And get immediate rewards +.>At the same time transition to the next state +.>Training data obtained->Storing into a memory playback pool;
step 5-6: updating the small-scale fading parameters;
step 5-7: each agent extracts training data from the experience playback pool as a sample according to the extraction probability shown in the formula (18), calculates IS weight and updates sample priority; obtaining a loss function according to equation (20), updating parameters of the agent predictive network by back propagation of the neural network using a small batch gradient descent strategy
Step 5-8: when the training times reach the target network updating interval, according to the predicted network parametersUpdating target network parameters +.>
Step 5-9: judging whether T is less than T, if T is the total time step in the e round, entering the step (5-4) if t=t+1, otherwise, entering the step (5-10);
step 5-10: judging whether e < I is met, wherein I is the set total training round number, if yes, e=e+1, entering a step (5-3), otherwise, finishing optimization, and obtaining a trained network model;
step 6: in the execution stage, the trained ID3QN model is utilized to obtain the optimal user transmitting power and channel allocation strategy, and the method comprises the following specific steps:
step 6-1: inputting state information observed by an intelligent agent at a certain moment by utilizing a network model trained by an ID3QN algorithm
Step 6-2: outputting an optimal policyAnd obtaining the calculated unloading node selected by the EVU and corresponding channel and power distribution.
In order to verify the effectiveness of the ID3QN method, the simulation is carried out by using a Pycham, the simulation environment is arranged in a space with the length of 2000m and the width of 500m, and the emergency rescue vehicle runs on a two-way four-lane with the length of 2000m and the road width of 14 m; the UAV flying height is 50-120 m, the flying speed is 10m/s, the UAV flying speed has 4 sub-channels, the bandwidth is 4MHz, the coverage area diameter is 500m, and the computing resource is 2GHz.
Only LOS channels are considered in simulation, and the path LOSs is set to be 32.4+22log 10 (d)+20log 10 (f c), wherein ,fc Representing carrier frequency in GHz, d representing Euclidean distance between EVU and UAV in three-dimensional space; the shadow fading distribution is set as lognormal distribution, and the shadow fading standard deviation is 4dB; the large-scale fading is updated once every training round; updating each training step of small-scale fading once; the ID3QN in the simulation consists of 1 input layer, 4 hidden layers and 1 output layer, wherein the size of the input layer and the dimension D of the state space s The same size of the output layer as the motion space dimension D a The same; the first 3 hidden layers are all connected layers, respectively comprising 128, 64 and 64 neurons, and the 4 th hidden layerFor the layer of the duel, there is D a +1 neurons. During training, the ReLU is used as an activation function to update parameters using RMSProp optimizers.
The training round number is set to 1500, 100 steps are performed in each round, and the target network parameters are updated once in every 5 rounds; the size of the experience playback pool is 16384, and the size of the small batch sample is 2048; furthermore, discount factorsAnd learning rates eta of 0.7 and 0.001, respectively, and initial and final values of epsilon of 1 and 0.02, respectively.
The ID3QN algorithm is compared to several reference algorithms: 1. a traditional DDQN algorithm; 2. a DDQN algorithm with preferential experience playback is introduced, which is called IDDQN for short; 3. the D3QN algorithm of preferential experience playback is not introduced;
fig. 3 and fig. 4 respectively describe performance comparison of several algorithms under the conditions of different calculation task amounts and different EVU user amounts, it can be seen that the average overhead of the system of the ID3QN algorithm is always the lowest, and the D3QN algorithm has obvious performance advantages compared with the DDQN algorithm, and in addition, the system performance is improved by introducing the priority experience playback mechanism.
What is not described in detail in the present application belongs to the prior art known to those skilled in the art.
Claims (1)
1. An emergency scene-oriented air network distributed offloading decision and resource optimization method based on an improved depth-of-decision double-Q network (Improved dueling double deep Q network, ID3 QN) is characterized by comprising the following steps:
step 1: constructing an air-ground integrated Internet of things system model consisting of an unmanned aerial vehicle and emergency rescue vehicle users;
step 2: describing a communication and calculation model of the system, and constructing an optimization problem based on the model with the aim of minimizing the time delay of the system;
step 3: constructing a deep reinforcement learning model according to an optimization problem by adopting a distributed resource allocation method, and setting key parameters of a dual-Q network (Dueling double deep Q network, D3 QN) of the opposite depth;
step 4: a priority experience playback mechanism is introduced into the D3QN, so that the convergence rate of training is increased, and the system performance is improved;
step 5: designing an ID3QN training algorithm and training a DRL model;
step 6: in the execution stage, the trained ID3QN model is utilized to obtain the optimal user transmitting power and channel allocation strategy;
further, the step 3 includes the following specific steps:
step 3-1: regarding emergency vehicle users (Emergency vehicle users, EVUs) as agents, for each agent n, the current state s is first obtained from the state space by local observation at each time step t t (n) state space is calculated by the EVU's computing task informationCurrent channel state information->UAV status information F t And the random exploration variable epsilon composition in the training round number e and epsilon-greedy algorithm, namely
Step 3-2: thereafter, each agent passes through a state-action cost function Q π (s t (n),a t (n)) obtaining a policy pi and selecting action a from the action space t (n) each agent action space is defined by an offloading policySubchannel->And transmit power P t n Is represented by the selected composition ofIs that
wherein ,indicating the calculated location of the agent, if the agent chooses to calculate locally +.>The training stage is not entered; if EVU selects UAV m for computational offloading, then it will be from subchannel set C m One subchannel is selected; transmit power P t n Limited to 4 levels, i.e. [23, 10,5,0 ]]dBm, then the joint action space of the agent is expressed as
Step 3-3: based on the action selections of all agents, the environment is converted into a new state S t+1 All agents share a global prize, defining a single step prize function at t for each agent as
r t =C-T total (4)
Wherein C is a constant for adjusting r t To train, T total Representing the total time delay of the system;
step 3-4: in order to find the best strategy to maximize the overall return, both current and future returns must be considered, so the return is defined as the cumulative discount prize R t ,
wherein ,representing discount factors->Indicating that future rewards are more focused and +.>Representing that the current prize is more focused;
step 3-5: value-based deep reinforcement learning approximates Q using nonlinear proximity capability of neural networks * (s t ,a t )=max π Q π (s t ,a t ) Then selecting an optimal action according to the optimal action value function; in the D3QN algorithm, the use parameter is theta t To better estimate the optimal action value function, i.e. Q * (S t ,A t ;θ t )≈max π Q π (S t ,A t );
Step 3-6: then, the network structure is designed, and unlike the traditional deep double-Q network, a breach layer is introduced before an output layer to evaluate the state and the action respectively, so that the intelligent agent can process the state with smaller relation with the action more effectively; the layer divides the network output into two parts, namely a state-dependent value function V (S t ) And a dominance function A (S) t ,A t ) In this way, the states can be independently evaluated rather than always relying on actions;
step 3-7: based on the network structure, the Q value function can be rewritten as
wherein ,representing the common part, the value function part and the dominance function part respectivelyDivided network parameters, which together form a network parameter θ t The method comprises the steps of carrying out a first treatment on the surface of the Value function->A value representing the current state; dominance function->A value representing each action compared to other actions in the current state;
step 3-8: however, the formulas based on the above cannot be based onUnique determination->Andin practical application, the above-mentioned method needs to be rewritten as
By subtracting the mean value of the dominance function, whenWhen fixed, can determine +.>And
step 3-9: in the training process, the D3QN adopts two networks of a prediction network and a target network to relieve the problem of Q value overestimation, firstly finds out the action of maximizing the Q value in the prediction network, and then uses the action to obtain the target Q value in the target network, wherein the target value can be expressed as
wherein θt Andrespectively representing parameters of a predicted network and a target network, wherein the two networks have the same structure, the predicted network parameters are continuously updated, and the target network parameters are updated once at regular intervals; q (S) t+1 ,A t ;θ t ) Representing neural network θ t The following is for state S t+1 Take action A t The obtained cost function;
further, the step 5 includes the following specific steps:
step 5-1: starting an environment simulator, initializing predicted network parameters of an agentAnd target network parameters->Initializing update frequency->Isoparametric parameters; initializing related parameters of preferential experience playback, setting playback pool size B, indexes alpha, mu and the like;
step 5-2: initializing a training round number e;
step 5-3: initializing a time step t in the e round;
step 5-4: updating position, path loss and large-scale fading parameters, and setting UAV parameters;
step 5-5: each agent n observes the current stateSelecting actions according to epsilon greedy strategy>And get the instant rewardsAt the same time transition to the next state +.>Training data obtained->Storing into a memory playback pool;
step 5-6: updating the small-scale fading parameters;
step 5-7: each agent extracts training data from the experience playback pool as a sample according to the extraction probability shown below,
where α is an index, corresponding to uniform sampling when α=0; b represents a small lot;representing the priority of sample i, β is a small positive number, preventing the priority of sample from being revisited once it is 0, δ i Time differential error (Temporal difference error, TD-error) representing sample i, expressed as
Then calculate IS weight w i =[BP(i)] -μ And updating the sample priority, B represents the empirical playback pool size, μ is an index, whenMu=1, w i The non-uniform probability P (i) is fully compensated, resulting in a loss function,
updating parameters of an agent predictive network by back propagation of neural networks using a small batch gradient descent strategy
Step 5-8: when the training times reach the target network updating interval, according to the predicted network parametersUpdating target network parameters
Step 5-9: judging whether T is less than T, if T is the total time step in the e round, entering the step (5-4) if t=t+1, otherwise, entering the step (5-10);
step 5-10: and (3) judging whether e < I is met, wherein I is the set total training round number, if so, e=e+1, entering a step (5-3), and if not, ending optimization, and obtaining a trained network model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310861810.5A CN116963034A (en) | 2023-07-13 | 2023-07-13 | Emergency scene-oriented air-ground network distributed resource scheduling method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310861810.5A CN116963034A (en) | 2023-07-13 | 2023-07-13 | Emergency scene-oriented air-ground network distributed resource scheduling method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116963034A true CN116963034A (en) | 2023-10-27 |
Family
ID=88443824
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310861810.5A Pending CN116963034A (en) | 2023-07-13 | 2023-07-13 | Emergency scene-oriented air-ground network distributed resource scheduling method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116963034A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117176213A (en) * | 2023-11-03 | 2023-12-05 | 中国人民解放军国防科技大学 | SCMA codebook selection and power distribution method based on deep prediction Q network |
-
2023
- 2023-07-13 CN CN202310861810.5A patent/CN116963034A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117176213A (en) * | 2023-11-03 | 2023-12-05 | 中国人民解放军国防科技大学 | SCMA codebook selection and power distribution method based on deep prediction Q network |
CN117176213B (en) * | 2023-11-03 | 2024-01-30 | 中国人民解放军国防科技大学 | SCMA codebook selection and power distribution method based on deep prediction Q network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113162679B (en) | DDPG algorithm-based IRS (intelligent resilient software) assisted unmanned aerial vehicle communication joint optimization method | |
CN114422056B (en) | Space-to-ground non-orthogonal multiple access uplink transmission method based on intelligent reflecting surface | |
Li et al. | Downlink transmit power control in ultra-dense UAV network based on mean field game and deep reinforcement learning | |
CN113543074A (en) | Joint computing migration and resource allocation method based on vehicle-road cloud cooperation | |
CN109905860A (en) | A kind of server recruitment and task unloading prioritization scheme based on the calculating of vehicle mist | |
CN114567888B (en) | Multi-unmanned aerial vehicle dynamic deployment method | |
CN116963034A (en) | Emergency scene-oriented air-ground network distributed resource scheduling method | |
CN113115344B (en) | Unmanned aerial vehicle base station communication resource allocation strategy prediction method based on noise optimization | |
CN116456493A (en) | D2D user resource allocation method and storage medium based on deep reinforcement learning algorithm | |
CN114169234A (en) | Scheduling optimization method and system for unmanned aerial vehicle-assisted mobile edge calculation | |
CN116600316A (en) | Air-ground integrated Internet of things joint resource allocation method based on deep double Q networks and federal learning | |
Zhang et al. | New computing tasks offloading method for MEC based on prospect theory framework | |
CN117098189A (en) | Computing unloading and resource allocation method based on GAT hybrid action multi-agent reinforcement learning | |
Nasr-Azadani et al. | Single-and multiagent actor–critic for initial UAV’s deployment and 3-D trajectory design | |
CN116321298A (en) | Multi-objective joint optimization task unloading strategy based on deep reinforcement learning in Internet of vehicles | |
CN115499921A (en) | Three-dimensional trajectory design and resource scheduling optimization method for complex unmanned aerial vehicle network | |
CN115134242A (en) | Vehicle-mounted computing task unloading method based on deep reinforcement learning strategy | |
CN116684925B (en) | Unmanned aerial vehicle-mounted intelligent reflecting surface safe movement edge calculation method | |
CN117221951A (en) | Task unloading method based on deep reinforcement learning in vehicle-mounted edge environment | |
CN114051252A (en) | Multi-user intelligent transmitting power control method in wireless access network | |
CN115811788B (en) | D2D network distributed resource allocation method combining deep reinforcement learning and unsupervised learning | |
CN116009590B (en) | Unmanned aerial vehicle network distributed track planning method, system, equipment and medium | |
CN116367231A (en) | Edge computing Internet of vehicles resource management joint optimization method based on DDPG algorithm | |
Yang et al. | Deep reinforcement learning in NOMA-assisted UAV networks for path selection and resource offloading | |
CN116582836B (en) | Task unloading and resource allocation method, device, medium and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |