CN116963034A - Emergency scene-oriented air-ground network distributed resource scheduling method - Google Patents

Emergency scene-oriented air-ground network distributed resource scheduling method Download PDF

Info

Publication number
CN116963034A
CN116963034A CN202310861810.5A CN202310861810A CN116963034A CN 116963034 A CN116963034 A CN 116963034A CN 202310861810 A CN202310861810 A CN 202310861810A CN 116963034 A CN116963034 A CN 116963034A
Authority
CN
China
Prior art keywords
network
action
parameters
representing
agent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310861810.5A
Other languages
Chinese (zh)
Inventor
程梦倩
宋晓勤
雷磊
李楠
张莉涓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202310861810.5A priority Critical patent/CN116963034A/en
Publication of CN116963034A publication Critical patent/CN116963034A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/90Services for handling of emergency or hazardous situations, e.g. earthquake and tsunami warning systems [ETWS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/12Protocols specially adapted for proprietary or special-purpose networking environments, e.g. medical networks, sensor networks, networks in vehicles or remote metering networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W28/00Network traffic management; Network resource management
    • H04W28/02Traffic management, e.g. flow control or congestion control
    • H04W28/08Load balancing or load distribution
    • H04W28/09Management thereof
    • H04W28/0925Management thereof using policies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W28/00Network traffic management; Network resource management
    • H04W28/02Traffic management, e.g. flow control or congestion control
    • H04W28/08Load balancing or load distribution
    • H04W28/09Management thereof
    • H04W28/0958Management thereof based on metrics or performance parameters
    • H04W28/0967Quality of Service [QoS] parameters
    • H04W28/0975Quality of Service [QoS] parameters for reducing delays
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/30Services specially adapted for particular environments, situations or purposes
    • H04W4/40Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Emergency Management (AREA)
  • Medical Informatics (AREA)
  • Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Biophysics (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Environmental & Geological Engineering (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an air-ground network distributed unloading decision-making and resource scheduling method for emergency scenes, which aims at the emergency disaster scenes, constructs an air-ground integrated Internet of things consisting of unmanned aerial vehicles and emergency rescue vehicle users, considers the demands of computation intensive and time delay sensitive services, aims at minimizing the total time delay of a system to construct an optimization problem, and then designs an improved decision-making depth double-Q network algorithm to solve the optimization problem. The ID3QN algorithm used by the invention can minimize the time cost of the system under the condition of meeting the constraints of time delay, power and the like, and effectively solves the joint optimization problem of unloading decision, channel and power distribution of the vehicle user in an emergency scene.

Description

Emergency scene-oriented air-ground network distributed resource scheduling method
Technical Field
The invention relates to the field of air-ground integrated Internet of things, in particular to an air-ground network distributed unloading decision and resource optimization method based on an improved opposite depth double Q network for emergency scenes.
Background
Emergency disaster scenarios require higher mobility, reliability and flexibility for field rescue communication and computing facilities. Although deploying Multi-access edge computing (Multi-access Edge Computing, MEC) in emergency scenarios may alleviate the problem of limited computing resources for internet of things (Internet ofThings, ioT) devices. However, the MEC deployed in advance in the emergency scene has the problems of inflexibility and uneven service, and the preset base station is also easily destroyed and cannot provide service, so that the conventional ground network cannot meet the requirement of quick response in the emergency scene. Aiming at the situation, the air-ground integrated Internet of things plays a key role, and provides support for assisting and supplementing a ground system. The third generation partnership project (The Third Generation Partnership Proiect,3 GPP) has seen Non-terrestrial networks (Non-Terrestrial Networks, NTN) as a new feature of 5G, which is intended to provide wireless access services worldwide, beyond space limitations. Unmanned aerial vehicles (Unmanned Aerial Vehicles, UAVs) have the advantages of low cost, flexibility in maneuvering and the like, and are widely applied to the field of wireless communication. As an aerial computing platform, UAVs can assist in edge computing, particularly for high-density public emergency scenarios.
In addition, due to the presence of various random and nonlinear factors, wireless communication systems are often difficult to accurately model, and even if modeling is enabled, models and algorithms become complex and fail to meet the real-time response requirements. While artificial intelligence (Artificial Intelligence, AI) technology with powerful data processing and expression capabilities and low inference complexity can provide technical support, especially deep reinforcement learning (Deep Reinforcement Learning, DRL), has been widely applied to resource allocation and computation offloading problems in the internet of things.
Disclosure of Invention
The invention aims at constructing an unmanned aerial vehicle-assisted air-ground integrated Internet of things architecture for emergency rescue scenes, and provides an improved decision-depth double-Q network (Improved dueling double deep Q network, ID3 QN) algorithm for unloading decision and resource optimization so as to reduce the total time delay of a system by considering the requirements of computation-intensive and time delay-sensitive services. In order to achieve the object, the invention adopts the following steps:
step 1: constructing an air-ground integrated Internet of things system model consisting of an unmanned aerial vehicle and emergency rescue vehicle users;
step 2: describing a communication and calculation model of the system, and constructing an optimization problem based on the model with the aim of minimizing the time delay of the system;
step 3: constructing a deep reinforcement learning model according to an optimization problem by adopting a distributed resource allocation method, and setting key parameters of a dual-Q network (Dueling double deep Q network, D3 QN) of the opposite depth;
step 4: a priority experience playback mechanism is introduced into the D3QN, so that the convergence rate of training is increased, and the system performance is improved;
step 5: designing an ID3QN training algorithm and training a DRL model;
step 6: in the execution stage, the trained ID3QN model is utilized to obtain the optimal user transmitting power and channel allocation strategy;
further, the step 1 includes the following specific steps:
step 1-1: considering a microcell in a disaster area, in which M unmanned aerial vehicles are equipped with computing resources as airborne MEC nodes, they perform trajectory optimization in advance and preferentially fly to the vicinity of a desired area according to the situation of users, a set of UAVs is expressed as
Step 1-2: on the ground, there are N emergency vehicle users (Emergency vehicle users, EVUs) that need to perform computationally intensive and delay sensitive tasks, each EVU can move, the aggregate representation of whichIs thatAssuming that each EVU has only one computation task in each time slot, denoted +.> wherein ,dn Representing the amount of calculated data entered; i.e n Representing the number of CPU revolutions required to complete the calculation task; />Representing the maximum tolerable time delay of the task n; when the EVU does not have enough computing resources, the UAV is selected for computing and unloading;
further, the step 2 includes the following specific steps:
step 2-1: definition of the definitionTo indicate the position of execution of the nth EVU calculation task when +.>When the calculation task representing EVUn is executed locally,/>Representing task->Executing on UAvm, otherwise, +.>Then it is indicated that EVUn has not selected UAVm to complete the computational offload task, assuming that each EVU can only select one UAV for computational offload;
step 2-2: if the EVUn selects UAvm for computational offloading, then the signal-to-interference-and-noise ratio gamma of the V2U link between the EVU and the UAV n,m Can be expressed as
Wherein, P [ n ]]Sum sigma 2 The transmit power of EVUn and the power of additive white gaussian noise are represented, respectively;representing the channel coefficients between EVUn and UAVm; i n Representing the interference of EVUn from other V2U links using the same sub-band, can be calculated by
wherein ,represents the channel coefficient between EVUn' and UAvm using the same V2U link,/V>And->Using the same definition, n in the formula is changed to n';
step 2-3: because the channel between EVU and UAV is a Line of sight (LOS) of free space, the channel coefficients are related to the effects of path LOSs and can be expressed as
wherein ,is made of distance->The path loss represented; setting position coordinate division of transmitting end and receiving end of V2U linkLet it be (x) n ,y n ,z n ),(x m ,y m ,z m ) The Euclidean distance between EVUn and UAvm>Can be expressed as
Step 2-4: the transmission rate between EVUn and UAVm can be expressed as
R n,m =Blog 2 (1+γ n,m ) (5)
Wherein B represents the bandwidth of the V2U link;
step 2-5: then the total transmission delay can be expressed as
wherein ,representing the transmission delay after the UAvm is selected by the EVUn;
step 2-6: the total computation latency of all EVU execution tasks can be expressed as
wherein ,representing allocation to computing tasks>Is a computing resource of (a); />Indicating that local computing resources are available +.>Executing a computing task; when m > 0, & gt>Representing the number of CPU revolutions per second assigned to EVUn by the UAV; />Representing the computation time required by the EVUn to select UAvm to execute the task;
step 2-7: the total time cost of all systems can be expressed as
Step 2-8: based on the above definition, the optimization problem is expressed as that the total time delay of the system is minimized
wherein ,allocation policy indicating offloading policy, channel and user transmit power, respectively, +.>Represents the maximum transmit power per EVU, < >>Representing the largest computational resource of UAVm; constraint C1 represents task->A maximum tolerable delay time limit of (2); constraints C2, C3 and C4 represent the power constraint sum of each EVU, respectivelyThe UAV calculates constraints of the resources; constraint C5 indicates that each EVU can only select one UAV for computational offloading;
further, the step 3 includes the following specific steps:
step 3-1: regarding EVU as an agent, for each agent n, the current state s is obtained from the state space by first observing locally at each time step t t (n) state space is calculated by the EVU's computing task informationCurrent channel state information->UAV status information F t And the random exploration variable epsilon composition in the training round number e and epsilon-greedy algorithm, namely
Step 3-2: thereafter, each agent passes through a state-action cost function Q π (s t (n),a t (n)) obtaining a policy pi and selecting action a from the action space t (n) each agent action space is defined by an offloading policySubchannel->And transmit powerIs expressed as
wherein ,indicating the calculated location of the agent, if the agent chooses to calculate locally +.>The training stage is not entered; if the EVU selects UAvm for computational offloading, then the EVU selects UAvm from the subchannel set C m One subchannel is selected; transmit power->Limited to 4 levels, i.e. [23, 10,5,0 ]]dBm, then the joint action space of the agent is expressed as
Step 3-3: based on the action selections of all agents, the environment is converted into a new state S t+1 All agents share a global prize, defining a single step prize function at t for each agent as
r i =C-T total (13) Wherein C is a constant for adjusting r t So as to train;
step 3-4: in order to find the best strategy to maximize the overall return, both current and future returns must be considered, so the return is defined as the cumulative discount prize R t
wherein ,representing discount factors->Indicating that future rewards are more focused and +.>Representing that the current prize is more focused;
step 3-5: value-based deep reinforcement learning approximates Q using nonlinear proximity capability of neural networks * (s t ,a t )=max π Q π (s t ,a t ) Then selecting an optimal action according to the optimal action value function; in the D3QN algorithm, the use parameter is theta t To better estimate the optimal action value function, i.e. Q * (S t ,A t ;θ t )≈max π Q π (S t ,A t );
Step 3-6: then, the network structure is designed, and unlike the traditional deep double-Q network, a breach layer is introduced before an output layer to evaluate the state and the action respectively, so that the intelligent agent can process the state with smaller relation with the action more effectively; the layer divides the network output into two parts, namely a state-dependent value function V (S t ) And a dominance function A (S) t ,A t ) In this way, the states can be independently evaluated rather than always relying on actions;
step 3-7: based on the network structure, the Q value function can be rewritten as
wherein ,network parameters respectively representing a common part, a value function part and a dominance function part, which together constitute a network parameter θ t The method comprises the steps of carrying out a first treatment on the surface of the Value function->A value representing the current state; dominance function->Representation ofThe value of each action compared to the other actions in the current state;
step 3-8: however, the formulas based on the above cannot be based onUnique determination and />In practical application, the above-mentioned method needs to be rewritten as
By subtracting the mean value of the dominance function, whenWhen fixed, can determine +.> and />
Step 3-9: in the training process, the D3QN adopts two networks of a prediction network and a target network to relieve the problem of Q value overestimation, firstly finds out the action of maximizing the Q value in the prediction network, and then uses the action to obtain the target Q value in the target network, wherein the target value can be expressed as
wherein θt Andrepresenting parameters of the predicted network and the target network, respectively, two network structuresThe same, predict network parameter to upgrade continuously, the goal network parameter is updated once every certain cycle; q (S) t+1 ,A t ;θ t ) Representing neural network θ t The following is for state S t+1 Take action A t The obtained cost function;
further, the step 4 includes the following specific steps:
step 4-1: training data for agent nWill be stored in the memory playback pool as samples for subsequent training, interpolation between the pure greedy samples and the uniform random samples is performed using the random sampling method, defining the probability that each sample i is extracted as
Where α is an index, corresponding to uniform sampling when α=0; b represents a small lot;representing the priority of sample i, β is a small positive number, preventing the priority of sample from being revisited once it is 0, δ i Time differential error (Temporal difference error, TD-error) representing sample i, expressed as
Step 4-2: in updating the network, each agent needs to minimize the loss function to achieve gradient descent, which is defined as when considering sample priority
wherein ,wi =[BP(i)] Represents a sampling-Importance (IS) weight, B represents an empirical playback pool size, μ IS an index, and when μ=1, w i Completely compensating the non-uniform probability P (i);
further, the step 5 includes the following specific steps:
step 5-1: starting an environment simulator, initializing predicted network parameters of an agentAnd target network parameters->Initializing update frequency->Isoparametric parameters; initializing related parameters of preferential experience playback, setting playback pool size B, indexes alpha, mu and the like;
step 5-2: initializing a training round number e;
step 5-3: initializing a time step t in the e round;
step 5-4: updating position, path loss and large-scale fading parameters, and setting UAV parameters;
step 5-5: each agent n observes the current stateSelecting actions according to epsilon greedy strategy>And get immediate rewards +.>At the same time transition to the next state +.>Training data obtained->Storing into a memory playback pool;
step 5-6: updating the small-scale fading parameters;
step 5-7: each agent extracts training data from the experience playback pool as a sample according to the extraction probability shown in the formula (18), calculates IS weight and updates sample priority; obtaining a loss function according to equation (20), updating parameters of the agent predictive network by back propagation of the neural network using a small batch gradient descent strategy
Step 5-8: when the training times reach the target network updating interval, according to the predicted network parametersUpdating target network parameters +.>
Step 5-9: judging whether T is less than T, if T is the total time step in the e round, entering the step (5-4) if t=t+1, otherwise, entering the step (5-10);
step 5-10: judging whether e < I is met, wherein I is the set total training round number, if yes, e=e+1, entering a step (5-3), otherwise, finishing optimization, and obtaining a trained network model;
further, the step 6 includes the following specific steps:
step 6-1: inputting state information observed by an intelligent agent at a certain moment by utilizing a network model trained by an ID3QN algorithm
Step 6-2: outputting an optimal policyComputation offload node to get EVU selection and corresponding channel and power allocation。
Drawings
Fig. 1 is a model of the air-ground integrated internet of things provided by the embodiment of the invention;
FIG. 2 is a frame diagram of an ID3QN algorithm provided by an embodiment of the invention;
FIG. 3 is a diagram of simulation results of the total time delay of the system according to the change of the calculation task according to the embodiment of the invention;
fig. 4 is a diagram of simulation results of the total delay of the system according to the EVU number according to the embodiment of the present invention;
Detailed Description
The invention is described in further detail below with reference to the drawings and examples.
The invention aims at an emergency rescue scene, builds an unmanned plane-assisted air-ground integrated Internet of things architecture shown in figure 1, considers the demands of computation-intensive and delay-sensitive services, aims at minimizing the total time delay of a system to construct an optimization problem, provides an algorithm joint optimization unloading decision and resource allocation based on a dual-Q network of opposite depths, introduces a priority experience playback mechanism to improve performance, and an improved dual-Q network (Improved dueling double deep Q network, ID3 QN) algorithm frame diagram is shown in figure 2, and can obtain an optimal unloading strategy and a corresponding channel and power allocation strategy according to a trained model.
The present invention is described in further detail below.
Step 1: an air-ground integrated Internet of things system model formed by an unmanned aerial vehicle and an emergency rescue vehicle user is constructed, and the method comprises the following steps:
step 1-1: considering a microcell in a disaster area, in which M unmanned aerial vehicles are equipped with computing resources as airborne MEC nodes, they perform trajectory optimization in advance and preferentially fly to the vicinity of a desired area according to the situation of users, a set of UAVs is expressed as
Step 1-2: on the ground, there are N emergency vehiclesUsers (Emergency vehicle users, EVUs) need to perform computationally intensive and delay sensitive tasks, each EVU can move, the aggregate of which is denoted asAssuming that each EVU has only one computation task in each time slot, denoted +.> wherein ,dn Representing the amount of calculated data entered; i.e n Representing the number of CPU revolutions required to complete the calculation task; />Representing the maximum tolerable time delay of the task n; when the EVU does not have enough computing resources, the UAV is selected for computing and unloading;
step 2: describing a communication and calculation model of the system, and constructing an optimization problem based on the model with the aim of minimizing the system time delay, comprising the following steps:
step 2-1: definition of the definitionTo indicate the position of execution of the nth EVU calculation task when +.>When the calculation task representing EVUn is executed locally,/>Representing task->Executing on UAvm, otherwise, +.>Then it is indicated that EVUn has not selected UAVm to complete the computational offload task, assuming that each EVU can only select one UAV for computational offload;
step 2-2: if the EVUn selects UAvm for computational offloading, then the signal-to-interference-and-noise ratio gamma of the V2U link between the EVU and the UAV n,m Can be expressed as
Wherein, P [ n ]]Sum sigma 2 The transmit power of EVUn and the power of additive white gaussian noise are represented, respectively;representing the channel coefficients between EVUn and UAVm; i n Representing the interference of EVUn from other V2U links using the same sub-band, can be calculated by
wherein ,represents the channel coefficient between EVUn' and UAvm using the same V2U link,/V>And->Using the same definition, n in the formula is changed to n';
step 2-3: because the channel between EVU and UAV is a Line of sight (LOS) of free space, the channel coefficients are related to the effects of path LOSs and can be expressed as
wherein ,is made of distance->The path loss represented; let the position coordinates of the transmitting end and the receiving end of the V2U link be (x) n ,y n ,z n ),(x m ,y m ,z m ) The Euclidean distance between EVUn and UAvm>Can be expressed as
Step 2-4: the transmission rate between EVUn and UAVm can be expressed as
R n,m =Blog 2 (1+γ n,m ) (25)
Wherein B represents the bandwidth of the V2U link;
step 2-5: then the total transmission delay can be expressed as
wherein ,representing the transmission delay after the UAvm is selected by the EVUn;
step 2-6: the total computation latency of all EVU execution tasks can be expressed as
wherein ,representing allocation to computing tasks>Is a computing resource of (a); />Indicating that local computing resources are available +.>Executing a computing task; when m > 0, & gt>Representing the number of CPU revolutions per second assigned to EVUn by the UAV; />Representing the computation time required by the EVUn to select UAvm to execute the task;
step 2-7: the total time cost of all systems can be expressed as
Step 2-8: based on the above definition, the optimization problem is expressed as that the total time delay of the system is minimized
wherein ,allocation policy indicating offloading policy, channel and user transmit power, respectively, +.>Represents the maximum transmit power per EVU, < >>Representing the largest computational resource of UAVm; constraint C1 represents task->A maximum tolerable delay time limit of (2); constraints C2, C3 and C4 represent the power constraints of each EVU and the constraints of the UAV computing resources, respectively; constraint C5 indicates that each EVU can only select one UAV for computational offloading;
step 3: by adopting a distributed resource allocation method, a deep reinforcement learning model is constructed according to an optimization problem, key parameters of a dual-Q network (Dueling double deep Q network, D3 QN) of a diagonal depth are set, and the method comprises the following steps:
step 3-1: regarding EVU as an agent, for each agent n, the current state s is first obtained from the state space by local observation at each time step t t (n) state space is calculated by the EVU's computing task informationCurrent channel state information->UAV status information F t And the random exploration variable epsilon composition in the training round number e and epsilon-greedy algorithm, namely
Step 3-2: thereafter, each agent passes through a state-action cost function Q π (s t (n),a t (n)) obtaining a policy pi and selecting action a from the action space t (n) each agent action space is defined by an offloading policySubchannel->And transmit powerIs expressed as
wherein ,indicating the calculated location of the agent, if the agent chooses to calculate locally +.>The training stage is not entered; if the EVU selects UAvm for computational offloading, then the EVU selects UAvm from the subchannel set C m One subchannel is selected; transmit power->Limited to 4 levels, i.e. [23, 10,5,0 ]]dBm, then the joint action space of the agent is expressed as
Step 3-3: based on the action selections of all agents, the environment is converted into a new state S i+1 All agents share a global prize, defining a single step prize function at t for each agent as
r t =C-T total (33)
Wherein C is a constant for adjusting r t So as to train;
step 3-4: in order to find the best strategy to maximize the overall return, both current and future returns must be considered, so the return is defined as the cumulative discount prize R t
wherein ,representing discount factors->Indicating that future rewards are more focused and +.>Representing that the current prize is more focused;
step 3-5: value-based deep reinforcement learning approximates Q(s) with the nonlinear proximity capability of neural networks t ,a t )=max π Q π (s t ,a t ) Then selecting an optimal action according to the optimal action value function; in the D3QN algorithm, the use parameter is theta t To better estimate the optimal action value function, i.e. Q * (S t ,A t ;θ t )≈max π Q π (S t ,A t );
Step 3-6: then, the network structure is designed, and different from the traditional deep double-Q network (Double deep Q network, DDQN), a breach layer is introduced before an output layer to evaluate the state and the action respectively, so that the intelligent agent can process the state with smaller relation with the action more effectively; the layer divides the network output into two parts, namely a state-dependent value function V (S t ) And a dominance function A (S) t ,A t ) In this way, the states can be independently evaluated rather than always relying on actions;
step 3-7: based on the network structure, the Q value function can be rewritten as
wherein ,representing the common part, the value function part and the preference, respectivelyNetwork parameters of the potential function part, which together form the network parameter θ t The method comprises the steps of carrying out a first treatment on the surface of the Value function->A value representing the current state; dominance function->A value representing each action compared to other actions in the current state;
step 3-8: however, the formulas based on the above cannot be based onUnique determination and />In practical application, the above-mentioned method needs to be rewritten as
By subtracting the mean value of the dominance function, whenWhen fixed, can determine +.> and />
Step 3-9: in the training process, the D3QN adopts two networks of a prediction network and a target network to relieve the problem of Q value overestimation, firstly finds out the action of maximizing the Q value in the prediction network, and then uses the action to obtain the target Q value in the target network, wherein the target value can be expressed as
wherein θt Andrespectively representing parameters of a predicted network and a target network, wherein the two networks have the same structure, the predicted network parameters are continuously updated, and the target network parameters are updated once at regular intervals; q (S) t+1 ,A i ;θ i ) Representing neural network θ i The following is for state S t+1 Take action A t The obtained cost function;
step 4: a priority experience playback mechanism is introduced into the D3QN, so that the convergence rate of training is increased, and the system performance is improved;
the traditional experience playback mechanism is random and uniform when extracting small batches of samples, the values of the samples are different in fact, some samples can accelerate network convergence, and if priority is set for each sample in advance and the samples are extracted according to the priority, training can be more efficient;
further, the step 4 includes the following specific steps:
step 4-1: training data for agent nWill be stored in the memory playback pool as samples for subsequent training, interpolation between the pure greedy samples and the uniform random samples is performed using the random sampling method, defining the probability that each sample i is extracted as
Where α is an index, corresponding to uniform sampling when α=0; b represents a small lot;representing the priority of sample i, β is a small positive number, preventing the priority of sample from being revisited once it is 0, δ i Time differential error (Temporal difference error, TD-error) representing sample i, expressed as
Step 4-2: in updating the network, each agent needs to minimize the loss function to achieve gradient descent, which is defined as when considering sample priority
wherein ,wi =[BP(i)] Represents a sampling-Importance (IS) weight, B represents an empirical playback pool size, μ IS an index, and when μ=1, w i Completely compensating the non-uniform probability P (i);
step 5: designing an ID3QN training algorithm and training a DRL model, wherein the training algorithm comprises the following steps:
step 5-1: starting an environment simulator, initializing predicted network parameters of an agentAnd target network parameters->Initializing update frequency->Isoparametric parameters; initializing related parameters of preferential experience playback, setting playback pool size B, indexes alpha, mu and the like;
step 5-2: initializing a training round number e;
step 5-3: initializing a time step t in the e round;
step 5-4: updating position, path loss and large-scale fading parameters, and setting UAV parameters;
step 5-5: each agent n observes the current stateSelecting actions according to epsilon greedy strategy>And get immediate rewards +.>At the same time transition to the next state +.>Training data obtained->Storing into a memory playback pool;
step 5-6: updating the small-scale fading parameters;
step 5-7: each agent extracts training data from the experience playback pool as a sample according to the extraction probability shown in the formula (18), calculates IS weight and updates sample priority; obtaining a loss function according to equation (20), updating parameters of the agent predictive network by back propagation of the neural network using a small batch gradient descent strategy
Step 5-8: when the training times reach the target network updating interval, according to the predicted network parametersUpdating target network parameters +.>
Step 5-9: judging whether T is less than T, if T is the total time step in the e round, entering the step (5-4) if t=t+1, otherwise, entering the step (5-10);
step 5-10: judging whether e < I is met, wherein I is the set total training round number, if yes, e=e+1, entering a step (5-3), otherwise, finishing optimization, and obtaining a trained network model;
step 6: in the execution stage, the trained ID3QN model is utilized to obtain the optimal user transmitting power and channel allocation strategy, and the method comprises the following specific steps:
step 6-1: inputting state information observed by an intelligent agent at a certain moment by utilizing a network model trained by an ID3QN algorithm
Step 6-2: outputting an optimal policyAnd obtaining the calculated unloading node selected by the EVU and corresponding channel and power distribution.
In order to verify the effectiveness of the ID3QN method, the simulation is carried out by using a Pycham, the simulation environment is arranged in a space with the length of 2000m and the width of 500m, and the emergency rescue vehicle runs on a two-way four-lane with the length of 2000m and the road width of 14 m; the UAV flying height is 50-120 m, the flying speed is 10m/s, the UAV flying speed has 4 sub-channels, the bandwidth is 4MHz, the coverage area diameter is 500m, and the computing resource is 2GHz.
Only LOS channels are considered in simulation, and the path LOSs is set to be 32.4+22log 10 (d)+20log 10 (f c), wherein ,fc Representing carrier frequency in GHz, d representing Euclidean distance between EVU and UAV in three-dimensional space; the shadow fading distribution is set as lognormal distribution, and the shadow fading standard deviation is 4dB; the large-scale fading is updated once every training round; updating each training step of small-scale fading once; the ID3QN in the simulation consists of 1 input layer, 4 hidden layers and 1 output layer, wherein the size of the input layer and the dimension D of the state space s The same size of the output layer as the motion space dimension D a The same; the first 3 hidden layers are all connected layers, respectively comprising 128, 64 and 64 neurons, and the 4 th hidden layerFor the layer of the duel, there is D a +1 neurons. During training, the ReLU is used as an activation function to update parameters using RMSProp optimizers.
The training round number is set to 1500, 100 steps are performed in each round, and the target network parameters are updated once in every 5 rounds; the size of the experience playback pool is 16384, and the size of the small batch sample is 2048; furthermore, discount factorsAnd learning rates eta of 0.7 and 0.001, respectively, and initial and final values of epsilon of 1 and 0.02, respectively.
The ID3QN algorithm is compared to several reference algorithms: 1. a traditional DDQN algorithm; 2. a DDQN algorithm with preferential experience playback is introduced, which is called IDDQN for short; 3. the D3QN algorithm of preferential experience playback is not introduced;
fig. 3 and fig. 4 respectively describe performance comparison of several algorithms under the conditions of different calculation task amounts and different EVU user amounts, it can be seen that the average overhead of the system of the ID3QN algorithm is always the lowest, and the D3QN algorithm has obvious performance advantages compared with the DDQN algorithm, and in addition, the system performance is improved by introducing the priority experience playback mechanism.
What is not described in detail in the present application belongs to the prior art known to those skilled in the art.

Claims (1)

1. An emergency scene-oriented air network distributed offloading decision and resource optimization method based on an improved depth-of-decision double-Q network (Improved dueling double deep Q network, ID3 QN) is characterized by comprising the following steps:
step 1: constructing an air-ground integrated Internet of things system model consisting of an unmanned aerial vehicle and emergency rescue vehicle users;
step 2: describing a communication and calculation model of the system, and constructing an optimization problem based on the model with the aim of minimizing the time delay of the system;
step 3: constructing a deep reinforcement learning model according to an optimization problem by adopting a distributed resource allocation method, and setting key parameters of a dual-Q network (Dueling double deep Q network, D3 QN) of the opposite depth;
step 4: a priority experience playback mechanism is introduced into the D3QN, so that the convergence rate of training is increased, and the system performance is improved;
step 5: designing an ID3QN training algorithm and training a DRL model;
step 6: in the execution stage, the trained ID3QN model is utilized to obtain the optimal user transmitting power and channel allocation strategy;
further, the step 3 includes the following specific steps:
step 3-1: regarding emergency vehicle users (Emergency vehicle users, EVUs) as agents, for each agent n, the current state s is first obtained from the state space by local observation at each time step t t (n) state space is calculated by the EVU's computing task informationCurrent channel state information->UAV status information F t And the random exploration variable epsilon composition in the training round number e and epsilon-greedy algorithm, namely
Step 3-2: thereafter, each agent passes through a state-action cost function Q π (s t (n),a t (n)) obtaining a policy pi and selecting action a from the action space t (n) each agent action space is defined by an offloading policySubchannel->And transmit power P t n Is represented by the selected composition ofIs that
wherein ,indicating the calculated location of the agent, if the agent chooses to calculate locally +.>The training stage is not entered; if EVU selects UAV m for computational offloading, then it will be from subchannel set C m One subchannel is selected; transmit power P t n Limited to 4 levels, i.e. [23, 10,5,0 ]]dBm, then the joint action space of the agent is expressed as
Step 3-3: based on the action selections of all agents, the environment is converted into a new state S t+1 All agents share a global prize, defining a single step prize function at t for each agent as
r t =C-T total (4)
Wherein C is a constant for adjusting r t To train, T total Representing the total time delay of the system;
step 3-4: in order to find the best strategy to maximize the overall return, both current and future returns must be considered, so the return is defined as the cumulative discount prize R t
wherein ,representing discount factors->Indicating that future rewards are more focused and +.>Representing that the current prize is more focused;
step 3-5: value-based deep reinforcement learning approximates Q using nonlinear proximity capability of neural networks * (s t ,a t )=max π Q π (s t ,a t ) Then selecting an optimal action according to the optimal action value function; in the D3QN algorithm, the use parameter is theta t To better estimate the optimal action value function, i.e. Q * (S t ,A t ;θ t )≈max π Q π (S t ,A t );
Step 3-6: then, the network structure is designed, and unlike the traditional deep double-Q network, a breach layer is introduced before an output layer to evaluate the state and the action respectively, so that the intelligent agent can process the state with smaller relation with the action more effectively; the layer divides the network output into two parts, namely a state-dependent value function V (S t ) And a dominance function A (S) t ,A t ) In this way, the states can be independently evaluated rather than always relying on actions;
step 3-7: based on the network structure, the Q value function can be rewritten as
wherein ,representing the common part, the value function part and the dominance function part respectivelyDivided network parameters, which together form a network parameter θ t The method comprises the steps of carrying out a first treatment on the surface of the Value function->A value representing the current state; dominance function->A value representing each action compared to other actions in the current state;
step 3-8: however, the formulas based on the above cannot be based onUnique determination->Andin practical application, the above-mentioned method needs to be rewritten as
By subtracting the mean value of the dominance function, whenWhen fixed, can determine +.>And
step 3-9: in the training process, the D3QN adopts two networks of a prediction network and a target network to relieve the problem of Q value overestimation, firstly finds out the action of maximizing the Q value in the prediction network, and then uses the action to obtain the target Q value in the target network, wherein the target value can be expressed as
wherein θt Andrespectively representing parameters of a predicted network and a target network, wherein the two networks have the same structure, the predicted network parameters are continuously updated, and the target network parameters are updated once at regular intervals; q (S) t+1 ,A t ;θ t ) Representing neural network θ t The following is for state S t+1 Take action A t The obtained cost function;
further, the step 5 includes the following specific steps:
step 5-1: starting an environment simulator, initializing predicted network parameters of an agentAnd target network parameters->Initializing update frequency->Isoparametric parameters; initializing related parameters of preferential experience playback, setting playback pool size B, indexes alpha, mu and the like;
step 5-2: initializing a training round number e;
step 5-3: initializing a time step t in the e round;
step 5-4: updating position, path loss and large-scale fading parameters, and setting UAV parameters;
step 5-5: each agent n observes the current stateSelecting actions according to epsilon greedy strategy>And get the instant rewardsAt the same time transition to the next state +.>Training data obtained->Storing into a memory playback pool;
step 5-6: updating the small-scale fading parameters;
step 5-7: each agent extracts training data from the experience playback pool as a sample according to the extraction probability shown below,
where α is an index, corresponding to uniform sampling when α=0; b represents a small lot;representing the priority of sample i, β is a small positive number, preventing the priority of sample from being revisited once it is 0, δ i Time differential error (Temporal difference error, TD-error) representing sample i, expressed as
Then calculate IS weight w i =[BP(i)] And updating the sample priority, B represents the empirical playback pool size, μ is an index, whenMu=1, w i The non-uniform probability P (i) is fully compensated, resulting in a loss function,
updating parameters of an agent predictive network by back propagation of neural networks using a small batch gradient descent strategy
Step 5-8: when the training times reach the target network updating interval, according to the predicted network parametersUpdating target network parameters
Step 5-9: judging whether T is less than T, if T is the total time step in the e round, entering the step (5-4) if t=t+1, otherwise, entering the step (5-10);
step 5-10: and (3) judging whether e < I is met, wherein I is the set total training round number, if so, e=e+1, entering a step (5-3), and if not, ending optimization, and obtaining a trained network model.
CN202310861810.5A 2023-07-13 2023-07-13 Emergency scene-oriented air-ground network distributed resource scheduling method Pending CN116963034A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310861810.5A CN116963034A (en) 2023-07-13 2023-07-13 Emergency scene-oriented air-ground network distributed resource scheduling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310861810.5A CN116963034A (en) 2023-07-13 2023-07-13 Emergency scene-oriented air-ground network distributed resource scheduling method

Publications (1)

Publication Number Publication Date
CN116963034A true CN116963034A (en) 2023-10-27

Family

ID=88443824

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310861810.5A Pending CN116963034A (en) 2023-07-13 2023-07-13 Emergency scene-oriented air-ground network distributed resource scheduling method

Country Status (1)

Country Link
CN (1) CN116963034A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117176213A (en) * 2023-11-03 2023-12-05 中国人民解放军国防科技大学 SCMA codebook selection and power distribution method based on deep prediction Q network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117176213A (en) * 2023-11-03 2023-12-05 中国人民解放军国防科技大学 SCMA codebook selection and power distribution method based on deep prediction Q network
CN117176213B (en) * 2023-11-03 2024-01-30 中国人民解放军国防科技大学 SCMA codebook selection and power distribution method based on deep prediction Q network

Similar Documents

Publication Publication Date Title
CN113162679B (en) DDPG algorithm-based IRS (intelligent resilient software) assisted unmanned aerial vehicle communication joint optimization method
CN114422056B (en) Space-to-ground non-orthogonal multiple access uplink transmission method based on intelligent reflecting surface
Li et al. Downlink transmit power control in ultra-dense UAV network based on mean field game and deep reinforcement learning
CN113543074A (en) Joint computing migration and resource allocation method based on vehicle-road cloud cooperation
CN109905860A (en) A kind of server recruitment and task unloading prioritization scheme based on the calculating of vehicle mist
CN114567888B (en) Multi-unmanned aerial vehicle dynamic deployment method
CN116963034A (en) Emergency scene-oriented air-ground network distributed resource scheduling method
CN113115344B (en) Unmanned aerial vehicle base station communication resource allocation strategy prediction method based on noise optimization
CN116456493A (en) D2D user resource allocation method and storage medium based on deep reinforcement learning algorithm
CN114169234A (en) Scheduling optimization method and system for unmanned aerial vehicle-assisted mobile edge calculation
CN116600316A (en) Air-ground integrated Internet of things joint resource allocation method based on deep double Q networks and federal learning
Zhang et al. New computing tasks offloading method for MEC based on prospect theory framework
CN117098189A (en) Computing unloading and resource allocation method based on GAT hybrid action multi-agent reinforcement learning
Nasr-Azadani et al. Single-and multiagent actor–critic for initial UAV’s deployment and 3-D trajectory design
CN116321298A (en) Multi-objective joint optimization task unloading strategy based on deep reinforcement learning in Internet of vehicles
CN115499921A (en) Three-dimensional trajectory design and resource scheduling optimization method for complex unmanned aerial vehicle network
CN115134242A (en) Vehicle-mounted computing task unloading method based on deep reinforcement learning strategy
CN116684925B (en) Unmanned aerial vehicle-mounted intelligent reflecting surface safe movement edge calculation method
CN117221951A (en) Task unloading method based on deep reinforcement learning in vehicle-mounted edge environment
CN114051252A (en) Multi-user intelligent transmitting power control method in wireless access network
CN115811788B (en) D2D network distributed resource allocation method combining deep reinforcement learning and unsupervised learning
CN116009590B (en) Unmanned aerial vehicle network distributed track planning method, system, equipment and medium
CN116367231A (en) Edge computing Internet of vehicles resource management joint optimization method based on DDPG algorithm
Yang et al. Deep reinforcement learning in NOMA-assisted UAV networks for path selection and resource offloading
CN116582836B (en) Task unloading and resource allocation method, device, medium and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination