CN117880858B

CN117880858B - Multi-unmanned aerial vehicle track optimization and power control method based on communication learning

Info

Publication number: CN117880858B
Application number: CN202410275005.9A
Authority: CN
Inventors: 毕远国; 袁梓梦; 刘羽霏; 刘雨衡; 郑彤; 樊彦伯
Original assignee: 东北大学
Priority date: 2024-03-12
Filing date: 2024-03-12
Publication date: 2024-05-10
Anticipated expiration: 2044-03-12
Also published as: CN117880858A

Abstract

The invention belongs to the technical field of intelligent decision and control of robots, and discloses a multi-unmanned aerial vehicle track optimization and power control method based on communication learning, which can maximize the number of users meeting service quality under the condition of lower energy consumption. The invention designs a communication mechanism, and adds memory storage equipment to help UAV-ABS acquire and utilize experiences of other unmanned aerial vehicles and store self-learned experiences. In order to construct a more effective unmanned aerial vehicle cooperation strategy and reduce the cost of network model deployment, the invention designs a concentrated attention commentator neural network, which can reduce redundant information and solve the problem of dimension disasters caused by the increase of the number of UAV-ABS and ground users.

Description

Multi-unmanned aerial vehicle track optimization and power control method based on communication learning

Technical Field

The invention relates to the technical field of intelligent decision and control of robots, in particular to a multi-unmanned aerial vehicle track optimization and power control method based on communication learning.

Background

The use of Unmanned aerial vehicles as air base stations (UAV-ABS) has attracted considerable academic and industrial attention. Compared with a ground fixed wireless base station, the UAV-ABS has the following advantages: firstly, the three-dimensional movement of the wireless communication device can bring a higher visual angle, so that a sight distance wireless transmission link with higher possibility is provided for ground users, and the communication quality is improved. Second, UAV-ABS is an ideal solution for post-disaster areas such as floods, earthquakes, or hot spots of short-lived communication needs such as concert, performance sites. Finally, unmanned networking is more flexible and the cost of the required links is lower. In addition, a network using a drone as an air base station is becoming an indispensable component in the next generation mobile communication network system.

The deployment of UAV-ABS still faces many challenges, including limitations in communication range, bandwidth, and energy consumption. UAV-ABS needs to move in almost every slot to provide high quality wireless service near ground users. However, excessive unnecessary movement may cause an increase in energy consumption, thereby degrading communication quality. Furthermore, we need to control the power allocation of the drone to achieve a tradeoff between communication quality and interference management. Thus, we have urgent need for a carefully designed strategy to help multiple UAV-ABS adaptively power distribution and optimize flight trajectories.

Traditional mathematical methods convert the non-convex problem into a convex problem for solution, however, this method sacrifices accuracy and cannot handle the mobility of the ground user. The existing method mainly utilizes a multi-agent reinforcement learning algorithm, such as a multi-agent depth deterministic strategy gradient algorithm, to solve, but cannot realize direct communication and information sharing between UAV-ABS. Such limitations may lead to information asymmetry and hinder cooperation between the UAV-ABS. In addition, as the number of UAV-ABS increases, the likelihood that the critics network will be affected by irrelevant information also increases, creating a problem of dimension disasters.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a multi-unmanned aerial vehicle track optimization and power control method based on communication learning.

The technical scheme of the invention is as follows: a multi-unmanned aerial vehicle track optimization and power control method based on communication learning comprises the following steps:

1) Building a multi-unmanned aerial vehicle auxiliary wireless communication system, wherein the multi-unmanned aerial vehicle auxiliary wireless communication system comprises an unmanned aerial vehicle, a ground user and a training center; defining a motion model, a communication model and an energy consumption model, and constructing a joint optimization objective function;

2) Converting the joint optimization objective function in the step 1) into Markov games, determining observation, actions and rewards, and designing a multi-agent reinforcement learning algorithm for solving the problem of the joint optimization objective function; the multi-agent reinforcement learning algorithm comprises a communication actor neural network NN1, a target actor neural network NN2, a concentrated attention commentary neural network NN3 and a target commentary family neural network NN4;

3) Initializing the position of the unmanned aerial vehicle and the position of a ground user; initializing neural network parameters in an empirical buffer and multi-agent reinforcement learning algorithm, including communication actor neural network parameters Target actor neural network parameters/>Focusing on commentator neural network parameters/>And target comment home neural network parameters/>；

4) Training begins and memory storage device is initialized; For each drone, its communication actor neural network NN1 is derived from memory storage device/>Reading communication information/>And combine local observation information/>Pair/>Coding, learning and updating; update information/>Is stored to a memory storage device/>In (a) and (b);

The update information The acquisition steps are as follows, firstly, the local observation information is encodedWherein/>Representing a fully connected neural network;

After encoding the local observation information, the communication actor neural network NN1 in the unmanned aerial vehicle reads experience from the memory storage device and learns the experience by combining the experience with the previous encoded information, and the formula is expressed as ; Wherein the method comprises the steps ofGating cell/>By concatenating information vectors/>Linear mapping acquisition, wherein/>Representing an activation function,/>Is extracted by gating unit only/>Context vector of spatiotemporal information, formulated as/>，/>Is a linear mapped learnable vector; /(I)Learning information of the unmanned aerial vehicle;

the unmanned aerial vehicle selectively updates and stores the learned information in the memory storage device In (a) and (b); two gating units/>And/>Wherein/>And/>Is a parameter that can be learned; candidate update information is denoted/>Wherein/>As a learnable parameter, the last updated information is denoted/>；

5) Communication actor neural network based on local observation informationCommunication information/>And memory storage device/>Update information stored therein/>Making an action, the action comprising unmanned aerial vehicle/>Shaft speed/>Y-axis speed/>And transmit power/>；

6) When all unmanned aerial vehicles giveShaft speed/>Y-axis speed/>And transmit power/>After that, the training center receives environmental feedback including rewards and the status of the next time slot; the training center packages all unmanned plane states, actions, communication information and rewards of the next time slot into experiences, and stores the experiences into an experience buffer area of the training center;

7) When the experience stored in the experience buffer exceeds a certain amount, the training center extracts a batch of experience to update the neural network parameters; the attentive critic network NN3 calculates the current action value using a batch of experience And updating the parameter/>, by minimizing the joint regression function; Updating parameters/>, by means of strategy gradient descent, of communication actor neural network NN1; Soft updating the target actor neural network NN2 and the target critic network NN4; soft update target comment home neural network NN4 parameter，/>For super parameters, soft updating the target actor neural network NN2 parameters,；

8) The unmanned aerial vehicle outputs a track decision result and a power control decision result by utilizing a self-deployed trained communication actor neural network NN 1.

The multi-drone assisted wireless communication system in a multi-drone wireless communication scenario,The individual unmanned aerial vehicle is deployed as an air base station, as/>Providing communication service by individual ground users; /(I)Defined as a collection of unmanned aerial vehicles,Defined as a ground user set; multi-unmanned aerial vehicle auxiliary wireless communication system with equal length/>Run on consecutive time slots, where/>Defined as a set of time slots;

defining a motion model of the unmanned aerial vehicle; the three-dimensional coordinates of the ground user are defined as The three-dimensional coordinates of the unmanned aerial vehicle are/>Wherein/>The flying height of the unmanned aerial vehicle; unmanned aerial vehicle is atThe coordinates of the time slots are expressed as,

（1）

Wherein the method comprises the steps ofIndicating that unmanned aerial vehicle is in time slot/>Running speed of/>Is an environmental impact factor,/>Representing the duration of a time slot,/>Expressed as maximum speed that the unmanned aerial vehicle can reach;

Defining an unmanned aerial vehicle communication model; the unmanned aerial vehicle communication model simultaneously considers the sight distance LoS and the non-sight distance NLoS; the probability of the line of sight LoS between the drone n and the ground user m is expressed as:

（2）

The probability of NLoS is that ; Where a, b are constants that depend on the environment,Is the elevation angle between the drone n and the ground user m,

（3）

Wherein,Representing the horizontal distance between the unmanned plane and the user; the channel gains of the drone and the ground user are expressed as:

（4）

Wherein the method comprises the steps of Is the path loss index,/>Is the path loss of LoS,/>Is the path loss of NLoS; definition of the definitionFor white gaussian noise, the signal to interference plus noise ratio received by ground users and unmanned aerial vehicles is defined as:

（5）

Wherein the method comprises the steps of Representing the transmitting power of the current unmanned aerial vehicle; /(I)The transmitting power is expressed as the transmitting power of other unmanned aerial vehicles; in the case of applying the frequency division multiple access wireless communication technology, the data transmission rate of the terrestrial users is defined as follows using shannon's capacity theorem:

（6）

Wherein the method comprises the steps of Representing the number of ground users within the range of the current unmanned plane;

Defining an energy consumption model; the energy consumption of the unmanned aerial vehicle comprises communication energy consumption of data transmission and flight energy consumption of the unmanned aerial vehicle in the moving process; to simplify the analysis, the energy consumption of the unmanned aerial vehicle during take-off, landing and hover is eliminated; definition of the definition For blade power,/>For blade speed, unmanned aerial vehicle's flight energy consumption represents:

（7）

the communication energy consumption of data transmission is expressed as ，/>Representing rated emission power of the unmanned aerial vehicle; the energy consumption model is expressed as:

（8）

Defining a joint optimization objective function; the goal of the multi-unmanned aerial vehicle auxiliary wireless communication system is to control the operation speed and the emission power of unmanned aerial vehicles in T time slots, and simultaneously maximize the number of ground users meeting the service quality requirement and minimize the energy consumption, and the joint optimization objective function is expressed as:

（9）

Is limited by ；

Is limited by；

Wherein the method comprises the steps ofThe maximum energy consumption of the unmanned aerial vehicle is achieved; /(I)Is an index function for measuring the service quality of the ground user ifThe formula value is 1, otherwise is/>Is the minimum transmission rate/>, which the surface user needs to meetA weight factor representing the relative importance of adjusting energy consumption and ground user quality of service in a joint optimization objective function.

The markov game is defined as that the unmanned aerial vehicle obtains real-time rewards by selecting actions, observing the current state and obtaining real-time rewards at each time slot, the rewards being used for interacting with the environment; a common goal of all drones is to maximize long-term accumulated rewards by selecting the best sequence of actions;

Defining observation information, actions, and rewards; the observation information of the unmanned aerial vehicle comprises the position information of all unmanned aerial vehicles and the position information of users in the communication range of the unmanned aerial vehicle;

（10）

The unmanned aerial vehicle comprises unmanned aerial vehicle Shaft speed/>Y-axis speed/>And transmit power/>It is expressed as:

（11）

all unmanned aerial vehicles aim at maximizing the number of ground users meeting the quality of service while minimizing the energy consumption of the multi-unmanned aerial vehicle auxiliary wireless communication system; meeting various constraints while executing tasks, when not adhering to the constraints, being penalized ; The prize is therefore formulated as:

（12）。

the communication actor neural network in the step 5) combines the self-observation information Learning information/>Updating informationGive action, its formula is expressed as/>，/>Neural network NN1 for communicating actors; the communication actor neural network NN1 comprises a characteristic network, an action network and a series of linear transformations; the characteristic network is a three-layer full-connection neural network, and the action network is a two-layer full-connection neural network; in the forward propagation process, self-observed informationGenerating update information/>, through feature network, linear transformation, sigmoid function and write operation; The action network combines the unmanned aerial vehicle with the self-observation information/>Learning information/>And update information/>And splicing, and calculating the final action output through the two full-connection layers and the activation function.

The training center in step 6) obtains rewardsNext state/>; Defining a set of statesAnd communication experience set/>; Experience/>Stored in an experience buffer, and the action value/> is calculated by using the concentrated attention commentator neural network NN3A value; the focused critic neural network NN3 architecture is as follows: the input information of the concentrated attention commentator neural network NN3 is the states and actions of all unmanned aerial vehicles; encoding input information by using a linear layer formed by a plurality of fully connected networks to obtain encoded information/>; Secondly, the multi-head attention layer is utilized to screen information, and the information is encoded/>Mapped to three trainable and shared weight matrices: query/>Bond/>Sum/>The formula is:

（13）

（14）

（15）

Wherein the method comprises the steps of 、/>And/>Is the mapped result; the attention weight of a single attention head is then calculated by three weight matrices, where/>Representing a key vector, the formula is:

（16）

the final multi-headed attention layer output is a weighted sum of all attention weights ; Next/>And/>Inputting the information into an addition and normalization layer, and performing addition and normalization operation on the information to ensure the validity and consistency of the information; normalized information/>Linear and nonlinear transformation by feed-forward layer to yield output/>The activation function of the feed-forward layer is Relu functions; finally will/>And/>The predicted/>, is finally obtained as input through addition and normalization processingValues.

Updating parameters by means of strategy gradient descent in the step 7)：

（17）

Wherein the method comprises the steps ofCommunication actor neural network NN1 parameters representing nth agent,/>Operation for describing gradients in three-dimensional space,/>Representing the predicted Q value of the concentrated attention commentator neural network NN3,/>Rewards representing cumulative policy,/>Representing a desire to fix a value function of an action of the other agent; /(I)To balance maximum entropy and rewards;

Updating the concentrated attention commentator neural network NN3 parameters, and updating the formula to be expressed as:

（18）

Wherein the method comprises the steps of Representing target/>Value, final goal is to minimize the difference between predicted value and target value/>Representing the current prize value,/>Representing the predicted Q value of the target critic neural network NN 4; /(I)Representing target actor neural network NN2 parameters; /(I)To balance maximum entropy and rewards.

The invention has the beneficial effects that: aiming at the defects of the prior art, the invention provides a multi-unmanned aerial vehicle track optimization and power control method based on communication learning. To help the actor's network obtain more a priori information before performing the action, we devised a communication mechanism that adds a memory storage device to help the UAV-ABS obtain and utilize the experience of other drones and store the experience learned by itself. In order to build more efficient unmanned aerial vehicle cooperation strategies and reduce the cost of network model deployment, we designed a focused review home network that can reduce redundant information and solve the dimension disaster problem that occurs with increasing numbers of UAV-ABS and ground users.

Drawings

FIG. 1 is a flow chart of a method for multi-unmanned aerial vehicle trajectory optimization and power control based on communication learning;

FIG. 2 is a multi-unmanned auxiliary wireless communication system model;

FIG. 3 is a framework of a communication actor's concentrated attention critics algorithm;

FIG. 4 is a block diagram of a focused commentator neural network;

FIG. 5 is a training prize diagram for different algorithms;

FIG. 6 is a graph of energy consumption versus various algorithms.

Fig. 7 is a graph of the number of users meeting the quality of service for different algorithms.

Detailed Description

The present invention will be described in further detail with reference to the drawings and embodiments, in order to make the objects, technical solutions and advantages of the present invention more apparent.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: a multi-unmanned aerial vehicle track optimization and power control method based on communication learning comprises the following steps:

Step1: and constructing a multi-unmanned aerial vehicle auxiliary wireless communication system, defining a motion model, a communication model and an energy consumption model, and constructing a joint optimization objective function.

Step 2: converting the joint optimization objective function in the step 1 into Markov games, determining observation, actions and rewards, and designing a multi-agent reinforcement learning algorithm to solve the problem of the joint optimization objective function. The multi-agent reinforcement learning algorithm comprises a plurality of types of neural networks: communication actor neural network NN1, target actor neural network NN2, concentration reviewer neural network NN3, and target review home neural network NN4.

Step 3: initializing the position of the unmanned aerial vehicle and the position of a ground user; initializing neural network parameters in an empirical buffer and multi-agent reinforcement learning algorithm, including communication actor neural network parametersTarget actor neural network parameters/>Focusing on commentator neural network parameters/>And target comment home neural network parameters/>；

Step 4: the invention designs a communication mechanism to help the unmanned aerial vehicle acquire, learn and update the communication experience of other unmanned aerial vehicles, and then stores the updated communication experience. Training begins and memory storage device is initialized; For each drone, its communication actor neural network NN1 is derived from memory storage device/>Reading communication information/>And incorporate local observation informationPair/>Coding, learning and updating; update information/>Is stored to a memory storage device/>Is a kind of medium.

Step 5: communication actor neural network based on local observation informationCommunication information/>And memory storage device/>Update information stored therein/>Making an action, the action comprising unmanned aerial vehicle/>Shaft speed/>Y-axis speed/>And transmit power/>；

Step 6: when all unmanned aerial vehicles giveShaft speed/>Y-axis speed/>And transmit power/>After that, the training center receives environmental feedback including rewards and the status of the next time slot; the training center packages all unmanned plane states, actions, communication information and rewards of the next time slot into experiences, and stores the experiences into an experience buffer area of the training center;

Step 7: when the experience stored in the experience buffer exceeds a certain amount, the training center extracts a batch of experience to update the neural network parameters; the attentive critic network NN3 calculates the current action value using a batch of experience And updating the parameter/>, by minimizing the joint regression function; Updating parameters of communication actor neural network NN1 in a strategy gradient descending manner; Soft updating the target actor neural network NN2 and the target critic network NN4;

step 8: the unmanned aerial vehicle outputs track decision and power control decision results by using a trained communication actor neural network NN1 deployed by the unmanned aerial vehicle.

The flow chart of the technical scheme is shown in figure 1.

The specific steps of the step 1 include:

step 1.1: in a multi-drone wireless communication scenario, The individual unmanned aerial vehicle is deployed as an air base station feed/>Individual ground subscribers provide communication services. /(I)Defined as unmanned aerial vehicle set,/>Defining a ground user set; multi-unmanned aerial vehicle auxiliary wireless communication system with equal length/>Run on consecutive time slots, where/>Defined as a set of time slots;

a multi-drone assisted wireless communication system model is shown in fig. 2.

Step 1.2: a motion model of the drone is defined. The three-dimensional coordinates of the user are defined asThe three-dimensional coordinates of the unmanned aerial vehicle are/>Wherein/>Is the flying height of the unmanned aerial vehicle. Unmanned aerial vehicle is/>The coordinates of the time slots are expressed as,

Wherein the method comprises the steps ofIndicating that unmanned aerial vehicle is in time slot/>Running speed of/>Is an environmental impact factor,/>Representing the duration of a time slot,/>Expressed as the maximum speed that the unmanned aerial vehicle can reach.

Step 1.3: a drone communication model is defined. The unmanned aerial vehicle communication model considers both line of sight LoS and non-line of sight NLoS. The probability of the line of sight LoS between the drone n and the ground user m is expressed as:

The probability of NLoS is that . Where a, b are constants dependent on the environment,/>Is the elevation angle between the two,

Wherein,Representing the horizontal distance of the drone from the user. The channel gains of the drone and the ground user are expressed as:

Wherein the method comprises the steps of Is the path loss index,/>Is the path loss of LoS,/>Is the path loss of NLoS. Definition of the definitionFor white gaussian noise, the signal to interference plus noise ratio received by ground users and unmanned aerial vehicles is defined as:

Wherein the method comprises the steps of Representing the current unmanned aerial vehicle transmitting power,/>Represented as the transmit power of the other drones. In the case of applying the frequency division multiple access wireless communication technology, the data transmission rate of the terrestrial users is defined as follows using shannon's capacity theorem:

Wherein the method comprises the steps of Representing the number of ground users within range of the current drone.

Step 1.4: an energy consumption model is defined. The energy consumption of the unmanned aerial vehicle comprises communication energy consumption of data transmission and flight energy consumption of the unmanned aerial vehicle in the moving process. To simplify the analysis, the energy consumption of the unmanned aerial vehicle during take-off, landing and hover is excluded. Definition of the definitionFor blade power,/>For blade speed, unmanned aerial vehicle's flight energy consumption represents:

the communication energy consumption of data transmission is expressed as ，/>Indicating the rated transmit power of the drone. The energy consumption model is expressed as:

Step 1.5: a joint optimization objective function is defined. Multi-unmanned aerial vehicle assisted wireless communication system targeting through control The operation speed and the emission power of the unmanned aerial vehicle in each time slot are maximized, the number of ground users meeting the service quality requirement is maximized, and the energy consumption is minimized, and the joint optimization objective function is expressed as:

Is limited by

Wherein the method comprises the steps ofIs the maximum energy consumption of the unmanned aerial vehicle. /(I)Is an index function for measuring the service quality of the ground user ifThe formula value is 1, otherwise 0. /(I)Is the minimum transmission rate that the surface user needs to meet. /(I)A weight factor representing the relative importance of adjusting energy consumption and ground user quality of service in a joint optimization objective function.

The specific steps of the step 2 include:

Step 2.1: to establish a clear problem setting for the algorithm, markov games are defined before the deep reinforcement learning approach is used. Markov games are defined as unmanned aerial vehicles that interact with the environment by selecting actions, observing the current state, and obtaining real-time rewards at each time slot. A common goal of all drones is to maximize long-term accumulated rewards by selecting the best sequence of actions.

Step 2.2: observation information, actions, and rewards are defined. The observation information of the unmanned aerial vehicle comprises the position information of all unmanned aerial vehicles and the position information of users in the communication range, and the observation information is expressed as:

The unmanned aerial vehicle comprises unmanned aerial vehicle Shaft speed/>、/>Shaft speed/>And transmit power/>It is expressed as:

all unmanned aerial vehicles aim to maximize the number of ground users meeting quality of service while minimizing the energy consumption of the multi-unmanned aerial vehicle assisted wireless communication system. Meeting various constraints while executing tasks, when not adhering to the constraints, being penalized . The prize is therefore formulated as:

（12）

Step 2.3: the communication actor concentrating on the reviewer algorithm CACAC is designed to solve the joint optimization problem, and the algorithm framework diagram is shown in fig. 3. Algorithms include a variety of neural networks: communication actor neural network NN1, target actor neural network NN2, concentration reviewer neural network NN3, and target review home neural network NN4.

Initializing a user, a unmanned aerial vehicle position, a neural network parameter and an experience buffer. The neural network parameters include communication actor neural network parametersTarget actor neural network parameters/>Focusing on commentator neural network parameters/>And target comment home neural network parameters/>。

Training begins and memory storage device is initialized。

For each unmanned aerial vehicle, its communication actor neural network NN1 is from memory storage deviceReading communication messagesAnd combine local observation information/>Pair/>Coding, learning and updating are carried out. Then update information/>Is stored to a memory storage device/>Is a kind of medium. Communication actor neural network NN1 based on local observations/>Communication message/>And update information/>An action is made. After all unmanned aerial vehicles give actions, the training center will receive environmental feedback including rewards and status of the next time slot. The training center packages all the unmanned plane states, the states of the next time slot, actions, communication information and rewards into experiences and stores the experiences into an experience buffer. When the experience stored in the experience buffer exceeds a certain amount, the training center may extract a batch of experience to update the neural network parameters. The concentrated commentator network NN3 calculates the current action value/>, using a batch of experienceAnd updating the parameter/>, by minimizing the joint regression function. Updating parameters/>, by means of strategy gradient descent, of communication actor neural network NN1。

The communication mechanism comprises three steps of acquisition, learning and updating. For each drone, they acquire local observationsCommunication message/>. The communication message is encoded, read and updated. First encoding local observations/>Wherein/>Representing a fully connected neural network.

After encoding the local observations, the communication actor neural network NN1 in the unmanned aerial vehicle reads experience from the communication storage device and learns by combining the previous encoded information, expressed as a formula; Wherein the method comprises the steps ofGating cell/>By concatenating information vectors/>Linear mapping acquisition, wherein/>Representing an activation function,/>Is used by the gating unit only to extract/>Context vector of spatio-temporal information, formulated as，/>Is a linear mapped leachable vector,/>Is learning information of the unmanned aerial vehicle.

Finally, the unmanned aerial vehicle selectively updates the learned information and stores the information in the memory storage deviceIs a kind of medium. Two gating units/>And/>Wherein/>And/>Is a parameter that can be learned. Candidate update information is denoted/>The last updated information is represented as。

The step 5 comprises the following steps:

Step 5.1: the unmanned plane gives actions by combining self observation, learning information and updating information, and the formula is expressed as Wherein/>For communication of actor neural networks NN1.

Step 5.2: the communication actor neural network NN1 is composed of a feature network, an action network, and a series of linear transformations. The characteristic network is a three-layer fully connected neural network, and the action network is a two-layer fully connected neural network. In the forward propagation process, self-observed informationObtaining feature vector/>, through feature network coding. Then by a read operation, calculate/>, using a linear transformation and a sigmoid functionFor controlling the reading of the message. Then by a write operation, a linear transformation and activation function is used to calculate/>、/>And/>For generating updated messages/>. Finally, the action network combines the unmanned aerial vehicle with the self-observation information/>Learning information/>And update information/>And splicing, and calculating the final action output through the two full-connection layers and the activation function. /(I)

The specific steps of the step 6 include:

Step 6.1: obtaining rewards Next state/>。

Step 6.2: defining a set of statesAnd communication experience set/>And will experienceAnd storing the action values into an experience buffer area, and calculating the action values by using the concentrated attention commentator neural network NN 3.

Step 6.3: as shown in fig. 4, the concentrated commentator neural network NN3 has the following architecture: the input information of the model is the states and actions of all unmanned aerial vehicles; firstly, we encode input information by using a linear layer formed by a plurality of fully connected networks to obtain encoded information; Secondly, the multi-head attention layer is utilized to screen information, and the information is encoded/>Mapped to three trainable and shared weight matrices: query/>Bond/>Sum/>The formula is:

Wherein the method comprises the steps of 、/>And/>Is the result after mapping. The attention weight of a single attention head is then calculated by three weight matrices, formulated as: /(I)

The final multi-headed attention layer output is a weighted sum of all attention weights. Next/>And/>Input into the addition and normalization layer, and perform addition and normalization operation on the information to ensure the validity and consistency of the information. Normalized information/>Linear and nonlinear transformation by feed-forward layer to yield output/>The activation function of the feed-forward layer is Relu functions. Finally will/>And/>The predicted/>, is finally obtained as input through addition and normalization processingValues.

The specific steps of the step 7 include:

step 7.1: updating actor neural network NN1 parameters using strategy gradient descent:

Wherein the method comprises the steps of Represents the/>Communication actor network parameters of individual agents,/>Rewards representing cumulative policy,/>Is a desire to fix the value function of an action under other agent actions.

Step 7.2: updating the concentrated attention commentator neural network NN3 parameters, and updating the formula to be expressed as:

Wherein the method comprises the steps of Representing/>, of concentrated-attention commentator neural network NN3 predictionsValue/>Representing target/>Value, final goal is to minimize the difference between predicted value and target value/>。/>Representing the current prize value,/>To balance maximum entropy and rewards.

Step 7.3: soft update target comment home neural network NN4 parameterTarget actor neural network NN2 parameters/>。

The present invention implements the proposed method using python3.7 as a programming language and performs subsequent training and verification. The training times of the algorithm are 30000 times, the training rounds are 25 steps each time, the training set size extracted by the experience buffer area is 256, and the learning rate of the communication actor neural network NN1 and the concentrated attention critter neural network is 0.001. In order to evaluate the effectiveness of the method, the invention is compared with the current advanced MADDPG, commnet reinforcement learning algorithm. The CACAC algorithm provided by the invention performs comparison analysis from three indexes of rewarding, the number of users meeting the service quality and energy consumption. Fig. 5 shows the change of rewards during training of three algorithms, wherein the solid line is the method proposed by the present invention. Fig. 6 and 7 show the number of users and the energy consumption of 3 algorithms, respectively, that meet the quality of service during the execution phase, wherein the diamonds are the method proposed by the present invention. It can be found from the figure that the method provided by the invention can converge faster and reach higher rewarding value in the training process, and the algorithm can enable more ground users to meet the service quality under lower energy consumption in the execution stage, i.e. the method provided by the invention has better performance.

The present invention is not limited to the above-described embodiments, and any modifications, variations, substitutions, etc. which are within the spirit and principles of the present invention should be included in the scope of the present invention. What is not described in detail in the present specification belongs to the prior art known to those skilled in the art.

Claims

1. The multi-unmanned aerial vehicle track optimization and power control method based on communication learning is characterized by comprising the following steps of:

The update information The acquisition steps are as follows, firstly encoding/>, local observation informationWherein/>Representing a fully connected neural network;

the unmanned aerial vehicle selectively updates and stores the learned information in the memory storage device In (a) and (b); two gating units/>And/>Wherein/>And/>Is a parameter that can be learned; candidate update information is denoted/>Wherein/>Is a parameter that can be learned; the last updated information is denoted/>；

5) Communication actor neural network based on local observation informationCommunication information/>And memory storage device/>Update information stored therein/>Making an action, the action comprising unmanned aerial vehicle/>Shaft speed/>Y-axis speed/>And transmit power；

7) When the experience stored in the experience buffer exceeds a certain amount, the training center extracts a batch of experience to update the neural network parameters; the attentive critic network NN3 calculates the current action value using a batch of experience And updating the parameter/>, by minimizing the joint regression function; Updating parameters/>, by means of strategy gradient descent, of communication actor neural network NN1; Soft updating the target actor neural network NN2 and the target critic network NN4; soft update target comment home neural network NN4 parameter，/>Soft updating target actor neural network NN2 parameters for super parameters,/>；

2. The method for optimizing and controlling power of multiple unmanned aerial vehicle trajectories based on communication learning according to claim 1, wherein the multiple unmanned aerial vehicle auxiliary wireless communication system is used for controlling power of multiple unmanned aerial vehicles in a multiple unmanned aerial vehicle wireless communication scene,The individual unmanned aerial vehicle is deployed as an air base station, as/>Providing communication service by individual ground users; /(I)Defined as a collection of unmanned aerial vehicles,Defined as a ground user set; multi-unmanned aerial vehicle auxiliary wireless communication system with equal length/>Run on consecutive time slots, where/>Defined as a set of time slots;

defining a motion model of the unmanned aerial vehicle; the three-dimensional coordinates of the ground user are defined as The three-dimensional coordinates of the unmanned aerial vehicle are/>Wherein/>The flying height of the unmanned aerial vehicle; unmanned aerial vehicle is/>The coordinates of the time slots are expressed as,

（1）

（2）

The probability of NLoS is that ; Where a, b are constants dependent on the environment,/>Is the elevation angle between the drone n and the ground user m,

（3）

（4）

Wherein the method comprises the steps of Is the path loss index,/>Is the path loss of LoS,/>Is the path loss of NLoS; definition/>For white gaussian noise, the signal to interference plus noise ratio received by ground users and unmanned aerial vehicles is defined as:

（5）

（6）

（7）

（8）

（9）

Is limited by ；

Is limited by；

Wherein the method comprises the steps ofThe maximum energy consumption of the unmanned aerial vehicle is achieved; /(I)Is an index function for measuring the service quality of the ground user ifThe formula value is 1, otherwise is/>Is the minimum transmission rate that the surface user needs to meetA weight factor representing the relative importance of adjusting energy consumption and ground user quality of service in a joint optimization objective function.

3. A multi-drone trajectory optimization and power control method based on communication learning according to claim 2, characterized in that the markov game is defined as that the drone obtains real-time rewards for interacting with the environment by selecting actions, observing the current state and at each time slot; a common goal of all drones is to maximize long-term accumulated rewards by selecting the best sequence of actions;

（10）

（11）

（12）。

4. The method for optimizing and controlling power of multiple unmanned aerial vehicle trajectories based on communication learning of claim 3, wherein said step 5) of communicating actor neural networks incorporates self-observed information Learning information/>Updating informationGive action, its formula is expressed as/>，/>Neural network NN1 for communicating actors; the communication actor neural network NN1 comprises a characteristic network, an action network and a series of linear transformations; the characteristic network is a three-layer full-connection neural network, and the action network is a two-layer full-connection neural network; in the forward propagation process, self-observed information/>Generating update information/>, through feature network, linear transformation, sigmoid function and write operation; The action network combines the unmanned aerial vehicle with the self-observation information/>Learning information/>And update information/>And splicing, and calculating the final action output through the two full-connection layers and the activation function.

5. The method for optimizing and controlling power of multiple unmanned aerial vehicle trajectories based on communication learning of claim 4, wherein the training center in step 6) obtains rewardsNext state/>; Define state set/>And communication experience set/>; Experience/>Stored in an experience buffer, and the action value/> is calculated by using the concentrated attention commentator neural network NN3A value; the focused critic neural network NN3 architecture is as follows: the input information of the concentrated attention commentator neural network NN3 is the states and actions of all unmanned aerial vehicles; encoding input information by using a linear layer formed by a plurality of fully connected networks to obtain encoded information/>; Secondly, the multi-head attention layer is utilized to screen information, and the information is encoded/>Mapped to three trainable and shared weight matrices: query/>Bond/>Sum/>The formula is:

（13）

（14）

（15）

（16）

the final multi-headed attention layer output is a weighted sum of all attention weights ; Next/>And/>Inputting the information into an addition and normalization layer, and performing addition and normalization operation on the information to ensure the validity and consistency of the information; normalized informationLinear and nonlinear transformation by feed-forward layer to yield output/>The activation function of the feed-forward layer is Relu functions; finally will/>And/>The predicted/>, is finally obtained as input through addition and normalization processingValues.

6. The method for optimizing and controlling power of multiple unmanned aerial vehicles based on communication learning according to claim 5, wherein the parameters are updated in step 7) by means of strategy gradient descent：

（17）

（18）