CN113572548A

CN113572548A - Unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning

Info

Publication number: CN113572548A
Application number: CN202110680187.4A
Authority: CN
Inventors: 彭诺蘅; 林艳; 张一晋; 李骏
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2021-10-29
Anticipated expiration: 2041-06-18
Also published as: CN113572548B

Abstract

The invention discloses an unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning, which specifically comprises the following steps: inputting unmanned aerial vehicle network environment, initializing a self Q table and optimal prior action distribution estimation, mutual information punishment item coefficient and action state pair occurrence number for each pair of unmanned aerial vehicles; in the current time slot, each pair of unmanned aerial vehicles selects a transmission channel according to the action generated in the last time slot, and the environment feedback reward is obtained after the transmission is finished; each pair of unmanned aerial vehicles observes the current state of the environment, interacts with other unmanned aerial vehicle pairs to obtain a global Q value of a Q value of each action in the current state, and generates actions according to an action strategy in a mutual information regularization soft Q-learning algorithm; each pair of unmanned aerial vehicles updates the Q table and each parameter; when the maximum number of steps of the training round is reached, the network environment of the unmanned aerial vehicle is input again to start the next round. The invention realizes the improvement of the total throughput performance of all unmanned aerial vehicle pairs and provides communication guarantee for the unmanned aerial vehicle network.

Description

Unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning

Technical Field

The invention belongs to the technical field of communication in a wireless mobile network, and particularly relates to an unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning.

Background

In the face of increasing communication demands, unmanned aerial vehicle communication networks are receiving attention due to limitations of ground communication infrastructures in terms of deployment cost and flexibility (Zeng Y, Wu Q, Zhang R. accessing from the sky: A tutorial on UAV communications for 5G and beyond [ J ] Proceedings of the IEEE,2019,107(12): 2327. 2375.). The unmanned aerial vehicle has the characteristics of small volume, low deployment cost, high agility and controllability and the like, so that the unmanned aerial vehicle can be used for processing emergency search and rescue tasks, serving as a mobile relay and monitoring weather and traffic (Gupta L, Jain R, Vaszkun G. Survey of opportunities in UAV communication networks [ J ]. IEEE Communications Surveys & Tutorials,2015,18(2):1123 + 1152.).

In particular, when the unmanned aerial vehicle pair directly communicates with each other, the short-range line-of-sight communication link established can effectively reduce signal transmission fading. However, as with ground-based device-to-device communications, drone-to-drone communications are also threatened by jammer malicious interference attacks. Moreover, due to the shortage of spectrum resources, co-channel interference between users also exists in the unmanned aerial vehicle communication network, so an effective dynamic resource allocation scheme is needed to provide communication guarantee (Xu Y, Ren G, Chen J, et al. a one-leader multi-follower Bayesian-stacking-gap gate for anti-jamming transmission in UAV communication networks [ J ]. IEEE Access,2018,6: 21697-reservoir 09 217).

In some studies using conventional optimization methods, in order to simplify the optimization problem, the trainees manually and interoperably limit the characteristics of the unmanned aerial vehicle, such as the flight trajectory of the unmanned aerial vehicle (Zhang S, Zhang H, Di B, et al cellular UAV-to-X Communications: Design and optimization for multi-UAV networks [ J ]. IEEE Transactions on Wireless Communications,2019,18(2): 1346-. While reinforcement learning algorithms can cope with complex drone communication networks, as agents can continuously learn to improve the performance of drone communication networks during interaction with the environment. However, because the single-agent reinforcement learning algorithm requires the central controller to collect global information for decision making, and the central controller is difficult to deploy in the unmanned aerial vehicle communication network, the students introduce the multi-agent reinforcement learning algorithm to solve the resource allocation optimization problem in the unmanned aerial vehicle communication network (Cui J, Liu Y, nalalanthan a. multi-agent recovery-based resource allocation for UAV networks [ J ]. IEEE Transactions on Wireless Communications,2019,19(2): 729-. Some of the students proposed a resource allocation scheme based on multi-agent independence, which is better than the conventional scheme in performance, but does not consider the performance improvement brought by the multi-agent cooperation framework (Tang J, Song J, Ou J, et al. minimum through put maximum knowledge for multi-UAV enabled WPCN: Adep re-enforcement leaving method [ J ]. IEEE Access,2020,8: 9124-one 9132.).

Disclosure of Invention

The invention provides an unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning, which improves the total throughput performance of all unmanned aerial vehicle pairs and provides communication guarantee for an unmanned aerial vehicle network.

The technical solution for realizing the purpose of the invention is as follows: an unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning comprises the following steps:

step 1, inputting an unmanned aerial vehicle network environment, wherein each pair of unmanned aerial vehicles serving as independent intelligent agents initializes a self Q table, optimal prior action distribution estimation, a mutual information punishment item coefficient and action state pair occurrence times;

step 2, in the current time slot, each pair of unmanned aerial vehicles selects a transmission channel according to the action generated in the last time slot, and obtains environment feedback reward after transmission is completed;

step 3, observing the current state of the environment by each pair of unmanned aerial vehicles, interacting the Q value of each action in the current state with other unmanned aerial vehicle pairs to obtain a global Q value, and further generating actions according to the action strategy in the mutual information regularization soft Q-learning algorithm;

step 4, each pair of unmanned aerial vehicles updates the self Q table and each parameter according to an updating mode in a mutual information regularization soft Q-learning algorithm;

and 5, when the maximum number of steps of the training round is reached, ending the current round, starting the next round, re-inputting the network environment of the unmanned aerial vehicle, and repeating the steps 2 to 4.

Further, the continuous training time is discretized into a plurality of time slots by positive integers

To indicate the jth slot; suppose there are M drone pairs and N jammers in the network, using sets respectively

And

to indicate.

Further, the step 1 inputs a network environment of the drone, wherein the network environment of the drone includes:

(1) and (3) network model: the unmanned aerial vehicle pairs and the jammers move according to a Markov random movement model, and the distance between a receiver and a transmitter in each pair of unmanned aerial vehicles is limited;

(2) and (3) channel model: considering that limited sub-bands exist in a system, and channel power gain is composed of path loss and fast fading, wherein the path loss only considers the line-of-sight situation, and the fast fading refers to Rayleigh fading;

(3) wireless transmission model: when the actual transmission rate is less than or equal to the reachable rate of the selected channel, the throughput is the number of bits transmitted in the transmission time of the time slot; otherwise, the throughput is 0;

(4) interference model: the interference type of the jammer is set to be single tone frequency sweep interference, channels interfered by different jammers cannot be overlapped, and the jammer interference channel set is the unmanned aerial vehicle available channel set.

Further, in step 2, each pair of the unmanned aerial vehicles selects a transmission channel according to the action generated in the last time slot, and obtains an incentive of environment feedback after the transmission is completed, specifically:

(1) action of unmanned aerial vehicle pair

The action of each pair of unmanned aerial vehicles comprises two parts, wherein the first part is to select a transmission channel of the next time slot of the unmanned aerial vehicle, the second part is to predict the transmission channel selected by other unmanned aerial vehicles to the next time slot, and then the action of the mth unmanned aerial vehicle pair in the time slot j

Expressed as:

wherein the content of the first and second substances,

representing the transmission channel of the mth drone pair at time slot j + 1;

m' ≠ m is a transmission channel vector of the mth unmanned aerial vehicle pair for predicting other unmanned aerial vehicle pairs in the time slot j + 1; in fact, since each pair of drones can only control the transmission channel of its next slot, the transmission channel vector used by all pairs of drones in slot j +1 is represented as

(2) System rewards

To maximize throughput for all drone pairs, the system reward is set to the total normalized throughput for all drone pairs, i.e., the reward for the mth drone pair at time slot j

Expressed as:

wherein the content of the first and second substances,

is the throughput of the mth drone pair at time slot j, C_transIs the actual transmission rate, t, of each pair of drones_rIs the transmission time within each slot.

Further, step 3, interacting the Q values of the actions in the current state with the other pairs of unmanned aerial vehicles to obtain a global Q value, and generating the actions according to the behavior strategy in the mutual information regularization soft Q-learning algorithm, specifically:

(1) state of unmanned aerial vehicle pair

The state of each pair of drones includes the channel currently interfered by the jammer and the transmission channel vector used by all drones for the current time slot, so that the state of the mth drone pair at time slot j

Expressed as:

wherein the content of the first and second substances,

representing the channels interfered by the jammers observed by each pair of unmanned planes at the time slot j;

ignoring false alarm/false alarm probability in the observation process, and assuming that each pair of unmanned aerial vehicles can accurately observe which channels are interfered by the jammers currently, so that the states of all unmanned aerial vehicle pairs in each time slot are the same;

(2) generating behavior policies

The soft Q-learning algorithm of mutual information regularization adopts a method similar to an epsilon-greedy strategy when generating a behavior strategy, and adjusts the weight of exploration and utilization through the estimation rho (a) of optimal prior action distribution and a dynamically changed mutual information penalty term coefficient beta;

during exploration, the agent obtains the action of the next time slot according to the current estimation sampling of the optimal prior action distribution, wherein the probability of each action is different; when in use, the agent directly selects the action with the highest probability, but the probability depends not only on the Q value, but also on the current estimation of the optimal prior action distribution; therefore, the behavior policy for the mth drone pair at slot j is:

wherein x is a random number subject to uniform distribution over [0,1], epsilon is a greedy factor, and the current optimal strategy when in use is:

further, in step 4, each pair of unmanned aerial vehicles updates its own Q table and each parameter according to an update mode in the mutual information regularization soft Q-learning algorithm, which specifically includes:

(1) updating an estimate of optimal a priori motion distribution ρ (a)

Suppose that

Is the policy generated by the mth drone pair in slot j,

indicating that the mth drone pair is generated in time slot j-1

Then estimating the current distribution of the optimal prior action; because the mth unmanned aerial vehicle pair is according to the motion vector when the time slot j

In the form of self-selected letterWay, and therefore current estimate of optimal prior motion distribution

The update equation of (2) is as follows:

wherein alpha is_ρIs the learning rate, and

is uniformly distributed;

(2) updating coefficient beta of mutual information penalty term

Suppose that

Is the coefficient of the mth unmanned aerial vehicle to the mutual information punishment item in the time slot j, and the updating formula is as follows:

wherein c is a normal number, and

(3) updating Q table

The Q table updating needs to use the estimation rho (a) of the optimal prior action distribution and the coefficient beta of a mutual information penalty term, and the updating formula of the m-th unmanned aerial vehicle on the Q table in the time slot j is as follows:

wherein

Is a calculation formula of soft Q value, and gamma is a discount factor;

the learning rate of the mth unmanned aerial vehicle pair in the time slot j changes along with the occurrence frequency of the mth unmanned aerial vehicle pair to the action state pair, and the specific calculation formula is as follows:

where omega is a normal number,

is the action state pair of the mth unmanned aerial vehicle pair in the time slot j

The number of occurrences of (c).

Compared with the prior art, the invention has the following remarkable advantages: (1) the unmanned aerial vehicle pairs cooperate with each other, and interference is avoided in an information interaction mode so as to maximize the total throughput performance of the system; (2) by adopting a soft Q-learning algorithm based on mutual information regularization, the convergence rate is higher, and the throughput performance is better and more stable in the face of a high-dynamic unmanned aerial vehicle network environment; (3) the anti-interference communication problem in the dynamically-changing unmanned aerial vehicle network is solved by utilizing the multi-agent cooperative framework, exploration and utilization are optimized, and a communication guarantee is provided for the unmanned aerial vehicle network.

Drawings

Fig. 1 is a flow chart of the unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning.

Fig. 2 is a schematic diagram of a network topology of an unmanned aerial vehicle according to an embodiment of the present invention.

Fig. 3 is a graph of throughput performance of an average drone pair as a function of the number of training rounds in an embodiment of the present invention.

Fig. 4 is a graph of throughput performance of an average drone pair as a function of the number of interferers in an embodiment of the invention.

Fig. 5 is a graph of throughput performance of an average drone pair as a function of the number of available channels in an embodiment of the invention.

Fig. 6 is a graph of throughput performance of an average drone pair as a function of the number of drone pairs in an embodiment of the invention.

Detailed Description

The invention provides an unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning, which is characterized in that through an interactive Q table, a plurality of unmanned aerial vehicle pairs cooperate with each other to avoid internal co-channel interference and malicious interference of an interference machine, so that the total throughput performance of all unmanned aerial vehicle pairs is improved, and the method comprises the following steps in combination with the following figures 1-2:

step 1: inputting an unmanned aerial vehicle network environment, wherein each pair of unmanned aerial vehicles serving as independent intelligent agents initializes a self Q table and optimal prior action distribution estimation, a mutual information penalty term coefficient and action state pair occurrence times;

step 2: in the current time slot, each pair of unmanned aerial vehicles selects a transmission channel according to the action generated in the last time slot, and the environment feedback reward is obtained after the transmission is finished;

and step 3: each pair of unmanned aerial vehicles observes the current state of the environment, interacts the Q value of each action in the current state with other unmanned aerial vehicle pairs to obtain a global Q value, and further generates actions according to the action strategy in the mutual information regularization soft Q-learning algorithm;

and 4, step 4: and each pair of unmanned aerial vehicles updates the self Q table and each parameter according to an updating mode in the mutual information regularization soft Q-learning algorithm.

And 5: and when the maximum number of steps of the training round is reached, ending the current round, starting the next round, re-inputting the network environment of the unmanned aerial vehicle, and repeating the steps 2-4.

The invention discretizes the continuous training time into a plurality of time slots by positive integers

To indicate the jth slot. Suppose there are M drone pairs and N jammers in the network, using sets respectively

And

to indicate.

Further, step 1, inputting an unmanned aerial vehicle network environment, wherein each pair of unmanned aerial vehicles serving as independent agents initializes a self Q table and optimal prior action distribution estimation, mutual information penalty term coefficients and action state pair occurrence times, specifically:

the unmanned aerial vehicle network environment includes:

(1) and (3) network model: the drone pair and jammer both move according to a markov stochastic mobility model, and the distance between the receiver and the transmitter in each pair of drones is limited.

(2) And (3) channel model: consider that there are limited sub-bands in the system and that the channel power gain consists of path loss (only line-of-sight case considered) and fast fading (rayleigh).

(3) Wireless transmission model: when the actual transmission rate is less than or equal to the reachable rate of the selected channel, the throughput is the number of bits transmitted in the transmission time of the time slot; otherwise, the throughput is 0.

1) action of unmanned aerial vehicle pair

The action of each pair of drones consists of two parts, the first part is to select the transmission channel of its next slot and the second part is to predict the transmission channel selected by the other drones for the next slot. The action of the mth drone pair at time slot j may be expressed as:

wherein the content of the first and second substances,

representing the transmission channel of the mth drone pair at time slot j + 1;

m' ≠ m is the transport channel vector for which the mth drone pair predicts the other drone pairs at slot j + 1. In fact, since each pair of drones can only control the transmission channel of its next slot, the transmission channel vector used by all pairs of drones at slot j +1 can be represented as

(2) System rewards

To maximize the throughput of all drone pairs, the system reward is set to the total normalized throughput of all drone pairs, i.e. the reward for the mth drone pair at slot j can be expressed as:

wherein the content of the first and second substances,

(1) state of unmanned aerial vehicle pair

The state of each pair of drones includes the channel currently interfered by the jammer and the communication channel vector in which all pairs of drones are executed for the current time slot, so the state of the mth pair of drones at time slot j can be expressed as:

wherein the content of the first and second substances,

indicating the channels observed by the jammers at time slot j by each pair of drones. It should be noted that the present invention does not consider the false alarm/false alarm probability in the observation process, and it is assumed that each pair of drones can accurately observe which channels are currently interfered by jammers, so the states of all pairs of drones in each timeslot are the same.

(2) Generating behavior policies

The soft Q-learning algorithm of mutual information regularization adopts a method similar to an epsilon-greedy strategy when generating a behavior strategy, and the weight of exploration and utilization is adjusted through the estimation rho (a) of optimal prior action distribution and a dynamically changed mutual information penalty term coefficient beta. Specifically, during exploration, the agent obtains the action of the next time slot according to the current estimation sampling of the optimal prior action distribution, wherein the probability of each action is different; when utilized, the agent directly selects the action with the highest probability, but this probability depends not only on the Q value, but also on the current estimate of the optimal a priori action distribution. Therefore, the behavior policy for the mth drone pair at slot j is:

(1) updating an estimate of optimal a priori motion distribution ρ (a)

Suppose that

Is the policy generated by the mth drone pair in slot j,

indicating that the mth drone pair is generated in time slot j-1

And then estimating the current optimal prior motion distribution. Because the mth unmanned aerial vehicle pair is according to the motion vector when the time slot j

The channel selected for itself, and thus the current estimate of the optimal prior motion distribution

The update equation of (2) is as follows:

wherein alpha is_ρIs the learning rate, and

is uniformly distributed.

(2) Updating coefficient beta of mutual information penalty term

Suppose that

The coefficient of the m-th unmanned aerial vehicle to the mutual information punishment item in the time slot j is as follows:

wherein c is a normal number, and

(3) updating Q table

wherein

Is a calculation formula for soft Q value, gamma is a discount factor,

the learning rate of the mth drone pair in the time slot j is changed along with the occurrence number of the motion state pair of the mth drone pair. The specific calculation formula is as follows:

where omega is a normal number,

The number of occurrences of (c).

The invention is described in further detail below with reference to the figures and the embodiments.

Examples

One embodiment of the invention is described in detail below, with the simulation using python programming, and the parameter settings do not affect generality. In contrast to the method described, there are: (1) a random fast frequency hopping method; (2) a multi-agent cooperative unmanned aerial vehicle fast frequency hopping method based on traditional Q-learning.

As shown in fig. 2, all drone pairs and jammers move within a rectangular area of fixed height, where the drone pairs and jammers fly at a certain speed according to their respective flight direction. And assuming that the distance D between the receiver and the transmitter in each pair of unmanned aerial vehicles satisfies that D is less than or equal to D in the communication process_maxWherein D is_maxThe communication range is 100m for each pair of drones. In addition, the role of receiver or transmitter in all pairs of drones does not change during training, but the receiver and transmitter will be regrouped at the beginning of each round.

During training, the training rounds are set to 1000, and the maximum number of steps per round is set to 1000. In addition, the speed of the unmanned aerial vehicle takes values randomly from [10m/s, 20m/s ], and the flying direction of the unmanned aerial vehicle has 1/3 probability to change. The initial value of the greedy factor epsilon is set to 1 and then decreases with the number of training steps. Table 1 lists other simulation parameters.

TABLE 1 Primary simulation parameters

As shown in fig. 3, compared with the baseline method, the method benefits from an efficient exploration and utilization mechanism, the converged throughput performance is better, the convergence speed is faster, and the training process is more stable.

As shown in fig. 4 to 6, when the number of jammers changes, the number of available channels changes, and the number of pairs of drones changes, the performance of the method is significantly better than that of the baseline method. This is because it introduces adaptive estimation of optimal a priori motion distribution and better Q value update and behavior strategy generation, so that the priority of the motion can be learned faster.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. An unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning is characterized by comprising the following steps:

2. The multi-agent reinforcement learning-based unmanned aerial vehicle network cooperative fast frequency hopping method according to claim 1, wherein continuous training time is discretized into a plurality of time slots, and positive integers are used

To indicate the jth slot; suppose there are M nobody in the networkMachine pair and N jammers, respectively

And

to indicate.

3. The cooperative fast frequency hopping method for network of unmanned aerial vehicles based on multi-agent reinforcement learning as claimed in claim 2, wherein the network environment of unmanned aerial vehicles is inputted in step 1, wherein the network environment of unmanned aerial vehicles comprises:

4. The unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning as claimed in claim 2 or 3, wherein each pair of unmanned aerial vehicles in step 2 selects a transmission channel according to the action generated in the previous time slot, and obtains an incentive for environmental feedback after transmission is completed, specifically:

(1) action of unmanned aerial vehicle pair

The action of each pair of drones consists of two parts, the first part is to select the transmission channel of the next time slot of itself, and the second part is to predict the other dronesThe transmission channel selected by the man-machine to the next time slot, then the action of the mth unmanned aerial vehicle pair in the time slot j

Expressed as:

wherein the content of the first and second substances,

representing the transmission channel of the mth drone pair at time slot j + 1;

predicting the transmission channel vector of other unmanned aerial vehicle pairs in the time slot j +1 by the mth unmanned aerial vehicle pair; in fact, since each pair of drones can only control the transmission channel of its next slot, the transmission channel vector used by all pairs of drones in slot j +1 is represented as

(2) System rewards

Expressed as:

wherein the content of the first and second substances,

is the m-th unmanned aerial vehicle pair in time slotThroughput at j, C_transIs the actual transmission rate, t, of each pair of drones_rIs the transmission time within each slot.

5. The unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning as claimed in claim 4, wherein step 3 said each pair of unmanned aerial vehicles observes the current state of the environment, and interacts with other unmanned aerial vehicle pairs with the Q value of each action in the current state to obtain a global Q value, and further generates actions according to the behavior strategy in the mutual information regularization soft Q-learning algorithm, specifically:

(1) state of unmanned aerial vehicle pair

Expressed as:

wherein the content of the first and second substances,

(2) generating behavior policies

6. the unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning as claimed in claim 1, wherein step 4 each pair of unmanned aerial vehicles updates its own Q table and various parameters according to an update mode in a mutual information regularization soft Q-learning algorithm, specifically:

(1) updating an estimate of optimal a priori motion distribution ρ (a)

Suppose that