CN113572548A - Unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning - Google Patents

Unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning Download PDF

Info

Publication number
CN113572548A
CN113572548A CN202110680187.4A CN202110680187A CN113572548A CN 113572548 A CN113572548 A CN 113572548A CN 202110680187 A CN202110680187 A CN 202110680187A CN 113572548 A CN113572548 A CN 113572548A
Authority
CN
China
Prior art keywords
unmanned aerial
pair
aerial vehicle
time slot
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110680187.4A
Other languages
Chinese (zh)
Other versions
CN113572548B (en
Inventor
彭诺蘅
林艳
张一晋
李骏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN202110680187.4A priority Critical patent/CN113572548B/en
Publication of CN113572548A publication Critical patent/CN113572548A/en
Application granted granted Critical
Publication of CN113572548B publication Critical patent/CN113572548B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B17/00Monitoring; Testing
    • H04B17/30Monitoring; Testing of propagation channels
    • H04B17/309Measuring or estimating channel quality parameters
    • H04B17/318Received signal strength
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B1/00Details of transmission systems, not covered by a single one of groups H04B3/00 - H04B13/00; Details of transmission systems not characterised by the medium used for transmission
    • H04B1/69Spread spectrum techniques
    • H04B1/713Spread spectrum techniques using frequency hopping
    • H04B1/715Interference-related aspects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B17/00Monitoring; Testing
    • H04B17/30Monitoring; Testing of propagation channels
    • H04B17/309Measuring or estimating channel quality parameters
    • H04B17/345Interference values
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B17/00Monitoring; Testing
    • H04B17/30Monitoring; Testing of propagation channels
    • H04B17/391Modelling the propagation channel
    • H04B17/3911Fading models or fading generators
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W84/00Network topologies
    • H04W84/18Self-organising networks, e.g. ad-hoc networks or sensor networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Electromagnetism (AREA)
  • Quality & Reliability (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention discloses an unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning, which specifically comprises the following steps: inputting unmanned aerial vehicle network environment, initializing a self Q table and optimal prior action distribution estimation, mutual information punishment item coefficient and action state pair occurrence number for each pair of unmanned aerial vehicles; in the current time slot, each pair of unmanned aerial vehicles selects a transmission channel according to the action generated in the last time slot, and the environment feedback reward is obtained after the transmission is finished; each pair of unmanned aerial vehicles observes the current state of the environment, interacts with other unmanned aerial vehicle pairs to obtain a global Q value of a Q value of each action in the current state, and generates actions according to an action strategy in a mutual information regularization soft Q-learning algorithm; each pair of unmanned aerial vehicles updates the Q table and each parameter; when the maximum number of steps of the training round is reached, the network environment of the unmanned aerial vehicle is input again to start the next round. The invention realizes the improvement of the total throughput performance of all unmanned aerial vehicle pairs and provides communication guarantee for the unmanned aerial vehicle network.

Description

Unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning
Technical Field
The invention belongs to the technical field of communication in a wireless mobile network, and particularly relates to an unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning.
Background
In the face of increasing communication demands, unmanned aerial vehicle communication networks are receiving attention due to limitations of ground communication infrastructures in terms of deployment cost and flexibility (Zeng Y, Wu Q, Zhang R. accessing from the sky: A tutorial on UAV communications for 5G and beyond [ J ] Proceedings of the IEEE,2019,107(12): 2327. 2375.). The unmanned aerial vehicle has the characteristics of small volume, low deployment cost, high agility and controllability and the like, so that the unmanned aerial vehicle can be used for processing emergency search and rescue tasks, serving as a mobile relay and monitoring weather and traffic (Gupta L, Jain R, Vaszkun G. Survey of opportunities in UAV communication networks [ J ]. IEEE Communications Surveys & Tutorials,2015,18(2):1123 + 1152.).
In particular, when the unmanned aerial vehicle pair directly communicates with each other, the short-range line-of-sight communication link established can effectively reduce signal transmission fading. However, as with ground-based device-to-device communications, drone-to-drone communications are also threatened by jammer malicious interference attacks. Moreover, due to the shortage of spectrum resources, co-channel interference between users also exists in the unmanned aerial vehicle communication network, so an effective dynamic resource allocation scheme is needed to provide communication guarantee (Xu Y, Ren G, Chen J, et al. a one-leader multi-follower Bayesian-stacking-gap gate for anti-jamming transmission in UAV communication networks [ J ]. IEEE Access,2018,6: 21697-reservoir 09 217).
In some studies using conventional optimization methods, in order to simplify the optimization problem, the trainees manually and interoperably limit the characteristics of the unmanned aerial vehicle, such as the flight trajectory of the unmanned aerial vehicle (Zhang S, Zhang H, Di B, et al cellular UAV-to-X Communications: Design and optimization for multi-UAV networks [ J ]. IEEE Transactions on Wireless Communications,2019,18(2): 1346-. While reinforcement learning algorithms can cope with complex drone communication networks, as agents can continuously learn to improve the performance of drone communication networks during interaction with the environment. However, because the single-agent reinforcement learning algorithm requires the central controller to collect global information for decision making, and the central controller is difficult to deploy in the unmanned aerial vehicle communication network, the students introduce the multi-agent reinforcement learning algorithm to solve the resource allocation optimization problem in the unmanned aerial vehicle communication network (Cui J, Liu Y, nalalanthan a. multi-agent recovery-based resource allocation for UAV networks [ J ]. IEEE Transactions on Wireless Communications,2019,19(2): 729-. Some of the students proposed a resource allocation scheme based on multi-agent independence, which is better than the conventional scheme in performance, but does not consider the performance improvement brought by the multi-agent cooperation framework (Tang J, Song J, Ou J, et al. minimum through put maximum knowledge for multi-UAV enabled WPCN: Adep re-enforcement leaving method [ J ]. IEEE Access,2020,8: 9124-one 9132.).
Disclosure of Invention
The invention provides an unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning, which improves the total throughput performance of all unmanned aerial vehicle pairs and provides communication guarantee for an unmanned aerial vehicle network.
The technical solution for realizing the purpose of the invention is as follows: an unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning comprises the following steps:
step 1, inputting an unmanned aerial vehicle network environment, wherein each pair of unmanned aerial vehicles serving as independent intelligent agents initializes a self Q table, optimal prior action distribution estimation, a mutual information punishment item coefficient and action state pair occurrence times;
step 2, in the current time slot, each pair of unmanned aerial vehicles selects a transmission channel according to the action generated in the last time slot, and obtains environment feedback reward after transmission is completed;
step 3, observing the current state of the environment by each pair of unmanned aerial vehicles, interacting the Q value of each action in the current state with other unmanned aerial vehicle pairs to obtain a global Q value, and further generating actions according to the action strategy in the mutual information regularization soft Q-learning algorithm;
step 4, each pair of unmanned aerial vehicles updates the self Q table and each parameter according to an updating mode in a mutual information regularization soft Q-learning algorithm;
and 5, when the maximum number of steps of the training round is reached, ending the current round, starting the next round, re-inputting the network environment of the unmanned aerial vehicle, and repeating the steps 2 to 4.
Further, the continuous training time is discretized into a plurality of time slots by positive integers
Figure BDA0003122215460000021
To indicate the jth slot; suppose there are M drone pairs and N jammers in the network, using sets respectively
Figure BDA0003122215460000022
And
Figure BDA0003122215460000023
to indicate.
Further, the step 1 inputs a network environment of the drone, wherein the network environment of the drone includes:
(1) and (3) network model: the unmanned aerial vehicle pairs and the jammers move according to a Markov random movement model, and the distance between a receiver and a transmitter in each pair of unmanned aerial vehicles is limited;
(2) and (3) channel model: considering that limited sub-bands exist in a system, and channel power gain is composed of path loss and fast fading, wherein the path loss only considers the line-of-sight situation, and the fast fading refers to Rayleigh fading;
(3) wireless transmission model: when the actual transmission rate is less than or equal to the reachable rate of the selected channel, the throughput is the number of bits transmitted in the transmission time of the time slot; otherwise, the throughput is 0;
(4) interference model: the interference type of the jammer is set to be single tone frequency sweep interference, channels interfered by different jammers cannot be overlapped, and the jammer interference channel set is the unmanned aerial vehicle available channel set.
Further, in step 2, each pair of the unmanned aerial vehicles selects a transmission channel according to the action generated in the last time slot, and obtains an incentive of environment feedback after the transmission is completed, specifically:
(1) action of unmanned aerial vehicle pair
The action of each pair of unmanned aerial vehicles comprises two parts, wherein the first part is to select a transmission channel of the next time slot of the unmanned aerial vehicle, the second part is to predict the transmission channel selected by other unmanned aerial vehicles to the next time slot, and then the action of the mth unmanned aerial vehicle pair in the time slot j
Figure BDA0003122215460000031
Expressed as:
Figure BDA0003122215460000032
wherein the content of the first and second substances,
Figure BDA0003122215460000033
representing the transmission channel of the mth drone pair at time slot j + 1;
Figure BDA0003122215460000034
m' ≠ m is a transmission channel vector of the mth unmanned aerial vehicle pair for predicting other unmanned aerial vehicle pairs in the time slot j + 1; in fact, since each pair of drones can only control the transmission channel of its next slot, the transmission channel vector used by all pairs of drones in slot j +1 is represented as
Figure BDA0003122215460000035
(2) System rewards
To maximize throughput for all drone pairs, the system reward is set to the total normalized throughput for all drone pairs, i.e., the reward for the mth drone pair at time slot j
Figure BDA0003122215460000036
Expressed as:
Figure BDA0003122215460000037
wherein the content of the first and second substances,
Figure BDA0003122215460000038
is the throughput of the mth drone pair at time slot j, CtransIs the actual transmission rate, t, of each pair of dronesrIs the transmission time within each slot.
Further, step 3, interacting the Q values of the actions in the current state with the other pairs of unmanned aerial vehicles to obtain a global Q value, and generating the actions according to the behavior strategy in the mutual information regularization soft Q-learning algorithm, specifically:
(1) state of unmanned aerial vehicle pair
The state of each pair of drones includes the channel currently interfered by the jammer and the transmission channel vector used by all drones for the current time slot, so that the state of the mth drone pair at time slot j
Figure BDA0003122215460000041
Expressed as:
Figure BDA0003122215460000042
wherein the content of the first and second substances,
Figure BDA0003122215460000043
representing the channels interfered by the jammers observed by each pair of unmanned planes at the time slot j;
ignoring false alarm/false alarm probability in the observation process, and assuming that each pair of unmanned aerial vehicles can accurately observe which channels are interfered by the jammers currently, so that the states of all unmanned aerial vehicle pairs in each time slot are the same;
(2) generating behavior policies
The soft Q-learning algorithm of mutual information regularization adopts a method similar to an epsilon-greedy strategy when generating a behavior strategy, and adjusts the weight of exploration and utilization through the estimation rho (a) of optimal prior action distribution and a dynamically changed mutual information penalty term coefficient beta;
during exploration, the agent obtains the action of the next time slot according to the current estimation sampling of the optimal prior action distribution, wherein the probability of each action is different; when in use, the agent directly selects the action with the highest probability, but the probability depends not only on the Q value, but also on the current estimation of the optimal prior action distribution; therefore, the behavior policy for the mth drone pair at slot j is:
Figure BDA0003122215460000044
wherein x is a random number subject to uniform distribution over [0,1], epsilon is a greedy factor, and the current optimal strategy when in use is:
Figure BDA0003122215460000045
further, in step 4, each pair of unmanned aerial vehicles updates its own Q table and each parameter according to an update mode in the mutual information regularization soft Q-learning algorithm, which specifically includes:
(1) updating an estimate of optimal a priori motion distribution ρ (a)
Suppose that
Figure BDA0003122215460000046
Is the policy generated by the mth drone pair in slot j,
Figure BDA0003122215460000047
indicating that the mth drone pair is generated in time slot j-1
Figure BDA0003122215460000048
Then estimating the current distribution of the optimal prior action; because the mth unmanned aerial vehicle pair is according to the motion vector when the time slot j
Figure BDA0003122215460000049
In the form of self-selected letterWay, and therefore current estimate of optimal prior motion distribution
Figure BDA0003122215460000051
The update equation of (2) is as follows:
Figure BDA0003122215460000052
wherein alpha isρIs the learning rate, and
Figure BDA0003122215460000053
is uniformly distributed;
(2) updating coefficient beta of mutual information penalty term
Suppose that
Figure BDA0003122215460000054
Is the coefficient of the mth unmanned aerial vehicle to the mutual information punishment item in the time slot j, and the updating formula is as follows:
Figure BDA0003122215460000055
wherein c is a normal number, and
Figure BDA0003122215460000056
(3) updating Q table
The Q table updating needs to use the estimation rho (a) of the optimal prior action distribution and the coefficient beta of a mutual information penalty term, and the updating formula of the m-th unmanned aerial vehicle on the Q table in the time slot j is as follows:
Figure BDA0003122215460000057
wherein
Figure BDA0003122215460000058
Is a calculation formula of soft Q value, and gamma is a discount factor;
Figure BDA0003122215460000059
the learning rate of the mth unmanned aerial vehicle pair in the time slot j changes along with the occurrence frequency of the mth unmanned aerial vehicle pair to the action state pair, and the specific calculation formula is as follows:
Figure BDA00031222154600000510
where omega is a normal number,
Figure BDA00031222154600000511
is the action state pair of the mth unmanned aerial vehicle pair in the time slot j
Figure BDA00031222154600000512
The number of occurrences of (c).
Compared with the prior art, the invention has the following remarkable advantages: (1) the unmanned aerial vehicle pairs cooperate with each other, and interference is avoided in an information interaction mode so as to maximize the total throughput performance of the system; (2) by adopting a soft Q-learning algorithm based on mutual information regularization, the convergence rate is higher, and the throughput performance is better and more stable in the face of a high-dynamic unmanned aerial vehicle network environment; (3) the anti-interference communication problem in the dynamically-changing unmanned aerial vehicle network is solved by utilizing the multi-agent cooperative framework, exploration and utilization are optimized, and a communication guarantee is provided for the unmanned aerial vehicle network.
Drawings
Fig. 1 is a flow chart of the unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning.
Fig. 2 is a schematic diagram of a network topology of an unmanned aerial vehicle according to an embodiment of the present invention.
Fig. 3 is a graph of throughput performance of an average drone pair as a function of the number of training rounds in an embodiment of the present invention.
Fig. 4 is a graph of throughput performance of an average drone pair as a function of the number of interferers in an embodiment of the invention.
Fig. 5 is a graph of throughput performance of an average drone pair as a function of the number of available channels in an embodiment of the invention.
Fig. 6 is a graph of throughput performance of an average drone pair as a function of the number of drone pairs in an embodiment of the invention.
Detailed Description
The invention provides an unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning, which is characterized in that through an interactive Q table, a plurality of unmanned aerial vehicle pairs cooperate with each other to avoid internal co-channel interference and malicious interference of an interference machine, so that the total throughput performance of all unmanned aerial vehicle pairs is improved, and the method comprises the following steps in combination with the following figures 1-2:
step 1: inputting an unmanned aerial vehicle network environment, wherein each pair of unmanned aerial vehicles serving as independent intelligent agents initializes a self Q table and optimal prior action distribution estimation, a mutual information penalty term coefficient and action state pair occurrence times;
step 2: in the current time slot, each pair of unmanned aerial vehicles selects a transmission channel according to the action generated in the last time slot, and the environment feedback reward is obtained after the transmission is finished;
and step 3: each pair of unmanned aerial vehicles observes the current state of the environment, interacts the Q value of each action in the current state with other unmanned aerial vehicle pairs to obtain a global Q value, and further generates actions according to the action strategy in the mutual information regularization soft Q-learning algorithm;
and 4, step 4: and each pair of unmanned aerial vehicles updates the self Q table and each parameter according to an updating mode in the mutual information regularization soft Q-learning algorithm.
And 5: and when the maximum number of steps of the training round is reached, ending the current round, starting the next round, re-inputting the network environment of the unmanned aerial vehicle, and repeating the steps 2-4.
The invention discretizes the continuous training time into a plurality of time slots by positive integers
Figure BDA0003122215460000061
To indicate the jth slot. Suppose there are M drone pairs and N jammers in the network, using sets respectively
Figure BDA0003122215460000062
And
Figure BDA0003122215460000063
to indicate.
Further, step 1, inputting an unmanned aerial vehicle network environment, wherein each pair of unmanned aerial vehicles serving as independent agents initializes a self Q table and optimal prior action distribution estimation, mutual information penalty term coefficients and action state pair occurrence times, specifically:
the unmanned aerial vehicle network environment includes:
(1) and (3) network model: the drone pair and jammer both move according to a markov stochastic mobility model, and the distance between the receiver and the transmitter in each pair of drones is limited.
(2) And (3) channel model: consider that there are limited sub-bands in the system and that the channel power gain consists of path loss (only line-of-sight case considered) and fast fading (rayleigh).
(3) Wireless transmission model: when the actual transmission rate is less than or equal to the reachable rate of the selected channel, the throughput is the number of bits transmitted in the transmission time of the time slot; otherwise, the throughput is 0.
(4) Interference model: the interference type of the jammer is set to be single tone frequency sweep interference, channels interfered by different jammers cannot be overlapped, and the jammer interference channel set is the unmanned aerial vehicle available channel set.
Further, in step 2, each pair of the unmanned aerial vehicles selects a transmission channel according to the action generated in the last time slot, and obtains an incentive of environment feedback after the transmission is completed, specifically:
1) action of unmanned aerial vehicle pair
The action of each pair of drones consists of two parts, the first part is to select the transmission channel of its next slot and the second part is to predict the transmission channel selected by the other drones for the next slot. The action of the mth drone pair at time slot j may be expressed as:
Figure BDA0003122215460000071
wherein the content of the first and second substances,
Figure BDA0003122215460000072
representing the transmission channel of the mth drone pair at time slot j + 1;
Figure BDA0003122215460000073
m' ≠ m is the transport channel vector for which the mth drone pair predicts the other drone pairs at slot j + 1. In fact, since each pair of drones can only control the transmission channel of its next slot, the transmission channel vector used by all pairs of drones at slot j +1 can be represented as
Figure BDA0003122215460000074
(2) System rewards
To maximize the throughput of all drone pairs, the system reward is set to the total normalized throughput of all drone pairs, i.e. the reward for the mth drone pair at slot j can be expressed as:
Figure BDA0003122215460000075
wherein the content of the first and second substances,
Figure BDA0003122215460000076
is the throughput of the mth drone pair at time slot j, CtransIs the actual transmission rate, t, of each pair of dronesrIs the transmission time within each slot.
Further, step 3, interacting the Q values of the actions in the current state with the other pairs of unmanned aerial vehicles to obtain a global Q value, and generating the actions according to the behavior strategy in the mutual information regularization soft Q-learning algorithm, specifically:
(1) state of unmanned aerial vehicle pair
The state of each pair of drones includes the channel currently interfered by the jammer and the communication channel vector in which all pairs of drones are executed for the current time slot, so the state of the mth pair of drones at time slot j can be expressed as:
Figure BDA0003122215460000081
wherein the content of the first and second substances,
Figure BDA0003122215460000082
indicating the channels observed by the jammers at time slot j by each pair of drones. It should be noted that the present invention does not consider the false alarm/false alarm probability in the observation process, and it is assumed that each pair of drones can accurately observe which channels are currently interfered by jammers, so the states of all pairs of drones in each timeslot are the same.
(2) Generating behavior policies
The soft Q-learning algorithm of mutual information regularization adopts a method similar to an epsilon-greedy strategy when generating a behavior strategy, and the weight of exploration and utilization is adjusted through the estimation rho (a) of optimal prior action distribution and a dynamically changed mutual information penalty term coefficient beta. Specifically, during exploration, the agent obtains the action of the next time slot according to the current estimation sampling of the optimal prior action distribution, wherein the probability of each action is different; when utilized, the agent directly selects the action with the highest probability, but this probability depends not only on the Q value, but also on the current estimate of the optimal a priori action distribution. Therefore, the behavior policy for the mth drone pair at slot j is:
Figure BDA0003122215460000083
wherein x is a random number subject to uniform distribution over [0,1], epsilon is a greedy factor, and the current optimal strategy when in use is:
Figure BDA0003122215460000084
further, in step 4, each pair of unmanned aerial vehicles updates its own Q table and each parameter according to an update mode in the mutual information regularization soft Q-learning algorithm, which specifically includes:
(1) updating an estimate of optimal a priori motion distribution ρ (a)
Suppose that
Figure BDA0003122215460000085
Is the policy generated by the mth drone pair in slot j,
Figure BDA0003122215460000086
indicating that the mth drone pair is generated in time slot j-1
Figure BDA0003122215460000087
And then estimating the current optimal prior motion distribution. Because the mth unmanned aerial vehicle pair is according to the motion vector when the time slot j
Figure BDA0003122215460000091
The channel selected for itself, and thus the current estimate of the optimal prior motion distribution
Figure BDA0003122215460000092
The update equation of (2) is as follows:
Figure BDA0003122215460000093
wherein alpha isρIs the learning rate, and
Figure BDA0003122215460000094
is uniformly distributed.
(2) Updating coefficient beta of mutual information penalty term
Suppose that
Figure BDA0003122215460000095
The coefficient of the m-th unmanned aerial vehicle to the mutual information punishment item in the time slot j is as follows:
Figure BDA0003122215460000096
wherein c is a normal number, and
Figure BDA0003122215460000097
(3) updating Q table
The Q table updating needs to use the estimation rho (a) of the optimal prior action distribution and the coefficient beta of a mutual information penalty term, and the updating formula of the m-th unmanned aerial vehicle on the Q table in the time slot j is as follows:
Figure BDA0003122215460000098
wherein
Figure BDA0003122215460000099
Is a calculation formula for soft Q value, gamma is a discount factor,
Figure BDA00031222154600000910
the learning rate of the mth drone pair in the time slot j is changed along with the occurrence number of the motion state pair of the mth drone pair. The specific calculation formula is as follows:
Figure BDA00031222154600000911
where omega is a normal number,
Figure BDA00031222154600000912
is the action state pair of the mth unmanned aerial vehicle pair in the time slot j
Figure BDA00031222154600000913
The number of occurrences of (c).
The invention is described in further detail below with reference to the figures and the embodiments.
Examples
One embodiment of the invention is described in detail below, with the simulation using python programming, and the parameter settings do not affect generality. In contrast to the method described, there are: (1) a random fast frequency hopping method; (2) a multi-agent cooperative unmanned aerial vehicle fast frequency hopping method based on traditional Q-learning.
As shown in fig. 2, all drone pairs and jammers move within a rectangular area of fixed height, where the drone pairs and jammers fly at a certain speed according to their respective flight direction. And assuming that the distance D between the receiver and the transmitter in each pair of unmanned aerial vehicles satisfies that D is less than or equal to D in the communication processmaxWherein D ismaxThe communication range is 100m for each pair of drones. In addition, the role of receiver or transmitter in all pairs of drones does not change during training, but the receiver and transmitter will be regrouped at the beginning of each round.
During training, the training rounds are set to 1000, and the maximum number of steps per round is set to 1000. In addition, the speed of the unmanned aerial vehicle takes values randomly from [10m/s, 20m/s ], and the flying direction of the unmanned aerial vehicle has 1/3 probability to change. The initial value of the greedy factor epsilon is set to 1 and then decreases with the number of training steps. Table 1 lists other simulation parameters.
TABLE 1 Primary simulation parameters
Figure BDA0003122215460000101
As shown in fig. 3, compared with the baseline method, the method benefits from an efficient exploration and utilization mechanism, the converged throughput performance is better, the convergence speed is faster, and the training process is more stable.
As shown in fig. 4 to 6, when the number of jammers changes, the number of available channels changes, and the number of pairs of drones changes, the performance of the method is significantly better than that of the baseline method. This is because it introduces adaptive estimation of optimal a priori motion distribution and better Q value update and behavior strategy generation, so that the priority of the motion can be learned faster.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (6)

1. An unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning is characterized by comprising the following steps:
step 1, inputting an unmanned aerial vehicle network environment, wherein each pair of unmanned aerial vehicles serving as independent intelligent agents initializes a self Q table, optimal prior action distribution estimation, a mutual information punishment item coefficient and action state pair occurrence times;
step 2, in the current time slot, each pair of unmanned aerial vehicles selects a transmission channel according to the action generated in the last time slot, and obtains environment feedback reward after transmission is completed;
step 3, observing the current state of the environment by each pair of unmanned aerial vehicles, interacting the Q value of each action in the current state with other unmanned aerial vehicle pairs to obtain a global Q value, and further generating actions according to the action strategy in the mutual information regularization soft Q-learning algorithm;
step 4, each pair of unmanned aerial vehicles updates the self Q table and each parameter according to an updating mode in a mutual information regularization soft Q-learning algorithm;
and 5, when the maximum number of steps of the training round is reached, ending the current round, starting the next round, re-inputting the network environment of the unmanned aerial vehicle, and repeating the steps 2 to 4.
2. The multi-agent reinforcement learning-based unmanned aerial vehicle network cooperative fast frequency hopping method according to claim 1, wherein continuous training time is discretized into a plurality of time slots, and positive integers are used
Figure FDA0003122215450000011
To indicate the jth slot; suppose there are M nobody in the networkMachine pair and N jammers, respectively
Figure FDA0003122215450000012
And
Figure FDA0003122215450000013
to indicate.
3. The cooperative fast frequency hopping method for network of unmanned aerial vehicles based on multi-agent reinforcement learning as claimed in claim 2, wherein the network environment of unmanned aerial vehicles is inputted in step 1, wherein the network environment of unmanned aerial vehicles comprises:
(1) and (3) network model: the unmanned aerial vehicle pairs and the jammers move according to a Markov random movement model, and the distance between a receiver and a transmitter in each pair of unmanned aerial vehicles is limited;
(2) and (3) channel model: considering that limited sub-bands exist in a system, and channel power gain is composed of path loss and fast fading, wherein the path loss only considers the line-of-sight situation, and the fast fading refers to Rayleigh fading;
(3) wireless transmission model: when the actual transmission rate is less than or equal to the reachable rate of the selected channel, the throughput is the number of bits transmitted in the transmission time of the time slot; otherwise, the throughput is 0;
(4) interference model: the interference type of the jammer is set to be single tone frequency sweep interference, channels interfered by different jammers cannot be overlapped, and the jammer interference channel set is the unmanned aerial vehicle available channel set.
4. The unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning as claimed in claim 2 or 3, wherein each pair of unmanned aerial vehicles in step 2 selects a transmission channel according to the action generated in the previous time slot, and obtains an incentive for environmental feedback after transmission is completed, specifically:
(1) action of unmanned aerial vehicle pair
The action of each pair of drones consists of two parts, the first part is to select the transmission channel of the next time slot of itself, and the second part is to predict the other dronesThe transmission channel selected by the man-machine to the next time slot, then the action of the mth unmanned aerial vehicle pair in the time slot j
Figure FDA0003122215450000021
Expressed as:
Figure FDA0003122215450000022
wherein the content of the first and second substances,
Figure FDA0003122215450000023
representing the transmission channel of the mth drone pair at time slot j + 1;
Figure FDA0003122215450000024
predicting the transmission channel vector of other unmanned aerial vehicle pairs in the time slot j +1 by the mth unmanned aerial vehicle pair; in fact, since each pair of drones can only control the transmission channel of its next slot, the transmission channel vector used by all pairs of drones in slot j +1 is represented as
Figure FDA0003122215450000025
(2) System rewards
To maximize throughput for all drone pairs, the system reward is set to the total normalized throughput for all drone pairs, i.e., the reward for the mth drone pair at time slot j
Figure FDA0003122215450000026
Expressed as:
Figure FDA0003122215450000027
wherein the content of the first and second substances,
Figure FDA0003122215450000028
is the m-th unmanned aerial vehicle pair in time slotThroughput at j, CtransIs the actual transmission rate, t, of each pair of dronesrIs the transmission time within each slot.
5. The unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning as claimed in claim 4, wherein step 3 said each pair of unmanned aerial vehicles observes the current state of the environment, and interacts with other unmanned aerial vehicle pairs with the Q value of each action in the current state to obtain a global Q value, and further generates actions according to the behavior strategy in the mutual information regularization soft Q-learning algorithm, specifically:
(1) state of unmanned aerial vehicle pair
The state of each pair of drones includes the channel currently interfered by the jammer and the transmission channel vector used by all drones for the current time slot, so that the state of the mth drone pair at time slot j
Figure FDA0003122215450000031
Expressed as:
Figure FDA0003122215450000032
wherein the content of the first and second substances,
Figure FDA0003122215450000033
representing the channels interfered by the jammers observed by each pair of unmanned planes at the time slot j;
ignoring false alarm/false alarm probability in the observation process, and assuming that each pair of unmanned aerial vehicles can accurately observe which channels are interfered by the jammers currently, so that the states of all unmanned aerial vehicle pairs in each time slot are the same;
(2) generating behavior policies
The soft Q-learning algorithm of mutual information regularization adopts a method similar to an epsilon-greedy strategy when generating a behavior strategy, and adjusts the weight of exploration and utilization through the estimation rho (a) of optimal prior action distribution and a dynamically changed mutual information penalty term coefficient beta;
during exploration, the agent obtains the action of the next time slot according to the current estimation sampling of the optimal prior action distribution, wherein the probability of each action is different; when in use, the agent directly selects the action with the highest probability, but the probability depends not only on the Q value, but also on the current estimation of the optimal prior action distribution; therefore, the behavior policy for the mth drone pair at slot j is:
Figure FDA0003122215450000034
wherein x is a random number subject to uniform distribution over [0,1], epsilon is a greedy factor, and the current optimal strategy when in use is:
Figure FDA0003122215450000035
6. the unmanned aerial vehicle network cooperative fast frequency hopping method based on multi-agent reinforcement learning as claimed in claim 1, wherein step 4 each pair of unmanned aerial vehicles updates its own Q table and various parameters according to an update mode in a mutual information regularization soft Q-learning algorithm, specifically:
(1) updating an estimate of optimal a priori motion distribution ρ (a)
Suppose that
Figure FDA0003122215450000036
Is the policy generated by the mth drone pair in slot j,
Figure FDA0003122215450000037
indicating that the mth drone pair is generated in time slot j-1
Figure FDA0003122215450000038
Then estimating the current distribution of the optimal prior action; because the mth unmanned aerial vehicle pair is according to the motion vector when the time slot j
Figure FDA0003122215450000041
The channel selected for itself, and thus the current estimate of the optimal prior motion distribution
Figure FDA0003122215450000042
The update equation of (2) is as follows:
Figure FDA0003122215450000043
wherein alpha isρIs the learning rate, and
Figure FDA0003122215450000044
is uniformly distributed;
(2) updating coefficient beta of mutual information penalty term
Suppose that
Figure FDA0003122215450000045
Is the coefficient of the mth unmanned aerial vehicle to the mutual information punishment item in the time slot j, and the updating formula is as follows:
Figure FDA0003122215450000046
wherein c is a normal number, and
Figure FDA0003122215450000047
(3) updating Q table
The Q table updating needs to use the estimation rho (a) of the optimal prior action distribution and the coefficient beta of a mutual information penalty term, and the updating formula of the m-th unmanned aerial vehicle on the Q table in the time slot j is as follows:
Figure FDA0003122215450000048
wherein
Figure FDA0003122215450000049
Is a calculation formula of soft Q value, and gamma is a discount factor;
Figure FDA00031222154500000410
the learning rate of the mth unmanned aerial vehicle pair in the time slot j changes along with the occurrence frequency of the mth unmanned aerial vehicle pair to the action state pair, and the specific calculation formula is as follows:
Figure FDA00031222154500000411
where omega is a normal number,
Figure FDA00031222154500000412
is the action state pair of the mth unmanned aerial vehicle pair in the time slot j
Figure FDA00031222154500000413
The number of occurrences of (c).
CN202110680187.4A 2021-06-18 2021-06-18 Unmanned plane network cooperative fast frequency hopping method based on multi-agent reinforcement learning Active CN113572548B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110680187.4A CN113572548B (en) 2021-06-18 2021-06-18 Unmanned plane network cooperative fast frequency hopping method based on multi-agent reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110680187.4A CN113572548B (en) 2021-06-18 2021-06-18 Unmanned plane network cooperative fast frequency hopping method based on multi-agent reinforcement learning

Publications (2)

Publication Number Publication Date
CN113572548A true CN113572548A (en) 2021-10-29
CN113572548B CN113572548B (en) 2023-07-07

Family

ID=78162317

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110680187.4A Active CN113572548B (en) 2021-06-18 2021-06-18 Unmanned plane network cooperative fast frequency hopping method based on multi-agent reinforcement learning

Country Status (1)

Country Link
CN (1) CN113572548B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020024170A1 (en) * 2018-08-01 2020-02-06 东莞理工学院 Nash equilibrium strategy and social network consensus evolution model in continuous action space
CN111786713A (en) * 2020-06-04 2020-10-16 大连理工大学 Unmanned aerial vehicle network hovering position optimization method based on multi-agent deep reinforcement learning
CN112256056A (en) * 2020-10-19 2021-01-22 中山大学 Unmanned aerial vehicle control method and system based on multi-agent deep reinforcement learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020024170A1 (en) * 2018-08-01 2020-02-06 东莞理工学院 Nash equilibrium strategy and social network consensus evolution model in continuous action space
CN111786713A (en) * 2020-06-04 2020-10-16 大连理工大学 Unmanned aerial vehicle network hovering position optimization method based on multi-agent deep reinforcement learning
CN112256056A (en) * 2020-10-19 2021-01-22 中山大学 Unmanned aerial vehicle control method and system based on multi-agent deep reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王海超;王金龙;丁国如;陈瑾;: "空天地一体化网络中智能协同抗干扰技术", 指挥与控制学报 *

Also Published As

Publication number Publication date
CN113572548B (en) 2023-07-07

Similar Documents

Publication Publication Date Title
Wang et al. A survey on applications of model-free strategy learning in cognitive wireless networks
Lei et al. Deep reinforcement learning-based spectrum allocation in integrated access and backhaul networks
Shi et al. Drone-cell trajectory planning and resource allocation for highly mobile networks: A hierarchical DRL approach
Li Multi-agent Q-learning of channel selection in multi-user cognitive radio systems: A two by two case
CN113162679A (en) DDPG algorithm-based IRS (inter-Range instrumentation System) auxiliary unmanned aerial vehicle communication joint optimization method
CN113382381B (en) Unmanned aerial vehicle cluster network intelligent frequency hopping method based on Bayesian Q learning
Elsayed et al. Deep reinforcement learning for reducing latency in mission critical services
Du et al. Multi-agent reinforcement learning for dynamic resource management in 6G in-X subnetworks
Guan et al. User association and power allocation for UAV-assisted networks: A distributed reinforcement learning approach
Albinsaid et al. Multi-agent reinforcement learning-based distributed dynamic spectrum access
Qin et al. Deep reinforcement learning based resource allocation and trajectory planning in integrated sensing and communications UAV network
Xu et al. Voting-based multiagent reinforcement learning for intelligent IoT
Ghavimi et al. Energy-efficient uav communications with interference management: Deep learning framework
Zhou et al. Multi-agent few-shot meta reinforcement learning for trajectory design and channel selection in UAV-assisted networks
Sande et al. Access and radio resource management for IAB networks using deep reinforcement learning
Wang et al. Intelligent resource allocation in UAV-enabled mobile edge computing networks
Wu et al. AoI minimization for UAV-to-device underlay communication by multi-agent deep reinforcement learning
Iturria-Rivera et al. Cooperate or not Cooperate: Transfer Learning with Multi-Armed Bandit for Spatial Reuse in Wi-Fi
Cao et al. Deep reinforcement learning for user access control in UAV networks
Huang et al. Delay-Oriented Knowledge-Driven Resource Allocation in SAGIN-Based Vehicular Networks
Wang et al. Joint spectrum access and power control in air-air communications-a deep reinforcement learning based approach
Zhang et al. Machine learning driven UAV-assisted edge computing
Gong et al. Distributed DRL-based resource allocation for multicast D2D communications
Huang et al. Fast spectrum sharing in vehicular networks: A meta reinforcement learning approach
Li et al. A Q-learning-based channel selection and data scheduling approach for high-frequency communications in jamming environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Lin Yan

Inventor after: Peng Nuoheng

Inventor after: Zhang Yijin

Inventor after: Li Jun

Inventor before: Peng Nuoheng

Inventor before: Lin Yan

Inventor before: Zhang Yijin

Inventor before: Li Jun

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant