CN115454141A - Unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part observable information - Google Patents

Unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part observable information Download PDF

Info

Publication number
CN115454141A
CN115454141A CN202211261459.8A CN202211261459A CN115454141A CN 115454141 A CN115454141 A CN 115454141A CN 202211261459 A CN202211261459 A CN 202211261459A CN 115454141 A CN115454141 A CN 115454141A
Authority
CN
China
Prior art keywords
unmanned aerial
aerial vehicle
time slot
cluster
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211261459.8A
Other languages
Chinese (zh)
Inventor
刘梦泽
单雯
卢其然
林艳
张一晋
邹骏
吴志娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN202211261459.8A priority Critical patent/CN115454141A/en
Publication of CN115454141A publication Critical patent/CN115454141A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • G05D1/104Simultaneous control of position or course in three dimensions specially adapted for aircraft involving a plurality of aircrafts, e.g. formation flying
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention discloses an unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partial observable information, which utilizes partial observable environment information of each agent, retains historical experience data through a long-term and short-term memory network, inputs a deep circulation Q network of each agent to perform action value function fitting, selects a channel and power corresponding to a maximum output Q value by adopting an epsilon-greedy algorithm, continuously and independently trains the deep circulation Q network of each agent, updates Q value distribution, and finally learns an unmanned aerial vehicle channel and emission power optimal decision for realizing communication transmission energy consumption minimization under an unknown interference scene. Aiming at the situation that an unmanned aerial vehicle cluster network is respectively under two scenes of frequency sweep interference and Markov interference, effective multi-agent anti-interference communication is realized from a frequency spectrum domain and a power domain by using part of historical experience data of observable information; compared with a contrast scheme based on multi-agent deep Q learning, the scheme can more efficiently reduce the long-term communication transmission energy consumption of the unmanned aerial vehicle cluster network under the condition that the environmental information is partially observable.

Description

Unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part observable information
Technical Field
The invention belongs to the technical field of wireless communication, and particularly relates to an unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part observable information.
Background
In recent years, with the rapid development of radio technology, many advantages in the unmanned aerial vehicle communication system are continuously highlighted, and the unmanned aerial vehicle is widely applied to an emergency network to relieve the terminal requirements in the communication system. The unmanned aerial vehicle cluster network anti-interference technology is an important technology for guaranteeing unmanned aerial vehicle communication to be free from interference threat. Among them, frequency hopping interference rejection is one of the most common interference rejection techniques. Because the traditional frequency hopping anti-interference technology cannot deal with the problems of unknown high-dynamic complex interference environment and the like, the frequency hopping anti-interference technology based on reinforcement learning becomes a research hotspot of the frequency hopping anti-interference technology of the communication network of the unmanned aerial vehicle in recent years.
Most previous studies adopt the Q-Learning (QL) algorithm, but only apply to low-dimensional discrete motion space. When the action space is large, dimension disaster is faced. Aiming at the problems, shangxing Wang et al provides a channel selection algorithm based on Deep Q-Network (DQN) online learning, and the channel selection algorithm effectively improves the anti-interference performance of the unmanned aerial vehicle communication Network in a complex environment. The Fuqiang Yao and Luliang Jia establish a multi-agent Markov Decision Model (MDP) model for the unmanned aerial vehicle cluster communication system by means of a Markov Game frame (MDP), and communication overhead is reduced when the MDP model is applied to an actual communication environment. However, the above-described interference-free communication techniques do not take into account the problem of partially observable communication environments.
Disclosure of Invention
The invention aims to provide an unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partial observable information, which utilizes a Deep cycle Q-learning (DRQN) algorithm, and realizes the approach to a real environment model by adopting a long-short term memory Network to retain historical information data to train the DRQN on the basis of establishing a Dec-POMDP model by a cluster head unmanned aerial vehicle.
The technical solution for realizing the purpose of the invention is as follows: an unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partial observable information comprises the following specific steps:
step 1: initializing algorithm parameters;
step 2: each cluster head unmanned aerial vehicle obtains a channel and transmitting power selected by a time slot on an adult unmanned aerial vehicle in the cluster through interaction with the environment;
and step 3: each cluster head unmanned aerial vehicle adopts an epsilon-greedy algorithm to select a channel and transmitting power of a current time slot for members in the cluster;
and 4, step 4: each cluster head unmanned aerial vehicle calculates the energy overhead sum required by the communication process with the member in the cluster head unmanned aerial vehicle, and obtains a corresponding environment reward value;
and 5: storing the observed value, the action, the reward and the observed value of the next time slot of each cluster head unmanned aerial vehicle into respective experience pools;
and 6: when the sample data of the experience pool is enough, each cluster head unmanned aerial vehicle carries out random sampling from the experience pool to obtain a plurality of batches of historical information data to form a time sequence, the time sequence is input into the value network of each cluster head unmanned aerial vehicle, and the value network parameters are updated by adopting a gradient descent method;
and 7: copying parameters of the value network to form a new target network at a certain time slot number;
and step 8: repeating the step 2 to the step 7 until 100 times of data transmission is finished;
and step 9: and (5) repeating the step 8 until the total reward value of the unmanned aerial vehicle cluster network is converged, and finishing the local training.
Compared with the prior art, the invention has the following remarkable advantages: (1) The multi-agent multi-domain anti-interference framework applicable to part of observable environments is provided, a multi-domain anti-interference decision process is modeled into a multi-agent part observable Markov process by aiming at minimizing the long-term communication transmission energy consumption of an unmanned aerial vehicle cluster network, and the observed value, the action and the reward of the current time slot of a cluster head unmanned aerial vehicle and the observed value of the next time slot are used as historical experiences to assist each unmanned aerial vehicle cluster intelligent body in completing respective channel selection and emission power distribution; (2) A multi-domain anti-interference algorithm based on a multi-agent deep circulation Q network is provided, historical information data are reserved by adopting a long-term and short-term memory network, then the historical information data are input into the deep circulation Q network of each agent to perform action value function fitting, parameters of each deep circulation Q network are updated, and finally an optimal decision of an unmanned aerial vehicle channel and transmitting power for realizing communication transmission energy consumption minimization under an unknown interference scene is obtained.
Drawings
Fig. 1 is a flowchart of the unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partial observable information.
Fig. 2 is a schematic diagram of learning convergence effects of different algorithms in a swept frequency interference mode.
Fig. 3 is a schematic diagram of learning convergence effects of different algorithms in the markov interference mode.
FIG. 4 is a graph of convergence values versus channel number for different algorithm environment rewards in a swept frequency interference mode.
Figure 5 is a graph of convergence values of different algorithm environment rewards versus the number of channels in a markov interference mode.
FIG. 6 is a diagram of the convergence value of different algorithm environment rewards and the number of jammers under the swept frequency interference mode.
Fig. 7 is a graph of convergence values of environment rewards of different algorithms in a markov interference mode versus the number of jammers.
Detailed Description
The invention relates to an unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partial observable information, which comprises the following specific steps:
step 1: initializing algorithm parameters;
and 2, step: each cluster head unmanned aerial vehicle obtains a channel and transmitting power selected by a time slot on an adult unmanned aerial vehicle in the cluster through interaction with the environment;
and step 3: each cluster head unmanned aerial vehicle adopts an epsilon-greedy algorithm to select a channel and a transmitting power of a current time slot for members in the cluster;
and 4, step 4: each cluster head unmanned aerial vehicle calculates the total energy overhead required in the communication process with the members in the cluster and obtains a corresponding environment reward value;
and 5: storing the observed value, the action, the reward and the observed value of the next time slot of each cluster head unmanned aerial vehicle into respective experience pools;
and 6: when the sample data of the experience pool is enough, each cluster head unmanned aerial vehicle carries out random sampling from the experience pool to obtain a plurality of batches of historical information data to form a time sequence, the time sequence is input into the value network of each cluster head unmanned aerial vehicle, and the value network parameters are updated by adopting a gradient descent method;
and 7: copying parameters of the value network to form a new target network at intervals of a certain time slot number;
and step 8: repeating the step 2 to the step 7 until 100 times of data transmission is finished;
and step 9: and (5) repeating the step 8 until the total reward value of the unmanned aerial vehicle cluster network is converged, and finishing the local training.
Further, the algorithm parameters in step 1 include a learning rate δ, a greedy factor e, a discount factor γ, an experience pool size μ, an attenuation factor θ, a value network parameter w and a target network parameter w'.
Further, in step 2, each cluster head unmanned aerial vehicle obtains a channel and a transmitting power selected by a time slot on an adult unmanned aerial vehicle in the cluster through interaction with the environment, specifically as follows:
the communication environment in the invention imitates the real environment as much as possible, and under most real environments, the intelligent agent cannot observe all state information due to the influence of noise and interference. Thus, the drone anti-jamming Decision problem is modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP).
Modeling a system model as Dec-POMDP<D,S,A,O,R>Wherein D is a plurality of agent sets, S is a joint state set, A is a joint action set, and O is a jointCombining the observation sets, wherein R is a reward function; defining D = {1, \8230;, N } is a set of N agents; defining the current observation of the time slot t +1 cluster head unmanned aerial vehicle n as follows:
Figure BDA0003891111010000031
joint observation set
Figure BDA0003891111010000032
Wherein
Figure BDA0003891111010000033
Is the channel of cluster member i of time slot t +1 cluster head drone n,
Figure BDA0003891111010000034
the time slot t +1 is the transmission power selected by the cluster head unmanned aerial vehicle n for the cluster member i; defining the action of a time slot t cluster head unmanned plane n as
Figure BDA0003891111010000035
Joint observation set
Figure BDA0003891111010000036
Wherein
Figure BDA0003891111010000037
Cluster member i frequency hopping to channel of time slot t cluster head unmanned aerial vehicle n
Figure BDA0003891111010000038
Is the transmission power selected by the time slot t cluster head unmanned aerial vehicle n for the cluster member i
Figure BDA0003891111010000039
Defining a joint state set S as all environment state information, and defining a joint observation set O as partial information which can be observed by N intelligent agents, so that the joint observation set O can be regarded as a subset of the joint state set S; definition of
Figure BDA00038911110100000310
Is the reward value for slot t cluster head drone n.
Further, in step 3, each cluster head unmanned aerial vehicle adopts an epsilon-greedy algorithm to select a channel and a transmission power of a current time slot for members in the cluster, specifically as follows:
step 3-1: the observed value of each cluster head unmanned aerial vehicle is used as the input of the value network, the Q value corresponding to each action is used as the output of the value network, wherein, the time slot t is that the cluster head unmanned aerial vehicle n observes
Figure BDA00038911110100000311
Lower execution action
Figure BDA00038911110100000312
Q value of (c) is the expectation of the cumulative future prize value at the beginning of time slot t +1, as follows:
Figure BDA0003891111010000041
wherein s is t For the time slot t the environmental status information,
Figure BDA0003891111010000042
taking action for time slot t cluster head unmanned aerial vehicle n
Figure BDA0003891111010000043
Environmental state s t From transition to s t+1 The probability of (c).
Step 3-2: the actions are selected according to the epsilon-greedy algorithm in the following specific way:
Figure BDA0003891111010000044
wherein p is a random number between 0 and 1, epsilon (0 < epsilon < 1) is an exploration probability,
Figure BDA0003891111010000045
the hidden layer state of the time slot t cluster head unmanned aerial vehicle n neural network is shown, and w is a value network parameter. In this network, the output is not only related to the input, but is also hidden from the time slot tLayer state
Figure BDA0003891111010000046
In connection with this, the present invention is,
Figure BDA0003891111010000047
the cluster head unmanned aerial vehicle n network state storage device is used for storing the past network state of the cluster head unmanned aerial vehicle n and comprises historical information. Hidden layer state
Figure BDA0003891111010000048
Figure BDA0003891111010000048
0 at the beginning of the round, i.e., contains no historical information. As the round is made, it is,
Figure BDA00038911110100000417
will be iteratively updated, generated by the slotted t-network
Figure BDA0003891111010000049
And the state of the hidden layer is used as the time slot t +1, so that the output of the time slot t +1 value network is influenced, and the iteration is carried out step by step.
The strategy randomly selects an action in the action space with the probability of epsilon, and avoids falling into local optimum. ε is the exploration probability and 1- ε is the utilization (selection of the current best strategy) probability. The larger the value of epsilon, the smaller the probability of utilization. In the initial stage of algorithm execution, because the state action space is large, the exploration probability should be a large value, and gradually approaches the optimal strategy along with the increase of the iteration times, and the utilization probability should be increased accordingly.
Further, in step 4, each cluster head unmanned aerial vehicle calculates the total energy overhead required in the communication process with its cluster member, and obtains a corresponding environment reward value, which is as follows:
note the book
Figure BDA00038911110100000410
And
Figure BDA00038911110100000411
the transmission power of cluster member i and jammer j of the time slot t cluster head unmanned aerial vehicle n respectively,
Figure BDA00038911110100000412
transmit power of cluster member k (when m = n, k ≠ i), G, for slot t cluster head drone m U And G J The gain for the antenna of the drone and the jammer respectively,
Figure BDA00038911110100000413
is the Euclidean distance between a time slot t cluster head unmanned aerial vehicle n and a cluster member i thereof or between the cluster head unmanned aerial vehicle n and an interference machine j, rho is the noise coefficient of the unmanned aerial vehicle, and sigma is 2 The mean square value of the ambient noise is calculated,
Figure BDA00038911110100000414
is the time slot T fast fading between the cluster head unmanned aerial vehicle n and the cluster member i thereof or between the cluster head unmanned aerial vehicle n and the jammer j, B is the channel bandwidth, T is the time required by single communication transmission, s is the data size of the single communication transmission,
Figure BDA00038911110100000415
the maximum average information rate of error-free transmission of a time slot t cluster head unmanned aerial vehicle n and a cluster member i thereof in an additive white Gaussian noise channel; rician fading channel gain is modeled as mean 0 with real part and variance is xi 2 Imaginary part modeling as mean 0 variance ξ 2 Independent and identically distributed Gaussian random process, so the channel fast fading is recorded as
Figure BDA00038911110100000416
a is a real part and b is an imaginary part; setting the energy overhead of the time slot t cluster head unmanned aerial vehicle n as
Figure BDA0003891111010000051
Figure BDA0003891111010000052
Figure BDA0003891111010000053
Figure BDA0003891111010000054
When a cluster member i of the cluster head unmanned aerial vehicle n is in the same channel with the interference machine j, the ratio of beta =1, otherwise the ratio of beta =0; when the cluster member i of the cluster head drone n is in the same channel as the cluster member k of the cluster head drone m, α =1, otherwise α =0. Time slot t environment total reward value is
Figure BDA0003891111010000055
The practical physical meaning of the energy overhead is the energy consumed by the cluster head unmanned aerial vehicle n and all cluster member unmanned aerial vehicles for data transmission once.
Further, in step 5, the observation value, the action, the reward and the observation value of the next time slot of each cluster head unmanned aerial vehicle are stored into respective experience pools, which specifically includes:
when the cluster head unmanned aerial vehicle n is in the time slot t according to
Figure BDA0003891111010000056
After the cluster member unmanned aerial vehicle frequency hopping channel and the transmitting power are selected, the environment state is determined by s t Jump to s t+1 Calculated at s by the reward value calculation formula t Down selection action
Figure BDA0003891111010000057
Awarding of prizes
Figure BDA0003891111010000058
And observation of
Figure BDA0003891111010000059
Generated by current time slot t
Figure BDA00038911110100000510
And storing the historical experience data into an experience pool.
Further, in step 6, when the experience pool sample data is sufficient, each agent performs random sampling from the experience pool to obtain a plurality of batches of historical information data to form a time sequence, the time sequence is input into the value network of each agent, and the value network parameters are updated by a gradient descent method, which specifically comprises the following steps:
observed value with time slot t as neural network input of cluster head unmanned aerial vehicle n
Figure BDA00038911110100000511
And outputting Q values corresponding to each action of the time slots t. In order to enhance the stability of the algorithm, the invention adopts a double-network structure, w is recorded as a value network parameter, w 'is a target network parameter, and the target network parameter w' is updated once every certain round in step 7.
Step 6-1: when each agent trains the value network, a batch of historical experience data is randomly selected from an experience pool to form a plurality of time sequences, each time sequence is a complete communication round, a time slot is randomly selected from each sequence, and a plurality of continuous steps are selected as training samples. Calculating the action Q value function of the cluster head unmanned aerial vehicle n of the time slot t through the value network at the time slot t of the sample
Figure BDA00038911110100000512
As the estimated Q value, the target network calculates the function of the action Q value of the time slot t +1 cluster head unmanned aerial vehicle n
Figure BDA0003891111010000061
Wherein,
Figure BDA0003891111010000062
and the observed value, the action and the hidden layer state of the time slot t cluster head unmanned aerial vehicle n are obtained. The true value of the action Q function is calculated using the following equation:
Figure BDA0003891111010000063
step 6-2: will be the true Q valueSubstituting the estimated Q value into the following formula for calculation, namely updating the value network parameter w and gradually reducing
Figure BDA0003891111010000064
Namely that
Figure BDA0003891111010000065
The Q value calculated by the value network is closer to the real Q value by a gradient descent method. Before each intelligent agent trains the neural network, the hidden layer state needs to be set to zero, and the hidden layer state of a plurality of subsequent steps needs to be set
Figure BDA0003891111010000066
Resulting from network iterations.
Further, in step 7, every certain number of time slots, the parameters of the value network are copied to form a new target network, i.e., w ← w'.
The invention is further described in detail below with reference to the drawings and specific embodiments.
Examples
In this embodiment, the cluster head drone and the jammer complete 100 moves and select the channel and the transmission power in one round, that is, complete one communication task, and in the round, each move, channel select and transmission power made by the cluster head drone and the jammer is called one time slot.
With reference to fig. 1, the present embodiment provides an unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part of observable information, which includes the following specific steps:
step 1: algorithm parameters are initialized.
The algorithm parameters comprise a learning rate delta, a greedy factor epsilon, a discount factor gamma, an experience pool size mu, an attenuation factor theta, a value network parameter w and a target network parameter w'.
Step 2: and each cluster head unmanned aerial vehicle obtains a channel and transmitting power selected by a time slot on the member unmanned aerial vehicle in the cluster through interaction with the environment. The method comprises the following specific steps:
the communication environment in the invention imitates the real environment as much as possible, and under most real environments, the intelligent agent cannot observe all state information due to the influence of noise and interference. Thus, the drone anti-jamming Decision problem is modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP).
Modeling a system model as Dec-POMDP<D,S,A,O,R>Wherein D is a plurality of agent sets, S is a joint state set, A is a joint action set, O is a joint observation set, and R is a reward function; defining D = {1, \8230, N } as a set of N agents; defining the current observation of the time slot t +1 cluster head unmanned aerial vehicle n as:
Figure BDA0003891111010000067
joint observation set
Figure BDA0003891111010000068
Wherein
Figure BDA0003891111010000069
Is the channel of cluster member i of time slot t +1 cluster head drone n,
Figure BDA00038911110100000610
is the transmission power selected by the time slot t +1 cluster head unmanned aerial vehicle n for the cluster member i; define the action of time slot t cluster head unmanned plane n as
Figure BDA0003891111010000071
Joint observation set
Figure BDA0003891111010000072
Wherein
Figure BDA0003891111010000073
Cluster member i frequency hopping to channel of time slot t cluster head unmanned aerial vehicle n
Figure BDA0003891111010000074
Is the transmission power selected by the time slot t cluster head unmanned aerial vehicle n for the cluster member i
Figure BDA0003891111010000075
Defining a joint state set S as all environment state information, and defining a joint observation set O as partial information which can be observed by N intelligent agents, so that the joint observation set O can be regarded as a subset of the joint state set S; definition of
Figure BDA0003891111010000076
Is the reward value for slot t cluster head drone n.
And step 3: and each cluster head unmanned aerial vehicle adopts an epsilon-greedy algorithm to select the channel and the transmitting power of the current time slot for the members in the cluster. The method comprises the following specific steps:
step 3-1: the observed value of each cluster head unmanned aerial vehicle is used as the input of the value network, the Q value corresponding to each action is used as the output of the value network, wherein, the time slot t is that the cluster head unmanned aerial vehicle n observes
Figure BDA0003891111010000077
Lower execution of actions
Figure BDA0003891111010000078
Q value of (c) is the expectation of the cumulative future prize value at the beginning of time slot t +1, as follows:
Figure BDA0003891111010000079
wherein s is t For the time slot t the environmental status information,
Figure BDA00038911110100000710
taking action for time slot t cluster head unmanned aerial vehicle n
Figure BDA00038911110100000711
Environmental state s t From transition to s t+1 The probability of (c).
Step 3-2: the actions are selected according to the epsilon-greedy algorithm in the following specific way:
Figure BDA00038911110100000712
wherein p is a random number between 0 and 1, epsilon (0 < epsilon < 1) is an exploration probability,
Figure BDA00038911110100000713
the hidden layer state of the time slot t cluster head unmanned aerial vehicle n neural network is shown, and w is a value network parameter. In this network, the output is not only related to the input, but also to the slot t hidden layer state
Figure BDA00038911110100000714
In connection with this, the first and second electrodes,
Figure BDA00038911110100000715
the cluster head unmanned aerial vehicle n network state storage device is used for storing the past network state of the cluster head unmanned aerial vehicle n and comprises historical information. Hidden layer state
Figure BDA00038911110100000716
Figure BDA00038911110100000716
0 at the beginning of the round, i.e., contains no historical information. As the round is made, it is,
Figure BDA00038911110100000717
will be iteratively updated, generated by the time-slot t network
Figure BDA00038911110100000718
And the state of the hidden layer is used as the time slot t +1, so that the output of the time slot t +1 value network is influenced, and iteration is performed step by step.
The strategy randomly selects an action in the action space with the probability of epsilon and avoids falling into local optimum. ε is the exploration probability and 1- ε is the utilization (selection of the current best strategy) probability. The larger the value of epsilon, the smaller the probability of utilization. In the initial stage of algorithm execution, because the state action space is large, the exploration probability should be a large value, and gradually approaches the optimal strategy along with the increase of the iteration times, and the utilization probability should be increased accordingly. The probability epsilon is updated as follows:
ε=max{0.01,θ x }
where x is the number of rounds currently in progress.
And 4, step 4: and each cluster head unmanned aerial vehicle calculates the total energy expenditure required in the communication process with the members in the cluster and obtains a corresponding environment reward value. The method comprises the following specific steps:
note book
Figure BDA0003891111010000081
And
Figure BDA0003891111010000082
the transmission power of cluster member i and jammer j of the time slot t cluster head unmanned aerial vehicle n respectively,
Figure BDA0003891111010000083
transmit power of cluster member k (when m = n, k ≠ i), G, for slot t cluster head drone m U And G J The gain of the antenna of the unmanned aerial vehicle and the antenna of the jammer are respectively,
Figure BDA0003891111010000084
is the Euclidean distance between a time slot t cluster head unmanned aerial vehicle n and a cluster member i thereof or between the cluster head unmanned aerial vehicle n and an interference machine j, rho is the noise coefficient of the unmanned aerial vehicle, and sigma is 2 The mean square value of the ambient noise is calculated,
Figure BDA0003891111010000085
is the time slot T fast fading between the cluster head unmanned aerial vehicle n and the cluster member i thereof or between the cluster head unmanned aerial vehicle n and the jammer j, B is the channel bandwidth, T is the time required by single communication transmission, s is the data size of the single communication transmission,
Figure BDA0003891111010000086
the maximum average information rate of error-free transmission of a time slot t cluster head unmanned aerial vehicle n and a cluster member i thereof in an additive white Gaussian noise channel; rician fading channel gain is modeled as mean 0 with real part and variance is xi 2 Modeling imaginary part as mean 0 and variance xi 2 Independent and identically distributed Gaussian random process, so the fast fading of the channel is recorded as
Figure BDA0003891111010000087
a is a real part and b is an imaginary part; setting the energy overhead of the time slot t cluster head unmanned aerial vehicle n as
Figure BDA0003891111010000088
Figure BDA0003891111010000089
Figure BDA00038911110100000810
Figure BDA00038911110100000811
When a cluster member i of the cluster head unmanned aerial vehicle n is in the same channel with the interference machine j, β =1, otherwise β =0; when the cluster member i of the cluster head drone n is in the same channel as the cluster member k of the cluster head drone m, α =1, otherwise α =0. Time slot t environment total reward value of
Figure BDA00038911110100000812
The practical physical meaning of the energy overhead is the energy consumed by the cluster head unmanned aerial vehicle n and all cluster member unmanned aerial vehicles for data transmission once.
And 5: and storing the observed value, the action, the reward and the observed value of the next time slot of each cluster head unmanned aerial vehicle into respective experience pools. The method comprises the following specific steps:
when the cluster head unmanned aerial vehicle n is in the time slot t according to
Figure BDA00038911110100000813
After the cluster member unmanned aerial vehicle frequency hopping channel and the transmitting power are selected, the environment state is determined by s t Jump to s t+1 Calculated at s by a reward value calculation formula t Down selection action
Figure BDA0003891111010000091
Awarding of prizes
Figure BDA0003891111010000092
And observation of
Figure BDA0003891111010000093
Generated by current time slot t
Figure BDA0003891111010000094
And storing the historical experience data into an experience pool.
Step 6: when the sample data of the experience pool is enough, each intelligent agent carries out random sampling from the experience pool to obtain a plurality of batches of historical information data to form a time sequence, the time sequence is input into the value network of each intelligent agent, and the value network parameters are updated by adopting a gradient descent method. The method comprises the following specific steps:
the neural network of the cluster head unmanned aerial vehicle n is composed of 3 neural units, and the first neural unit is a Long Short-Term Memory unit (LSTM). The LSTM architecture is a special recurrent neural network architecture that can use historical information to predict and process sequence data. The LSTM is composed of a forgetting gate, an input gate and an output gate, wherein control parameters in the forgetting gate determine historical information needing to be discarded, the input gate determines new information to be added, and the output gate determines data to be output from the LSTM unit to a next unit.
Forget the door:
Figure BDA0003891111010000095
an input gate:
Figure BDA0003891111010000096
Figure BDA0003891111010000097
Figure BDA0003891111010000098
an output gate:
Figure BDA0003891111010000099
wherein, W i,f,c,o And b i,f,c,o The input weight and the offset of the gate,
Figure BDA00038911110100000910
is the input of the slot tlstm unit.
The LSTM structure uses three gates to determine the degree of retention for the incoming data sequence, allowing prediction of the future from historical information. In the anti-interference scene of the invention, each agent only exchanges reward information, so that the action information of other agents cannot be determined. The LSTM structure helps each intelligent agent to pre-estimate the actions of other intelligent agents by using the experience of historical information, and a better unmanned aerial vehicle cluster network communication anti-interference strategy can be obtained.
Observed value with time slot t as neural network input of cluster head unmanned aerial vehicle n
Figure BDA00038911110100000911
And outputting the Q value corresponding to each action of the time slot t. In order to enhance the stability of the algorithm, the invention adopts a double-network structure, w is recorded as a value network parameter, w 'is a target network parameter, and the target network parameter w' is updated once every certain round in step 7.
Step 6-1: when each agent trains the value network, a batch of historical experience data is randomly selected from an experience pool to form a plurality of time sequences, each time sequence is a complete communication turn, a time slot is randomly selected from each sequence, and a plurality of continuous steps are selected as training samples. At time slot t of the sample, byAction Q value function of cluster head unmanned aerial vehicle n for calculating time slot t by value network
Figure BDA0003891111010000101
As the estimated Q value, the target network calculates the function of the action Q value of the time slot t +1 cluster head unmanned aerial vehicle n
Figure BDA0003891111010000102
Wherein,
Figure BDA0003891111010000103
and the observed value, the action and the hidden layer state of the time slot t cluster head unmanned aerial vehicle n are obtained. The true value of the action Q function is calculated using the following equation:
Figure BDA0003891111010000104
step 6-2: the real Q value and the estimated Q value are substituted into the following formula for calculation, namely, the value network parameter w can be updated, and the value network parameter w is gradually reduced
Figure BDA0003891111010000105
Namely that
Figure BDA0003891111010000106
The Q value calculated by the value network is closer to the real Q value by a gradient descent method. Before each intelligent agent trains the neural network, the hidden layer state needs to be set to zero, and the hidden layer state of a plurality of subsequent steps needs to be set
Figure BDA0003891111010000107
Resulting from network iterations.
The gradient descent process adopts an Adaptive Moment Estimation (ADAM) mode. In the value network parameter updating process, only a batch of historical experience data is sampled for training in each iteration, if data sets are different, loss functions are different, and the probability of convergence to local optimum can be reduced by adopting an ADAM mode. ADAM dynamically adjusts the learning rate for each parameter according to first and second moment estimates of the gradient of each parameter by the loss function. The ADAM is based on a gradient descent method, but the learning step size of the parameters in each iteration has a certain range, so that a larger learning step size cannot be caused by a larger gradient, and the values of the parameters are more stable. The ADAM algorithm is implemented by the following steps:
assuming a time slot t, the first derivative of the objective function with respect to the parameter is g t First, an exponential moving average is calculated:
m t =λ 1 m t-1 +(1-λ 1 )g t-1
two more deviation correction terms are calculated:
Figure BDA0003891111010000108
Figure BDA0003891111010000109
the finally obtained gradient updating method comprises the following steps:
Figure BDA00038911110100001010
finally returning error function related result parameters
Figure BDA00038911110100001011
M in the algorithm t Is an exponential moving average, ω t Is the square gradient, while the parameter lambda 1 、λ 2 To control the exponential decay rate of these moving means, η is the learning step and τ is a constant, typically 10 -8
And 7: copying parameters of the value network to form a new target network at a certain time slot number;
and 8: repeating the step 2 to the step 7 until the data transmission is completed for 100 times;
and step 9: and (5) repeating the step 8 until the total reward value of the unmanned aerial vehicle cluster network is converged, and finishing the local training.
The method is implemented by Python. Setting the total number of channels C =4, the number of unmanned aerial vehicles N =9, the number of interference machines J =1, and the transmission power selected by the unmanned aerial vehicle N at the cluster head of the time slot t for the cluster member i and the transmission power selected by the interference machine J are respectively
Figure BDA0003891111010000111
(the transmitting power of the cluster member unmanned aerial vehicle i can be 27, 30, 33 and 36dBm, and the transmitting power of the interference machine j is usually 33 dBm), the unmanned aerial vehicle cluster is designed to complete one communication task, so that the cluster head unmanned aerial vehicle n and the cluster member unmanned aerial vehicle i are required to complete q times of communication, and anti-interference systems under two different interference modes are simulated. Taking the learning rate delta =0.002, the discount factor gamma =0.99, the attenuation factor theta =0.998, the empirical pool size mu =200, and the environmental noise mean square value sigma 2 = -114dBm; rician fading channel gain is modeled as mean value xi by real part 2 Modeling imaginary part as mean 0 and variance xi 2 Independent and identically distributed Gaussian random process, so the fast fading of the channel is recorded as
Figure BDA0003891111010000112
a is the real part and b is the imaginary part.
The learning convergence effects in the sweep frequency and the markov interference mode are shown in fig. 2 to 3, respectively. The sweep frequency is set to interfere with 1 channel simultaneously, and 1MHz is used as the sweep frequency step length. 4 interference states are set under the Markov interference mode, each interference mode is randomly generated when the simulation starts, and the conversion of any time slot interference mode follows the following state transition matrix:
Figure BDA0003891111010000113
fig. 2 and fig. 3 respectively show the convergence of the reward values in the random selection scheme of channel and power, the DQN-based and DRQN-based selection schemes of channel and power in the two interference modes of frequency sweep interference and markov interference. As can be seen from the figure, the DRQN-based scheme has a higher convergence reward value than the DQN scheme, and the convergence result is more stable, because the DRQN has a long-term and short-term memory network, each agent can obtain hidden information such as action change rules and environment change rules of other agents according to historical experience, and the network output is determined by not only its own observation condition; the output of DQN is completely determined by its own observation, and once the environment or other agent decision rules change, fluctuation of the whole network will be caused. Comparing fig. 2 with fig. 3, the convergence of the reward values of the three channels and the scheme of power selection is approximately the same, and under the condition of markov interference, the performance of the DRQN-based scheme is improved by 34.6% compared with the DQN-based scheme, and is improved by 54.5% compared with the random-based scheme; under the condition of frequency sweep interference, the performance of the DRQN-based scheme is improved by 38.4% compared with the DQN-based scheme, and the performance of the DRQN-based scheme is improved by 56% compared with the random-based scheme.
Fig. 4 and fig. 5 are graphs showing the relationship between the average reward convergence value and the number of channels of three channels and power selection schemes under two interference modes, namely, swept-frequency interference and markov interference, respectively. When the number of the channels is increased, the average reward convergence values of the three schemes are improved, and the situation of the same frequency interference is reduced after the number of the channels is increased, so that the energy expenditure of unmanned aerial vehicle communication is reduced. The DRQN based scheme has a smaller change in the mean reward convergence value than other schemes, indicating that it is not sensitive to this environmental condition.
Fig. 6 and fig. 7 are diagrams illustrating the relationship between the reward convergence values and the number of jammers for three channels and the power selection scheme under two interference modes, namely, swept-frequency interference and markov interference, respectively. It can be seen from the figure that, in the two interference modes, as the number of the jammers increases, the environment is continuously deteriorated, and the average reward convergence values under the three schemes all have a decreasing trend, but the average reward convergence value based on the DRQN scheme is more stable, and the decreasing amplitude does not exceed 10%, so that the DRQN algorithm has better robustness.

Claims (8)

1. An unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part of observable information is characterized by comprising the following specific steps:
step 1: initializing algorithm parameters;
step 2: each cluster head unmanned aerial vehicle obtains a channel and transmitting power selected by a time slot on the member unmanned aerial vehicle in the cluster through interaction with the environment;
and step 3: each cluster head unmanned aerial vehicle adopts an epsilon-greedy algorithm to select a channel and transmitting power of a current time slot for members in the cluster;
and 4, step 4: each cluster head unmanned aerial vehicle calculates the total energy overhead required in the communication process with the members in the cluster and obtains a corresponding environment reward value;
and 5: storing the observed value, the action and the reward of the current time slot of each cluster head unmanned aerial vehicle and the observed value of the next time slot into respective experience pools;
and 6: when the experience pool sample data is enough, randomly sampling each cluster head unmanned aerial vehicle from the experience pool to obtain a plurality of batches of historical information data to form a time sequence, inputting the time sequence into the value network of each cluster head unmanned aerial vehicle, and updating the value network parameters by adopting a gradient descent method;
and 7: copying parameters of the value network to form a new target network at intervals of a certain time slot number;
and step 8: repeating the step 2 to the step 7 until the data transmission is completed for 100 times;
and step 9: and (5) repeating the step 8 until the total reward value of the unmanned aerial vehicle cluster network is converged, and finishing the local training.
2. The unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partially observable information as claimed in claim 1, wherein the algorithm parameters in step 1 include learning rate δ, greedy factor e, discount factor γ, experience pool size μ, attenuation factor θ, value network parameter w and target network parameter w'.
3. The unmanned aerial vehicle cluster multi-agent multi-domain anti-jamming method based on partially observable information of claim 1, wherein in step 2, each cluster head unmanned aerial vehicle obtains a channel and a transmission power selected by a time slot on an adult unmanned aerial vehicle in its cluster by interacting with an environment, specifically as follows:
the communication environment in the invention imitates the real environment as much as possible, and under most real environments, the intelligent agent cannot observe all state information due to the influence of noise and interference. Thus, the drone anti-jamming Decision problem is modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP).
Modeling a system model as Dec-POMDP<D,S,A,O,R>Wherein D is a plurality of agent sets, S is a joint state set, A is a joint action set, O is a joint observation set, and R is a reward function; defining D = {1, \8230, N } as a set of N agents; defining the current observation of the time slot t +1 cluster head unmanned aerial vehicle n as follows:
Figure FDA0003891110000000011
joint observation set
Figure FDA0003891110000000012
Wherein
Figure FDA0003891110000000013
Is a channel of cluster member i of time slot t +1 cluster head drone n,
Figure FDA0003891110000000014
is the transmission power selected by the time slot t +1 cluster head unmanned aerial vehicle n for the cluster member i; define the action of time slot t cluster head unmanned plane n as
Figure FDA0003891110000000021
Joint observation set
Figure FDA0003891110000000022
Wherein
Figure FDA0003891110000000023
Cluster member i frequency hopping to channel of time slot t cluster head unmanned aerial vehicle n
Figure FDA0003891110000000024
Is the transmission power selected by the time slot t cluster head unmanned aerial vehicle n for the cluster member i
Figure FDA0003891110000000025
Defining a joint state set S as all environment state information, and defining a joint observation set O as partial information which can be observed by N intelligent agents, so that the joint observation set O can be regarded as a subset of the joint state set S; definition of
Figure FDA0003891110000000026
Is the prize value for slot t cluster head drone n.
4. The unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partial observable information as claimed in claim 1, wherein in step 3, each cluster head unmanned aerial vehicle adopts epsilon-greedy algorithm to select channel and transmission power of current time slot for its members in the cluster, specifically as follows:
step 3-1: the observed value of each cluster head unmanned aerial vehicle is used as the input of the value network, the Q value corresponding to each action is used as the output of the value network, wherein, the time slot t is that the cluster head unmanned aerial vehicle n observes
Figure FDA0003891110000000027
Lower execution of actions
Figure FDA0003891110000000028
Is the expectation of the cumulative future prize value for the beginning of time slot t +1, as follows:
Figure FDA0003891110000000029
wherein s is t Is the time slot t environment status information,
Figure FDA00038911100000000210
taking action for time slot t cluster head unmanned aerial vehicle n
Figure FDA00038911100000000211
Environmental state s t From transition to s t+1 The probability of (c).
Step 3-2: the actions are selected according to the epsilon-greedy algorithm in the following specific way:
Figure FDA00038911100000000212
wherein p is a random number between 0 and 1, epsilon (0 < epsilon < 1) is an exploration probability,
Figure FDA00038911100000000213
the hidden layer state of the time slot t cluster head unmanned aerial vehicle n neural network is shown, and w is a value network parameter. In this network, the output is not only related to the input, but also to the slot t hidden layer state
Figure FDA00038911100000000214
In connection with this, the first and second electrodes,
Figure FDA00038911100000000215
the cluster head unmanned aerial vehicle n network state storage device is used for storing the past network state of the cluster head unmanned aerial vehicle n and comprises historical information. Hidden layer state
Figure FDA00038911100000000216
0 at the beginning of the round, i.e., no history information is included. As the round is made, it is,
Figure FDA00038911100000000217
will be iteratively updated, generated by the slotted t-network
Figure FDA00038911100000000218
And the state of the hidden layer is used as the time slot t +1, so that the output of the time slot t +1 value network is influenced, and iteration is performed step by step.
The strategy randomly selects an action in the action space with the probability of epsilon, and avoids falling into local optimum. ε is the probability of exploration, and 1- ε is the probability of utilization (selection of the current best strategy). The larger the value of epsilon, the smaller the probability of utilization. In the initial stage of algorithm execution, because the state action space is large, the exploration probability should be a large value, and gradually approaches the optimal strategy along with the increase of the iteration times, and the utilization probability should be increased accordingly.
5. The unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partially observable information as claimed in claim 1, wherein in step 4, each cluster head unmanned aerial vehicle calculates the sum of energy overhead required for the communication process with its members in the cluster, and obtains a corresponding environment reward value, specifically as follows:
note the book
Figure FDA0003891110000000031
And
Figure FDA0003891110000000032
the transmission power of cluster member i and interference machine j of the unmanned plane n of the cluster head of the time slot t respectively,
Figure FDA0003891110000000033
transmit power of cluster member k for time slot t cluster head drone m (when m = n, k ≠ i), G U And G J The gain for the antenna of the drone and the jammer respectively,
Figure FDA0003891110000000034
is the Euclidean distance between a time slot t cluster head unmanned aerial vehicle n and a cluster member i thereof or between the cluster head unmanned aerial vehicle n and an interference machine j, rho is the noise coefficient of the unmanned aerial vehicle, and sigma is 2 The mean square value of the ambient noise is calculated,
Figure FDA0003891110000000035
is the fast fading between the cluster head unmanned plane n and the cluster member i thereof or between the cluster head unmanned plane n and the jammer j in the time slot t, and B is the channel bandwidth,t is the time required for a single communication transmission, s is the data size of the single communication transmission,
Figure FDA0003891110000000036
the maximum average information rate of error-free transmission of a time slot t cluster head unmanned aerial vehicle n and a cluster member i thereof in an additive white Gaussian noise channel; rician fading channel gain is modeled as mean 0 with real part and variance is xi 2 Imaginary part modeling as mean 0 variance ξ 2 Independent and identically distributed Gaussian random process, so the fast fading of the channel is recorded as
Figure FDA0003891110000000037
a is a real part and b is an imaginary part; set the energy cost of time slot t cluster head unmanned aerial vehicle n as
Figure FDA0003891110000000038
Figure FDA0003891110000000039
Figure FDA00038911100000000310
Figure FDA00038911100000000311
When a cluster member i of the cluster head unmanned aerial vehicle n is in the same channel with the interference machine j, β =1, otherwise β =0; when the cluster member i of the cluster head drone n is in the same channel as the cluster member k of the cluster head drone m, α =1, otherwise α =0. Time slot t environment total reward value of
Figure FDA00038911100000000312
The practical physical meaning of the energy overhead is the energy consumed by the cluster head unmanned aerial vehicle n and all cluster member unmanned aerial vehicles for data transmission once.
6. The unmanned aerial vehicle cluster multi-agent multi-domain anti-jamming method based on partially observable information as claimed in claim 1, wherein in step 5, the observed value of the current time slot, the action, the reward and the observed value of the next time slot of each cluster head unmanned aerial vehicle are stored in respective experience pools, specifically as follows:
when cluster head unmanned aerial vehicle n is in time slot t according to
Figure FDA00038911100000000313
After the cluster member unmanned aerial vehicle frequency hopping channel and the transmitting power are selected, the environment state is determined by s t Jump to s t+1 Calculated at s by the reward value calculation formula t Down selection action
Figure FDA0003891110000000041
Awarding of prizes
Figure FDA0003891110000000042
And observation of
Figure FDA0003891110000000043
Generated by current time slot t
Figure FDA0003891110000000044
And storing the historical experience data into an experience pool.
7. The unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partially observable information according to claim 1, wherein in step 6, when sample data of the experience pools is sufficient, each intelligent agent randomly samples from each experience pool to obtain a plurality of batches of historical information data to form a time sequence, the time sequence is input into a value network of each intelligent agent, and a value network parameter is updated by a gradient descent method, specifically as follows:
observed value with time slot t as neural network input of cluster head unmanned aerial vehicle n
Figure FDA0003891110000000045
And outputting Q values corresponding to each action of the time slots t. In order to enhance the stability of the algorithm, the invention adopts a double-network structure, w is recorded as a value network parameter, w 'is a target network parameter, and the target network parameter w' is updated once every certain round in step 7.
Step 6-1: when each agent trains the value network, a batch of historical experience data is randomly selected from an experience pool to form a plurality of time sequences, each time sequence is a complete communication turn, a time slot is randomly selected from each sequence, and a plurality of continuous steps are selected as training samples. Calculating the action Q value function of the cluster head unmanned aerial vehicle n of the time slot t through the value network at the time slot t of the sample
Figure FDA0003891110000000046
As the estimated Q value, the target network calculates the action Q value function of the time slot t +1 cluster head unmanned aerial vehicle n
Figure FDA0003891110000000047
Wherein,
Figure FDA0003891110000000048
and the observed value, the action and the hidden layer state of the time slot t cluster head unmanned aerial vehicle n. The true value of the action Q function is calculated using the following equation:
Figure FDA0003891110000000049
step 6-2: substituting the real Q value and the estimated Q value into the following formula for calculation, namely updating the value network parameter w and gradually reducing
Figure FDA00038911100000000410
Namely, it is
Figure FDA00038911100000000411
And the Q value calculated by the value network is closer to the real Q value by a gradient descent method. Before each intelligent agent trains the neural network, the hidden layer state needs to be set to zero, and the hidden layer state of a plurality of subsequent steps needs to be set
Figure FDA00038911100000000412
Resulting from network iterations.
8. The unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method according to claim 1, wherein in step 7, at regular time slot number, the parameters of the value network are copied to form a new target network, i.e. w ← w'.
CN202211261459.8A 2022-10-14 2022-10-14 Unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part observable information Pending CN115454141A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211261459.8A CN115454141A (en) 2022-10-14 2022-10-14 Unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part observable information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211261459.8A CN115454141A (en) 2022-10-14 2022-10-14 Unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part observable information

Publications (1)

Publication Number Publication Date
CN115454141A true CN115454141A (en) 2022-12-09

Family

ID=84311660

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211261459.8A Pending CN115454141A (en) 2022-10-14 2022-10-14 Unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part observable information

Country Status (1)

Country Link
CN (1) CN115454141A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116131963A (en) * 2023-02-02 2023-05-16 广东工业大学 Fiber link multipath interference noise equalization method based on LSTM neural network
CN116432690A (en) * 2023-06-15 2023-07-14 中国人民解放军国防科技大学 Markov-based intelligent decision method, device, equipment and storage medium
CN117675054A (en) * 2024-02-02 2024-03-08 中国电子科技集团公司第十研究所 Multi-domain combined anti-interference intelligent decision method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107179777A (en) * 2017-06-03 2017-09-19 复旦大学 Multiple agent cluster Synergistic method and multiple no-manned plane cluster cooperative system
US20190011531A1 (en) * 2016-03-11 2019-01-10 Goertek Inc. Following method and device for unmanned aerial vehicle and wearable device
CN113382381A (en) * 2021-05-30 2021-09-10 南京理工大学 Unmanned aerial vehicle cluster network intelligent frequency hopping method based on Bayesian Q learning
US20210373552A1 (en) * 2018-11-06 2021-12-02 Battelle Energy Alliance, Llc Systems, devices, and methods for millimeter wave communication for unmanned aerial vehicles
CN114415735A (en) * 2022-03-31 2022-04-29 天津大学 Dynamic environment-oriented multi-unmanned aerial vehicle distributed intelligent task allocation method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190011531A1 (en) * 2016-03-11 2019-01-10 Goertek Inc. Following method and device for unmanned aerial vehicle and wearable device
CN107179777A (en) * 2017-06-03 2017-09-19 复旦大学 Multiple agent cluster Synergistic method and multiple no-manned plane cluster cooperative system
US20210373552A1 (en) * 2018-11-06 2021-12-02 Battelle Energy Alliance, Llc Systems, devices, and methods for millimeter wave communication for unmanned aerial vehicles
CN113382381A (en) * 2021-05-30 2021-09-10 南京理工大学 Unmanned aerial vehicle cluster network intelligent frequency hopping method based on Bayesian Q learning
CN114415735A (en) * 2022-03-31 2022-04-29 天津大学 Dynamic environment-oriented multi-unmanned aerial vehicle distributed intelligent task allocation method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李孜恒;孟超;: "基于深度强化学习的无线网络资源分配算法", 通信技术, no. 08, 10 August 2020 (2020-08-10) *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116131963A (en) * 2023-02-02 2023-05-16 广东工业大学 Fiber link multipath interference noise equalization method based on LSTM neural network
CN116432690A (en) * 2023-06-15 2023-07-14 中国人民解放军国防科技大学 Markov-based intelligent decision method, device, equipment and storage medium
CN116432690B (en) * 2023-06-15 2023-08-18 中国人民解放军国防科技大学 Markov-based intelligent decision method, device, equipment and storage medium
CN117675054A (en) * 2024-02-02 2024-03-08 中国电子科技集团公司第十研究所 Multi-domain combined anti-interference intelligent decision method and system
CN117675054B (en) * 2024-02-02 2024-04-23 中国电子科技集团公司第十研究所 Multi-domain combined anti-interference intelligent decision method and system

Similar Documents

Publication Publication Date Title
CN115454141A (en) Unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part observable information
CN113162679A (en) DDPG algorithm-based IRS (inter-Range instrumentation System) auxiliary unmanned aerial vehicle communication joint optimization method
CN108777872B (en) Intelligent anti-interference method and intelligent anti-interference system based on deep Q neural network anti-interference model
CN111050330B (en) Mobile network self-optimization method, system, terminal and computer readable storage medium
CN114422056B (en) Space-to-ground non-orthogonal multiple access uplink transmission method based on intelligent reflecting surface
CN113795049B (en) Femtocell heterogeneous network power self-adaptive optimization method based on deep reinforcement learning
CN116456493A (en) D2D user resource allocation method and storage medium based on deep reinforcement learning algorithm
CN114422363A (en) Unmanned aerial vehicle loaded RIS auxiliary communication system capacity optimization method and device
CN115103372B (en) Multi-user MIMO system user scheduling method based on deep reinforcement learning
CN111050413A (en) Unmanned aerial vehicle CSMA access method based on adaptive adjustment strategy
CN113423110A (en) Multi-user multi-channel dynamic spectrum access method based on deep reinforcement learning
CN115567148A (en) Intelligent interference method based on cooperative Q learning
CN113382060A (en) Unmanned aerial vehicle track optimization method and system in Internet of things data collection
Parras et al. An online learning algorithm to play discounted repeated games in wireless networks
CN116938314A (en) Unmanned aerial vehicle relay safety communication optimizing system under eavesdropping attack
CN114980254B (en) Dynamic multichannel access method and device based on duel deep cycle Q network
CN116866895A (en) Intelligent countering method based on neural virtual self-game
CN116866048A (en) Anti-interference zero-and Markov game model and maximum and minimum depth Q learning method
CN116165886A (en) Multi-sensor intelligent cooperative control method, device, equipment and medium
CN116340737A (en) Heterogeneous cluster zero communication target distribution method based on multi-agent reinforcement learning
Evmorfos et al. Deep actor-critic for continuous 3D motion control in mobile relay beamforming networks
Zappone et al. Complexity-aware ANN-based energy efficiency maximization
Dong et al. Multi-agent adversarial attacks for multi-channel communications
CN116073856A (en) Intelligent frequency hopping anti-interference decision method based on depth deterministic strategy
CN113411099B (en) Double-change frequency hopping pattern intelligent decision method based on PPER-DQN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination