CN115454141A - Unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part observable information - Google Patents
Unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part observable information Download PDFInfo
- Publication number
- CN115454141A CN115454141A CN202211261459.8A CN202211261459A CN115454141A CN 115454141 A CN115454141 A CN 115454141A CN 202211261459 A CN202211261459 A CN 202211261459A CN 115454141 A CN115454141 A CN 115454141A
- Authority
- CN
- China
- Prior art keywords
- unmanned aerial
- aerial vehicle
- time slot
- cluster
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 230000009471 action Effects 0.000 claims abstract description 48
- 230000006854 communication Effects 0.000 claims abstract description 39
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 37
- 230000005540 biological transmission Effects 0.000 claims abstract description 36
- 238000004891 communication Methods 0.000 claims abstract description 33
- 230000006870 function Effects 0.000 claims abstract description 18
- 241000854291 Dianthus carthusianorum Species 0.000 claims description 128
- 239000003795 chemical substances by application Substances 0.000 claims description 51
- 230000008569 process Effects 0.000 claims description 12
- 238000013528 artificial neural network Methods 0.000 claims description 11
- 238000005562 fading Methods 0.000 claims description 11
- 238000011478 gradient descent method Methods 0.000 claims description 10
- 238000012549 training Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 6
- 230000003993 interaction Effects 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 5
- 238000013459 approach Methods 0.000 claims description 4
- 230000007704 transition Effects 0.000 claims description 4
- 239000000654 additive Substances 0.000 claims description 3
- 230000000996 additive effect Effects 0.000 claims description 3
- 230000001186 cumulative effect Effects 0.000 claims description 3
- 230000009916 joint effect Effects 0.000 claims description 3
- 230000007613 environmental effect Effects 0.000 abstract description 5
- 238000005265 energy consumption Methods 0.000 abstract description 4
- 230000006403 short-term memory Effects 0.000 abstract description 4
- 230000007787 long-term memory Effects 0.000 abstract description 3
- 230000007774 longterm Effects 0.000 abstract description 2
- 238000001228 spectrum Methods 0.000 abstract 1
- 238000005516 engineering process Methods 0.000 description 6
- 230000008859 change Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 230000003247 decreasing effect Effects 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000015654 memory Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/10—Simultaneous control of position or course in three dimensions
- G05D1/101—Simultaneous control of position or course in three dimensions specially adapted for aircraft
- G05D1/104—Simultaneous control of position or course in three dimensions specially adapted for aircraft involving a plurality of aircrafts, e.g. formation flying
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Aviation & Aerospace Engineering (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
The invention discloses an unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partial observable information, which utilizes partial observable environment information of each agent, retains historical experience data through a long-term and short-term memory network, inputs a deep circulation Q network of each agent to perform action value function fitting, selects a channel and power corresponding to a maximum output Q value by adopting an epsilon-greedy algorithm, continuously and independently trains the deep circulation Q network of each agent, updates Q value distribution, and finally learns an unmanned aerial vehicle channel and emission power optimal decision for realizing communication transmission energy consumption minimization under an unknown interference scene. Aiming at the situation that an unmanned aerial vehicle cluster network is respectively under two scenes of frequency sweep interference and Markov interference, effective multi-agent anti-interference communication is realized from a frequency spectrum domain and a power domain by using part of historical experience data of observable information; compared with a contrast scheme based on multi-agent deep Q learning, the scheme can more efficiently reduce the long-term communication transmission energy consumption of the unmanned aerial vehicle cluster network under the condition that the environmental information is partially observable.
Description
Technical Field
The invention belongs to the technical field of wireless communication, and particularly relates to an unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part observable information.
Background
In recent years, with the rapid development of radio technology, many advantages in the unmanned aerial vehicle communication system are continuously highlighted, and the unmanned aerial vehicle is widely applied to an emergency network to relieve the terminal requirements in the communication system. The unmanned aerial vehicle cluster network anti-interference technology is an important technology for guaranteeing unmanned aerial vehicle communication to be free from interference threat. Among them, frequency hopping interference rejection is one of the most common interference rejection techniques. Because the traditional frequency hopping anti-interference technology cannot deal with the problems of unknown high-dynamic complex interference environment and the like, the frequency hopping anti-interference technology based on reinforcement learning becomes a research hotspot of the frequency hopping anti-interference technology of the communication network of the unmanned aerial vehicle in recent years.
Most previous studies adopt the Q-Learning (QL) algorithm, but only apply to low-dimensional discrete motion space. When the action space is large, dimension disaster is faced. Aiming at the problems, shangxing Wang et al provides a channel selection algorithm based on Deep Q-Network (DQN) online learning, and the channel selection algorithm effectively improves the anti-interference performance of the unmanned aerial vehicle communication Network in a complex environment. The Fuqiang Yao and Luliang Jia establish a multi-agent Markov Decision Model (MDP) model for the unmanned aerial vehicle cluster communication system by means of a Markov Game frame (MDP), and communication overhead is reduced when the MDP model is applied to an actual communication environment. However, the above-described interference-free communication techniques do not take into account the problem of partially observable communication environments.
Disclosure of Invention
The invention aims to provide an unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partial observable information, which utilizes a Deep cycle Q-learning (DRQN) algorithm, and realizes the approach to a real environment model by adopting a long-short term memory Network to retain historical information data to train the DRQN on the basis of establishing a Dec-POMDP model by a cluster head unmanned aerial vehicle.
The technical solution for realizing the purpose of the invention is as follows: an unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partial observable information comprises the following specific steps:
step 1: initializing algorithm parameters;
step 2: each cluster head unmanned aerial vehicle obtains a channel and transmitting power selected by a time slot on an adult unmanned aerial vehicle in the cluster through interaction with the environment;
and step 3: each cluster head unmanned aerial vehicle adopts an epsilon-greedy algorithm to select a channel and transmitting power of a current time slot for members in the cluster;
and 4, step 4: each cluster head unmanned aerial vehicle calculates the energy overhead sum required by the communication process with the member in the cluster head unmanned aerial vehicle, and obtains a corresponding environment reward value;
and 5: storing the observed value, the action, the reward and the observed value of the next time slot of each cluster head unmanned aerial vehicle into respective experience pools;
and 6: when the sample data of the experience pool is enough, each cluster head unmanned aerial vehicle carries out random sampling from the experience pool to obtain a plurality of batches of historical information data to form a time sequence, the time sequence is input into the value network of each cluster head unmanned aerial vehicle, and the value network parameters are updated by adopting a gradient descent method;
and 7: copying parameters of the value network to form a new target network at a certain time slot number;
and step 8: repeating the step 2 to the step 7 until 100 times of data transmission is finished;
and step 9: and (5) repeating the step 8 until the total reward value of the unmanned aerial vehicle cluster network is converged, and finishing the local training.
Compared with the prior art, the invention has the following remarkable advantages: (1) The multi-agent multi-domain anti-interference framework applicable to part of observable environments is provided, a multi-domain anti-interference decision process is modeled into a multi-agent part observable Markov process by aiming at minimizing the long-term communication transmission energy consumption of an unmanned aerial vehicle cluster network, and the observed value, the action and the reward of the current time slot of a cluster head unmanned aerial vehicle and the observed value of the next time slot are used as historical experiences to assist each unmanned aerial vehicle cluster intelligent body in completing respective channel selection and emission power distribution; (2) A multi-domain anti-interference algorithm based on a multi-agent deep circulation Q network is provided, historical information data are reserved by adopting a long-term and short-term memory network, then the historical information data are input into the deep circulation Q network of each agent to perform action value function fitting, parameters of each deep circulation Q network are updated, and finally an optimal decision of an unmanned aerial vehicle channel and transmitting power for realizing communication transmission energy consumption minimization under an unknown interference scene is obtained.
Drawings
Fig. 1 is a flowchart of the unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partial observable information.
Fig. 2 is a schematic diagram of learning convergence effects of different algorithms in a swept frequency interference mode.
Fig. 3 is a schematic diagram of learning convergence effects of different algorithms in the markov interference mode.
FIG. 4 is a graph of convergence values versus channel number for different algorithm environment rewards in a swept frequency interference mode.
Figure 5 is a graph of convergence values of different algorithm environment rewards versus the number of channels in a markov interference mode.
FIG. 6 is a diagram of the convergence value of different algorithm environment rewards and the number of jammers under the swept frequency interference mode.
Fig. 7 is a graph of convergence values of environment rewards of different algorithms in a markov interference mode versus the number of jammers.
Detailed Description
The invention relates to an unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partial observable information, which comprises the following specific steps:
step 1: initializing algorithm parameters;
and 2, step: each cluster head unmanned aerial vehicle obtains a channel and transmitting power selected by a time slot on an adult unmanned aerial vehicle in the cluster through interaction with the environment;
and step 3: each cluster head unmanned aerial vehicle adopts an epsilon-greedy algorithm to select a channel and a transmitting power of a current time slot for members in the cluster;
and 4, step 4: each cluster head unmanned aerial vehicle calculates the total energy overhead required in the communication process with the members in the cluster and obtains a corresponding environment reward value;
and 5: storing the observed value, the action, the reward and the observed value of the next time slot of each cluster head unmanned aerial vehicle into respective experience pools;
and 6: when the sample data of the experience pool is enough, each cluster head unmanned aerial vehicle carries out random sampling from the experience pool to obtain a plurality of batches of historical information data to form a time sequence, the time sequence is input into the value network of each cluster head unmanned aerial vehicle, and the value network parameters are updated by adopting a gradient descent method;
and 7: copying parameters of the value network to form a new target network at intervals of a certain time slot number;
and step 8: repeating the step 2 to the step 7 until 100 times of data transmission is finished;
and step 9: and (5) repeating the step 8 until the total reward value of the unmanned aerial vehicle cluster network is converged, and finishing the local training.
Further, the algorithm parameters in step 1 include a learning rate δ, a greedy factor e, a discount factor γ, an experience pool size μ, an attenuation factor θ, a value network parameter w and a target network parameter w'.
Further, in step 2, each cluster head unmanned aerial vehicle obtains a channel and a transmitting power selected by a time slot on an adult unmanned aerial vehicle in the cluster through interaction with the environment, specifically as follows:
the communication environment in the invention imitates the real environment as much as possible, and under most real environments, the intelligent agent cannot observe all state information due to the influence of noise and interference. Thus, the drone anti-jamming Decision problem is modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP).
Modeling a system model as Dec-POMDP<D,S,A,O,R>Wherein D is a plurality of agent sets, S is a joint state set, A is a joint action set, and O is a jointCombining the observation sets, wherein R is a reward function; defining D = {1, \8230;, N } is a set of N agents; defining the current observation of the time slot t +1 cluster head unmanned aerial vehicle n as follows:joint observation setWhereinIs the channel of cluster member i of time slot t +1 cluster head drone n,the time slot t +1 is the transmission power selected by the cluster head unmanned aerial vehicle n for the cluster member i; defining the action of a time slot t cluster head unmanned plane n asJoint observation setWhereinCluster member i frequency hopping to channel of time slot t cluster head unmanned aerial vehicle nIs the transmission power selected by the time slot t cluster head unmanned aerial vehicle n for the cluster member iDefining a joint state set S as all environment state information, and defining a joint observation set O as partial information which can be observed by N intelligent agents, so that the joint observation set O can be regarded as a subset of the joint state set S; definition ofIs the reward value for slot t cluster head drone n.
Further, in step 3, each cluster head unmanned aerial vehicle adopts an epsilon-greedy algorithm to select a channel and a transmission power of a current time slot for members in the cluster, specifically as follows:
step 3-1: the observed value of each cluster head unmanned aerial vehicle is used as the input of the value network, the Q value corresponding to each action is used as the output of the value network, wherein, the time slot t is that the cluster head unmanned aerial vehicle n observesLower execution actionQ value of (c) is the expectation of the cumulative future prize value at the beginning of time slot t +1, as follows:
wherein s is t For the time slot t the environmental status information,taking action for time slot t cluster head unmanned aerial vehicle nEnvironmental state s t From transition to s t+1 The probability of (c).
Step 3-2: the actions are selected according to the epsilon-greedy algorithm in the following specific way:
wherein p is a random number between 0 and 1, epsilon (0 < epsilon < 1) is an exploration probability,the hidden layer state of the time slot t cluster head unmanned aerial vehicle n neural network is shown, and w is a value network parameter. In this network, the output is not only related to the input, but is also hidden from the time slot tLayer stateIn connection with this, the present invention is,the cluster head unmanned aerial vehicle n network state storage device is used for storing the past network state of the cluster head unmanned aerial vehicle n and comprises historical information. Hidden layer state 0 at the beginning of the round, i.e., contains no historical information. As the round is made, it is,will be iteratively updated, generated by the slotted t-networkAnd the state of the hidden layer is used as the time slot t +1, so that the output of the time slot t +1 value network is influenced, and the iteration is carried out step by step.
The strategy randomly selects an action in the action space with the probability of epsilon, and avoids falling into local optimum. ε is the exploration probability and 1- ε is the utilization (selection of the current best strategy) probability. The larger the value of epsilon, the smaller the probability of utilization. In the initial stage of algorithm execution, because the state action space is large, the exploration probability should be a large value, and gradually approaches the optimal strategy along with the increase of the iteration times, and the utilization probability should be increased accordingly.
Further, in step 4, each cluster head unmanned aerial vehicle calculates the total energy overhead required in the communication process with its cluster member, and obtains a corresponding environment reward value, which is as follows:
note the bookAndthe transmission power of cluster member i and jammer j of the time slot t cluster head unmanned aerial vehicle n respectively,transmit power of cluster member k (when m = n, k ≠ i), G, for slot t cluster head drone m U And G J The gain for the antenna of the drone and the jammer respectively,is the Euclidean distance between a time slot t cluster head unmanned aerial vehicle n and a cluster member i thereof or between the cluster head unmanned aerial vehicle n and an interference machine j, rho is the noise coefficient of the unmanned aerial vehicle, and sigma is 2 The mean square value of the ambient noise is calculated,is the time slot T fast fading between the cluster head unmanned aerial vehicle n and the cluster member i thereof or between the cluster head unmanned aerial vehicle n and the jammer j, B is the channel bandwidth, T is the time required by single communication transmission, s is the data size of the single communication transmission,the maximum average information rate of error-free transmission of a time slot t cluster head unmanned aerial vehicle n and a cluster member i thereof in an additive white Gaussian noise channel; rician fading channel gain is modeled as mean 0 with real part and variance is xi 2 Imaginary part modeling as mean 0 variance ξ 2 Independent and identically distributed Gaussian random process, so the channel fast fading is recorded asa is a real part and b is an imaginary part; setting the energy overhead of the time slot t cluster head unmanned aerial vehicle n as
When a cluster member i of the cluster head unmanned aerial vehicle n is in the same channel with the interference machine j, the ratio of beta =1, otherwise the ratio of beta =0; when the cluster member i of the cluster head drone n is in the same channel as the cluster member k of the cluster head drone m, α =1, otherwise α =0. Time slot t environment total reward value is
The practical physical meaning of the energy overhead is the energy consumed by the cluster head unmanned aerial vehicle n and all cluster member unmanned aerial vehicles for data transmission once.
Further, in step 5, the observation value, the action, the reward and the observation value of the next time slot of each cluster head unmanned aerial vehicle are stored into respective experience pools, which specifically includes:
when the cluster head unmanned aerial vehicle n is in the time slot t according toAfter the cluster member unmanned aerial vehicle frequency hopping channel and the transmitting power are selected, the environment state is determined by s t Jump to s t+1 Calculated at s by the reward value calculation formula t Down selection actionAwarding of prizesAnd observation ofGenerated by current time slot tAnd storing the historical experience data into an experience pool.
Further, in step 6, when the experience pool sample data is sufficient, each agent performs random sampling from the experience pool to obtain a plurality of batches of historical information data to form a time sequence, the time sequence is input into the value network of each agent, and the value network parameters are updated by a gradient descent method, which specifically comprises the following steps:
observed value with time slot t as neural network input of cluster head unmanned aerial vehicle nAnd outputting Q values corresponding to each action of the time slots t. In order to enhance the stability of the algorithm, the invention adopts a double-network structure, w is recorded as a value network parameter, w 'is a target network parameter, and the target network parameter w' is updated once every certain round in step 7.
Step 6-1: when each agent trains the value network, a batch of historical experience data is randomly selected from an experience pool to form a plurality of time sequences, each time sequence is a complete communication round, a time slot is randomly selected from each sequence, and a plurality of continuous steps are selected as training samples. Calculating the action Q value function of the cluster head unmanned aerial vehicle n of the time slot t through the value network at the time slot t of the sampleAs the estimated Q value, the target network calculates the function of the action Q value of the time slot t +1 cluster head unmanned aerial vehicle nWherein,and the observed value, the action and the hidden layer state of the time slot t cluster head unmanned aerial vehicle n are obtained. The true value of the action Q function is calculated using the following equation:
step 6-2: will be the true Q valueSubstituting the estimated Q value into the following formula for calculation, namely updating the value network parameter w and gradually reducingNamely that
The Q value calculated by the value network is closer to the real Q value by a gradient descent method. Before each intelligent agent trains the neural network, the hidden layer state needs to be set to zero, and the hidden layer state of a plurality of subsequent steps needs to be setResulting from network iterations.
Further, in step 7, every certain number of time slots, the parameters of the value network are copied to form a new target network, i.e., w ← w'.
The invention is further described in detail below with reference to the drawings and specific embodiments.
Examples
In this embodiment, the cluster head drone and the jammer complete 100 moves and select the channel and the transmission power in one round, that is, complete one communication task, and in the round, each move, channel select and transmission power made by the cluster head drone and the jammer is called one time slot.
With reference to fig. 1, the present embodiment provides an unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part of observable information, which includes the following specific steps:
step 1: algorithm parameters are initialized.
The algorithm parameters comprise a learning rate delta, a greedy factor epsilon, a discount factor gamma, an experience pool size mu, an attenuation factor theta, a value network parameter w and a target network parameter w'.
Step 2: and each cluster head unmanned aerial vehicle obtains a channel and transmitting power selected by a time slot on the member unmanned aerial vehicle in the cluster through interaction with the environment. The method comprises the following specific steps:
the communication environment in the invention imitates the real environment as much as possible, and under most real environments, the intelligent agent cannot observe all state information due to the influence of noise and interference. Thus, the drone anti-jamming Decision problem is modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP).
Modeling a system model as Dec-POMDP<D,S,A,O,R>Wherein D is a plurality of agent sets, S is a joint state set, A is a joint action set, O is a joint observation set, and R is a reward function; defining D = {1, \8230, N } as a set of N agents; defining the current observation of the time slot t +1 cluster head unmanned aerial vehicle n as:joint observation setWhereinIs the channel of cluster member i of time slot t +1 cluster head drone n,is the transmission power selected by the time slot t +1 cluster head unmanned aerial vehicle n for the cluster member i; define the action of time slot t cluster head unmanned plane n asJoint observation setWhereinCluster member i frequency hopping to channel of time slot t cluster head unmanned aerial vehicle nIs the transmission power selected by the time slot t cluster head unmanned aerial vehicle n for the cluster member iDefining a joint state set S as all environment state information, and defining a joint observation set O as partial information which can be observed by N intelligent agents, so that the joint observation set O can be regarded as a subset of the joint state set S; definition ofIs the reward value for slot t cluster head drone n.
And step 3: and each cluster head unmanned aerial vehicle adopts an epsilon-greedy algorithm to select the channel and the transmitting power of the current time slot for the members in the cluster. The method comprises the following specific steps:
step 3-1: the observed value of each cluster head unmanned aerial vehicle is used as the input of the value network, the Q value corresponding to each action is used as the output of the value network, wherein, the time slot t is that the cluster head unmanned aerial vehicle n observesLower execution of actionsQ value of (c) is the expectation of the cumulative future prize value at the beginning of time slot t +1, as follows:
wherein s is t For the time slot t the environmental status information,taking action for time slot t cluster head unmanned aerial vehicle nEnvironmental state s t From transition to s t+1 The probability of (c).
Step 3-2: the actions are selected according to the epsilon-greedy algorithm in the following specific way:
wherein p is a random number between 0 and 1, epsilon (0 < epsilon < 1) is an exploration probability,the hidden layer state of the time slot t cluster head unmanned aerial vehicle n neural network is shown, and w is a value network parameter. In this network, the output is not only related to the input, but also to the slot t hidden layer stateIn connection with this, the first and second electrodes,the cluster head unmanned aerial vehicle n network state storage device is used for storing the past network state of the cluster head unmanned aerial vehicle n and comprises historical information. Hidden layer state 0 at the beginning of the round, i.e., contains no historical information. As the round is made, it is,will be iteratively updated, generated by the time-slot t networkAnd the state of the hidden layer is used as the time slot t +1, so that the output of the time slot t +1 value network is influenced, and iteration is performed step by step.
The strategy randomly selects an action in the action space with the probability of epsilon and avoids falling into local optimum. ε is the exploration probability and 1- ε is the utilization (selection of the current best strategy) probability. The larger the value of epsilon, the smaller the probability of utilization. In the initial stage of algorithm execution, because the state action space is large, the exploration probability should be a large value, and gradually approaches the optimal strategy along with the increase of the iteration times, and the utilization probability should be increased accordingly. The probability epsilon is updated as follows:
ε=max{0.01,θ x }
where x is the number of rounds currently in progress.
And 4, step 4: and each cluster head unmanned aerial vehicle calculates the total energy expenditure required in the communication process with the members in the cluster and obtains a corresponding environment reward value. The method comprises the following specific steps:
note bookAndthe transmission power of cluster member i and jammer j of the time slot t cluster head unmanned aerial vehicle n respectively,transmit power of cluster member k (when m = n, k ≠ i), G, for slot t cluster head drone m U And G J The gain of the antenna of the unmanned aerial vehicle and the antenna of the jammer are respectively,is the Euclidean distance between a time slot t cluster head unmanned aerial vehicle n and a cluster member i thereof or between the cluster head unmanned aerial vehicle n and an interference machine j, rho is the noise coefficient of the unmanned aerial vehicle, and sigma is 2 The mean square value of the ambient noise is calculated,is the time slot T fast fading between the cluster head unmanned aerial vehicle n and the cluster member i thereof or between the cluster head unmanned aerial vehicle n and the jammer j, B is the channel bandwidth, T is the time required by single communication transmission, s is the data size of the single communication transmission,the maximum average information rate of error-free transmission of a time slot t cluster head unmanned aerial vehicle n and a cluster member i thereof in an additive white Gaussian noise channel; rician fading channel gain is modeled as mean 0 with real part and variance is xi 2 Modeling imaginary part as mean 0 and variance xi 2 Independent and identically distributed Gaussian random process, so the fast fading of the channel is recorded asa is a real part and b is an imaginary part; setting the energy overhead of the time slot t cluster head unmanned aerial vehicle n as
When a cluster member i of the cluster head unmanned aerial vehicle n is in the same channel with the interference machine j, β =1, otherwise β =0; when the cluster member i of the cluster head drone n is in the same channel as the cluster member k of the cluster head drone m, α =1, otherwise α =0. Time slot t environment total reward value of
The practical physical meaning of the energy overhead is the energy consumed by the cluster head unmanned aerial vehicle n and all cluster member unmanned aerial vehicles for data transmission once.
And 5: and storing the observed value, the action, the reward and the observed value of the next time slot of each cluster head unmanned aerial vehicle into respective experience pools. The method comprises the following specific steps:
when the cluster head unmanned aerial vehicle n is in the time slot t according toAfter the cluster member unmanned aerial vehicle frequency hopping channel and the transmitting power are selected, the environment state is determined by s t Jump to s t+1 Calculated at s by a reward value calculation formula t Down selection actionAwarding of prizesAnd observation ofGenerated by current time slot tAnd storing the historical experience data into an experience pool.
Step 6: when the sample data of the experience pool is enough, each intelligent agent carries out random sampling from the experience pool to obtain a plurality of batches of historical information data to form a time sequence, the time sequence is input into the value network of each intelligent agent, and the value network parameters are updated by adopting a gradient descent method. The method comprises the following specific steps:
the neural network of the cluster head unmanned aerial vehicle n is composed of 3 neural units, and the first neural unit is a Long Short-Term Memory unit (LSTM). The LSTM architecture is a special recurrent neural network architecture that can use historical information to predict and process sequence data. The LSTM is composed of a forgetting gate, an input gate and an output gate, wherein control parameters in the forgetting gate determine historical information needing to be discarded, the input gate determines new information to be added, and the output gate determines data to be output from the LSTM unit to a next unit.
Forget the door:
an input gate:
an output gate:
wherein, W i,f,c,o And b i,f,c,o The input weight and the offset of the gate,is the input of the slot tlstm unit.
The LSTM structure uses three gates to determine the degree of retention for the incoming data sequence, allowing prediction of the future from historical information. In the anti-interference scene of the invention, each agent only exchanges reward information, so that the action information of other agents cannot be determined. The LSTM structure helps each intelligent agent to pre-estimate the actions of other intelligent agents by using the experience of historical information, and a better unmanned aerial vehicle cluster network communication anti-interference strategy can be obtained.
Observed value with time slot t as neural network input of cluster head unmanned aerial vehicle nAnd outputting the Q value corresponding to each action of the time slot t. In order to enhance the stability of the algorithm, the invention adopts a double-network structure, w is recorded as a value network parameter, w 'is a target network parameter, and the target network parameter w' is updated once every certain round in step 7.
Step 6-1: when each agent trains the value network, a batch of historical experience data is randomly selected from an experience pool to form a plurality of time sequences, each time sequence is a complete communication turn, a time slot is randomly selected from each sequence, and a plurality of continuous steps are selected as training samples. At time slot t of the sample, byAction Q value function of cluster head unmanned aerial vehicle n for calculating time slot t by value networkAs the estimated Q value, the target network calculates the function of the action Q value of the time slot t +1 cluster head unmanned aerial vehicle nWherein,and the observed value, the action and the hidden layer state of the time slot t cluster head unmanned aerial vehicle n are obtained. The true value of the action Q function is calculated using the following equation:
step 6-2: the real Q value and the estimated Q value are substituted into the following formula for calculation, namely, the value network parameter w can be updated, and the value network parameter w is gradually reducedNamely that
The Q value calculated by the value network is closer to the real Q value by a gradient descent method. Before each intelligent agent trains the neural network, the hidden layer state needs to be set to zero, and the hidden layer state of a plurality of subsequent steps needs to be setResulting from network iterations.
The gradient descent process adopts an Adaptive Moment Estimation (ADAM) mode. In the value network parameter updating process, only a batch of historical experience data is sampled for training in each iteration, if data sets are different, loss functions are different, and the probability of convergence to local optimum can be reduced by adopting an ADAM mode. ADAM dynamically adjusts the learning rate for each parameter according to first and second moment estimates of the gradient of each parameter by the loss function. The ADAM is based on a gradient descent method, but the learning step size of the parameters in each iteration has a certain range, so that a larger learning step size cannot be caused by a larger gradient, and the values of the parameters are more stable. The ADAM algorithm is implemented by the following steps:
assuming a time slot t, the first derivative of the objective function with respect to the parameter is g t First, an exponential moving average is calculated:
m t =λ 1 m t-1 +(1-λ 1 )g t-1
two more deviation correction terms are calculated:
the finally obtained gradient updating method comprises the following steps:
finally returning error function related result parametersM in the algorithm t Is an exponential moving average, ω t Is the square gradient, while the parameter lambda 1 、λ 2 To control the exponential decay rate of these moving means, η is the learning step and τ is a constant, typically 10 -8 。
And 7: copying parameters of the value network to form a new target network at a certain time slot number;
and 8: repeating the step 2 to the step 7 until the data transmission is completed for 100 times;
and step 9: and (5) repeating the step 8 until the total reward value of the unmanned aerial vehicle cluster network is converged, and finishing the local training.
The method is implemented by Python. Setting the total number of channels C =4, the number of unmanned aerial vehicles N =9, the number of interference machines J =1, and the transmission power selected by the unmanned aerial vehicle N at the cluster head of the time slot t for the cluster member i and the transmission power selected by the interference machine J are respectively(the transmitting power of the cluster member unmanned aerial vehicle i can be 27, 30, 33 and 36dBm, and the transmitting power of the interference machine j is usually 33 dBm), the unmanned aerial vehicle cluster is designed to complete one communication task, so that the cluster head unmanned aerial vehicle n and the cluster member unmanned aerial vehicle i are required to complete q times of communication, and anti-interference systems under two different interference modes are simulated. Taking the learning rate delta =0.002, the discount factor gamma =0.99, the attenuation factor theta =0.998, the empirical pool size mu =200, and the environmental noise mean square value sigma 2 = -114dBm; rician fading channel gain is modeled as mean value xi by real part 2 Modeling imaginary part as mean 0 and variance xi 2 Independent and identically distributed Gaussian random process, so the fast fading of the channel is recorded asa is the real part and b is the imaginary part.
The learning convergence effects in the sweep frequency and the markov interference mode are shown in fig. 2 to 3, respectively. The sweep frequency is set to interfere with 1 channel simultaneously, and 1MHz is used as the sweep frequency step length. 4 interference states are set under the Markov interference mode, each interference mode is randomly generated when the simulation starts, and the conversion of any time slot interference mode follows the following state transition matrix:
fig. 2 and fig. 3 respectively show the convergence of the reward values in the random selection scheme of channel and power, the DQN-based and DRQN-based selection schemes of channel and power in the two interference modes of frequency sweep interference and markov interference. As can be seen from the figure, the DRQN-based scheme has a higher convergence reward value than the DQN scheme, and the convergence result is more stable, because the DRQN has a long-term and short-term memory network, each agent can obtain hidden information such as action change rules and environment change rules of other agents according to historical experience, and the network output is determined by not only its own observation condition; the output of DQN is completely determined by its own observation, and once the environment or other agent decision rules change, fluctuation of the whole network will be caused. Comparing fig. 2 with fig. 3, the convergence of the reward values of the three channels and the scheme of power selection is approximately the same, and under the condition of markov interference, the performance of the DRQN-based scheme is improved by 34.6% compared with the DQN-based scheme, and is improved by 54.5% compared with the random-based scheme; under the condition of frequency sweep interference, the performance of the DRQN-based scheme is improved by 38.4% compared with the DQN-based scheme, and the performance of the DRQN-based scheme is improved by 56% compared with the random-based scheme.
Fig. 4 and fig. 5 are graphs showing the relationship between the average reward convergence value and the number of channels of three channels and power selection schemes under two interference modes, namely, swept-frequency interference and markov interference, respectively. When the number of the channels is increased, the average reward convergence values of the three schemes are improved, and the situation of the same frequency interference is reduced after the number of the channels is increased, so that the energy expenditure of unmanned aerial vehicle communication is reduced. The DRQN based scheme has a smaller change in the mean reward convergence value than other schemes, indicating that it is not sensitive to this environmental condition.
Fig. 6 and fig. 7 are diagrams illustrating the relationship between the reward convergence values and the number of jammers for three channels and the power selection scheme under two interference modes, namely, swept-frequency interference and markov interference, respectively. It can be seen from the figure that, in the two interference modes, as the number of the jammers increases, the environment is continuously deteriorated, and the average reward convergence values under the three schemes all have a decreasing trend, but the average reward convergence value based on the DRQN scheme is more stable, and the decreasing amplitude does not exceed 10%, so that the DRQN algorithm has better robustness.
Claims (8)
1. An unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part of observable information is characterized by comprising the following specific steps:
step 1: initializing algorithm parameters;
step 2: each cluster head unmanned aerial vehicle obtains a channel and transmitting power selected by a time slot on the member unmanned aerial vehicle in the cluster through interaction with the environment;
and step 3: each cluster head unmanned aerial vehicle adopts an epsilon-greedy algorithm to select a channel and transmitting power of a current time slot for members in the cluster;
and 4, step 4: each cluster head unmanned aerial vehicle calculates the total energy overhead required in the communication process with the members in the cluster and obtains a corresponding environment reward value;
and 5: storing the observed value, the action and the reward of the current time slot of each cluster head unmanned aerial vehicle and the observed value of the next time slot into respective experience pools;
and 6: when the experience pool sample data is enough, randomly sampling each cluster head unmanned aerial vehicle from the experience pool to obtain a plurality of batches of historical information data to form a time sequence, inputting the time sequence into the value network of each cluster head unmanned aerial vehicle, and updating the value network parameters by adopting a gradient descent method;
and 7: copying parameters of the value network to form a new target network at intervals of a certain time slot number;
and step 8: repeating the step 2 to the step 7 until the data transmission is completed for 100 times;
and step 9: and (5) repeating the step 8 until the total reward value of the unmanned aerial vehicle cluster network is converged, and finishing the local training.
2. The unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partially observable information as claimed in claim 1, wherein the algorithm parameters in step 1 include learning rate δ, greedy factor e, discount factor γ, experience pool size μ, attenuation factor θ, value network parameter w and target network parameter w'.
3. The unmanned aerial vehicle cluster multi-agent multi-domain anti-jamming method based on partially observable information of claim 1, wherein in step 2, each cluster head unmanned aerial vehicle obtains a channel and a transmission power selected by a time slot on an adult unmanned aerial vehicle in its cluster by interacting with an environment, specifically as follows:
the communication environment in the invention imitates the real environment as much as possible, and under most real environments, the intelligent agent cannot observe all state information due to the influence of noise and interference. Thus, the drone anti-jamming Decision problem is modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP).
Modeling a system model as Dec-POMDP<D,S,A,O,R>Wherein D is a plurality of agent sets, S is a joint state set, A is a joint action set, O is a joint observation set, and R is a reward function; defining D = {1, \8230, N } as a set of N agents; defining the current observation of the time slot t +1 cluster head unmanned aerial vehicle n as follows:joint observation setWhereinIs a channel of cluster member i of time slot t +1 cluster head drone n,is the transmission power selected by the time slot t +1 cluster head unmanned aerial vehicle n for the cluster member i; define the action of time slot t cluster head unmanned plane n asJoint observation setWhereinCluster member i frequency hopping to channel of time slot t cluster head unmanned aerial vehicle nIs the transmission power selected by the time slot t cluster head unmanned aerial vehicle n for the cluster member iDefining a joint state set S as all environment state information, and defining a joint observation set O as partial information which can be observed by N intelligent agents, so that the joint observation set O can be regarded as a subset of the joint state set S; definition ofIs the prize value for slot t cluster head drone n.
4. The unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partial observable information as claimed in claim 1, wherein in step 3, each cluster head unmanned aerial vehicle adopts epsilon-greedy algorithm to select channel and transmission power of current time slot for its members in the cluster, specifically as follows:
step 3-1: the observed value of each cluster head unmanned aerial vehicle is used as the input of the value network, the Q value corresponding to each action is used as the output of the value network, wherein, the time slot t is that the cluster head unmanned aerial vehicle n observesLower execution of actionsIs the expectation of the cumulative future prize value for the beginning of time slot t +1, as follows:
wherein s is t Is the time slot t environment status information,taking action for time slot t cluster head unmanned aerial vehicle nEnvironmental state s t From transition to s t+1 The probability of (c).
Step 3-2: the actions are selected according to the epsilon-greedy algorithm in the following specific way:
wherein p is a random number between 0 and 1, epsilon (0 < epsilon < 1) is an exploration probability,the hidden layer state of the time slot t cluster head unmanned aerial vehicle n neural network is shown, and w is a value network parameter. In this network, the output is not only related to the input, but also to the slot t hidden layer stateIn connection with this, the first and second electrodes,the cluster head unmanned aerial vehicle n network state storage device is used for storing the past network state of the cluster head unmanned aerial vehicle n and comprises historical information. Hidden layer state0 at the beginning of the round, i.e., no history information is included. As the round is made, it is,will be iteratively updated, generated by the slotted t-networkAnd the state of the hidden layer is used as the time slot t +1, so that the output of the time slot t +1 value network is influenced, and iteration is performed step by step.
The strategy randomly selects an action in the action space with the probability of epsilon, and avoids falling into local optimum. ε is the probability of exploration, and 1- ε is the probability of utilization (selection of the current best strategy). The larger the value of epsilon, the smaller the probability of utilization. In the initial stage of algorithm execution, because the state action space is large, the exploration probability should be a large value, and gradually approaches the optimal strategy along with the increase of the iteration times, and the utilization probability should be increased accordingly.
5. The unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partially observable information as claimed in claim 1, wherein in step 4, each cluster head unmanned aerial vehicle calculates the sum of energy overhead required for the communication process with its members in the cluster, and obtains a corresponding environment reward value, specifically as follows:
note the bookAndthe transmission power of cluster member i and interference machine j of the unmanned plane n of the cluster head of the time slot t respectively,transmit power of cluster member k for time slot t cluster head drone m (when m = n, k ≠ i), G U And G J The gain for the antenna of the drone and the jammer respectively,is the Euclidean distance between a time slot t cluster head unmanned aerial vehicle n and a cluster member i thereof or between the cluster head unmanned aerial vehicle n and an interference machine j, rho is the noise coefficient of the unmanned aerial vehicle, and sigma is 2 The mean square value of the ambient noise is calculated,is the fast fading between the cluster head unmanned plane n and the cluster member i thereof or between the cluster head unmanned plane n and the jammer j in the time slot t, and B is the channel bandwidth,t is the time required for a single communication transmission, s is the data size of the single communication transmission,the maximum average information rate of error-free transmission of a time slot t cluster head unmanned aerial vehicle n and a cluster member i thereof in an additive white Gaussian noise channel; rician fading channel gain is modeled as mean 0 with real part and variance is xi 2 Imaginary part modeling as mean 0 variance ξ 2 Independent and identically distributed Gaussian random process, so the fast fading of the channel is recorded asa is a real part and b is an imaginary part; set the energy cost of time slot t cluster head unmanned aerial vehicle n as
When a cluster member i of the cluster head unmanned aerial vehicle n is in the same channel with the interference machine j, β =1, otherwise β =0; when the cluster member i of the cluster head drone n is in the same channel as the cluster member k of the cluster head drone m, α =1, otherwise α =0. Time slot t environment total reward value of
The practical physical meaning of the energy overhead is the energy consumed by the cluster head unmanned aerial vehicle n and all cluster member unmanned aerial vehicles for data transmission once.
6. The unmanned aerial vehicle cluster multi-agent multi-domain anti-jamming method based on partially observable information as claimed in claim 1, wherein in step 5, the observed value of the current time slot, the action, the reward and the observed value of the next time slot of each cluster head unmanned aerial vehicle are stored in respective experience pools, specifically as follows:
when cluster head unmanned aerial vehicle n is in time slot t according toAfter the cluster member unmanned aerial vehicle frequency hopping channel and the transmitting power are selected, the environment state is determined by s t Jump to s t+1 Calculated at s by the reward value calculation formula t Down selection actionAwarding of prizesAnd observation ofGenerated by current time slot tAnd storing the historical experience data into an experience pool.
7. The unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partially observable information according to claim 1, wherein in step 6, when sample data of the experience pools is sufficient, each intelligent agent randomly samples from each experience pool to obtain a plurality of batches of historical information data to form a time sequence, the time sequence is input into a value network of each intelligent agent, and a value network parameter is updated by a gradient descent method, specifically as follows:
observed value with time slot t as neural network input of cluster head unmanned aerial vehicle nAnd outputting Q values corresponding to each action of the time slots t. In order to enhance the stability of the algorithm, the invention adopts a double-network structure, w is recorded as a value network parameter, w 'is a target network parameter, and the target network parameter w' is updated once every certain round in step 7.
Step 6-1: when each agent trains the value network, a batch of historical experience data is randomly selected from an experience pool to form a plurality of time sequences, each time sequence is a complete communication turn, a time slot is randomly selected from each sequence, and a plurality of continuous steps are selected as training samples. Calculating the action Q value function of the cluster head unmanned aerial vehicle n of the time slot t through the value network at the time slot t of the sampleAs the estimated Q value, the target network calculates the action Q value function of the time slot t +1 cluster head unmanned aerial vehicle nWherein,and the observed value, the action and the hidden layer state of the time slot t cluster head unmanned aerial vehicle n. The true value of the action Q function is calculated using the following equation:
step 6-2: substituting the real Q value and the estimated Q value into the following formula for calculation, namely updating the value network parameter w and gradually reducingNamely, it is
And the Q value calculated by the value network is closer to the real Q value by a gradient descent method. Before each intelligent agent trains the neural network, the hidden layer state needs to be set to zero, and the hidden layer state of a plurality of subsequent steps needs to be setResulting from network iterations.
8. The unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method according to claim 1, wherein in step 7, at regular time slot number, the parameters of the value network are copied to form a new target network, i.e. w ← w'.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211261459.8A CN115454141A (en) | 2022-10-14 | 2022-10-14 | Unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part observable information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211261459.8A CN115454141A (en) | 2022-10-14 | 2022-10-14 | Unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part observable information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115454141A true CN115454141A (en) | 2022-12-09 |
Family
ID=84311660
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211261459.8A Pending CN115454141A (en) | 2022-10-14 | 2022-10-14 | Unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part observable information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115454141A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116131963A (en) * | 2023-02-02 | 2023-05-16 | 广东工业大学 | Fiber link multipath interference noise equalization method based on LSTM neural network |
CN116432690A (en) * | 2023-06-15 | 2023-07-14 | 中国人民解放军国防科技大学 | Markov-based intelligent decision method, device, equipment and storage medium |
CN117675054A (en) * | 2024-02-02 | 2024-03-08 | 中国电子科技集团公司第十研究所 | Multi-domain combined anti-interference intelligent decision method and system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107179777A (en) * | 2017-06-03 | 2017-09-19 | 复旦大学 | Multiple agent cluster Synergistic method and multiple no-manned plane cluster cooperative system |
US20190011531A1 (en) * | 2016-03-11 | 2019-01-10 | Goertek Inc. | Following method and device for unmanned aerial vehicle and wearable device |
CN113382381A (en) * | 2021-05-30 | 2021-09-10 | 南京理工大学 | Unmanned aerial vehicle cluster network intelligent frequency hopping method based on Bayesian Q learning |
US20210373552A1 (en) * | 2018-11-06 | 2021-12-02 | Battelle Energy Alliance, Llc | Systems, devices, and methods for millimeter wave communication for unmanned aerial vehicles |
CN114415735A (en) * | 2022-03-31 | 2022-04-29 | 天津大学 | Dynamic environment-oriented multi-unmanned aerial vehicle distributed intelligent task allocation method |
-
2022
- 2022-10-14 CN CN202211261459.8A patent/CN115454141A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190011531A1 (en) * | 2016-03-11 | 2019-01-10 | Goertek Inc. | Following method and device for unmanned aerial vehicle and wearable device |
CN107179777A (en) * | 2017-06-03 | 2017-09-19 | 复旦大学 | Multiple agent cluster Synergistic method and multiple no-manned plane cluster cooperative system |
US20210373552A1 (en) * | 2018-11-06 | 2021-12-02 | Battelle Energy Alliance, Llc | Systems, devices, and methods for millimeter wave communication for unmanned aerial vehicles |
CN113382381A (en) * | 2021-05-30 | 2021-09-10 | 南京理工大学 | Unmanned aerial vehicle cluster network intelligent frequency hopping method based on Bayesian Q learning |
CN114415735A (en) * | 2022-03-31 | 2022-04-29 | 天津大学 | Dynamic environment-oriented multi-unmanned aerial vehicle distributed intelligent task allocation method |
Non-Patent Citations (1)
Title |
---|
李孜恒;孟超;: "基于深度强化学习的无线网络资源分配算法", 通信技术, no. 08, 10 August 2020 (2020-08-10) * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116131963A (en) * | 2023-02-02 | 2023-05-16 | 广东工业大学 | Fiber link multipath interference noise equalization method based on LSTM neural network |
CN116432690A (en) * | 2023-06-15 | 2023-07-14 | 中国人民解放军国防科技大学 | Markov-based intelligent decision method, device, equipment and storage medium |
CN116432690B (en) * | 2023-06-15 | 2023-08-18 | 中国人民解放军国防科技大学 | Markov-based intelligent decision method, device, equipment and storage medium |
CN117675054A (en) * | 2024-02-02 | 2024-03-08 | 中国电子科技集团公司第十研究所 | Multi-domain combined anti-interference intelligent decision method and system |
CN117675054B (en) * | 2024-02-02 | 2024-04-23 | 中国电子科技集团公司第十研究所 | Multi-domain combined anti-interference intelligent decision method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115454141A (en) | Unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part observable information | |
CN113162679A (en) | DDPG algorithm-based IRS (inter-Range instrumentation System) auxiliary unmanned aerial vehicle communication joint optimization method | |
CN108777872B (en) | Intelligent anti-interference method and intelligent anti-interference system based on deep Q neural network anti-interference model | |
CN111050330B (en) | Mobile network self-optimization method, system, terminal and computer readable storage medium | |
CN114422056B (en) | Space-to-ground non-orthogonal multiple access uplink transmission method based on intelligent reflecting surface | |
CN113795049B (en) | Femtocell heterogeneous network power self-adaptive optimization method based on deep reinforcement learning | |
CN116456493A (en) | D2D user resource allocation method and storage medium based on deep reinforcement learning algorithm | |
CN114422363A (en) | Unmanned aerial vehicle loaded RIS auxiliary communication system capacity optimization method and device | |
CN115103372B (en) | Multi-user MIMO system user scheduling method based on deep reinforcement learning | |
CN111050413A (en) | Unmanned aerial vehicle CSMA access method based on adaptive adjustment strategy | |
CN113423110A (en) | Multi-user multi-channel dynamic spectrum access method based on deep reinforcement learning | |
CN115567148A (en) | Intelligent interference method based on cooperative Q learning | |
CN113382060A (en) | Unmanned aerial vehicle track optimization method and system in Internet of things data collection | |
Parras et al. | An online learning algorithm to play discounted repeated games in wireless networks | |
CN116938314A (en) | Unmanned aerial vehicle relay safety communication optimizing system under eavesdropping attack | |
CN114980254B (en) | Dynamic multichannel access method and device based on duel deep cycle Q network | |
CN116866895A (en) | Intelligent countering method based on neural virtual self-game | |
CN116866048A (en) | Anti-interference zero-and Markov game model and maximum and minimum depth Q learning method | |
CN116165886A (en) | Multi-sensor intelligent cooperative control method, device, equipment and medium | |
CN116340737A (en) | Heterogeneous cluster zero communication target distribution method based on multi-agent reinforcement learning | |
Evmorfos et al. | Deep actor-critic for continuous 3D motion control in mobile relay beamforming networks | |
Zappone et al. | Complexity-aware ANN-based energy efficiency maximization | |
Dong et al. | Multi-agent adversarial attacks for multi-channel communications | |
CN116073856A (en) | Intelligent frequency hopping anti-interference decision method based on depth deterministic strategy | |
CN113411099B (en) | Double-change frequency hopping pattern intelligent decision method based on PPER-DQN |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |