CN112087749B - Cooperative active eavesdropping method for realizing multiple listeners based on reinforcement learning - Google Patents
Cooperative active eavesdropping method for realizing multiple listeners based on reinforcement learning Download PDFInfo
- Publication number
- CN112087749B CN112087749B CN202010878680.2A CN202010878680A CN112087749B CN 112087749 B CN112087749 B CN 112087749B CN 202010878680 A CN202010878680 A CN 202010878680A CN 112087749 B CN112087749 B CN 112087749B
- Authority
- CN
- China
- Prior art keywords
- legal
- network
- listener
- eavesdropping
- actor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Abstract
The application provides a cooperative active interception method for realizing multiple listeners based on reinforcement learning, which considers that each legal listener is an energy-limited device and performs cooperative interception and interference under the constraint of maximum interference power. I.e., reinforcement learning based methods maximize the expected eavesdropping energy efficiency of each lawful listener. The method mainly relates to the solution of the joint interference power distribution problem of cooperative active interception of two legal listeners by using reinforcement learning. Interference power is cooperatively transmitted through two legal listeners so as to realize successful interception of information transmitted by suspicious transmitters and maximize expected interception energy efficiency of each legal listener.
Description
Technical Field
The invention relates to the field of communication, in particular to a cooperative active eavesdropping method for realizing multiple listeners based on reinforcement learning.
Background
Many technologies for suspicious communications have been developed, and the interception of suspicious links by lawful listeners, i.e. active interception, plays an important role in wireless communication security and is a new direction of research on wireless communication security.
In active eavesdropping systems, much of the current research is to use a single lawful listener to listen to suspicious links. In an active eavesdropping system containing multiple lawful listeners, it is not considered that they can eavesdrop and interfere simultaneously in full duplex mode to achieve successful eavesdropping, improving the system eavesdropping performance. In addition, in active eavesdropping systems, most of the existing articles do not consider the problem of limited energy of legal listeners, however, in practice, a legal listener is usually a device with limited power, and insufficient energy can affect eavesdropping performance and even cause eavesdropping failure.
Disclosure of Invention
In order to solve the technical problems, the application provides a cooperative active interception method for realizing multiple listeners based on reinforcement learning, and the application considers that each legal listener is an energy-limited device and performs cooperative interception and interference under the constraint of maximum interference power. I.e., reinforcement learning based methods maximize the expected eavesdropping energy efficiency of the compound listener. The method mainly relates to the solution of the joint interference power distribution problem of cooperative active interception of two legal listeners by using reinforcement learning. Interference power is cooperatively transmitted through two legal listeners so as to realize successful interception of information transmitted by suspicious transmitters and maximize expected interception energy efficiency of each legal listener.
The specific technical scheme provided by the application is as follows:
a collaborative active eavesdropping method for implementing multiple listeners based on reinforcement learning, the method comprising:
determining a primary parameter in the cooperative active eavesdropping system;
in the cooperative active eavesdropping system, generating an eavesdropping energy efficiency function of each legal listener according to channel state information at the moment t and the interference power emitted by the legal listener;
in a cooperative active eavesdropping system, a multi-agent reinforcement learning algorithm-a multi-agent depth deterministic strategy gradient algorithm of a cooperative scene is determined according to the problem of interference power distribution of two legal listeners to be solved;
the cooperative active eavesdropping system based on the multi-agent depth deterministic strategy gradient algorithm realizes cooperative transmission interference power realization eavesdropping of two legal listeners;
the cooperative active interception system based on the multi-agent depth deterministic strategy gradient algorithm realizes cooperative transmission interference power interception of two legal listeners, and specifically comprises the following steps:
(1) in the cooperative active eavesdropping system, the network structure of the MADDPG algorithm is established.
Since the state dimension is high in the proposed collaborative active eavesdropping problem and is a continuous action space problem, when Q-value-based iteration is used, the time and memory consumed for enumerating states and action spaces is not measurable, and thus a function approximator needs to be built by using a Deep Neural Network (DNN) to create a learned agent. In the proposed cooperative active eavesdropping system, two legal listeners represent two agents, each agent comprising 4 networks, namely an actor estimation network, an actor target network, a critic estimation network and a critic target network, wherein the structures of the estimation networks and the target networks are identical, namely a fully connected DNN comprising two layers of hidden layers with activation functions being ReLU nonlinear activation functions and being parameterized by a set of weights. The parameter of the actor estimated network is theta, and the parameter of the critic estimated network is omega; and finally, the actor and the critic target network need to update the target network parameters at regular time according to the parameters of the estimated network until the convergence of the actor and the critic target network is not trained.
(2) In a collaborative active eavesdropping system, states and actions in the madppg algorithm.
Status: for each legal listener i, a state obtained from the environmentWherein->Indicating the channel power gain of the suspected link. />Representing the channel power gain of the suspicious transmitter T to the legal listener i. />Representing the channel power gain of lawful listener i to suspected receiver D. />Representing the power gain of the legal listener i from the interfering channel;
the actions are as follows: for each lawful listener i, it is necessary to rely onThe observed environment state is used for transmitting interference power, and the actions are that
(3) The target function, the expected eavesdropping energy efficiency, of each legal listener is determined based on the MADDPG algorithm in the cooperative active eavesdropping system.
In reinforcement learning, a strategy is an action selection strategy that optimizes long term performance. Therefore, we need to take the expected eavesdropping energy efficiency over a period of time T as an objective function. The criteria for Q value is defined as the expected return value of agent selection action a in state s starting from time t, Q value for agent i being:
wherein r is i t For a timely incentive for agent i,for the behavior policy of agent i in state s, what is output is the action to be performed. The optimal Q value is the maximum that can be reached when taking optimal action for all decisions. The value function uses DNN to construct a learning agent, and the obtained value function approximator is +.>
Thus, the expected eavesdropping energy efficiency for each lawful listener is:
wherein gamma E (0, 1) is a discount factor, θ i Estimating parameters of the network for the actor of the lawful listener i,timely eavesdropping energy efficiency prize of legal monitor i at time tAnd (5) excitation. The optimal strategy is->
(4) Network parameters and the required initial data are initialized.
In reinforcement learning, initial parameters are required to start network training, and therefore, parameters θ and ω of the actor network and the critic network need to be initialized at first. Since there is no prize value at the initial time, the prize for the lawful eavesdropper i is r i 0 =0, i.e. initial time eavesdropping energy efficiencyInitializing initial time status information ++>
(5) Cooperative interference power decision in cooperative active eavesdropping system-two legal listeners cooperate to transmit interference power.
Centralized training: the input to the critic estimation network is the status and action information of two lawful listeners, i.eAnd->The critic network of the two legal listeners can obtain the full information state during training, and meanwhile, the strategy actions adopted by the two legal listeners are also obtained, so that even if the actors cannot obtain all information, the strategies of other actors cannot be known, and each actor also has a critic with global information to guide the actors to optimize the strategies. This shows that each lawful listener updates its own policy under knowledge of the policy of the other lawful listener, so that cooperative interception of the suspicious link between the two lawful listeners can be achieved.
The updating mode of the actor network is as follows:
where M represents the number of samples randomly drawn from the experience playback pool and the superscript j represents an approximation of the other lawful listener values for the ith lawful listener.Indicating a desired prize value, x, of a critic network notifying an actor based on global status information j Representing state information for lawful listener i including another lawful listener j, i.e. x j ={s 1 ,s 2 The actor network needs to update its policy according to the desired rewards given by the critic network, i.e. if the action taken is such that the critic tells about the desired rewards +.>Increasing, then the actor will increase the value of this strategic gradient direction, and conversely, decrease. Thus, the actor network is oriented in the direction of the gradient rise of the strategy, thereby updating the parameters θ, ++of the actor network>Representing the gradient of the strategy.
The critic network loss function is:
wherein the loss function is the square of the difference between the true Q value and the estimated Q value, a i Representing the action taken by lawful listener i in the current state.Representing the true value, r i For rewarding in time, the->Expressed in the target network parameter theta' i The next target network strategy, x 'is the global state information of the next moment, a' i Representing the action taken at the next moment. For lawful listener i, the critic network is updated in such a way that the parameter ω is updated by minimizing its loss function i I.e. L (ω) i ) For omega i The gradient is found and updated as the gradient decreases.
For the legal monitor i, the parameters of the actor target network and the critic target network are updated regularly and a soft update mode is adopted:
θ′ i ←τθ i +(1-τ)θ′ i
ω′ i ←τω i +(1-τ)ω′ i
where τ represents the retention parameter, i.e., the degree to which the estimated network parameter is retained during the target network parameter update.
Distributed execution: after the model is trained, i.e. the parameters are converged, the parameters are not changed any more, only two actors are needed to interact with the environment, i.e. only the circulation represented by the black solid line in fig. 2 is needed, and the two legal monitors take actions according to the obtained state information, i.e. the interference power required to be transmitted.
The MADDPG algorithm is utilized to intensively train the model, and then the motion is performed in a distributed mode after the model is trained, so that when the motion is performed in a distributed mode, the trained model can be utilized to realize that two legal listeners determine the distribution of interference power in a cooperative mode, and the expected eavesdropping can be optimized.
Compared with the prior art, the technical scheme has the following advantages:
according to the cooperative active interception method for realizing the multi-monitor based on reinforcement learning, each legal monitor is considered to be an energy-limited device, and cooperative interception and interference are carried out under the constraint of maximum interference power. I.e., reinforcement learning based methods maximize the expected eavesdropping energy efficiency of each lawful listener. The method mainly relates to the solution of the joint interference power distribution problem of cooperative active interception of two legal listeners by using reinforcement learning. Interference power is cooperatively transmitted through two legal listeners so as to realize successful interception of information transmitted by suspicious transmitters and maximize expected interception energy efficiency of each legal listener.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a collaborative active eavesdropping method for implementing multiple listeners based on reinforcement learning according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating the madppg algorithm using two agents as an example according to an embodiment of the present invention.
Detailed Description
In active eavesdropping systems, much of the current research is to use a single lawful listener to listen to suspicious links. In an active eavesdropping system containing multiple lawful listeners, it is not considered that they can eavesdrop and interfere simultaneously in full duplex mode to achieve successful eavesdropping, improving the system eavesdropping performance. In addition, in the active eavesdropping system, most of the existing researches do not consider the problem of energy limitation of legal listeners, however, in practical situations, a legal listener is usually a device with limited power, and the eavesdropping performance is affected by insufficient energy, and even eavesdropping failure is caused.
The inventor finds that in the active eavesdropping system, the existing research has the interception of suspicious links by a single legal listener, and most of the problems of energy limitation of the legal listener are not considered, which does not accord with the actual situation. Moreover, it is not considered that simultaneous eavesdropping and interference by a plurality of lawful listeners may further improve the system eavesdropping performance.
In order to solve the technical problem, the application provides a cooperative active interception method for realizing multiple listeners based on reinforcement learning, namely a reinforcement learning-based method for maximizing expected interception energy efficiency of each legal listener. The method mainly relates to the solution of the joint interference power distribution problem of cooperative active interception of two legal listeners by using reinforcement learning. The two legal listeners cooperatively transmit interference power to realize successful interception of information transmitted by the suspicious transmitter and maximize the expected interception energy efficiency of each legal listener, which requires finding an optimal interference power allocation strategy of each legal listener. Wherein the function of the eavesdropping energy efficiency evaluates the relation between the eavesdropping rate of each legal listener and the interference power, and the eavesdropping energy efficiency is calculated as the ratio of the eavesdropping rate to the power.
As shown in fig. 1, the present application proposes a cooperative active eavesdropping method for implementing multiple listeners based on reinforcement learning, including:
determining a primary parameter in the cooperative active eavesdropping system;
in the cooperative active eavesdropping system, generating an eavesdropping energy efficiency function of each legal listener according to channel state information at the moment t and the interference power emitted by the legal listener;
in a cooperative active eavesdropping system, a multi-agent reinforcement learning algorithm-a multi-agent depth deterministic strategy gradient algorithm of a cooperative scene is determined according to the problem of interference power distribution of two legal listeners to be solved;
the cooperative active eavesdropping system based on the multi-agent depth deterministic strategy gradient algorithm realizes cooperative transmission interference power realization eavesdropping of two legal listeners.
The cooperative active eavesdropping method based on the reinforcement learning method for realizing the multi-listener is described in detail below.
First, the primary parameters in a cooperative active eavesdropping system are determined.
According to a wireless communication environment, first, considering that channel state information is dynamically changed, each legal eavesdropper needs to transmit interference power according to the channel state information in the environment. At time t, the information observed by each legal monitor i is that the channel power gain of the suspicious linkChannel power gain of suspicious transmitter T to legal listener i>Channel power gain of legal listener i to suspect receiver D>Legal listener i Power gain from interfering channel +.>The power transmitted by each legal listener is +.>
In the second step, in the cooperative active eavesdropping system, the eavesdropping energy efficiency function of each legal listener is generated according to the channel power gain (namely the channel state information) at the moment t and the interference power of the legal listener.
The eavesdropping energy efficiency is the ratio of the eavesdropping rate to the transmitted interference power, i.e., the ratio of the rate at which each legitimate listener successfully eavesdrops on an illegitimate link to the interference power it transmits. The eavesdropping rate needs to be determined according to the data transmission rate of the suspicious link and the data transmission rate of the eavesdropping link, the calculation of the data transmission rate needs to be calculated according to a shannon formula, and the following specific calculation process is as follows:
the signal-to-interference-plus-noise ratio (SINR) at the suspect receiver D is:
the signal to interference plus noise ratio (SINR) at lawful listener E1 is:
the signal to interference plus noise ratio (SINR) at lawful listener E2 is:
wherein P is T Is the power at which the suspect transmitter transmits a signal,is the interference power, sigma, transmitted by lawful listener i 2 Is the noise power.
By using SINR, we can obtain the data transmission rates of the suspicious link and the lawful listener E1, E2 eavesdropping link according to shannon's formula as:
in lawful interception systems, ifThe lawful listener can decode the information sent by the overheard suspicious transmitter T with any small error rate, thereby giving an effective overheard rate of +.>If->The lawful interception device cannot decode the information sent by the suspicious transmitter T, and the interception rate is +.>Thus, an indicator function is introduced to indicate whether two legitimate monitors successfully eavesdropped:
wherein, for legal monitor i, at time t, ifIndicating a successful eavesdropping; otherwise, the eavesdropping fails.
The eavesdropping rate is defined as:
for cooperative active eavesdropping systems, eavesdropping performance depends on the eavesdropping rate and the allocation of interference power. Thus, for the ith listener, its eavesdropping energy efficiency is:
thirdly, in the cooperative active eavesdropping system, a Multi-agent reinforcement learning algorithm-Multi-agent depth deterministic strategy gradient (Multi-agent Deep Deterministic Policy Gradient, MADDPG) algorithm of a cooperative scene is determined according to the problem of interference power distribution of two legal listeners to be solved.
In a multi-agent system, each agent needs to select its own action according to the state information of the environment. The traditional single-agent reinforcement learning method can not well solve the problem of cooperative active eavesdropping between two legal listeners. Since each agent is constantly changing during the training process, this results in an unstable environment for each agent individual. That is, the agent only observes its own local state information, and when making decisions, the same state-action pair and different prize values may occur. In other words, the agents interact with each other, each agent not only requiring its own state information, but also other agents' state information and action behavior, as such information and behavior may affect the rewards of the agents. Thus, to solve the problem of cooperative active eavesdropping of two agents, i.e. two lawful listeners, the policy update process of one listener should consider the policy of the other listener instead of updating its own policy with its own behavioral actions only. Here, each lawful listener needs to determine its own required interference power (action) according to the environmental information, i.e., the state-action information of all lawful listeners, so as to successfully eavesdrop on the information sent by the suspicious transmitter and maximize the expected eavesdrop energy efficiency of each lawful listener. Aiming at the characteristics of the cooperative active eavesdropping system and the problem structure to be solved, the MADDPG multi-agent reinforcement learning algorithm can well solve the problem of cooperative active eavesdropping. After the agent obtains the status information and makes action selections, the environment will feed back agent rewards information for judging whether the action is good or bad.
The MADDPG algorithm adopts a centralized training and distributed execution mode, and is an algorithm suitable for a cooperative scene. That is, during the training process, critic may instruct the training of the actor by observing global state information, that is, not only using its own state-action, but also using state-action information of other agents. At the time of testing, action is taken using only actor, and FIG. 2 shows the flow of MADDPG algorithm by way of example with two agents. Wherein S is all Global state information for two agents, a 1 ,a 2 Representing the actions taken by agent 1 and agent 2, respectively, r 1 ,r 2 Timely rewards after acting for two agents respectively, S 1 ,S 2 The state information observed by agent 1 and agent 2, respectively.
And fourthly, the cooperative active eavesdropping system based on the MADDPG algorithm realizes the cooperative transmission interference power realization eavesdropping of two legal listeners.
(1) In the cooperative active eavesdropping system, the network structure of the MADDPG algorithm is established.
Since the state dimension is high in the proposed collaborative active eavesdropping problem and is a continuous motion space problem. The time and memory spent enumerating states and action spaces is not measurable when using Q-value based iterations. It is therefore necessary to construct a function approximator using a Deep Neural Network (DNN) to create a learned proxy. In the proposed cooperative active eavesdropping system, two legal listeners represent two agents, each agent comprising 4 networks, namely an actor estimation network, an actor target network, a critic estimation network and a critic target network, wherein the structures of the estimation networks and the target networks are identical, namely a fully connected DNN comprising two layers of hidden layers with activation functions being ReLU nonlinear activation functions and being parameterized by a set of weights. The parameter of the actor estimated network is theta, and the parameter of the critic estimated network is omega; the parameter of the actor target network is theta ', and the parameter of the critic target network is omega'. Finally, the actor and critic target networks need to update the target network parameters at regular time according to the parameters of the estimated network until the target networks converge and are not trained.
(2) In a collaborative active eavesdropping system, states and actions in the madppg algorithm.
Status: for each legal listener i, a state obtained from the environmentWherein->Indicating the channel power gain of the suspected link. />Representing the channel power gain of the suspicious transmitter T to the legal listener i. />Representing the channel power gain of lawful listener i to suspected receiver D. />Representing the power gain of the legal listener i from the interfering channel;
the actions are as follows: for each legal listener i, the interference power needs to be transmitted according to the observed environmental state, namely
(3) The target function, the expected eavesdropping energy efficiency, of each legal listener is determined based on the MADDPG algorithm in the cooperative active eavesdropping system.
In reinforcement learning, a strategy is an action selection strategy that optimizes long term performance. Therefore, we need to take the expected eavesdropping energy efficiency over a period of time T as an objective function. The criteria for Q value is defined as the expected return value of agent selection action a in state s starting from time t, Q value for agent i being:
wherein r is i t For a timely incentive for agent i,for the behavior policy of agent i in state s, what is output is the action to be performed. The optimal Q value is the maximum that can be reached when taking optimal action for all decisions. The value function uses DNN to construct a learning agent, and the obtained value function approximator is +.>
Thus, the expected eavesdropping energy efficiency for each lawful listener is:
wherein gamma E (0, 1) is a discount factor, θ i Estimating the network for the actor of a lawful listener iThe parameters of the parameters are set to be,and at the time t, the legal monitor i timely eavesdrops on the energy efficiency rewards. The optimal strategy is->
(4) Network parameters and the required initial data are initialized.
In reinforcement learning, initial parameters are required to start network training, and therefore, parameters θ and ω of the actor network and the critic network need to be initialized at first. Since there is no prize value at the initial time, the prize for the lawful eavesdropper i is r i 0 =0, i.e. initial time eavesdropping energy efficiencyInitializing initial time status information ++>
(5) Cooperative interference power decision in cooperative active eavesdropping system-two legal listeners cooperate to transmit interference power.
Centralized training: the input to the critic estimation network is the status and action information of two lawful listeners, i.eAnd->The critic network of the two legal listeners can obtain the full information state during training, and meanwhile, the strategy actions adopted by the two legal listeners are also obtained, so that even if the actors cannot obtain all information, the strategies of other actors cannot be known, and each actor also has a critic with global information to guide the actors to optimize the strategies. This shows that each lawful listener updates its own policy under knowledge of the policy of the other lawful listener, so thatThe cooperation between two legal listeners is realized to intercept suspicious links.
The updating mode of the actor network is as follows:
where M represents the number of samples randomly drawn from the experience playback pool and the superscript j represents an approximation of the other lawful listener values for the ith lawful listener.Indicating a desired prize value, x, of a critic network notifying an actor based on global status information j Representing state information for lawful listener i including another lawful listener j, i.e. x j ={s 1 ,s 2 The actor network needs to update its policy according to the desired rewards given by the critic network, i.e. if the action taken is such that the critic tells about the desired rewards +.>Increasing, then the actor will increase the value of this strategic gradient direction, and conversely, decrease. Thus, the actor network is oriented in the direction of the gradient rise of the strategy, thereby updating the parameters θ, ++of the actor network>Representing the gradient of the strategy.
The critic network loss function is:
wherein the loss function is the square of the difference between the true Q value and the estimated Q value, a i Representing the action taken by lawful listener i in the current state.Representing the true value, r i For rewarding in time, the->Expressed in the target network parameter theta' i The next target network strategy, x 'is the global state information of the next moment, a' i Representing the action taken at the next moment. For lawful listener i, the critic network is updated in such a way that the parameter ω is updated by minimizing its loss function i I.e. L (ω) i ) For omega i The gradient is found and updated as the gradient decreases.
For the legal monitor i, the parameters of the actor target network and the critic target network are updated regularly and a soft update mode is adopted:
θ′ i ←τθ i +(1-τ)θ′ i
ω′ i ←τω i +(1-τ)ω′ i
where τ represents the retention parameter, i.e., the degree to which the estimated network parameter is retained during the target network parameter update.
Distributed execution: after the model is trained, i.e. the parameters are converged, the parameters are not changed any more, only two actors are needed to interact with the environment, i.e. only the circulation represented by the black lines in fig. 2 is needed, and the two legal monitors take actions according to the obtained state information, i.e. the interference power required to be transmitted.
The MADDPG algorithm is utilized to intensively train the model, and then the motion is performed in a distributed mode after the model is trained, so that when the motion is performed in a distributed mode, the trained model can be utilized to realize that two legal listeners determine the distribution of interference power in a cooperative mode, and the expected eavesdropping can be optimized.
The specific process of the madppg algorithm is as follows:
1) Initializing parameters of an actor network and a critic network, including estimating network parameters and target network parameters; initializing random noise delta of action exploration;
2) Obtain the initial state x=(s) 1 ,...s i ) Wherein at time slot t, for agent i, its state isInitializing prize r i =0;
3) For each agent i, observe an initial stateWherein->Is the channel power gain of the suspected link at time t. />Representing the channel power gain of the suspicious transmitter T to the legal listener i. />Representing the channel power gain of lawful listener i to suspected receiver D. />Representing the power gain of the legal listener i from the interfering channel;
4) For each agent i, an action is selected according to the stateWherein->Indicating the interference power of the legal listener transmission i;
5) Perform the action and obtain the prize r i And a next time state x'.
6) The experience e (t) = (x, a, r, x') stored at time t is fed into the experience playback unit D (t), from which small batches of samples are randomly drawn for training.
7) Updating actor network using policy gradients:
8) Using a minimization loss function L (ω i ) Updating the critic network.
9) The next state x+.x' is updated.
10 Soft update mode update parameter target network parameter θ' i ,ω′ i Stopping training until convergence: θ'. i ←τθ i +(1-τ)θ′ i ,ω′ i ←τω i +(1-τ)ω′ i 。
According to the method, the selection of the interference power decision can be performed in each legal monitor, and the madppg algorithm can be implemented, when two legal monitors perform interference power actions according to respective optimal expected eavesdropping energy efficiency in a continuous action space, the joint interference power allocation strategy is an optimal balance, that is, under the optimal interference power allocation strategy, each legal monitor can obtain the maximum expected eavesdropping energy efficiency.
As can be seen, the present application contemplates that each lawful listener is an energy-limited device that performs cooperative eavesdropping and interference under maximum interference power constraints.
Specifically, when the joint interference power of the cooperative active interception system including two legal listeners is allocated, the application also considers that the channel state information is dynamically changed, and the MADDPG algorithm is utilized to maximize the expected interception energy efficiency of each legal listener, namely, the two listeners can cooperatively intercept the suspicious link and simultaneously send interference signals to the suspicious receiver to realize successful interception. The madppg algorithm is used to select the interference power allocation decision from the main, which can achieve the goal of maximizing the desired eavesdropping energy efficiency by constantly training.
The application can achieve the following beneficial effects:
(1) Under the interference power constraint, the interference power is transmitted through the cooperation of the legal listeners so as to realize successful interception of information sent by the suspicious transmitter, and compared with the interception energy efficiency of non-cooperative active interception, the expected interception energy efficiency of the cooperative active interception of the legal listeners can be obviously improved.
(2) The MADDPG algorithm can be well applied to the scene of cooperative active eavesdropping, and is suitable for continuous action space problems. By adopting an appropriately increased learning rate, its convergence speed is also increased.
(3) The MADDPG algorithm is provided to optimize the interference power allocation decision, and the optimal interference power allocation strategy can be found under the conditions of large state action space and continuous action space, so that the expected eavesdropping energy efficiency of each legal listener is maximized.
In the present description, each part is described in a progressive manner, and each part is mainly described as different from other parts, and identical and similar parts between the parts are mutually referred.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (1)
1. The cooperative active eavesdropping method for realizing the multi-listener based on reinforcement learning is characterized by comprising the following steps:
determining a primary parameter in a cooperative active eavesdropping system;
in the cooperative active eavesdropping system, generating an eavesdropping energy efficiency function of each legal listener according to channel state information at the moment t and the interference power emitted by the legal listener;
in a cooperative active eavesdropping system, a multi-agent reinforcement learning algorithm-a multi-agent depth deterministic strategy gradient algorithm of a cooperative scene is determined according to the problem of interference power distribution of two legal listeners to be solved;
the cooperative active eavesdropping system based on the multi-agent depth deterministic strategy gradient algorithm realizes cooperative transmission interference power realization eavesdropping of two legal listeners;
the cooperative active interception system based on the multi-agent depth deterministic strategy gradient algorithm realizes cooperative transmission interference power interception of two legal listeners, and specifically comprises the following steps:
(1) in the cooperative active eavesdropping system, establishing a network structure of an MADDPG algorithm;
since the state dimension is higher in the proposed cooperative active eavesdropping problem and is a continuous action space problem, when the Q-value-based iteration is used, the time and memory consumed for enumerating the state and action space are not measurable, so that a function approximator needs to be built by using a Deep Neural Network (DNN) to create a learned proxy, in the proposed cooperative active eavesdropping system, two legal listeners represent two proxies, each proxy comprises 4 networks, namely an actor estimation network, an actor target network, a critic estimation network and a critic target network, wherein the structure of the estimation network and the target network is the same, namely the estimation network consists of a fully connected DNN which comprises two layers of hidden layers with activation functions being ReLU nonlinear activation functions, the parameters of the actor estimation network are θ, and the parameters of the critic estimation network are ω; the parameters of the actor target network are theta ', the parameters of the critic target network are omega', and finally, the actor and the critic target network need to update the parameters of the target network at regular time according to the parameters of the estimated network until convergence of the actor and the critic target network is not trained;
(2) in a cooperative active eavesdropping system, states and actions in the madppg algorithm;
status: for each legal listener i, a state obtained from the environmentWherein->Channel power gain indicative of suspicious link, +.>Indicating the channel power gain of suspicious transmitter T to legal listener i,/>Indicating the channel power gain of legal listener i to suspect receiver D +.>Representing the power gain of the legal listener i from the interfering channel;
the actions are as follows: for each legal listener i, the interference power needs to be transmitted according to the observed environmental state, namely
(3) Determining an objective function-expected eavesdropping energy efficiency of each legal listener based on an MADDPG algorithm in a cooperative active eavesdropping system;
in reinforcement learning, the strategy is an action selection strategy that optimizes long-term performance, and therefore, it is necessary to define the expected eavesdropping energy efficiency over a period of time T as an objective function, the Q-value criteria are defined as starting from time T, and in state s, the agent selects the expected return value of action a, and for agent i, the Q-value is:
wherein r is i t For a timely incentive for agent i,for the behavior policy of agent i in state s, the output is the action to be performed, the optimal Q value is the maximum that can be reached when taking optimal action for all decisions,the value function uses DNN to construct a learning agent, and the obtained value function approximator is +.>
Thus, the expected eavesdropping energy efficiency for each lawful listener is:
wherein gamma E (0, 1) is a discount factor, θ i Estimating parameters of the network for the actor of the lawful listener i,for the time t, the legal monitor i rewards the timely eavesdropping energy efficiency, and the optimal strategy is +.>
(4) Initializing network parameters and required initial data;
in reinforcement learning, initial parameters are needed to start network training, so parameters θ and ω of an actor network and a critic network need to be initialized randomly first, and since there is no prize value at the initial time, a legal eavesdropper i is rewarded with r i 0 =0, i.e. initial time eavesdropping energy efficiencyInitializing initial time status information ++>
(5) Cooperative interference power decision-two legal listeners cooperate to transmit interference power in a cooperative active eavesdropping system;
centralized training: the input to the critic estimation network is the status and action information of two lawful listeners, i.eAndthe critic network of the two legal listeners can obtain the full information state during training, and meanwhile, the strategy actions adopted by the two legal listeners are also obtained, so that even if an actor cannot obtain all information, the strategies of other actors cannot be known, each actor also has a critic with global information to guide the same to optimize the strategies, and each legal listener updates the own strategy under the strategy of the other legal listener, so that the cooperative interception of suspicious links between the two legal listeners can be realized;
the updating mode of the actor network is as follows:
where M represents the number of samples randomly drawn from the empirical playback pool, the superscript j represents an approximation of the other legal listener values for the ith legal listener,indicating a desired prize value, x, of a critic network notifying an actor based on global status information j Representing state information for lawful listener i including another lawful listener j, i.e. x j ={s 1 ,s 2 The actor network needs to update its policy according to the desired rewards given by the critic network, i.e. if the action taken is such that the critic tells about the desired rewards +.>Increasing, the actor increases the value of this direction of the policy gradient, and conversely decreases, so that the actor network is directed in the direction of the increase of the policy gradient, thereby updating the parameter θ of the actor network,/>representing the gradient of the strategy;
the critic network loss function is:
wherein the loss function is the square of the difference between the true Q value and the estimated Q value, a i Representing the action taken by lawful listener i in the current state,representing the true value, r i For rewarding in time, the->Expressed in the target network parameter theta' i The next target network strategy, x 'is the global state information of the next moment, a' i Representing the action taken at the next moment, the critic network is updated for the lawful listener i in such a way that the parameter ω is updated by minimizing its loss function i I.e. L (ω) i ) For omega i Solving the gradient and updating along with the descending direction of the gradient;
for the legal monitor i, the parameters of the actor target network and the critic target network are updated regularly and a soft update mode is adopted:
θ′ i ←τθ i +(1-τ)θ′ i
ω′ i ←τω i +(1-τ)ω′ i
wherein τ represents a retention parameter, i.e., a degree to which the estimated network parameter is retained during the target network parameter update process;
distributed execution: after the model is trained, i.e. parameters are converged, and no change is needed, only two actors interact with the environment, and the two legal monitors take actions according to the obtained state information, i.e. the interference power required to be transmitted;
the MADDPG algorithm is utilized to intensively train the model, and then the action is performed in a distributed mode after the model is trained, so that two legal listeners can determine the allocation of interference power in a cooperative mode by utilizing the trained model when the action is performed in a distributed mode, and the expected eavesdropping energy can be optimized.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010878680.2A CN112087749B (en) | 2020-08-27 | 2020-08-27 | Cooperative active eavesdropping method for realizing multiple listeners based on reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010878680.2A CN112087749B (en) | 2020-08-27 | 2020-08-27 | Cooperative active eavesdropping method for realizing multiple listeners based on reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112087749A CN112087749A (en) | 2020-12-15 |
CN112087749B true CN112087749B (en) | 2023-06-02 |
Family
ID=73729707
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010878680.2A Active CN112087749B (en) | 2020-08-27 | 2020-08-27 | Cooperative active eavesdropping method for realizing multiple listeners based on reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112087749B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113747442B (en) * | 2021-08-24 | 2023-06-06 | 华北电力大学(保定) | IRS-assisted wireless communication transmission method, device, terminal and storage medium |
CN115296705B (en) * | 2022-04-28 | 2023-11-21 | 南京大学 | Active monitoring method in MIMO communication system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107948173A (en) * | 2017-11-30 | 2018-04-20 | 华北电力大学(保定) | A kind of monitor method |
CN108235423A (en) * | 2017-12-29 | 2018-06-29 | 中山大学 | Wireless communication anti-eavesdrop jamming power control algolithm based on Q study |
CN109088891A (en) * | 2018-10-18 | 2018-12-25 | 南通大学 | Legal listening method based on safety of physical layer under a kind of more relay systems |
CN109302262A (en) * | 2018-09-27 | 2019-02-01 | 电子科技大学 | A kind of communication anti-interference method determining Gradient Reinforcement Learning based on depth |
CN109635917A (en) * | 2018-10-17 | 2019-04-16 | 北京大学 | A kind of multiple agent Cooperation Decision-making and training method |
-
2020
- 2020-08-27 CN CN202010878680.2A patent/CN112087749B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107948173A (en) * | 2017-11-30 | 2018-04-20 | 华北电力大学(保定) | A kind of monitor method |
CN108235423A (en) * | 2017-12-29 | 2018-06-29 | 中山大学 | Wireless communication anti-eavesdrop jamming power control algolithm based on Q study |
CN109302262A (en) * | 2018-09-27 | 2019-02-01 | 电子科技大学 | A kind of communication anti-interference method determining Gradient Reinforcement Learning based on depth |
CN109635917A (en) * | 2018-10-17 | 2019-04-16 | 北京大学 | A kind of multiple agent Cooperation Decision-making and training method |
CN109088891A (en) * | 2018-10-18 | 2018-12-25 | 南通大学 | Legal listening method based on safety of physical layer under a kind of more relay systems |
Non-Patent Citations (3)
Title |
---|
UAV-Enabled Secure Communications by Multi-Agent Deep Reinforcement Learning;Yu Zhang等;IEEE Transactions on Vehicular Technology;全文 * |
基于中继和主动干扰的合作监听方案设计;朱敏;张登银;;南京邮电大学学报(自然科学版)(第03期);全文 * |
基于值函数和策略梯度的深度强化学习综述;刘建伟;高峰;罗雄麟;;计算机学报(第06期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN112087749A (en) | 2020-12-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | Multi-agent deep reinforcement learning based spectrum allocation for D2D underlay communications | |
CN112087749B (en) | Cooperative active eavesdropping method for realizing multiple listeners based on reinforcement learning | |
Kong et al. | A reinforcement learning approach for dynamic spectrum anti-jamming in fading environment | |
CN113507328B (en) | Time slot MAC protocol method, system, device and medium for underwater acoustic network | |
CN113225794B (en) | Full-duplex cognitive communication power control method based on deep reinforcement learning | |
CN116744311B (en) | User group spectrum access method based on PER-DDQN | |
CN111865474B (en) | Wireless communication anti-interference decision method and system based on edge calculation | |
Saad et al. | A cooperative Q-learning approach for online power allocation in femtocell networks | |
Tan et al. | Deep reinforcement learning for channel selection and power control in D2D networks | |
Zhang et al. | Deep reinforcement learning-empowered beamforming design for IRS-assisted MISO interference channels | |
Li et al. | Reinforcement learning-based intelligent reflecting surface assisted communications against smart attackers | |
Xu et al. | A new anti-jamming strategy based on deep reinforcement learning for MANET | |
Mafuta et al. | Decentralized resource allocation-based multiagent deep learning in vehicular network | |
Huang et al. | Fast spectrum sharing in vehicular networks: A meta reinforcement learning approach | |
Lu et al. | A learning approach towards power control in full-duplex underlay cognitive radio networks | |
Zhang et al. | Deep Deterministic Policy Gradient for End-to-End Communication Systems without Prior Channel Knowledge | |
Song et al. | Deep Q-network based power allocation meets reservoir computing in distributed dynamic spectrum access networks | |
Wang et al. | Joint Spectrum Allocation and Power Control in Vehicular Networks Based on Reinforcement Learning | |
Mallouh et al. | A hierarchy of deep reinforcement learning agents for decision making in blockchain nodes | |
CN111741050A (en) | Data distribution method and system based on roadside unit | |
Zhang et al. | Multi-Agent Reinforcement Learning Based Channel Access Scheme for Underwater Optical Wireless Communication Networks | |
CN115866559B (en) | Non-orthogonal multiple access auxiliary Internet of vehicles low-energy-consumption safe unloading method | |
Irkiçatal et al. | Deep Reinforcement Learning Aided Rate-Splitting for Interference Channels | |
CN112867087B (en) | Anti-interference method based on multiuser random forest reinforcement learning | |
Chen et al. | RTE: Rapid and Reliable Trust Evaluation for Collaborator Selection and Time-Sensitive Task Handling in Internet of Vehicles |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |