CN112087749B - Cooperative active eavesdropping method for realizing multiple listeners based on reinforcement learning - Google Patents

Cooperative active eavesdropping method for realizing multiple listeners based on reinforcement learning Download PDF

Info

Publication number
CN112087749B
CN112087749B CN202010878680.2A CN202010878680A CN112087749B CN 112087749 B CN112087749 B CN 112087749B CN 202010878680 A CN202010878680 A CN 202010878680A CN 112087749 B CN112087749 B CN 112087749B
Authority
CN
China
Prior art keywords
legal
network
listener
eavesdropping
actor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010878680.2A
Other languages
Chinese (zh)
Other versions
CN112087749A (en
Inventor
李保罡
杨亚欣
张淑娥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China Electric Power University
Original Assignee
North China Electric Power University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China Electric Power University filed Critical North China Electric Power University
Priority to CN202010878680.2A priority Critical patent/CN112087749B/en
Publication of CN112087749A publication Critical patent/CN112087749A/en
Application granted granted Critical
Publication of CN112087749B publication Critical patent/CN112087749B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The application provides a cooperative active interception method for realizing multiple listeners based on reinforcement learning, which considers that each legal listener is an energy-limited device and performs cooperative interception and interference under the constraint of maximum interference power. I.e., reinforcement learning based methods maximize the expected eavesdropping energy efficiency of each lawful listener. The method mainly relates to the solution of the joint interference power distribution problem of cooperative active interception of two legal listeners by using reinforcement learning. Interference power is cooperatively transmitted through two legal listeners so as to realize successful interception of information transmitted by suspicious transmitters and maximize expected interception energy efficiency of each legal listener.

Description

Cooperative active eavesdropping method for realizing multiple listeners based on reinforcement learning
Technical Field
The invention relates to the field of communication, in particular to a cooperative active eavesdropping method for realizing multiple listeners based on reinforcement learning.
Background
Many technologies for suspicious communications have been developed, and the interception of suspicious links by lawful listeners, i.e. active interception, plays an important role in wireless communication security and is a new direction of research on wireless communication security.
In active eavesdropping systems, much of the current research is to use a single lawful listener to listen to suspicious links. In an active eavesdropping system containing multiple lawful listeners, it is not considered that they can eavesdrop and interfere simultaneously in full duplex mode to achieve successful eavesdropping, improving the system eavesdropping performance. In addition, in active eavesdropping systems, most of the existing articles do not consider the problem of limited energy of legal listeners, however, in practice, a legal listener is usually a device with limited power, and insufficient energy can affect eavesdropping performance and even cause eavesdropping failure.
Disclosure of Invention
In order to solve the technical problems, the application provides a cooperative active interception method for realizing multiple listeners based on reinforcement learning, and the application considers that each legal listener is an energy-limited device and performs cooperative interception and interference under the constraint of maximum interference power. I.e., reinforcement learning based methods maximize the expected eavesdropping energy efficiency of the compound listener. The method mainly relates to the solution of the joint interference power distribution problem of cooperative active interception of two legal listeners by using reinforcement learning. Interference power is cooperatively transmitted through two legal listeners so as to realize successful interception of information transmitted by suspicious transmitters and maximize expected interception energy efficiency of each legal listener.
The specific technical scheme provided by the application is as follows:
a collaborative active eavesdropping method for implementing multiple listeners based on reinforcement learning, the method comprising:
determining a primary parameter in the cooperative active eavesdropping system;
in the cooperative active eavesdropping system, generating an eavesdropping energy efficiency function of each legal listener according to channel state information at the moment t and the interference power emitted by the legal listener;
in a cooperative active eavesdropping system, a multi-agent reinforcement learning algorithm-a multi-agent depth deterministic strategy gradient algorithm of a cooperative scene is determined according to the problem of interference power distribution of two legal listeners to be solved;
the cooperative active eavesdropping system based on the multi-agent depth deterministic strategy gradient algorithm realizes cooperative transmission interference power realization eavesdropping of two legal listeners;
the cooperative active interception system based on the multi-agent depth deterministic strategy gradient algorithm realizes cooperative transmission interference power interception of two legal listeners, and specifically comprises the following steps:
(1) in the cooperative active eavesdropping system, the network structure of the MADDPG algorithm is established.
Since the state dimension is high in the proposed collaborative active eavesdropping problem and is a continuous action space problem, when Q-value-based iteration is used, the time and memory consumed for enumerating states and action spaces is not measurable, and thus a function approximator needs to be built by using a Deep Neural Network (DNN) to create a learned agent. In the proposed cooperative active eavesdropping system, two legal listeners represent two agents, each agent comprising 4 networks, namely an actor estimation network, an actor target network, a critic estimation network and a critic target network, wherein the structures of the estimation networks and the target networks are identical, namely a fully connected DNN comprising two layers of hidden layers with activation functions being ReLU nonlinear activation functions and being parameterized by a set of weights. The parameter of the actor estimated network is theta, and the parameter of the critic estimated network is omega; and finally, the actor and the critic target network need to update the target network parameters at regular time according to the parameters of the estimated network until the convergence of the actor and the critic target network is not trained.
(2) In a collaborative active eavesdropping system, states and actions in the madppg algorithm.
Status: for each legal listener i, a state obtained from the environment
Figure BDA0002653430110000021
Wherein->
Figure BDA0002653430110000022
Indicating the channel power gain of the suspected link. />
Figure BDA0002653430110000023
Representing the channel power gain of the suspicious transmitter T to the legal listener i. />
Figure BDA0002653430110000024
Representing the channel power gain of lawful listener i to suspected receiver D. />
Figure BDA0002653430110000025
Representing the power gain of the legal listener i from the interfering channel;
the actions are as follows: for each lawful listener i, it is necessary to rely onThe observed environment state is used for transmitting interference power, and the actions are that
Figure BDA0002653430110000026
(3) The target function, the expected eavesdropping energy efficiency, of each legal listener is determined based on the MADDPG algorithm in the cooperative active eavesdropping system.
In reinforcement learning, a strategy is an action selection strategy that optimizes long term performance. Therefore, we need to take the expected eavesdropping energy efficiency over a period of time T as an objective function. The criteria for Q value is defined as the expected return value of agent selection action a in state s starting from time t, Q value for agent i being:
Figure BDA0002653430110000027
wherein r is i t For a timely incentive for agent i,
Figure BDA0002653430110000028
for the behavior policy of agent i in state s, what is output is the action to be performed. The optimal Q value is the maximum that can be reached when taking optimal action for all decisions. The value function uses DNN to construct a learning agent, and the obtained value function approximator is +.>
Figure BDA0002653430110000029
Thus, the expected eavesdropping energy efficiency for each lawful listener is:
Figure BDA00026534301100000210
wherein gamma E (0, 1) is a discount factor, θ i Estimating parameters of the network for the actor of the lawful listener i,
Figure BDA00026534301100000211
timely eavesdropping energy efficiency prize of legal monitor i at time tAnd (5) excitation. The optimal strategy is->
Figure BDA00026534301100000212
(4) Network parameters and the required initial data are initialized.
In reinforcement learning, initial parameters are required to start network training, and therefore, parameters θ and ω of the actor network and the critic network need to be initialized at first. Since there is no prize value at the initial time, the prize for the lawful eavesdropper i is r i 0 =0, i.e. initial time eavesdropping energy efficiency
Figure BDA0002653430110000031
Initializing initial time status information ++>
Figure BDA0002653430110000032
(5) Cooperative interference power decision in cooperative active eavesdropping system-two legal listeners cooperate to transmit interference power.
Centralized training: the input to the critic estimation network is the status and action information of two lawful listeners, i.e
Figure BDA0002653430110000033
And->
Figure BDA0002653430110000034
The critic network of the two legal listeners can obtain the full information state during training, and meanwhile, the strategy actions adopted by the two legal listeners are also obtained, so that even if the actors cannot obtain all information, the strategies of other actors cannot be known, and each actor also has a critic with global information to guide the actors to optimize the strategies. This shows that each lawful listener updates its own policy under knowledge of the policy of the other lawful listener, so that cooperative interception of the suspicious link between the two lawful listeners can be achieved.
The updating mode of the actor network is as follows:
Figure BDA0002653430110000035
where M represents the number of samples randomly drawn from the experience playback pool and the superscript j represents an approximation of the other lawful listener values for the ith lawful listener.
Figure BDA0002653430110000036
Indicating a desired prize value, x, of a critic network notifying an actor based on global status information j Representing state information for lawful listener i including another lawful listener j, i.e. x j ={s 1 ,s 2 The actor network needs to update its policy according to the desired rewards given by the critic network, i.e. if the action taken is such that the critic tells about the desired rewards +.>
Figure BDA0002653430110000037
Increasing, then the actor will increase the value of this strategic gradient direction, and conversely, decrease. Thus, the actor network is oriented in the direction of the gradient rise of the strategy, thereby updating the parameters θ, ++of the actor network>
Figure BDA0002653430110000038
Representing the gradient of the strategy.
The critic network loss function is:
Figure BDA0002653430110000039
wherein the loss function is the square of the difference between the true Q value and the estimated Q value, a i Representing the action taken by lawful listener i in the current state.
Figure BDA00026534301100000310
Representing the true value, r i For rewarding in time, the->
Figure BDA00026534301100000311
Expressed in the target network parameter theta' i The next target network strategy, x 'is the global state information of the next moment, a' i Representing the action taken at the next moment. For lawful listener i, the critic network is updated in such a way that the parameter ω is updated by minimizing its loss function i I.e. L (ω) i ) For omega i The gradient is found and updated as the gradient decreases.
For the legal monitor i, the parameters of the actor target network and the critic target network are updated regularly and a soft update mode is adopted:
θ′ i ←τθ i +(1-τ)θ′ i
ω′ i ←τω i +(1-τ)ω′ i
where τ represents the retention parameter, i.e., the degree to which the estimated network parameter is retained during the target network parameter update.
Distributed execution: after the model is trained, i.e. the parameters are converged, the parameters are not changed any more, only two actors are needed to interact with the environment, i.e. only the circulation represented by the black solid line in fig. 2 is needed, and the two legal monitors take actions according to the obtained state information, i.e. the interference power required to be transmitted.
The MADDPG algorithm is utilized to intensively train the model, and then the motion is performed in a distributed mode after the model is trained, so that when the motion is performed in a distributed mode, the trained model can be utilized to realize that two legal listeners determine the distribution of interference power in a cooperative mode, and the expected eavesdropping can be optimized.
Compared with the prior art, the technical scheme has the following advantages:
according to the cooperative active interception method for realizing the multi-monitor based on reinforcement learning, each legal monitor is considered to be an energy-limited device, and cooperative interception and interference are carried out under the constraint of maximum interference power. I.e., reinforcement learning based methods maximize the expected eavesdropping energy efficiency of each lawful listener. The method mainly relates to the solution of the joint interference power distribution problem of cooperative active interception of two legal listeners by using reinforcement learning. Interference power is cooperatively transmitted through two legal listeners so as to realize successful interception of information transmitted by suspicious transmitters and maximize expected interception energy efficiency of each legal listener.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a collaborative active eavesdropping method for implementing multiple listeners based on reinforcement learning according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating the madppg algorithm using two agents as an example according to an embodiment of the present invention.
Detailed Description
In active eavesdropping systems, much of the current research is to use a single lawful listener to listen to suspicious links. In an active eavesdropping system containing multiple lawful listeners, it is not considered that they can eavesdrop and interfere simultaneously in full duplex mode to achieve successful eavesdropping, improving the system eavesdropping performance. In addition, in the active eavesdropping system, most of the existing researches do not consider the problem of energy limitation of legal listeners, however, in practical situations, a legal listener is usually a device with limited power, and the eavesdropping performance is affected by insufficient energy, and even eavesdropping failure is caused.
The inventor finds that in the active eavesdropping system, the existing research has the interception of suspicious links by a single legal listener, and most of the problems of energy limitation of the legal listener are not considered, which does not accord with the actual situation. Moreover, it is not considered that simultaneous eavesdropping and interference by a plurality of lawful listeners may further improve the system eavesdropping performance.
In order to solve the technical problem, the application provides a cooperative active interception method for realizing multiple listeners based on reinforcement learning, namely a reinforcement learning-based method for maximizing expected interception energy efficiency of each legal listener. The method mainly relates to the solution of the joint interference power distribution problem of cooperative active interception of two legal listeners by using reinforcement learning. The two legal listeners cooperatively transmit interference power to realize successful interception of information transmitted by the suspicious transmitter and maximize the expected interception energy efficiency of each legal listener, which requires finding an optimal interference power allocation strategy of each legal listener. Wherein the function of the eavesdropping energy efficiency evaluates the relation between the eavesdropping rate of each legal listener and the interference power, and the eavesdropping energy efficiency is calculated as the ratio of the eavesdropping rate to the power.
As shown in fig. 1, the present application proposes a cooperative active eavesdropping method for implementing multiple listeners based on reinforcement learning, including:
determining a primary parameter in the cooperative active eavesdropping system;
in the cooperative active eavesdropping system, generating an eavesdropping energy efficiency function of each legal listener according to channel state information at the moment t and the interference power emitted by the legal listener;
in a cooperative active eavesdropping system, a multi-agent reinforcement learning algorithm-a multi-agent depth deterministic strategy gradient algorithm of a cooperative scene is determined according to the problem of interference power distribution of two legal listeners to be solved;
the cooperative active eavesdropping system based on the multi-agent depth deterministic strategy gradient algorithm realizes cooperative transmission interference power realization eavesdropping of two legal listeners.
The cooperative active eavesdropping method based on the reinforcement learning method for realizing the multi-listener is described in detail below.
First, the primary parameters in a cooperative active eavesdropping system are determined.
According to a wireless communication environment, first, considering that channel state information is dynamically changed, each legal eavesdropper needs to transmit interference power according to the channel state information in the environment. At time t, the information observed by each legal monitor i is that the channel power gain of the suspicious link
Figure BDA0002653430110000051
Channel power gain of suspicious transmitter T to legal listener i>
Figure BDA0002653430110000052
Channel power gain of legal listener i to suspect receiver D>
Figure BDA0002653430110000053
Legal listener i Power gain from interfering channel +.>
Figure BDA0002653430110000054
The power transmitted by each legal listener is +.>
Figure BDA0002653430110000055
In the second step, in the cooperative active eavesdropping system, the eavesdropping energy efficiency function of each legal listener is generated according to the channel power gain (namely the channel state information) at the moment t and the interference power of the legal listener.
The eavesdropping energy efficiency is the ratio of the eavesdropping rate to the transmitted interference power, i.e., the ratio of the rate at which each legitimate listener successfully eavesdrops on an illegitimate link to the interference power it transmits. The eavesdropping rate needs to be determined according to the data transmission rate of the suspicious link and the data transmission rate of the eavesdropping link, the calculation of the data transmission rate needs to be calculated according to a shannon formula, and the following specific calculation process is as follows:
the signal-to-interference-plus-noise ratio (SINR) at the suspect receiver D is:
Figure BDA0002653430110000056
the signal to interference plus noise ratio (SINR) at lawful listener E1 is:
Figure BDA0002653430110000057
the signal to interference plus noise ratio (SINR) at lawful listener E2 is:
Figure BDA0002653430110000061
wherein P is T Is the power at which the suspect transmitter transmits a signal,
Figure BDA0002653430110000062
is the interference power, sigma, transmitted by lawful listener i 2 Is the noise power.
By using SINR, we can obtain the data transmission rates of the suspicious link and the lawful listener E1, E2 eavesdropping link according to shannon's formula as:
Figure BDA0002653430110000063
Figure BDA0002653430110000064
Figure BDA0002653430110000065
in lawful interception systems, if
Figure BDA0002653430110000066
The lawful listener can decode the information sent by the overheard suspicious transmitter T with any small error rate, thereby giving an effective overheard rate of +.>
Figure BDA0002653430110000067
If->
Figure BDA0002653430110000068
The lawful interception device cannot decode the information sent by the suspicious transmitter T, and the interception rate is +.>
Figure BDA0002653430110000069
Thus, an indicator function is introduced to indicate whether two legitimate monitors successfully eavesdropped:
Figure BDA00026534301100000610
wherein, for legal monitor i, at time t, if
Figure BDA00026534301100000611
Indicating a successful eavesdropping; otherwise, the eavesdropping fails.
The eavesdropping rate is defined as:
Figure BDA00026534301100000612
/>
for cooperative active eavesdropping systems, eavesdropping performance depends on the eavesdropping rate and the allocation of interference power. Thus, for the ith listener, its eavesdropping energy efficiency is:
Figure BDA00026534301100000613
thirdly, in the cooperative active eavesdropping system, a Multi-agent reinforcement learning algorithm-Multi-agent depth deterministic strategy gradient (Multi-agent Deep Deterministic Policy Gradient, MADDPG) algorithm of a cooperative scene is determined according to the problem of interference power distribution of two legal listeners to be solved.
In a multi-agent system, each agent needs to select its own action according to the state information of the environment. The traditional single-agent reinforcement learning method can not well solve the problem of cooperative active eavesdropping between two legal listeners. Since each agent is constantly changing during the training process, this results in an unstable environment for each agent individual. That is, the agent only observes its own local state information, and when making decisions, the same state-action pair and different prize values may occur. In other words, the agents interact with each other, each agent not only requiring its own state information, but also other agents' state information and action behavior, as such information and behavior may affect the rewards of the agents. Thus, to solve the problem of cooperative active eavesdropping of two agents, i.e. two lawful listeners, the policy update process of one listener should consider the policy of the other listener instead of updating its own policy with its own behavioral actions only. Here, each lawful listener needs to determine its own required interference power (action) according to the environmental information, i.e., the state-action information of all lawful listeners, so as to successfully eavesdrop on the information sent by the suspicious transmitter and maximize the expected eavesdrop energy efficiency of each lawful listener. Aiming at the characteristics of the cooperative active eavesdropping system and the problem structure to be solved, the MADDPG multi-agent reinforcement learning algorithm can well solve the problem of cooperative active eavesdropping. After the agent obtains the status information and makes action selections, the environment will feed back agent rewards information for judging whether the action is good or bad.
The MADDPG algorithm adopts a centralized training and distributed execution mode, and is an algorithm suitable for a cooperative scene. That is, during the training process, critic may instruct the training of the actor by observing global state information, that is, not only using its own state-action, but also using state-action information of other agents. At the time of testing, action is taken using only actor, and FIG. 2 shows the flow of MADDPG algorithm by way of example with two agents. Wherein S is all Global state information for two agents, a 1 ,a 2 Representing the actions taken by agent 1 and agent 2, respectively, r 1 ,r 2 Timely rewards after acting for two agents respectively, S 1 ,S 2 The state information observed by agent 1 and agent 2, respectively.
And fourthly, the cooperative active eavesdropping system based on the MADDPG algorithm realizes the cooperative transmission interference power realization eavesdropping of two legal listeners.
(1) In the cooperative active eavesdropping system, the network structure of the MADDPG algorithm is established.
Since the state dimension is high in the proposed collaborative active eavesdropping problem and is a continuous motion space problem. The time and memory spent enumerating states and action spaces is not measurable when using Q-value based iterations. It is therefore necessary to construct a function approximator using a Deep Neural Network (DNN) to create a learned proxy. In the proposed cooperative active eavesdropping system, two legal listeners represent two agents, each agent comprising 4 networks, namely an actor estimation network, an actor target network, a critic estimation network and a critic target network, wherein the structures of the estimation networks and the target networks are identical, namely a fully connected DNN comprising two layers of hidden layers with activation functions being ReLU nonlinear activation functions and being parameterized by a set of weights. The parameter of the actor estimated network is theta, and the parameter of the critic estimated network is omega; the parameter of the actor target network is theta ', and the parameter of the critic target network is omega'. Finally, the actor and critic target networks need to update the target network parameters at regular time according to the parameters of the estimated network until the target networks converge and are not trained.
(2) In a collaborative active eavesdropping system, states and actions in the madppg algorithm.
Status: for each legal listener i, a state obtained from the environment
Figure BDA0002653430110000071
Wherein->
Figure BDA0002653430110000072
Indicating the channel power gain of the suspected link. />
Figure BDA0002653430110000073
Representing the channel power gain of the suspicious transmitter T to the legal listener i. />
Figure BDA0002653430110000074
Representing the channel power gain of lawful listener i to suspected receiver D. />
Figure BDA0002653430110000075
Representing the power gain of the legal listener i from the interfering channel;
the actions are as follows: for each legal listener i, the interference power needs to be transmitted according to the observed environmental state, namely
Figure BDA0002653430110000076
(3) The target function, the expected eavesdropping energy efficiency, of each legal listener is determined based on the MADDPG algorithm in the cooperative active eavesdropping system.
In reinforcement learning, a strategy is an action selection strategy that optimizes long term performance. Therefore, we need to take the expected eavesdropping energy efficiency over a period of time T as an objective function. The criteria for Q value is defined as the expected return value of agent selection action a in state s starting from time t, Q value for agent i being:
Figure BDA0002653430110000081
wherein r is i t For a timely incentive for agent i,
Figure BDA0002653430110000082
for the behavior policy of agent i in state s, what is output is the action to be performed. The optimal Q value is the maximum that can be reached when taking optimal action for all decisions. The value function uses DNN to construct a learning agent, and the obtained value function approximator is +.>
Figure BDA0002653430110000083
Thus, the expected eavesdropping energy efficiency for each lawful listener is:
Figure BDA0002653430110000084
wherein gamma E (0, 1) is a discount factor, θ i Estimating the network for the actor of a lawful listener iThe parameters of the parameters are set to be,
Figure BDA0002653430110000085
and at the time t, the legal monitor i timely eavesdrops on the energy efficiency rewards. The optimal strategy is->
Figure BDA0002653430110000086
(4) Network parameters and the required initial data are initialized.
In reinforcement learning, initial parameters are required to start network training, and therefore, parameters θ and ω of the actor network and the critic network need to be initialized at first. Since there is no prize value at the initial time, the prize for the lawful eavesdropper i is r i 0 =0, i.e. initial time eavesdropping energy efficiency
Figure BDA0002653430110000087
Initializing initial time status information ++>
Figure BDA0002653430110000088
(5) Cooperative interference power decision in cooperative active eavesdropping system-two legal listeners cooperate to transmit interference power.
Centralized training: the input to the critic estimation network is the status and action information of two lawful listeners, i.e
Figure BDA0002653430110000089
And->
Figure BDA00026534301100000810
The critic network of the two legal listeners can obtain the full information state during training, and meanwhile, the strategy actions adopted by the two legal listeners are also obtained, so that even if the actors cannot obtain all information, the strategies of other actors cannot be known, and each actor also has a critic with global information to guide the actors to optimize the strategies. This shows that each lawful listener updates its own policy under knowledge of the policy of the other lawful listener, so thatThe cooperation between two legal listeners is realized to intercept suspicious links.
The updating mode of the actor network is as follows:
Figure BDA00026534301100000811
where M represents the number of samples randomly drawn from the experience playback pool and the superscript j represents an approximation of the other lawful listener values for the ith lawful listener.
Figure BDA00026534301100000812
Indicating a desired prize value, x, of a critic network notifying an actor based on global status information j Representing state information for lawful listener i including another lawful listener j, i.e. x j ={s 1 ,s 2 The actor network needs to update its policy according to the desired rewards given by the critic network, i.e. if the action taken is such that the critic tells about the desired rewards +.>
Figure BDA00026534301100000813
Increasing, then the actor will increase the value of this strategic gradient direction, and conversely, decrease. Thus, the actor network is oriented in the direction of the gradient rise of the strategy, thereby updating the parameters θ, ++of the actor network>
Figure BDA0002653430110000091
Representing the gradient of the strategy.
The critic network loss function is:
Figure BDA0002653430110000092
wherein the loss function is the square of the difference between the true Q value and the estimated Q value, a i Representing the action taken by lawful listener i in the current state.
Figure BDA0002653430110000093
Representing the true value, r i For rewarding in time, the->
Figure BDA0002653430110000094
Expressed in the target network parameter theta' i The next target network strategy, x 'is the global state information of the next moment, a' i Representing the action taken at the next moment. For lawful listener i, the critic network is updated in such a way that the parameter ω is updated by minimizing its loss function i I.e. L (ω) i ) For omega i The gradient is found and updated as the gradient decreases.
For the legal monitor i, the parameters of the actor target network and the critic target network are updated regularly and a soft update mode is adopted:
θ′ i ←τθ i +(1-τ)θ′ i
ω′ i ←τω i +(1-τ)ω′ i
where τ represents the retention parameter, i.e., the degree to which the estimated network parameter is retained during the target network parameter update.
Distributed execution: after the model is trained, i.e. the parameters are converged, the parameters are not changed any more, only two actors are needed to interact with the environment, i.e. only the circulation represented by the black lines in fig. 2 is needed, and the two legal monitors take actions according to the obtained state information, i.e. the interference power required to be transmitted.
The MADDPG algorithm is utilized to intensively train the model, and then the motion is performed in a distributed mode after the model is trained, so that when the motion is performed in a distributed mode, the trained model can be utilized to realize that two legal listeners determine the distribution of interference power in a cooperative mode, and the expected eavesdropping can be optimized.
The specific process of the madppg algorithm is as follows:
1) Initializing parameters of an actor network and a critic network, including estimating network parameters and target network parameters; initializing random noise delta of action exploration;
2) Obtain the initial state x=(s) 1 ,...s i ) Wherein at time slot t, for agent i, its state is
Figure BDA0002653430110000095
Initializing prize r i =0;
3) For each agent i, observe an initial state
Figure BDA0002653430110000096
Wherein->
Figure BDA0002653430110000097
Is the channel power gain of the suspected link at time t. />
Figure BDA0002653430110000098
Representing the channel power gain of the suspicious transmitter T to the legal listener i. />
Figure BDA0002653430110000099
Representing the channel power gain of lawful listener i to suspected receiver D. />
Figure BDA00026534301100000910
Representing the power gain of the legal listener i from the interfering channel;
4) For each agent i, an action is selected according to the state
Figure BDA00026534301100000911
Wherein->
Figure BDA00026534301100000912
Indicating the interference power of the legal listener transmission i;
5) Perform the action and obtain the prize r i And a next time state x'.
6) The experience e (t) = (x, a, r, x') stored at time t is fed into the experience playback unit D (t), from which small batches of samples are randomly drawn for training.
7) Updating actor network using policy gradients:
Figure BDA0002653430110000101
8) Using a minimization loss function L (ω i ) Updating the critic network.
Figure BDA0002653430110000102
9) The next state x+.x' is updated.
10 Soft update mode update parameter target network parameter θ' i ,ω′ i Stopping training until convergence: θ'. i ←τθ i +(1-τ)θ′ i ,ω′ i ←τω i +(1-τ)ω′ i
According to the method, the selection of the interference power decision can be performed in each legal monitor, and the madppg algorithm can be implemented, when two legal monitors perform interference power actions according to respective optimal expected eavesdropping energy efficiency in a continuous action space, the joint interference power allocation strategy is an optimal balance, that is, under the optimal interference power allocation strategy, each legal monitor can obtain the maximum expected eavesdropping energy efficiency.
As can be seen, the present application contemplates that each lawful listener is an energy-limited device that performs cooperative eavesdropping and interference under maximum interference power constraints.
Specifically, when the joint interference power of the cooperative active interception system including two legal listeners is allocated, the application also considers that the channel state information is dynamically changed, and the MADDPG algorithm is utilized to maximize the expected interception energy efficiency of each legal listener, namely, the two listeners can cooperatively intercept the suspicious link and simultaneously send interference signals to the suspicious receiver to realize successful interception. The madppg algorithm is used to select the interference power allocation decision from the main, which can achieve the goal of maximizing the desired eavesdropping energy efficiency by constantly training.
The application can achieve the following beneficial effects:
(1) Under the interference power constraint, the interference power is transmitted through the cooperation of the legal listeners so as to realize successful interception of information sent by the suspicious transmitter, and compared with the interception energy efficiency of non-cooperative active interception, the expected interception energy efficiency of the cooperative active interception of the legal listeners can be obviously improved.
(2) The MADDPG algorithm can be well applied to the scene of cooperative active eavesdropping, and is suitable for continuous action space problems. By adopting an appropriately increased learning rate, its convergence speed is also increased.
(3) The MADDPG algorithm is provided to optimize the interference power allocation decision, and the optimal interference power allocation strategy can be found under the conditions of large state action space and continuous action space, so that the expected eavesdropping energy efficiency of each legal listener is maximized.
In the present description, each part is described in a progressive manner, and each part is mainly described as different from other parts, and identical and similar parts between the parts are mutually referred.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (1)

1. The cooperative active eavesdropping method for realizing the multi-listener based on reinforcement learning is characterized by comprising the following steps:
determining a primary parameter in a cooperative active eavesdropping system;
in the cooperative active eavesdropping system, generating an eavesdropping energy efficiency function of each legal listener according to channel state information at the moment t and the interference power emitted by the legal listener;
in a cooperative active eavesdropping system, a multi-agent reinforcement learning algorithm-a multi-agent depth deterministic strategy gradient algorithm of a cooperative scene is determined according to the problem of interference power distribution of two legal listeners to be solved;
the cooperative active eavesdropping system based on the multi-agent depth deterministic strategy gradient algorithm realizes cooperative transmission interference power realization eavesdropping of two legal listeners;
the cooperative active interception system based on the multi-agent depth deterministic strategy gradient algorithm realizes cooperative transmission interference power interception of two legal listeners, and specifically comprises the following steps:
(1) in the cooperative active eavesdropping system, establishing a network structure of an MADDPG algorithm;
since the state dimension is higher in the proposed cooperative active eavesdropping problem and is a continuous action space problem, when the Q-value-based iteration is used, the time and memory consumed for enumerating the state and action space are not measurable, so that a function approximator needs to be built by using a Deep Neural Network (DNN) to create a learned proxy, in the proposed cooperative active eavesdropping system, two legal listeners represent two proxies, each proxy comprises 4 networks, namely an actor estimation network, an actor target network, a critic estimation network and a critic target network, wherein the structure of the estimation network and the target network is the same, namely the estimation network consists of a fully connected DNN which comprises two layers of hidden layers with activation functions being ReLU nonlinear activation functions, the parameters of the actor estimation network are θ, and the parameters of the critic estimation network are ω; the parameters of the actor target network are theta ', the parameters of the critic target network are omega', and finally, the actor and the critic target network need to update the parameters of the target network at regular time according to the parameters of the estimated network until convergence of the actor and the critic target network is not trained;
(2) in a cooperative active eavesdropping system, states and actions in the madppg algorithm;
status: for each legal listener i, a state obtained from the environment
Figure FDA0004198316900000021
Wherein->
Figure FDA0004198316900000022
Channel power gain indicative of suspicious link, +.>
Figure FDA0004198316900000023
Indicating the channel power gain of suspicious transmitter T to legal listener i,/>
Figure FDA0004198316900000024
Indicating the channel power gain of legal listener i to suspect receiver D +.>
Figure FDA0004198316900000025
Representing the power gain of the legal listener i from the interfering channel;
the actions are as follows: for each legal listener i, the interference power needs to be transmitted according to the observed environmental state, namely
Figure FDA0004198316900000026
(3) Determining an objective function-expected eavesdropping energy efficiency of each legal listener based on an MADDPG algorithm in a cooperative active eavesdropping system;
in reinforcement learning, the strategy is an action selection strategy that optimizes long-term performance, and therefore, it is necessary to define the expected eavesdropping energy efficiency over a period of time T as an objective function, the Q-value criteria are defined as starting from time T, and in state s, the agent selects the expected return value of action a, and for agent i, the Q-value is:
Figure FDA0004198316900000027
wherein r is i t For a timely incentive for agent i,
Figure FDA0004198316900000028
for the behavior policy of agent i in state s, the output is the action to be performed, the optimal Q value is the maximum that can be reached when taking optimal action for all decisions,the value function uses DNN to construct a learning agent, and the obtained value function approximator is +.>
Figure FDA0004198316900000029
Thus, the expected eavesdropping energy efficiency for each lawful listener is:
Figure FDA00041983169000000210
wherein gamma E (0, 1) is a discount factor, θ i Estimating parameters of the network for the actor of the lawful listener i,
Figure FDA00041983169000000211
for the time t, the legal monitor i rewards the timely eavesdropping energy efficiency, and the optimal strategy is +.>
Figure FDA00041983169000000212
(4) Initializing network parameters and required initial data;
in reinforcement learning, initial parameters are needed to start network training, so parameters θ and ω of an actor network and a critic network need to be initialized randomly first, and since there is no prize value at the initial time, a legal eavesdropper i is rewarded with r i 0 =0, i.e. initial time eavesdropping energy efficiency
Figure FDA00041983169000000213
Initializing initial time status information ++>
Figure FDA00041983169000000214
(5) Cooperative interference power decision-two legal listeners cooperate to transmit interference power in a cooperative active eavesdropping system;
centralized training: the input to the critic estimation network is the status and action information of two lawful listeners, i.e
Figure FDA0004198316900000031
And
Figure FDA0004198316900000032
the critic network of the two legal listeners can obtain the full information state during training, and meanwhile, the strategy actions adopted by the two legal listeners are also obtained, so that even if an actor cannot obtain all information, the strategies of other actors cannot be known, each actor also has a critic with global information to guide the same to optimize the strategies, and each legal listener updates the own strategy under the strategy of the other legal listener, so that the cooperative interception of suspicious links between the two legal listeners can be realized;
the updating mode of the actor network is as follows:
Figure FDA0004198316900000033
where M represents the number of samples randomly drawn from the empirical playback pool, the superscript j represents an approximation of the other legal listener values for the ith legal listener,
Figure FDA0004198316900000034
indicating a desired prize value, x, of a critic network notifying an actor based on global status information j Representing state information for lawful listener i including another lawful listener j, i.e. x j ={s 1 ,s 2 The actor network needs to update its policy according to the desired rewards given by the critic network, i.e. if the action taken is such that the critic tells about the desired rewards +.>
Figure FDA0004198316900000035
Increasing, the actor increases the value of this direction of the policy gradient, and conversely decreases, so that the actor network is directed in the direction of the increase of the policy gradient, thereby updating the parameter θ of the actor network,/>
Figure FDA0004198316900000036
representing the gradient of the strategy;
the critic network loss function is:
Figure FDA0004198316900000037
wherein the loss function is the square of the difference between the true Q value and the estimated Q value, a i Representing the action taken by lawful listener i in the current state,
Figure FDA0004198316900000038
representing the true value, r i For rewarding in time, the->
Figure FDA0004198316900000039
Expressed in the target network parameter theta' i The next target network strategy, x 'is the global state information of the next moment, a' i Representing the action taken at the next moment, the critic network is updated for the lawful listener i in such a way that the parameter ω is updated by minimizing its loss function i I.e. L (ω) i ) For omega i Solving the gradient and updating along with the descending direction of the gradient;
for the legal monitor i, the parameters of the actor target network and the critic target network are updated regularly and a soft update mode is adopted:
θ′ i ←τθ i +(1-τ)θ′ i
ω′ i ←τω i +(1-τ)ω′ i
wherein τ represents a retention parameter, i.e., a degree to which the estimated network parameter is retained during the target network parameter update process;
distributed execution: after the model is trained, i.e. parameters are converged, and no change is needed, only two actors interact with the environment, and the two legal monitors take actions according to the obtained state information, i.e. the interference power required to be transmitted;
the MADDPG algorithm is utilized to intensively train the model, and then the action is performed in a distributed mode after the model is trained, so that two legal listeners can determine the allocation of interference power in a cooperative mode by utilizing the trained model when the action is performed in a distributed mode, and the expected eavesdropping energy can be optimized.
CN202010878680.2A 2020-08-27 2020-08-27 Cooperative active eavesdropping method for realizing multiple listeners based on reinforcement learning Active CN112087749B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010878680.2A CN112087749B (en) 2020-08-27 2020-08-27 Cooperative active eavesdropping method for realizing multiple listeners based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010878680.2A CN112087749B (en) 2020-08-27 2020-08-27 Cooperative active eavesdropping method for realizing multiple listeners based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN112087749A CN112087749A (en) 2020-12-15
CN112087749B true CN112087749B (en) 2023-06-02

Family

ID=73729707

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010878680.2A Active CN112087749B (en) 2020-08-27 2020-08-27 Cooperative active eavesdropping method for realizing multiple listeners based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN112087749B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113747442B (en) * 2021-08-24 2023-06-06 华北电力大学(保定) IRS-assisted wireless communication transmission method, device, terminal and storage medium
CN115296705B (en) * 2022-04-28 2023-11-21 南京大学 Active monitoring method in MIMO communication system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107948173A (en) * 2017-11-30 2018-04-20 华北电力大学(保定) A kind of monitor method
CN108235423A (en) * 2017-12-29 2018-06-29 中山大学 Wireless communication anti-eavesdrop jamming power control algolithm based on Q study
CN109088891A (en) * 2018-10-18 2018-12-25 南通大学 Legal listening method based on safety of physical layer under a kind of more relay systems
CN109302262A (en) * 2018-09-27 2019-02-01 电子科技大学 A kind of communication anti-interference method determining Gradient Reinforcement Learning based on depth
CN109635917A (en) * 2018-10-17 2019-04-16 北京大学 A kind of multiple agent Cooperation Decision-making and training method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107948173A (en) * 2017-11-30 2018-04-20 华北电力大学(保定) A kind of monitor method
CN108235423A (en) * 2017-12-29 2018-06-29 中山大学 Wireless communication anti-eavesdrop jamming power control algolithm based on Q study
CN109302262A (en) * 2018-09-27 2019-02-01 电子科技大学 A kind of communication anti-interference method determining Gradient Reinforcement Learning based on depth
CN109635917A (en) * 2018-10-17 2019-04-16 北京大学 A kind of multiple agent Cooperation Decision-making and training method
CN109088891A (en) * 2018-10-18 2018-12-25 南通大学 Legal listening method based on safety of physical layer under a kind of more relay systems

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
UAV-Enabled Secure Communications by Multi-Agent Deep Reinforcement Learning;Yu Zhang等;IEEE Transactions on Vehicular Technology;全文 *
基于中继和主动干扰的合作监听方案设计;朱敏;张登银;;南京邮电大学学报(自然科学版)(第03期);全文 *
基于值函数和策略梯度的深度强化学习综述;刘建伟;高峰;罗雄麟;;计算机学报(第06期);全文 *

Also Published As

Publication number Publication date
CN112087749A (en) 2020-12-15

Similar Documents

Publication Publication Date Title
Li et al. Multi-agent deep reinforcement learning based spectrum allocation for D2D underlay communications
CN112087749B (en) Cooperative active eavesdropping method for realizing multiple listeners based on reinforcement learning
Kong et al. A reinforcement learning approach for dynamic spectrum anti-jamming in fading environment
CN113507328B (en) Time slot MAC protocol method, system, device and medium for underwater acoustic network
CN113225794B (en) Full-duplex cognitive communication power control method based on deep reinforcement learning
CN116744311B (en) User group spectrum access method based on PER-DDQN
CN111865474B (en) Wireless communication anti-interference decision method and system based on edge calculation
Saad et al. A cooperative Q-learning approach for online power allocation in femtocell networks
Tan et al. Deep reinforcement learning for channel selection and power control in D2D networks
Zhang et al. Deep reinforcement learning-empowered beamforming design for IRS-assisted MISO interference channels
Li et al. Reinforcement learning-based intelligent reflecting surface assisted communications against smart attackers
Xu et al. A new anti-jamming strategy based on deep reinforcement learning for MANET
Mafuta et al. Decentralized resource allocation-based multiagent deep learning in vehicular network
Huang et al. Fast spectrum sharing in vehicular networks: A meta reinforcement learning approach
Lu et al. A learning approach towards power control in full-duplex underlay cognitive radio networks
Zhang et al. Deep Deterministic Policy Gradient for End-to-End Communication Systems without Prior Channel Knowledge
Song et al. Deep Q-network based power allocation meets reservoir computing in distributed dynamic spectrum access networks
Wang et al. Joint Spectrum Allocation and Power Control in Vehicular Networks Based on Reinforcement Learning
Mallouh et al. A hierarchy of deep reinforcement learning agents for decision making in blockchain nodes
CN111741050A (en) Data distribution method and system based on roadside unit
Zhang et al. Multi-Agent Reinforcement Learning Based Channel Access Scheme for Underwater Optical Wireless Communication Networks
CN115866559B (en) Non-orthogonal multiple access auxiliary Internet of vehicles low-energy-consumption safe unloading method
Irkiçatal et al. Deep Reinforcement Learning Aided Rate-Splitting for Interference Channels
CN112867087B (en) Anti-interference method based on multiuser random forest reinforcement learning
Chen et al. RTE: Rapid and Reliable Trust Evaluation for Collaborator Selection and Time-Sensitive Task Handling in Internet of Vehicles

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant