CN112087749B

CN112087749B - Cooperative active eavesdropping method for realizing multiple listeners based on reinforcement learning

Info

Publication number: CN112087749B
Application number: CN202010878680.2A
Authority: CN
Inventors: 李保罡; 杨亚欣; 张淑娥
Original assignee: North China Electric Power University
Current assignee: North China Electric Power University
Priority date: 2020-08-27
Filing date: 2020-08-27
Publication date: 2023-06-02
Anticipated expiration: 2040-08-27
Also published as: CN112087749A

Abstract

The application provides a cooperative active interception method for realizing multiple listeners based on reinforcement learning, which considers that each legal listener is an energy-limited device and performs cooperative interception and interference under the constraint of maximum interference power. I.e., reinforcement learning based methods maximize the expected eavesdropping energy efficiency of each lawful listener. The method mainly relates to the solution of the joint interference power distribution problem of cooperative active interception of two legal listeners by using reinforcement learning. Interference power is cooperatively transmitted through two legal listeners so as to realize successful interception of information transmitted by suspicious transmitters and maximize expected interception energy efficiency of each legal listener.

Description

Cooperative active eavesdropping method for realizing multiple listeners based on reinforcement learning

Technical Field

The invention relates to the field of communication, in particular to a cooperative active eavesdropping method for realizing multiple listeners based on reinforcement learning.

Background

Many technologies for suspicious communications have been developed, and the interception of suspicious links by lawful listeners, i.e. active interception, plays an important role in wireless communication security and is a new direction of research on wireless communication security.

In active eavesdropping systems, much of the current research is to use a single lawful listener to listen to suspicious links. In an active eavesdropping system containing multiple lawful listeners, it is not considered that they can eavesdrop and interfere simultaneously in full duplex mode to achieve successful eavesdropping, improving the system eavesdropping performance. In addition, in active eavesdropping systems, most of the existing articles do not consider the problem of limited energy of legal listeners, however, in practice, a legal listener is usually a device with limited power, and insufficient energy can affect eavesdropping performance and even cause eavesdropping failure.

Disclosure of Invention

In order to solve the technical problems, the application provides a cooperative active interception method for realizing multiple listeners based on reinforcement learning, and the application considers that each legal listener is an energy-limited device and performs cooperative interception and interference under the constraint of maximum interference power. I.e., reinforcement learning based methods maximize the expected eavesdropping energy efficiency of the compound listener. The method mainly relates to the solution of the joint interference power distribution problem of cooperative active interception of two legal listeners by using reinforcement learning. Interference power is cooperatively transmitted through two legal listeners so as to realize successful interception of information transmitted by suspicious transmitters and maximize expected interception energy efficiency of each legal listener.

The specific technical scheme provided by the application is as follows:

a collaborative active eavesdropping method for implementing multiple listeners based on reinforcement learning, the method comprising:

determining a primary parameter in the cooperative active eavesdropping system;

in the cooperative active eavesdropping system, generating an eavesdropping energy efficiency function of each legal listener according to channel state information at the moment t and the interference power emitted by the legal listener;

in a cooperative active eavesdropping system, a multi-agent reinforcement learning algorithm-a multi-agent depth deterministic strategy gradient algorithm of a cooperative scene is determined according to the problem of interference power distribution of two legal listeners to be solved;

the cooperative active eavesdropping system based on the multi-agent depth deterministic strategy gradient algorithm realizes cooperative transmission interference power realization eavesdropping of two legal listeners;

the cooperative active interception system based on the multi-agent depth deterministic strategy gradient algorithm realizes cooperative transmission interference power interception of two legal listeners, and specifically comprises the following steps:

(1) in the cooperative active eavesdropping system, the network structure of the MADDPG algorithm is established.

Since the state dimension is high in the proposed collaborative active eavesdropping problem and is a continuous action space problem, when Q-value-based iteration is used, the time and memory consumed for enumerating states and action spaces is not measurable, and thus a function approximator needs to be built by using a Deep Neural Network (DNN) to create a learned agent. In the proposed cooperative active eavesdropping system, two legal listeners represent two agents, each agent comprising 4 networks, namely an actor estimation network, an actor target network, a critic estimation network and a critic target network, wherein the structures of the estimation networks and the target networks are identical, namely a fully connected DNN comprising two layers of hidden layers with activation functions being ReLU nonlinear activation functions and being parameterized by a set of weights. The parameter of the actor estimated network is theta, and the parameter of the critic estimated network is omega; and finally, the actor and the critic target network need to update the target network parameters at regular time according to the parameters of the estimated network until the convergence of the actor and the critic target network is not trained.

(2) In a collaborative active eavesdropping system, states and actions in the madppg algorithm.

Status: for each legal listener i, a state obtained from the environment

Wherein->

Indicating the channel power gain of the suspected link. />

Representing the channel power gain of the suspicious transmitter T to the legal listener i. />

Representing the channel power gain of lawful listener i to suspected receiver D. />

Representing the power gain of the legal listener i from the interfering channel;

the actions are as follows: for each lawful listener i, it is necessary to rely onThe observed environment state is used for transmitting interference power, and the actions are that

(3) The target function, the expected eavesdropping energy efficiency, of each legal listener is determined based on the MADDPG algorithm in the cooperative active eavesdropping system.

In reinforcement learning, a strategy is an action selection strategy that optimizes long term performance. Therefore, we need to take the expected eavesdropping energy efficiency over a period of time T as an objective function. The criteria for Q value is defined as the expected return value of agent selection action a in state s starting from time t, Q value for agent i being:

wherein r is _i ^t For a timely incentive for agent i,

for the behavior policy of agent i in state s, what is output is the action to be performed. The optimal Q value is the maximum that can be reached when taking optimal action for all decisions. The value function uses DNN to construct a learning agent, and the obtained value function approximator is +.>

Thus, the expected eavesdropping energy efficiency for each lawful listener is:

wherein gamma E (0, 1) is a discount factor, θ _i Estimating parameters of the network for the actor of the lawful listener i,

timely eavesdropping energy efficiency prize of legal monitor i at time tAnd (5) excitation. The optimal strategy is->

(4) Network parameters and the required initial data are initialized.

In reinforcement learning, initial parameters are required to start network training, and therefore, parameters θ and ω of the actor network and the critic network need to be initialized at first. Since there is no prize value at the initial time, the prize for the lawful eavesdropper i is r _i ⁰ =0, i.e. initial time eavesdropping energy efficiency

Initializing initial time status information ++>

(5) Cooperative interference power decision in cooperative active eavesdropping system-two legal listeners cooperate to transmit interference power.

Centralized training: the input to the critic estimation network is the status and action information of two lawful listeners, i.e

And->

The critic network of the two legal listeners can obtain the full information state during training, and meanwhile, the strategy actions adopted by the two legal listeners are also obtained, so that even if the actors cannot obtain all information, the strategies of other actors cannot be known, and each actor also has a critic with global information to guide the actors to optimize the strategies. This shows that each lawful listener updates its own policy under knowledge of the policy of the other lawful listener, so that cooperative interception of the suspicious link between the two lawful listeners can be achieved.

The updating mode of the actor network is as follows:

where M represents the number of samples randomly drawn from the experience playback pool and the superscript j represents an approximation of the other lawful listener values for the ith lawful listener.

Indicating a desired prize value, x, of a critic network notifying an actor based on global status information ^j Representing state information for lawful listener i including another lawful listener j, i.e. x ^j ＝{s ₁ ,s ₂ The actor network needs to update its policy according to the desired rewards given by the critic network, i.e. if the action taken is such that the critic tells about the desired rewards +.>

Increasing, then the actor will increase the value of this strategic gradient direction, and conversely, decrease. Thus, the actor network is oriented in the direction of the gradient rise of the strategy, thereby updating the parameters θ, ++of the actor network>

Representing the gradient of the strategy.

The critic network loss function is:

wherein the loss function is the square of the difference between the true Q value and the estimated Q value, a _i Representing the action taken by lawful listener i in the current state.

Representing the true value, r _i For rewarding in time, the->

Expressed in the target network parameter theta' _i The next target network strategy, x 'is the global state information of the next moment, a' _i Representing the action taken at the next moment. For lawful listener i, the critic network is updated in such a way that the parameter ω is updated by minimizing its loss function _i I.e. L (ω) _i ) For omega _i The gradient is found and updated as the gradient decreases.

For the legal monitor i, the parameters of the actor target network and the critic target network are updated regularly and a soft update mode is adopted:

θ′ _i ←τθ _i +(1-τ)θ′ _i

ω′ _i ←τω _i +(1-τ)ω′ _i

where τ represents the retention parameter, i.e., the degree to which the estimated network parameter is retained during the target network parameter update.

Distributed execution: after the model is trained, i.e. the parameters are converged, the parameters are not changed any more, only two actors are needed to interact with the environment, i.e. only the circulation represented by the black solid line in fig. 2 is needed, and the two legal monitors take actions according to the obtained state information, i.e. the interference power required to be transmitted.

The MADDPG algorithm is utilized to intensively train the model, and then the motion is performed in a distributed mode after the model is trained, so that when the motion is performed in a distributed mode, the trained model can be utilized to realize that two legal listeners determine the distribution of interference power in a cooperative mode, and the expected eavesdropping can be optimized.

Compared with the prior art, the technical scheme has the following advantages:

according to the cooperative active interception method for realizing the multi-monitor based on reinforcement learning, each legal monitor is considered to be an energy-limited device, and cooperative interception and interference are carried out under the constraint of maximum interference power. I.e., reinforcement learning based methods maximize the expected eavesdropping energy efficiency of each lawful listener. The method mainly relates to the solution of the joint interference power distribution problem of cooperative active interception of two legal listeners by using reinforcement learning. Interference power is cooperatively transmitted through two legal listeners so as to realize successful interception of information transmitted by suspicious transmitters and maximize expected interception energy efficiency of each legal listener.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a collaborative active eavesdropping method for implementing multiple listeners based on reinforcement learning according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating the madppg algorithm using two agents as an example according to an embodiment of the present invention.

Detailed Description

In active eavesdropping systems, much of the current research is to use a single lawful listener to listen to suspicious links. In an active eavesdropping system containing multiple lawful listeners, it is not considered that they can eavesdrop and interfere simultaneously in full duplex mode to achieve successful eavesdropping, improving the system eavesdropping performance. In addition, in the active eavesdropping system, most of the existing researches do not consider the problem of energy limitation of legal listeners, however, in practical situations, a legal listener is usually a device with limited power, and the eavesdropping performance is affected by insufficient energy, and even eavesdropping failure is caused.

The inventor finds that in the active eavesdropping system, the existing research has the interception of suspicious links by a single legal listener, and most of the problems of energy limitation of the legal listener are not considered, which does not accord with the actual situation. Moreover, it is not considered that simultaneous eavesdropping and interference by a plurality of lawful listeners may further improve the system eavesdropping performance.

In order to solve the technical problem, the application provides a cooperative active interception method for realizing multiple listeners based on reinforcement learning, namely a reinforcement learning-based method for maximizing expected interception energy efficiency of each legal listener. The method mainly relates to the solution of the joint interference power distribution problem of cooperative active interception of two legal listeners by using reinforcement learning. The two legal listeners cooperatively transmit interference power to realize successful interception of information transmitted by the suspicious transmitter and maximize the expected interception energy efficiency of each legal listener, which requires finding an optimal interference power allocation strategy of each legal listener. Wherein the function of the eavesdropping energy efficiency evaluates the relation between the eavesdropping rate of each legal listener and the interference power, and the eavesdropping energy efficiency is calculated as the ratio of the eavesdropping rate to the power.

As shown in fig. 1, the present application proposes a cooperative active eavesdropping method for implementing multiple listeners based on reinforcement learning, including:

determining a primary parameter in the cooperative active eavesdropping system;

the cooperative active eavesdropping system based on the multi-agent depth deterministic strategy gradient algorithm realizes cooperative transmission interference power realization eavesdropping of two legal listeners.

The cooperative active eavesdropping method based on the reinforcement learning method for realizing the multi-listener is described in detail below.

First, the primary parameters in a cooperative active eavesdropping system are determined.

According to a wireless communication environment, first, considering that channel state information is dynamically changed, each legal eavesdropper needs to transmit interference power according to the channel state information in the environment. At time t, the information observed by each legal monitor i is that the channel power gain of the suspicious link

Channel power gain of suspicious transmitter T to legal listener i>

Channel power gain of legal listener i to suspect receiver D>

Legal listener i Power gain from interfering channel +.>

The power transmitted by each legal listener is +.>

In the second step, in the cooperative active eavesdropping system, the eavesdropping energy efficiency function of each legal listener is generated according to the channel power gain (namely the channel state information) at the moment t and the interference power of the legal listener.

The eavesdropping energy efficiency is the ratio of the eavesdropping rate to the transmitted interference power, i.e., the ratio of the rate at which each legitimate listener successfully eavesdrops on an illegitimate link to the interference power it transmits. The eavesdropping rate needs to be determined according to the data transmission rate of the suspicious link and the data transmission rate of the eavesdropping link, the calculation of the data transmission rate needs to be calculated according to a shannon formula, and the following specific calculation process is as follows:

the signal-to-interference-plus-noise ratio (SINR) at the suspect receiver D is:

the signal to interference plus noise ratio (SINR) at lawful listener E1 is:

the signal to interference plus noise ratio (SINR) at lawful listener E2 is:

wherein P is _T Is the power at which the suspect transmitter transmits a signal,

is the interference power, sigma, transmitted by lawful listener i ² Is the noise power.

By using SINR, we can obtain the data transmission rates of the suspicious link and the lawful listener E1, E2 eavesdropping link according to shannon's formula as:

in lawful interception systems, if

The lawful listener can decode the information sent by the overheard suspicious transmitter T with any small error rate, thereby giving an effective overheard rate of +.>

If->

The lawful interception device cannot decode the information sent by the suspicious transmitter T, and the interception rate is +.>

Thus, an indicator function is introduced to indicate whether two legitimate monitors successfully eavesdropped:

wherein, for legal monitor i, at time t, if

Indicating a successful eavesdropping; otherwise, the eavesdropping fails.

The eavesdropping rate is defined as:

/>

for cooperative active eavesdropping systems, eavesdropping performance depends on the eavesdropping rate and the allocation of interference power. Thus, for the ith listener, its eavesdropping energy efficiency is:

thirdly, in the cooperative active eavesdropping system, a Multi-agent reinforcement learning algorithm-Multi-agent depth deterministic strategy gradient (Multi-agent Deep Deterministic Policy Gradient, MADDPG) algorithm of a cooperative scene is determined according to the problem of interference power distribution of two legal listeners to be solved.

In a multi-agent system, each agent needs to select its own action according to the state information of the environment. The traditional single-agent reinforcement learning method can not well solve the problem of cooperative active eavesdropping between two legal listeners. Since each agent is constantly changing during the training process, this results in an unstable environment for each agent individual. That is, the agent only observes its own local state information, and when making decisions, the same state-action pair and different prize values may occur. In other words, the agents interact with each other, each agent not only requiring its own state information, but also other agents' state information and action behavior, as such information and behavior may affect the rewards of the agents. Thus, to solve the problem of cooperative active eavesdropping of two agents, i.e. two lawful listeners, the policy update process of one listener should consider the policy of the other listener instead of updating its own policy with its own behavioral actions only. Here, each lawful listener needs to determine its own required interference power (action) according to the environmental information, i.e., the state-action information of all lawful listeners, so as to successfully eavesdrop on the information sent by the suspicious transmitter and maximize the expected eavesdrop energy efficiency of each lawful listener. Aiming at the characteristics of the cooperative active eavesdropping system and the problem structure to be solved, the MADDPG multi-agent reinforcement learning algorithm can well solve the problem of cooperative active eavesdropping. After the agent obtains the status information and makes action selections, the environment will feed back agent rewards information for judging whether the action is good or bad.

The MADDPG algorithm adopts a centralized training and distributed execution mode, and is an algorithm suitable for a cooperative scene. That is, during the training process, critic may instruct the training of the actor by observing global state information, that is, not only using its own state-action, but also using state-action information of other agents. At the time of testing, action is taken using only actor, and FIG. 2 shows the flow of MADDPG algorithm by way of example with two agents. Wherein S is _all Global state information for two agents, a ₁ ,a ₂ Representing the actions taken by agent 1 and agent 2, respectively, r ₁ ,r ₂ Timely rewards after acting for two agents respectively, S ₁ ,S ₂ The state information observed by agent 1 and agent 2, respectively.

And fourthly, the cooperative active eavesdropping system based on the MADDPG algorithm realizes the cooperative transmission interference power realization eavesdropping of two legal listeners.

Since the state dimension is high in the proposed collaborative active eavesdropping problem and is a continuous motion space problem. The time and memory spent enumerating states and action spaces is not measurable when using Q-value based iterations. It is therefore necessary to construct a function approximator using a Deep Neural Network (DNN) to create a learned proxy. In the proposed cooperative active eavesdropping system, two legal listeners represent two agents, each agent comprising 4 networks, namely an actor estimation network, an actor target network, a critic estimation network and a critic target network, wherein the structures of the estimation networks and the target networks are identical, namely a fully connected DNN comprising two layers of hidden layers with activation functions being ReLU nonlinear activation functions and being parameterized by a set of weights. The parameter of the actor estimated network is theta, and the parameter of the critic estimated network is omega; the parameter of the actor target network is theta ', and the parameter of the critic target network is omega'. Finally, the actor and critic target networks need to update the target network parameters at regular time according to the parameters of the estimated network until the target networks converge and are not trained.

Status: for each legal listener i, a state obtained from the environment

Wherein->

Indicating the channel power gain of the suspected link. />

the actions are as follows: for each legal listener i, the interference power needs to be transmitted according to the observed environmental state, namely

wherein r is _i ^t For a timely incentive for agent i,

Thus, the expected eavesdropping energy efficiency for each lawful listener is:

wherein gamma E (0, 1) is a discount factor, θ _i Estimating the network for the actor of a lawful listener iThe parameters of the parameters are set to be,

and at the time t, the legal monitor i timely eavesdrops on the energy efficiency rewards. The optimal strategy is->

(4) Network parameters and the required initial data are initialized.

Initializing initial time status information ++>

And->

The critic network of the two legal listeners can obtain the full information state during training, and meanwhile, the strategy actions adopted by the two legal listeners are also obtained, so that even if the actors cannot obtain all information, the strategies of other actors cannot be known, and each actor also has a critic with global information to guide the actors to optimize the strategies. This shows that each lawful listener updates its own policy under knowledge of the policy of the other lawful listener, so thatThe cooperation between two legal listeners is realized to intercept suspicious links.

The updating mode of the actor network is as follows:

Representing the gradient of the strategy.

The critic network loss function is:

Representing the true value, r _i For rewarding in time, the->

θ′ _i ←τθ _i +(1-τ)θ′ _i

ω′ _i ←τω _i +(1-τ)ω′ _i

Distributed execution: after the model is trained, i.e. the parameters are converged, the parameters are not changed any more, only two actors are needed to interact with the environment, i.e. only the circulation represented by the black lines in fig. 2 is needed, and the two legal monitors take actions according to the obtained state information, i.e. the interference power required to be transmitted.

The specific process of the madppg algorithm is as follows:

1) Initializing parameters of an actor network and a critic network, including estimating network parameters and target network parameters; initializing random noise delta of action exploration;

2) Obtain the initial state x=(s) ₁ ,...s _i ) Wherein at time slot t, for agent i, its state is

Initializing prize r _i ＝0；

3) For each agent i, observe an initial state

Wherein->

Is the channel power gain of the suspected link at time t. />

4) For each agent i, an action is selected according to the state

Wherein->

Indicating the interference power of the legal listener transmission i;

5) Perform the action and obtain the prize r _i And a next time state x'.

6) The experience e (t) = (x, a, r, x') stored at time t is fed into the experience playback unit D (t), from which small batches of samples are randomly drawn for training.

7) Updating actor network using policy gradients:

8) Using a minimization loss function L (ω _i ) Updating the critic network.

9) The next state x+.x' is updated.

10 Soft update mode update parameter target network parameter θ' _i ，ω′ _i Stopping training until convergence: θ'. _i ←τθ _i +(1-τ)θ′ _i ,ω′ _i ←τω _i +(1-τ)ω′ _i 。

According to the method, the selection of the interference power decision can be performed in each legal monitor, and the madppg algorithm can be implemented, when two legal monitors perform interference power actions according to respective optimal expected eavesdropping energy efficiency in a continuous action space, the joint interference power allocation strategy is an optimal balance, that is, under the optimal interference power allocation strategy, each legal monitor can obtain the maximum expected eavesdropping energy efficiency.

As can be seen, the present application contemplates that each lawful listener is an energy-limited device that performs cooperative eavesdropping and interference under maximum interference power constraints.

Specifically, when the joint interference power of the cooperative active interception system including two legal listeners is allocated, the application also considers that the channel state information is dynamically changed, and the MADDPG algorithm is utilized to maximize the expected interception energy efficiency of each legal listener, namely, the two listeners can cooperatively intercept the suspicious link and simultaneously send interference signals to the suspicious receiver to realize successful interception. The madppg algorithm is used to select the interference power allocation decision from the main, which can achieve the goal of maximizing the desired eavesdropping energy efficiency by constantly training.

The application can achieve the following beneficial effects:

(1) Under the interference power constraint, the interference power is transmitted through the cooperation of the legal listeners so as to realize successful interception of information sent by the suspicious transmitter, and compared with the interception energy efficiency of non-cooperative active interception, the expected interception energy efficiency of the cooperative active interception of the legal listeners can be obviously improved.

(2) The MADDPG algorithm can be well applied to the scene of cooperative active eavesdropping, and is suitable for continuous action space problems. By adopting an appropriately increased learning rate, its convergence speed is also increased.

(3) The MADDPG algorithm is provided to optimize the interference power allocation decision, and the optimal interference power allocation strategy can be found under the conditions of large state action space and continuous action space, so that the expected eavesdropping energy efficiency of each legal listener is maximized.

In the present description, each part is described in a progressive manner, and each part is mainly described as different from other parts, and identical and similar parts between the parts are mutually referred.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The cooperative active eavesdropping method for realizing the multi-listener based on reinforcement learning is characterized by comprising the following steps:

determining a primary parameter in a cooperative active eavesdropping system;

(1) in the cooperative active eavesdropping system, establishing a network structure of an MADDPG algorithm;

since the state dimension is higher in the proposed cooperative active eavesdropping problem and is a continuous action space problem, when the Q-value-based iteration is used, the time and memory consumed for enumerating the state and action space are not measurable, so that a function approximator needs to be built by using a Deep Neural Network (DNN) to create a learned proxy, in the proposed cooperative active eavesdropping system, two legal listeners represent two proxies, each proxy comprises 4 networks, namely an actor estimation network, an actor target network, a critic estimation network and a critic target network, wherein the structure of the estimation network and the target network is the same, namely the estimation network consists of a fully connected DNN which comprises two layers of hidden layers with activation functions being ReLU nonlinear activation functions, the parameters of the actor estimation network are θ, and the parameters of the critic estimation network are ω; the parameters of the actor target network are theta ', the parameters of the critic target network are omega', and finally, the actor and the critic target network need to update the parameters of the target network at regular time according to the parameters of the estimated network until convergence of the actor and the critic target network is not trained;

(2) in a cooperative active eavesdropping system, states and actions in the madppg algorithm;

status: for each legal listener i, a state obtained from the environment

Wherein->

Channel power gain indicative of suspicious link, +.>

Indicating the channel power gain of suspicious transmitter T to legal listener i,/>

Indicating the channel power gain of legal listener i to suspect receiver D +.>

(3) Determining an objective function-expected eavesdropping energy efficiency of each legal listener based on an MADDPG algorithm in a cooperative active eavesdropping system;

in reinforcement learning, the strategy is an action selection strategy that optimizes long-term performance, and therefore, it is necessary to define the expected eavesdropping energy efficiency over a period of time T as an objective function, the Q-value criteria are defined as starting from time T, and in state s, the agent selects the expected return value of action a, and for agent i, the Q-value is:

wherein r is _i ^t For a timely incentive for agent i,

for the behavior policy of agent i in state s, the output is the action to be performed, the optimal Q value is the maximum that can be reached when taking optimal action for all decisions,the value function uses DNN to construct a learning agent, and the obtained value function approximator is +.>

Thus, the expected eavesdropping energy efficiency for each lawful listener is:

for the time t, the legal monitor i rewards the timely eavesdropping energy efficiency, and the optimal strategy is +.>

(4) Initializing network parameters and required initial data;

in reinforcement learning, initial parameters are needed to start network training, so parameters θ and ω of an actor network and a critic network need to be initialized randomly first, and since there is no prize value at the initial time, a legal eavesdropper i is rewarded with r _i ⁰ =0, i.e. initial time eavesdropping energy efficiency

Initializing initial time status information ++>

(5) Cooperative interference power decision-two legal listeners cooperate to transmit interference power in a cooperative active eavesdropping system;

And

the critic network of the two legal listeners can obtain the full information state during training, and meanwhile, the strategy actions adopted by the two legal listeners are also obtained, so that even if an actor cannot obtain all information, the strategies of other actors cannot be known, each actor also has a critic with global information to guide the same to optimize the strategies, and each legal listener updates the own strategy under the strategy of the other legal listener, so that the cooperative interception of suspicious links between the two legal listeners can be realized;

the updating mode of the actor network is as follows:

where M represents the number of samples randomly drawn from the empirical playback pool, the superscript j represents an approximation of the other legal listener values for the ith legal listener,

Increasing, the actor increases the value of this direction of the policy gradient, and conversely decreases, so that the actor network is directed in the direction of the increase of the policy gradient, thereby updating the parameter θ of the actor network,/>

representing the gradient of the strategy;

the critic network loss function is:

wherein the loss function is the square of the difference between the true Q value and the estimated Q value, a _i Representing the action taken by lawful listener i in the current state,

representing the true value, r _i For rewarding in time, the->

Expressed in the target network parameter theta' _i The next target network strategy, x 'is the global state information of the next moment, a' _i Representing the action taken at the next moment, the critic network is updated for the lawful listener i in such a way that the parameter ω is updated by minimizing its loss function _i I.e. L (ω) _i ) For omega _i Solving the gradient and updating along with the descending direction of the gradient;

θ′ _i ←τθ _i +(1-τ)θ′ _i

ω′ _i ←τω _i +(1-τ)ω′ _i

wherein τ represents a retention parameter, i.e., a degree to which the estimated network parameter is retained during the target network parameter update process;

distributed execution: after the model is trained, i.e. parameters are converged, and no change is needed, only two actors interact with the environment, and the two legal monitors take actions according to the obtained state information, i.e. the interference power required to be transmitted;

the MADDPG algorithm is utilized to intensively train the model, and then the action is performed in a distributed mode after the model is trained, so that two legal listeners can determine the allocation of interference power in a cooperative mode by utilizing the trained model when the action is performed in a distributed mode, and the expected eavesdropping energy can be optimized.