CN109302262B

CN109302262B - Communication anti-interference method based on depth determination gradient reinforcement learning

Info

Publication number: CN109302262B
Application number: CN201811129485.9A
Authority: CN
Inventors: 黎伟; 王军; 李黎; 党泽; 王杨
Original assignee: University of Electronic Science and Technology of China; CETC 54 Research Institute
Current assignee: University of Electronic Science and Technology of China; CETC 54 Research Institute
Priority date: 2018-09-27
Filing date: 2018-09-27
Publication date: 2020-07-10
Anticipated expiration: 2038-09-27
Also published as: CN109302262A

Abstract

The invention belongs to the technical field of wireless communication, and relates to a communication anti-interference method based on depth determination gradient reinforcement learning. Firstly, constructing an interference environment model according to the number of interference sources and a wireless channel model; constructing a utility function according to the communication quality index of the legal user, and taking the utility function as the return in learning; and constructing the spectrum information sampled by different time slots into a spectrum time slot matrix, and describing the interference environment state by using the matrix. And then determining a gradient reinforcement learning mechanism according to the depth, constructing a convolutional neural network, and realizing anti-interference strategy selection of the corresponding state on a continuous space through the target actor convolutional neural network by the environment state matrix when an anti-interference decision is made. The invention relates to a method for preparing a high-purity sodium silicate. And finishing continuous anti-interference strategy selection in communication by a reinforced learning mechanism of the depth determination gradient strategy. The quantization error caused by the quantization discrete processing strategy space is overcome, the grid number of the output unit of the neural network and the complexity of the network are reduced, and the performance of the anti-interference algorithm is improved.

Description

Communication anti-interference method based on depth determination gradient reinforcement learning

Technical Field

The invention belongs to the technical field of wireless communication, and relates to a communication anti-interference method based on depth determination strategy gradient reinforcement learning.

Background

With the development of wireless communication technology, the electromagnetic environment faced by a wireless communication system is increasingly complex and harsh, and may be affected by both unintentional interference from own-party communications and interference signals intentionally released by an adversary. The traditional anti-interference means adopts a fixed anti-interference strategy aiming at the static interference mode of an interference source. With the intellectualization of the interference means, the interference source can dynamically adjust the interference strategy according to the change of the communication state of the legal user, so that the traditional interference resisting method cannot ensure the normal communication of the legal user in the dynamic interference environment. Therefore, it is necessary to adopt a corresponding intelligent anti-interference strategy for the dynamic interference strategy of the interference source to ensure normal communication of a legitimate user in a dynamic interference environment.

At present, an Anti-interference strategy dynamic adjustment method aiming at an interference source mainly adopts a reinforcement learning-based mode, firstly, discretization processing is carried out on an Anti-interference strategy space to construct an Anti-interference strategy set, secondly, a utility function related to Communication quality of legal users is constructed, an environment state matrix is obtained through Spectrum sampling and preprocessing, discrete strategy selection is realized on the environment state matrix through a deep neural network, finally, a selection strategy is acted on the environment and environment state transfer is estimated, an optimal Communication strategy under the dynamic interference strategy is obtained through multiple learning, specifically, the method can refer to Xin L iu, etc.' Anti-jamming-Communication Using discrete parameter, A DeepRenformation L initial, IEEE Communication L meters, vol.22, vol.5, May.2018Moreover, for the transmission power on different sub-channels during discretization of the power, according to the quantization discrete processing rule, the constructed strategy set needs to contain N ×L elements, wherein N is the number of channels and the number of quantization levels, and the number of corresponding deep neural networks needs to be L^NAnd when the number of system channels and the number of quantization levels are excessive, the number of output numbers of the neural network grows exponentially, and the complexity of training the neural network and selecting strategies based on the ∈ -greedy strategy is increased.

Disclosure of Invention

Aiming at the technical problems, the invention provides a communication anti-interference power selection method based on a Deep Deterministic strategy Gradient learning mechanism (DDPG). Under the condition of discretizing the power strategy space, the selection of the anti-interference power strategy is determined, the anti-interference performance is improved, and the strategy selection complexity is reduced.

The invention firstly constructs an interference environment according to the number of interference sources and a wireless channel model. And constructing a utility function according to the communication quality index of the legal user, and taking the utility function as the return in learning. And constructing the spectrum information sampled by different time slots into a spectrum time slot matrix, and describing the interference environment state by using the matrix. Four deep neural networks including a target actor (target _ actor), an estimated actor (estimated _ actor), a target critic (target _ critic) and an estimated critic (estimated _ critic) are constructed in the invention and are respectively used for operations such as strategy selection based on an environment state matrix, strategy selection network training, strategy selection evaluation, evaluation network training and the like. The target actor neural network and the estimated actor neural network have the same network structure, and the target comment family neural network and the estimated comment family neural network have the same network structure. And the environment state matrix outputs an anti-interference strategy through a target actor neural network. And the legal user adjusts the transmitting power and selects a channel to realize intelligent anti-interference strategy adjustment. And calculating a return function value and a transfer environment state matrix according to the wireless interference environment model and the anti-interference strategy. And the current environment state, the current anti-interference strategy, the return function value and the transfer environment state form an experience group and are stored in an experience pool. And finally extracting an experience group in the experience pool to finish training the neural network of the estimated actor and the neural network of the estimated commenting family. And when the learning steps reach a certain number, updating the target actor neural network and the target commentator neural network respectively by estimating parameters of the actor neural network and the commentator neural network. This learning mechanism continues until the learning results converge.

The method for realizing the intelligent anti-interference scheme of the legal user comprises the following steps:

s1, defining each algorithm module of the intelligent anti-interference scheme: the method comprises the following steps of interference environment definition, interference environment state definition, return function definition, anti-interference strategy definition and experience storage pool definition.

S2, constructing four deep neural networks of a target actor neural network (target _ actor), an estimated actor neural network (estimated _ actor), a target comment family neural network (target _ critic) and an estimated comment family neural network (estimated _ critic). Wherein the target actor neural network and the estimated actor neural network have the same network structure, and the target comment family neural network and the estimated comment family neural network have the same structure.

And S3, obtaining the anti-interference strategy by the environment state information, namely the frequency spectrum time sequence matrix through the target actor neural network, acting the strategy on the interference environment, calculating the return value and the transfer state matrix of the anti-interference strategy in the current interference environment, and storing the return value and the transfer state matrix.

And S4, training and parameter updating the estimated actor neural network and the estimated critic neural network by sampling the experience group from the experience pool.

S5, judging whether the learning mechanism meets the stop condition, if so, stopping learning to obtain the final anti-interference strategy; otherwise, go back to S2 to continue learning.

According to an embodiment of the present invention, the above step S1 includes the steps of:

s1.1, interference environment definition: an interference environment is defined according to the number of interferers, the interference mode and the wireless channel model.

S1.2, interference environment state definition: and forming a spectrum time slot matrix by spectrum information measured by different time slots, wherein the size of the spectrum time slot matrix is determined by an observation spectrum range and an observation time slot length.

S1.3, return function definition: and constructing a feedback return function according to the communication quality index of the legal user.

S1.4, anti-interference strategy definition: and defining the combination of the transmission power on different sub-channels as an anti-interference strategy set. The transmit power on each subchannel may be

Any value over a continuum.

S1.5, empirical storage pool definition: an experience storage pool with a fixed size is preset and used for storing an experience group consisting of a current environment state matrix, an anti-interference strategy, a return function value and a transfer environment state matrix.

According to an embodiment of the present invention, the step S2 includes the following steps:

and S2.1, constructing a target actor neural network and an estimated actor neural network by adopting the convolutional neural networks with the same structure. The convolutional neural network includes a plurality of convolutional layers, a plurality of pooling layers, and a plurality of fully-connected layers. And the target actor neural network completes the selection of an anti-interference strategy according to the input spectrum time slot state matrix. And estimating the actor neural network to complete network training and parameter updating according to the sampling experience group. And when the training steps reach a preset value, covering the target actor neural network parameters with the estimated actor neural network parameters so as to finish the parameter updating of the target actor neural network.

And S2.2, constructing a target comment family neural network and an estimation comment family neural network by adopting the conventional deep neural network with the same structure. The deep neural network comprises a plurality of neural network layers, and each neural network layer comprises a plurality of neurons and activation functions. The output of the target comment family neural network is used for helping to evaluate the strategy selection merits of the actor neural network. And (4) carrying out network training and parameter updating on the estimated critic neural network according to the sampling experience information. And when the training step number reaches a preset value, covering the target critic neural network with the estimated critic neural network parameters to complete parameter updating.

According to an embodiment of the present invention, the above step S3 includes the steps of:

and S3.1, obtaining an anti-interference strategy through the target actor neural network constructed in the step S2.1 by the environment state matrix according to the definition of the environment state in the step S1.2. And applying an anti-interference strategy to the interference environment defined in the step S1.1, and calculating a return function value and a state matrix after next transfer.

And S3.2, defining an experience pool with the capacity of M, and storing an experience group (S, A, R, S _) formed by the current environment state, the selected strategy behavior, the obtained return function value and the next environment state in the S3.1 in the experience pool.

According to an embodiment of the present invention, the above step S4 includes the steps of:

and S4.1, randomly extracting a certain number of experience groups from the experience pool obtained in the S3.2 for training and updating the parameters of the convolutional neural network.

And S4.2, obtaining two corresponding state behavior values through the target neural network and the estimation neural network according to the current state S and the next state S _inthe experience group extracted in the step S4.1. And constructing a loss function through the current return function value and the two state behavior values, and finishing network training and updating on the estimation and comment family neural network through the minimized loss function.

And S4.3, obtaining the state behavior value of the current state S in the experience group extracted in the step S4.1 through an estimation comment family neural network, and obtaining the corresponding state behavior value of the current state S and the strategy A in the experience group extracted in the step S4.1 through a target actor neural network. And constructing a loss function according to the two state behavior values, and carrying out training and parameter updating for estimating the actor neural network.

The invention has the beneficial effects that:

the invention completes the selection of continuous anti-interference strategies in communication based on the reinforcement learning mechanism of the depth determination strategy gradient strategy. The quantization error caused by the quantization discrete processing strategy space is overcome, the grid number of the output unit of the neural network and the complexity of the network are reduced, and the performance of the anti-interference algorithm is improved.

Drawings

FIG. 1 is a processing framework of an anti-interference strategy selection algorithm based on a depth-determined strategy gradient strategy reinforcement learning mechanism designed by the invention

FIG. 2 shows the structure of a target actor neural network and an estimated actor neural network designed according to the present invention

FIG. 3 is a diagram of a target critic neural network and an estimated critic neural network structure designed by the present invention

Fig. 4 is a comparison of the algorithm designed by the present invention with the performance of the algorithm of optimal strategy selection, random strategy selection and DQN-based discretization decision method.

Detailed Description

In order to make the steps of the present invention more detailed and clear, the present invention is further described in detail below with reference to the accompanying drawings and embodiments.

Example one

Fig. 1 is a specific implementation method of the algorithm of the present invention, and the following describes each step and its principle in detail with reference to fig. 1.

The algorithm implementation framework of the depth-determination-based gradient strategy reinforcement learning continuous strategy selection anti-interference method is shown in FIG. 1. In step S1, interference and radio environment modeling is completed in S1.1. In a scenario, multiple interference sources interfere with a legitimate communication link, and the interference may include, but is not limited to: the interference comprises five types of interference, namely single tone interference, multi-tone interference, linear frequency sweep interference, partial frequency band interference and noise frequency hopping interference. The interference source can realize dynamic interference adjustment on legal users by adjusting interference parameters or switching interference modes. The five interference modes are concretely and mathematically modeled as follows:

(1) single tone interference

The complex baseband expression of the single-tone interfering signal is:

wherein A is the amplitude of the single-tone interference signal, f_JFor a single-tone interfering signal frequency,

the initial phase is a single tone interferer.

(2) Multitone interference

The complex baseband expression of the multi-tone interference signal is:

wherein A is_mFor the mth single-tone interference amplitude, f in multi-tone interference_mFor the frequency of the mth single-tone interferer,

the initial phase of the mth single-tone interferer.

(3) Linear swept frequency interference

The complex baseband expression of the linear sweep interference signal is:

wherein A is amplitude, f₀Is the initial frequency, k is the frequency modulation coefficient,

is the initial phase and T is the signal duration.

(4) Partial band interference

The partial band noise interference is represented as white gaussian noise in the partial band, and the expression of the complex baseband is as follows:

wherein, U_n(t) is obedience mean zero, variance is

Base band noise of f_JBeing the centre frequency of the signal or signals,

is [0,2 π ]]Uniformly distributed and mutually independent phases.

(5) Noise frequency modulation interference

The complex baseband of the noise modulated frequency signal can be represented as follows:

where A is the amplitude of the noise FM signal, f₀For the carrier frequency, k, of a noisy FM signal_fmξ (t) is zero mean and variance as frequency modulation index

Is narrow-band Gaussian white noise with a certain value. Wherein

Is a wiener process belonging to a

A gaussian distribution of (a). Frequency modulation index k_fmSum variance

Together determine the effective bandwidth of the noise modulation.

And the interference source dynamically selects an interference mode and corresponding parameters according to the maximum interference effect.

A legal user anti-interference strategy calculates a return function value R and an environment state matrix S through sampling of wireless spectrum information in an environment; constructing a historical experience group according to the return function, the environment state, the current anti-interference strategy and the next transfer state matrix, and storing the historical experience group in an experience pool; the neural network selects the next anti-interference behavior according to the current environment state matrix, acts the anti-interference strategy on the environment, and updates the parameters according to historical experience; the whole algorithm iterates until the algorithm converges. Specifically, the specific implementation steps of the algorithm are as follows:

in the invention, steps S1.2, S1.3 and S1.4 respectively complete the design of the environment state, the design of the return function and the design of the anti-interference strategy. In the case of multiple sub-channels, the signal received on a sub-channel by the receiving end of a legal link can be represented as:

where m ∈ {1, …, N } is the channel index number, N is the number of channels, x_tIs a useful emission signal, x_jIs a signal that is an interference signal or a signal,

j ∈ {1, …, J } is the interference source index number, J is the interference source number, t is the timing sequence index number;

indicating the channel between the legitimate communication users,

representing the interfering channels from the interfering sources to the legitimate user receivers. Therefore, the signal-to-interference-and-noise ratio and the achievable rate available to the receiving end of the legitimate user can be expressed as:

wherein

Is the equivalent channel gain on the sub-channel,

is the corresponding noise power. The achievable rate at time t at the receiving end can be expressed as the sum of the rates on the N subchannels:

before an anti-interference decision is made, the corresponding power on each subchannel is obtained by sampling the wireless environment, and the power of all the subchannels forms a power vector P ═ P_t,1,p_t,2,…,p_t,N]Where N corresponds to the number of subchannels. The state matrix S is formed by a plurality of historical power vectors S_t＝[P_t-1P_t-2…P_t-t]^TWhere t is the observation time window. Meanwhile, the limit of the anti-interference strategy on the transmission power is considered, the return function designed in the invention considers the gain and power overhead of the adopted anti-interference strategy on the signal-to-interference-and-noise ratio at the same time, and the specific expression is as follows:

wherein

Is the interference power of the interferer on the channel; function(s)

Is shown when f_jWhen m, 1 is output, otherwise 0 is output;

is the transmit power overhead.

The strength of interference on certain sub-channels due to the influence of interference sources

And the transmission power on the corresponding channel can be adjusted to ensure that the communication quality of the link is maximized within a controllable power range. The immunity policy in the present invention on each subchannel is therefore the transmit power on that subchannel. In the present invention, it will be assumed that the maximum transmission power of the subchannel m is

Where m ∈ {1, …, N }, the set of immunity policies may therefore be expressed as

Experience groups and experience pools are defined in step S1.5 in the invention step S1, and training and parameter updating of the neural network in subsequent steps are provided through storage and sampling of historical experiences. According to the algorithm structure description of FIG. 1, the invention defines the capacity size as M_eA experience pool of (2), can store M_eA historical experience. The current environmental status S, the reward function value R, and the current anti-interference policy a obtained through S1.2-S1.5 in step S1_tAnd transition ambient state S_-Construction of an experience set { S, R, A_tS _. The experience groups are stored in the experience pool one by one, and when the number of the experience groups stored in the experience pool reaches the upper limit of the capacity, the experience group with the longest storage time is covered by the new experience group.

In inventive step S2, step S2.1, a target actor neural network μ (. |. theta.) is constructed using a convolutional neural network^μ) And estimating the actor neural network μ' (. Theta)^μ). The target actor neural network and the estimated actor neural network have the same network structure, and the specific structure is shown in fig. 2, and specific parameters refer to the second embodiment. Selecting the transmitting power vector of the corresponding sub-channel from the continuous anti-interference strategy space through the target actor neural network by the current environment state matrix obtained in the step S1.2:

in order to realize the exploration of unknown strategies and overcome the situation of falling into local optimum, the power vector is superposed with random exploration noise with the same dimension, namely

Form the Current anti-interference policy A_t. The strategy acts on the environment to complete the interaction between the strategy and the interference environment, so that the next calculation of the environment transferring state and the return function value is carried out. In step S2.2 of invention step S2, the same deep neural network structure is usedObjective comment Hospital network Q (. | [ theta ])^t) And estimating a comment Homelan network Q' (. |. theta.)^t). And the target actor neural network completes the selection of an anti-interference strategy according to the input spectrum time slot state matrix. Estimating actor neural network to complete network training and parameter updating according to the sampling experience group. And when the training steps reach a preset value, covering the target actor neural network parameters with the estimated actor neural network parameters so as to finish the parameter updating of the target actor neural network. The output of the target comment family neural network is used for helping to evaluate the strategy selection merits of the actor neural network. And (4) carrying out network training and parameter updating on the estimated critic neural network according to the sampling experience information. And when the training step number reaches a preset value, covering the target critic neural network with the estimated critic neural network parameters to complete parameter updating.

In step S3, the strategy obtained in step S2.2 is used as the transmission power on the current channel m in step S3.1, and the next calculation is performed according to the new transmission power and the interference model when the environment state is calculated. In step S3, in step S3.2, the current environmental status in S2.1, the policy action selected in S2.2, the reward function value obtained in S2.2, and the next environmental status obtained in S3.1 are grouped into an experience group { S, a } according to the capacity and structure of the experience storage pool defined in S1.5_tR, S _ } is stored in the experience pool. When the stored experience set reaches the upper capacity limit of the experience set, the latest derived experience set is stored in the memory unit in which the oldest experience set is stored, overwriting the oldest experience set.

In step S4, in step S4.1, a corresponding number of experience sets are extracted from the experience storage pool in step S3 based on the preset batch _ size to complete the evaluation of critic Q' (. The. -. The^t) And (5) training parameters of the neural network. Referring to FIG. 1, step S4.2 in step S4 comments on the estimate of the Homeland network Q' (. | θ)^Q') Is achieved by minimizing its loss function L os _ function, where L os _ function is defined as follows:

L_{loss_function}(θ^Q')＝(1/N)∑_i(y_i-Q(S_i,A_i|θ^Q'))²(10)

y_i＝R_i+γQ(S_i+1,μ'(S_i+1|θ^μ')|θ^Q) (11)

wherein Q (S)_i,A_i|θ^Q) Representation dependent estimation of actor neural network parameters theta^Q‘Gamma represents a long-term return discount factor. And when the training step number reaches the updating step number I, copying the network parameters in the estimated comment family neural network into the target comment family neural network to complete the updating of the network parameters. Step S4.3 in step S4 is to estimate the actor neural network μ' (· | θ)^μ') The training is realized by strengthening the optimal strategy selection direction of the neural network of the target comment family and estimating the optimal parameter selection direction of the actor in the current environment state, and the updating method comprises the following steps:

and when the training step number reaches the updating step number I, copying the network parameters in the estimated actor neural network into the target actor neural network to complete the updating of the network parameters.

In step S5, the reward function R gradually converges to its optimal value as training continues. In the invention, the mean value change condition of the zeta step R is recorded, when the mean value change is small enough, the training is considered to be converged, the algorithm is stopped, and the finally output strategy is used for resisting disturbance as a final strategy. The convergence is determined as follows:

where v is the termination condition for determining convergence and is set to a very small positive value.

Example two

The convolutional neural network structure for anti-interference decision is shown in FIG. 2, wherein 128 sub-channels are assumed to be divided by a system in simulation, a spectrum time slot state matrix of 128 × 128 is constructed according to spectrum sampling signals and is used as the input of the convolutional neural network, and then a power vector of 1 × 128 is output through three convolutional layers, two pooling layers and two full-connection layers.

Assuming that the input data of the convolution operation is I, the corresponding convolution kernel K has the same dimension as the input data. Take three-dimensional input data as an example (when the input data is two-dimensional, the third dimension can be considered to be 1). The convolution operation requires that the third dimension of the convolution kernel K is the same as the input data Ithird dimension, by w₁,w₂,w₃Representing three dimensions, after convolution operation, the output is:

the convolutional neural network pooling operation generally comprises maximum pooling and mean pooling, and the calculation method comprises the following steps:

and (3) mean value pooling:

maximum pooling:

maximum pooling is employed in the present invention.

Specifically, in this embodiment, each layer structure is as shown in fig. 2, and each layer structure is specifically described as follows:

the available spectrum is divided into 128 sub-channels in the network model, the observation time slot is 128 in length, so the input state matrix dimension is 128 × 128.

Specifically, the state matrix from the input layer is first subjected to a convolution operation with a convolution kernel size of 3 × 3, wherein the number of convolution kernels is 20, the convolution step size is 1, and Re L u is adopted as an activation function, the dimension of the output result after the operation is 126 × 126 × 20, wherein the Relu activation function is operated as:

y＝max{0,x} (17)

the output is then subjected to a maximum pooling operation with a pooling size of 2 × 2. the output dimensionality after the first layer of convolutional pooling is 63 × 63 × 20.

The output from the second layer after the convolution pooling operation passes through a third layer of the convolution network, and the convolution operation obtains the output of 31 × 31 × 30, wherein the dimension of a convolution kernel ruler is 3 × 3, the number of convolution kernels is 30, the Relu function is adopted as an activation function, and the convolution step size is 2.

The fourth layer of the convolution network carries out convolution operation by taking the output of the third layer as input, the size of the adopted convolution kernel is 4 × 4, the number of the convolution kernels is 30, the convolution step size is 2, and the convolution operation is carried out on the w₁,w₂And performing zero padding operation on two dimensions, wherein the number of zero padding is 1, outputting the dimension of 15 × 15 × 30 after the layer of convolution operation, performing maximum pooling operation on the output after the convolution operation, wherein the pooling size is 3 × 3, and outputting the dimension of 5 × 5 × 30 after the pooling operation.

The output of the convolutional neural network with the fourth layer of dimension 5 × 5 × 30 is recombined into a vector with the dimension 1 × 750, and the vector with the dimension 1 × 360 is output after the processing of the fully connected layer.

The sixth layer of the convolutional network is a fully-connected layer, 128 neurons are constructed in the layer, and the Relu function is adopted as an activation function. The output from the fifth layer of the convolutional neural network is processed by the full connection layer and then outputs Q (. | [ theta ]) corresponding to the dimensionality of the anti-interference strategy set^t) Vector of values, output dimension 1 × 128.

FIG. 3 is a layer neural network, neural network structure, for implementing an estimated critic neural network and a target critic neural network the first layer is an input layer with dimension 128 × (128+1) containing a state matrix S representing channel power information_tAnd a work vector A for representing the strategy_tThe second layer is neural layer 1, the number of neurons is 1024, the output dimension is 1024 × 1, the activation function is the Re L u function, the third layer is neural layer 2, the number of neurons is 128, the output dimension is 128 × 1, adoptAnd activating a function by using Re L u, wherein the fourth layer is a nerve layer 3, the number of the neurons is 32, the output dimension is 32 × 1, and the Re L u activation function is adopted, the fifth layer is a nerve layer 4, the number of the neurons is 1, and the Q value for evaluating the quality of the actor network strategy selection is output.

Further, fig. 4 shows the continuous power selection anti-interference strategy performance of the depth determination strategy gradient based reinforcement learning in the present invention. The performance of a random power selection strategy, a discrete power selection strategy based on DQN, a continuous power selection strategy based on depth determination strategy gradient and an ideal optimal power selection strategy are performed in the figure. It can be seen from the figure that the algorithm reward function proposed in the present invention has a great performance improvement compared to the random power selection strategy.

Claims

1. A communication anti-interference method based on depth determination gradient reinforcement learning is characterized by comprising the following steps:

s1, initializing definition, including:

interference environment: defining an interference environment according to the number of interferers, an interference mode and a wireless channel model;

interference environment state: forming a spectrum time slot matrix by spectrum information measured by different time slots, wherein the size of the spectrum time slot matrix is determined by an observation spectrum range and an observation time slot length;

a return function: constructing a feedback return function according to the communication quality index of a legal user;

an anti-interference strategy is as follows: defining the combination of the transmitting power on different sub-channels as an anti-interference strategy set;

deep neural network: constructing four deep neural networks of a target actor, an estimated actor, a target commentator and an estimated commentator, wherein the target actor neural network and the estimated actor neural network have the same network structure, and the target commentator neural network and the estimated commentator neural network have the same network structure;

an empirical storage pool: presetting an experience storage pool with a fixed size, wherein the experience storage pool is used for storing an experience group consisting of a current interference suppression strategy, an environment state, a current interference suppression strategy and an environment return;

s2, obtaining an anti-interference strategy by the interference environment state, namely the frequency spectrum time sequence matrix through a target actor convolutional neural network, acting the strategy on the interference environment, and observing a return value of the interference environment and a state matrix after next transfer under the current anti-interference strategy according to a return function; the output of the target comment family neural network is used for helping to evaluate the strategy selection quality of the actor neural network;

s3, forming an experience group by the current anti-interference strategy, the interference environment state, the return value under the anti-interference strategy and the transfer environment state, and storing the experience group into an experience pool;

s4, training the estimated actor neural network and the estimated critic neural network by sampling an experience group from an experience pool, covering the target actor neural network parameters with the estimated actor neural network parameters and covering the target critic neural network parameters with the estimated critic neural network parameters when the training steps reach preset values, and thus finishing the parameter updating of the target actor neural network;

s5, judging whether the learning mechanism meets a preset stopping condition, and if so, stopping learning to obtain a final anti-interference strategy; otherwise, go back to S2 to continue learning.

2. The method of claim 1, wherein the reward function in step S1 is:

where m ∈ {1, …, N } is the channel index number, N is the number of channels,

is the interference power of the interference source on the channel, J ∈ {1, …, J } is the interference source index number, J is the interference source number, t is the timing sequence index number;

indicating the channel between the legitimate communication users,

for transmitting power, function, of sub-channel

Is shown when f_jWhen m, 1 is output, otherwise 0 is output;

is the transmit power overhead.

3. The communication interference rejection method based on the depth-determined gradient reinforcement learning of claim 2, wherein in step S4, the method for updating the parameters of the convolutional neural network is as follows:

training the parameters of the convolutional neural network, obtaining corresponding state behavior values through the convolutional neural network according to the current state and the next state in the extracted experience group, constructing a corresponding loss function, and updating the network parameters through the minimized loss function.