CN109302262B - Communication anti-interference method based on depth determination gradient reinforcement learning - Google Patents

Communication anti-interference method based on depth determination gradient reinforcement learning Download PDF

Info

Publication number
CN109302262B
CN109302262B CN201811129485.9A CN201811129485A CN109302262B CN 109302262 B CN109302262 B CN 109302262B CN 201811129485 A CN201811129485 A CN 201811129485A CN 109302262 B CN109302262 B CN 109302262B
Authority
CN
China
Prior art keywords
interference
neural network
strategy
actor
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201811129485.9A
Other languages
Chinese (zh)
Other versions
CN109302262A (en
Inventor
黎伟
王军
李黎
党泽
王杨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
CETC 54 Research Institute
Original Assignee
University of Electronic Science and Technology of China
CETC 54 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China, CETC 54 Research Institute filed Critical University of Electronic Science and Technology of China
Priority to CN201811129485.9A priority Critical patent/CN109302262B/en
Publication of CN109302262A publication Critical patent/CN109302262A/en
Application granted granted Critical
Publication of CN109302262B publication Critical patent/CN109302262B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04KSECRET COMMUNICATION; JAMMING OF COMMUNICATION
    • H04K3/00Jamming of communication; Counter-measures
    • H04K3/20Countermeasures against jamming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04KSECRET COMMUNICATION; JAMMING OF COMMUNICATION
    • H04K3/00Jamming of communication; Counter-measures
    • H04K3/40Jamming having variable characteristics

Abstract

The invention belongs to the technical field of wireless communication, and relates to a communication anti-interference method based on depth determination gradient reinforcement learning. Firstly, constructing an interference environment model according to the number of interference sources and a wireless channel model; constructing a utility function according to the communication quality index of the legal user, and taking the utility function as the return in learning; and constructing the spectrum information sampled by different time slots into a spectrum time slot matrix, and describing the interference environment state by using the matrix. And then determining a gradient reinforcement learning mechanism according to the depth, constructing a convolutional neural network, and realizing anti-interference strategy selection of the corresponding state on a continuous space through the target actor convolutional neural network by the environment state matrix when an anti-interference decision is made. The invention relates to a method for preparing a high-purity sodium silicate. And finishing continuous anti-interference strategy selection in communication by a reinforced learning mechanism of the depth determination gradient strategy. The quantization error caused by the quantization discrete processing strategy space is overcome, the grid number of the output unit of the neural network and the complexity of the network are reduced, and the performance of the anti-interference algorithm is improved.

Description

Communication anti-interference method based on depth determination gradient reinforcement learning
Technical Field
The invention belongs to the technical field of wireless communication, and relates to a communication anti-interference method based on depth determination strategy gradient reinforcement learning.
Background
With the development of wireless communication technology, the electromagnetic environment faced by a wireless communication system is increasingly complex and harsh, and may be affected by both unintentional interference from own-party communications and interference signals intentionally released by an adversary. The traditional anti-interference means adopts a fixed anti-interference strategy aiming at the static interference mode of an interference source. With the intellectualization of the interference means, the interference source can dynamically adjust the interference strategy according to the change of the communication state of the legal user, so that the traditional interference resisting method cannot ensure the normal communication of the legal user in the dynamic interference environment. Therefore, it is necessary to adopt a corresponding intelligent anti-interference strategy for the dynamic interference strategy of the interference source to ensure normal communication of a legitimate user in a dynamic interference environment.
At present, an Anti-interference strategy dynamic adjustment method aiming at an interference source mainly adopts a reinforcement learning-based mode, firstly, discretization processing is carried out on an Anti-interference strategy space to construct an Anti-interference strategy set, secondly, a utility function related to Communication quality of legal users is constructed, an environment state matrix is obtained through Spectrum sampling and preprocessing, discrete strategy selection is realized on the environment state matrix through a deep neural network, finally, a selection strategy is acted on the environment and environment state transfer is estimated, an optimal Communication strategy under the dynamic interference strategy is obtained through multiple learning, specifically, the method can refer to Xin L iu, etc.' Anti-jamming-Communication Using discrete parameter, A DeepRenformation L initial, IEEE Communication L meters, vol.22, vol.5, May.2018Moreover, for the transmission power on different sub-channels during discretization of the power, according to the quantization discrete processing rule, the constructed strategy set needs to contain N ×L elements, wherein N is the number of channels and the number of quantization levels, and the number of corresponding deep neural networks needs to be LNAnd when the number of system channels and the number of quantization levels are excessive, the number of output numbers of the neural network grows exponentially, and the complexity of training the neural network and selecting strategies based on the ∈ -greedy strategy is increased.
Disclosure of Invention
Aiming at the technical problems, the invention provides a communication anti-interference power selection method based on a Deep Deterministic strategy Gradient learning mechanism (DDPG). Under the condition of discretizing the power strategy space, the selection of the anti-interference power strategy is determined, the anti-interference performance is improved, and the strategy selection complexity is reduced.
The invention firstly constructs an interference environment according to the number of interference sources and a wireless channel model. And constructing a utility function according to the communication quality index of the legal user, and taking the utility function as the return in learning. And constructing the spectrum information sampled by different time slots into a spectrum time slot matrix, and describing the interference environment state by using the matrix. Four deep neural networks including a target actor (target _ actor), an estimated actor (estimated _ actor), a target critic (target _ critic) and an estimated critic (estimated _ critic) are constructed in the invention and are respectively used for operations such as strategy selection based on an environment state matrix, strategy selection network training, strategy selection evaluation, evaluation network training and the like. The target actor neural network and the estimated actor neural network have the same network structure, and the target comment family neural network and the estimated comment family neural network have the same network structure. And the environment state matrix outputs an anti-interference strategy through a target actor neural network. And the legal user adjusts the transmitting power and selects a channel to realize intelligent anti-interference strategy adjustment. And calculating a return function value and a transfer environment state matrix according to the wireless interference environment model and the anti-interference strategy. And the current environment state, the current anti-interference strategy, the return function value and the transfer environment state form an experience group and are stored in an experience pool. And finally extracting an experience group in the experience pool to finish training the neural network of the estimated actor and the neural network of the estimated commenting family. And when the learning steps reach a certain number, updating the target actor neural network and the target commentator neural network respectively by estimating parameters of the actor neural network and the commentator neural network. This learning mechanism continues until the learning results converge.
The method for realizing the intelligent anti-interference scheme of the legal user comprises the following steps:
s1, defining each algorithm module of the intelligent anti-interference scheme: the method comprises the following steps of interference environment definition, interference environment state definition, return function definition, anti-interference strategy definition and experience storage pool definition.
S2, constructing four deep neural networks of a target actor neural network (target _ actor), an estimated actor neural network (estimated _ actor), a target comment family neural network (target _ critic) and an estimated comment family neural network (estimated _ critic). Wherein the target actor neural network and the estimated actor neural network have the same network structure, and the target comment family neural network and the estimated comment family neural network have the same structure.
And S3, obtaining the anti-interference strategy by the environment state information, namely the frequency spectrum time sequence matrix through the target actor neural network, acting the strategy on the interference environment, calculating the return value and the transfer state matrix of the anti-interference strategy in the current interference environment, and storing the return value and the transfer state matrix.
And S4, training and parameter updating the estimated actor neural network and the estimated critic neural network by sampling the experience group from the experience pool.
S5, judging whether the learning mechanism meets the stop condition, if so, stopping learning to obtain the final anti-interference strategy; otherwise, go back to S2 to continue learning.
According to an embodiment of the present invention, the above step S1 includes the steps of:
s1.1, interference environment definition: an interference environment is defined according to the number of interferers, the interference mode and the wireless channel model.
S1.2, interference environment state definition: and forming a spectrum time slot matrix by spectrum information measured by different time slots, wherein the size of the spectrum time slot matrix is determined by an observation spectrum range and an observation time slot length.
S1.3, return function definition: and constructing a feedback return function according to the communication quality index of the legal user.
S1.4, anti-interference strategy definition: and defining the combination of the transmission power on different sub-channels as an anti-interference strategy set. The transmit power on each subchannel may be
Figure BDA0001813149920000031
Any value over a continuum.
S1.5, empirical storage pool definition: an experience storage pool with a fixed size is preset and used for storing an experience group consisting of a current environment state matrix, an anti-interference strategy, a return function value and a transfer environment state matrix.
According to an embodiment of the present invention, the step S2 includes the following steps:
and S2.1, constructing a target actor neural network and an estimated actor neural network by adopting the convolutional neural networks with the same structure. The convolutional neural network includes a plurality of convolutional layers, a plurality of pooling layers, and a plurality of fully-connected layers. And the target actor neural network completes the selection of an anti-interference strategy according to the input spectrum time slot state matrix. And estimating the actor neural network to complete network training and parameter updating according to the sampling experience group. And when the training steps reach a preset value, covering the target actor neural network parameters with the estimated actor neural network parameters so as to finish the parameter updating of the target actor neural network.
And S2.2, constructing a target comment family neural network and an estimation comment family neural network by adopting the conventional deep neural network with the same structure. The deep neural network comprises a plurality of neural network layers, and each neural network layer comprises a plurality of neurons and activation functions. The output of the target comment family neural network is used for helping to evaluate the strategy selection merits of the actor neural network. And (4) carrying out network training and parameter updating on the estimated critic neural network according to the sampling experience information. And when the training step number reaches a preset value, covering the target critic neural network with the estimated critic neural network parameters to complete parameter updating.
According to an embodiment of the present invention, the above step S3 includes the steps of:
and S3.1, obtaining an anti-interference strategy through the target actor neural network constructed in the step S2.1 by the environment state matrix according to the definition of the environment state in the step S1.2. And applying an anti-interference strategy to the interference environment defined in the step S1.1, and calculating a return function value and a state matrix after next transfer.
And S3.2, defining an experience pool with the capacity of M, and storing an experience group (S, A, R, S _) formed by the current environment state, the selected strategy behavior, the obtained return function value and the next environment state in the S3.1 in the experience pool.
According to an embodiment of the present invention, the above step S4 includes the steps of:
and S4.1, randomly extracting a certain number of experience groups from the experience pool obtained in the S3.2 for training and updating the parameters of the convolutional neural network.
And S4.2, obtaining two corresponding state behavior values through the target neural network and the estimation neural network according to the current state S and the next state S _inthe experience group extracted in the step S4.1. And constructing a loss function through the current return function value and the two state behavior values, and finishing network training and updating on the estimation and comment family neural network through the minimized loss function.
And S4.3, obtaining the state behavior value of the current state S in the experience group extracted in the step S4.1 through an estimation comment family neural network, and obtaining the corresponding state behavior value of the current state S and the strategy A in the experience group extracted in the step S4.1 through a target actor neural network. And constructing a loss function according to the two state behavior values, and carrying out training and parameter updating for estimating the actor neural network.
The invention has the beneficial effects that:
the invention completes the selection of continuous anti-interference strategies in communication based on the reinforcement learning mechanism of the depth determination strategy gradient strategy. The quantization error caused by the quantization discrete processing strategy space is overcome, the grid number of the output unit of the neural network and the complexity of the network are reduced, and the performance of the anti-interference algorithm is improved.
Drawings
FIG. 1 is a processing framework of an anti-interference strategy selection algorithm based on a depth-determined strategy gradient strategy reinforcement learning mechanism designed by the invention
FIG. 2 shows the structure of a target actor neural network and an estimated actor neural network designed according to the present invention
FIG. 3 is a diagram of a target critic neural network and an estimated critic neural network structure designed by the present invention
Fig. 4 is a comparison of the algorithm designed by the present invention with the performance of the algorithm of optimal strategy selection, random strategy selection and DQN-based discretization decision method.
Detailed Description
In order to make the steps of the present invention more detailed and clear, the present invention is further described in detail below with reference to the accompanying drawings and embodiments.
Example one
Fig. 1 is a specific implementation method of the algorithm of the present invention, and the following describes each step and its principle in detail with reference to fig. 1.
The algorithm implementation framework of the depth-determination-based gradient strategy reinforcement learning continuous strategy selection anti-interference method is shown in FIG. 1. In step S1, interference and radio environment modeling is completed in S1.1. In a scenario, multiple interference sources interfere with a legitimate communication link, and the interference may include, but is not limited to: the interference comprises five types of interference, namely single tone interference, multi-tone interference, linear frequency sweep interference, partial frequency band interference and noise frequency hopping interference. The interference source can realize dynamic interference adjustment on legal users by adjusting interference parameters or switching interference modes. The five interference modes are concretely and mathematically modeled as follows:
(1) single tone interference
The complex baseband expression of the single-tone interfering signal is:
Figure BDA0001813149920000051
wherein A is the amplitude of the single-tone interference signal, fJFor a single-tone interfering signal frequency,
Figure BDA0001813149920000052
the initial phase is a single tone interferer.
(2) Multitone interference
The complex baseband expression of the multi-tone interference signal is:
Figure BDA0001813149920000053
wherein A ismFor the mth single-tone interference amplitude, f in multi-tone interferencemFor the frequency of the mth single-tone interferer,
Figure BDA0001813149920000054
the initial phase of the mth single-tone interferer.
(3) Linear swept frequency interference
The complex baseband expression of the linear sweep interference signal is:
Figure BDA0001813149920000055
wherein A is amplitude, f0Is the initial frequency, k is the frequency modulation coefficient,
Figure BDA0001813149920000056
is the initial phase and T is the signal duration.
(4) Partial band interference
The partial band noise interference is represented as white gaussian noise in the partial band, and the expression of the complex baseband is as follows:
Figure BDA0001813149920000057
wherein, Un(t) is obedience mean zero, variance is
Figure BDA0001813149920000058
Base band noise of fJBeing the centre frequency of the signal or signals,
Figure BDA0001813149920000059
is [0,2 π ]]Uniformly distributed and mutually independent phases.
(5) Noise frequency modulation interference
The complex baseband of the noise modulated frequency signal can be represented as follows:
Figure BDA0001813149920000061
where A is the amplitude of the noise FM signal, f0For the carrier frequency, k, of a noisy FM signalfmξ (t) is zero mean and variance as frequency modulation index
Figure BDA0001813149920000062
Is narrow-band Gaussian white noise with a certain value. Wherein
Figure BDA0001813149920000063
Is a wiener process belonging to a
Figure BDA0001813149920000064
A gaussian distribution of (a). Frequency modulation index kfmSum variance
Figure BDA0001813149920000065
Together determine the effective bandwidth of the noise modulation.
And the interference source dynamically selects an interference mode and corresponding parameters according to the maximum interference effect.
A legal user anti-interference strategy calculates a return function value R and an environment state matrix S through sampling of wireless spectrum information in an environment; constructing a historical experience group according to the return function, the environment state, the current anti-interference strategy and the next transfer state matrix, and storing the historical experience group in an experience pool; the neural network selects the next anti-interference behavior according to the current environment state matrix, acts the anti-interference strategy on the environment, and updates the parameters according to historical experience; the whole algorithm iterates until the algorithm converges. Specifically, the specific implementation steps of the algorithm are as follows:
in the invention, steps S1.2, S1.3 and S1.4 respectively complete the design of the environment state, the design of the return function and the design of the anti-interference strategy. In the case of multiple sub-channels, the signal received on a sub-channel by the receiving end of a legal link can be represented as:
Figure BDA0001813149920000066
where m ∈ {1, …, N } is the channel index number, N is the number of channels, xtIs a useful emission signal, xjIs a signal that is an interference signal or a signal,
Figure BDA0001813149920000067
j ∈ {1, …, J } is the interference source index number, J is the interference source number, t is the timing sequence index number;
Figure BDA0001813149920000068
indicating the channel between the legitimate communication users,
Figure BDA0001813149920000069
representing the interfering channels from the interfering sources to the legitimate user receivers. Therefore, the signal-to-interference-and-noise ratio and the achievable rate available to the receiving end of the legitimate user can be expressed as:
Figure BDA00018131499200000610
Figure BDA00018131499200000611
wherein
Figure BDA00018131499200000612
Is the equivalent channel gain on the sub-channel,
Figure BDA00018131499200000613
is the corresponding noise power. The achievable rate at time t at the receiving end can be expressed as the sum of the rates on the N subchannels:
Figure BDA0001813149920000071
before an anti-interference decision is made, the corresponding power on each subchannel is obtained by sampling the wireless environment, and the power of all the subchannels forms a power vector P ═ Pt,1,pt,2,…,pt,N]Where N corresponds to the number of subchannels. The state matrix S is formed by a plurality of historical power vectors St=[Pt-1Pt-2…Pt-t]TWhere t is the observation time window. Meanwhile, the limit of the anti-interference strategy on the transmission power is considered, the return function designed in the invention considers the gain and power overhead of the adopted anti-interference strategy on the signal-to-interference-and-noise ratio at the same time, and the specific expression is as follows:
Figure BDA0001813149920000072
wherein
Figure BDA0001813149920000073
Is the interference power of the interferer on the channel; function(s)
Figure BDA0001813149920000079
Is shown when fjWhen m, 1 is output, otherwise 0 is output;
Figure BDA0001813149920000074
is the transmit power overhead.
The strength of interference on certain sub-channels due to the influence of interference sources
Figure BDA0001813149920000075
And the transmission power on the corresponding channel can be adjusted to ensure that the communication quality of the link is maximized within a controllable power range. The immunity policy in the present invention on each subchannel is therefore the transmit power on that subchannel. In the present invention, it will be assumed that the maximum transmission power of the subchannel m is
Figure BDA0001813149920000076
Where m ∈ {1, …, N }, the set of immunity policies may therefore be expressed as
Figure BDA0001813149920000077
Experience groups and experience pools are defined in step S1.5 in the invention step S1, and training and parameter updating of the neural network in subsequent steps are provided through storage and sampling of historical experiences. According to the algorithm structure description of FIG. 1, the invention defines the capacity size as MeA experience pool of (2), can store MeA historical experience. The current environmental status S, the reward function value R, and the current anti-interference policy a obtained through S1.2-S1.5 in step S1tAnd transition ambient state S-Construction of an experience set { S, R, AtS _. The experience groups are stored in the experience pool one by one, and when the number of the experience groups stored in the experience pool reaches the upper limit of the capacity, the experience group with the longest storage time is covered by the new experience group.
In inventive step S2, step S2.1, a target actor neural network μ (. |. theta.) is constructed using a convolutional neural networkμ) And estimating the actor neural network μ' (. Theta)μ). The target actor neural network and the estimated actor neural network have the same network structure, and the specific structure is shown in fig. 2, and specific parameters refer to the second embodiment. Selecting the transmitting power vector of the corresponding sub-channel from the continuous anti-interference strategy space through the target actor neural network by the current environment state matrix obtained in the step S1.2:
Figure BDA0001813149920000078
in order to realize the exploration of unknown strategies and overcome the situation of falling into local optimum, the power vector is superposed with random exploration noise with the same dimension, namely
Figure BDA0001813149920000081
Form the Current anti-interference policy At. The strategy acts on the environment to complete the interaction between the strategy and the interference environment, so that the next calculation of the environment transferring state and the return function value is carried out. In step S2.2 of invention step S2, the same deep neural network structure is usedObjective comment Hospital network Q (. | [ theta ])t) And estimating a comment Homelan network Q' (. |. theta.)t). And the target actor neural network completes the selection of an anti-interference strategy according to the input spectrum time slot state matrix. Estimating actor neural network to complete network training and parameter updating according to the sampling experience group. And when the training steps reach a preset value, covering the target actor neural network parameters with the estimated actor neural network parameters so as to finish the parameter updating of the target actor neural network. The output of the target comment family neural network is used for helping to evaluate the strategy selection merits of the actor neural network. And (4) carrying out network training and parameter updating on the estimated critic neural network according to the sampling experience information. And when the training step number reaches a preset value, covering the target critic neural network with the estimated critic neural network parameters to complete parameter updating.
In step S3, the strategy obtained in step S2.2 is used as the transmission power on the current channel m in step S3.1, and the next calculation is performed according to the new transmission power and the interference model when the environment state is calculated. In step S3, in step S3.2, the current environmental status in S2.1, the policy action selected in S2.2, the reward function value obtained in S2.2, and the next environmental status obtained in S3.1 are grouped into an experience group { S, a } according to the capacity and structure of the experience storage pool defined in S1.5tR, S _ } is stored in the experience pool. When the stored experience set reaches the upper capacity limit of the experience set, the latest derived experience set is stored in the memory unit in which the oldest experience set is stored, overwriting the oldest experience set.
In step S4, in step S4.1, a corresponding number of experience sets are extracted from the experience storage pool in step S3 based on the preset batch _ size to complete the evaluation of critic Q' (. The. -. Thet) And (5) training parameters of the neural network. Referring to FIG. 1, step S4.2 in step S4 comments on the estimate of the Homeland network Q' (. | θ)Q') Is achieved by minimizing its loss function L os _ function, where L os _ function is defined as follows:
Lloss_functionQ')=(1/N)∑i(yi-Q(Si,AiQ'))2(10)
yi=Ri+γQ(Si+1,μ'(Si+1μ')|θQ) (11)
wherein Q (S)i,AiQ) Representation dependent estimation of actor neural network parameters thetaQ‘Gamma represents a long-term return discount factor. And when the training step number reaches the updating step number I, copying the network parameters in the estimated comment family neural network into the target comment family neural network to complete the updating of the network parameters. Step S4.3 in step S4 is to estimate the actor neural network μ' (· | θ)μ') The training is realized by strengthening the optimal strategy selection direction of the neural network of the target comment family and estimating the optimal parameter selection direction of the actor in the current environment state, and the updating method comprises the following steps:
Figure BDA0001813149920000095
and when the training step number reaches the updating step number I, copying the network parameters in the estimated actor neural network into the target actor neural network to complete the updating of the network parameters.
In step S5, the reward function R gradually converges to its optimal value as training continues. In the invention, the mean value change condition of the zeta step R is recorded, when the mean value change is small enough, the training is considered to be converged, the algorithm is stopped, and the finally output strategy is used for resisting disturbance as a final strategy. The convergence is determined as follows:
Figure BDA0001813149920000091
where v is the termination condition for determining convergence and is set to a very small positive value.
Example two
The convolutional neural network structure for anti-interference decision is shown in FIG. 2, wherein 128 sub-channels are assumed to be divided by a system in simulation, a spectrum time slot state matrix of 128 × 128 is constructed according to spectrum sampling signals and is used as the input of the convolutional neural network, and then a power vector of 1 × 128 is output through three convolutional layers, two pooling layers and two full-connection layers.
Assuming that the input data of the convolution operation is I, the corresponding convolution kernel K has the same dimension as the input data. Take three-dimensional input data as an example (when the input data is two-dimensional, the third dimension can be considered to be 1). The convolution operation requires that the third dimension of the convolution kernel K is the same as the input data Ithird dimension, by w1,w2,w3Representing three dimensions, after convolution operation, the output is:
Figure BDA0001813149920000092
the convolutional neural network pooling operation generally comprises maximum pooling and mean pooling, and the calculation method comprises the following steps:
and (3) mean value pooling:
Figure BDA0001813149920000093
maximum pooling:
Figure BDA0001813149920000094
maximum pooling is employed in the present invention.
Specifically, in this embodiment, each layer structure is as shown in fig. 2, and each layer structure is specifically described as follows:
the available spectrum is divided into 128 sub-channels in the network model, the observation time slot is 128 in length, so the input state matrix dimension is 128 × 128.
Specifically, the state matrix from the input layer is first subjected to a convolution operation with a convolution kernel size of 3 × 3, wherein the number of convolution kernels is 20, the convolution step size is 1, and Re L u is adopted as an activation function, the dimension of the output result after the operation is 126 × 126 × 20, wherein the Relu activation function is operated as:
y=max{0,x} (17)
the output is then subjected to a maximum pooling operation with a pooling size of 2 × 2. the output dimensionality after the first layer of convolutional pooling is 63 × 63 × 20.
The output from the second layer after the convolution pooling operation passes through a third layer of the convolution network, and the convolution operation obtains the output of 31 × 31 × 30, wherein the dimension of a convolution kernel ruler is 3 × 3, the number of convolution kernels is 30, the Relu function is adopted as an activation function, and the convolution step size is 2.
The fourth layer of the convolution network carries out convolution operation by taking the output of the third layer as input, the size of the adopted convolution kernel is 4 × 4, the number of the convolution kernels is 30, the convolution step size is 2, and the convolution operation is carried out on the w1,w2And performing zero padding operation on two dimensions, wherein the number of zero padding is 1, outputting the dimension of 15 × 15 × 30 after the layer of convolution operation, performing maximum pooling operation on the output after the convolution operation, wherein the pooling size is 3 × 3, and outputting the dimension of 5 × 5 × 30 after the pooling operation.
The output of the convolutional neural network with the fourth layer of dimension 5 × 5 × 30 is recombined into a vector with the dimension 1 × 750, and the vector with the dimension 1 × 360 is output after the processing of the fully connected layer.
The sixth layer of the convolutional network is a fully-connected layer, 128 neurons are constructed in the layer, and the Relu function is adopted as an activation function. The output from the fifth layer of the convolutional neural network is processed by the full connection layer and then outputs Q (. | [ theta ]) corresponding to the dimensionality of the anti-interference strategy sett) Vector of values, output dimension 1 × 128.
FIG. 3 is a layer neural network, neural network structure, for implementing an estimated critic neural network and a target critic neural network the first layer is an input layer with dimension 128 × (128+1) containing a state matrix S representing channel power informationtAnd a work vector A for representing the strategytThe second layer is neural layer 1, the number of neurons is 1024, the output dimension is 1024 × 1, the activation function is the Re L u function, the third layer is neural layer 2, the number of neurons is 128, the output dimension is 128 × 1, adoptAnd activating a function by using Re L u, wherein the fourth layer is a nerve layer 3, the number of the neurons is 32, the output dimension is 32 × 1, and the Re L u activation function is adopted, the fifth layer is a nerve layer 4, the number of the neurons is 1, and the Q value for evaluating the quality of the actor network strategy selection is output.
Further, fig. 4 shows the continuous power selection anti-interference strategy performance of the depth determination strategy gradient based reinforcement learning in the present invention. The performance of a random power selection strategy, a discrete power selection strategy based on DQN, a continuous power selection strategy based on depth determination strategy gradient and an ideal optimal power selection strategy are performed in the figure. It can be seen from the figure that the algorithm reward function proposed in the present invention has a great performance improvement compared to the random power selection strategy.

Claims (3)

1. A communication anti-interference method based on depth determination gradient reinforcement learning is characterized by comprising the following steps:
s1, initializing definition, including:
interference environment: defining an interference environment according to the number of interferers, an interference mode and a wireless channel model;
interference environment state: forming a spectrum time slot matrix by spectrum information measured by different time slots, wherein the size of the spectrum time slot matrix is determined by an observation spectrum range and an observation time slot length;
a return function: constructing a feedback return function according to the communication quality index of a legal user;
an anti-interference strategy is as follows: defining the combination of the transmitting power on different sub-channels as an anti-interference strategy set;
deep neural network: constructing four deep neural networks of a target actor, an estimated actor, a target commentator and an estimated commentator, wherein the target actor neural network and the estimated actor neural network have the same network structure, and the target commentator neural network and the estimated commentator neural network have the same network structure;
an empirical storage pool: presetting an experience storage pool with a fixed size, wherein the experience storage pool is used for storing an experience group consisting of a current interference suppression strategy, an environment state, a current interference suppression strategy and an environment return;
s2, obtaining an anti-interference strategy by the interference environment state, namely the frequency spectrum time sequence matrix through a target actor convolutional neural network, acting the strategy on the interference environment, and observing a return value of the interference environment and a state matrix after next transfer under the current anti-interference strategy according to a return function; the output of the target comment family neural network is used for helping to evaluate the strategy selection quality of the actor neural network;
s3, forming an experience group by the current anti-interference strategy, the interference environment state, the return value under the anti-interference strategy and the transfer environment state, and storing the experience group into an experience pool;
s4, training the estimated actor neural network and the estimated critic neural network by sampling an experience group from an experience pool, covering the target actor neural network parameters with the estimated actor neural network parameters and covering the target critic neural network parameters with the estimated critic neural network parameters when the training steps reach preset values, and thus finishing the parameter updating of the target actor neural network;
s5, judging whether the learning mechanism meets a preset stopping condition, and if so, stopping learning to obtain a final anti-interference strategy; otherwise, go back to S2 to continue learning.
2. The method of claim 1, wherein the reward function in step S1 is:
Figure FDA0002443656290000021
where m ∈ {1, …, N } is the channel index number, N is the number of channels,
Figure FDA0002443656290000022
is the interference power of the interference source on the channel, J ∈ {1, …, J } is the interference source index number, J is the interference source number, t is the timing sequence index number;
Figure FDA0002443656290000023
indicating the channel between the legitimate communication users,
Figure FDA0002443656290000024
for transmitting power, function, of sub-channel
Figure FDA0002443656290000025
Is shown when fjWhen m, 1 is output, otherwise 0 is output;
Figure FDA0002443656290000026
is the transmit power overhead.
3. The communication interference rejection method based on the depth-determined gradient reinforcement learning of claim 2, wherein in step S4, the method for updating the parameters of the convolutional neural network is as follows:
training the parameters of the convolutional neural network, obtaining corresponding state behavior values through the convolutional neural network according to the current state and the next state in the extracted experience group, constructing a corresponding loss function, and updating the network parameters through the minimized loss function.
CN201811129485.9A 2018-09-27 2018-09-27 Communication anti-interference method based on depth determination gradient reinforcement learning Expired - Fee Related CN109302262B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811129485.9A CN109302262B (en) 2018-09-27 2018-09-27 Communication anti-interference method based on depth determination gradient reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811129485.9A CN109302262B (en) 2018-09-27 2018-09-27 Communication anti-interference method based on depth determination gradient reinforcement learning

Publications (2)

Publication Number Publication Date
CN109302262A CN109302262A (en) 2019-02-01
CN109302262B true CN109302262B (en) 2020-07-10

Family

ID=65164716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811129485.9A Expired - Fee Related CN109302262B (en) 2018-09-27 2018-09-27 Communication anti-interference method based on depth determination gradient reinforcement learning

Country Status (1)

Country Link
CN (1) CN109302262B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109861720B (en) * 2019-03-15 2021-07-30 中国科学院上海高等研究院 WSN anti-interference method, device, equipment and medium based on reinforcement learning
CN110113418B (en) * 2019-05-08 2020-06-02 电子科技大学 Collaborative cache updating method for vehicle-associated information center network
CN110611619B (en) * 2019-09-12 2020-10-09 西安电子科技大学 Intelligent routing decision method based on DDPG reinforcement learning algorithm
CN110944354B (en) * 2019-11-12 2022-10-04 广州丰石科技有限公司 Base station interference monitoring method and system based on waveform analysis and deep learning
CN111181618B (en) * 2020-01-03 2022-05-10 东南大学 Intelligent reflection surface phase optimization method based on deep reinforcement learning
CN111526592B (en) * 2020-04-14 2022-04-08 电子科技大学 Non-cooperative multi-agent power control method used in wireless interference channel
CN111835453B (en) * 2020-07-01 2022-09-20 中国人民解放军空军工程大学 Communication countermeasure process modeling method
CN112087749B (en) * 2020-08-27 2023-06-02 华北电力大学(保定) Cooperative active eavesdropping method for realizing multiple listeners based on reinforcement learning
CN112188004B (en) * 2020-09-28 2022-04-05 精灵科技有限公司 Obstacle call detection system based on machine learning and control method thereof
CN112202527B (en) * 2020-10-01 2022-09-13 西北工业大学 Intelligent electromagnetic signal identification system interference method based on momentum gradient disturbance
CN112492691B (en) * 2020-11-26 2024-03-26 辽宁工程技术大学 Downlink NOMA power distribution method of depth deterministic strategy gradient
CN114696925B (en) * 2020-12-31 2023-12-15 华为技术有限公司 Channel quality assessment method and related device
CN113038616B (en) * 2021-03-16 2022-06-03 电子科技大学 Frequency spectrum resource management and allocation method based on federal learning
CN112906640B (en) * 2021-03-19 2022-10-14 电子科技大学 Space-time situation prediction method and device based on deep learning and readable storage medium
CN113098565B (en) * 2021-04-02 2022-06-07 甘肃工大舞台技术工程有限公司 Stage carrier communication self-adaptive frequency hopping anti-interference method based on deep network
CN113221454B (en) * 2021-05-06 2022-09-13 西北工业大学 Electromagnetic radiation source identification method based on deep reinforcement learning
CN113411099B (en) * 2021-05-28 2022-04-29 杭州电子科技大学 Double-change frequency hopping pattern intelligent decision method based on PPER-DQN
CN113890564B (en) * 2021-08-24 2023-04-11 浙江大学 Special ad hoc network frequency hopping anti-interference method and device for unmanned aerial vehicle based on federal learning
CN114417939B (en) * 2022-01-27 2022-06-28 中国人民解放军32802部队 Interference strategy generation method based on knowledge graph

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104581738A (en) * 2015-01-30 2015-04-29 厦门大学 Cognitive radio hostile interference resisting method based on Q learning
CN104994569A (en) * 2015-06-25 2015-10-21 厦门大学 Multi-user reinforcement learning-based cognitive wireless network anti-hostile interference method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107479368B (en) * 2017-06-30 2021-09-21 北京百度网讯科技有限公司 Method and system for training unmanned aerial vehicle control model based on artificial intelligence

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104581738A (en) * 2015-01-30 2015-04-29 厦门大学 Cognitive radio hostile interference resisting method based on Q learning
CN104994569A (en) * 2015-06-25 2015-10-21 厦门大学 Multi-user reinforcement learning-based cognitive wireless network anti-hostile interference method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A heterogeneous information fusion deep reinforcement learning for intelligent frequency selection of HF communication;Xin Liu;《China Communications》;20180906;第19卷(第9期);全文 *
Anti-Jamming Communications Using Spectrum Waterfall: A Deep Reinforcement Learning Approach;Xin Liu;《IEEE COMMUNICATIONS LETTERS》;20180312;第22卷(第5期);全文 *
Two-dimensional anti-jamming communication based on deep reinforcement learning;Guoan Han;《2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)》;20180719;全文 *

Also Published As

Publication number Publication date
CN109302262A (en) 2019-02-01

Similar Documents

Publication Publication Date Title
CN109302262B (en) Communication anti-interference method based on depth determination gradient reinforcement learning
CN109274456B (en) Incomplete information intelligent anti-interference method based on reinforcement learning
CN109729528B (en) D2D resource allocation method based on multi-agent deep reinforcement learning
CN109639377B (en) Spectrum resource management method based on deep reinforcement learning
CN109617584B (en) MIMO system beam forming matrix design method based on deep learning
CN111970072B (en) Broadband anti-interference system and method based on deep reinforcement learning
Menon et al. A game-theoretic framework for interference avoidance
CN113795049B (en) Femtocell heterogeneous network power self-adaptive optimization method based on deep reinforcement learning
CN113613301B (en) Air-ground integrated network intelligent switching method based on DQN
CN109068382B (en) NOMA cross-layer power distribution method based on time delay QoS
Eisen et al. Large scale wireless power allocation with graph neural networks
Nikoloska et al. Modular meta-learning for power control via random edge graph neural networks
Rahmani et al. Deep reinforcement learning-based sum rate fairness trade-off for cell-free mMIMO
CN113038612A (en) Cognitive radio power control method based on deep learning
CN115811788B (en) D2D network distributed resource allocation method combining deep reinforcement learning and unsupervised learning
CN117240331A (en) No-cellular network downlink precoding design method based on graph neural network
Elarfaoui et al. Optimization of QoS parameters in cognitive radio using combination of two crossover methods in genetic algorithm
CN115276858A (en) Dynamic spectrum multi-domain anti-interference method and system based on cognitive anti-interference model
CN115278896A (en) MIMO full duplex power distribution method based on intelligent antenna
Jiang et al. Dynamic spectrum access for femtocell networks: A graph neural network based learning approach
Zhang et al. A convolutional neural network based resource management algorithm for NOMA enhanced D2D and cellular hybrid networks
CN110474663B (en) Iterative intelligent signal detection method based on neural network
Ali et al. Deep-Q Reinforcement Learning for Fairness in Multiple-Access Cognitive Radio Networks
Dai et al. Power allocation for multiple transmitter-receiver pairs under frequency-selective fading based on convolutional neural network
Kim et al. RL-based transmission completion time minimization with energy harvesting for time-varying channels

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200710

Termination date: 20210927

CF01 Termination of patent right due to non-payment of annual fee