CN109302262A

CN109302262A - A kind of communication anti-interference method determining Gradient Reinforcement Learning based on depth

Info

Publication number: CN109302262A
Application number: CN201811129485.9A
Authority: CN
Inventors: 黎伟; 王军; 李黎; 党泽; 王杨
Original assignee: University of Electronic Science and Technology of China; CETC 54 Research Institute
Current assignee: University of Electronic Science and Technology of China; CETC 54 Research Institute
Priority date: 2018-09-27
Filing date: 2018-09-27
Publication date: 2019-02-01
Anticipated expiration: 2038-09-27
Also published as: CN109302262B

Abstract

The invention belongs to wireless communication technology fields, are related to a kind of communication anti-interference method that Gradient Reinforcement Learning is determined based on depth.The present invention constructs interference environment model according to interference source quantity and wireless channel model first；Utility function is constructed according to legitimate user's communication quality index, and using the utility function as the return in study；The spectrum information that different time-gap samples is built into frequency spectrum time slot matrix, with the matrix description interference environment state.Then Gradient Reinforcement Learning mechanism is determined according to depth, constructs convolutional neural networks, when carrying out anti-interference decision, ambient condition matrix realizes Anti-interference Strategy selection of the corresponding states on continuous space by target performer convolutional neural networks.Base of the present invention.Determine that the intensified learning mechanism of gradient policy completes continuous Anti-interference Strategy selection in communication in depth.Quantization discrete processes policy space bring quantization error is overcome, neural network output unit lattice number and network complexity is reduced, improves Anti-interference algorithm performance.

Description

A kind of communication anti-interference method determining Gradient Reinforcement Learning based on depth

Technical field

The invention belongs to wireless communication technology fields, are related to a kind of communication that Policy-Gradient Reinforcement Learning is determined based on depth Anti-interference method.

Background technique

With the development of wireless communication technique, the electromagnetic environment that wireless communication system faces is increasingly complicated severe, both may It can be by the unintentional interference communicated from one's own side, it is also possible to which will receive the interference signal that enemy deliberately discharges influences.Traditional Antijamming measure is directed to the static interference mode of interference source, takes fixed Anti-interference Strategy.With interference means intelligence, Interference source can adjust jamming exposure area according to the change dynamic of legitimate user's communications status, so that traditional anti-interference method can not be protected Demonstrate,prove normal communication of the legitimate user under dynamic interference environment.It is therefore desirable to the dynamic disturbance strategies for interference source to take phase The intelligent Anti-interference Strategy answered guarantees normal communication of the legitimate user under dynamic interference environment.

Currently, the dynamic disturbance means for interference source mainly carry out Anti-interference Strategy by the way of based on intensified learning Dynamic adjusts.This method carries out sliding-model control to Anti-interference Strategy space first, constructs Anti-interference Strategy collection；Secondly construction with The relevant utility function of legitimate user's communication quality；Ambient condition matrix is obtained by spectral sample and pretreatment, and by environment State matrix realizes discrete strategies selection by deep neural network；Selection strategy is finally acted on environment and estimates environment shape State transfer.By repeatedly learning, the optimal communication strategy under dynamic disturbance strategy is obtained.It specifically refers to: Xin Liu, etc., “Anti-jamming Communications Using Spectrum Waterfall:A Deep Reinforcement Learning Approach”,IEEE Communication Letters,vol.22,no.5, May.2018.This method constitutes power selection set by carrying out quantization discrete processes to power selection strategy.Then construction is deep Spend neural network, and by the frequency spectrum time slot matrix sampled from air interference environment by neural network output it is corresponding each from Dissipate the state behavior functional value of power policy.The selection of power policy is carried out finally by ∈-greedy Greedy strategy.However, This method can introduce quantization error when carrying out quantization discrete processes to power so that power selection result be unable to reach it is optimal.No Only in this way, in discretization power for the transmission power in different subchannels, according to quantization discrete processes rule, the plan of construction Slightly gathering in need includes N × L element, and wherein N is the number of channel, is quantization series, and corresponding deep neural network needs a L^N Output.When system channel number and excessive quantization series, neural network exports number exponentially and increases, and increases the instruction of neural network Practice and carry out based on ∈-greedy Greedy strategy the complexity of policy selection.

Summary of the invention

Against the above technical problems, the present invention proposes that one kind determines Policy-Gradient strategy intensified learning mechanism based on depth The communication anti-interference power selection method of (Deep Deterministic Policy Gradient, DDPG).To power plan In the case that slightly space carries out discretization, the selection for determining anti-interference power strategy is completed, interference free performance is improved, reduces strategy Select complexity.

The present invention constructs interference environment according to interference source quantity and wireless channel model first.Matter is communicated according to legitimate user Figureofmerit constructs utility function, and using the utility function as the return in study.The spectrum information structure that different time-gap is sampled Frequency spectrum time slot matrix is built up, with the matrix description interference environment state.It is constructed in the present invention including target performer (target_ Actor), estimate performer (evaluate_actor), target reviewer (target_critic) and estimation reviewer (evaluate_critic) four deep neural networks are respectively used to policy selection based on ambient condition matrix, strategy Select the operations such as network training, policy selection evaluation and evaluation network training.Wherein, target performer neural network and estimation performer Neural network network structure having the same, target reviewer neural network and estimation reviewer's neural network net having the same Network structure.Ambient condition matrix exports Anti-interference Strategy by target performer neural network.Legitimate user adjusts transmission power And channel selection, realize intelligent Anti-interference Strategy adjustment.Return letter is calculated according to air interference environmental model and Anti-interference Strategy Numerical value and transfer environment state matrix.Current ambient conditions, current Anti-interference Strategy, Reward Program value and transfer ambient condition Composition experience group, is stored in experience pond.The experience group finally extracted in experience pond is completed to estimation performer's neural network and is estimated Count the training of reviewer's neural network.When study step number reaches certain amount, commented by estimation performer's neural network and estimation The update to target performer neural network and target reviewer's neural network is respectively completed by the parameter of family's neural network.The study Mechanism is continued for, until learning outcome is restrained.

Using the realization of the present invention mentioned legitimate user's intelligence Anti jamming Scheme the following steps are included:

S1, the intelligent each algoritic module definition of Anti jamming Scheme: interference environment definition, the definition of interference environment state, return Function definition, Anti-interference Strategy definition, the definition of experience storage pool.

S2, construction target performer neural network (target_actor), estimation performer's neural network (evaluate_ Actor), target reviewer neural network (target_critic) and estimation reviewer's neural network (evaluate_critic) Four deep neural networks.Wherein target performer neural network and estimation performer's neural network network structure having the same, mesh Mark reviewer's neural network and estimation reviewer's neural network structure having the same.

S3, by environmental state information, i.e. frequency slot matrix obtains Anti-interference Strategy by target performer's neural network, should Strategy acts on interference environment, calculates return value and transfering state matrix of the Anti-interference Strategy under current n interference environment, goes forward side by side Row storage.

S4, experience group of sampling from experience pond are trained estimation performer's neural network and estimation reviewer's neural network With parameter with new.

S5, judges whether study mechanism meets stop condition, if satisfied, then stopping learning obtaining Anti-interference Strategy to the end； Otherwise S2 is returned to continue to learn.

According to an embodiment of the invention, above-mentioned steps S1 the following steps are included:

Interference environment definition: S1.1 defines interference environment according to intruder's quantity, conflicting mode and wireless channel model.

S1.2, the definition of interference environment state: the spectrum information that different time-gap is measured constitutes frequency spectrum time slot matrix, when frequency spectrum Gap matrix size is determined by observation spectral range and observation slot length.

Reward Program definition: S1.3 constructs feedback Reward Program according to the communication quality index of legitimate user.

Anti-interference Strategy definition: transmission power combination in different subchannels is defined as Anti-interference Strategy collection by S1.4.Often Transmission power on sub-channels can beAny value on continuum.

S1.5, the definition of experience storage pool: the experience storage pool of a default fixed size, for storing by current environment shape The experience group that state matrix, Anti-interference Strategy, Reward Program value and transfer environment state matrix form.

According to embodiments of the present invention, above-mentioned steps S2 the following steps are included:

S2.1, using mutually isostructural convolutional neural networks construction target performer neural network and estimation performer's nerve net Network.Convolutional neural networks include multiple convolutional layers, multiple pond layers and multiple full articulamentums.Target performer's neural network is according to defeated Enter the selection that frequency spectrum time slot state matrix completes Anti-interference Strategy.Estimate that performer's neural network completes network according to sampling experience group Trained and parameter updates.When train epochs reach preset value, with estimation performer's neural network parameter coverage goal performer nerve Network parameter, so that the parameter for completing target performer's neural network updates.

S2.2, using mutually isostructural conventional depth neural network configuration target reviewer neural network and estimation reviewer Neural network.The deep neural network includes multiple neural net layers, includes multiple neurons, activation in each neural net layer Function.The output of target reviewer's neural network is used to help the policy selection superiority and inferiority of evaluation performer's neural network.Estimation comment Family's neural network carries out network training according to sampling posterior infromation and parameter updates.When train epochs reach preset value, with estimating It counts reviewer's neural network parameter coverage goal reviewer's neural network and completes parameter update.

According to an embodiment of the invention, above-mentioned steps S3 the following steps are included:

S3.1, according to the definition of ambient condition in step S1.2, by ambient condition matrix by constructing in step S2.1 Target performer's neural network obtains Anti-interference Strategy.And Anti-interference Strategy is acted on into the interference environment that step S1.1 is defined, it counts State matrix after calculating Reward Program value and shifting in next step.

S3.2, define a capacity be M experience pond, and by the strategy interaction of current ambient conditions, selection in S3.1, Obtained Reward Program value and next step ambient condition constitutes experience group { S, A, R, S_ } and is stored in experience pond.

According to an embodiment of the invention, above-mentioned steps S4 the following steps are included:

S4.1 randomly selects a certain number of experience groups for convolutional neural networks parameter from the experience pond that S3.2 is obtained Training and update.

S4.2, the current state S and next step state S_ in experience group extracted by step S4.1, passes through target nerve Network and estimation neural network obtain corresponding two states behavior value.Pass through current Reward Program value and two state behavior values Loss function is constructed, network training and update are completed to estimation reviewer's neural network by minimizing loss function.

S4.3, by the current state S in experience group that step S4.1 is extracted by estimating that reviewer's neural network obtains it Step S4.1 is extracted current state S in experience group and strategy A and is obtained pair by target performer's neural network by state behavior value Answer state behavior value.Loss function is constructed according to two state behavior values, carries out the training and parameter of estimation performer's neural network It updates.

The invention has the benefit that

The present invention is based on depth to determine that the intensified learning mechanism of Policy-Gradient strategy completes continuous Anti-interference Strategy in communication Selection.Quantization discrete processes policy space bring quantization error is overcome, neural network output unit lattice number and net are reduced Network complexity improves Anti-interference algorithm performance.

Detailed description of the invention

Fig. 1 is that the Anti-interference Strategy based on determining depth-size strategy gradient policy intensified learning mechanism that the present invention designs selects Algorithm process frame

Fig. 2 is the target performer neural network that the present invention designs and estimation performer's neural network structure

Fig. 3 is the target reviewer neural network that the present invention designs and estimation reviewer's neural network structure

Fig. 4 is the designed algorithm of this hair and optimal policy selection, randomized policy selection and is based on DQN discretization decision-making party The algorithm performance of method compares.

Specific embodiment

To keep step of the invention clear in further detail, below in conjunction with attached drawing and case study on implementation to of the invention further detailed Explanation.

Embodiment one

Fig. 1 is inventive algorithm specific implementation method, and each step and its principle is described in detail below with reference to Fig. 1.

It is proposed by the present invention to determine that the continuous policy selection anti-interference method algorithm of gradient policy intensified learning is real based on depth Existing frame is as shown in Figure 1.Interference and wireless environment modeling are completed in step S1 in S1.1.Multiple interference sources are to legal logical in scene Letter link is interfered, and conflicting mode may include but be not limited to: single tone jamming, Multi-tone jamming, linear frequency sweep interference, part frequency Band interference and five kinds of noise frequency hopping interference interference.Interference source can be by adjusting interference parameter or switching conflicting mode realization pair The interference dynamic of legitimate user adjusts.Five kinds of conflicting mode concrete mathematical models are as follows:

(1) single tone jamming

The complex radical type expression of single tone jamming signal are as follows:

Wherein, A is single tone jamming signal amplitude, f_JFor single tone jamming signal frequency,For single tone jamming initial phase.

(2) Multi-tone jamming

The complex radical type expression of Multi-tone jamming signal are as follows:

Wherein, A_mFor m-th of single tone jamming amplitude in Multi-tone jamming, f_mFor the frequency of m-th of single tone jamming,For m The initial phase of a single tone jamming.

(3) linear frequency sweep interferes

The complex radical type expression of linear frequency sweep interference signal are as follows:

Wherein, A is amplitude, f₀It is original frequency, k is coefficient of frequency modulation,It is initial phase, T is signal duration.

(4) partial-band jamming

Partial-band Gaussian noise jamming shows as white Gaussian noise in partial-band, the expression formula of complex base band:

Wherein, U_n(t) be obey mean value be zero, variance isBaseband noise, f_JFor the centre frequency of signal,For [0, 2 π] in be uniformly distributed and mutually independent phase.

(5) niose-modulating-frenquency jamming

The complex base band of noise FM signal can indicate as follows:

Wherein, A is the amplitude of noise FM signal, f₀For the carrier frequency of noise FM signal, k_fmFor frequency modulation index (FM index), ξ It (t) is zero-mean, varianceFor this white noise of the narrowband loud, high-pitched sound of certain value.WhereinIt is a Wiener-Hopf equation, belongs to In oneGaussian Profile.Frequency modulation index (FM index) k_fmAnd varianceThe effective bandwidth of noise FM is codetermined.

Interference source is according to maximum interference effect dynamic select conflicting mode and corresponding parameter.

Legitimate user's Anti-interference Strategy calculates Reward Program value R by wireless frequency spectrum intelligence sample in environment, calculates environment State matrix S；History warp is constructed according to Reward Program, ambient condition, current Anti-interference Strategy and next step transfering state matrix Group is tested, is stored in experience pond；Neural network carries out anti-interference action selection in next step according to current ambient conditions matrix, and will The Anti-interference Strategy acts on environment, while the update of parameter is carried out according to historical experience；Entire algorithm iteration is carried out until calculating Method convergence.Specifically, the specific implementation step of the algorithm is as follows:

Step S1.2, S1.3 and S1.4 are respectively completed ambient condition design, the design of Reward Program and resist dry in the present invention Disturb the design of strategy.In multi sub-channel, received signal be may be expressed as: on sub-channels for legal link receiving end

Wherein m ∈ { 1 ..., N } is channel indexes number, and N is channel number；x_tIt is useful transmitting signal, x_jIt is interference signal,It is white Gaussian noise in subchannel；J ∈ { 1 ..., J } is interference source call number, and J is interference source number；Sequence index when t is Number；Indicate the channel between legitimate correspondence user,Interference channel of the expression interference source to legitimate user receiver.Cause This, Signal to Interference plus Noise Ratio obtained by legitimate user receiving end and achievable rate may be expressed as:

WhereinIt is the equivalent channel gain in subchannel,It is corresponding noise power.Receiving end The rate summation in N number of subchannel is represented by the achievable rate of moment t:

Before anti-interference decision, corresponding power on every sub-channels, institute are obtained by the sampling to wireless environment first There is the power of subchannel to constitute vector power P=[p_t,1,p_t,2,…,p_t,N], wherein N corresponds to subchannel number.State matrix S by Multiple historical power vectors constitute S_t=[P_t-1 P_t-2…P_t-t]^T, wherein t is observation time window.Anti-interference Strategy is considered simultaneously Limitation in terms of transmission power, Anti-interference Strategy used by the Reward Program that designs considers in the present invention are dry in letter simultaneously It makes an uproar than upper gain and power overhead, expression is as follows:

WhereinIt is jamming power of the interference source on channel；FunctionExpression is worked as f_jWhen=m, otherwise output 1 exports 0；It is transmission power expense.

Interference strength due to the influence in the source of being interfered, in certain subchannelsIt is larger, it can be by adjusting Transmission power in respective channel guarantees to maximize link communication quality within the scope of controlled power.Therefore every in the present invention Anti-interference Strategy on sub-channels is the transmission power in the subchannel.It will assume subchannel m emission maximum in the present invention Power isWherein m ∈ { 1 ..., N }, therefore Anti-interference Strategy collection is represented by

Experience group and experience pond are defined in inventive step S1 in S1.5 step, passes through the storage and sampling to historical experience The training and parameter for providing the neural network in subsequent step update.It is described according to the algorithm structure of Fig. 1, defines appearance in invention Amount size is M_eExperience pond, M can be stored_eHistorical experience.The current ambient conditions obtained by S1.2-S1.5 in step S1 S, Reward Program value R, current Anti-interference Strategy A_tWith transfer ambient condition S_-Building experience group { S, R, A_t,S_}.The experience group quilt It is stored in experience pond one by one, when the experience group item number stored in experience pond reaches maximum size, the longest experience group of storage time By newly into experience group covering.

In inventive step S2 step S2.1, using convolutional neural networks construction target performer neural network μ (| θ^μ) and Estimation performer's neural network μ ' (| θ^μ).Target performer neural network and estimation performer's neural network network knot having the same Structure, specific structure is as shown in Fig. 2, design parameter reference implementation example two.The current ambient conditions matrix obtained by step S1.2 leads to Cross the transmission power vector that target performer neural network spatially selects corresponding subchannel from continuous Anti-interference Strategy:In order to realize to the exploration of unknown strategy, overcome the case where falling into local optimum, the vector power with The random search noise of identical dimensional is superimposed, i.e.,Form current Anti-interference Strategy A_t.The strategy It acts on environment, completes the interaction of strategy with interference environment, to shift in next step ambient condition and Reward Program value It calculates.In inventive step S2 step S2.2 using same depth neural network structure construction target reviewer neural network Q ( |θ^t) and estimation reviewer's neural network Q'(| θ^t).Target performer neural network is completed according to input spectrum time slot state matrix The selection of Anti-interference Strategy.Estimate that performer's neural network completes network training and parameter with new according to sampling experience group.Work as training When step number reaches preset value, with performer's neural network parameter coverage goal performer's neural network parameter is estimated, to complete target The parameter of performer's neural network updates.The output of target reviewer's neural network is used to help the strategy of evaluation performer's neural network Select superiority and inferiority.Estimate that reviewer's neural network carries out network training according to sampling posterior infromation and parameter updates.Work as train epochs When reaching preset value, parameter is completed with estimation reviewer's neural network parameter coverage goal reviewer's neural network and is updated.

In step S3 in step S3.1 by strategy obtained in S2.2 as the transmission power on present channel m, next time It is calculated when calculating ambient condition according to new transmission power and interference model.In step S3 in step S3.2, according to S1.5 Defined in experience storage pool capacity and structure, by S2.1 current ambient conditions, select in S2.2 strategy interaction, The next step ambient condition that Reward Program value and S3.1 obtained in S2.2 are obtained constitutes experience group { S, A_t, R, S_ } and it is stored in In the experience pond.When the experience group of storage reaches the maximum size of experience group, newest obtained experience group is stored in oldest In the storage unit of experience group storage, the oldest experience group is covered.

In step s 4 in step S4.1, according to presetting batch_size size from the experience storage pool in step S3 The middle experience group for extracting corresponding number is completed to estimation reviewer Q'(| θ^t) neural network parameter training.According to Fig. 1 institute Show, step S4.2 is to estimation reviewer's neural network Q'(| θ in step S4^Q') training by minimize its loss function Loss_function realizes that wherein Loss_function is defined as follows:

L_{loss_function}(θ^Q')=(1/N) ∑_i(y_i-Q(S_i,A_i|θ^Q'))² (10)

y_i=R_i+γQ(S_i+1,μ'(S_i+1|θ^μ')|θ^Q) (11)

Wherein Q (S_i,A_i|θ^Q) indicate dependent on estimation performer's neural network parameter θ^Q‘State behavior value function, γ indicate Long-term return discount factor.When train epochs, which reach, updates step number I, it will estimate that the network parameter in reviewer's neural network is answered Make the update that network parameter is completed in target reviewer's neural network.Step S4.3 is to estimation performer's neural network in step S4 μ'(·|θ^μ') training by strengthen target reviewer neural network optimal policy choice direction and estimation performer's neural network work as Parameter optimal selection direction is realized under preceding ambient condition, and update method is as follows:

When train epochs, which reach, updates step number I, the network parameter estimated in performer's neural network is copied into target and is drilled The update of network parameter is completed in member's neural network.

In step s 5, with trained lasting progress, Reward Program R gradually converges to its optimal value.The present invention falls into a trap The Change in Mean situation for recording ζ step R thinks trained convergence when the Change in Mean is sufficiently small, stops the algorithm, and will be final defeated Strategy out is anti-interference as final strategy.Convergent decision procedure is as follows:

Wherein υ is to determine convergent termination condition, is set as a very small positive value.

Embodiment two

Convolutional neural networks structure for anti-interference decision proposed by the invention is as shown in Figure 2: system is assumed in emulation System divides 128 sub-channels, according to the frequency spectrum time slot state matrix of spectral sample signal construction 128 × 128 as convolutional Neural The input of network；Then by three convolutional layers, the vector power of two pond layers and two full articulamentum output 1 × 128.Tool Body, in convolutional neural networks convolutional layer, pond layer and operation it is as follows:

Assuming that the input data of convolution algorithm is I, corresponding convolution kernel K is identical as the dimension of input data.With three-dimensional defeated 1) entering data instance (when input data is two-dimentional, can regard the third dimension as.Convolution operation require the convolution kernel K third dimension with it is defeated It is identical to enter the data I third dimension, uses w₁,w₂,w₃Indicate each three dimensions, after convolution operation, output are as follows:

It generally includes to maximize pond, mean value pond in the operation of convolutional neural networks pondization, calculation method is as follows:

Mean value pond:

Maximum value pond:

Maximum value pond is used in the present invention.

Specifically, each layer of structure is as shown in Fig. 2, every layer of structure is described in detail below in the present embodiment:

Convolutional neural networks first layer is input layer, and input size is determined by subchannel number and observation slot length. Usable spectrum is divided into 128 sub-channels in network model, and it is 128 that observation time slot, which is length, therefore input state matrix is tieed up Degree is 128 × 128.

The convolutional neural networks second layer is made of the operation of convolution, Relu activation primitive and pondization.Specifically, coming from input layer State matrix first pass around convolution kernel having a size of 3 × 3 convolution operation, wherein convolution kernel number is 20, and convolution step-length is 1, Using ReLu as activation primitive.Output result dimension after the operation is 126 × 126 × 20.Wherein Relu activates letter Number operation are as follows:

Y=max { 0, x } (17)

The output is subjected to maximum pondization operation again, pond is having a size of 2 × 2.After the operation of the convolution pondization of first layer Exporting dimension is 63 × 63 × 20.

By convolutional network third layer, convolution operation obtains 31 × 31 for output after convolution pondization operation from the second layer × 30 output.Wherein convolution kernel ruler dimension is 3 × 3, and convolution kernel number is 30, and activation primitive uses Relu function, convolution step A length of 2.

The output of third layer is carried out convolution operation by the 4th layer of convolutional network, and the convolution kernel of use is having a size of 4 × 4, convolution kernel number is 30, and convolution step-length is 2, and to w₁,w₂Two dimensions carry out zero padding operation, and zero padding number is 1.By It is 15 × 15 × 30 that dimension is exported after this layer of convolution operation.And the output after convolution operation will be changed to carry out maximum pondization and operate, Having a size of 3 × 3, it is 5 × 5 × 30 that dimension is exported behind pond in pond.

Convolutional network layer 5 is full articulamentum, constructs 1024 neurons in this layer, and activation primitive uses Relu letter Number.It is reassembled as the vector that dimension is 1 × 750 from the output that the 4th layer of dimension of convolutional neural networks is 5 × 5 × 30, is passed through The vector of dimension 1 × 360 is exported after the full articulamentum processing.

Convolutional network layer 6 is full articulamentum, constructs 128 neurons in this layer, and activation primitive uses Relu letter Number.Output from convolutional neural networks layer 5 output after the full articulamentum processing is corresponding with Anti-interference Strategy collection dimension Q (| θ^t) value vector, output dimension is 1 × 128.

Fig. 3 is for realizing the layer neural network of estimation reviewer's neural network and target reviewer's neural network, nerve Network structure.First layer is input layer, and dimension is 128 × (128+1), wherein the state square comprising indicating channel power information Battle array S_t, and the function vector A for indicating strategy_t.The second layer is nervous layer 1, and neuron number 1024, output dimension is 1024 × 1, activation primitive is ReLu function.Third layer is nervous layer 2, and neuron number 128, output dimension is 128 × 1, is used ReLu activation primitive.4th layer is nervous layer 3, and neuron number 32, output dimension is 32 × 1, activates letter using ReLu Number.Layer 5 is nervous layer 4, and neuron number 1 exports the Q value for evaluating performer's network strategy selection superiority and inferiority.

Further, Fig. 4 is illustrated in the present invention and is determined that the continuous power of Policy-Gradient Reinforcement Learning selects based on depth Anti-interference Strategy performance.Random power selection strategy, the discrete power selection strategy based on DQN, institute of the present invention have been carried out in figure Propose the performance of continuous power selection strategy and ideal optimal power selection strategy that Policy-Gradient is determined based on depth.It can from figure To find out, algorithm Reward Program proposed in the present invention has very big performance boost compared to random power selection strategy.

Claims

1. a kind of communication anti-interference method for determining Gradient Reinforcement Learning based on depth, which comprises the following steps:

S1, initialization definitions, comprising:

Interference environment: interference environment is defined according to intruder's quantity, conflicting mode and wireless channel model；

Interference environment state: the spectrum information that different time-gap is measured constitutes frequency spectrum time slot matrix, frequency spectrum time slot matrix size by It observes spectral range and observation slot length determines；

Reward Program: feedback Reward Program is constructed according to the communication quality index of legitimate user；

Anti-interference Strategy: the transmission power combination in different subchannels is defined as Anti-interference Strategy collection；

Deep neural network: construction target performer, estimation performer, target reviewer and estimation four depth nerve nets of reviewer Network, wherein target performer neural network and estimation performer's neural network network structure having the same, target reviewer's nerve net Network and estimation reviewer's neural network network structure having the same；

Experience storage pool: the experience storage pool of a default fixed size, for storing by current AF panel strategy, environment shape The experience group of state, current AF panel strategy and environment return composition；

S2, by interference environment state, i.e. frequency slot matrix obtains Anti-interference Strategy by target performer's convolutional neural networks, and The strategy is acted on into interference environment, is observed under current Anti-interference Strategy in the return value of interference environment under according to Reward Program State matrix after the transfer of one step；The output of the target reviewer neural network is used to help the plan of evaluation performer's neural network Slightly select superiority and inferiority；

S3, by under current Anti-interference Strategy, interference environment state, Anti-interference Strategy return value and transfer ambient condition constitute warp Group storage is tested to experience pond；

S4, experience group of sampling from experience pond are trained estimation performer's neural network and estimation reviewer's neural network, when When train epochs reach preset value, with estimation performer's neural network parameter coverage goal performer's neural network parameter, commented with estimation By family's neural network parameter coverage goal reviewer's neural network parameter, to complete the parameter of target performer's neural network more Newly；

S5, judge whether study mechanism meets preset stop condition, if satisfied, then stopping learning obtaining anti-interference plan to the end Slightly；Otherwise S2 is returned to continue to learn.

2. a kind of imperfect information intelligence anti-interference method based on intensified learning according to claim 1, feature exist In Reward Program described in step S1 are as follows:

Wherein, m ∈ { 1 ..., N } is channel indexes number, and N is channel number,It is interference source in channel On jamming power, j ∈ { 1 ..., J } is interference source call number, and J is interference source number；T is timing call number；It indicates to close Channel between method communication user,For sub-channel transmission power, functionF is worked as in expression_jWhen=m, output 1 is otherwise defeated Out 0；It is transmission power expense.

3. a kind of imperfect information intelligence anti-interference method based on intensified learning according to claim 2, feature exist In, in the step S4, the method for convolutional neural networks parameter update are as follows:

The training of convolutional neural networks parameter passes through convolution mind by current state in the experience group of extraction and next step state Corresponding state behavior value is obtained through network, and constructs corresponding loss function, carries out network ginseng by minimizing loss function Several updates.