CN116866048A

CN116866048A - Anti-interference zero-and Markov game model and maximum and minimum depth Q learning method

Info

Publication number: CN116866048A
Application number: CN202310888224.XA
Authority: CN
Inventors: 徐煜华; 李文; 陈瑾; 冯智斌; 韩昊; 袁鸿程; 徐逸凡
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2023-07-19
Filing date: 2023-07-19
Publication date: 2023-10-10

Abstract

The application discloses an anti-interference zero-markov game model and a maximum and minimum depth Q learning method. The model is as follows: considering a communication challenge scenario involving a pair of communication transceivers, a fixed jammer and a smart jammer, the fixed jammer releases swept interference, the smart jammer optimizes interference channel selection, and the user optimizes channel and power selection to maximize transmission utility, modeling the challenge interaction process as an anti-interference zero and markov game model. The method comprises the following steps: initializing anti-interference network parameters and training super parameters; constructing a frequency spectrum waterfall diagram, inputting an anti-interference network, and outputting an anti-Q value matrix; calculating and executing anti-interference actions, calculating a reward value obtained by the current action and storing a countermeasure record; and (3) sampling the countermeasure record, calculating Q value estimation error, updating the anti-interference network through a reverse gradient algorithm, and circulating the interaction process until the network training is completed. The application can effectively solve the problem of non-stable change of the environment caused by updating the interference strategy, and obtain a more stable anti-interference strategy.

Description

Anti-interference zero-and Markov game model and maximum and minimum depth Q learning method

Technical Field

The application belongs to the technical field of wireless communication, and particularly relates to an anti-interference zero-and-Markov game model and a maximum and minimum deep Q learning method.

Background

Due to the openness of the wireless channel, legitimate user communication in a wireless communication network is vulnerable to interference attacks by malicious users. Therefore, the communication anti-interference technology plays a crucial role in both civil communication and military communication. However, the traditional communication anti-interference mode, such as frequency hopping spread spectrum and direct sequence spread spectrum mode, is fixed, so that it is difficult to effectively cope with dynamic interference. Therefore, in recent years, intelligent anti-interference technology based on machine learning is continuously proposed by researchers, and through the energization of an artificial intelligent algorithm, legal users can learn and mine the interference change rule, so that an efficient and reliable communication mode is adopted. There have been studies on applying a deep reinforcement learning method to the Anti-interference field, and obtaining an optimal access policy by interactive learning with an interference environment without the need of interference a priori information by a user (x.liu et al., "Anti-jamming communications using spectrum waterfall: A deep reinforcement learning approach," IEEE Communications Letter, vol.22, no.5, pp.998-1001,2018). Similarly, the existing literature also applies a deep reinforcement learning method to unmanned aerial vehicle video transmission, effectively avoids interference attack by optimizing coding, modulation modes, power and channels, and improves the intelligence of video transmission (L.Xiao et al., "UAV anti-jamming video transmissions with QoE guarantee: a reinforcement learning-based application," IEEE Transactions on Communications, vol.69, no.9, pp.5933-5947,2021.). Further, there are literature references that consider Intelligent Anti-interference methods in a Multi-user scenario, where Multi-user decisions are made by computing joint Q values for all users (Q.Zhou et al, "Intelligent Anti-Jamming Communication for Wireless Sensor Networks: A Multi-Agent Reinforcement Learning Approach," IEEE Open Journal ofthe Communications Society, vol.2, pp.775-784,2021.). However, most of the previous studies consider that interference is not intelligent and interference attack has obvious regularity. With the continuous development of wireless and artificial intelligence technology, the interference capability is also continuously enhanced, and the wireless and artificial intelligence system also has environment sensing and policy updating capabilities and intelligence equivalent to a communication party. Thus, the smart immunity approach previously considered fixed pattern interference would fail, or even be completely suppressed, when faced with smart interference.

At present, some researches begin to consider anti-intelligent interference, and some researches consider unmanned aerial vehicle-assisted mobile network anti-interference scenes, and assume that interference adopts Q learning to update strategies, so that an unmanned aerial vehicle relay power selection method based on deep reinforcement learning is designed, and transmission error rate and energy consumption are effectively reduced. (X.Lu et al., "UAV-aided cellular communications with deep reinforcement learning against jamming," IEEE Wireless Communications, vol.27, no.4, pp.48-53,2020.). In addition, research also considers that the interference unmanned aerial vehicle based on the deep reinforcement learning implements intelligent interference by observing the track of the communication unmanned aerial vehicle, and designs an countermeasure algorithm based on the deep reinforcement learning to avoid the attack of the interference unmanned aerial vehicle. (N.Gao et al., "Anti-intelligent UAVjamming strategy via deep Q-networks," IEEE Transactions on Communications, vol.68, no.1, pp.569-581,2019.). However, the previous research considers that opponent capability is relatively weak, intelligent communication countermeasure characteristics are not deeply analyzed, intelligent interference is directly regarded as environment, non-stationarity of environment state change caused by dynamic update of interference strategies is ignored, and stationarity assumption of single-user reinforcement learning convergence is destroyed. The intelligent countermeasure characteristics are further considered in the design of the anti-interference algorithm.

In summary, the existing intelligent anti-interference research results are difficult to effectively resist intelligent interference with policy updating capability, and mainly have the following problems: 1) The interference mode considered by most of the existing intelligent anti-interference methods is relatively simple and is not considered to have intelligence; 2) The existing research on intelligent interference does not deeply analyze the intelligent countermeasure characteristics, but simply regards the intelligent interference as a part of the environment, ignores the environment non-stationary change caused by interference policy update, and is difficult to be dominant in communication countermeasure.

Disclosure of Invention

Aiming at the problems existing in the prior art, the application provides an anti-interference Markov game model and a maximum and minimum depth Q learning method, which avoid intelligent interference attack and effectively improve the transmission rate of users.

The technical solution for realizing the purpose of the application is as follows: in one aspect, an anti-interference zero-markov game model is provided, in which a user and intelligent interference have environmental observation and policy updating capabilities, both sides observe environmental states and make decisions so as to change the environmental states, the user aims at maximizing transmission utility, the intelligent interference has completely opposite targets, and the sum of the utilities of both sides is zero.

In another aspect, an anti-intelligent interference method based on maximum and minimum deep Q learning is provided, the method comprising the steps of:

step 1, modeling an anti-interference problem under intelligent interference threat as an anti-interference zero and Markov game model, wherein game participants are users and intelligent interference, and the optimization objective of the users is to obtain an optimal anti-interference strategy of the worst interference strategy, which corresponds to a Nash equilibrium strategy;

step 2, the user builds an anti-interference decision network and randomly initializes network parameters, and simultaneously sets super parameters of network training, including learning rate alpha, discount factor gamma and exploration probability epsilon ₀ Target network update step size N _T Experience replay unit

Step 3, the user perceives the spectrum state of the real-time environment and constructs the history perceiving data into a spectrum waterfall diagram s _t Inputting the frequency spectrum waterfall diagram into the anti-interference decision network, and calculating and outputting a current state Q value matrix Q(s) _t (i theta), where theta represents decision network parameters, and then calculate the current anti-interference action using epsilon-greedy strategy and linear programmingAnd executing the joint action; wherein (1)>And p _t The user communication frequency and the transmission power respectively;

step 4, calculating the return value r under the current action _t And obtaining the interference action o through the perception data _t And constructs the spectrum state s _t+1 Record interaction e _t ＝(s _t ,a _t ,o _t ,r _t ,s _t+1 ) Added to an experience playback unit, i.e

Step 5, from experience playback unitMiddle random extraction training sample set->Wherein(s) _i ,a _i ,o _i ,r _i ,s′ _i ) Respectively represent the current state s in the ith training sample _i Anti-interference action a _i Disturbance action o _i Return value r _i Next state s' _i Then calculating an estimated Q value error value L (theta), and then updating the anti-interference decision network by using a gradient descent method;

and step 6, cycling the steps 3 to 5 until the appointed iteration times are reached.

Further, modeling in the step 1 is an anti-interference zero and markov game model, specifically:

anti-interference zero and Markov game six-tupleA representation; wherein (1)>The environment state is defined as a frequency spectrum waterfall diagram, and comprises time, frequency and work three-dimensional information; />Representing a user action set, wherein a user selects a channel power joint decision to perform anti-interference communication; />Representing a set of interference actions, the interference selecting a channel to interfere with user transmissions; />Representing a state transition function representing a probability of transitioning to a next state under the influence of the user and the interfering action; />Representing a prize value function, defined as r _t ＝C _t -ωp _t Wherein C _t Represents the transmission rate, ω represents the cost factor, p _t Representing user power; gamma represents a discount factor;

considering the worst interference situation, the optimization goal of the user is to obtain the optimal anti-interference strategy pi ^* To maximize the future cumulative discount return value:

wherein the method comprises the steps ofRepresented in state s _t Under the condition that the user and the interference respectively adopt policies pi and mu, the future accumulated discount return value obtained by the user, and the state s change obeys the transfer function +.> Representing calculated expected value, gamma is discount factor, r _t+i A return value at time t+i; in zero and random gaming, the above described solution targets correspond to solving a gaming Nash equalization strategy.

Further, the anti-interference decision network in the step 2 comprises a two-layer convolution network and a three-layer full-connection network, wherein the convolution layer is used for extracting useful features of the frequency spectrum waterfall diagram, and the full-connection layer is used for integrating feature values and calculating a Q value matrix; the first convolution layer includes f ₁ A convolution kernel of size z ₁ ×z ₁ Step length d ₁ The method comprises the steps of carrying out a first treatment on the surface of the The second convolution layer includes f ₂ A convolution kernel of size z ₂ ×z ₂ Step length d ₂ The method comprises the steps of carrying out a first treatment on the surface of the The number of neurons of the first layer and the second layer of the full-connection layer is n respectively ₁ And n ₂ The method comprises the steps of carrying out a first treatment on the surface of the The number of neurons of the last full-connection layer of the anti-interference decision network is

Further, in step 3, the history awareness data is constructed into a frequency spectrum waterfall diagram s _t The method specifically comprises the following steps:

the instantaneous spectrum data obtained by sensing and sampling at the moment t by a user is o _t ＝[o ₁ ,o ₂ ,…,o _L ]Wherein l= (f _U -f _L ) Δf is the number of sampling points, f _U For the upper frequency limit, f _L As a lower frequency limit, Δf represents frequency resolution; the calculation formula of the ith sampling value is as followsWhere S (f) is a power spectral density function of the user received signal, expressed as:

wherein h is ₁ ，h ₂ And h ₃ Representing user transmitter, intelligent jammer and fixed jammer to user receiver, respectivelyIs used for the channel gain of (a),representing the power spectral density function of the user signal, +.>Representing the user's center frequency,representing the power spectral density function of the intelligent interference signal, M represents the number of channels covered by the interference signal,center frequency of mth channel representing smart interference coverage, +.>Representing the power spectral density function for a fixed interference signal, < ->Representing the center frequency of the fixed interference coverage channel, n (f) representing the power spectral density function of the ambient noise;

the spectral waterfall is constructed from historical spectral data, denoted s _t ＝[o _t ,o _t-1 ,…,o _t-Φ+1 ] ^T Where Φ is the historical data length.

Further, step 3 describes calculating the current anti-interference action using an ε -greedy strategy and linear programmingAnd performs a joint action, specifically:

the user selects actions with an epsilon-greedy policy, which is in epsilon _t Is to randomly select an action with a probability of 1-epsilon _t Is used for calculating an equalization strategy according to the probability of the Q value matrixThen according to the equalization policySlightly sample action +.>Wherein ε is _t Updating rule epsilon _t ＝ε _f +(ε ₀ -ε _f )e ^-t/v Wherein ε is ₀ For initial value epsilon _f V is the final value and v is the fading coefficient;

after calculating the Q value matrix Q (s _t (|θ), an antijam equalization strategy can be calculated by linear programmingNamely:

wherein pi (a|s) _t ) Representing the user in state s _t Probability of lower selection action a, assumingThen for any disturbing action o +.>The above can be converted into:

further, step 4 is to calculate the return value r under the current action _t And obtaining the interference action o through the perception data _t And constructs the spectrum state s _t+1 Record interaction e _t ＝(s _t ,a _t ,o _t ,r _t ,s _t+1 ) Added to an experience playback unit, i.eRecord interaction e _t =(s _t ,a _t ,o _t ,r _t ,s _t+1 ) AddingAdded to an experience playback unit->The method comprises the following steps:

the intelligent disturbance inputs the frequency spectrum waterfall diagram obtained by perception into a disturbance DQN network, and selects a disturbance action o according to an epsilon-greedy mode _t ；

After the user and the interference execute respective actions, the user calculates the prize value r at the current moment _t ＝C _t -ωp _t ；

Obtaining the next time state s by sensing the spectrum _t+1 ；

Recording the interaction e _t ＝(s _t ,a _t ,o _t ,r _t ,s _t+1 ) Stored in an experience replay unit, namely

Further, in the step 5, the estimated Q value error value L (θ) is calculated, and then the anti-interference decision network is updated by using a gradient descent method, which specifically includes:

randomly sampling B training samples {(s) _i ,a _i ,o _i ,r _i ,s _i ′)} _i∈[B] For network updates;

for each training sample, state s is calculated _i Is input into an anti-interference decision network to calculate and obtain an estimated Q value Q (s _i ,a _i ,o _i |θ)；

The next state s' _i Input to a target value network to calculate and obtain a Q value matrix Q(s) _i ′,a _i ,o _i |θ ^- ) Wherein θ is ^- Representing the target value network parameter, and further calculating to obtain the target Q value y _i The method comprises the following steps:

wherein the method comprises the steps ofIs a Q value matrix Q (s' _i ,·|θ ^- ) User balancing strategies under;

calculating an estimation error L (θ):

updating network parameters by using a gradient descent method:

compared with the prior art, the application has the remarkable advantages that:

(1) The intelligent interference with strategy updating capability can be resisted, the intelligent communication countermeasure is dominant, and the anti-interference effect of the user is effectively improved.

(2) The user can continuously optimize own strategies by only learning with the frequency spectrum environment without interference and channel prior information.

The application is described in further detail below with reference to the accompanying drawings.

Drawings

Fig. 1 is a schematic diagram of the anti-smart-disturbance method based on maximum-minimum deep Q learning of the present application.

Fig. 2 is a block diagram of the intelligent interference prevention method based on maximum and minimum deep Q learning of the present application.

FIG. 3 is a schematic diagram showing the convergence performance of the proposed method and comparative algorithm in an embodiment of the present application.

FIG. 4 is a schematic diagram of performance testing of the proposed method and comparative algorithm in an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In one embodiment, the application provides an anti-interference zero-and-Markov game model and a maximum and minimum deep Q learning method, and the transmission power and the communication channel under the threat of intelligent interference are subjected to joint optimization selection.

In a typical communication anti-interference scenario consisting of an intelligent jammer, a fixed jammer and a communication transceiving pair, with reference to fig. 1, the transceiving pair carries out data transceiving through a transmission link, and the jammer reduces the received signal to noise ratio by releasing a wireless interference attack, so that normal communication of the transceiving pair is destroyed, and therefore, the transceiving pair needs to optimally adjust own power and channel in real time to avoid the attack of interference, thereby ensuring reliable transmission of data.

The application provides an anti-interference zero-and-Markov game model, which considers an intelligent communication countermeasure scene comprising a pair of communication transceivers (users), a fixed jammer and an intelligent jammer, wherein the fixed jammer releases sweep interference, the intelligent jammer uses deep reinforcement learning to optimize interference channel selection, the users optimize the channel and power selection to maximize transmission utility, and the countermeasure interaction process is modeled as the anti-interference zero-and-Markov game model.

The application aims at jointly optimizing transmission power and a communication channel, and utilizes an intelligent learning algorithm to enable a user to continuously interact with an interference environment so as to find optimal communication parameters.

The application provides an intelligent interference resistance method based on maximum and minimum depth Q learning, which comprises the following steps:

step 1, modeling an anti-interference problem under intelligent interference threat as an anti-interference zero and Markov game model, wherein game participants are users and intelligent interference, the intelligent interference is considered to be completely rational, and the user aims at obtaining an optimal anti-interference strategy of the worst interference strategy, which corresponds to a Nash equilibrium strategy;

here, modeling is an anti-interference zero and markov game model, specifically:

anti-interference zero and Markov game six-tupleAnd (3) representing. Wherein (1)>For the set of environment states, the environment states are defined as a frequency spectrum waterfall diagram, which contains time, frequency and work three-dimensional information, not only reflects the change of the frequency spectrum states, but also provides enough information for anti-interference decision; />Representing a user action set, wherein a user selects a channel power joint decision to perform anti-interference communication; />Representing a set of interference actions, the interference selecting a channel to interfere with user transmissions; />Representing a state transition function representing a probability of transitioning to a next state under the influence of the user and the interfering action; />Representing a prize value function, defined as r _t ＝C _t -ωp _t Wherein C _t Represents the transmission rate, ω represents the cost factor, p _t Representing user power; gamma represents the discount factor.

Considering the worst interference situation, the optimization goal of the user is to find the optimal anti-interference strategy pi ^* To maximize the future cumulative discount return value:

wherein the method comprises the steps ofRepresented in state s _t Under the condition that the user and the interference respectively adopt the policies pi and mu, the future accumulated discount return value obtained by the user,state s change obeys the transfer function +.> Representing calculated expected value, gamma is discount factor, r _t+i A return value at time t+i; in zero and random gaming, the above described solution targets correspond to solving a gaming Nash equalization strategy.

Step 2, the user builds an anti-interference decision network and randomly initializes network parameters, and simultaneously sets super parameters of network training, including learning rate, discount factors, exploration probability and the like;

here, since the input of the anti-interference decision network is a two-dimensional spectrum waterfall diagram, a neural network model composed of a convolution layer and a full-connection layer is used, wherein the convolution layer is used for extracting useful features of the two-dimensional spectrum waterfall diagram and reducing calculation complexity, the full-connection layer is used for integrating feature values and calculating a Q value matrix, and the designed anti-interference decision network comprises a two-layer convolution network and a three-layer full-connection network. The first layer convolution contains f ₁ A convolution kernel of size z ₁ ×z ₁ Step length d ₁ . The second layer convolution contains f ₂ A convolution kernel of size z ₂ ×z ₂ Step length d ₂ . The number of neurons of the first layer and the second layer of the full-connection layer is n respectively ₁ And n ₂ . The number of neurons of the last full-connection layer of the anti-interference decision network is

Randomly initializing anti-interference network parameters, initializing discount factor gamma, learning rate alpha and exploring probability epsilon ₀ Target network update step size N _T Experience replay unit

Step 3, the user perceives the spectrum state of the real-time environment andconstructing historical perceptual data into a spectral waterfall plot s _t Inputting the frequency spectrum waterfall diagram into an anti-interference decision network, and calculating and outputting a current state Q value matrix Q (s _t (i theta) and then calculating the current anti-interference action by using an epsilon-greedy strategy and linear programmingAnd performs a joint action.

Here, the historical perceptual data is constructed as a spectral waterfall plot s _t The method specifically comprises the following steps:

wherein h is ₁ ，h ₂ And h ₃ Representing the channel gains of the user transmitter, the smart jammer and the fixed jammer to the user receiver respectively,representing the power spectral density function of the user signal, +.>Representing the user's center frequency,representing the power spectral density function of the intelligent interference signal, M represents the number of channels covered by the interference signal,center frequency of mth channel representing smart interference coverage, +.>Representing the power spectral density function for a fixed interference signal, < ->Representing the center frequency of the fixed interference coverage channel, n (f) representing the power spectral density function of the ambient noise;

Here, the calculation of the current anti-interference action using an ε -greedy strategy and a linear programmingAnd performs a joint action, specifically:

the user selects actions in an epsilon-greedy manner, which is epsilon _t Is to randomly select an action with a probability of 1-epsilon _t Is used for calculating an equalization strategy according to the probability of the Q value matrixThen sample action->In order to increase the action exploration, the method considers the attenuation of random exploration along with the iteration times, epsilon _t Updating rule epsilon _t ＝ε _f +(ε ₀ -ε _f )e ^-t/v Wherein ε is ₀ For initial value epsilon _f V is the fading coefficient, which is the final value.

step 4, calculating the return value r under the current action _t And obtaining the interference action o through the perception data _t And constructs the spectrum state s _t+1 Record interaction e _t ＝(s _t ,a _t ,o _t ,r _t ,s _t+1 ) Added to an experience replay unitThe method comprises the following steps:

the intelligent disturbance inputs the frequency spectrum waterfall diagram obtained by perception into a disturbance DQN network, and selects a disturbance action o according to an epsilon-greedy mode _t . After the user and the interference execute respective actions, the user calculates the prize value r at the current moment _t ＝C _t -ωp _t Then obtaining the next time state s through sensing the frequency spectrum _t+1 Finally, recording the interaction e _t ＝(s _t ,a _t ,o _t ,r _t ,s _t+1 ) Stored to an experience replay unitIs a kind of medium.

Step 5From experience replay unitMiddle random extraction training sample set->And calculating an estimated Q value error value L (theta), and then updating the anti-interference decision network by using a gradient descent method. The method comprises the following steps:

randomly sampling B training samples {(s) _i ,a _i ,o _i ,r _i ,s′ _i )} _i∈[B] For network updates. For each training sample, state s is calculated _i Is input into a strategy network and can be calculated to obtain an estimated Q value Q (s _i ,a _i ,o _i |θ). The next state s' _i Input to a target value network to calculate and obtain a Q value matrix Q (s' _i ,a _i ,o _i |θ ^- ) Wherein θ is ^- Representing the target value network parameter, and further calculating to obtain the target Q value y _i The method comprises the following steps:

wherein the method comprises the steps ofIs a Q value matrix Q (s' _i ,·|θ ^- ) User equalization policies below. Finally, the estimated error L (θ) is calculated and the network parameters are updated by means of gradient descent method>The error is calculated as:

As a specific example, in one embodiment, the present application is described in further detail.

In this embodiment, the system simulation uses a pytorch deep learning framework, and the network training runs on RTX 2080Ti, and the parameter setting does not affect the generality. The range of the simulation communication frequency band is set to 830MHz-850MHz, and the simulation communication frequency band is uniformly divided into 10 non-overlapping channels, and the bandwidth of each channel is 2MHz. The user communication slot length is set to 10ms. The frequency sampling interval of the perceptron is 0.1MHz, the number of sampling points is 200 each time, and the trace length of the frequency spectrum waterfall diagram is set to be 200, so that the frequency spectrum waterfall diagram is a matrix of 200 x 200. The interference and user baseband signals are generated by simulation with a raised cosine roll-off filter, and the roll-off coefficient is set to 0.5. The maximum transmitting power of the user is 200mw (23 dBm), the user is divided into 10 power levels, and the signal-to-noise ratio threshold of the receiver is 5dB; the fixed jammer releases sweep frequency interference, the interference bandwidth is 4MHz, the frequency band is circularly swept from the beginning to the end of the frequency band, the sweep frequency speed is 500MHz/s, and the transmitting power is 50dBm; the intelligent jammer has a transmitting power of 50dBm, an interfering signal bandwidth of 6MHz and can cover 3 channels. The background noise power is-90 dBm. The learning rate of deep reinforcement learning is 1e-4, the batch size is 32, the empirical pool size is 10000, the first layer of convolution layer contains 16 convolution kernels of 4*4, the step length is 2, the second layer of convolution kernel contains 32 convolution kernels of 4*4, the step length is 2, and the three layers of full connection layers respectively contain 512, 256 and 1000 neurons. Mainly consider the following three comparison algorithms:

1) DQN-based method: the method directly regards the interference as a part of the environment, does not carry out deep modeling analysis on the intelligent communication countermeasure process, and learns the user rule by utilizing a single-user DQN method;

2) A method based on perceived access: according to the method, spectrum access is carried out according to a fixed rule, a user accesses a channel with the lowest current signal energy value according to a current spectrum sensing result, and the power is kept unchanged;

3) A method based on random selection: the user randomly selects an action from the set of actions per slot.

FIG. 2 shows a frame diagram of the proposed method of the present application, where at each moment the user perceives the spectral environment, after which a spectral waterfall diagram is constructed and then input into the decision network, where the channel and power joint decisions are output, then batch training samples are randomly selected from the memory playback unit, and then the network loss value is calculated and then updated.

Fig. 3 compares the performance of different algorithms when online against DQN interference. During online communication countermeasure, users and interference are constantly updating their own policies. The average prize value on the ordinate of the graph is obtained by averaging the prize value per 100 time slots, each curve is the result of 5 repeated tests, and the hatched area of the graph is a 95% confidence interval, which represents the algorithm performance fluctuation range. The figure shows that the average rewarding value of the method based on the perception access gradually decreases along with the increase of the iteration number, because the method based on the perception access is an anti-interference method based on a fixed rule, and the change rule is easy to learn by DQN interference, so that the method is suppressed by intelligent interference. Because the random selection method has no change rule, the DQN interference is difficult to learn effective measures, and the anti-interference performance of the DQN interference is stable. However, because of the existence of fixed pattern interference, the random selection method can be random to an interference channel, so that the reward value is low and the performance cannot be improved. Both the single-user DQN-based method and the proposed method can promote their anti-interference strategies by interacting with the environment, so that both average rewards increase with the increase of the iteration times. However, the proposed method can achieve about 15% performance improvement compared to the DQN method.

Stability of the anti-interference model was tested in fig. 4, and availability (explatability) is an important indicator for evaluating the stability of the training model, which characterizes whether the model is easily learned by an adversary. The higher the availability is, the more vulnerable the anti-interference model is, the more easy the intelligent interference learns its change rule, otherwise, the more robust the model is, the more difficult the intelligent interference learns effective countermeasure strategy. The application compares the performance of different anti-interference models in the face of DQN interference of an online updating strategy. The antijam model keeps the network parameters unchanged at this time. It can be seen from the figure that the performance of the method decreases with iteration number, except that the random selection method remains unchanged in the face of DQN interference, since DQN interference is improving its own interference strategy by interaction. The single-user DQN method eventually falls to the same level as the perceptual algorithm, indicating that the obtained anti-interference model has high availability and is completely suppressed in the face of intelligent interference. However, the performance of the method provided by the application is obviously slower than that of other methods, and a higher average rewarding value can be still maintained, which indicates that the anti-interference model obtained by the method has lower availability.

Through comparison, the intelligent countermeasure algorithm provided by the application is superior to the traditional DQN anti-interference algorithm, the anti-interference effect improvement of about 15% can be obtained, and the performance is more stable in the dynamic countermeasure process.

In summary, the maximum and minimum depth Q learning method optimizes the user transmission power and the channel, converges to the countermeasure game equilibrium solution by considering the optimal anti-interference strategy under the worst interference condition, and the user can obtain a steady anti-interference strategy only by interacting with the interference environment without interference and channel prior information in the decision process.

The foregoing has outlined and described the basic principles, features, and advantages of the present application. It will be understood by those skilled in the art that the foregoing embodiments are not intended to limit the application, and the above embodiments and descriptions are meant to be illustrative only of the principles of the application, and that various modifications, equivalent substitutions, improvements, etc. may be made within the spirit and scope of the application without departing from the spirit and scope of the application.

Claims

1. The anti-interference zero-markov game model is characterized in that a user and intelligent interference in the model have environment observation and strategy updating capabilities, both sides observe environment states and make decisions so as to change the environment states, the user aims at maximizing transmission utility, the intelligent interference has completely opposite targets, and the sum of the utilities of both sides is zero.

2. The anti-intelligent interference method based on maximum and minimum depth Q learning based on the anti-interference zero and markov game model according to claim 1, characterized in that it comprises the following steps:

Step 3, the user perceives the spectrum state of the real-time environment and constructs the history perceiving data into a spectrum waterfall diagram s _t Inputting the frequency spectrum waterfall diagram into the anti-interference decision network, and calculating and outputting a current state Q value matrix Q(s) _t (i theta), where theta represents decision network parameters, and then calculate the current tamper-resistant action a using epsilon-greedy strategy and linear programming _t ＝(f _u ^t ,p _t ) And executing the joint action; wherein f _u ^t And p _t The user communication frequency and the transmission power respectively;

Step 5, from experience playback unitTraining sample is drawn at randomThe present set->Wherein(s) _i ,a _i ,o _i ,r _i ,s′ _i ) Respectively represent the current state s in the ith training sample _i Anti-interference action a _i Disturbance action o _i Return value r _i Next state s' _i Then calculating an estimated Q value error value L (theta), and then updating the anti-interference decision network by using a gradient descent method;

3. The intelligent anti-jamming method based on maximum and minimum depth Q learning according to claim 2, wherein the modeling in step 1 is an anti-jamming zero and markov game model, specifically:

anti-interference zero and Markov game six-tupleA representation; wherein (1)>The environment state is defined as a frequency spectrum waterfall diagram, and comprises time, frequency and work three-dimensional information; />Representing a user action set, wherein a user selects a channel power joint decision to perform anti-interference communication; />Representing a set of interference actions, the interference selecting a channel to interfere with user transmissions; />Representing state transition functions, representing under the influence of user and interfering actionsProbability of transition to the next state; />Representing a prize value function, defined as r _t ＝C _t -ωp _t Wherein C _t Represents the transmission rate, ω represents the cost factor, p _t Representing user power; gamma represents a discount factor;

4. The intelligent anti-interference method based on maximum and minimum depth Q learning according to claim 2, wherein the anti-interference decision network in step 2 includes a two-layer convolution network and a three-layer fully connected network, wherein the convolution layer is used for extracting useful features of a frequency spectrum waterfall graph and reducing computational complexity, and the fully connected layer is used for integrating feature values and calculating a Q value matrix; the first convolution layer includes f ₁ A convolution kernel, convolvingThe core size is z ₁ ×z ₁ Step length d ₁ The method comprises the steps of carrying out a first treatment on the surface of the The second convolution layer includes f ₂ A convolution kernel of size z ₂ ×z ₂ Step length d ₂ The method comprises the steps of carrying out a first treatment on the surface of the The number of neurons of the first layer and the second layer of the full-connection layer is n respectively ₁ And n ₂ The method comprises the steps of carrying out a first treatment on the surface of the The number of neurons of the last full-connection layer of the anti-interference decision network is

5. The intelligent anti-jamming method based on maximum and minimum depth Q learning according to claim 2, wherein in step 3, the history awareness data is constructed into a frequency spectrum waterfall diagram s _t The method specifically comprises the following steps:

wherein h is ₁ ，h ₂ And h ₃ Representing the channel gains of the user transmitter, the smart jammer and the fixed jammer to the user receiver respectively,representing the power spectral density function of the user signal, +.>Representing the user center frequency, +.>Representing the power spectral density function of the smart interfering signal, M representing the number of channels covered by the interfering signal,/->Center frequency of mth channel representing smart interference coverage, +.>Representing the power spectral density function for a fixed interference signal, < ->Representing the center frequency of the fixed interference coverage channel, n (f) representing the power spectral density function of the ambient noise;

6. The intelligent anti-jamming method based on maximum and minimum depth Q learning according to claim 2, wherein in step 3, the current anti-jamming action is calculated by using an epsilon-greedy strategy and linear programmingAnd performs a joint action, specifically:

the user selects actions with an epsilon-greedy policy, which is in epsilon _t Is to randomly select an action with a probability of 1-epsilon _t Is used for calculating an equalization strategy according to the probability of the Q value matrixThen sample actions according to equalization strategy>Wherein ε is _t Updating rule epsilon _t ＝ε _f +(ε ₀ -ε _f )e ^-t/v Wherein ε is ₀ For initial value epsilon _f V is the final value and v is the fading coefficient;

7. the method for intelligent interference resistance based on maximum and minimum deep Q learning according to claim 2, wherein in step 4, the return value r under the current action is calculated _t And obtaining the interference action o through the perception data _t And constructs the spectrum state s _t+1 Record interaction e _t =(s _t ,a _t ,o _t ,r _t ,s _t+1 ) Added to an experience playback unit, i.eRecord interaction e _t =(s _t ,a _t ,o _t ,r _t ,s _t+1 ) Added to experience playback Unit->The method comprises the following steps:

Obtaining the next time state s by sensing the spectrum _t+1 ；

8. The intelligent anti-jamming method based on maximum and minimum deep Q learning according to claim 2, wherein the calculating of the Q value estimation error value L (θ) in step 5 is followed by updating the anti-jamming decision network by using a gradient descent method, specifically:

randomly sample B training samplesFor network updates;

The next state s' _i Input to target value network to calculate and obtain Q value matrixWherein θ is ^- Representing the target value network parameter, and further calculating to obtain the target Q value y _i The method comprises the following steps:

wherein the method comprises the steps ofFor Q value matrix->User balancing strategies under;

calculating an estimation error L (θ):

updating network parameters by using a gradient descent method: