CN117750525A - Frequency domain anti-interference method and system based on reinforcement learning - Google Patents
Frequency domain anti-interference method and system based on reinforcement learning Download PDFInfo
- Publication number
- CN117750525A CN117750525A CN202410182440.7A CN202410182440A CN117750525A CN 117750525 A CN117750525 A CN 117750525A CN 202410182440 A CN202410182440 A CN 202410182440A CN 117750525 A CN117750525 A CN 117750525A
- Authority
- CN
- China
- Prior art keywords
- interference
- channel
- receiver
- communication
- transmitter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 230000002787 reinforcement Effects 0.000 title claims abstract description 41
- 238000004891 communication Methods 0.000 claims abstract description 112
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 32
- 230000005540 biological transmission Effects 0.000 claims abstract description 26
- 238000012549 training Methods 0.000 claims abstract description 19
- 101000741965 Homo sapiens Inactive tyrosine-protein kinase PRAG1 Proteins 0.000 claims abstract description 14
- 102100038659 Inactive tyrosine-protein kinase PRAG1 Human genes 0.000 claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims abstract description 6
- 230000009471 action Effects 0.000 claims description 46
- 238000001228 spectrum Methods 0.000 claims description 14
- 230000006870 function Effects 0.000 claims description 12
- 230000007704 transition Effects 0.000 claims description 12
- 238000005562 fading Methods 0.000 claims description 9
- 230000003595 spectral effect Effects 0.000 claims description 9
- 230000000875 corresponding effect Effects 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 6
- 230000001629 suppression Effects 0.000 claims 2
- 238000010586 diagram Methods 0.000 description 14
- 238000013461 design Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000004088 simulation Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000036039 immunity Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Mobile Radio Communication Systems (AREA)
- Noise Elimination (AREA)
Abstract
The invention discloses a frequency domain anti-interference method and a system based on reinforcement learning, wherein the method comprises the steps that a transmitter and a receiver transmit data through a communication link and transmit control information through a control link; when the communication user performs data transmission, the plurality of patterned jammers generate interference signals to interfere the communication user; the intelligent agent is embedded into the receiver, one communication period of the transmitter and the receiver is divided into a plurality of subframes, each subframe comprises a plurality of time slots, and the avoidance rate of all time slot channels is calculated; judging whether the avoidance rate reaches a preset threshold value, if not, training by using a WDQL algorithm, updating the channel strategy, transmitting the updated channel strategy and NACK to a transmitter through a control link, and starting data transmission of the next communication period. The invention not only ensures lower iteration time and calculation complexity, but also realizes rapid training decision speed and excellent anti-interference performance.
Description
Technical Field
The invention relates to the technical field of wireless communication, in particular to a frequency domain anti-interference method and system based on reinforcement learning.
Background
The openness of the wireless communication channel makes it vulnerable to interference attacks, which in turn results in loss of communication performance, reducing the reliability of the wireless communication system. Thus, the anti-interference technology becomes a crucial research direction in the communication field.
Traditional anti-interference technologies, such as Frequency Hopping Spread Spectrum (FHSS) and Direct Sequence Spread Spectrum (DSSS), although capable of providing a certain anti-interference capability to a communication system, cannot flexibly optimize the anti-interference strategy according to the real-time spectrum environment and interference mode due to its fixed mode. Thus, there is a need for a more intelligent method of selecting communication frequencies to be effective against malicious interference.
With the development of machine learning technology, a scholars in recent years have proposed an anti-interference channel selection method based on Q learning (references: s.liu, y.xu, x.chen, m.wang, w.li, y.li and y.xu, "Pattern-Aware Intelligent Anti-Jamming Communication: A Sequential Deep Reinforcement Learning Approach," in IEEE Access, vol.7, pp. 169204-169216, 2019 "). However, adjusting only system frequency domain parameters does not take full advantage of the multi-domain flexibility of the wireless communication system. Thus, some scholars focused on the joint anti-interference problem of the frequency domain and the power domain and proposed a Multi-parameter Q Learning anti-interference algorithm (reference: Z. Pu, Y. Niu and G. Zhang, "A Multi-Parameter Intelligent Communication Anti-Jamming Method Based on Three-Dimensional Q-Learning," 2022 IEEE 2nd International Conference on Computer Communication and Artificial Intelligence (CCAI), beijin, china, 2022, pp. 205-210.). In addition, there are also scholars combining Q learning with deep learning, fitting Q-value tables using deep reinforcement learning algorithms to achieve dynamic spectrum immunity (ref: X. Liu, Y. Xu, L. Jia, et al, "Anti-Jamming Communications Using Spectrum Waterfall: A Deep Reinforcement Learning Approach," in IEEE Communications Letters, vol. 22, no. 5, pp. 998-1001, may 2018).
However, while anti-jamming algorithms employing deep reinforcement learning successfully solve the "dimensional explosion" problem of huge state decision space, in many cases they have long convergence times and are difficult to train effectively. The reinforcement learning anti-jamming algorithm employing Q learning, while capable of converging in a shorter time than deep reinforcement learning, does not adequately take into account overestimation problems that may result when a single estimator is employed to update the Q value. This problem may make the resulting interference rejection strategy less than optimal.
Thus, how to achieve rapid convergence and good interference immunity in the face of an unknown communication interference environment is a challenge that needs to be addressed by practitioners of the art.
Disclosure of Invention
The invention aims to provide a frequency domain anti-interference method and a system based on reinforcement learning, which are particularly suitable for the situation facing unknown patterned interference. The method can avoid interference rapidly to obtain good anti-interference performance, and simultaneously reduces the frequency of channel switching as much as possible to reduce the communication cost, so that the problems that the anti-interference research based on deep reinforcement learning is difficult to train and the convergence time is long in the prior art are solved, and the problem that strategies faced in the anti-interference research of the reinforcement learning algorithm based on Q learning are not optimal is solved.
The invention discloses a frequency domain anti-interference method based on reinforcement learning, which comprises the following steps:
step 1: the method comprises the steps that a transmitter and a receiver which are mutually communicated are used as communication users, the transmitter and the receiver transmit data through a communication link, and control information is transmitted through a control link, wherein the control information comprises channel strategies and NACK; when the communication user performs data transmission, the plurality of patterned jammers generate interference signals to interfere the communication user;
step 2: the intelligent agent is embedded into the receiver, one communication period of the transmitter and the receiver is divided into a plurality of subframes, each subframe comprises a plurality of time slots, and the avoidance rate of all time slot channels is calculated;
step 3: judging whether the avoidance rate reaches a preset threshold value, if not, training by using a WDQL algorithm, updating the channel strategy, transmitting the updated channel strategy and NACK to a transmitter through a control link, and starting data transmission of the next communication period.
Further, the step 2 includes:
in one communication cycleInside, in front->Or->Data transmission is carried out in time, and the data transmission is carried out in time>The method comprises the steps of carrying out a first treatment on the surface of the At the same time, in front ofOr->In time, an agent located in the receiver perceives the spectrum environment in real time, generating +.>After the data transmission is completed, calculating the avoidance rate of all time slot channels in the current period; communication cycle->Comprising +.>Or->、/>、/>;/>And->All represent time periods.
Further, in each time slot, the communication user obtains the signal-to-interference-and-noise ratio of each channel through spectrum sensing, and combines the signal-to-interference-and-noise ratio information of a plurality of time slots into a plurality of signal-to-interference-and-noise ratio subframes;
the signal-to-interference-plus-noise ratio acquisition method comprises the following steps:
describing channel state information by adopting a block fading channel model, wherein channel parameters are kept unchanged in each time slot; modeling the channel gain between the transmitter and the receiver and the channel gain between the jammer and the receiver respectively;
and calculating the signal-to-interference-and-noise ratio of the receiver based on the channel gain model between the transmitter and the receiver and the channel gain model between the jammer and the receiver.
Further, the channel gain model between the transmitter and the receiver is as follows:
The channel gain model between the receiver and the jth jammer is:
Wherein,for the Euclidean distance between transmitter and receiver, < >>For the receiver and the firstjEuclidean distance between the jammers, < >>And->Is a path fading factor>For the instantaneous fading coefficient, the mean value is 0, and the variance is +.>Complex gaussian variable of (a);
the signal-to-interference-and-noise ratio of the receiver is:
Wherein,indicating the communication user's selection of messages in time slot tDao->Center frequency of>For baseband signal bandwidth, ">For communication signal power, +.>And->Representing the interference power spectral density function and the noise power spectral density function respectively,fthe frequency of the variable is represented by,n(f) Representing the power spectral density of the noise,Jindicating the number of jammers, +.>Represent the firstjThe selected channel of the individual jammer in the t time slot +.>Is set at the center frequency of (a).
Further, the step 3 includes:
step 31: setting training subframe numberFront +.A WDQL algorithm is used for the current communication period>Training sub-frames to obtain +.>For Q value table (+)>);/>And->All are Q value tables;
step 32: for the pair ofThe Q value table is averaged to extract the action with the maximum Q value in each time slot to form a length of +.>As an optimal channel strategy;
step 33: and sending the optimal channel strategy and NACK to a transmitter together, guiding the transmitter to carry out channel selection in the next communication period, wherein N is a positive integer greater than 1.
Further, the step 32 includes:
first in the state of time slot tNext, according to probability->Randomly selecting an action, or according to probability +.>Selecting the action with the maximum Q value, namely +.>;/>Indicates the action of->An action at time slot t;
then calculate the actionIs awarded->And randomly select to update the Q value table +.>Or Q value table->。
Further, the rewardsThe calculation process of (1) comprises:
the agent in the receiver avoids the interference through Markov decision; the Markov decision process includes states, actions, state transition probabilities and rewards, the states of the t time slots are expressed asAll states constitute a state space->The method comprises the steps of carrying out a first treatment on the surface of the The action in time slot t is denoted +.>,/>,/>For the number of available channels, m represents the mth channel in the number of available channels, all actions constitute action space +.>The method comprises the steps of carrying out a first treatment on the surface of the And state transition probabilitySatisfy->Indicated at slot->When the intelligent agent is in the current environment state +.>Select action->The environment shifts to the next slot +.>Status of->The probability of the instant channel avoidance rate; />Representing actions of all available channels in the action space A at the time of t time slots; />State space representing t time slots, ">State space representing the t+1 time slot, +.>Representing the state of the t+1 slot, +.>Representing the action space of the t+1 time slot, and Pr and p both represent probabilities;
if the communication user already perceives the signal-to-interference-and-noise ratio information of each time slot, all the state transition probabilities are determined values; the bonus representative is in stateAt the time, the agent selects action->Environment transitions to State->The obtained gain is demodulated by three preset modulation modes corresponding to the demodulation threshold +.>,/>And->And a switching cost +_ brought when a channel switch is generated>Determining, profit->Expressed as:
wherein,as an indication function, express when->I.e. when no channel switch occurs in the front and rear slots, no cost is lost, otherwise, a size of +.>Cost of (2); />Representing the channel selected by the current time slot of the communication subscriber at time t+1 time slot +.>Is set at the center frequency of (a).
Further, the goal of the Markov decision is to maximize the total revenue within a subframe:
Wherein,optimal channel policy representing communication subscriber +.>Representing slave policy->Selecting an operation that maximizes the total profit of the communication subscriber,/->Representing the sum of rewards earned for all slots of a single subframe,/->For the number of slots in a single subframe, < >>Represents action, pi (τ) represents and policy +.>Corresponding actions, T, represent the number of slots within a single subframe.
Further, the random selection updatesOr->Comprising the following steps:
if it is updated:
If it is updated:
Wherein (1)>And->Respectively represent the next state obtained based on the Q value table +.>Action with maximum Q and action with minimum Q down, < >>And->As weights, to balance the problem of overestimation of a single estimator with underestimation of a double estimator; when updating the Q value, use of +.>And->Two Q value tables, ">For learning rate->For discounts factor->For weight parameter, ++>To find the maximum function.
The invention also discloses a frequency domain anti-interference system based on reinforcement learning, which is used for realizing the frequency domain anti-interference method based on reinforcement learning, and comprises the following steps:
the communication module is used for taking a transmitter and a receiver which are mutually communicated as communication users, transmitting data through a communication link and transmitting control information through a control link; when the communication user performs data transmission, the plurality of patterned jammers generate interference signals to interfere the communication user;
the computing module is used for embedding an agent into the receiver, dividing one communication period of the transmitter and the receiver into a plurality of subframes, wherein each subframe comprises a plurality of time slots, and computing the avoidance rate of all time slot channels;
and the decision module is used for judging whether the avoidance rate reaches a preset threshold value, if the avoidance rate does not reach the preset threshold value, training by using a WDQL algorithm, updating the channel strategy, transmitting the updated channel strategy and NACK to the transmitter through the control link, and starting data transmission of the next communication period.
Due to the adoption of the technical scheme, the invention has the following advantages:
1. through a continuously optimized weighted double-Q learning algorithm, the system can update the Q-value table, formulate and communicate the optimized channel strategy to the transmitter through a stable control link. The model of the invention has complete design and reasonable algorithm, not only ensures lower iteration time and calculation complexity, but also realizes rapid training decision speed. Particularly, when facing fixed mode interference, the system can quickly converge and has excellent anti-interference performance, and a powerful guarantee is provided for the reliability of the wireless communication system.
2. Based on reinforcement learning technology, an intelligent communication anti-interference method is designed. According to the method, the frequency spectrum environment is periodically sensed, the signal-to-interference-and-noise ratio subframes are generated, and a channel strategy is formulated through training of the subframes so as to achieve the purpose of avoiding interference.
3. The invention does not need to estimate the interference mode and parameters of the jammer in advance, namely, the model is not needed, so the invention can be widely applied to various modeling anti-interference scenes.
4. The invention can avoid the problems of long training and convergence time of the deep reinforcement learning algorithm and the overestimation of the actions by taking the Q learning algorithm as the reinforcement learning algorithm. By effectively solving the proposed model, good anti-interference performance can be obtained.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments described in the embodiments of the present invention, and other drawings may be obtained according to these drawings for those skilled in the art.
FIG. 1 is a diagram of a frequency domain anti-interference system of the present invention;
FIG. 2 is a single communication cycle of the frequency domain anti-interference system of the present inventionA structural design drawing in the inner part;
FIG. 3 is a flow chart of the frequency domain anti-interference method of the present invention;
FIG. 4 (a) is a signal-to-interference-and-noise ratio thermodynamic diagram of comb interference rejection in accordance with an embodiment of the present invention;
FIG. 4 (b) is a diagram of yet another SNR thermodynamic diagram for comb interference rejection in accordance with an embodiment of the present invention;
FIG. 4 (c) is a graph of yet another SNR thermodynamic diagram of comb interference rejection in accordance with an embodiment of the present invention;
FIG. 5 (a) is a signal-to-interference-and-noise ratio thermodynamic diagram of the present invention for resisting the interference of the frequency sweep;
FIG. 5 (b) is a diagram of a further signal-to-interference-and-noise ratio thermodynamic diagram for combating swept-frequency interference in an embodiment of the invention;
FIG. 5 (c) is a diagram of a further signal-to-interference-and-noise ratio thermodynamic diagram for combating swept-frequency interference in an embodiment of the invention;
FIG. 6 is a graph of prize variation against comb interference in an embodiment of the invention;
FIG. 7 is a graph showing a reward variation of the anti-sweep interference according to an embodiment of the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and examples, wherein the examples are shown only in a partial, but not in all embodiments of the invention. All other embodiments obtained by those skilled in the art are intended to fall within the scope of the embodiments of the present invention.
Fig. 1 is a model diagram of a frequency domain interference rejection system. In this model, a pair of communication transceivers form a communication user, and a transmitter and a receiver transmit data over a communication link while transmitting control information over a control link. The agent is embedded in the receiver, obtains channel information using spectrum sensing, and optimizes channel strategies using reinforcement learning algorithms. Meanwhile, a plurality of patterned jammers generate high-power interference signals for interfering communication users.
FIG. 2 is a single communication cycle of a frequency domain interference rejection systemThe inner structural design drawing. In this structure, the communication user performs the following operations: before->Or->Within the time (/ ->) And carrying out data transmission. Meanwhile, during this period, the intelligent agent in the receiver perceives the spectrum environment in real time, generating +.>And after the data transmission is completed, the communication user calculates the avoidance rate of all time slot channels in the current period. If the avoidance rate is greater than the set threshold value, the method shows that the channel strategy is not required to be optimized through reinforcement learning, and the method is carried out at the last +.>An ACK (Acknowledgement) is sent to the transmitter in time. Conversely, in the following->During time, the Q value table is updated by reinforcement learning algorithm to give the optimal channel strategy, and at last +.>A NACK (Negative Acknowledgement, negative feedback) and the latest channel strategy are sent to the transmitter in time. Finally, the communication user uses the latest channel strategy to transmit data after the next communication period starts.
The invention provides an embodiment of a frequency domain anti-interference method based on reinforcement learning, which comprises the following steps:
step 1: the method comprises the steps that a transmitter and a receiver which are mutually communicated are used as communication users, the transmitter and the receiver transmit data through a communication link, and control information is transmitted through a control link, wherein the control information comprises channel strategies and NACK; when the communication user performs data transmission, the plurality of patterned jammers generate interference signals to interfere the communication user;
step 2: the intelligent agent is embedded into the receiver, one communication period of the transmitter and the receiver is divided into a plurality of subframes, each subframe comprises a plurality of time slots, and the avoidance rate of all time slot channels is calculated;
step 3: judging whether the avoidance rate reaches a preset threshold value, if not, training and updating a channel strategy by using a WDQL (weighted double Q-learing, weighted double Q learning) algorithm, then sending the updated channel strategy and NACK to a transmitter through a control link, and starting data transmission of the next communication period.
The model of the invention has complete design and reasonable algorithm, not only ensures lower iteration time and calculation complexity, but also realizes rapid training decision speed. Particularly, when facing fixed mode interference, the system can quickly converge and has excellent anti-interference performance, and a powerful guarantee is provided for the reliability of the wireless communication system.
In this embodiment, step 2 includes:
in one communication cycleInside, in front->Or->Data transmission is carried out in time, and the data transmission is carried out in time>The method comprises the steps of carrying out a first treatment on the surface of the At the same time, in front ofOr->Within the time periodAn agent located in the receiver perceives the spectrum environment in real time, generating +.>After the data transmission is completed, calculating the avoidance rate of all time slot channels in the current period; communication cycle->Comprising +.>Or->、/>、/>;/>And->All represent time periods.
In this embodiment, in each time slot, a communication user obtains the signal-to-interference-and-noise ratio of each channel through spectrum sensing, and combines the signal-to-interference-and-noise ratio information of a plurality of time slots into a plurality of signal-to-interference-and-noise ratio subframes;
the signal-to-interference-and-noise ratio acquisition method comprises the following steps:
describing channel state information by adopting a block fading channel model, wherein channel parameters are kept unchanged in each time slot; modeling the channel gain between the transmitter and the receiver and the channel gain between the jammer and the receiver respectively;
and calculating the signal-to-interference-and-noise ratio of the receiver based on the channel gain model between the transmitter and the receiver and the channel gain model between the jammer and the receiver.
In this embodiment, the channel gain model between the transmitter and the receiver is:
The channel gain model between the receiver and the jth jammer is:
Wherein,for the Euclidean distance between transmitter and receiver, < >>For the receiver and the firstjEuclidean distance between the jammers, < >>And->Is a path fading factor>For the instantaneous fading coefficient, the mean value is 0, and the variance is +.>Complex gaussian variable of (a);
the signal-to-interference-and-noise ratio of the receiver is:
Wherein,indicating the selected channel of the communication subscriber in the t time slot +.>Center frequency of>For baseband signal bandwidth, ">For communication signal power, +.>And->Representing the interference power spectral density function and the noise power spectral density function respectively,fthe frequency of the variable is represented by,n(f) Representing the power spectral density of the noise,Jindicating the number of jammers, +.>Represent the firstjThe selected channel of the individual jammer in the t time slot +.>Is set at the center frequency of (a).
In this embodiment, step 3 includes:
step 31: setting training subframe numberFront +.A WDQL algorithm is used for the current communication period>Training sub-frames to obtain +.>For Q value table (+)>);/>And->All are Q value tables;
step 32: for the pair ofThe Q value table is averaged to extract the action with the maximum Q value in each time slot to form a length of +.>As an optimal channel strategy;
step 33: and sending the optimal channel strategy and NACK to a transmitter together, guiding the transmitter to carry out channel selection in the next communication period, wherein N is a positive integer greater than 1.
In this embodiment, step 32 includes:
first in the state of time slot tNext, according to probability->Randomly selecting an action, or according to probability +.>Selecting the action with the maximum Q value, namely +.>;/>Indicates the action of->An action at time slot t;
then calculate the actionIs awarded->And randomly select to update the Q value table +.>Or Q value table->。
In the present embodiment, rewardsThe calculation process of (1) comprises:
the agent in the receiver avoids the interference through Markov decision; the Markov decision process includes states, actions, state transition probabilities, and rewards, with the states of the t slots expressed asAll states constitute a state space->The method comprises the steps of carrying out a first treatment on the surface of the The action in time slot t is denoted +.>,/>,/>For the number of available channels, m represents the mth channel in the number of available channels, all actions constitute action space +.>The method comprises the steps of carrying out a first treatment on the surface of the And state transition probabilitySatisfy->Indicated at slot->When the intelligent agent is in the current environment state +.>Select action->The environment shifts to the next slot +.>State of (2)The probability of the instant channel avoidance rate; />Representing actions of all available channels in the action space A at the time of t time slots; />State space representing t time slots, ">State space representing the t+1 time slot, +.>Representing the state of the t+1 slot, +.>Representing the action space of the t+1 time slot, and Pr and p both represent probabilities;
if the communication user already perceives the signal-to-interference-and-noise ratio information of each time slot, all the state transition probabilities are determined values; the bonus representative is in stateAt the time, the agent selects action->Environment transitions to State->Post-obtainingThe obtained income is respectively corresponding to demodulation threshold by three preset modulation modes>,/>And->And a switching cost +_ brought when a channel switch is generated>Determining, profit->Expressed as:
wherein,as an indication function, express when->I.e. when no channel switch occurs in the front and rear slots, no cost is lost, otherwise, a size of +.>Cost of (2); />Representing the channel selected by the current time slot of the communication subscriber at time t+1 time slot +.>Is set at the center frequency of (a).
In this embodiment, the Markov decision is madeThe goal is to maximize the total benefit within one subframe:
Wherein,optimal channel policy representing communication subscriber +.>Representing slave policy->Selecting an operation that maximizes the total profit of the communication subscriber,/->Representing the sum of rewards earned for all slots of a single subframe,/->For the number of slots in a single subframe, < >>Represents action, pi (τ) represents and policy +.>Corresponding actions, T, represent the number of slots within a single subframe.
In this embodiment, the update is randomly selectedOr->Comprising the following steps:
if it is more thanNew type:
If it is updated:
Wherein (1)>And->Respectively represent the next state obtained based on the Q value table +.>Action with maximum Q and action with minimum Q down, < >>And->As weights, to balance the problem of overestimation of a single estimator with underestimation of a double estimator; when updating the Q value, use of +.>And->Two Q value tables, ">For learning rate->For discounts factor->For weight parameter, ++>To find the maximum function.
Step 3 further comprises:
if the avoidance rate is greater than the preset threshold value, the receiver sends ACK to the sender through the control link, the channel strategy of the sender is unchanged, and data transmission in the next communication period is started.
The invention also provides an embodiment of a frequency domain anti-interference system based on reinforcement learning, which is used for realizing the frequency domain anti-interference method based on reinforcement learning described in the above embodiment, and comprises the following steps:
the communication module is used for taking a transmitter and a receiver which are mutually communicated as communication users, transmitting data through a communication link and transmitting control information through a control link; when the communication user performs data transmission, the plurality of patterned jammers generate interference signals to interfere the communication user;
the computing module is used for embedding an agent into the receiver, dividing one communication period of the transmitter and the receiver into a plurality of subframes, wherein each subframe comprises a plurality of time slots, and computing the avoidance rate of all time slot channels;
and the decision module is used for judging whether the avoidance rate reaches a preset threshold value, if the avoidance rate does not reach the preset threshold value, training by using a WDQL algorithm, updating the channel strategy, transmitting the updated channel strategy and NACK to the transmitter through the control link, and starting data transmission of the next communication period.
The invention is further illustrated by the following examples:
under the windows 10-bit 64-operating system, simulations were completed in pychar software using the python language using a CPU model 12th Gen Intel (R) Core (TM) i 3-12100.30 GH. To analyze the effectiveness of the system, it is compared to a random channel selection algorithm. The relevant parameter settings for reinforcement learning are shown in table 1.
Table 1 simulation parameter settings
In an embodiment, one communication period is divided into 20 signal-to-interference-and-noise ratio subframes, two fixed interference modes are considered: comb interference and swept interference. Fig. 4 (a), fig. 4 (b) and fig. 4 (c) respectively show signal-to-interference-and-noise-ratio thermodynamic diagrams of a communication user obtained by performing system simulation by using a WDQL algorithm in a comb interference environment. Each block represents a channel and the black blocks represent the optimal channel strategy given by the reinforcement learning algorithm for the current communication cycle. The shade of the grey square represents the magnitude of the signal-to-interference-plus-noise value, and the darker the color, the smaller the value, which indicates that the corresponding channel is disturbed to a greater extent and is unsuitable for communication. Fig. 4 (a), fig. 4 (b) and fig. 4 (c) correspond to the first, second and third snr subframes of the current communication period, respectively, and the perceived snr information is different from time slot to time slot, so that the thermodynamic diagrams of the different subframes have different color shades, but the interference patterns are consistent. It can be observed that after the reinforcement learning algorithm is trained, the channel strategy given by the intelligent agent basically avoids the interference of the jammer, and the purpose of avoiding the interference is achieved. Similarly, fig. 5 (a), fig. 5 (b) and fig. 5 (c) show thermodynamic diagrams of signal-to-interference-and-noise ratios of communication users obtained by performing system simulation using the WDQL algorithm in a swept interference environment. Sweep interference is more complex than comb interference, resulting in more frequent channel switching frequencies.
Fig. 6 and 7 show graphs of reward variation of reinforcement learning based channel selection algorithm and random channel selection algorithm in comb interference and swept interference environments. It can be observed from the graph that as the number of training rounds increases, the rewards per round of the reinforcement learning-based algorithm are continuously increased, so that the interference is effectively avoided, and the final rewards tend to be a stable value. Conversely, the prize value of the random channel selection algorithm is not increased and interference is naturally not effectively avoided.
Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.
Claims (10)
1. The frequency domain anti-interference method based on reinforcement learning is characterized by comprising the following steps of:
step 1: the method comprises the steps that a transmitter and a receiver which are mutually communicated are used as communication users, the transmitter and the receiver transmit data through a communication link, and control information is transmitted through a control link, wherein the control information comprises channel strategies and NACK; when the communication user performs data transmission, the plurality of patterned jammers generate interference signals to interfere the communication user;
step 2: the intelligent agent is embedded into the receiver, one communication period of the transmitter and the receiver is divided into a plurality of subframes, each subframe comprises a plurality of time slots, and the avoidance rate of all time slot channels is calculated;
step 3: judging whether the avoidance rate reaches a preset threshold value, if not, training by using a WDQL algorithm, updating the channel strategy, transmitting the updated channel strategy and NACK to a transmitter through a control link, and starting data transmission of the next communication period.
2. The reinforcement learning-based frequency domain interference rejection method according to claim 1, wherein the step 2 comprises:
in one communication cycleInside, in front->Or->Data transmission is carried out in time, and the data transmission is carried out in time>The method comprises the steps of carrying out a first treatment on the surface of the At the same time, before->Or (b)In time, an agent located in the receiver perceives the spectrum environment in real time, generating +.>After the data transmission is completed, calculating the avoidance rate of all time slot channels in the current period; communication cycle->Comprising +.>Or->、/>、/>;/>And->All represent time periods.
3. The reinforcement learning-based frequency domain anti-interference method according to claim 1 or 2, wherein in each time slot, a communication user obtains the signal-to-interference-and-noise ratio of each channel through spectrum sensing, and combines the signal-to-interference-and-noise ratio information of a plurality of time slots into a plurality of signal-to-interference-and-noise ratio subframes;
the signal-to-interference-plus-noise ratio acquisition method comprises the following steps:
describing channel state information by adopting a block fading channel model, wherein channel parameters are kept unchanged in each time slot; modeling the channel gain between the transmitter and the receiver and the channel gain between the jammer and the receiver respectively;
and calculating the signal-to-interference-and-noise ratio of the receiver based on the channel gain model between the transmitter and the receiver and the channel gain model between the jammer and the receiver.
4. The reinforcement learning-based frequency domain interference rejection method according to claim 3, wherein the channel gain model between the transmitter and the receiver is:
The channel gain model between the receiver and the jth jammer is:
Wherein,for the Euclidean distance between transmitter and receiver, < >>For the receiver and the firstjEuclidean distance between the jammers, < >>And->Is a path fading factor>For the instantaneous fading coefficient, the mean value is 0, and the variance is +.>Complex gaussian variable of (a);
the signal-to-interference-and-noise ratio of the receiver is:
Wherein,indicating the selected channel of the communication subscriber in the t time slot +.>Center frequency of>For baseband signal bandwidth, ">For communication signal power, +.>And->Representing the interference power spectral density function and the noise power spectral density function respectively,fthe frequency of the variable is represented by,n(f) Representing the power spectral density of the noise,Jindicating the number of jammers, +.>Represent the firstjThe selected channel of the individual jammer in the t time slot +.>Is set at the center frequency of (a).
5. The reinforcement learning-based frequency domain interference rejection method according to claim 1, wherein the step 3 comprises:
step 31: setting training subframe numberFront +.A WDQL algorithm is used for the current communication period>Training sub-frames to obtain +.>For Q value table (+)>);/>And->All are Q value tables;
step 32: for the pair ofThe Q value table is averaged to extract the action with the maximum Q value in each time slot to form a length of +.>As an optimal channel strategy;
step 33: and sending the optimal channel strategy and NACK to a transmitter together, guiding the transmitter to carry out channel selection in the next communication period, wherein N is a positive integer greater than 1.
6. The reinforcement learning-based frequency domain interference rejection method according to claim 5, wherein said step 32 comprises:
first in the state of time slot tNext, according to probability->Randomly selecting an action, or according to probability +.>Selecting the action with the maximum Q value, namely +.>;/>Indicates the action of->An action at time slot t;
then calculate the actionIs awarded->And randomly select to update the Q value table +.>Or Q value table->。
7. The reinforcement learning based frequency domain interference avoidance method according to claim 6 wherein said rewardsThe calculation process of (1) comprises:
the agent in the receiver avoids the interference through Markov decision; the Markov decision process includes states, actions, state transition probabilities and rewards, the states of the t time slots are expressed asAll states constitute a state space->The method comprises the steps of carrying out a first treatment on the surface of the The action in time slot t is denoted +.>,/>,/>For the number of available channels, m represents the mth channel in the number of available channels, all actions constitute action space +.>The method comprises the steps of carrying out a first treatment on the surface of the And state transition probabilitySatisfy->Representing in time slotsWhen the intelligent agent is in the current environment state +.>Select action->The environment shifts to the next slot +.>Status of->The probability of the instant channel avoidance rate; />Representing actions of all available channels in the action space A at the time of t time slots; />State space representing t time slots, ">State space representing the t+1 time slot, +.>Representing the state of the t+1 slot, +.>Representing the action space of the t+1 time slot, and Pr and p both represent probabilities;
if the communication user already perceives the signal-to-interference-and-noise ratio information of each time slot, all the state transition probabilities are determined values; the bonus representative is in stateAt the time, the agent selects action->Environment transitions to State->The obtained gain is demodulated by three preset modulation modes corresponding to the demodulation threshold +.>,/>And->And a switching cost +_ brought when a channel switch is generated>Determining, profit->Expressed as:
wherein,as an indication function, express when->I.e. when no channel switch occurs in the front and rear slots, no cost is lost, otherwise, a size of +.>Cost of (2); />Representing the channel selected by the current time slot of the communication subscriber at time t+1 time slot +.>Is set at the center frequency of (a).
8. The reinforcement learning based frequency domain interference rejection method according to claim 7, wherein the markov decision is aimed at maximizing the total gain within one subframe:
Wherein,optimal channel policy representing communication subscriber +.>Representing slave policy->Selecting an operation that maximizes the total profit of the communication subscriber,/->Representing the sum of rewards earned for all slots of a single subframe,/->As the number of slots within a single sub-frame,represents action, pi (τ) represents and policy +.>Corresponding actions, T, represent the number of slots within a single subframe.
9. The reinforcement learning based frequency domain interference rejection method according to claim 8, wherein the randomly selected updatesOr->Comprising the following steps:
if it is updated:
If it is updated:
Wherein (1)>And->Respectively represent the next state obtained based on the Q value table +.>Action with maximum Q and action with minimum Q down, < >>And->As weights, to balance the problem of overestimation of a single estimator with underestimation of a double estimator; when updating the Q value, use of +.>And->Two Q value tables, ">For learning rate->For discounts factor->For weight parameter, ++>To find the maximum function.
10. A reinforcement learning-based frequency domain interference suppression system for implementing the reinforcement learning-based frequency domain interference suppression method of any one of claims 1-9, comprising:
the communication module is used for taking a transmitter and a receiver which are mutually communicated as communication users, transmitting data through a communication link and transmitting control information through a control link; when the communication user performs data transmission, the plurality of patterned jammers generate interference signals to interfere the communication user;
the computing module is used for embedding an agent into the receiver, dividing one communication period of the transmitter and the receiver into a plurality of subframes, wherein each subframe comprises a plurality of time slots, and computing the avoidance rate of all time slot channels;
and the decision module is used for judging whether the avoidance rate reaches a preset threshold value, if the avoidance rate does not reach the preset threshold value, training by using a WDQL algorithm, updating the channel strategy, transmitting the updated channel strategy and NACK to the transmitter through the control link, and starting data transmission of the next communication period.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410182440.7A CN117750525B (en) | 2024-02-19 | 2024-02-19 | Frequency domain anti-interference method and system based on reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410182440.7A CN117750525B (en) | 2024-02-19 | 2024-02-19 | Frequency domain anti-interference method and system based on reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117750525A true CN117750525A (en) | 2024-03-22 |
CN117750525B CN117750525B (en) | 2024-05-31 |
Family
ID=90259480
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410182440.7A Active CN117750525B (en) | 2024-02-19 | 2024-02-19 | Frequency domain anti-interference method and system based on reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117750525B (en) |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109586820A (en) * | 2018-12-28 | 2019-04-05 | 中国人民解放军陆军工程大学 | Dynamic spectrum anti-interference model in fading environment and reinforcement learning anti-interference algorithm |
US20210123741A1 (en) * | 2019-10-29 | 2021-04-29 | Loon Llc | Systems and Methods for Navigating Aerial Vehicles Using Deep Reinforcement Learning |
US20210241090A1 (en) * | 2020-01-31 | 2021-08-05 | At&T Intellectual Property I, L.P. | Radio access network control with deep reinforcement learning |
CN114280558A (en) * | 2021-12-23 | 2022-04-05 | 北京邮电大学 | Interference signal waveform optimization method based on reinforcement learning |
US20220209885A1 (en) * | 2020-12-24 | 2022-06-30 | Viettel Group | Method and apparatus for adaptive anti-jamming communications based on deep double-q reinforcement learning |
CN114845403A (en) * | 2022-04-28 | 2022-08-02 | 淮安欧特科技有限公司 | Competitive double-depth Q network intelligent channel decision method |
CN114978388A (en) * | 2022-05-18 | 2022-08-30 | 大连大学 | Unmanned aerial vehicle time-frequency domain combined cognitive anti-interference intelligent decision method |
CN115103446A (en) * | 2022-05-25 | 2022-09-23 | 南京邮电大学 | Multi-user communication anti-interference intelligent decision-making method based on deep reinforcement learning |
CN115236607A (en) * | 2022-06-30 | 2022-10-25 | 北京邮电大学 | Radar anti-interference strategy optimization method based on double-layer Q learning |
CN116744311A (en) * | 2023-05-24 | 2023-09-12 | 中国人民解放军国防科技大学 | User group spectrum access method based on PER-DDQN |
CN116866895A (en) * | 2023-07-19 | 2023-10-10 | 中国人民解放军陆军工程大学 | Intelligent countering method based on neural virtual self-game |
CN116886236A (en) * | 2023-08-29 | 2023-10-13 | 电子科技大学 | Resource optimization-oriented collaborative interference strategy generation method |
-
2024
- 2024-02-19 CN CN202410182440.7A patent/CN117750525B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109586820A (en) * | 2018-12-28 | 2019-04-05 | 中国人民解放军陆军工程大学 | Dynamic spectrum anti-interference model in fading environment and reinforcement learning anti-interference algorithm |
US20210123741A1 (en) * | 2019-10-29 | 2021-04-29 | Loon Llc | Systems and Methods for Navigating Aerial Vehicles Using Deep Reinforcement Learning |
US20210241090A1 (en) * | 2020-01-31 | 2021-08-05 | At&T Intellectual Property I, L.P. | Radio access network control with deep reinforcement learning |
US20220209885A1 (en) * | 2020-12-24 | 2022-06-30 | Viettel Group | Method and apparatus for adaptive anti-jamming communications based on deep double-q reinforcement learning |
CN114280558A (en) * | 2021-12-23 | 2022-04-05 | 北京邮电大学 | Interference signal waveform optimization method based on reinforcement learning |
CN114845403A (en) * | 2022-04-28 | 2022-08-02 | 淮安欧特科技有限公司 | Competitive double-depth Q network intelligent channel decision method |
CN114978388A (en) * | 2022-05-18 | 2022-08-30 | 大连大学 | Unmanned aerial vehicle time-frequency domain combined cognitive anti-interference intelligent decision method |
CN115103446A (en) * | 2022-05-25 | 2022-09-23 | 南京邮电大学 | Multi-user communication anti-interference intelligent decision-making method based on deep reinforcement learning |
CN115236607A (en) * | 2022-06-30 | 2022-10-25 | 北京邮电大学 | Radar anti-interference strategy optimization method based on double-layer Q learning |
CN116744311A (en) * | 2023-05-24 | 2023-09-12 | 中国人民解放军国防科技大学 | User group spectrum access method based on PER-DDQN |
CN116866895A (en) * | 2023-07-19 | 2023-10-10 | 中国人民解放军陆军工程大学 | Intelligent countering method based on neural virtual self-game |
CN116886236A (en) * | 2023-08-29 | 2023-10-13 | 电子科技大学 | Resource optimization-oriented collaborative interference strategy generation method |
Non-Patent Citations (3)
Title |
---|
FANGMIN XU;FAN YANG;CHENGLIN ZHAO;SHENG WU;: "Deep Reinforcement Learning Based Joint Edge Resource Management in Maritime Network", 中国通信, no. 05, 15 May 2020 (2020-05-15) * |
WANLING LI等: "D2D Communication Power Control Based on Deep Q Learning and Fractional Frequency Reuse", 2023 15TH INTERNATIONAL CONFERENCE ON COMMUNICATION SOFTWARE AND NETWORKS (ICCSN), 6 November 2023 (2023-11-06) * |
谭俊杰;梁应敞;: "面向智能通信的深度强化学习方法", 电子科技大学学报, no. 02, 30 March 2020 (2020-03-30) * |
Also Published As
Publication number | Publication date |
---|---|
CN117750525B (en) | 2024-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109586820A (en) | Dynamic spectrum anti-interference model in fading environment and reinforcement learning anti-interference algorithm | |
CN111726217B (en) | Deep reinforcement learning-based autonomous frequency selection method and system for broadband wireless communication | |
Xing et al. | Stochastic learning solution for distributed discrete power control game in wireless data networks | |
CN108712748B (en) | Cognitive radio anti-interference intelligent decision-making method based on reinforcement learning | |
CN104994569B (en) | Multi-user reinforcement learning-based method for resisting hostile interference of cognitive wireless network | |
CN109274456B (en) | Incomplete information intelligent anti-interference method based on reinforcement learning | |
Liu et al. | A heterogeneous information fusion deep reinforcement learning for intelligent frequency selection of HF communication | |
Kong et al. | A reinforcement learning approach for dynamic spectrum anti-jamming in fading environment | |
Van Huynh et al. | DeepFake: Deep dueling-based deception strategy to defeat reactive jammers | |
Ilahi et al. | LoRaDRL: Deep reinforcement learning based adaptive PHY layer transmission parameters selection for LoRaWAN | |
Kim | Adaptive online power control scheme based on the evolutionary game theory | |
CN114978388B (en) | Unmanned aerial vehicle time-frequency domain combined cognition anti-interference intelligent decision-making method | |
CN115567148A (en) | Intelligent interference method based on cooperative Q learning | |
CN113225794A (en) | Full-duplex cognitive communication power control method based on deep reinforcement learning | |
CN113423110A (en) | Multi-user multi-channel dynamic spectrum access method based on deep reinforcement learning | |
Zhou et al. | A countermeasure against random pulse jamming in time domain based on reinforcement learning | |
Li et al. | Intelligent dynamic spectrum anti-jamming communications: A deep reinforcement learning perspective | |
Pei et al. | Joint time-frequency anti-jamming communications: A reinforcement learning approach | |
CN115766089A (en) | Energy acquisition cognitive Internet of things anti-interference optimal transmission method | |
CN113271119B (en) | Anti-interference cooperative frequency hopping method based on transmission scheduling | |
Cho et al. | Power control for MACA-based underwater MAC protocol: A Q-learning approach | |
CN113038567B (en) | Anti-interference method of anti-interference system in multi-relay communication | |
CN117750525B (en) | Frequency domain anti-interference method and system based on reinforcement learning | |
Ali et al. | Defeating proactive jammers using deep reinforcement learning for resource-constrained IoT networks | |
Chen et al. | Adaptive repetition scheme with machine learning for 3GPP NB-IoT |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |