CN117750525A

CN117750525A - Frequency domain anti-interference method and system based on reinforcement learning

Info

Publication number: CN117750525A
Application number: CN202410182440.7A
Authority: CN
Inventors: 李刚; 吴麒; 王翔; 董珊珊; 罗浩; 乔冠华
Original assignee: CETC 10 Research Institute
Current assignee: CETC 10 Research Institute
Priority date: 2024-02-19
Filing date: 2024-02-19
Publication date: 2024-03-22

Abstract

The invention discloses a frequency domain anti-interference method and a system based on reinforcement learning, wherein the method comprises the steps that a transmitter and a receiver transmit data through a communication link and transmit control information through a control link; when the communication user performs data transmission, the plurality of patterned jammers generate interference signals to interfere the communication user; the intelligent agent is embedded into the receiver, one communication period of the transmitter and the receiver is divided into a plurality of subframes, each subframe comprises a plurality of time slots, and the avoidance rate of all time slot channels is calculated; judging whether the avoidance rate reaches a preset threshold value, if not, training by using a WDQL algorithm, updating the channel strategy, transmitting the updated channel strategy and NACK to a transmitter through a control link, and starting data transmission of the next communication period. The invention not only ensures lower iteration time and calculation complexity, but also realizes rapid training decision speed and excellent anti-interference performance.

Description

Frequency domain anti-interference method and system based on reinforcement learning

Technical Field

The invention relates to the technical field of wireless communication, in particular to a frequency domain anti-interference method and system based on reinforcement learning.

Background

The openness of the wireless communication channel makes it vulnerable to interference attacks, which in turn results in loss of communication performance, reducing the reliability of the wireless communication system. Thus, the anti-interference technology becomes a crucial research direction in the communication field.

Traditional anti-interference technologies, such as Frequency Hopping Spread Spectrum (FHSS) and Direct Sequence Spread Spectrum (DSSS), although capable of providing a certain anti-interference capability to a communication system, cannot flexibly optimize the anti-interference strategy according to the real-time spectrum environment and interference mode due to its fixed mode. Thus, there is a need for a more intelligent method of selecting communication frequencies to be effective against malicious interference.

With the development of machine learning technology, a scholars in recent years have proposed an anti-interference channel selection method based on Q learning (references: s.liu, y.xu, x.chen, m.wang, w.li, y.li and y.xu, "Pattern-Aware Intelligent Anti-Jamming Communication: A Sequential Deep Reinforcement Learning Approach," in IEEE Access, vol.7, pp. 169204-169216, 2019 "). However, adjusting only system frequency domain parameters does not take full advantage of the multi-domain flexibility of the wireless communication system. Thus, some scholars focused on the joint anti-interference problem of the frequency domain and the power domain and proposed a Multi-parameter Q Learning anti-interference algorithm (reference: Z. Pu, Y. Niu and G. Zhang, "A Multi-Parameter Intelligent Communication Anti-Jamming Method Based on Three-Dimensional Q-Learning," 2022 IEEE 2nd International Conference on Computer Communication and Artificial Intelligence (CCAI), beijin, china, 2022, pp. 205-210.). In addition, there are also scholars combining Q learning with deep learning, fitting Q-value tables using deep reinforcement learning algorithms to achieve dynamic spectrum immunity (ref: X. Liu, Y. Xu, L. Jia, et al, "Anti-Jamming Communications Using Spectrum Waterfall: A Deep Reinforcement Learning Approach," in IEEE Communications Letters, vol. 22, no. 5, pp. 998-1001, may 2018).

However, while anti-jamming algorithms employing deep reinforcement learning successfully solve the "dimensional explosion" problem of huge state decision space, in many cases they have long convergence times and are difficult to train effectively. The reinforcement learning anti-jamming algorithm employing Q learning, while capable of converging in a shorter time than deep reinforcement learning, does not adequately take into account overestimation problems that may result when a single estimator is employed to update the Q value. This problem may make the resulting interference rejection strategy less than optimal.

Thus, how to achieve rapid convergence and good interference immunity in the face of an unknown communication interference environment is a challenge that needs to be addressed by practitioners of the art.

Disclosure of Invention

The invention aims to provide a frequency domain anti-interference method and a system based on reinforcement learning, which are particularly suitable for the situation facing unknown patterned interference. The method can avoid interference rapidly to obtain good anti-interference performance, and simultaneously reduces the frequency of channel switching as much as possible to reduce the communication cost, so that the problems that the anti-interference research based on deep reinforcement learning is difficult to train and the convergence time is long in the prior art are solved, and the problem that strategies faced in the anti-interference research of the reinforcement learning algorithm based on Q learning are not optimal is solved.

The invention discloses a frequency domain anti-interference method based on reinforcement learning, which comprises the following steps:

step 1: the method comprises the steps that a transmitter and a receiver which are mutually communicated are used as communication users, the transmitter and the receiver transmit data through a communication link, and control information is transmitted through a control link, wherein the control information comprises channel strategies and NACK; when the communication user performs data transmission, the plurality of patterned jammers generate interference signals to interfere the communication user;

step 2: the intelligent agent is embedded into the receiver, one communication period of the transmitter and the receiver is divided into a plurality of subframes, each subframe comprises a plurality of time slots, and the avoidance rate of all time slot channels is calculated;

step 3: judging whether the avoidance rate reaches a preset threshold value, if not, training by using a WDQL algorithm, updating the channel strategy, transmitting the updated channel strategy and NACK to a transmitter through a control link, and starting data transmission of the next communication period.

Further, the step 2 includes:

in one communication cycleInside, in front->Or->Data transmission is carried out in time, and the data transmission is carried out in time>The method comprises the steps of carrying out a first treatment on the surface of the At the same time, in front ofOr->In time, an agent located in the receiver perceives the spectrum environment in real time, generating +.>After the data transmission is completed, calculating the avoidance rate of all time slot channels in the current period; communication cycle->Comprising +.>Or->、/>、/>；/>And->All represent time periods.

Further, in each time slot, the communication user obtains the signal-to-interference-and-noise ratio of each channel through spectrum sensing, and combines the signal-to-interference-and-noise ratio information of a plurality of time slots into a plurality of signal-to-interference-and-noise ratio subframes;

the signal-to-interference-plus-noise ratio acquisition method comprises the following steps:

describing channel state information by adopting a block fading channel model, wherein channel parameters are kept unchanged in each time slot; modeling the channel gain between the transmitter and the receiver and the channel gain between the jammer and the receiver respectively;

and calculating the signal-to-interference-and-noise ratio of the receiver based on the channel gain model between the transmitter and the receiver and the channel gain model between the jammer and the receiver.

Further, the channel gain model between the transmitter and the receiver is as follows：

The channel gain model between the receiver and the jth jammer is：

Wherein,for the Euclidean distance between transmitter and receiver, < >>For the receiver and the firstjEuclidean distance between the jammers, < >>And->Is a path fading factor>For the instantaneous fading coefficient, the mean value is 0, and the variance is +.>Complex gaussian variable of (a);

the signal-to-interference-and-noise ratio of the receiver is：

Wherein,indicating the communication user's selection of messages in time slot tDao->Center frequency of>For baseband signal bandwidth, ">For communication signal power, +.>And->Representing the interference power spectral density function and the noise power spectral density function respectively,fthe frequency of the variable is represented by,n(f) Representing the power spectral density of the noise,Jindicating the number of jammers, +.>Represent the firstjThe selected channel of the individual jammer in the t time slot +.>Is set at the center frequency of (a).

Further, the step 3 includes:

step 31: setting training subframe numberFront +.A WDQL algorithm is used for the current communication period>Training sub-frames to obtain +.>For Q value table (+)>)；/>And->All are Q value tables;

step 32: for the pair ofThe Q value table is averaged to extract the action with the maximum Q value in each time slot to form a length of +.>As an optimal channel strategy;

step 33: and sending the optimal channel strategy and NACK to a transmitter together, guiding the transmitter to carry out channel selection in the next communication period, wherein N is a positive integer greater than 1.

Further, the step 32 includes:

first in the state of time slot tNext, according to probability->Randomly selecting an action, or according to probability +.>Selecting the action with the maximum Q value, namely +.>；/>Indicates the action of->An action at time slot t;

then calculate the actionIs awarded->And randomly select to update the Q value table +.>Or Q value table->。

Further, the rewardsThe calculation process of (1) comprises:

the agent in the receiver avoids the interference through Markov decision; the Markov decision process includes states, actions, state transition probabilities and rewards, the states of the t time slots are expressed asAll states constitute a state space->The method comprises the steps of carrying out a first treatment on the surface of the The action in time slot t is denoted +.>，/>，/>For the number of available channels, m represents the mth channel in the number of available channels, all actions constitute action space +.>The method comprises the steps of carrying out a first treatment on the surface of the And state transition probabilitySatisfy->Indicated at slot->When the intelligent agent is in the current environment state +.>Select action->The environment shifts to the next slot +.>Status of->The probability of the instant channel avoidance rate; />Representing actions of all available channels in the action space A at the time of t time slots; />State space representing t time slots, ">State space representing the t+1 time slot, +.>Representing the state of the t+1 slot, +.>Representing the action space of the t+1 time slot, and Pr and p both represent probabilities;

if the communication user already perceives the signal-to-interference-and-noise ratio information of each time slot, all the state transition probabilities are determined values; the bonus representative is in stateAt the time, the agent selects action->Environment transitions to State->The obtained gain is demodulated by three preset modulation modes corresponding to the demodulation threshold +.>，/>And->And a switching cost +_ brought when a channel switch is generated>Determining, profit->Expressed as:

wherein,as an indication function, express when->I.e. when no channel switch occurs in the front and rear slots, no cost is lost, otherwise, a size of +.>Cost of (2); />Representing the channel selected by the current time slot of the communication subscriber at time t+1 time slot +.>Is set at the center frequency of (a).

Further, the goal of the Markov decision is to maximize the total revenue within a subframe：

Wherein,optimal channel policy representing communication subscriber +.>Representing slave policy->Selecting an operation that maximizes the total profit of the communication subscriber,/->Representing the sum of rewards earned for all slots of a single subframe,/->For the number of slots in a single subframe, < >>Represents action, pi (τ) represents and policy +.>Corresponding actions, T, represent the number of slots within a single subframe.

Further, the random selection updatesOr->Comprising the following steps:

if it is updated：

If it is updated：

Wherein (1)>And->Respectively represent the next state obtained based on the Q value table +.>Action with maximum Q and action with minimum Q down, < >>And->As weights, to balance the problem of overestimation of a single estimator with underestimation of a double estimator; when updating the Q value, use of +.>And->Two Q value tables, ">For learning rate->For discounts factor->For weight parameter, ++>To find the maximum function.

The invention also discloses a frequency domain anti-interference system based on reinforcement learning, which is used for realizing the frequency domain anti-interference method based on reinforcement learning, and comprises the following steps:

the communication module is used for taking a transmitter and a receiver which are mutually communicated as communication users, transmitting data through a communication link and transmitting control information through a control link; when the communication user performs data transmission, the plurality of patterned jammers generate interference signals to interfere the communication user;

the computing module is used for embedding an agent into the receiver, dividing one communication period of the transmitter and the receiver into a plurality of subframes, wherein each subframe comprises a plurality of time slots, and computing the avoidance rate of all time slot channels;

and the decision module is used for judging whether the avoidance rate reaches a preset threshold value, if the avoidance rate does not reach the preset threshold value, training by using a WDQL algorithm, updating the channel strategy, transmitting the updated channel strategy and NACK to the transmitter through the control link, and starting data transmission of the next communication period.

Due to the adoption of the technical scheme, the invention has the following advantages:

1. through a continuously optimized weighted double-Q learning algorithm, the system can update the Q-value table, formulate and communicate the optimized channel strategy to the transmitter through a stable control link. The model of the invention has complete design and reasonable algorithm, not only ensures lower iteration time and calculation complexity, but also realizes rapid training decision speed. Particularly, when facing fixed mode interference, the system can quickly converge and has excellent anti-interference performance, and a powerful guarantee is provided for the reliability of the wireless communication system.

2. Based on reinforcement learning technology, an intelligent communication anti-interference method is designed. According to the method, the frequency spectrum environment is periodically sensed, the signal-to-interference-and-noise ratio subframes are generated, and a channel strategy is formulated through training of the subframes so as to achieve the purpose of avoiding interference.

3. The invention does not need to estimate the interference mode and parameters of the jammer in advance, namely, the model is not needed, so the invention can be widely applied to various modeling anti-interference scenes.

4. The invention can avoid the problems of long training and convergence time of the deep reinforcement learning algorithm and the overestimation of the actions by taking the Q learning algorithm as the reinforcement learning algorithm. By effectively solving the proposed model, good anti-interference performance can be obtained.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments described in the embodiments of the present invention, and other drawings may be obtained according to these drawings for those skilled in the art.

FIG. 1 is a diagram of a frequency domain anti-interference system of the present invention;

FIG. 2 is a single communication cycle of the frequency domain anti-interference system of the present inventionA structural design drawing in the inner part;

FIG. 3 is a flow chart of the frequency domain anti-interference method of the present invention;

FIG. 4 (a) is a signal-to-interference-and-noise ratio thermodynamic diagram of comb interference rejection in accordance with an embodiment of the present invention;

FIG. 4 (b) is a diagram of yet another SNR thermodynamic diagram for comb interference rejection in accordance with an embodiment of the present invention;

FIG. 4 (c) is a graph of yet another SNR thermodynamic diagram of comb interference rejection in accordance with an embodiment of the present invention;

FIG. 5 (a) is a signal-to-interference-and-noise ratio thermodynamic diagram of the present invention for resisting the interference of the frequency sweep;

FIG. 5 (b) is a diagram of a further signal-to-interference-and-noise ratio thermodynamic diagram for combating swept-frequency interference in an embodiment of the invention;

FIG. 5 (c) is a diagram of a further signal-to-interference-and-noise ratio thermodynamic diagram for combating swept-frequency interference in an embodiment of the invention;

FIG. 6 is a graph of prize variation against comb interference in an embodiment of the invention;

FIG. 7 is a graph showing a reward variation of the anti-sweep interference according to an embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and examples, wherein the examples are shown only in a partial, but not in all embodiments of the invention. All other embodiments obtained by those skilled in the art are intended to fall within the scope of the embodiments of the present invention.

Fig. 1 is a model diagram of a frequency domain interference rejection system. In this model, a pair of communication transceivers form a communication user, and a transmitter and a receiver transmit data over a communication link while transmitting control information over a control link. The agent is embedded in the receiver, obtains channel information using spectrum sensing, and optimizes channel strategies using reinforcement learning algorithms. Meanwhile, a plurality of patterned jammers generate high-power interference signals for interfering communication users.

FIG. 2 is a single communication cycle of a frequency domain interference rejection systemThe inner structural design drawing. In this structure, the communication user performs the following operations: before->Or->Within the time (/ ->) And carrying out data transmission. Meanwhile, during this period, the intelligent agent in the receiver perceives the spectrum environment in real time, generating +.>And after the data transmission is completed, the communication user calculates the avoidance rate of all time slot channels in the current period. If the avoidance rate is greater than the set threshold value, the method shows that the channel strategy is not required to be optimized through reinforcement learning, and the method is carried out at the last +.>An ACK (Acknowledgement) is sent to the transmitter in time. Conversely, in the following->During time, the Q value table is updated by reinforcement learning algorithm to give the optimal channel strategy, and at last +.>A NACK (Negative Acknowledgement, negative feedback) and the latest channel strategy are sent to the transmitter in time. Finally, the communication user uses the latest channel strategy to transmit data after the next communication period starts.

The invention provides an embodiment of a frequency domain anti-interference method based on reinforcement learning, which comprises the following steps:

step 3: judging whether the avoidance rate reaches a preset threshold value, if not, training and updating a channel strategy by using a WDQL (weighted double Q-learing, weighted double Q learning) algorithm, then sending the updated channel strategy and NACK to a transmitter through a control link, and starting data transmission of the next communication period.

The model of the invention has complete design and reasonable algorithm, not only ensures lower iteration time and calculation complexity, but also realizes rapid training decision speed. Particularly, when facing fixed mode interference, the system can quickly converge and has excellent anti-interference performance, and a powerful guarantee is provided for the reliability of the wireless communication system.

In this embodiment, step 2 includes:

in one communication cycleInside, in front->Or->Data transmission is carried out in time, and the data transmission is carried out in time>The method comprises the steps of carrying out a first treatment on the surface of the At the same time, in front ofOr->Within the time periodAn agent located in the receiver perceives the spectrum environment in real time, generating +.>After the data transmission is completed, calculating the avoidance rate of all time slot channels in the current period; communication cycle->Comprising +.>Or->、/>、/>；/>And->All represent time periods.

In this embodiment, in each time slot, a communication user obtains the signal-to-interference-and-noise ratio of each channel through spectrum sensing, and combines the signal-to-interference-and-noise ratio information of a plurality of time slots into a plurality of signal-to-interference-and-noise ratio subframes;

the signal-to-interference-and-noise ratio acquisition method comprises the following steps:

In this embodiment, the channel gain model between the transmitter and the receiver is：

The channel gain model between the receiver and the jth jammer is：

the signal-to-interference-and-noise ratio of the receiver is：

Wherein,indicating the selected channel of the communication subscriber in the t time slot +.>Center frequency of>For baseband signal bandwidth, ">For communication signal power, +.>And->Representing the interference power spectral density function and the noise power spectral density function respectively,fthe frequency of the variable is represented by,n(f) Representing the power spectral density of the noise,Jindicating the number of jammers, +.>Represent the firstjThe selected channel of the individual jammer in the t time slot +.>Is set at the center frequency of (a).

In this embodiment, step 3 includes:

In this embodiment, step 32 includes:

In the present embodiment, rewardsThe calculation process of (1) comprises:

the agent in the receiver avoids the interference through Markov decision; the Markov decision process includes states, actions, state transition probabilities, and rewards, with the states of the t slots expressed asAll states constitute a state space->The method comprises the steps of carrying out a first treatment on the surface of the The action in time slot t is denoted +.>，/>，/>For the number of available channels, m represents the mth channel in the number of available channels, all actions constitute action space +.>The method comprises the steps of carrying out a first treatment on the surface of the And state transition probabilitySatisfy->Indicated at slot->When the intelligent agent is in the current environment state +.>Select action->The environment shifts to the next slot +.>State of (2)The probability of the instant channel avoidance rate; />Representing actions of all available channels in the action space A at the time of t time slots; />State space representing t time slots, ">State space representing the t+1 time slot, +.>Representing the state of the t+1 slot, +.>Representing the action space of the t+1 time slot, and Pr and p both represent probabilities;

if the communication user already perceives the signal-to-interference-and-noise ratio information of each time slot, all the state transition probabilities are determined values; the bonus representative is in stateAt the time, the agent selects action->Environment transitions to State->Post-obtainingThe obtained income is respectively corresponding to demodulation threshold by three preset modulation modes>，/>And->And a switching cost +_ brought when a channel switch is generated>Determining, profit->Expressed as:

In this embodiment, the Markov decision is madeThe goal is to maximize the total benefit within one subframe：

In this embodiment, the update is randomly selectedOr->Comprising the following steps:

if it is more thanNew type：

If it is updated：

Step 3 further comprises:

if the avoidance rate is greater than the preset threshold value, the receiver sends ACK to the sender through the control link, the channel strategy of the sender is unchanged, and data transmission in the next communication period is started.

The invention also provides an embodiment of a frequency domain anti-interference system based on reinforcement learning, which is used for realizing the frequency domain anti-interference method based on reinforcement learning described in the above embodiment, and comprises the following steps:

The invention is further illustrated by the following examples:

under the windows 10-bit 64-operating system, simulations were completed in pychar software using the python language using a CPU model 12th Gen Intel (R) Core (TM) i 3-12100.30 GH. To analyze the effectiveness of the system, it is compared to a random channel selection algorithm. The relevant parameter settings for reinforcement learning are shown in table 1.

Table 1 simulation parameter settings

In an embodiment, one communication period is divided into 20 signal-to-interference-and-noise ratio subframes, two fixed interference modes are considered: comb interference and swept interference. Fig. 4 (a), fig. 4 (b) and fig. 4 (c) respectively show signal-to-interference-and-noise-ratio thermodynamic diagrams of a communication user obtained by performing system simulation by using a WDQL algorithm in a comb interference environment. Each block represents a channel and the black blocks represent the optimal channel strategy given by the reinforcement learning algorithm for the current communication cycle. The shade of the grey square represents the magnitude of the signal-to-interference-plus-noise value, and the darker the color, the smaller the value, which indicates that the corresponding channel is disturbed to a greater extent and is unsuitable for communication. Fig. 4 (a), fig. 4 (b) and fig. 4 (c) correspond to the first, second and third snr subframes of the current communication period, respectively, and the perceived snr information is different from time slot to time slot, so that the thermodynamic diagrams of the different subframes have different color shades, but the interference patterns are consistent. It can be observed that after the reinforcement learning algorithm is trained, the channel strategy given by the intelligent agent basically avoids the interference of the jammer, and the purpose of avoiding the interference is achieved. Similarly, fig. 5 (a), fig. 5 (b) and fig. 5 (c) show thermodynamic diagrams of signal-to-interference-and-noise ratios of communication users obtained by performing system simulation using the WDQL algorithm in a swept interference environment. Sweep interference is more complex than comb interference, resulting in more frequent channel switching frequencies.

Fig. 6 and 7 show graphs of reward variation of reinforcement learning based channel selection algorithm and random channel selection algorithm in comb interference and swept interference environments. It can be observed from the graph that as the number of training rounds increases, the rewards per round of the reinforcement learning-based algorithm are continuously increased, so that the interference is effectively avoided, and the final rewards tend to be a stable value. Conversely, the prize value of the random channel selection algorithm is not increased and interference is naturally not effectively avoided.

Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims

1. The frequency domain anti-interference method based on reinforcement learning is characterized by comprising the following steps of:

2. The reinforcement learning-based frequency domain interference rejection method according to claim 1, wherein the step 2 comprises:

in one communication cycleInside, in front->Or->Data transmission is carried out in time, and the data transmission is carried out in time>The method comprises the steps of carrying out a first treatment on the surface of the At the same time, before->Or (b)In time, an agent located in the receiver perceives the spectrum environment in real time, generating +.>After the data transmission is completed, calculating the avoidance rate of all time slot channels in the current period; communication cycle->Comprising +.>Or->、/>、/>；/>And->All represent time periods.

3. The reinforcement learning-based frequency domain anti-interference method according to claim 1 or 2, wherein in each time slot, a communication user obtains the signal-to-interference-and-noise ratio of each channel through spectrum sensing, and combines the signal-to-interference-and-noise ratio information of a plurality of time slots into a plurality of signal-to-interference-and-noise ratio subframes;

4. The reinforcement learning-based frequency domain interference rejection method according to claim 3, wherein the channel gain model between the transmitter and the receiver is：

The channel gain model between the receiver and the jth jammer is：

the signal-to-interference-and-noise ratio of the receiver is：

5. The reinforcement learning-based frequency domain interference rejection method according to claim 1, wherein the step 3 comprises:

6. The reinforcement learning-based frequency domain interference rejection method according to claim 5, wherein said step 32 comprises:

7. The reinforcement learning based frequency domain interference avoidance method according to claim 6 wherein said rewardsThe calculation process of (1) comprises:

the agent in the receiver avoids the interference through Markov decision; the Markov decision process includes states, actions, state transition probabilities and rewards, the states of the t time slots are expressed asAll states constitute a state space->The method comprises the steps of carrying out a first treatment on the surface of the The action in time slot t is denoted +.>，/>，/>For the number of available channels, m represents the mth channel in the number of available channels, all actions constitute action space +.>The method comprises the steps of carrying out a first treatment on the surface of the And state transition probabilitySatisfy->Representing in time slotsWhen the intelligent agent is in the current environment state +.>Select action->The environment shifts to the next slot +.>Status of->The probability of the instant channel avoidance rate; />Representing actions of all available channels in the action space A at the time of t time slots; />State space representing t time slots, ">State space representing the t+1 time slot, +.>Representing the state of the t+1 slot, +.>Representing the action space of the t+1 time slot, and Pr and p both represent probabilities;

8. The reinforcement learning based frequency domain interference rejection method according to claim 7, wherein the markov decision is aimed at maximizing the total gain within one subframe：

Wherein,optimal channel policy representing communication subscriber +.>Representing slave policy->Selecting an operation that maximizes the total profit of the communication subscriber,/->Representing the sum of rewards earned for all slots of a single subframe,/->As the number of slots within a single sub-frame,represents action, pi (τ) represents and policy +.>Corresponding actions, T, represent the number of slots within a single subframe.

9. The reinforcement learning based frequency domain interference rejection method according to claim 8, wherein the randomly selected updatesOr->Comprising the following steps:

if it is updated：

If it is updated：

10. A reinforcement learning-based frequency domain interference suppression system for implementing the reinforcement learning-based frequency domain interference suppression method of any one of claims 1-9, comprising: