CN114885425A

CN114885425A - USRP platform-based reinforcement learning frequency hopping communication anti-interference implementation method

Info

Publication number: CN114885425A
Application number: CN202210397780.2A
Authority: CN
Inventors: 田峰; 王展; 陈宇航; 吴夜
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-04-15
Filing date: 2022-04-15
Publication date: 2022-08-09

Abstract

The invention discloses a reinforcement learning frequency hopping communication anti-interference realization method based on a USRP platform, which comprises the following steps: carrying out graphical programming on Labview of an upper computer, firstly, using double-threshold energy detection at a transmitter end to realize spectrum sensing and acquiring spectrum state information; and secondly, updating a reward table R of Q learning according to the obtained frequency spectrum as input information, then carrying out iterative training through a Q learning algorithm to obtain an updated Q table, continuously monitoring the state information of the frequency spectrum, continuously updating the Q table according to the steps after the state of the frequency spectrum is changed, and otherwise, not updating the Q table and keeping the original state unchanged. When the system starts to communicate, the optimal frequency spectrum sub-band frequency hopping communication is selected according to the frequency hopping decision of the Q table, and therefore the effect of intelligently and actively avoiding interference is achieved. The invention can realize flexible frequency hopping and reduce the probability of interference; the frequency spectrum utilization rate can be effectively improved, the frequency hopping times are reduced, and the system overhead is greatly reduced.

Description

Enhanced learning frequency hopping communication anti-interference realization method based on USRP platform

Technical Field

The invention relates to a reinforcement learning frequency hopping communication anti-interference implementation method, in particular to a reinforcement learning frequency hopping communication anti-interference implementation method based on a USRP platform.

Background

The frequency hopping technology has important research significance as the communication anti-interference technology which is most widely applied in the current communication field. However, in most of the current frequency hopping communication systems, both communication parties hop synchronously according to a determined frequency hopping pattern, if the system cannot hop in time to avoid interference frequency spectrum due to pressing interference or smart interference in the communication process, own communication is seriously interfered, and the communication quality cannot be guaranteed; meanwhile, as communication devices are continuously increased, the frequency spectrum environment becomes more complex, and communication frequency bands are easily overlapped with each other, so that the capacity of the frequency hopping communication network is severely limited by the conventional frequency hopping mode.

With the proposal and development of Cognitive Radio (CR), the specific dynamic spectrum access capability can be well applied to a frequency hopping system. The cognitive radio technology can be combined with a reinforcement learning algorithm, the spectrum sensing capability of the CR is utilized to scan and sense the surrounding environment, then the reinforcement learning algorithm learns according to the sensing information to acquire a strategy, and the strategy is used as a frequency hopping scheme, so that the aim of intelligent anti-interference can be fulfilled. Meanwhile, with the improvement of hardware level, the software radio technology is more mature, and frequency hopping communication algorithm research is carried out on a software radio platform, so that the software radio platform can be more fit with a real environment, and has important practical significance.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a reinforcement learning frequency hopping communication interference resistance realization method based on a USRP platform, which can effectively resist partial frequency band blocking interference and point frequency interference.

The technical scheme is as follows: the invention discloses a reinforcement learning frequency hopping communication anti-interference realization method.A USRP platform comprises a transmitter, a receiver, an interference machine and a Q learning cognitive engine, wherein the Q learning cognitive engine comprises an information processing center, a Q learning algorithm module and a plurality of sensing nodes; the transmitter sends data to the receiver, and the jammer sends single tone/multi-tone interference as an interference source; the transmitter and the receiver adopt wireless communication; the Q learning cognitive engine is connected with the transmitter and the receiver, and the information processing center and the sensing node by optical fibers respectively; the method comprises the following steps:

s1, starting the system, detecting whether the functions of each hardware module of the system are normal, and waiting for the user to communicate;

s2, the user starts to prepare for communication, the transmitter waits for sending information, requests the Q learning cognitive engine to allocate available spectrum resources, and meanwhile the sensing node starts to periodically collect data of the surrounding environment and sends a data packet to the information processing center

S3, the information processing center processes the received data and updates the frequency spectrum information of the usable/unusable frequency band;

s4, obtaining interference frequency band information according to the frequency spectrum resource condition fed back by the information processing center, and then updating the corresponding reward value R of the state-action pair;

s5, adopting a synchronous Q learning algorithm to start on-line training and learning, and generating a corresponding Q table;

s6, after receiving the service request of the transmitter, the Q learning cognitive engine allocates available frequency band spectrum resources to the transmitter;

s7, in the communication process, the information processing center judges the current used frequency spectrum resource and judges whether the wireless environment is changed; if the Q table does not change, the user continues to make a frequency hopping decision according to the current Q table to receive data; if the frequency hopping decision is changed, the Q learning cognitive engine needs to be retrained, the Q table is updated, and at the moment, a user carries out a new frequency hopping decision according to the updated Q table;

and S8, finishing the user communication and ending the service.

Further, in step S2, a dual-threshold energy detection algorithm is used to detect interference, multiple sensing nodes periodically detect the spectrum spatial energy data information where the sensing nodes are located, and the information processing center performs sampling processing on the data information;

after the sampling processing is finished, the energy detection data N of the j-th sensing node on the same frequency band _j And (3) calculating an average value, wherein a formula for judging whether the spectrum space is available is as follows:

wherein H ₀ Is a hard decision threshold, H, that determines the spectrum space available ₁ Judging a hard decision threshold that a spectrum space is unavailable, wherein n is the number of sensing nodes, and f represents a decision result;

when f is 0, it indicates that the spectrum space is available;

when f is 2, the spectrum space is not available;

if f is equal to 1, the state of the spectrum space cannot be determined, and secondary determination is required.

Further, the secondary decision adopts a voting mechanism to count all N _j <H ₀ Or N _j >H ₁ The number of (a): when N is present _j <H ₀ When the voting number exceeds more than half, judging the voting number as an idle resource;

when N is present _j >H ₁ And when the voting number exceeds more than half, judging the non-idle resources.

Further, in step S5, dividing the spectrum resources into equal spectrum subbands, where the spectrum subband where the working frequency of information transmission is located is a current state; in the process of user communication, performing frequency hopping decision according to the Q table, and selecting the optimal frequency spectrum sub-band frequency hopping corresponding to the current state as action; the environment feeds back corresponding Reward according to the action of the agent, and the previous state and the action taken form a Q table through training and learning, so that a Q value updating formula is as follows:

wherein, alpha is a learning rate and represents that the new Q which is learned from the new Q by the old Q accounts for the proportion of the new Q; gamma is a discount factor that is a function of,

to representi the value of the reward that the agent is in state s and takes action a; q ' (s ', a ') represents a potential future reward;

the Q value when the state is s action a in the Q table at time i is shown.

Further, in step S5, the synchronous Q learning algorithm replaces the off strategy of the standard Q learning algorithm with the on strategy, and selects an action corresponding to the maximum Q value at each time slot;

after entering the next state, the CR's wide-band spectrum sensing capability is used to detect the jammer's frequency at that time, thereby updating all columns of all state-action pairs.

Further, in step S7, when the information processing center finds that the current operating frequency band is about to be interfered, it immediately makes a frequency hopping decision according to the Q-table, selects a new optimal available spectrum sub-band, and notifies the transmitter and the receiver to change spectrum resources.

Compared with the prior art, the invention has the following remarkable effects:

1. the Q learning cognitive engine is adopted to interact with the environment, so that an interference strategy of an interference machine is learned, a corresponding decision is made, self-adaptive adjustment can be performed, interference is avoided actively, the probability of being interfered is reduced, the frequency hopping frequency is reduced, and the system overhead is greatly reduced; an adaptive intelligent frequency hopping pattern can be generated to serve as a frequency hopping decision, the behavior of an interference machine can be effectively predicted, avoidance can be actively made, and the purpose of communication interference resistance is achieved; compared with the traditional communication anti-interference mode, the frequency hopping system can effectively resist partial frequency band blocking interference and point frequency interference;

2. the cognitive engine of the frequency hopping communication system has short training period and quick decision making, and can effectively cope with different interference strategies; the synchronous Q learning algorithm is adopted, so that training can be accelerated, and an anti-interference decision strategy is obtained after a short learning period; meanwhile, when the jammer changes the jamming strategy, the anti-jamming decision can be quickly updated, and the communication quality of the user is ensured;

3. the invention is realized on a USRP RIO software radio platform which is a relatively mature software radio platform, adopts Labview graphical programming software to realize software programming, and has the advantages of simple and convenient operation, better hardware processing capacity and wider and more accurate adjustable hardware parameters compared with other devices of the same type.

Drawings

FIG. 1 is a flow diagram of a test system of the present invention;

FIG. 2 is a diagram of a communication anti-jamming system model of the present invention;

FIG. 3 is a diagram of the physical deployment of the system of the present invention;

FIG. 4 is a functional diagram of a sensing node of the present invention;

FIG. 5 is a data encapsulation diagram of the present invention;

FIG. 6 is a functional block diagram of the transmitting end of the present invention;

FIG. 7 is a functional block diagram of the receiver of the present invention;

FIG. 8 is a functional diagram of an information handling center and a cognitive engine of the present invention;

FIG. 9 is a simplified single/multi-tone interference function of the present invention;

FIG. 10 is a simplified functional diagram of frequency sweep interference of the present invention;

FIG. 11 is a video playback screenshot of the transmitting end;

FIG. 12 is a video playback screenshot of an interfered receiving end;

fig. 13 is a constellation diagram of the subject interference;

FIG. 14 is a graph illustrating an error rate of a received signal;

FIG. 15 is a video playback screenshot after a Q learning cognitive engine decision is taken;

FIG. 16 is a constellation diagram after decision-making using a Q learning cognitive engine;

FIG. 17 is a diagram illustrating bit error rates after a Q learning cognitive engine is adopted for decision making;

FIG. 18 is a graph comparing the performance of different algorithms against multi-tone interference;

FIG. 19 is a comparison graph of the performance of different algorithms against swept frequency interference;

Detailed Description

The invention is described in further detail below with reference to the drawings and the detailed description.

According to the invention, a software radio platform USRP RIO is taken as an experimental environment, a reinforcement learning algorithm is researched and applied to solve the problem of frequency hopping communication in a real environment, a reinforcement learning-based frequency hopping communication anti-interference system is realized, tests prove that reinforcement learning frequency hopping communication decisions can effectively resist the interference of different strategies, the utilization rate of frequency spectrums can be improved, and the communication quality of system users can be effectively ensured.

The method comprises the steps of carrying out graphical programming on Labview of an upper computer, as shown in figure 1, firstly, realizing spectrum sensing by using double-threshold energy detection at a transmitter end, and acquiring spectrum state information; and secondly, updating a reward table R of Q learning according to the obtained frequency spectrum as input information, then performing iterative training by adopting a Q learning algorithm to obtain an updated Q table, continuously monitoring frequency spectrum state information, continuously updating the Q table according to the steps after the frequency spectrum state is changed (namely, interfered or occupied), and otherwise, not updating the Q table and keeping the original state unchanged. When the system starts to communicate, the system carries out self-adaptive frequency hopping decision according to the Q table and selects the optimal frequency spectrum sub-band frequency hopping communication, thereby realizing the effect of intelligently and actively avoiding interference, ensuring that the system is not influenced by the interference in the process of using communication service, improving the stability of communication and ensuring the reliable delivery of information.

The invention applies the USRP RIO software radio platform to verify and test the frequency hopping communication anti-interference method based on reinforcement learning. The specific implementation comprises the following steps:

information processing and spectrum determination

The invention uses a double-threshold energy detection algorithm to detect interference, a plurality of sensing nodes periodically detect the energy data information of the frequency spectrum space where the sensing nodes are located, and an information processing center samples the data information, wherein the sampling rate is M (the size of M is determined according to the performance of a system). After the sampling processing is finished, the energy detection data N of the j-th sensing node on the same frequency band _j And (6) calculating an average value. Judging whether the spectrum space is available, and adopting the following hard decision formula:

wherein H ₀ And H ₁ Two thresholds for dual threshold energy detection, H ₀ Is a hard decision threshold, H, that determines the spectrum space available ₁ The method is a hard decision threshold for judging that a spectrum space is unavailable, n is the number of sensing nodes, and f represents a decision result. When the decision result f is 0, indicating that the spectrum space is available; when the decision result f is 2, indicating that the spectrum space is not available; if the decision result f is 1, it indicates that the state of the spectrum space cannot be determined, and secondary decision is required. For the secondary decision mode, a voting mechanism is adopted: count all N _j <H ₀ Or N _j >H ₁ When the number of (2) is N _j <H ₀ When the voting number exceeds more than half, judging the voting number as an idle resource; when N is present _j >H ₁ And when the voting number exceeds more than half, judging the non-idle resources. And if the secondary decision can not obtain the resources of the result, discarding the resources in the current decision, and performing the decision again when waiting for the initiation of a new decision in the next round.

(II) anti-interference decision based on reinforcement learning

The invention realizes the frequency hopping decision by using the Q-learning algorithm in machine learning and realizes the communication anti-interference strategy. Dividing frequency spectrum resources into equal frequency spectrum sub-bands, wherein the frequency spectrum sub-band where the working frequency of information transmission is located is a current state, performing frequency hopping decision according to a Q table in the process of user communication, selecting the optimal frequency spectrum sub-band corresponding to the current state for frequency hopping, selecting the frequency spectrum sub-band as action, feeding back corresponding Reward according to the action of an agent in the environment, and forming the state and the action into the Q table through training and learning. The Q value update formula is as follows:

where α is the learning rate, indicating that the old Q value will be changed from the new Q valueThe new Q from the study accounts for the proportion of the new Q. γ is a discount factor that defines the importance of future rewards.

Indicating the value of the prize earned by the agent in state s and taking action a at time i. Q ' (s ', a ') represents a potential future reward.

The Q value when the state is s action a in the Q table at time i is shown.

However, in the process of learning the interference strategy by using the traditional Q learning algorithm so as to actively avoid the channel from being interfered, the biggest problem is that the behavior of the interference machine needs to be learned, a large amount of training time is needed, and the optimal anti-interference strategy can be formed by multiple iterations. Furthermore, CR (cognitive radio) must attempt random actions before the Q learning algorithm converges, which is not suitable for learning in operating a communication link, since CR may lose a large number of transport packets. In order to solve the problems, a machine learning Q-learning algorithm is realized at an upper computer Labview software programming end on a USRP RIO software radio platform, and the invention applies an improved Q learning algorithm, namely a synchronous Q learning algorithm (OPSQ-learning).

The synchronous Q learning algorithm mainly has the following characteristics: (1) the off strategy of the standard Q learning algorithm is replaced by an on strategy, and at each time slot the CR no longer tries random actions, but follows a greedy strategy to select the action corresponding to the maximum Q value. (2) With the wide-band spectrum sensing capability of the CR, n Q values are updated synchronously each time, instead of only one cell in the Q matrix being updated asynchronously, i.e. after entering the next state, the CR detects the jammer frequency at that moment with its wide-band spectrum sensing capability, thus updating all state-action pairs (updating all columns of the Q matrix row Q (s:)).

The method comprises the following implementation steps:

step 1: the sensing node starts to periodically collect data of the surrounding environment and sends data packets to the information processing center.

Step 2: the information processing center processes the received data and updates the available/unavailable frequency band spectrum information.

And step 3: and obtaining interference frequency band information according to the frequency spectrum resource condition fed back by the information processing center, and then updating the corresponding reward value R of the state-action pair.

And 4, step 4: the Q learning algorithm starts on-line training and learning and generates a corresponding Q table.

And 5: the information processing center continuously judges the currently used frequency spectrum resources and judges whether the wireless environment is changed. And if the change does not occur, the Q table is not updated, otherwise, the Q learning cognitive engine needs to be retrained, and the Q table is updated so as to make a new anti-interference decision.

Step 6: after receiving a service request of a transmitter, the Q learning cognitive engine allocates available frequency band spectrum resources to the transmitter so as to facilitate video communication service between the transmitter and a receiver.

And 7: in the process of video communication, the currently used spectrum resources are judged to see whether the wireless environment is changed. If the Q learning cognitive engine does not change, the user continues to perform frequency hopping decision according to the current Q table to receive video data, otherwise, the Q learning cognitive engine needs to be retrained, and the Q table is updated so as to make a new anti-interference decision. At this time, the user needs to make a new frequency hopping decision according to the updated Q table to ensure the communication quality.

And 8: the user communication is completed and the service is finished.

The invention realizes the communication anti-interference decision with zero interference and low error code. Interference cognition based on energy detection is realized through a physical layer cognitive radio technology, the state of a frequency spectrum space is periodically monitored and is sent to a Q learning algorithm module as input information, and the Q learning algorithm carries out on-line training learning to obtain a Q table matrix. The process is used as a Q learning cognitive engine, so that the user can select an optimal frequency hopping decision scheme according to a Q table obtained by the Q learning cognitive engine in the communication process to carry out frequency hopping communication, and the transmission channel is actively prevented from being interfered. By the means, the information transmission channel of the user is ensured to be zero interference and low error code in the communication process.

The experiment verifies as follows:

first, experiment platform

In the software radio, except that basic frequency conversion, A/D, D/A conversion and radio frequency driving are realized by a hardware platform USRP RIO2943R, the rest functions are designed and finished in a software mode. The whole communication process needs to be designed and programmed by itself except for basic transceiving functions.

A series of physical parameters for NI USRP RIO2943R are as follows: the adjustable frequency range is 1.2 GHz-6 GHz, the real-time bandwidth is 40MHz, the PCIex4 bus speed is 800MB/s, and the Kintex7 FPGA chip is provided.

The software part of the experiment was designed and debugged using Labview 2015. On the basis of radio frequency transceiving drive provided by software, a series of functions required by the invention are expanded, thereby realizing the whole invention.

Second, setting up experimental environment

Fig. 2 shows the specific deployment of the experiment in the test system. In the experimental environment, a transmitter, a receiver, an interference machine, an information processing center, a Q learning algorithm module and a plurality of sensing nodes are arranged. In the experimental process, a transmitter sends video data to a receiver, and an interference machine sends single-tone/multi-tone interference as an interference source.

The communication mode between the transmitter and the receiver is wireless communication; the communication mode between the Q learning cognitive engine and the transmitter and the receiver is wired communication, and optical fibers are used for connection; the information processing center and the sensing nodes are connected by optical fibers. By such a connection, the reliability of all communications of the system can be ensured.

The experimental procedure is mainly divided into three sections: the system comprises users (a transmitter and a receiver), an interference machine, a Q learning cognitive engine (comprising a Q learning algorithm module, an information processing center and a sensing node), and is shown in figure 3.

Transmitter and receiver experimental setup:

the transmitter and the receiver are responsible for data communication in experiments, carry out video transmission and display transmission results, and simultaneously carry out statistics on data and feed back service quality. Multiple antennas can be configured on one USRP RIO device, siso (single input single output) self-sending and self-receiving can be realized, and meanwhile, considering the limitation of laboratory equipment, a transmitter and a receiver are arranged on the same USRP RIO device, and two single antennas are configured for transmitting and receiving. Taking a transmitter and a receiver as an example, fig. 6 is a functional block diagram of a transmitting end, and fig. 7 is a functional block diagram of a receiver.

As shown in fig. 6, at the transmitting end, after the operations of source coding, channel coding, QAM modulation, inserting a guard interval UW, framing, etc., from the source, the data is transmitted to the wireless channel through the RF transmitting module. In order to modify the radio frequency parameters of the transmitting end in real time, such as center frequency, local oscillator, gain and the like, an external expansion interface is added to the RF transmitting module so as to realize the calling of the module for modifying the transmitting parameters. The test system of the experiment transmits video data, so that the information source is a data packet processed by VLC software.

As shown in fig. 7, at the receiving end (i.e., UDP receiving module), wireless information is received from the receiving antenna, and after passing through the RF receiving module, the wireless information reaches the sink after frame synchronization, frame analysis, channel equalization, QAM modulation, channel decoding, and source decoding. Similarly, in order to modify the radio frequency parameters of the receiving end in real time, an external expansion interface is added to the RF receiving module. The signal sink is also VLC software, and after data is obtained, the data is internally decoded, and the playing quality of the video can be observed while the video is played.

The experimental setting of the jammer is as follows:

the jammer mainly implements interference on a radio channel, and prevents partial communication in a radio space. In the experiment, multiple interference modes such as single-tone interference, multi-tone interference, frequency sweep interference, intelligent interference and the like are realized by one USRP RIO device.

As shown in fig. 9, single-tone or multi-tone interference is implemented to interfere with a single or multiple channels by transmitting interference data to a wireless channel via radio frequency after modulation of a single interference source or multiple interference sources. As shown in fig. 10, the frequency sweep interference between 2.2GHz and 2.8GHz is realized, the interference source is modulated, the interference data is transmitted to the wireless channel through the RF transmitting module, and meanwhile, in order to realize the frequency sweep interference function, a frequency sweep interference function module is added, the transmitted radio frequency parameter, i.e., the transmission center frequency, is modified in real time by calling the external expansion interface of the RF transmitting module, and the interference is performed to different wireless channels in different time slots.

And (3) setting a sensing node experiment:

the sensing node mainly detects the frequency spectrum space environment, but because the USRP platform can only sense data in a smaller bandwidth at the same time, in order to detect a larger frequency spectrum space, a frequency sweep module is added in the system, so that frequency sweep detection is performed on each frequency spectrum within a set bandwidth range, and frequency spectrum data of each frequency band is obtained. In order to simplify the deployment difficulty of the sensing node, the detected frequency spectrum data is sent to the data processing node for processing. As shown in fig. 4, which is a block diagram of a sensing node, after the set radio frequency transceiving data parameters are input, a program is started to start sensing data, and then the sensed data is encapsulated according to the data format of fig. 5 and is transmitted to the information processing center in a UDP manner. The frequency band between 2.2GHz and 2.8GHz is selected for testing in the experiment, and the sensing process is sequentially executed without gaps in the frequency spectrum range by means of the frequency sweeping function module. When the scanning of the spectrum range is completed, the next sensing task of the spectrum is performed again. In order to realize the broadband spectrum sensing capability of cognitive radio, frequency sweep detection is carried out simultaneously by means of a plurality of USRP (universal serial bus) devices, so that the frequency spectrum of frequency sweep can be enlarged, the time spent on frequency sweep is reduced, and spectrum space information can be mastered more quickly.

The information processing center and the Q learning algorithm module are arranged in an experiment way:

for the information processing center, the data processing is mainly completed, and the channel quality is fed back to update the reward. The interference behavior is mainly judged according to the real-time state of the environment, the algorithm trains and learns the interference behavior, and the Q table is updated. As shown in fig. 8, the UDP receiving module needs to receive two parts of data, namely, data sent by the transmitter and data sent by the sensing node. When UDP communication is carried out, different data sources can be easily identified according to different UDP port numbers. After the training and learning of the cognitive engine, the Q table is updated, then the transmitter and the receiver can carry out frequency hopping decision communication according to the Q table, and the transmitter and the receiver are informed to change frequency spectrum resources through a UDP sending module on the experimental platform. And after receiving data sent by the sensing node, the data enters a data processing module, the data is sampled and processed, then the state of each channel is judged, the available state of each channel is updated, the reward table is updated, and then the Q learning algorithm is trained and learned to update the Q table. And once the current working frequency band is found to be interfered, carrying out frequency hopping decision immediately according to the Q table, selecting a new optimal available frequency spectrum sub-band, and informing the transmitter and the receiver to change frequency spectrum resources through a UDP sending module.

Third, the experimental process

Step 1: and configuring preset parameters. A pre-configuration of the parameters is required before all procedures are initiated. The radio frequency parameter conditions of the transmitting end are set as follows: the radio frequency antenna TX1 has the initial central frequency of 2.4GHz, the local oscillation frequency of-1 Hz and the transmission gain of 0 dBm. The radio frequency parameters of the receiving end are set as follows: the radio frequency antenna RX1 has an initial center frequency of 2.4GHz, a local oscillator frequency of-1 Hz, and a receiving gain of 0 dBm. The radio frequency parameter setting conditions of the jammer are as follows: the radio frequency antenna TX1 has the initial central frequency of 2.4GHz, the local oscillator frequency of-1 Hz and the receiving gain of 0 dBm. The initial radio frequency preset parameters of the sensing node are as follows, namely a radio frequency antenna RX1, the initial central frequency of 2.2GHz, the local frequency of-1 Hz, and the transmission gain of 0 dBm.

Step 2: and operating the sensing node, the information processing center and the Q learning algorithm program, performing frequency sweep detection data on the frequency spectrum space through the sensing node, then sending the data to the information processing center for processing, updating the available state of the channel, and inputting the data to the Q learning algorithm module for updating the reward. At the moment, the Q learning cognitive engine can obtain a Q table, namely a frequency hopping decision strategy, through training and learning.

And step 3: and running programs of the transmitting end and the receiving end, selecting an available channel according to the Q learning cognitive engine to start data communication, opening a VLC script file, starting to generate video source data and playing the received video source data.

And 4, step 4: and running an interference machine program to interfere the channel, testing whether an anti-interference scheme of the system can play an effective role, and verifying whether a result accords with an expectation. When the interference reaches the frequency band where the transmission data is located, the transmission video quality is greatly affected, if no interference resisting means exists, the video images transmitted and received at the moment are as shown in fig. 11 and fig. 12, and it can be seen from fig. 12 that the video playing quality is very poor, and obvious frame loss and time delay exist. Fig. 13 is a constellation diagram at the receiving end, and it can be seen that the constellation diagram is disordered at this time. Fig. 14 is a graph of the bit error rate, which can be seen to be very high.

And 5: and when the environment is changed, the Q learning cognitive engine continues training and learning, and the Q table is updated. At this time, the user can make a frequency hopping decision again according to the current state and the Q table, and sends the decision to the receiving end and the transmitting end through UDP (user Datagram protocol), so that the frequency spectrum resources are reallocated, and frequency hopping communication is selected. After the frequency hopping is adjusted, the center frequency of the transmission and the reception is adjusted to 2.3GHz, which meets the expected result. Fig. 15 is a video playing screenshot in which a reinforcement learning algorithm is applied to perform a frequency hopping decision, and it can be seen that video transmission quality is good. Fig. 16 is a constellation diagram of the receiving end at this time, and the constellation diagram is clear. Fig. 17 is a graph of average bit error rate, in which a channel is not substantially interfered and the channel bit error rate is zero in a transmission process of performing a frequency hopping decision by reinforcement learning.

Fig. 18 and fig. 19 show the performance of each reinforcement learning algorithm against multi-tone interference and frequency-sweep interference, respectively, and it can be seen that, when dealing with multi-tone and frequency-sweep interferers, both the conventional Q learning algorithm and the SARSA algorithm begin to converge after about one hundred iterative trainings, and if these techniques are applied to the software defined radio platform, hundreds of data packets are lost during real-time communication. The convergence rate of the OPSQ-learning algorithm used by the invention is higher than that of other algorithms, and the OPSQ-learning algorithm can provide a proper anti-interference defense strategy only by 15-20 times of training when dealing with several kinds of interferers.

According to the invention, through an improved synchronous Q learning algorithm realized on a software radio platform USRP RIO, the training learning process is accelerated by utilizing the broadband spectrum sensing capability of cognitive radio, N Q values can be synchronously updated aiming at the current state, and compared with the mode that a single Q value is asynchronously updated every time, the algorithm convergence can be accelerated, and an optimal anti-interference decision-making strategy can be obtained more quickly. Through experimental system tests, the training period of the applied improved Q learning algorithm is only 7 iterative cycles, and compared with standard Q learning, the improved Q learning algorithm is converged after hundreds of sets, and is more favorable for real-time communication. The improved Q learning algorithm applied by the invention only loses a few data packets during learning, and the packet loss rate is tolerable.

Claims

1. A reinforcement learning frequency hopping communication anti-interference implementation method based on a USRP platform comprises a transmitter, a receiver, an interference machine and a Q learning cognitive engine, wherein the Q learning cognitive engine comprises an information processing center, a Q learning algorithm module and a plurality of sensing nodes; the transmitter sends data to the receiver, and the jammer sends single tone/multi-tone interference as an interference source;

the transmitter and the receiver adopt wireless communication; the Q learning cognitive engine is connected with the transmitter and the receiver, and the information processing center and the sensing node by optical fibers respectively; the method is characterized by comprising the following steps:

s7, in the process of communication, the information processing center judges the currently used frequency spectrum resource and judges whether the wireless environment is changed; if the Q table does not change, the user continues to make a frequency hopping decision according to the current Q table to receive data; if the frequency hopping decision is changed, the Q learning cognitive engine needs to be retrained, the Q table is updated, and at the moment, a user carries out a new frequency hopping decision according to the updated Q table;

and S8, finishing the user communication and ending the service.

2. The USRP platform-based reinforcement learning frequency hopping communication anti-interference implementation method according to claim 1, wherein in step S2, a dual-threshold energy detection algorithm is adopted to detect interference, a plurality of sensing nodes periodically detect energy data information of frequency spectrum space where the sensing nodes are located, and an information processing center samples the data information;

when f is 0, it indicates that the spectrum space is available;

when f is 2, the spectrum space is not available;

when f is 1, the state of the spectrum space cannot be judged, and secondary judgment is needed.

3. The USRP platform-based reinforcement learning frequency hopping communication anti-interference implementation method according to claim 2, wherein the secondary decision adopts a voting mechanism to count all N _j <H ₀ Or N _j >H ₁ The number of (a):

when N is present _j <H ₀ When the voting number exceeds more than half, judging the voting number as an idle resource;

4. The USRP platform-based reinforcement learning frequency hopping communication anti-interference implementation method according to claim 1, wherein in step S5, frequency spectrum resources are divided into equal frequency spectrum sub-bands, and the frequency spectrum sub-band where the working frequency of information transmission is located is a current state; in the process of user communication, performing frequency hopping decision according to the Q table, and selecting the optimal frequency spectrum sub-band frequency hopping corresponding to the current state as action; the environment feeds back corresponding Reward according to the action of the agent, and the previous state and the action taken form a Q table through training and learning, so that a Q value updating formula is as follows:

a reward value indicating that agent is in state s at time i and takes action a; q ' (s ', a ') represents a potential future reward;

to representWhen the time is i, the state in the Q table is the Q value when the s action is a.

5. The USRP platform-based reinforcement learning frequency hopping communication anti-interference implementation method according to claim 1, wherein in step S5, the synchronous Q learning algorithm replaces an off policy of a standard Q learning algorithm with an on policy, and selects an action corresponding to a maximum Q value in each time slot;

6. The USRP platform-based reinforcement learning frequency hopping communication anti-interference implementation method according to claim 5, wherein in step S7, when the information processing center finds that the current working frequency band is about to be interfered, the information processing center immediately makes a frequency hopping decision according to the Q table, selects a new optimal available spectrum sub-band, and notifies the transmitter and the receiver to change spectrum resources.